Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Parsing an HTML page as XML

Posted on 2005-05-13
Medium Priority
Last Modified: 2013-11-19
How can I parse as HTML (HTML 4.01 Transitional) page as if it were XML?  

For example, I have a standard HTML page that contains multiple tables.  I want to grab the 3rd table and read all rows/columns from it.

Question by:chilltemp
  • 5
  • 3
LVL 10

Expert Comment

ID: 13999544
I've never written a HTML parser and probably there is an easier way but this is just something I came up with off the top of my head. In your code you could make a copy of the file, then strip out everything that comes before the <body></body> tag and then you can add the proper tag and then I think it would look like this:

<?xml version="1.0"  encodinig="utf-8" ?>

I have some misgivings about that, but if it worked you could just parse it like normal XML until you got to the "elements" you were interested in. Anyway, like I said I just came up with it off the top of my head.

Author Comment

ID: 13999691
I've tried that, but I keep getting errors thrown due to elements that aren’t closed properly.  Such as:

<MAP NAME="map">
  <AREA SHAPE="rect" COORDS="___" HREF="___">
  <AREA SHAPE="rect" COORDS="___" HREF="___">

Do you know of any "reasonable" way to either correct these, or to convince the xml parser to ignore them?

Author Comment

ID: 13999979
The error message is:
"The 'AREA' start tag on line '5' does not match the end tag of 'MAP'. Line 6, position 3."
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

LVL 10

Expert Comment

ID: 14000942
Ah yes. Here's the problem with that. I knew something like that would creep up. XML doesn't allow tags to act just like that which have

<SomeTag stuff="...">

So any tags setup like that will cause the parser to fail. Here's the thing, single tag stuff like that in XML would need to be like this:

<AREA SHAPE="rect" COORDS="___" HREF="___"/>  (notice the /> at the end, I can't remember if that is valid HTML but in XML if there is only a single tage you need to end it with "/>")

Is the HTML file always going to be in the same format?


Author Comment

ID: 14010472
This page is always in the same format, so I can probably do a RegEx replace to convert the HTML into valid XML.  But my goal was to create a function/class that would be a little more reusable. (But there's also the question of how much time invested into this is reasonable.)
LVL 10

Accepted Solution

NetworkArchitek earned 1000 total points
ID: 14010570
Well you can use a regex to make it reusable. Basically you will want to take anything in the form of

<SomeTag stuff="...">  (That does not have a closing tag)

And just turn it in to

<SomeTag stuff="..." />

Well, but then you still have the problem of trying to find the matching tag, and then there some tags are sometimes single-tags and sometimes have a closing tag. There may be a tool out there that does this but I have never seen or heard of one. You could implement this but it would be a lot of work. I think you'd end up writing the equivalent of an XML parser just to get it ready for the XML parser if you went this route.

Author Comment

ID: 14011701
Your probably write about that.  So I've gone the route of creating a RegEx based row scraper.  This simply scrapes all rows/columns from a HTML string, and dumps them into a data table.  I doe's not handle nested tables very well, but it will suffice for my needs.  

Of' course, now I'm having a problem with the RegEx functions parsing my HTML string of almost 3,000,000 characters!

Author Comment

ID: 14011848
A little bit of research showed that my RegEx was working.  I was just being a bit too skeptical.

Anyways, thanks for your help.  I'm abandoning the XML method due to its complexity.

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Preface This is the third article about the EE Collaborative Login Project. A Better Website Login System (http://www.experts-exchange.com/A_2902.html) introduces the Login System and shows how to implement a login page. The EE Collaborative Logi…
I will show you how to create a ASP.NET Captcha control without using any HTTP HANDELRS or what so ever. you can easily plug it into your web pages. For Example a = 2 + 3 (where 2 and 3 are 2 random numbers) Session("Answer") = 5 then we…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Suggested Courses

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question