asked on

Simple way to parse HTML

I'm fairly new to Java and I'm looking for the simplest way possible to parse and HTML document and get the values from a table. For instance, take this html document...

<html>
<body>
<table>
<tr><td>1</td><td>2</td><td>3</td></tr>
</table>
</body>
</html>

If I simply wanted to find the table cells and get their contents, what would be the simplest way to do it? Examples would be very helpful (particularly as it pertains to opening the HTML document from the code).

Ovi

1. You can use a JEditorPane (not necesarly visible on the screen), load'it with your page by using it's constructor or the setPage(URL/String) method, or by setting the contentType to "text/html" and using as default kit an HTMLEditorKit; you will be abble to use setText("html string") method. After that, retrieve the document of the JEditorPane, using getDocument() method, cast to HTMLDocument, and iterate thro the elements and retrieve their content.
2. You could use the already included parser from the java.swing.text.html package to do the job for you.
3. Implement your own simple parser specialized on simple tables. I will try to do this for you now...

saxaboo

If you have some certainties about the HTML you're willing to parse, maybe you could use regular expressions instead of trying to parse it.

Example : assume I know that there is only one table in tha page I'm receiving. Then a /<table>(.*)<\/table>/ regex ran on the source allows me to extract the inner HTML.

Assume I know that all the rows will be 4 columns wide and that the HTML developer puts carriage returns between rows in his source code.
Then I can try a /<tr><td>(.*)<\/td><td>(.*)<\/td><td>(.*)<\/td><\/tr>/ match on each line to extract the cells

Of course this highly depends on the assumptions you can safely make about the input HTML format, and the reliability you want, but anyway trying regexs is always a good idea, they are sooooo powerful :)

If you can post some more info maybe we can help you a little further

-S

ASKER CERTIFIED SOLUTION

Ovi

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

moshecristel

ASKER

It's not as simple as I had hoped but it seems to work. Thanks.