moshecristel
asked on
Simple way to parse HTML
I'm fairly new to Java and I'm looking for the simplest way possible to parse and HTML document and get the values from a table. For instance, take this html document...
<html>
<body>
<table>
<tr><td>1</td><td>2</td><t d>3</td></ tr>
</table>
</body>
</html>
If I simply wanted to find the table cells and get their contents, what would be the simplest way to do it? Examples would be very helpful (particularly as it pertains to opening the HTML document from the code).
<html>
<body>
<table>
<tr><td>1</td><td>2</td><t
</table>
</body>
</html>
If I simply wanted to find the table cells and get their contents, what would be the simplest way to do it? Examples would be very helpful (particularly as it pertains to opening the HTML document from the code).
If you have some certainties about the HTML you're willing to parse, maybe you could use regular expressions instead of trying to parse it.
Example : assume I know that there is only one table in tha page I'm receiving. Then a /<table>(.*)<\/table>/ regex ran on the source allows me to extract the inner HTML.
Assume I know that all the rows will be 4 columns wide and that the HTML developer puts carriage returns between rows in his source code.
Then I can try a /<tr><td>(.*)<\/td><td>(.* )<\/td><td >(.*)<\/td ><\/tr>/ match on each line to extract the cells
Of course this highly depends on the assumptions you can safely make about the input HTML format, and the reliability you want, but anyway trying regexs is always a good idea, they are sooooo powerful :)
If you can post some more info maybe we can help you a little further
-S
Example : assume I know that there is only one table in tha page I'm receiving. Then a /<table>(.*)<\/table>/ regex ran on the source allows me to extract the inner HTML.
Assume I know that all the rows will be 4 columns wide and that the HTML developer puts carriage returns between rows in his source code.
Then I can try a /<tr><td>(.*)<\/td><td>(.*
Of course this highly depends on the assumptions you can safely make about the input HTML format, and the reliability you want, but anyway trying regexs is always a good idea, they are sooooo powerful :)
If you can post some more info maybe we can help you a little further
-S
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
It's not as simple as I had hoped but it seems to work. Thanks.
2. You could use the already included parser from the java.swing.text.html package to do the job for you.
3. Implement your own simple parser specialized on simple tables. I will try to do this for you now...