how to parse the source of html file in Java

Hi,
I have a html file and I would like to parse the source of this html.

Let's assume report.html is my html and I know the exact path to this file on my machine.

The pattern that I am parsing in the source looks like this:

<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>

Open in new window


And this pattern continues.

Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:

Tiltle              Status
ABCD:            pass
EFGHJKLM:  fail

How can I make it in Java?

Thanks,


TolgarAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
CEHJConnect With a Mentor Commented:
I usually use HttpUnit to do this kind of thing - a high level api makes things easy. For the below code, executed with file url file:x.html, the output is

ABCD pass
EFGHJKLM fail
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 
for_yanCommented:
You can use SAX parser to do it and retrieve "a" elements content
alternating values will give you TITLE and STATUS
0
 
for_yanCommented:

This is a simple example of using SAX:

http://www.rgagnon.com/javadetails/java-0408.html
0
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

 
for_yanCommented:
No, it was probably not a good suggestion to use SAX - the order is not something simple there
0
 
TolgarAuthor Commented:
Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.

Also, if I want to pass the whole output to a string how should I change this part?

for(String[] row : cells) {
        			for(String cell : row) {
        			    java.lang.String QualActv += cell;
        			}
        			System.out.println();	
        		    }

Open in new window


Thanks,
0
 
CEHJCommented:
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
	for(String[] row : cells) {
	    for(String cell : row) {
		QualActv += cell;
	    }
	}

Open in new window

0
 
CEHJCommented:
:)
0
 
TolgarAuthor Commented:
An additonal question.

Why do we make this false? and What does it mean?

HttpUnitOptions.setScriptingEnabled(false);

Open in new window



Thanks,
0
 
objectsCommented:
> HttpUnitOptions.setScriptingEnabled(false);

says to not run any javascript on the page.
typically you shouldn't make it false
0
 
TolgarAuthor Commented:
So if there is javascript running on the page, then what is gonna happen?

Thanks,
0
 
objectsCommented:
if its false it will just get ignored
0
 
TolgarAuthor Commented:
So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.


Thanks,
0
 
CEHJCommented:
>>Because I want it to happen

You want what to happen?
0
 
TolgarAuthor Commented:
I want to read the table as text which is what we do in the solution. But I don't really understand why we say:

HttpUnitOptions.setScriptingEnabled(false);


Thanks,
0
 
CEHJCommented:
That was just something left over from a previous bit of code. Do you have javascript in the html
0
 
objectsCommented:
what code are you referring to?
0
 
TolgarAuthor Commented:
Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?

Thanks,
0
 
objectsCommented:
what code exactly are you referring to?
0
 
TolgarAuthor Commented:
Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 
objectsCommented:
not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
0
 
CEHJCommented:
Just remove that line - simple as that
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.