Link to home
Start Free TrialLog in
Avatar of Tolgar
Tolgar

asked on

how to parse the source of html file in Java

Hi,
I have a html file and I would like to parse the source of this html.

Let's assume report.html is my html and I know the exact path to this file on my machine.

The pattern that I am parsing in the source looks like this:

<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>

Open in new window


And this pattern continues.

Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:

Tiltle              Status
ABCD:            pass
EFGHJKLM:  fail

How can I make it in Java?

Thanks,


Avatar of for_yan
for_yan
Flag of United States of America image

You can use SAX parser to do it and retrieve "a" elements content
alternating values will give you TITLE and STATUS

This is a simple example of using SAX:

http://www.rgagnon.com/javadetails/java-0408.html
No, it was probably not a good suggestion to use SAX - the order is not something simple there
ASKER CERTIFIED SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Tolgar
Tolgar

ASKER

Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.

Also, if I want to pass the whole output to a string how should I change this part?

for(String[] row : cells) {
        			for(String cell : row) {
        			    java.lang.String QualActv += cell;
        			}
        			System.out.println();	
        		    }

Open in new window


Thanks,
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
	for(String[] row : cells) {
	    for(String cell : row) {
		QualActv += cell;
	    }
	}

Open in new window

:)
Avatar of Tolgar

ASKER

An additonal question.

Why do we make this false? and What does it mean?

HttpUnitOptions.setScriptingEnabled(false);

Open in new window



Thanks,
> HttpUnitOptions.setScriptingEnabled(false);

says to not run any javascript on the page.
typically you shouldn't make it false
Avatar of Tolgar

ASKER

So if there is javascript running on the page, then what is gonna happen?

Thanks,
if its false it will just get ignored
Avatar of Tolgar

ASKER

So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.


Thanks,
>>Because I want it to happen

You want what to happen?
Avatar of Tolgar

ASKER

I want to read the table as text which is what we do in the solution. But I don't really understand why we say:

HttpUnitOptions.setScriptingEnabled(false);


Thanks,
That was just something left over from a previous bit of code. Do you have javascript in the html
what code are you referring to?
Avatar of Tolgar

ASKER

Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?

Thanks,
what code exactly are you referring to?
Avatar of Tolgar

ASKER

Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
Just remove that line - simple as that