Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

how to parse the source of html file in Java

Posted on 2011-03-17
21
Medium Priority
?
387 Views
Last Modified: 2012-05-11
Hi,
I have a html file and I would like to parse the source of this html.

Let's assume report.html is my html and I know the exact path to this file on my machine.

The pattern that I am parsing in the source looks like this:

<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>

Open in new window


And this pattern continues.

Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:

Tiltle              Status
ABCD:            pass
EFGHJKLM:  fail

How can I make it in Java?

Thanks,


0
Comment
Question by:Tolgar
  • 7
  • 6
  • 5
  • +1
21 Comments
 
LVL 47

Expert Comment

by:for_yan
ID: 35157277
You can use SAX parser to do it and retrieve "a" elements content
alternating values will give you TITLE and STATUS
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157310

This is a simple example of using SAX:

http://www.rgagnon.com/javadetails/java-0408.html
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157365
No, it was probably not a good suggestion to use SAX - the order is not something simple there
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 86

Accepted Solution

by:
CEHJ earned 2000 total points
ID: 35157535
I usually use HttpUnit to do this kind of thing - a high level api makes things easy. For the below code, executed with file url file:x.html, the output is

ABCD pass
EFGHJKLM fail
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 

Author Comment

by:Tolgar
ID: 35157711
Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.

Also, if I want to pass the whole output to a string how should I change this part?

for(String[] row : cells) {
        			for(String cell : row) {
        			    java.lang.String QualActv += cell;
        			}
        			System.out.println();	
        		    }

Open in new window


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35157764
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
	for(String[] row : cells) {
	    for(String cell : row) {
		QualActv += cell;
	    }
	}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35158191
:)
0
 

Author Comment

by:Tolgar
ID: 35158713
An additonal question.

Why do we make this false? and What does it mean?

HttpUnitOptions.setScriptingEnabled(false);

Open in new window



Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160724
> HttpUnitOptions.setScriptingEnabled(false);

says to not run any javascript on the page.
typically you shouldn't make it false
0
 

Author Comment

by:Tolgar
ID: 35160850
So if there is javascript running on the page, then what is gonna happen?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160890
if its false it will just get ignored
0
 

Author Comment

by:Tolgar
ID: 35161000
So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161004
>>Because I want it to happen

You want what to happen?
0
 

Author Comment

by:Tolgar
ID: 35161047
I want to read the table as text which is what we do in the solution. But I don't really understand why we say:

HttpUnitOptions.setScriptingEnabled(false);


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161056
That was just something left over from a previous bit of code. Do you have javascript in the html
0
 
LVL 92

Expert Comment

by:objects
ID: 35161090
what code are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35165283
Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35169356
what code exactly are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35169604
Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 35169618
not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35171297
Just remove that line - simple as that
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this post we will learn different types of Android Layout and some basics of an Android App.
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.
Suggested Courses

885 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question