Solved

how to parse the source of html file in Java

Posted on 2011-03-17
21
379 Views
Last Modified: 2012-05-11
Hi,
I have a html file and I would like to parse the source of this html.

Let's assume report.html is my html and I know the exact path to this file on my machine.

The pattern that I am parsing in the source looks like this:

<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>

Open in new window


And this pattern continues.

Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:

Tiltle              Status
ABCD:            pass
EFGHJKLM:  fail

How can I make it in Java?

Thanks,


0
Comment
Question by:Tolgar
  • 7
  • 6
  • 5
  • +1
21 Comments
 
LVL 47

Expert Comment

by:for_yan
ID: 35157277
You can use SAX parser to do it and retrieve "a" elements content
alternating values will give you TITLE and STATUS
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157310

This is a simple example of using SAX:

http://www.rgagnon.com/javadetails/java-0408.html
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157365
No, it was probably not a good suggestion to use SAX - the order is not something simple there
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 500 total points
ID: 35157535
I usually use HttpUnit to do this kind of thing - a high level api makes things easy. For the below code, executed with file url file:x.html, the output is

ABCD pass
EFGHJKLM fail
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 

Author Comment

by:Tolgar
ID: 35157711
Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.

Also, if I want to pass the whole output to a string how should I change this part?

for(String[] row : cells) {
        			for(String cell : row) {
        			    java.lang.String QualActv += cell;
        			}
        			System.out.println();	
        		    }

Open in new window


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35157764
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
	for(String[] row : cells) {
	    for(String cell : row) {
		QualActv += cell;
	    }
	}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35158191
:)
0
 

Author Comment

by:Tolgar
ID: 35158713
An additonal question.

Why do we make this false? and What does it mean?

HttpUnitOptions.setScriptingEnabled(false);

Open in new window



Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160724
> HttpUnitOptions.setScriptingEnabled(false);

says to not run any javascript on the page.
typically you shouldn't make it false
0
 

Author Comment

by:Tolgar
ID: 35160850
So if there is javascript running on the page, then what is gonna happen?

Thanks,
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 92

Expert Comment

by:objects
ID: 35160890
if its false it will just get ignored
0
 

Author Comment

by:Tolgar
ID: 35161000
So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161004
>>Because I want it to happen

You want what to happen?
0
 

Author Comment

by:Tolgar
ID: 35161047
I want to read the table as text which is what we do in the solution. But I don't really understand why we say:

HttpUnitOptions.setScriptingEnabled(false);


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161056
That was just something left over from a previous bit of code. Do you have javascript in the html
0
 
LVL 92

Expert Comment

by:objects
ID: 35161090
what code are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35165283
Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35169356
what code exactly are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35169604
Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 35169618
not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35171297
Just remove that line - simple as that
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Suggested Solutions

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Viewers learn about the “while” loop and how to utilize it correctly in Java. Additionally, viewers begin exploring how to include conditional statements within a while loop and avoid an endless loop. Define While Loop: Basic Example: Explanatio…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now