Solved

how to parse the source of html file in Java

Posted on 2011-03-17
21
380 Views
Last Modified: 2012-05-11
Hi,
I have a html file and I would like to parse the source of this html.

Let's assume report.html is my html and I know the exact path to this file on my machine.

The pattern that I am parsing in the source looks like this:

<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>

Open in new window


And this pattern continues.

Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:

Tiltle              Status
ABCD:            pass
EFGHJKLM:  fail

How can I make it in Java?

Thanks,


0
Comment
Question by:Tolgar
  • 7
  • 6
  • 5
  • +1
21 Comments
 
LVL 47

Expert Comment

by:for_yan
ID: 35157277
You can use SAX parser to do it and retrieve "a" elements content
alternating values will give you TITLE and STATUS
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157310

This is a simple example of using SAX:

http://www.rgagnon.com/javadetails/java-0408.html
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157365
No, it was probably not a good suggestion to use SAX - the order is not something simple there
0
NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud

 
LVL 86

Accepted Solution

by:
CEHJ earned 500 total points
ID: 35157535
I usually use HttpUnit to do this kind of thing - a high level api makes things easy. For the below code, executed with file url file:x.html, the output is

ABCD pass
EFGHJKLM fail
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 

Author Comment

by:Tolgar
ID: 35157711
Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.

Also, if I want to pass the whole output to a string how should I change this part?

for(String[] row : cells) {
        			for(String cell : row) {
        			    java.lang.String QualActv += cell;
        			}
        			System.out.println();	
        		    }

Open in new window


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35157764
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
	for(String[] row : cells) {
	    for(String cell : row) {
		QualActv += cell;
	    }
	}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35158191
:)
0
 

Author Comment

by:Tolgar
ID: 35158713
An additonal question.

Why do we make this false? and What does it mean?

HttpUnitOptions.setScriptingEnabled(false);

Open in new window



Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160724
> HttpUnitOptions.setScriptingEnabled(false);

says to not run any javascript on the page.
typically you shouldn't make it false
0
 

Author Comment

by:Tolgar
ID: 35160850
So if there is javascript running on the page, then what is gonna happen?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160890
if its false it will just get ignored
0
 

Author Comment

by:Tolgar
ID: 35161000
So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161004
>>Because I want it to happen

You want what to happen?
0
 

Author Comment

by:Tolgar
ID: 35161047
I want to read the table as text which is what we do in the solution. But I don't really understand why we say:

HttpUnitOptions.setScriptingEnabled(false);


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161056
That was just something left over from a previous bit of code. Do you have javascript in the html
0
 
LVL 92

Expert Comment

by:objects
ID: 35161090
what code are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35165283
Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35169356
what code exactly are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35169604
Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 35169618
not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35171297
Just remove that line - simple as that
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question