Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

how to parse the source of html file in Java

Posted on 2011-03-17
21
Medium Priority
?
386 Views
Last Modified: 2012-05-11
Hi,
I have a html file and I would like to parse the source of this html.

Let's assume report.html is my html and I know the exact path to this file on my machine.

The pattern that I am parsing in the source looks like this:

<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>

Open in new window


And this pattern continues.

Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:

Tiltle              Status
ABCD:            pass
EFGHJKLM:  fail

How can I make it in Java?

Thanks,


0
Comment
Question by:Tolgar
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 6
  • 5
  • +1
21 Comments
 
LVL 47

Expert Comment

by:for_yan
ID: 35157277
You can use SAX parser to do it and retrieve "a" elements content
alternating values will give you TITLE and STATUS
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157310

This is a simple example of using SAX:

http://www.rgagnon.com/javadetails/java-0408.html
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35157365
No, it was probably not a good suggestion to use SAX - the order is not something simple there
0
The top UI technologies you need to be aware of

An important part of the job as a front-end developer is to stay up to date and in contact with new tools, trends and workflows. That’s why you cannot miss this upcoming webinar to explore the latest trends in UI technologies!

 
LVL 86

Accepted Solution

by:
CEHJ earned 2000 total points
ID: 35157535
I usually use HttpUnit to do this kind of thing - a high level api makes things easy. For the below code, executed with file url file:x.html, the output is

ABCD pass
EFGHJKLM fail
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 

Author Comment

by:Tolgar
ID: 35157711
Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.

Also, if I want to pass the whole output to a string how should I change this part?

for(String[] row : cells) {
        			for(String cell : row) {
        			    java.lang.String QualActv += cell;
        			}
        			System.out.println();	
        		    }

Open in new window


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35157764
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
	for(String[] row : cells) {
	    for(String cell : row) {
		QualActv += cell;
	    }
	}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35158191
:)
0
 

Author Comment

by:Tolgar
ID: 35158713
An additonal question.

Why do we make this false? and What does it mean?

HttpUnitOptions.setScriptingEnabled(false);

Open in new window



Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160724
> HttpUnitOptions.setScriptingEnabled(false);

says to not run any javascript on the page.
typically you shouldn't make it false
0
 

Author Comment

by:Tolgar
ID: 35160850
So if there is javascript running on the page, then what is gonna happen?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35160890
if its false it will just get ignored
0
 

Author Comment

by:Tolgar
ID: 35161000
So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161004
>>Because I want it to happen

You want what to happen?
0
 

Author Comment

by:Tolgar
ID: 35161047
I want to read the table as text which is what we do in the solution. But I don't really understand why we say:

HttpUnitOptions.setScriptingEnabled(false);


Thanks,
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35161056
That was just something left over from a previous bit of code. Do you have javascript in the html
0
 
LVL 92

Expert Comment

by:objects
ID: 35161090
what code are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35165283
Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?

Thanks,
0
 
LVL 92

Expert Comment

by:objects
ID: 35169356
what code exactly are you referring to?
0
 

Author Comment

by:Tolgar
ID: 35169604
Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;


public class TableCollector {
    public static void main(String[] args) {
	try {
	    HttpUnitOptions.setScriptingEnabled(false);
	    WebConversation wc = new WebConversation();
	    WebResponse wr = wc.getResponse(args[0]);
	    WebTable table = wr.getTables()[0];
	    String[][] cells = table.asText();
	    for(String[] row : cells) {
		for(String cell : row) {
		    System.out.printf("%s ", cell);
		}
		System.out.println();	
	    }
	}
	catch(Exception e) {
	    e.printStackTrace();	
	}
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 35169618
not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35171297
Just remove that line - simple as that
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
Suggested Courses

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question