Tolgar
asked on
how to parse the source of html file in Java
Hi,
I have a html file and I would like to parse the source of this html.
Let's assume report.html is my html and I know the exact path to this file on my machine.
The pattern that I am parsing in the source looks like this:
And this pattern continues.
Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:
Tiltle Status
ABCD: pass
EFGHJKLM: fail
How can I make it in Java?
Thanks,
I have a html file and I would like to parse the source of this html.
Let's assume report.html is my html and I know the exact path to this file on my machine.
The pattern that I am parsing in the source looks like this:
<tr>
<td><a href="http://www-SOMEPATH/page1.html#check_ABCD">ABCD</a>
<td><a href="SOMEPATH.html#ABCD"><font color=green>pass</font></a>
<tr>
<td><a href="http://www-SOMEPATH/sbcheck.html#check_EFGHJKLM">EFGHJKLM</a>
<td><a href="SOMEPATH.html#EFGHJKLM"><font color=red>fail</font></a>
And this pattern continues.
Now I would like to parse, extract and construct a plain view of this report in a plain separate text file as the following:
Tiltle Status
ABCD: pass
EFGHJKLM: fail
How can I make it in Java?
Thanks,
No, it was probably not a good suggestion to use SAX - the order is not something simple there
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi CEHJ,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.
Also, if I want to pass the whole output to a string how should I change this part?
Thanks,
Thanks for your prompt response. Can you please explain me the for loops with some comments? I had difficulty in understanding it.
Also, if I want to pass the whole output to a string how should I change this part?
for(String[] row : cells) {
for(String cell : row) {
java.lang.String QualActv += cell;
}
System.out.println();
}
Thanks,
You've got the idea about the loops. If you want to collect all cells, you need to define the collecting String first:
String QualActv = "";
for(String[] row : cells) {
for(String cell : row) {
QualActv += cell;
}
}
:)
ASKER
An additonal question.
Why do we make this false? and What does it mean?
Thanks,
Why do we make this false? and What does it mean?
HttpUnitOptions.setScriptingEnabled(false);
Thanks,
> HttpUnitOptions.setScripti ngEnabled( false);
says to not run any javascript on the page.
typically you shouldn't make it false
says to not run any javascript on the page.
typically you shouldn't make it false
ASKER
So if there is javascript running on the page, then what is gonna happen?
Thanks,
Thanks,
if its false it will just get ignored
ASKER
So then why do we make it false in this case? Because I want it to happen regardless of there is javascript or not.
Thanks,
Thanks,
>>Because I want it to happen
You want what to happen?
You want what to happen?
ASKER
I want to read the table as text which is what we do in the solution. But I don't really understand why we say:
HttpUnitOptions.setScripti ngEnabled( false);
Thanks,
HttpUnitOptions.setScripti
Thanks,
That was just something left over from a previous bit of code. Do you have javascript in the html
what code are you referring to?
ASKER
Well, for now we don't have. But this page is not under my control. So there may be in some time. Then what should I do?
Thanks,
Thanks,
what code exactly are you referring to?
ASKER
Here it is:
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;
import com.meterware.httpunit.HttpUnitOptions;
public class TableCollector {
public static void main(String[] args) {
try {
HttpUnitOptions.setScriptingEnabled(false);
WebConversation wc = new WebConversation();
WebResponse wr = wc.getResponse(args[0]);
WebTable table = wr.getTables()[0];
String[][] cells = table.asText();
for(String[] row : cells) {
for(String cell : row) {
System.out.printf("%s ", cell);
}
System.out.println();
}
}
catch(Exception e) {
e.printStackTrace();
}
}
}
not sure why thats been added. Generally you'd want to enable it so javascript is executed
open a new question if you need anymore help with it
open a new question if you need anymore help with it
Just remove that line - simple as that
alternating values will give you TITLE and STATUS