How to create a text file from html

awking00
awking00 used Ask the Experts™
on
I have some html files that show the difference between two different schema definitions (xsds). The attached Word file is an excerpt of a sample html file and depicts how it looks. The attached .txt file shows the underlying html source code for that excerpt. What I would like to do is write a Java program that would create a new text file in the following format:
Line 1 "OLD line(s): 235 [EMPTY] [becomes] New line(s): 236,246 [DATA] name = OtherSupportSumAmt
Line 2 "OLD line(s): 1759 [DATA] name=OrganizationTypeDesc [becomes] New line(s): 1770 [DATA] name=OrganizationTypeCd
Line 3 "OLD line(s): 1763 [N/A] [becomes] New line(s): 1774
Basically, I want
1) the old and new information on the same line of text with [becomes] (or some other delimiter) in between.
2) the word [DATA] where the substring "name=" exists followed by name=[whatever is in the quotes that follow].
3) the word [EMPTY] or [NULL] if no data follows the line(s) number.
4) the phrase "not applicable" or N/A where the substring "name=" does not exist.
In other words a new line of text for each old and new pairs of html data.
DiffWebpage.docx
DiffHTML.txt
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2007

Commented:
What have you tried so far?  Post what you have and tell us what isn't working.
awking00Information Technology Specialist

Author

Commented:
I really haven't tried anything yet as I don't know where to begin. Perhaps I should start out by asking,"Can I create one long string of text from the html?" Then figure out how I might parse that string.
Top Expert 2007
Commented:
Yes -- read the html file as a file and you can search it.  You probably want to look for the beginning of the tag on each line, and the ending tag, then use the text between the two.  If you can rely on the format of the html file, that will be enough.

You can also use some open source packages to read html tags (once you read in the html file), which you might want to.  For example, once you read in the html file, you could use something like jsoup to go to your specific html tag and get the text between the open and close of that tag.
awking00Information Technology Specialist

Author

Commented:
I just downloaded the jsoup libraries and that's pretty much what I was looking for. Haven't completed all the parsing I need to do yet, but that's a fairly straight forward exercise. Thanks a lot.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial