Link to home
Avatar of awking00
awking00Flag for United States of America

asked on

How to create a text file from html

I have some html files that show the difference between two different schema definitions (xsds). The attached Word file is an excerpt of a sample html file and depicts how it looks. The attached .txt file shows the underlying html source code for that excerpt. What I would like to do is write a Java program that would create a new text file in the following format:
Line 1 "OLD line(s): 235 [EMPTY] [becomes] New line(s): 236,246 [DATA] name = OtherSupportSumAmt
Line 2 "OLD line(s): 1759 [DATA] name=OrganizationTypeDesc [becomes] New line(s): 1770 [DATA] name=OrganizationTypeCd
Line 3 "OLD line(s): 1763 [N/A] [becomes] New line(s): 1774
Basically, I want
1) the old and new information on the same line of text with [becomes] (or some other delimiter) in between.
2) the word [DATA] where the substring "name=" exists followed by name=[whatever is in the quotes that follow].
3) the word [EMPTY] or [NULL] if no data follows the line(s) number.
4) the phrase "not applicable" or N/A where the substring "name=" does not exist.
In other words a new line of text for each old and new pairs of html data.
DiffWebpage.docx
DiffHTML.txt
Avatar of mrcoffee365
mrcoffee365
Flag of United States of America image

What have you tried so far?  Post what you have and tell us what isn't working.
Avatar of awking00

ASKER

I really haven't tried anything yet as I don't know where to begin. Perhaps I should start out by asking,"Can I create one long string of text from the html?" Then figure out how I might parse that string.
ASKER CERTIFIED SOLUTION
Avatar of mrcoffee365
mrcoffee365
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
I just downloaded the jsoup libraries and that's pretty much what I was looking for. Haven't completed all the parsing I need to do yet, but that's a fairly straight forward exercise. Thanks a lot.