How to create a text file from html

I have some html files that show the difference between two different schema definitions (xsds). The attached Word file is an excerpt of a sample html file and depicts how it looks. The attached .txt file shows the underlying html source code for that excerpt. What I would like to do is write a Java program that would create a new text file in the following format:
Line 1 "OLD line(s): 235 [EMPTY] [becomes] New line(s): 236,246 [DATA] name = OtherSupportSumAmt
Line 2 "OLD line(s): 1759 [DATA] name=OrganizationTypeDesc [becomes] New line(s): 1770 [DATA] name=OrganizationTypeCd
Line 3 "OLD line(s): 1763 [N/A] [becomes] New line(s): 1774
Basically, I want
1) the old and new information on the same line of text with [becomes] (or some other delimiter) in between.
2) the word [DATA] where the substring "name=" exists followed by name=[whatever is in the quotes that follow].
3) the word [EMPTY] or [NULL] if no data follows the line(s) number.
4) the phrase "not applicable" or N/A where the substring "name=" does not exist.
In other words a new line of text for each old and new pairs of html data.
DiffWebpage.docx
DiffHTML.txt
LVL 32
awking00Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

mrcoffee365Commented:
What have you tried so far?  Post what you have and tell us what isn't working.
0
awking00Author Commented:
I really haven't tried anything yet as I don't know where to begin. Perhaps I should start out by asking,"Can I create one long string of text from the html?" Then figure out how I might parse that string.
0
mrcoffee365Commented:
Yes -- read the html file as a file and you can search it.  You probably want to look for the beginning of the tag on each line, and the ending tag, then use the text between the two.  If you can rely on the format of the html file, that will be enough.

You can also use some open source packages to read html tags (once you read in the html file), which you might want to.  For example, once you read in the html file, you could use something like jsoup to go to your specific html tag and get the text between the open and close of that tag.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
awking00Author Commented:
I just downloaded the jsoup libraries and that's pretty much what I was looking for. Haven't completed all the parsing I need to do yet, but that's a fairly straight forward exercise. Thanks a lot.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.