Link to home
Start Free TrialLog in
Avatar of dkim18
dkim18

asked on

help with parsing html?

I am trying to parse this file.
It is not well formated xml so I might have to read in as string text  and parse it.

After the header, I will have one or more items.
As example, for the alias, I might have one ore more alias.

I just want to read all the alias and save.
Then read next header's values, schools, so grab all the school and etc.
They are not listed one item per line.

How can I read and break them per header?
Alias = Doe, John and Doe, John S
School = Univer of xx , Univer of xx
Phones = 32333-,23233, etc.

<data mark="A" header="true">Alias</data>
<data mark="U">Doe, John</data>
<data mark="U">Doe, John S.</data>
<data mark="A" header="true">Schools</data>
<data mark="U">University of Utah, BS, 2012</data>
<data mark="U">University of Miami, MS, 2014</data>

<data mark="A" header="true">Phones</data>
<data mark="U">122-122-12212</data>
<

Open in new window

Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Doesn't really look anything to do with html at all. Can you attach an example file please?
Avatar of dkim18
dkim18

ASKER

I checked my xml data file against a xml validator and it is not well formed.
so I need to string parse it.

I was stringtokenizing using </data> tag.

I am wondering if you can help me a smarter way to parse it and group them together.
Avatar of dkim18

ASKER

I take it back it is not parsing correctly.
I was hoping to get

<data mark="A" header="true">Alias,
<data mark="U">Doe, John, and etc but it is not splitting correctly.


StringTokenizer stringTokenizer = new StringTokenizer(revText, "<data>");
                 while (stringTokenizer.hasMoreElements()) {
                    String tmp = stringTokenizer.nextElement().toString();
                    System.out.println(tmp);
                 }

Open in new window

Avatar of dkim18

ASKER

I need some sort of logic to find <data> and read until </data> tag and keep parsing it until there is no more <data> tag.
Avatar of dkim18

ASKER

I see what my problem is:


StringTokenizer stringTokenizer = new StringTokenizer(revText, "<data>");

How would I do it as a word (tag)?
SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial