sasidhar1229
asked on
Java regex parse html tags
Hi,
I need to get data between two tags.
For example :
<div>something<div>second div starts within first div</div>something</div>
I need to get the data between first div like this
something<div>second div starts within first div</div>something
not like this
<div>something<div>second div starts within first div
I need to get data between two tags.
For example :
<div>something<div>second div starts within first div</div>something</div>
I need to get the data between first div like this
something<div>second div starts within first div</div>something
not like this
<div>something<div>second div starts within first div
You would better try HtmlAgilityPack (I am not sure whether it exists for Java or not)
It basically converts to HTMl Document into XML Document
Then you can easily parse XML Document to do whatever you wanna
It basically converts to HTMl Document into XML Document
Then you can easily parse XML Document to do whatever you wanna
ASKER
ok
<div class="s">
<div class="f kv">
<cite>
xxxxxxxxxxxxxxxxx
</cite>
<span class="vshid">
xxxxxx
</span>
</div>
<div class="esc slp" id="poS15" style="display:none">xxxx< /div>
<span class="st">
<span class="f">second span</span>xxxxxxxx
</span>
</div>
I need to get data between <span class="st"> and this span's end tag
<span class="st">
<span class="f">second span</span>xxxxxxxx
</span>
i.e. need to get this string
<span class="f">second span</span>xxxxxxxx
<div class="s">
<div class="f kv">
<cite>
xxxxxxxxxxxxxxxxx
</cite>
<span class="vshid">
xxxxxx
</span>
</div>
<div class="esc slp" id="poS15" style="display:none">xxxx<
<span class="st">
<span class="f">second span</span>xxxxxxxx
</span>
</div>
I need to get data between <span class="st"> and this span's end tag
<span class="st">
<span class="f">second span</span>xxxxxxxx
</span>
i.e. need to get this string
<span class="f">second span</span>xxxxxxxx
And the contents inside <span class="st"> can be pretty much ANYTHING?
If so, then this is very hard (if not impossible) to do just with regular expressions. In which case, along the lines of what umartopia said, you should really be parsing this in a structured way. Use something like TagSoup, to parse the content and access via XPath, SAX Handlers, DOM methods, etc
If you know that the contents will only ever be a certain pattern, or a certain (small) number of patterns, then you may be able to write some regex but if you then get a HTML input file that is slightly different from what you expect, it will break.
If so, then this is very hard (if not impossible) to do just with regular expressions. In which case, along the lines of what umartopia said, you should really be parsing this in a structured way. Use something like TagSoup, to parse the content and access via XPath, SAX Handlers, DOM methods, etc
If you know that the contents will only ever be a certain pattern, or a certain (small) number of patterns, then you may be able to write some regex but if you then get a HTML input file that is slightly different from what you expect, it will break.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Now I know that the answer to the above is NO. But with the information that you give us, we can't really give you anything better. You need to provide the FULL and EXACT input that you will be searching over, and the expected result. For example, what if the above is contained within other <div>'s, is this one special in any way? What makes the first <div> more special than the second <div>? Is it possible that you might have <div>, <div > and/or <div someAttribute="blah">?
Need more info...