asked on

Java regex parse html tags

Hi,

I need to get data between two tags.

For example :

<div>something<div>second div starts within first div</div>something</div>

I need to get the data between first div like this

something<div>second div starts within first div</div>something

not like this

<div>something<div>second div starts within first div

mccarl

Can't you just strip the first 5 characters and the last 6 characters of the string??

Now I know that the answer to the above is NO. But with the information that you give us, we can't really give you anything better. You need to provide the FULL and EXACT input that you will be searching over, and the expected result. For example, what if the above is contained within other <div>'s, is this one special in any way? What makes the first <div> more special than the second <div>? Is it possible that you might have <div>, <div > and/or <div someAttribute="blah">?

Need more info...

Umar Topia

You would better try HtmlAgilityPack (I am not sure whether it exists for Java or not)

It basically converts to HTMl Document into XML Document

Then you can easily parse XML Document to do whatever you wanna

sasidhar1229

ASKER

ok

<div class="s">
<div class="f kv">
<cite>
xxxxxxxxxxxxxxxxx
</cite>
<span class="vshid">
xxxxxx
</span>
</div>
<div class="esc slp" id="poS15" style="display:none">xxxx</div>
<span class="st">
<span class="f">second span</span>xxxxxxxx
</span>
</div>

I need to get data between <span class="st"> and this span's end tag

<span class="st">
<span class="f">second span</span>xxxxxxxx
</span>

i.e. need to get this string

<span class="f">second span</span>xxxxxxxx

mccarl

And the contents inside <span class="st"> can be pretty much ANYTHING?

If so, then this is very hard (if not impossible) to do just with regular expressions. In which case, along the lines of what umartopia said, you should really be parsing this in a structured way. Use something like TagSoup, to parse the content and access via XPath, SAX Handlers, DOM methods, etc

If you know that the contents will only ever be a certain pattern, or a certain (small) number of patterns, then you may be able to write some regex but if you then get a HTML input file that is slightly different from what you expect, it will break.

Umar Topia

Various Scrappers

http://stackoverflow.com/questions/2861/options-for-html-scraping

ASKER CERTIFIED SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial