Link to home
Start Free TrialLog in
Avatar of sasidhar1229
sasidhar1229Flag for India

asked on

Java regex parse html tags

Hi,

I need to get data between two tags.

For example :

<div>something<div>second div starts within first div</div>something</div>

I need to get the data between first div like this

something<div>second div starts within first div</div>something

not like this

<div>something<div>second div starts within first div
Avatar of mccarl
mccarl
Flag of Australia image

Can't you just strip the first 5 characters and the last 6 characters of the string??

Now I know that the answer to the above is NO. But with the information that you give us, we can't really give you anything better. You need to provide the FULL and EXACT input that you will be searching over, and the expected result. For example, what if the above is contained within other <div>'s, is this one special in any way? What makes the first <div> more special than the second <div>? Is it possible that you might have <div>, <div     > and/or <div  someAttribute="blah">?

Need more info...
You would better try HtmlAgilityPack (I am not sure whether it exists for Java or not)

It basically converts to HTMl Document into XML Document

Then you can easily parse XML Document to do whatever you wanna
Avatar of sasidhar1229

ASKER

ok

<div class="s">
            <div class="f kv">
                <cite>
                    xxxxxxxxxxxxxxxxx
                </cite>
                <span class="vshid">
                    xxxxxx
                </span>
            </div>
            <div class="esc slp" id="poS15" style="display:none">xxxx</div>
            <span class="st">
                <span class="f">second span</span>xxxxxxxx
            </span>
        </div>

I need to get data between <span class="st"> and this span's end tag

 <span class="st">
                <span class="f">second span</span>xxxxxxxx
 </span>

i.e. need to get this string

 <span class="f">second span</span>xxxxxxxx
And the contents inside <span class="st"> can be pretty much ANYTHING?

If so, then this is very hard (if not impossible) to do just with regular expressions. In which case, along the lines of what umartopia said, you should really be parsing this in a structured way. Use something like TagSoup, to parse the content and access via XPath, SAX Handlers, DOM methods, etc

If you know that the contents will only ever be a certain pattern, or a certain (small) number of patterns, then you may be able to write some regex but if you then get a HTML input file that is slightly different from what you expect, it will break.
ASKER CERTIFIED SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial