Avatar of sasidhar1229
sasidhar1229
Flag for India asked on

Java regex parse html tags

Hi,

I need to get data between two tags.

For example :

<div>something<div>second div starts within first div</div>something</div>

I need to get the data between first div like this

something<div>second div starts within first div</div>something

not like this

<div>something<div>second div starts within first div
Regular ExpressionsJava

Avatar of undefined
Last Comment
kaufmed

8/22/2022 - Mon
mccarl

Can't you just strip the first 5 characters and the last 6 characters of the string??

Now I know that the answer to the above is NO. But with the information that you give us, we can't really give you anything better. You need to provide the FULL and EXACT input that you will be searching over, and the expected result. For example, what if the above is contained within other <div>'s, is this one special in any way? What makes the first <div> more special than the second <div>? Is it possible that you might have <div>, <div     > and/or <div  someAttribute="blah">?

Need more info...
Umar Topia

You would better try HtmlAgilityPack (I am not sure whether it exists for Java or not)

It basically converts to HTMl Document into XML Document

Then you can easily parse XML Document to do whatever you wanna
sasidhar1229

ASKER
ok

<div class="s">
            <div class="f kv">
                <cite>
                    xxxxxxxxxxxxxxxxx
                </cite>
                <span class="vshid">
                    xxxxxx
                </span>
            </div>
            <div class="esc slp" id="poS15" style="display:none">xxxx</div>
            <span class="st">
                <span class="f">second span</span>xxxxxxxx
            </span>
        </div>

I need to get data between <span class="st"> and this span's end tag

 <span class="st">
                <span class="f">second span</span>xxxxxxxx
 </span>

i.e. need to get this string

 <span class="f">second span</span>xxxxxxxx
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
mccarl

And the contents inside <span class="st"> can be pretty much ANYTHING?

If so, then this is very hard (if not impossible) to do just with regular expressions. In which case, along the lines of what umartopia said, you should really be parsing this in a structured way. Use something like TagSoup, to parse the content and access via XPath, SAX Handlers, DOM methods, etc

If you know that the contents will only ever be a certain pattern, or a certain (small) number of patterns, then you may be able to write some regex but if you then get a HTML input file that is slightly different from what you expect, it will break.
Umar Topia

ASKER CERTIFIED SOLUTION
kaufmed

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question