Java regex parse html tags

sasidhar1229
sasidhar1229 used Ask the Experts™
on
Hi,

I need to get data between two tags.

For example :

<div>something<div>second div starts within first div</div>something</div>

I need to get the data between first div like this

something<div>second div starts within first div</div>something

not like this

<div>something<div>second div starts within first div
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
mccarlIT Business Systems Analyst / Software Developer
Top Expert 2015

Commented:
Can't you just strip the first 5 characters and the last 6 characters of the string??

Now I know that the answer to the above is NO. But with the information that you give us, we can't really give you anything better. You need to provide the FULL and EXACT input that you will be searching over, and the expected result. For example, what if the above is contained within other <div>'s, is this one special in any way? What makes the first <div> more special than the second <div>? Is it possible that you might have <div>, <div     > and/or <div  someAttribute="blah">?

Need more info...
Umar Topia.Net Full Stack Developer

Commented:
You would better try HtmlAgilityPack (I am not sure whether it exists for Java or not)

It basically converts to HTMl Document into XML Document

Then you can easily parse XML Document to do whatever you wanna

Author

Commented:
ok

<div class="s">
            <div class="f kv">
                <cite>
                    xxxxxxxxxxxxxxxxx
                </cite>
                <span class="vshid">
                    xxxxxx
                </span>
            </div>
            <div class="esc slp" id="poS15" style="display:none">xxxx</div>
            <span class="st">
                <span class="f">second span</span>xxxxxxxx
            </span>
        </div>

I need to get data between <span class="st"> and this span's end tag

 <span class="st">
                <span class="f">second span</span>xxxxxxxx
 </span>

i.e. need to get this string

 <span class="f">second span</span>xxxxxxxx
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

mccarlIT Business Systems Analyst / Software Developer
Top Expert 2015

Commented:
And the contents inside <span class="st"> can be pretty much ANYTHING?

If so, then this is very hard (if not impossible) to do just with regular expressions. In which case, along the lines of what umartopia said, you should really be parsing this in a structured way. Use something like TagSoup, to parse the content and access via XPath, SAX Handlers, DOM methods, etc

If you know that the contents will only ever be a certain pattern, or a certain (small) number of patterns, then you may be able to write some regex but if you then get a HTML input file that is slightly different from what you expect, it will break.
Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015
Commented:
Try this:

Edit
Pattern p = Pattern.compile("<span\\s+class="st">[^<]*(<span(?: [^>]*)?>[^<]+</span>)");

Open in new window


I'm going to assume you know how to use this in conjunction with the Matcher class. Once you execute the pattern, the text you are interested in will be in group 1.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial