Extract specfic data between <div> tags using RegEx in visual basic using RegEx.

Hello,

I would like a solution to extract specfic data between <div> tags using RegEx in visual basic using RegEx. Here are the requirements

1. There is a chunk of text that occurs between the very first <div class="item text"> tags. But there is a pattern to it.

(a) The <div class="item text"> may contain child <div> tags in the following pattern
<div class="item text">
   <div class="attachtitle">  blah blah blah </div>
   <div class="attachcontent">blah blah blah </div>  
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  
 
In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

(b) The <div class="item text"> DOES NOT contain child tags
<div class="item text">
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  

In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

2. I want the text that shows up in between <div "..." class="downRow" .... ></div> tags.
For E.g.,
<div "..." class="downRow" .... >B</div>

I want to extract B (just an example)

I have attached two files for testing. One has only one <div> tag, the other has two child <div> tags
TwoDivTags.txt
OneDivTag.txt
Jay BaluAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jay BaluAuthor Commented:
I would prefer a solution in the same lines as this one here

http://www.experts-exchange.com/Programming/Languages/Visual_Basic/Q_28568498.html
0
aikimarkCommented:
This pattern comes close to getting the downtext parse, but you will need to confirm this with testing.  It would have been helpful if you included the text you expected in addition to the input text.
<div class="item text">\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


This pattern gets the downrow text:
<div .*? class="downRow".*?>\s*(.*?)\s*</div>

Open in new window


This pattern might get both sets of text:
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


Remember that string literals containing quote characters need to double up those internal quote character.
0
Jay BaluAuthor Commented:
Thanks Aikimark - I will test and report back in a few. As far as the text I am expecting, it will be this for two<div> pattern
 
Which of the following inequalities is an algebraic expression for the shaded part of the number line above?<br /><br />(A) |x| &lt;= 3 <br />(B) |x| &lt;= 5 <br />(C) |x - 2| &lt;= 3 <br />(D) |x - 1| &lt;= 4 <br />(E) |x +1| &lt;= 4																		

Open in new window


For one div pattern, this is the kind of text I am expecting (taken from the attached file):
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


p.s: please note that the text to be extracted will end at some <div> tag (that way we can make sure there are not any loss of letters)
0
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Jay BaluAuthor Commented:
So ...

1. When I ran the first RegEx against an input that has one <div> pattern, it yielded "almost" the correct answer...if not for some extra information. I was expecting the text only up until <div> but it gave me a lot more information. If it means a lot to tweak, don't sweat it... I can "mid" and extract the info. Here is what I meant

Instead of
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


It gave me all of this
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br /><div style="border: 1px solid black; padding: 10px; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0);"><span style="font-weight: bold"><span style="color: #459926">Practice Questions</span><br />Question: 51<br />Page: 159<br />Difficulty: 600</span>

Open in new window



2. The second RegEx gives me exactly what I want. So we are good there.

3. The third one is a boo-boo. In fact, I was hoping that sucker would work. You see... my input files are generated by a set of URLs (remember your solution from yesterday?...Q_28568498). Each of those links will open up a html which has the text that I want to extract. That "text of interest" maybe or may not be enclosed in single <div> or double <div? which I do not know. It'd help me immensely if there were ONE generic RegEx that works in both situations.

Hope I made sense!
0
aikimarkCommented:
That is because the question indicated you wanted the text up to the </div> tag.  I'll look at it in light of the feedback.
0
aikimarkCommented:
Note: the third pattern has two capture groups.  The first submatch will be the class value and the second submatch will be the text up to the </div>
0
aikimarkCommented:
Also, I only tested these patterns on the twodivtags.txt file
0
aikimarkCommented:
Please try this pattern
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*(?:</div>|<div )

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Terry WoodsIT GuruCommented:
I'd suggest some slight modifications to that pattern:

<div .*

Open in new window

should be
<div [^>]*

Open in new window

to avoid matching a class in a non-div tag.

I haven't tested this, but I've made a few other subtle changes to try to ensure the result is always an expected one.
<div [^>]*class="(item text|downRow)".*?>(?:\s*<div.*?</div>)*\s*(\w.*?)\s*(?:</div>|<div )

Open in new window


Most points to @aikimark please, if you use my solution.
0
aikimarkCommented:
@Jay

Thanksgiving travel plans approach.  If you need help, please post before noon today (11/26).
0
Jay BaluAuthor Commented:
My apologies for the late response. I have been traveling for the holidays and did not have a chance to work on this resolution. I will accept aikimark's resolution even though I don't have the time to test it at this time. I will come back if there are any questions. Thanks to Mr. Woods too for the response.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Visual Basic Classic

From novice to tech pro — start learning today.