Avatar of Jay Balu
Jay Balu
 asked on

Extract specfic data between <div> tags using RegEx in visual basic using RegEx.

Hello,

I would like a solution to extract specfic data between <div> tags using RegEx in visual basic using RegEx. Here are the requirements

1. There is a chunk of text that occurs between the very first <div class="item text"> tags. But there is a pattern to it.

(a) The <div class="item text"> may contain child <div> tags in the following pattern
<div class="item text">
   <div class="attachtitle">  blah blah blah </div>
   <div class="attachcontent">blah blah blah </div>  
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  
 
In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

(b) The <div class="item text"> DOES NOT contain child tags
<div class="item text">
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  

In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

2. I want the text that shows up in between <div "..." class="downRow" .... ></div> tags.
For E.g.,
<div "..." class="downRow" .... >B</div>

I want to extract B (just an example)

I have attached two files for testing. One has only one <div> tag, the other has two child <div> tags
TwoDivTags.txt
OneDivTag.txt
Visual Basic ClassicProgrammingVB Script

Avatar of undefined
Last Comment
Jay Balu

8/22/2022 - Mon
Jay Balu

ASKER
I would prefer a solution in the same lines as this one here

https://www.experts-exchange.com/Programming/Languages/Visual_Basic/Q_28568498.html
aikimark

This pattern comes close to getting the downtext parse, but you will need to confirm this with testing.  It would have been helpful if you included the text you expected in addition to the input text.
<div class="item text">\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


This pattern gets the downrow text:
<div .*? class="downRow".*?>\s*(.*?)\s*</div>

Open in new window


This pattern might get both sets of text:
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


Remember that string literals containing quote characters need to double up those internal quote character.
Jay Balu

ASKER
Thanks Aikimark - I will test and report back in a few. As far as the text I am expecting, it will be this for two<div> pattern
 
Which of the following inequalities is an algebraic expression for the shaded part of the number line above?<br /><br />(A) |x| &lt;= 3 <br />(B) |x| &lt;= 5 <br />(C) |x - 2| &lt;= 3 <br />(D) |x - 1| &lt;= 4 <br />(E) |x +1| &lt;= 4																		

Open in new window


For one div pattern, this is the kind of text I am expecting (taken from the attached file):
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


p.s: please note that the text to be extracted will end at some <div> tag (that way we can make sure there are not any loss of letters)
Your help has saved me hundreds of hours of internet surfing.
fblack61
Jay Balu

ASKER
So ...

1. When I ran the first RegEx against an input that has one <div> pattern, it yielded "almost" the correct answer...if not for some extra information. I was expecting the text only up until <div> but it gave me a lot more information. If it means a lot to tweak, don't sweat it... I can "mid" and extract the info. Here is what I meant

Instead of
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


It gave me all of this
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br /><div style="border: 1px solid black; padding: 10px; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0);"><span style="font-weight: bold"><span style="color: #459926">Practice Questions</span><br />Question: 51<br />Page: 159<br />Difficulty: 600</span>

Open in new window



2. The second RegEx gives me exactly what I want. So we are good there.

3. The third one is a boo-boo. In fact, I was hoping that sucker would work. You see... my input files are generated by a set of URLs (remember your solution from yesterday?...Q_28568498). Each of those links will open up a html which has the text that I want to extract. That "text of interest" maybe or may not be enclosed in single <div> or double <div? which I do not know. It'd help me immensely if there were ONE generic RegEx that works in both situations.

Hope I made sense!
aikimark

That is because the question indicated you wanted the text up to the </div> tag.  I'll look at it in light of the feedback.
aikimark

Note: the third pattern has two capture groups.  The first submatch will be the class value and the second submatch will be the text up to the </div>
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
aikimark

Also, I only tested these patterns on the twodivtags.txt file
ASKER CERTIFIED SOLUTION
aikimark

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
Terry Woods

I'd suggest some slight modifications to that pattern:

<div .*

Open in new window

should be
<div [^>]*

Open in new window

to avoid matching a class in a non-div tag.

I haven't tested this, but I've made a few other subtle changes to try to ensure the result is always an expected one.
<div [^>]*class="(item text|downRow)".*?>(?:\s*<div.*?</div>)*\s*(\w.*?)\s*(?:</div>|<div )

Open in new window


Most points to @aikimark please, if you use my solution.
aikimark

@Jay

Thanksgiving travel plans approach.  If you need help, please post before noon today (11/26).
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Jay Balu

ASKER
My apologies for the late response. I have been traveling for the holidays and did not have a chance to work on this resolution. I will accept aikimark's resolution even though I don't have the time to test it at this time. I will come back if there are any questions. Thanks to Mr. Woods too for the response.