Link to home
Start Free TrialLog in
Avatar of Jay Balu
Jay Balu

asked on

Extract specfic data between <div> tags using RegEx in visual basic using RegEx.

Hello,

I would like a solution to extract specfic data between <div> tags using RegEx in visual basic using RegEx. Here are the requirements

1. There is a chunk of text that occurs between the very first <div class="item text"> tags. But there is a pattern to it.

(a) The <div class="item text"> may contain child <div> tags in the following pattern
<div class="item text">
   <div class="attachtitle">  blah blah blah </div>
   <div class="attachcontent">blah blah blah </div>  
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  
 
In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

(b) The <div class="item text"> DOES NOT contain child tags
<div class="item text">
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  

In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

2. I want the text that shows up in between <div "..." class="downRow" .... ></div> tags.
For E.g.,
<div "..." class="downRow" .... >B</div>

I want to extract B (just an example)

I have attached two files for testing. One has only one <div> tag, the other has two child <div> tags
TwoDivTags.txt
OneDivTag.txt
Avatar of Jay Balu
Jay Balu

ASKER

Avatar of aikimark
This pattern comes close to getting the downtext parse, but you will need to confirm this with testing.  It would have been helpful if you included the text you expected in addition to the input text.
<div class="item text">\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


This pattern gets the downrow text:
<div .*? class="downRow".*?>\s*(.*?)\s*</div>

Open in new window


This pattern might get both sets of text:
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


Remember that string literals containing quote characters need to double up those internal quote character.
Thanks Aikimark - I will test and report back in a few. As far as the text I am expecting, it will be this for two<div> pattern
 
Which of the following inequalities is an algebraic expression for the shaded part of the number line above?<br /><br />(A) |x| &lt;= 3 <br />(B) |x| &lt;= 5 <br />(C) |x - 2| &lt;= 3 <br />(D) |x - 1| &lt;= 4 <br />(E) |x +1| &lt;= 4																		

Open in new window


For one div pattern, this is the kind of text I am expecting (taken from the attached file):
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


p.s: please note that the text to be extracted will end at some <div> tag (that way we can make sure there are not any loss of letters)
So ...

1. When I ran the first RegEx against an input that has one <div> pattern, it yielded "almost" the correct answer...if not for some extra information. I was expecting the text only up until <div> but it gave me a lot more information. If it means a lot to tweak, don't sweat it... I can "mid" and extract the info. Here is what I meant

Instead of
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


It gave me all of this
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br /><div style="border: 1px solid black; padding: 10px; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0);"><span style="font-weight: bold"><span style="color: #459926">Practice Questions</span><br />Question: 51<br />Page: 159<br />Difficulty: 600</span>

Open in new window



2. The second RegEx gives me exactly what I want. So we are good there.

3. The third one is a boo-boo. In fact, I was hoping that sucker would work. You see... my input files are generated by a set of URLs (remember your solution from yesterday?...Q_28568498). Each of those links will open up a html which has the text that I want to extract. That "text of interest" maybe or may not be enclosed in single <div> or double <div? which I do not know. It'd help me immensely if there were ONE generic RegEx that works in both situations.

Hope I made sense!
That is because the question indicated you wanted the text up to the </div> tag.  I'll look at it in light of the feedback.
Note: the third pattern has two capture groups.  The first submatch will be the class value and the second submatch will be the text up to the </div>
Also, I only tested these patterns on the twodivtags.txt file
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'd suggest some slight modifications to that pattern:

<div .*

Open in new window

should be
<div [^>]*

Open in new window

to avoid matching a class in a non-div tag.

I haven't tested this, but I've made a few other subtle changes to try to ensure the result is always an expected one.
<div [^>]*class="(item text|downRow)".*?>(?:\s*<div.*?</div>)*\s*(\w.*?)\s*(?:</div>|<div )

Open in new window


Most points to @aikimark please, if you use my solution.
@Jay

Thanksgiving travel plans approach.  If you need help, please post before noon today (11/26).
My apologies for the late response. I have been traveling for the holidays and did not have a chance to work on this resolution. I will accept aikimark's resolution even though I don't have the time to test it at this time. I will come back if there are any questions. Thanks to Mr. Woods too for the response.