Solved

Extract specfic data between <div> tags using RegEx in visual basic using RegEx.

Posted on 2014-11-25
11
319 Views
Last Modified: 2014-12-04
Hello,

I would like a solution to extract specfic data between <div> tags using RegEx in visual basic using RegEx. Here are the requirements

1. There is a chunk of text that occurs between the very first <div class="item text"> tags. But there is a pattern to it.

(a) The <div class="item text"> may contain child <div> tags in the following pattern
<div class="item text">
   <div class="attachtitle">  blah blah blah </div>
   <div class="attachcontent">blah blah blah </div>  
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  
 
In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

(b) The <div class="item text"> DOES NOT contain child tags
<div class="item text">
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  

In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

2. I want the text that shows up in between <div "..." class="downRow" .... ></div> tags.
For E.g.,
<div "..." class="downRow" .... >B</div>

I want to extract B (just an example)

I have attached two files for testing. One has only one <div> tag, the other has two child <div> tags
TwoDivTags.txt
OneDivTag.txt
0
Comment
Question by:Jay Balu
  • 6
  • 4
11 Comments
 

Author Comment

by:Jay Balu
ID: 40464491
I would prefer a solution in the same lines as this one here

http://www.experts-exchange.com/Programming/Languages/Visual_Basic/Q_28568498.html
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40464691
This pattern comes close to getting the downtext parse, but you will need to confirm this with testing.  It would have been helpful if you included the text you expected in addition to the input text.
<div class="item text">\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


This pattern gets the downrow text:
<div .*? class="downRow".*?>\s*(.*?)\s*</div>

Open in new window


This pattern might get both sets of text:
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


Remember that string literals containing quote characters need to double up those internal quote character.
0
 

Author Comment

by:Jay Balu
ID: 40464730
Thanks Aikimark - I will test and report back in a few. As far as the text I am expecting, it will be this for two<div> pattern
 
Which of the following inequalities is an algebraic expression for the shaded part of the number line above?<br /><br />(A) |x| &lt;= 3 <br />(B) |x| &lt;= 5 <br />(C) |x - 2| &lt;= 3 <br />(D) |x - 1| &lt;= 4 <br />(E) |x +1| &lt;= 4																		

Open in new window


For one div pattern, this is the kind of text I am expecting (taken from the attached file):
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


p.s: please note that the text to be extracted will end at some <div> tag (that way we can make sure there are not any loss of letters)
0
 

Author Comment

by:Jay Balu
ID: 40464800
So ...

1. When I ran the first RegEx against an input that has one <div> pattern, it yielded "almost" the correct answer...if not for some extra information. I was expecting the text only up until <div> but it gave me a lot more information. If it means a lot to tweak, don't sweat it... I can "mid" and extract the info. Here is what I meant

Instead of
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


It gave me all of this
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br /><div style="border: 1px solid black; padding: 10px; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0);"><span style="font-weight: bold"><span style="color: #459926">Practice Questions</span><br />Question: 51<br />Page: 159<br />Difficulty: 600</span>

Open in new window



2. The second RegEx gives me exactly what I want. So we are good there.

3. The third one is a boo-boo. In fact, I was hoping that sucker would work. You see... my input files are generated by a set of URLs (remember your solution from yesterday?...Q_28568498). Each of those links will open up a html which has the text that I want to extract. That "text of interest" maybe or may not be enclosed in single <div> or double <div? which I do not know. It'd help me immensely if there were ONE generic RegEx that works in both situations.

Hope I made sense!
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40464915
That is because the question indicated you wanted the text up to the </div> tag.  I'll look at it in light of the feedback.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 45

Expert Comment

by:aikimark
ID: 40464954
Note: the third pattern has two capture groups.  The first submatch will be the class value and the second submatch will be the text up to the </div>
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40464978
Also, I only tested these patterns on the twodivtags.txt file
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40464985
Please try this pattern
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*(?:</div>|<div )

Open in new window

0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 40465823
I'd suggest some slight modifications to that pattern:

<div .*

Open in new window

should be
<div [^>]*

Open in new window

to avoid matching a class in a non-div tag.

I haven't tested this, but I've made a few other subtle changes to try to ensure the result is always an expected one.
<div [^>]*class="(item text|downRow)".*?>(?:\s*<div.*?</div>)*\s*(\w.*?)\s*(?:</div>|<div )

Open in new window


Most points to @aikimark please, if you use my solution.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40466704
@Jay

Thanksgiving travel plans approach.  If you need help, please post before noon today (11/26).
0
 

Author Comment

by:Jay Balu
ID: 40482273
My apologies for the late response. I have been traveling for the holidays and did not have a chance to work on this resolution. I will accept aikimark's resolution even though I don't have the time to test it at this time. I will come back if there are any questions. Thanks to Mr. Woods too for the response.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now