Solved

Extract specfic data between <div> tags using RegEx in visual basic using RegEx.

Posted on 2014-11-25
11
311 Views
Last Modified: 2014-12-04
Hello,

I would like a solution to extract specfic data between <div> tags using RegEx in visual basic using RegEx. Here are the requirements

1. There is a chunk of text that occurs between the very first <div class="item text"> tags. But there is a pattern to it.

(a) The <div class="item text"> may contain child <div> tags in the following pattern
<div class="item text">
   <div class="attachtitle">  blah blah blah </div>
   <div class="attachcontent">blah blah blah </div>  
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  
 
In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

(b) The <div class="item text"> DOES NOT contain child tags
<div class="item text">
 EXTRACT THE TEXT FROM THIS SPOT  
</div>  

In this case, I want the text present at this location (EXTRACT THE TEXT FROM THIS SPOT...this isn't the real text, just an example)

2. I want the text that shows up in between <div "..." class="downRow" .... ></div> tags.
For E.g.,
<div "..." class="downRow" .... >B</div>

I want to extract B (just an example)

I have attached two files for testing. One has only one <div> tag, the other has two child <div> tags
TwoDivTags.txt
OneDivTag.txt
0
Comment
Question by:Jay Balu
  • 6
  • 4
11 Comments
 

Author Comment

by:Jay Balu
ID: 40464491
I would prefer a solution in the same lines as this one here

http://www.experts-exchange.com/Programming/Languages/Visual_Basic/Q_28568498.html
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40464691
This pattern comes close to getting the downtext parse, but you will need to confirm this with testing.  It would have been helpful if you included the text you expected in addition to the input text.
<div class="item text">\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


This pattern gets the downrow text:
<div .*? class="downRow".*?>\s*(.*?)\s*</div>

Open in new window


This pattern might get both sets of text:
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*</div>

Open in new window


Remember that string literals containing quote characters need to double up those internal quote character.
0
 

Author Comment

by:Jay Balu
ID: 40464730
Thanks Aikimark - I will test and report back in a few. As far as the text I am expecting, it will be this for two<div> pattern
 
Which of the following inequalities is an algebraic expression for the shaded part of the number line above?<br /><br />(A) |x| &lt;= 3 <br />(B) |x| &lt;= 5 <br />(C) |x - 2| &lt;= 3 <br />(D) |x - 1| &lt;= 4 <br />(E) |x +1| &lt;= 4																		

Open in new window


For one div pattern, this is the kind of text I am expecting (taken from the attached file):
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


p.s: please note that the text to be extracted will end at some <div> tag (that way we can make sure there are not any loss of letters)
0
 

Author Comment

by:Jay Balu
ID: 40464800
So ...

1. When I ran the first RegEx against an input that has one <div> pattern, it yielded "almost" the correct answer...if not for some extra information. I was expecting the text only up until <div> but it gave me a lot more information. If it means a lot to tweak, don't sweat it... I can "mid" and extract the info. Here is what I meant

Instead of
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br />

Open in new window


It gave me all of this
If y is an integer, then the least possible value of |23 - 5y| is<br /><br />(A) 1<br />(B) 2<br />(C) 3<br />(D) 4<br />(E) 5<br /><br /><div style="border: 1px solid black; padding: 10px; background-color: rgb(255, 255, 255); color: rgb(0, 0, 0);"><span style="font-weight: bold"><span style="color: #459926">Practice Questions</span><br />Question: 51<br />Page: 159<br />Difficulty: 600</span>

Open in new window



2. The second RegEx gives me exactly what I want. So we are good there.

3. The third one is a boo-boo. In fact, I was hoping that sucker would work. You see... my input files are generated by a set of URLs (remember your solution from yesterday?...Q_28568498). Each of those links will open up a html which has the text that I want to extract. That "text of interest" maybe or may not be enclosed in single <div> or double <div? which I do not know. It'd help me immensely if there were ONE generic RegEx that works in both situations.

Hope I made sense!
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40464915
That is because the question indicated you wanted the text up to the </div> tag.  I'll look at it in light of the feedback.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 45

Expert Comment

by:aikimark
ID: 40464954
Note: the third pattern has two capture groups.  The first submatch will be the class value and the second submatch will be the text up to the </div>
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40464978
Also, I only tested these patterns on the twodivtags.txt file
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40464985
Please try this pattern
<div .*class="(item text|downRow)".*>\s*(?:<div .*</div>)?\s*(\w.*?)\s*(?:</div>|<div )

Open in new window

0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 40465823
I'd suggest some slight modifications to that pattern:

<div .*

Open in new window

should be
<div [^>]*

Open in new window

to avoid matching a class in a non-div tag.

I haven't tested this, but I've made a few other subtle changes to try to ensure the result is always an expected one.
<div [^>]*class="(item text|downRow)".*?>(?:\s*<div.*?</div>)*\s*(\w.*?)\s*(?:</div>|<div )

Open in new window


Most points to @aikimark please, if you use my solution.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40466704
@Jay

Thanksgiving travel plans approach.  If you need help, please post before noon today (11/26).
0
 

Author Comment

by:Jay Balu
ID: 40482273
My apologies for the late response. I have been traveling for the holidays and did not have a chance to work on this resolution. I will accept aikimark's resolution even though I don't have the time to test it at this time. I will come back if there are any questions. Thanks to Mr. Woods too for the response.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
A short article about a problem I had getting the GPS LocationListener working.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now