Solved

regular expression for checking attributes in an html tag

Posted on 2012-03-26
10
375 Views
Last Modified: 2012-03-30
In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".

The regular expression must thus for example extract all the following "randomtext" except the last four.

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

I solved the "in any order" problem by using 3 look-aheads:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

Open in new window


...or for more HTML flexibility:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

Open in new window


The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)
0
Comment
Question by:lucavilla
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
What language are you using?
0
 
LVL 10

Assisted Solution

by:pfrancois
pfrancois earned 100 total points
Comment Utility
In this case, if you transform your HTML into XML, which is rather easy if the HTML code is valid. Instead of using a regular expression, you can select the text you want with Xpath. In that case, the xpath expression should be:
//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()

Open in new window


Example in PHP:
$htmlfile = '<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>';
$doc = new DOMDocument();
$doc->loadHTML($htmlfile);
$xpath = new DOMXpath($doc);
$nodelist = $xpath->query("//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()");
foreach ($nodelist as $text) {
	echo $doc->savexml($text);
}

Open in new window

0
 
LVL 35

Accepted Solution

by:
Robert Schutt earned 400 total points
Comment Utility
re-using your 3 look-aheads I made this in vbscript:

Option Explicit

Dim re
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = "<\s*font(?=[^>]*\s+size\s*=\s*5[ >])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:size\s*=\s*5|color\s*=\s*[""']red[""']|face\s*=\s*[""']verdana[""'])){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Dim strTest
strTest = _
"<font size=5 color=""red"" face=""verdana"">randomtext1</font>" & vbCrLf & _
"<font size=5 face=""verdana"" color=""red"">randomtext2</font>" & vbCrLf & _
"<font color=""red"" size=5 face=""verdana"">randomtext3</font>" & vbCrLf & _
"<font color=""red"" face=""verdana"" size=5>randomtext4</font>" & vbCrLf & _
"<font face=""verdana"" size=5 color=""red"">randomtext5</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5>randomtext6</font>" & vbCrLf & _
"<font size=5 size=5 size=5>randomtext7</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random"">randomtext8</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random=pippo"">randomtext9</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 garbagetext>randomtext10</font>"

Dim msg, m
msg = "matches:" & vbCrLf
For Each m In re.Execute(strTest)
	msg = msg & m.SubMatches(0) & vbCrLf ' if you want the complete matched text use: m.Value
Next

MsgBox msg

Open in new window


So what it does: take your 3 look-aheads and also check there's exactly 3 attributes with the {3} part.
0
 

Author Comment

by:lucavilla
Comment Utility
pfrancois: XPath seems nice here but I'ld like to test it on the fly like I do with RegexBuddy or web regex testers. Is there an XPath tester on the web where I just copy and paste my HTML and my XPath and press a button to see the results?

robert_schutt: great, you arrived to the same solution I arrived to, that has the only "defect" of repeating the 3 attributes one time.  Any idea about how to avoid repeating them (without losing efficiency)?
0
 
LVL 10

Expert Comment

by:pfrancois
Comment Utility
A webpage where you paste your html code and your xpath, with a submit button, is a great idea, and I am ready to implement it in 10 minutes, but the only server I have with PHP support is dedicated to liturgy (http://www.romanliturgy.org), not the best place to publish such kind of things, however...

My environment is Linux and I love to work in command line mode, that gives me higher productivity. What I use to do, in that environment, is:
wget http://some.website.com/some/path/to/some/page.html
xpath --html '//some/xpath/expression()' page.html

Open in new window

and I see the result. xpath is a 30 lines PHP script I wrote.
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 10

Expert Comment

by:pfrancois
Comment Utility
I found a lot of online xpath checkers, but, as far, no one works with HTML as input, only XML.

But I found THE solution in case you are working with Firefox: it is the addon "XPath Checker", that rocks. It accepts HTML input.

See: https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/
0
 
LVL 35

Expert Comment

by:Robert Schutt
Comment Utility
When you said "avoid repeating them" I misunderstood you I think, I was trying to solve it differently but you probably mean just not use them twice in the regex.

And looking back that's not necessary at all, see this more generic version that just checks for 3 attributes, since the look-aheads already make sure the right ones are there:

re.Pattern = "<\s*font(?=[^>]*\ssize\s*=\s*5[\s>])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:\w+\s*=\s*(?:""\w+""|'\w+'|\d+))){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Open in new window

0
 
LVL 10

Expert Comment

by:pfrancois
Comment Utility
The xpath based solution above matches the pattern only when each of the three attributes is present exactly once.
0
 

Author Comment

by:lucavilla
Comment Utility
pfrancois:  XPath Checker is nice but unfortunately wants an HTML page from the web...  I don't think it's possible to give to it the HTML code on-the-fly...

robert_schutt: thanks!
0
 
LVL 10

Expert Comment

by:pfrancois
Comment Utility
If your HTML code is not on the web, but in a local file, just point the browser to your local file, with URL file:///some-path/some-file.html, or open it with File > Open File... and navigate to the HTML file you want to test. Xpath Checker expects your Xpath to be "on-the-fly", but I suppose your HTML code is a constant, or are you really modifying your HTML also on-the-fly? In this latter case, make your HTML code XML compliant and your will be able to use all the online XPath checkers you can find on the web.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Problem to setup 18 78
java continue statement 10 69
count8 challlenge 13 84
base64 decode encode 12 93
A short article about a problem I had getting the GPS LocationListener working.
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now