Solved

regular expression for checking attributes in an html tag

Posted on 2012-03-26
10
376 Views
Last Modified: 2012-03-30
In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".

The regular expression must thus for example extract all the following "randomtext" except the last four.

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

I solved the "in any order" problem by using 3 look-aheads:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

Open in new window


...or for more HTML flexibility:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

Open in new window


The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)
0
Comment
Question by:lucavilla
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 37765388
What language are you using?
0
 
LVL 10

Assisted Solution

by:pfrancois
pfrancois earned 100 total points
ID: 37765507
In this case, if you transform your HTML into XML, which is rather easy if the HTML code is valid. Instead of using a regular expression, you can select the text you want with Xpath. In that case, the xpath expression should be:
//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()

Open in new window


Example in PHP:
$htmlfile = '<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>';
$doc = new DOMDocument();
$doc->loadHTML($htmlfile);
$xpath = new DOMXpath($doc);
$nodelist = $xpath->query("//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()");
foreach ($nodelist as $text) {
	echo $doc->savexml($text);
}

Open in new window

0
 
LVL 35

Accepted Solution

by:
Robert Schutt earned 400 total points
ID: 37765963
re-using your 3 look-aheads I made this in vbscript:

Option Explicit

Dim re
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = "<\s*font(?=[^>]*\s+size\s*=\s*5[ >])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:size\s*=\s*5|color\s*=\s*[""']red[""']|face\s*=\s*[""']verdana[""'])){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Dim strTest
strTest = _
"<font size=5 color=""red"" face=""verdana"">randomtext1</font>" & vbCrLf & _
"<font size=5 face=""verdana"" color=""red"">randomtext2</font>" & vbCrLf & _
"<font color=""red"" size=5 face=""verdana"">randomtext3</font>" & vbCrLf & _
"<font color=""red"" face=""verdana"" size=5>randomtext4</font>" & vbCrLf & _
"<font face=""verdana"" size=5 color=""red"">randomtext5</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5>randomtext6</font>" & vbCrLf & _
"<font size=5 size=5 size=5>randomtext7</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random"">randomtext8</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random=pippo"">randomtext9</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 garbagetext>randomtext10</font>"

Dim msg, m
msg = "matches:" & vbCrLf
For Each m In re.Execute(strTest)
	msg = msg & m.SubMatches(0) & vbCrLf ' if you want the complete matched text use: m.Value
Next

MsgBox msg

Open in new window


So what it does: take your 3 look-aheads and also check there's exactly 3 attributes with the {3} part.
0
ScreenConnect 6.0 Free Trial

Check out the updates in one game-changing release, ScreenConnect 6.0, based on partner feedback. New features include a redesigned UI that improves session organization and overall user experience. See the enhancements for yourself!

 

Author Comment

by:lucavilla
ID: 37770017
pfrancois: XPath seems nice here but I'ld like to test it on the fly like I do with RegexBuddy or web regex testers. Is there an XPath tester on the web where I just copy and paste my HTML and my XPath and press a button to see the results?

robert_schutt: great, you arrived to the same solution I arrived to, that has the only "defect" of repeating the 3 attributes one time.  Any idea about how to avoid repeating them (without losing efficiency)?
0
 
LVL 10

Expert Comment

by:pfrancois
ID: 37770347
A webpage where you paste your html code and your xpath, with a submit button, is a great idea, and I am ready to implement it in 10 minutes, but the only server I have with PHP support is dedicated to liturgy (http://www.romanliturgy.org), not the best place to publish such kind of things, however...

My environment is Linux and I love to work in command line mode, that gives me higher productivity. What I use to do, in that environment, is:
wget http://some.website.com/some/path/to/some/page.html
xpath --html '//some/xpath/expression()' page.html

Open in new window

and I see the result. xpath is a 30 lines PHP script I wrote.
0
 
LVL 10

Expert Comment

by:pfrancois
ID: 37770523
I found a lot of online xpath checkers, but, as far, no one works with HTML as input, only XML.

But I found THE solution in case you are working with Firefox: it is the addon "XPath Checker", that rocks. It accepts HTML input.

See: https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/
0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 37771459
When you said "avoid repeating them" I misunderstood you I think, I was trying to solve it differently but you probably mean just not use them twice in the regex.

And looking back that's not necessary at all, see this more generic version that just checks for 3 attributes, since the look-aheads already make sure the right ones are there:

re.Pattern = "<\s*font(?=[^>]*\ssize\s*=\s*5[\s>])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:\w+\s*=\s*(?:""\w+""|'\w+'|\d+))){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Open in new window

0
 
LVL 10

Expert Comment

by:pfrancois
ID: 37772049
The xpath based solution above matches the pattern only when each of the three attributes is present exactly once.
0
 

Author Comment

by:lucavilla
ID: 37773967
pfrancois:  XPath Checker is nice but unfortunately wants an HTML page from the web...  I don't think it's possible to give to it the HTML code on-the-fly...

robert_schutt: thanks!
0
 
LVL 10

Expert Comment

by:pfrancois
ID: 37775408
If your HTML code is not on the web, but in a local file, just point the browser to your local file, with URL file:///some-path/some-file.html, or open it with File > Open File... and navigate to the HTML file you want to test. Xpath Checker expects your Xpath to be "on-the-fly", but I suppose your HTML code is a constant, or are you really modifying your HTML also on-the-fly? In this latter case, make your HTML code XML compliant and your will be able to use all the online XPath checkers you can find on the web.
0

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
mapAB Challlenge 35 145
Math question 3 88
Scripting vs. Programming languages 25 166
Windows Service to Receive TCP Packets 4 147
Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question