regular expression for checking attributes in an html tag

In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".

The regular expression must thus for example extract all the following "randomtext" except the last four.

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

I solved the "in any order" problem by using 3 look-aheads:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

Open in new window


...or for more HTML flexibility:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

Open in new window


The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)
lucavillaAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

käµfm³d 👽Commented:
What language are you using?
0
Pierre FrançoisSenior consultantCommented:
In this case, if you transform your HTML into XML, which is rather easy if the HTML code is valid. Instead of using a regular expression, you can select the text you want with Xpath. In that case, the xpath expression should be:
//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()

Open in new window


Example in PHP:
$htmlfile = '<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>';
$doc = new DOMDocument();
$doc->loadHTML($htmlfile);
$xpath = new DOMXpath($doc);
$nodelist = $xpath->query("//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()");
foreach ($nodelist as $text) {
	echo $doc->savexml($text);
}

Open in new window

0
Robert SchuttSoftware EngineerCommented:
re-using your 3 look-aheads I made this in vbscript:

Option Explicit

Dim re
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = "<\s*font(?=[^>]*\s+size\s*=\s*5[ >])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:size\s*=\s*5|color\s*=\s*[""']red[""']|face\s*=\s*[""']verdana[""'])){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Dim strTest
strTest = _
"<font size=5 color=""red"" face=""verdana"">randomtext1</font>" & vbCrLf & _
"<font size=5 face=""verdana"" color=""red"">randomtext2</font>" & vbCrLf & _
"<font color=""red"" size=5 face=""verdana"">randomtext3</font>" & vbCrLf & _
"<font color=""red"" face=""verdana"" size=5>randomtext4</font>" & vbCrLf & _
"<font face=""verdana"" size=5 color=""red"">randomtext5</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5>randomtext6</font>" & vbCrLf & _
"<font size=5 size=5 size=5>randomtext7</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random"">randomtext8</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random=pippo"">randomtext9</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 garbagetext>randomtext10</font>"

Dim msg, m
msg = "matches:" & vbCrLf
For Each m In re.Execute(strTest)
	msg = msg & m.SubMatches(0) & vbCrLf ' if you want the complete matched text use: m.Value
Next

MsgBox msg

Open in new window


So what it does: take your 3 look-aheads and also check there's exactly 3 attributes with the {3} part.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

lucavillaAuthor Commented:
pfrancois: XPath seems nice here but I'ld like to test it on the fly like I do with RegexBuddy or web regex testers. Is there an XPath tester on the web where I just copy and paste my HTML and my XPath and press a button to see the results?

robert_schutt: great, you arrived to the same solution I arrived to, that has the only "defect" of repeating the 3 attributes one time.  Any idea about how to avoid repeating them (without losing efficiency)?
0
Pierre FrançoisSenior consultantCommented:
A webpage where you paste your html code and your xpath, with a submit button, is a great idea, and I am ready to implement it in 10 minutes, but the only server I have with PHP support is dedicated to liturgy (http://www.romanliturgy.org), not the best place to publish such kind of things, however...

My environment is Linux and I love to work in command line mode, that gives me higher productivity. What I use to do, in that environment, is:
wget http://some.website.com/some/path/to/some/page.html
xpath --html '//some/xpath/expression()' page.html

Open in new window

and I see the result. xpath is a 30 lines PHP script I wrote.
0
Pierre FrançoisSenior consultantCommented:
I found a lot of online xpath checkers, but, as far, no one works with HTML as input, only XML.

But I found THE solution in case you are working with Firefox: it is the addon "XPath Checker", that rocks. It accepts HTML input.

See: https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/
0
Robert SchuttSoftware EngineerCommented:
When you said "avoid repeating them" I misunderstood you I think, I was trying to solve it differently but you probably mean just not use them twice in the regex.

And looking back that's not necessary at all, see this more generic version that just checks for 3 attributes, since the look-aheads already make sure the right ones are there:

re.Pattern = "<\s*font(?=[^>]*\ssize\s*=\s*5[\s>])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:\w+\s*=\s*(?:""\w+""|'\w+'|\d+))){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Open in new window

0
Pierre FrançoisSenior consultantCommented:
The xpath based solution above matches the pattern only when each of the three attributes is present exactly once.
0
lucavillaAuthor Commented:
pfrancois:  XPath Checker is nice but unfortunately wants an HTML page from the web...  I don't think it's possible to give to it the HTML code on-the-fly...

robert_schutt: thanks!
0
Pierre FrançoisSenior consultantCommented:
If your HTML code is not on the web, but in a local file, just point the browser to your local file, with URL file:///some-path/some-file.html, or open it with File > Open File... and navigate to the HTML file you want to test. Xpath Checker expects your Xpath to be "on-the-fly", but I suppose your HTML code is a constant, or are you really modifying your HTML also on-the-fly? In this latter case, make your HTML code XML compliant and your will be able to use all the online XPath checkers you can find on the web.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming

From novice to tech pro — start learning today.