Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 382
  • Last Modified:

regular expression for checking attributes in an html tag

In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".

The regular expression must thus for example extract all the following "randomtext" except the last four.

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

I solved the "in any order" problem by using 3 look-aheads:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

Open in new window


...or for more HTML flexibility:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

Open in new window


The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)
0
lucavilla
Asked:
lucavilla
  • 5
  • 2
  • 2
  • +1
2 Solutions
 
käµfm³d 👽Commented:
What language are you using?
0
 
Pierre FrançoisSenior consultantCommented:
In this case, if you transform your HTML into XML, which is rather easy if the HTML code is valid. Instead of using a regular expression, you can select the text you want with Xpath. In that case, the xpath expression should be:
//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()

Open in new window


Example in PHP:
$htmlfile = '<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>';
$doc = new DOMDocument();
$doc->loadHTML($htmlfile);
$xpath = new DOMXpath($doc);
$nodelist = $xpath->query("//font[(@size = '5') and (@color = 'red') and (@face  ='Verdana') and (count (@*) = 3)]/text()");
foreach ($nodelist as $text) {
	echo $doc->savexml($text);
}

Open in new window

0
 
Robert SchuttSoftware EngineerCommented:
re-using your 3 look-aheads I made this in vbscript:

Option Explicit

Dim re
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = "<\s*font(?=[^>]*\s+size\s*=\s*5[ >])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:size\s*=\s*5|color\s*=\s*[""']red[""']|face\s*=\s*[""']verdana[""'])){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Dim strTest
strTest = _
"<font size=5 color=""red"" face=""verdana"">randomtext1</font>" & vbCrLf & _
"<font size=5 face=""verdana"" color=""red"">randomtext2</font>" & vbCrLf & _
"<font color=""red"" size=5 face=""verdana"">randomtext3</font>" & vbCrLf & _
"<font color=""red"" face=""verdana"" size=5>randomtext4</font>" & vbCrLf & _
"<font face=""verdana"" size=5 color=""red"">randomtext5</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5>randomtext6</font>" & vbCrLf & _
"<font size=5 size=5 size=5>randomtext7</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random"">randomtext8</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 foobar=""random=pippo"">randomtext9</font>" & vbCrLf & _
"<font face=""verdana"" color=""red"" size=5 garbagetext>randomtext10</font>"

Dim msg, m
msg = "matches:" & vbCrLf
For Each m In re.Execute(strTest)
	msg = msg & m.SubMatches(0) & vbCrLf ' if you want the complete matched text use: m.Value
Next

MsgBox msg

Open in new window


So what it does: take your 3 look-aheads and also check there's exactly 3 attributes with the {3} part.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
lucavillaAuthor Commented:
pfrancois: XPath seems nice here but I'ld like to test it on the fly like I do with RegexBuddy or web regex testers. Is there an XPath tester on the web where I just copy and paste my HTML and my XPath and press a button to see the results?

robert_schutt: great, you arrived to the same solution I arrived to, that has the only "defect" of repeating the 3 attributes one time.  Any idea about how to avoid repeating them (without losing efficiency)?
0
 
Pierre FrançoisSenior consultantCommented:
A webpage where you paste your html code and your xpath, with a submit button, is a great idea, and I am ready to implement it in 10 minutes, but the only server I have with PHP support is dedicated to liturgy (http://www.romanliturgy.org), not the best place to publish such kind of things, however...

My environment is Linux and I love to work in command line mode, that gives me higher productivity. What I use to do, in that environment, is:
wget http://some.website.com/some/path/to/some/page.html
xpath --html '//some/xpath/expression()' page.html

Open in new window

and I see the result. xpath is a 30 lines PHP script I wrote.
0
 
Pierre FrançoisSenior consultantCommented:
I found a lot of online xpath checkers, but, as far, no one works with HTML as input, only XML.

But I found THE solution in case you are working with Firefox: it is the addon "XPath Checker", that rocks. It accepts HTML input.

See: https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/
0
 
Robert SchuttSoftware EngineerCommented:
When you said "avoid repeating them" I misunderstood you I think, I was trying to solve it differently but you probably mean just not use them twice in the regex.

And looking back that's not necessary at all, see this more generic version that just checks for 3 attributes, since the look-aheads already make sure the right ones are there:

re.Pattern = "<\s*font(?=[^>]*\ssize\s*=\s*5[\s>])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:\w+\s*=\s*(?:""\w+""|'\w+'|\d+))){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Open in new window

0
 
Pierre FrançoisSenior consultantCommented:
The xpath based solution above matches the pattern only when each of the three attributes is present exactly once.
0
 
lucavillaAuthor Commented:
pfrancois:  XPath Checker is nice but unfortunately wants an HTML page from the web...  I don't think it's possible to give to it the HTML code on-the-fly...

robert_schutt: thanks!
0
 
Pierre FrançoisSenior consultantCommented:
If your HTML code is not on the web, but in a local file, just point the browser to your local file, with URL file:///some-path/some-file.html, or open it with File > Open File... and navigate to the HTML file you want to test. Xpath Checker expects your Xpath to be "on-the-fly", but I suppose your HTML code is a constant, or are you really modifying your HTML also on-the-fly? In this latter case, make your HTML code XML compliant and your will be able to use all the online XPath checkers you can find on the web.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 5
  • 2
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now