Link to home
Start Free TrialLog in
Avatar of lucavilla
lucavillaFlag for Italy

asked on

regular expression for checking attributes in an html tag

In an HTML source I need to extract any simple text inside a FONT tag with exactly (no more, no less) these 3 attributes, in any order: size=5, color="red", face="verdana".

The regular expression must thus for example extract all the following "randomtext" except the last four.

<font size=5 color="red" face="verdana">randomtext</font>
<font size=5 face="verdana" color="red">randomtext</font>
<font color="red" size=5 face="verdana">randomtext</font>
<font color="red" face="verdana" size=5>randomtext</font>
<font face="verdana" size=5 color="red">randomtext</font>
<font face="verdana" color="red" size=5>randomtext</font>
<font size=5 size=5 size=5>randomtext</font>
<font face="verdana" color="red" size=5 foobar="random">randomtext</font>
<font face="verdana" color="red" size=5 foobar="random=pippo">randomtext</font>
<font face="verdana" color="red" size=5 garbagetext>randomtext</font>

I solved the "in any order" problem by using 3 look-aheads:
<font(?=[^>]* size=5)(?=[^>]* color="red")(?=[^>]* face="verdana")[^>]*>([^<]+)</font>

Open in new window


...or for more HTML flexibility:
<\s*font(?=[^>]*\s+size\s*=\s*5)(?=[^>]*\scolor\s*=\s*["']red["'])(?=[^>]*\sface\s*=\s*["']verdana["'])[^>]*>\s*([^<]+?)\s*<\s*/font\s*>

Open in new window


The problem is that it also matches the last three. How can I exclude those matching? (obviously in a general and reasonable short/efficient way, i.e. without codyfing all possible positive combinations and without using literal negative expressions that work only on my examples)
Avatar of kaufmed
kaufmed
Flag of United States of America image

What language are you using?
SOLUTION
Avatar of Pierre François
Pierre François
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of lucavilla

ASKER

pfrancois: XPath seems nice here but I'ld like to test it on the fly like I do with RegexBuddy or web regex testers. Is there an XPath tester on the web where I just copy and paste my HTML and my XPath and press a button to see the results?

robert_schutt: great, you arrived to the same solution I arrived to, that has the only "defect" of repeating the 3 attributes one time.  Any idea about how to avoid repeating them (without losing efficiency)?
A webpage where you paste your html code and your xpath, with a submit button, is a great idea, and I am ready to implement it in 10 minutes, but the only server I have with PHP support is dedicated to liturgy (http://www.romanliturgy.org), not the best place to publish such kind of things, however...

My environment is Linux and I love to work in command line mode, that gives me higher productivity. What I use to do, in that environment, is:
wget http://some.website.com/some/path/to/some/page.html
xpath --html '//some/xpath/expression()' page.html

Open in new window

and I see the result. xpath is a 30 lines PHP script I wrote.
I found a lot of online xpath checkers, but, as far, no one works with HTML as input, only XML.

But I found THE solution in case you are working with Firefox: it is the addon "XPath Checker", that rocks. It accepts HTML input.

See: https://addons.mozilla.org/en-US/firefox/addon/xpath-checker/
When you said "avoid repeating them" I misunderstood you I think, I was trying to solve it differently but you probably mean just not use them twice in the regex.

And looking back that's not necessary at all, see this more generic version that just checks for 3 attributes, since the look-aheads already make sure the right ones are there:

re.Pattern = "<\s*font(?=[^>]*\ssize\s*=\s*5[\s>])(?=[^>]*\scolor\s*=\s*[""']red[""'])(?=[^>]*\sface\s*=\s*[""']verdana[""'])(?:\s+(?:\w+\s*=\s*(?:""\w+""|'\w+'|\d+))){3}\s*>\s*([^<]+?)\s*<\s*/font\s*>"

Open in new window

The xpath based solution above matches the pattern only when each of the three attributes is present exactly once.
pfrancois:  XPath Checker is nice but unfortunately wants an HTML page from the web...  I don't think it's possible to give to it the HTML code on-the-fly...

robert_schutt: thanks!
If your HTML code is not on the web, but in a local file, just point the browser to your local file, with URL file:///some-path/some-file.html, or open it with File > Open File... and navigate to the HTML file you want to test. Xpath Checker expects your Xpath to be "on-the-fly", but I suppose your HTML code is a constant, or are you really modifying your HTML also on-the-fly? In this latter case, make your HTML code XML compliant and your will be able to use all the online XPath checkers you can find on the web.