Solved

Regex to match certain HTML attributes

Posted on 2006-11-27
7
1,204 Views
Last Modified: 2013-11-19
Hi,

Regexes are sometimes quite challenging. I've been banging my head on this table for hours now and want to stop.  Please help me get rid of this headache!

I need to remove all style and class attributes in an HTML file whilst leaving all other attributes untouched.  I just need the regex for this - I've written a generic filter that uses the Regex,  but I just can't seem to get this one to work (I'm failing to get the regex to ignore other attributes between the tag and the style=...).

Given the following HTML (which came from pasting from the trully awful MS Werd - I really couldn't invent this rubbish if I tried!):

<H1 style="MARGIN: 0cm 0cm 0pt"><FONT color=#000000>blah blah<SPAN style="mso-spacerun: yes">&nbsp; </SPAN></font></H1>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-ALIGN: justify"><?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:PlaceName w:st="on"><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">blah blah</SPAN></st1:PlaceName><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">

I need just the Regex and the Replacement strings.  It should:
 - remove (match) style and class attributes
 - work with and without quotes - note that 'Century Gothic' is wrapped with single quotes
 - assume the attribute quotes are "double" (or missing)
 - the attributes must be allowed to be in *any* order in the tag
 - all other attributes and tags must be left in situ

I've other regexes that clean the rest of the vomit - at least ten of them!

For a bonus,  if anyone has the name of the idiot who created the Werd HTML engine.....  I'd just love to write to his/her mother and tell her how her child is messing with people's heads:-)

Cheers,
0
Comment
Question by:GlennGilbert
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
7 Comments
 
LVL 10

Assisted Solution

by:xanius
xanius earned 100 total points
ID: 18027043
search:

(perl style regex):

\b(class|style)=("[^"]+?"|\S+)\s*

replace: nothing

Xanius
0
 
LVL 3

Author Comment

by:GlennGilbert
ID: 18027052
Cheers!

Could you add the angle brackets to make it specific for an HTML tag please.

0
 
LVL 3

Author Comment

by:GlennGilbert
ID: 18027079
I always seem to mess up when adding the more specific characters:

Yours:
  \b(class|style)=("[^"]+?"|\S+)\s*


Mine:
  <.*\b(class|style)=("[^"]+?"|\S+)\s*.*>
  Which,  needless to say,  doesn't work as it matches everything.
0
Learn by Doing. Anytime. Anywhere.

Do you like to learn by doing?
Our labs and exercises give you the chance to do just that: Learn by performing actions on real environments.

Hands-on, scenario-based labs give you experience on real environments provided by us so you don't have to worry about breaking anything.

 
LVL 84

Accepted Solution

by:
ozo earned 400 total points
ID: 18027153
$_=q<
<H1 style="MARGIN: 0cm 0cm 0pt"><FONT color=#000000>blah blah<SPAN style="mso-spacerun: yes">&nbsp; </SPAN></font></H1>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-ALIGN: justify"><?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:PlaceName w:st="on"><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">blah blah</SPAN></st1:PlaceName><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">
>;

s/(<[^<>]*?)\b((class|style)=("[^"]+?"|\S+)\s*)+([^<>]*>)/$1$5/g;

print;
0
 
LVL 3

Author Comment

by:GlennGilbert
ID: 18027624
Fantastic - thank you!

My learning point... search for all characters that aren't a start < or end > tag - and do this using as few matches as possible - *?

I made a small change to capture the leadng space(s) before the span/class so it looks like this (the syntax is my underlying filter which this will be poked into):
<Filter>
(<[^<>]*?)\s*\b((class|style)=("[^"]+?"|\S+))+([^<>]*>)
</Filter>
<Replacement>
$1$5
</Replacement>


Input:
<H1 style="MARGIN: 0cm 0cm 0pt"><FONT color=#000000>blah blah<SPAN style="mso-spacerun: yes">&nbsp; </SPAN></font></H1>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-ALIGN: justify"><?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:PlaceName w:st="on"><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">blah blah</SPAN></st1:PlaceName><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">


Output:
<H1><FONT color=#000000>blah blah<SPAN>&nbsp; </SPAN></font></H1>
<P style="MARGIN: 0cm 0cm 0pt; TEXT-ALIGN: justify"><?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:PlaceName w:st="on"><SPAN>blah blah</SPAN></st1:PlaceName><SPAN>


And I'll just get on with clearing out the rest of the vomit - like the xml prologue, smart tag stuff, font tags, empty styles... the list seems to go on forever.

Cheers,
Glenn
0
 
LVL 10

Expert Comment

by:xanius
ID: 18027665
Glenn,

> Which,  needless to say,  doesn't work as it matches everything.

when such things happen, you generally have a too greedy regex. As in the examples above, as a thumbs of rule, put in a '?' after the '*' and '?'s.

0
 
LVL 10

Expert Comment

by:xanius
ID: 18027733
(Sorrry, I was crossposting)

<Filter>
(<[^<>]*?)\s*\b((class|style)=("[^"]+?"|\S+))+([^<>]*>)
</Filter>
<Replacement>
$1$5
</Replacement>

works perfectly as long as there are no other attribtes between 'class' and 'style'. If this is true youre ok. If not, you should rather try to use two regexes, the first to idetnify the tag and the secodn to work on it

Regex1:
<Filter>
(<[^<>]*?(?:class|style)=[^<>]*?>)
</Filter>
 Put "$1" into som variable, then apply the following regex in it with the global switch

Regex2:
<Filter>
\b(class|style)=("[^"]+?"|\S+)\s*
</Filter>
<Replacement>
</Replacement>

Cheers
Xanius
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A long time ago (May 2011), I have written an article showing you how to create a DLL using Visual Studio 2005 to be hosted in SQL Server 2005. That was valid at that time and it is still valid if you are still using these versions. You can still re…
I found this questions asking how to do this in many different forums, so I will describe here how to implement a solution using PHP and AJAX. The logical flow for the problem should be: Write an event handler for the first drop down box to get …
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…
Suggested Courses

617 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question