Link to home
Start Free TrialLog in
Avatar of jfowlie
jfowlie

asked on

Quote unquoted HTML attributes (regular expression)

I need a regular expression that looks at every tag in a string of HTML.
If there are any attributes in a tag that are not surrounded by quotes (' or ") it will add them.

E.g.
$text = '<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face=Times New Roman size=3>my text here</FONT></P>';
$text = preg_replace(this is where I need help!);

and $text becomes....

$text = '<P class="MsoNormal" style="MARGIN: 0cm 0cm 0pt"><FONT face="Times New Roman" size="3">my text here</FONT></P>';
Avatar of shmert
shmert

I'd recommend using the PEAR tidy extension to clean up the HTML.
http://pecl.php.net/package/tidy
It does good stuff.  You'll have a hard time doing this with a regular expression, because you need to handle:
- nested quotes
- backslash-escaped quotes
- quoted angle brackets
and other weirdness.  Tidy will do this and a lot more for you.
Avatar of jfowlie

ASKER

Sorry, I should have said that I'm already making  use of the Tidy extension, and it's good, but it doesn't do everything that I need it to do which is why I was trying to perform a regular expression on the "untidy" html.

Basically the problem I'm having concerns a tag such as the following, where the font face has not been put in quotes
e.g.
<font face=Times New Roman>my text here</font>
I'm running Tidy as follows:

tidy_parse_string($HTMLtext);

tidy_setopt( 'bare' , true );
tidy_setopt( 'drop-proprietary-attributes' , true);
tidy_setopt( 'show-body-only' , true);
tidy_setopt( 'word-2000' , true );
tidy_setopt( 'fix-backslash' ,true);
tidy_setopt( 'logical-emphasis' ,true);
tidy_setopt( 'lower-literals' ,true);
tidy_setopt( 'drop-empty-paras', true);

tidy_clean_repair();

$HTMLtext= tidy_get_output();

And using the settings above, Tidy does the following:
<FONT face=times roman="" new="">my text here</FONT>

I don't understand why it does this when I have 'drop-proprietary-attributes' enabled.

I do not want it to drop the font tags.
<font face=Times New Roman>my text here</font>
This seems really problematic to me.  How do you know where the attribute value ends and the next attribute name starts?

For example:
<option value=my value selected class=x>

"selected" is a standalone tag!  How can a parser differentiate "my value" from "selected"?  This might be something that can't be reasonably automated.  If the attributes didn't contain spaces, you could just close the quote when you come to a space, and the next non-angled-bracket you come to is the start of a new tag.
Avatar of jfowlie

ASKER

You're right - the example you gave is a problem.

The following idea is not perfect and it will corrupt your example, but I think it would meet my requirements.  
How would I write something that put everything after an opening attribute= in quotes until it either
1) reached the end of the tag
or
2) came across another attribute (that uses =)

At the end of the day, I have HTML content that is being pasted from Word (which Tidy does a good job of handling)  or other web pages (which are the main culprit for the issue I'm having). The good news is that FORM elements are never part of this content.

Alternatively, can you suggest some better options for me to use with Tidy?
ASKER CERTIFIED SOLUTION
Avatar of shmert
shmert

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Avatar of skullnobrains
skullnobrains

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
$nb=count($chunks);//line nopt necessary

a few $chunks are typed as $chunk : these are typo errors
thanks a lot for forced accept.
work somtimes gets a reward as it seems :)

just a tiny note, as i looked my code for some time before i figured out what it does... '"' is actually a double-quote embedded between two simple quotes. may help others some day...

see you all on the threads sometimes.