Solved

Quote unquoted HTML attributes (regular expression)

Posted on 2004-03-31
10
1,344 Views
Last Modified: 2013-11-19
I need a regular expression that looks at every tag in a string of HTML.
If there are any attributes in a tag that are not surrounded by quotes (' or ") it will add them.

E.g.
$text = '<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face=Times New Roman size=3>my text here</FONT></P>';
$text = preg_replace(this is where I need help!);

and $text becomes....

$text = '<P class="MsoNormal" style="MARGIN: 0cm 0cm 0pt"><FONT face="Times New Roman" size="3">my text here</FONT></P>';
0
Comment
Question by:jfowlie
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
10 Comments
 
LVL 11

Expert Comment

by:shmert
ID: 10729241
I'd recommend using the PEAR tidy extension to clean up the HTML.
http://pecl.php.net/package/tidy
It does good stuff.  You'll have a hard time doing this with a regular expression, because you need to handle:
- nested quotes
- backslash-escaped quotes
- quoted angle brackets
and other weirdness.  Tidy will do this and a lot more for you.
0
 
LVL 1

Author Comment

by:jfowlie
ID: 10732743
Sorry, I should have said that I'm already making  use of the Tidy extension, and it's good, but it doesn't do everything that I need it to do which is why I was trying to perform a regular expression on the "untidy" html.

Basically the problem I'm having concerns a tag such as the following, where the font face has not been put in quotes
e.g.
<font face=Times New Roman>my text here</font>
I'm running Tidy as follows:

tidy_parse_string($HTMLtext);

tidy_setopt( 'bare' , true );
tidy_setopt( 'drop-proprietary-attributes' , true);
tidy_setopt( 'show-body-only' , true);
tidy_setopt( 'word-2000' , true );
tidy_setopt( 'fix-backslash' ,true);
tidy_setopt( 'logical-emphasis' ,true);
tidy_setopt( 'lower-literals' ,true);
tidy_setopt( 'drop-empty-paras', true);

tidy_clean_repair();

$HTMLtext= tidy_get_output();

And using the settings above, Tidy does the following:
<FONT face=times roman="" new="">my text here</FONT>

I don't understand why it does this when I have 'drop-proprietary-attributes' enabled.

I do not want it to drop the font tags.
0
 
LVL 11

Expert Comment

by:shmert
ID: 10735187
<font face=Times New Roman>my text here</font>
This seems really problematic to me.  How do you know where the attribute value ends and the next attribute name starts?

For example:
<option value=my value selected class=x>

"selected" is a standalone tag!  How can a parser differentiate "my value" from "selected"?  This might be something that can't be reasonably automated.  If the attributes didn't contain spaces, you could just close the quote when you come to a space, and the next non-angled-bracket you come to is the start of a new tag.
0
Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

 
LVL 1

Author Comment

by:jfowlie
ID: 10735375
You're right - the example you gave is a problem.

The following idea is not perfect and it will corrupt your example, but I think it would meet my requirements.  
How would I write something that put everything after an opening attribute= in quotes until it either
1) reached the end of the tag
or
2) came across another attribute (that uses =)

At the end of the day, I have HTML content that is being pasted from Word (which Tidy does a good job of handling)  or other web pages (which are the main culprit for the issue I'm having). The good news is that FORM elements are never part of this content.

Alternatively, can you suggest some better options for me to use with Tidy?
0
 
LVL 11

Accepted Solution

by:
shmert earned 250 total points
ID: 10745571
Maybe you can do it with regex, although it certainly won't handle every situation (single quotes surrounded by double quotes, for example, like many javascript tags might have).

I'd probably do some sort of iterative parser that looks at the HTML one character at a time.  Have various flags that you set for when you encounter an opening tag, whether you are in a quote block, and the last potential spot where an attribute value ended.
0
 
LVL 27

Assisted Solution

by:skullnobrains
skullnobrains earned 250 total points
ID: 10750238
i would use something that does not really work like a parser but should be usefull
$dest is destination string
$orig is original string
i assume that you converted all '>' and '<' to html entities in the regular text
this will generate an error if u have an '=' sign in an attribute (which is often the case)
it only accepts double-quotes but easy to change
you'll need to check for commas and parenthesis if u need the code

function strtochr($haystack, $needle)
{
   $pos = strrpos($haystack, $needle);
   if($pos === false) {
       return $haystack;
   }
   return substr($haystack, 0, $pos + 1);
}//took this one in the manual


while($orig[$i] != \0){

if(($dest.= $orig[$i]) != '<')continue;

$chunks=explode('=',substr($orig,$i, ($i=strpos($orig,'>') ) );
//i wander if this one is legal. should increment i to the '>' char and output chunks seperated by '=' for the content inside '<...>'

$nb=count($chunks);

for($n=1:$n<(count($chunk)-1);$n++){
   $begin=strtochr(trim($chunks[$n]),' ');
   if($begin[0]!='"')$begin='"'.$begin;
   if($begin[strlen($begin)]!='"')$begin.='"';
   $chunks[$n]=$begin." ".strrchr($chunks[$n+1],' ');}

$chunk[$n]=trim(chunk[$n]);
if($chunk[$n][0]!='"')$begin='"'.$chunk[$n];
if($chunk[$n][strlen(chunk[$n])]!='"')chunk[$n].='"';

$dest.=implode('='$chunks);
}
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 10750245
$nb=count($chunks);//line nopt necessary

a few $chunks are typed as $chunk : these are typo errors
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 12667519
thanks a lot for forced accept.
work somtimes gets a reward as it seems :)

just a tiny note, as i looked my code for some time before i figured out what it does... '"' is actually a double-quote embedded between two simple quotes. may help others some day...

see you all on the threads sometimes.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I found this questions asking how to do this in many different forums, so I will describe here how to implement a solution using PHP and AJAX. The logical flow for the problem should be: Write an event handler for the first drop down box to get …
JavaScript has plenty of pieces of code people often just copy/paste from somewhere but never quite fully understand. Self-Executing functions are just one good example that I'll try to demystify here.
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question