Solved

find broken HTML tags in string

Posted on 2007-03-22
13
447 Views
Last Modified: 2012-06-27
Is it possible to detect if a  string has broken HTML tags in it?

For example: <p>this is my string with <b>broken</b> HTML tags.  Can</b> i find them?</p>

As you can probably tell, im trying to find and remove broken tags from a string.  I am only concerned about <b></b> tags for now...

please advise?
0
Comment
Question by:ellandrd
  • 9
  • 4
13 Comments
 
LVL 49

Expert Comment

by:Roonaan
ID: 18770229
Hi Ellandrd,

You might look into the htmltidy library if installed on your server.(http://devzone.zend.com/node/view/id/761)

With some work we should also able to use preg_* functionalities of course.

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770258
Ive just found this JavaScript code:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/CSS/Q_21038792.html

But im not sure if this will work... as i need something in PHP if possible.
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770286
after looking into this htmltidy stuff - it looks but i dont have control over the server configuration for this project. - the site is hosted on a paid/shared hosting provider.

however i will install on my own server at home for my own use...

what do you think of that JavaScript code?
0
 
LVL 49

Expert Comment

by:Roonaan
ID: 18770347
Javascript isn't the solution.

In php you could use something like this. (Typed from heart, don't have a php server available at this time)

<?php

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, PREG_SPLIT_DELIM_CAPTURE);
$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

function getElementName($tag) {
  return strtolower(preg_replace('/^(\S+)(.*)$/', '\1', $tag));
}

?>

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770589
is $html your string or HTML code?
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770808
my string that i tested:

<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>

with directly copying and pasting your code and running it, the output i get is this:

<p>this is my bold</b> tag and this is a <b>broken tag</p>
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 16

Author Comment

by:ellandrd
ID: 18770823
when i print the contents of the stack, its empty.  when i print the contents of $parts, i get this:

Array
(
    [0] => this is my
    [1] => bold tag and this is a broken tag
)

Array
(
)

Array
(
)
0
 
LVL 49

Accepted Solution

by:
Roonaan earned 500 total points
ID: 18770976
My preg_split statement was wrong, and the getElementName could use a minor adjustment. Please try going with:

<?php

$html = '<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>';

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, 0, PREG_SPLIT_DELIM_CAPTURE);

var_export($parts);

$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

while(count($stack) > 0) {
  echo '</'.array_pop($stack).'>';
}

echo "\n".$html;
echo "\n".$newhtml;

function getElementName($tag) {
  return strtolower(preg_replace('/^([^\s>]+)(.*)$/', '\1', $tag));
}

?>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18771194
it works brilliant - thank you so much!!
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18772781
ok ive just noticed a bug.  when testing this earlier all i was using was simple strings like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</p> which came out like:

<p>this is my <b>bold</b> tag and this is a <b>broken</b></p>

but when you test some thing like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</p> it comes out like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</b></p>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18772790
when actaully it should look like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</b> tag and not too good at all</p>
0
 
LVL 49

Expert Comment

by:Roonaan
ID: 18777955
How can you be sure that <b> only just span one word always?

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18778199
good question....

i need to have a rethink about how im going to overcome this...
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

910 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now