find broken HTML tags in string

Is it possible to detect if a  string has broken HTML tags in it?

For example: <p>this is my string with <b>broken</b> HTML tags.  Can</b> i find them?</p>

As you can probably tell, im trying to find and remove broken tags from a string.  I am only concerned about <b></b> tags for now...

please advise?
LVL 16
ellandrdAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

RoonaanCommented:
Hi Ellandrd,

You might look into the htmltidy library if installed on your server.(http://devzone.zend.com/node/view/id/761)

With some work we should also able to use preg_* functionalities of course.

-r-
0
ellandrdAuthor Commented:
Ive just found this JavaScript code:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/CSS/Q_21038792.html

But im not sure if this will work... as i need something in PHP if possible.
0
ellandrdAuthor Commented:
after looking into this htmltidy stuff - it looks but i dont have control over the server configuration for this project. - the site is hosted on a paid/shared hosting provider.

however i will install on my own server at home for my own use...

what do you think of that JavaScript code?
0
Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

RoonaanCommented:
Javascript isn't the solution.

In php you could use something like this. (Typed from heart, don't have a php server available at this time)

<?php

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, PREG_SPLIT_DELIM_CAPTURE);
$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

function getElementName($tag) {
  return strtolower(preg_replace('/^(\S+)(.*)$/', '\1', $tag));
}

?>

-r-
0
ellandrdAuthor Commented:
is $html your string or HTML code?
0
ellandrdAuthor Commented:
my string that i tested:

<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>

with directly copying and pasting your code and running it, the output i get is this:

<p>this is my bold</b> tag and this is a <b>broken tag</p>
0
ellandrdAuthor Commented:
when i print the contents of the stack, its empty.  when i print the contents of $parts, i get this:

Array
(
    [0] => this is my
    [1] => bold tag and this is a broken tag
)

Array
(
)

Array
(
)
0
RoonaanCommented:
My preg_split statement was wrong, and the getElementName could use a minor adjustment. Please try going with:

<?php

$html = '<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>';

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, 0, PREG_SPLIT_DELIM_CAPTURE);

var_export($parts);

$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

while(count($stack) > 0) {
  echo '</'.array_pop($stack).'>';
}

echo "\n".$html;
echo "\n".$newhtml;

function getElementName($tag) {
  return strtolower(preg_replace('/^([^\s>]+)(.*)$/', '\1', $tag));
}

?>
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ellandrdAuthor Commented:
it works brilliant - thank you so much!!
0
ellandrdAuthor Commented:
ok ive just noticed a bug.  when testing this earlier all i was using was simple strings like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</p> which came out like:

<p>this is my <b>bold</b> tag and this is a <b>broken</b></p>

but when you test some thing like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</p> it comes out like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</b></p>
0
ellandrdAuthor Commented:
when actaully it should look like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</b> tag and not too good at all</p>
0
RoonaanCommented:
How can you be sure that <b> only just span one word always?

-r-
0
ellandrdAuthor Commented:
good question....

i need to have a rethink about how im going to overcome this...
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.