Solved

find broken HTML tags in string

Posted on 2007-03-22
13
452 Views
Last Modified: 2012-06-27
Is it possible to detect if a  string has broken HTML tags in it?

For example: <p>this is my string with <b>broken</b> HTML tags.  Can</b> i find them?</p>

As you can probably tell, im trying to find and remove broken tags from a string.  I am only concerned about <b></b> tags for now...

please advise?
0
Comment
Question by:ellandrd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 4
13 Comments
 
LVL 49

Expert Comment

by:Roonaan
ID: 18770229
Hi Ellandrd,

You might look into the htmltidy library if installed on your server.(http://devzone.zend.com/node/view/id/761)

With some work we should also able to use preg_* functionalities of course.

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770258
Ive just found this JavaScript code:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/CSS/Q_21038792.html

But im not sure if this will work... as i need something in PHP if possible.
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770286
after looking into this htmltidy stuff - it looks but i dont have control over the server configuration for this project. - the site is hosted on a paid/shared hosting provider.

however i will install on my own server at home for my own use...

what do you think of that JavaScript code?
0
Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

 
LVL 49

Expert Comment

by:Roonaan
ID: 18770347
Javascript isn't the solution.

In php you could use something like this. (Typed from heart, don't have a php server available at this time)

<?php

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, PREG_SPLIT_DELIM_CAPTURE);
$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

function getElementName($tag) {
  return strtolower(preg_replace('/^(\S+)(.*)$/', '\1', $tag));
}

?>

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770589
is $html your string or HTML code?
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770808
my string that i tested:

<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>

with directly copying and pasting your code and running it, the output i get is this:

<p>this is my bold</b> tag and this is a <b>broken tag</p>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770823
when i print the contents of the stack, its empty.  when i print the contents of $parts, i get this:

Array
(
    [0] => this is my
    [1] => bold tag and this is a broken tag
)

Array
(
)

Array
(
)
0
 
LVL 49

Accepted Solution

by:
Roonaan earned 500 total points
ID: 18770976
My preg_split statement was wrong, and the getElementName could use a minor adjustment. Please try going with:

<?php

$html = '<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>';

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, 0, PREG_SPLIT_DELIM_CAPTURE);

var_export($parts);

$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

while(count($stack) > 0) {
  echo '</'.array_pop($stack).'>';
}

echo "\n".$html;
echo "\n".$newhtml;

function getElementName($tag) {
  return strtolower(preg_replace('/^([^\s>]+)(.*)$/', '\1', $tag));
}

?>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18771194
it works brilliant - thank you so much!!
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18772781
ok ive just noticed a bug.  when testing this earlier all i was using was simple strings like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</p> which came out like:

<p>this is my <b>bold</b> tag and this is a <b>broken</b></p>

but when you test some thing like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</p> it comes out like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</b></p>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18772790
when actaully it should look like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</b> tag and not too good at all</p>
0
 
LVL 49

Expert Comment

by:Roonaan
ID: 18777955
How can you be sure that <b> only just span one word always?

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18778199
good question....

i need to have a rethink about how im going to overcome this...
0

Featured Post

Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question