Solved

find broken HTML tags in string

Posted on 2007-03-22
13
449 Views
Last Modified: 2012-06-27
Is it possible to detect if a  string has broken HTML tags in it?

For example: <p>this is my string with <b>broken</b> HTML tags.  Can</b> i find them?</p>

As you can probably tell, im trying to find and remove broken tags from a string.  I am only concerned about <b></b> tags for now...

please advise?
0
Comment
Question by:ellandrd
  • 9
  • 4
13 Comments
 
LVL 49

Expert Comment

by:Roonaan
ID: 18770229
Hi Ellandrd,

You might look into the htmltidy library if installed on your server.(http://devzone.zend.com/node/view/id/761)

With some work we should also able to use preg_* functionalities of course.

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770258
Ive just found this JavaScript code:

http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/CSS/Q_21038792.html

But im not sure if this will work... as i need something in PHP if possible.
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770286
after looking into this htmltidy stuff - it looks but i dont have control over the server configuration for this project. - the site is hosted on a paid/shared hosting provider.

however i will install on my own server at home for my own use...

what do you think of that JavaScript code?
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 49

Expert Comment

by:Roonaan
ID: 18770347
Javascript isn't the solution.

In php you could use something like this. (Typed from heart, don't have a php server available at this time)

<?php

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, PREG_SPLIT_DELIM_CAPTURE);
$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

function getElementName($tag) {
  return strtolower(preg_replace('/^(\S+)(.*)$/', '\1', $tag));
}

?>

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770589
is $html your string or HTML code?
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770808
my string that i tested:

<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>

with directly copying and pasting your code and running it, the output i get is this:

<p>this is my bold</b> tag and this is a <b>broken tag</p>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18770823
when i print the contents of the stack, its empty.  when i print the contents of $parts, i get this:

Array
(
    [0] => this is my
    [1] => bold tag and this is a broken tag
)

Array
(
)

Array
(
)
0
 
LVL 49

Accepted Solution

by:
Roonaan earned 500 total points
ID: 18770976
My preg_split statement was wrong, and the getElementName could use a minor adjustment. Please try going with:

<?php

$html = '<p>this is my <b>bold</b> tag and this is a <b>broken tag</p>';

// create stack
$stack = array();
// Match an html element
$preg = '/(<[^>]+>)/';
// Break $html on each and every tag
$parts = preg_split($preg, $html, 0, PREG_SPLIT_DELIM_CAPTURE);

var_export($parts);

$newhtml = '';
// Walk through $parts and maintain the $stack:
foreach($parts as $part) {
   if(substr($part, 0, 2) == '</') {
      // check stack, and remove element from stack if ok
      $partname = getElementName(substr($part, 2));
      $stacksize = count($stack);
      // close all tags that are unclosed
      while($stacksize > 0 && $stack[$stacksize-1] != $partname) {
           $newhtml .= '</'.array_pop($stack).'>';
           $stacksize = count($stack);
      }
      // close the current tag
      if($stacksize > 0 && $stack[$stacksize-1] == $partname) {
           $newhtml .= $part;
           array_pop($stack);
      }
   } elseif(substr($part, 0, 1) == '<') {
      $newhtml .= $part;
      // strip attributes from part, and add element to stack
      $partname = getElementName(substr($part,1));
      array_push($stack, $partname);
    } else {
       $newhtml .= $part;
    }
}

while(count($stack) > 0) {
  echo '</'.array_pop($stack).'>';
}

echo "\n".$html;
echo "\n".$newhtml;

function getElementName($tag) {
  return strtolower(preg_replace('/^([^\s>]+)(.*)$/', '\1', $tag));
}

?>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18771194
it works brilliant - thank you so much!!
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18772781
ok ive just noticed a bug.  when testing this earlier all i was using was simple strings like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</p> which came out like:

<p>this is my <b>bold</b> tag and this is a <b>broken</b></p>

but when you test some thing like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</p> it comes out like this:

<p>this is my <b>bold</b> tag and this is a <b>broken tag and not too good at all</b></p>
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18772790
when actaully it should look like this:

<p>this is my <b>bold</b> tag and this is a <b>broken</b> tag and not too good at all</p>
0
 
LVL 49

Expert Comment

by:Roonaan
ID: 18777955
How can you be sure that <b> only just span one word always?

-r-
0
 
LVL 16

Author Comment

by:ellandrd
ID: 18778199
good question....

i need to have a rethink about how im going to overcome this...
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to dynamically set the form action using jQuery.

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question