PHP/REGEX: Automatically close HTML elements

Using PHP, how can I automatically close all HTML tags that need to be closed?
<pre><?php

$html = '
<div>
 <div>
  <p>
   <strong>Hello
  </p>
 </div>
';

$html = closeHTML($html);

echo htmlentities($html);

function closeHTML($html) {
 // Close all open HTML tags that need to be closed
 return $html; // 
}

?></pre>

Open in new window

LVL 16
hankknightAsked:
Who is Participating?
 
käµfm³d 👽Connect With a Mentor Commented:
Here's why using regex for this is a bad idea:

Regex can be used to tokenize an input, because you are saying with your pattern, "I expect valid tokens to fit this pattern." For the sake of argument, let's take tokens to mean anything not whitespace. Just having tokens, however, doesn't allow you to say which tokens match each other (e.g. beginning and ending tags). A parser (crudely, regex with a stack) is what you would use to keep track of tokens.

You need some way of denoting when you have found an opening tag. This is where a stack comes into play. With a stack, you can track when you encounter an opening tag (by pushing it onto the stack when you find it), and when you encounter a matching ending tag (by popping the beginning tag off the stack if it is on top of the stack).

What you're trying to do is fake a parser with regex, and this generally doesn't work (except in the ultra-rare, simplistic case). I suggest you look at using an array (stack) to track your tags. You can still use regex to find the tags, but you need the array to keep track of opening/closing tags. You'll still need to define rules for ambiguity. For example, given:

    <html>
        <body>
                <span>
                        <span>
                                <font>hello world!</font>
</span>
        </body>
    </html>

Which <span> does the </span> above close?? This is a very simple example, but these are things you have to think about.
0
 
Cornelia YoderConnect With a Mentor ArtistCommented:
In php, you can echo out the closing tags thus:

echo "</td></tr></table>";

but there is no magic way for php to know what tags need to be closed where -- that is the job of the programmer.
0
 
Lukasz ChmielewskiConnect With a Mentor Commented:
If you desperately need that, to give you an idea you could count the tags parity in tag array and if the number would be odd, then it would indicate that there are not closed tags (you would know what kind of tag) but to determine where - well, see the comment above.
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

 
Beverley PortlockConnect With a Mentor Commented:
You could look at running the code through HTML TIDY which can do some of this

http://tidy.sourceforge.net/
0
 
käµfm³d 👽Commented:
Regular expressions are NOT going to be a good tool to use for this scenario  = )
0
 
Ray PaseurConnect With a Mentor Commented:
I don't think REGEX is the right solution.  And if kaufmed does not have a REGEX for it, it probably does not exist!

I think the right solution is to use a design pattern that takes advantage of careful indenting of your control structures and tag sequences, and that separates your logic from your presentation layer.  Then use the W3 validator to find and fix any HTML hiccups.  If you code to strict standards you will have no trouble.
http://validator.w3.org/
0
 
käµfm³d 👽Commented:
Oh Ray_Paseur, you're such a kidder    = )
0
 
Beverley PortlockCommented:
@kaufmed  +1  :-D

0
 
hankknightAuthor Commented:
OK, I managed to get something to work for one tag only.  The problem with my code is that it only fixes one problem.  If there is one problem only, it will fix it.  If there are six unclosed tags it will only fix the first one.

I understand that the code it creates may not be valid however I cannot use the Tidy extension for this. So even if it closes a <strong> tag in the wrong place, that is fine.
<pre><?php

$html = '
<div id="abc">
 <div>
  <p>
   <strong>Hello
  </p>
 </div>
';

echo htmlentities($html);
echo '<hr />';
echo htmlentities(closeTags($html));

function closeTags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

?></pre>

Open in new window

0
 
käµfm³d 👽Commented:
Perhaps I should have said, "here's why using regex alone for this is a bad idea..."   = )
0
 
hankknightAuthor Commented:
Thank you all for your insights.  Would it be a better idea to REMOVE the inner-most offending tags which are not closed?
0
 
käµfm³d 👽Connect With a Mentor Commented:
That's a call you'll have to make. We don't have any idea what your html is/does/how it's structured. At this point, only you would know if removing an offending tag would still give valid HTML. I can say that it is easy to completely wreck your page if you start removing tags solely on the basis of not finding the matching tag due to the (currently) non-rigid nature of HTML.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.