Solved

PHP/REGEX: Automatically close HTML elements

Posted on 2011-02-21
12
369 Views
Last Modified: 2012-06-27
Using PHP, how can I automatically close all HTML tags that need to be closed?
<pre><?php

$html = '
<div>
 <div>
  <p>
   <strong>Hello
  </p>
 </div>
';

$html = closeHTML($html);

echo htmlentities($html);

function closeHTML($html) {
 // Close all open HTML tags that need to be closed
 return $html; // 
}

?></pre>

Open in new window

0
Comment
Question by:hankknight
  • 5
  • 2
  • 2
  • +3
12 Comments
 
LVL 27

Assisted Solution

by:yodercm
yodercm earned 50 total points
ID: 34942509
In php, you can echo out the closing tags thus:

echo "</td></tr></table>";

but there is no magic way for php to know what tags need to be closed where -- that is the job of the programmer.
0
 
LVL 27

Assisted Solution

by:Lukasz Chmielewski
Lukasz Chmielewski earned 50 total points
ID: 34942535
If you desperately need that, to give you an idea you could count the tags parity in tag array and if the number would be odd, then it would indicate that there are not closed tags (you would know what kind of tag) but to determine where - well, see the comment above.
0
 
LVL 34

Assisted Solution

by:Beverley Portlock
Beverley Portlock earned 50 total points
ID: 34942707
You could look at running the code through HTML TIDY which can do some of this

http://tidy.sourceforge.net/
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34943025
Regular expressions are NOT going to be a good tool to use for this scenario  = )
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 50 total points
ID: 34943169
I don't think REGEX is the right solution.  And if kaufmed does not have a REGEX for it, it probably does not exist!

I think the right solution is to use a design pattern that takes advantage of careful indenting of your control structures and tag sequences, and that separates your logic from your presentation layer.  Then use the W3 validator to find and fix any HTML hiccups.  If you code to strict standards you will have no trouble.
http://validator.w3.org/
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34943276
Oh Ray_Paseur, you're such a kidder    = )
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34943407
@kaufmed  +1  :-D

0
 
LVL 16

Author Comment

by:hankknight
ID: 34944340
OK, I managed to get something to work for one tag only.  The problem with my code is that it only fixes one problem.  If there is one problem only, it will fix it.  If there are six unclosed tags it will only fix the first one.

I understand that the code it creates may not be valid however I cannot use the Tidy extension for this. So even if it closes a <strong> tag in the wrong place, that is fine.
<pre><?php

$html = '
<div id="abc">
 <div>
  <p>
   <strong>Hello
  </p>
 </div>
';

echo htmlentities($html);
echo '<hr />';
echo htmlentities(closeTags($html));

function closeTags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

?></pre>

Open in new window

0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 300 total points
ID: 34945284
Here's why using regex for this is a bad idea:

Regex can be used to tokenize an input, because you are saying with your pattern, "I expect valid tokens to fit this pattern." For the sake of argument, let's take tokens to mean anything not whitespace. Just having tokens, however, doesn't allow you to say which tokens match each other (e.g. beginning and ending tags). A parser (crudely, regex with a stack) is what you would use to keep track of tokens.

You need some way of denoting when you have found an opening tag. This is where a stack comes into play. With a stack, you can track when you encounter an opening tag (by pushing it onto the stack when you find it), and when you encounter a matching ending tag (by popping the beginning tag off the stack if it is on top of the stack).

What you're trying to do is fake a parser with regex, and this generally doesn't work (except in the ultra-rare, simplistic case). I suggest you look at using an array (stack) to track your tags. You can still use regex to find the tags, but you need the array to keep track of opening/closing tags. You'll still need to define rules for ambiguity. For example, given:

    <html>
        <body>
                <span>
                        <span>
                                <font>hello world!</font>
</span>
        </body>
    </html>

Which <span> does the </span> above close?? This is a very simple example, but these are things you have to think about.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34945310
Perhaps I should have said, "here's why using regex alone for this is a bad idea..."   = )
0
 
LVL 16

Author Comment

by:hankknight
ID: 34945412
Thank you all for your insights.  Would it be a better idea to REMOVE the inner-most offending tags which are not closed?
0
 
LVL 75

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 300 total points
ID: 34945437
That's a call you'll have to make. We don't have any idea what your html is/does/how it's structured. At this point, only you would know if removing an offending tag would still give valid HTML. I can say that it is easy to completely wreck your page if you start removing tags solely on the basis of not finding the matching tag due to the (currently) non-rigid nature of HTML.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction HTML checkboxes provide the perfect way for a web developer to receive client input when the client's options might be none, one or many.  But the PHP code for processing the checkboxes can be confusing at first.  What if a checkbox is…
Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

25 Experts available now in Live!

Get 1:1 Help Now