Solved

PHP/REGEX: Automatically close HTML elements

Posted on 2011-02-21
12
368 Views
Last Modified: 2012-06-27
Using PHP, how can I automatically close all HTML tags that need to be closed?
<pre><?php

$html = '
<div>
 <div>
  <p>
   <strong>Hello
  </p>
 </div>
';

$html = closeHTML($html);

echo htmlentities($html);

function closeHTML($html) {
 // Close all open HTML tags that need to be closed
 return $html; // 
}

?></pre>

Open in new window

0
Comment
Question by:hankknight
  • 5
  • 2
  • 2
  • +3
12 Comments
 
LVL 27

Assisted Solution

by:yodercm
yodercm earned 50 total points
ID: 34942509
In php, you can echo out the closing tags thus:

echo "</td></tr></table>";

but there is no magic way for php to know what tags need to be closed where -- that is the job of the programmer.
0
 
LVL 27

Assisted Solution

by:Lukasz Chmielewski
Lukasz Chmielewski earned 50 total points
ID: 34942535
If you desperately need that, to give you an idea you could count the tags parity in tag array and if the number would be odd, then it would indicate that there are not closed tags (you would know what kind of tag) but to determine where - well, see the comment above.
0
 
LVL 34

Assisted Solution

by:Beverley Portlock
Beverley Portlock earned 50 total points
ID: 34942707
You could look at running the code through HTML TIDY which can do some of this

http://tidy.sourceforge.net/
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34943025
Regular expressions are NOT going to be a good tool to use for this scenario  = )
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 50 total points
ID: 34943169
I don't think REGEX is the right solution.  And if kaufmed does not have a REGEX for it, it probably does not exist!

I think the right solution is to use a design pattern that takes advantage of careful indenting of your control structures and tag sequences, and that separates your logic from your presentation layer.  Then use the W3 validator to find and fix any HTML hiccups.  If you code to strict standards you will have no trouble.
http://validator.w3.org/
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34943276
Oh Ray_Paseur, you're such a kidder    = )
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 34943407
@kaufmed  +1  :-D

0
 
LVL 16

Author Comment

by:hankknight
ID: 34944340
OK, I managed to get something to work for one tag only.  The problem with my code is that it only fixes one problem.  If there is one problem only, it will fix it.  If there are six unclosed tags it will only fix the first one.

I understand that the code it creates may not be valid however I cannot use the Tidy extension for this. So even if it closes a <strong> tag in the wrong place, that is fine.
<pre><?php

$html = '
<div id="abc">
 <div>
  <p>
   <strong>Hello
  </p>
 </div>
';

echo htmlentities($html);
echo '<hr />';
echo htmlentities(closeTags($html));

function closeTags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

?></pre>

Open in new window

0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 300 total points
ID: 34945284
Here's why using regex for this is a bad idea:

Regex can be used to tokenize an input, because you are saying with your pattern, "I expect valid tokens to fit this pattern." For the sake of argument, let's take tokens to mean anything not whitespace. Just having tokens, however, doesn't allow you to say which tokens match each other (e.g. beginning and ending tags). A parser (crudely, regex with a stack) is what you would use to keep track of tokens.

You need some way of denoting when you have found an opening tag. This is where a stack comes into play. With a stack, you can track when you encounter an opening tag (by pushing it onto the stack when you find it), and when you encounter a matching ending tag (by popping the beginning tag off the stack if it is on top of the stack).

What you're trying to do is fake a parser with regex, and this generally doesn't work (except in the ultra-rare, simplistic case). I suggest you look at using an array (stack) to track your tags. You can still use regex to find the tags, but you need the array to keep track of opening/closing tags. You'll still need to define rules for ambiguity. For example, given:

    <html>
        <body>
                <span>
                        <span>
                                <font>hello world!</font>
</span>
        </body>
    </html>

Which <span> does the </span> above close?? This is a very simple example, but these are things you have to think about.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34945310
Perhaps I should have said, "here's why using regex alone for this is a bad idea..."   = )
0
 
LVL 16

Author Comment

by:hankknight
ID: 34945412
Thank you all for your insights.  Would it be a better idea to REMOVE the inner-most offending tags which are not closed?
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 300 total points
ID: 34945437
That's a call you'll have to make. We don't have any idea what your html is/does/how it's structured. At this point, only you would know if removing an offending tag would still give valid HTML. I can say that it is easy to completely wreck your page if you start removing tags solely on the basis of not finding the matching tag due to the (currently) non-rigid nature of HTML.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to dynamically set the form action using jQuery.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now