regular expression strip and replace tags within a string

Posted on 2006-06-08
Last Modified: 2008-02-01
Dear Experts,

In my prevoius posts you've noticed that everything which has to do with regular expression is not my cup of T.

For my new-system I would like to do the following:

$html_text (contains a html-page);

with the following code I can display under the news title the beginning of the news item (stored in $html_text) for a number of positions ($minimum) when the string is longer than $maximum.

if(strlen($html_text) > $maximum) {
$html_text = substr($html_text, 0, $minimum);

This is al well, but there are some drawbacks:
1. it will stop in the middle of a word (cutting words in two).
2. as I use images, tables, ect... in the string $html_text, well you know what's the problem.

So I would first like to do the following:
strip out the <img ... > and replace it with [image: alt-atribute of the img-tag (if the alt-attribute exist, if not: do nothing]
strip out the complete <table> from start till ending and replace it with ['here is tatble']
Then strip out all html-tags (like <p>, <br>, <font>, <hr>, <strong>, <i>,<u>, <ol>, <ul> ,<li>, etc...) and replace with nothing.
But the <a href=...> hyperling tags should be left complete.

After that <taggy>-operation determine how long the output of a string will be.
If the string is longer than xxx-positions then output yyy-postions of that string.
But it has to keep the last word of its output in tact (not cut in two).

Thanks in advance for your help !


Question by:gijsbertjr
  • 3
  • 2

Expert Comment

ID: 16867843
I did this quick, so let me know if you run into any problems. But this should work:

$regex = array('@<img.*?alt=(\'(?:.*?)\'|"(?:.*?)"|(?:[^\'">\s]*))[^>]*>@ie', '@<table[^>]*>.*?</table>@i');
$replace = array('trim(stripslashes(\'$1\'),\'\\\'"\')', '[\'here is table\']');

$html_text = preg_replace($regex, $replace, $html_text);
$html_text = strip_tags($html_text, '<a>');

if ( strlen($html_text) > $xxx )
      $extra = substr($html_text, $xxx);
      $pos = ( preg_match('@\s@', $extra, $m) ) ? strpos($extra, $m[0]) : 0;
      $html_text = substr($html_text, 0, $xxx + $pos);

Expert Comment

ID: 16867857
> ...everything which has to do with regular expression is not my cup of T.

If you ever find some free time, and are curious, take a look at this website.
LVL 49

Accepted Solution

Roonaan earned 500 total points
ID: 16869133
Alternative code:

  function limit_html($html, $max=100) {
    $html = strip_tags($html, '<a>,<table>,<img>');
    $html = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '[image:\2]', $html);
    $html = preg_replace('/<table(.*?)table>/im', '[table]', $html);
    $max = 900;
    if(strlen($html) > $max) {
      $html = substr($html, 0, $max);
      $pos = max(strrpos($html, ']'), strrpos($html,' '), strrpos($html, '</a>')+5);
      if($pos > 0) {
        $html = substr($html, 0, $pos);
    return $html;
  echo limit_html( file_get_contents(''), 900);

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.


Expert Comment

ID: 16870990
Roonan, the only dispute I would have with your code is that it's not very flexible with the <img> tags. It requires that the alt-attribute is enclosed specifically with double quotes (which is not always the case) and that there are no greater-than symbols (>) in the alt-attribute (which is not always the case). Mine is a little more flexible, since it allows for double quotes, single quotes, or no quotes, and also allows for a greater-than symbol in the alt-attribute. Sure, having that symbol in the alt-attribute is not W3C valid code, but who ever said he'd be parsing valid code to begin with? Also, what if the user had something like this:

Again, not W3C valid by any means, but your regex would not handle it properly. I know that one's a little bit of a stretch, but I think the image one is certainly a valid concern.

And I just realized my code should say this:
$replace = array('\'[image:\'.trim(stripslashes(\'$1\'),\'\\\'"\').\']\'', '[\'here is table\']');

Author Comment

ID: 16875607
Goodmorning  soapergem and Roonaan,

First let me tell you how much I appreciate you guys helping me out ! As a 43 year old youngster, with a steap learning curve ahead, the brain starts to refuses to grasp some issues. :)

Thanks soapergem for the regex tutorial tip: I'll check it out this weekend. I've noticed you posted a similar question in the past concerning <img>-tag.

Ok, about the the code. Both of them do not perform as I expected.

1) soapergem's code:
a) concerning images:
Produces this [image:drop] (drop being the alt-attrib) when the image has an alt-attribute.
But when the image has no alt-attribute it produces nothing or better &nbsp;

b) concerning table
it removes al the tags, but leaves the text between the opening and closing table tags.

2) Roonan's code
a) concerning images:
Produces this [image:] wheter the image has an alt-attribute or not.

b) concerning table:
the opening and closing table tag are still there, aswell as the text.
of course the other tags as <tr> have been strip with strip_tags function

3) the solution:
a) images:
perhaps I've badly discribed  the problem, but that is what I would like:
when image has an alt-attribute ==> [image: alt-atrib]
when image has no alt-attribute ==> [image]

b) tables:
when a table starts with the <table>-tag, everything that follows ( f.e. <tr><td>bla bla bla</td></tr>) until the first closing table-tag </table> should be replaced with [here is table].
What I noticed with both codes is that stripping the table leaves a lot of white space behind.

c) the other tags are stripped and in both case the <a href>-tag still works (strip_tag function and the execptions which ones not to strip)

d) concerning the lenght of the output:
The ouput will allways be appended with <a href="index.php?news_id=1>Read more ...</a>.
So if the original string is let's say 100 positions long it should output only 80 (leave's a bit to read).
On the other hand there should be some flexibility in the lenght of the output based on the lenght of the original string.
if the length of the orginal string is  > 100 and < 200 (output maximum 80)
if the length of the origanal string is >199 and < 400 (output maximum 250)
if the length of the original string is > 199 and < 1000 (output maximum 400)
if the length of the original string is > 999 (outpunt maximum 600)

Many thanks for helping me out !

best regards,


Author Comment

ID: 16877424

First off all i found a great tutorial at

I have adapted Roonaan's code to suit my needs.
I left out the whole alt-attribute thing because it did'nt find a solution ... yet.

My site is tri-lingual and a use a file to store all the language variables.

$lang_image = "Here is an image"; // language variable
$lang_table = "Here is a table"; // language variable
$lang_extension = " ..."; // language variable

function limit_text ($html_text, $lang_image, $lang_table, $lang_extension) {

$html_text = strip_tags($html_text, '<a>,<table>,<img>');
$html_text = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '&nbsp;<i>['.$lang_image.']</i>&nbsp;', $html_text);
$html_text = preg_replace('/<table(.*?)table>/ims', '&nbsp;<i>['.$lang_table.']</i>&nbsp;', $html_text); // notice the "s" at ims which solves the [here is table] not showing
$html_text = preg_replace('/\s\s+/', '', $html_text); // get rid of white space

// depending on the lenght of the string the length of the output will be different
    if(strlen($html_text) < 100) {
    $html_text = substr($html_text, 0, 80);
    } elseif(strlen($html_text) > 99 && strlen($html_text) < 1000) {
    $html_text = substr($html_text, 0, 250);
    } else {
    $html_text = substr($html_text, 0, 380);
    $pos = max(strrpos($html_text, ']'), strrpos($html_text,' '), strrpos($html_text, '</a>')+5);
    if($pos > 0) {
        $html_text = substr($html_text, 0, $pos);
    $html_text = $html_text . $lang_extension; // adding the dots ... at the end
    return $html_text;

echo limit_text($news_text, $lang_image, $lang_table, $lang_extension);

If you have any suggestions for improvement, there are welcome.

I'll grant the point to Roonan as I used his code as the basis for my solution.

Thanks again,


Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Part of the Global Positioning System A geocode ( is the major subset of a GPS coordinate (, the other parts being the altitude and t…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now