regular expression strip and replace tags within a string

Posted on 2006-06-08
Last Modified: 2008-02-01
Dear Experts,

In my prevoius posts you've noticed that everything which has to do with regular expression is not my cup of T.

For my new-system I would like to do the following:

$html_text (contains a html-page);

with the following code I can display under the news title the beginning of the news item (stored in $html_text) for a number of positions ($minimum) when the string is longer than $maximum.

if(strlen($html_text) > $maximum) {
$html_text = substr($html_text, 0, $minimum);

This is al well, but there are some drawbacks:
1. it will stop in the middle of a word (cutting words in two).
2. as I use images, tables, ect... in the string $html_text, well you know what's the problem.

So I would first like to do the following:
strip out the <img ... > and replace it with [image: alt-atribute of the img-tag (if the alt-attribute exist, if not: do nothing]
strip out the complete <table> from start till ending and replace it with ['here is tatble']
Then strip out all html-tags (like <p>, <br>, <font>, <hr>, <strong>, <i>,<u>, <ol>, <ul> ,<li>, etc...) and replace with nothing.
But the <a href=...> hyperling tags should be left complete.

After that <taggy>-operation determine how long the output of a string will be.
If the string is longer than xxx-positions then output yyy-postions of that string.
But it has to keep the last word of its output in tact (not cut in two).

Thanks in advance for your help !


Question by:gijsbertjr
  • 3
  • 2

Expert Comment

ID: 16867843
I did this quick, so let me know if you run into any problems. But this should work:

$regex = array('@<img.*?alt=(\'(?:.*?)\'|"(?:.*?)"|(?:[^\'">\s]*))[^>]*>@ie', '@<table[^>]*>.*?</table>@i');
$replace = array('trim(stripslashes(\'$1\'),\'\\\'"\')', '[\'here is table\']');

$html_text = preg_replace($regex, $replace, $html_text);
$html_text = strip_tags($html_text, '<a>');

if ( strlen($html_text) > $xxx )
      $extra = substr($html_text, $xxx);
      $pos = ( preg_match('@\s@', $extra, $m) ) ? strpos($extra, $m[0]) : 0;
      $html_text = substr($html_text, 0, $xxx + $pos);

Expert Comment

ID: 16867857
> ...everything which has to do with regular expression is not my cup of T.

If you ever find some free time, and are curious, take a look at this website.
LVL 49

Accepted Solution

Roonaan earned 500 total points
ID: 16869133
Alternative code:

  function limit_html($html, $max=100) {
    $html = strip_tags($html, '<a>,<table>,<img>');
    $html = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '[image:\2]', $html);
    $html = preg_replace('/<table(.*?)table>/im', '[table]', $html);
    $max = 900;
    if(strlen($html) > $max) {
      $html = substr($html, 0, $max);
      $pos = max(strrpos($html, ']'), strrpos($html,' '), strrpos($html, '</a>')+5);
      if($pos > 0) {
        $html = substr($html, 0, $pos);
    return $html;
  echo limit_html( file_get_contents(''), 900);

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center


Expert Comment

ID: 16870990
Roonan, the only dispute I would have with your code is that it's not very flexible with the <img> tags. It requires that the alt-attribute is enclosed specifically with double quotes (which is not always the case) and that there are no greater-than symbols (>) in the alt-attribute (which is not always the case). Mine is a little more flexible, since it allows for double quotes, single quotes, or no quotes, and also allows for a greater-than symbol in the alt-attribute. Sure, having that symbol in the alt-attribute is not W3C valid code, but who ever said he'd be parsing valid code to begin with? Also, what if the user had something like this:

Again, not W3C valid by any means, but your regex would not handle it properly. I know that one's a little bit of a stretch, but I think the image one is certainly a valid concern.

And I just realized my code should say this:
$replace = array('\'[image:\'.trim(stripslashes(\'$1\'),\'\\\'"\').\']\'', '[\'here is table\']');

Author Comment

ID: 16875607
Goodmorning  soapergem and Roonaan,

First let me tell you how much I appreciate you guys helping me out ! As a 43 year old youngster, with a steap learning curve ahead, the brain starts to refuses to grasp some issues. :)

Thanks soapergem for the regex tutorial tip: I'll check it out this weekend. I've noticed you posted a similar question in the past concerning <img>-tag.

Ok, about the the code. Both of them do not perform as I expected.

1) soapergem's code:
a) concerning images:
Produces this [image:drop] (drop being the alt-attrib) when the image has an alt-attribute.
But when the image has no alt-attribute it produces nothing or better &nbsp;

b) concerning table
it removes al the tags, but leaves the text between the opening and closing table tags.

2) Roonan's code
a) concerning images:
Produces this [image:] wheter the image has an alt-attribute or not.

b) concerning table:
the opening and closing table tag are still there, aswell as the text.
of course the other tags as <tr> have been strip with strip_tags function

3) the solution:
a) images:
perhaps I've badly discribed  the problem, but that is what I would like:
when image has an alt-attribute ==> [image: alt-atrib]
when image has no alt-attribute ==> [image]

b) tables:
when a table starts with the <table>-tag, everything that follows ( f.e. <tr><td>bla bla bla</td></tr>) until the first closing table-tag </table> should be replaced with [here is table].
What I noticed with both codes is that stripping the table leaves a lot of white space behind.

c) the other tags are stripped and in both case the <a href>-tag still works (strip_tag function and the execptions which ones not to strip)

d) concerning the lenght of the output:
The ouput will allways be appended with <a href="index.php?news_id=1>Read more ...</a>.
So if the original string is let's say 100 positions long it should output only 80 (leave's a bit to read).
On the other hand there should be some flexibility in the lenght of the output based on the lenght of the original string.
if the length of the orginal string is  > 100 and < 200 (output maximum 80)
if the length of the origanal string is >199 and < 400 (output maximum 250)
if the length of the original string is > 199 and < 1000 (output maximum 400)
if the length of the original string is > 999 (outpunt maximum 600)

Many thanks for helping me out !

best regards,


Author Comment

ID: 16877424

First off all i found a great tutorial at

I have adapted Roonaan's code to suit my needs.
I left out the whole alt-attribute thing because it did'nt find a solution ... yet.

My site is tri-lingual and a use a file to store all the language variables.

$lang_image = "Here is an image"; // language variable
$lang_table = "Here is a table"; // language variable
$lang_extension = " ..."; // language variable

function limit_text ($html_text, $lang_image, $lang_table, $lang_extension) {

$html_text = strip_tags($html_text, '<a>,<table>,<img>');
$html_text = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '&nbsp;<i>['.$lang_image.']</i>&nbsp;', $html_text);
$html_text = preg_replace('/<table(.*?)table>/ims', '&nbsp;<i>['.$lang_table.']</i>&nbsp;', $html_text); // notice the "s" at ims which solves the [here is table] not showing
$html_text = preg_replace('/\s\s+/', '', $html_text); // get rid of white space

// depending on the lenght of the string the length of the output will be different
    if(strlen($html_text) < 100) {
    $html_text = substr($html_text, 0, 80);
    } elseif(strlen($html_text) > 99 && strlen($html_text) < 1000) {
    $html_text = substr($html_text, 0, 250);
    } else {
    $html_text = substr($html_text, 0, 380);
    $pos = max(strrpos($html_text, ']'), strrpos($html_text,' '), strrpos($html_text, '</a>')+5);
    if($pos > 0) {
        $html_text = substr($html_text, 0, $pos);
    $html_text = $html_text . $lang_extension; // adding the dots ... at the end
    return $html_text;

echo limit_text($news_text, $lang_image, $lang_table, $lang_extension);

If you have any suggestions for improvement, there are welcome.

I'll grant the point to Roonan as I used his code as the basis for my solution.

Thanks again,


Featured Post

Active Directory Webinar

We all know we need to protect and secure our privileges, but where to start? Join Experts Exchange and ManageEngine on Tuesday, April 11, 2017 10:00 AM PDT to learn how to track and secure privileged users in Active Directory.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
Introduction This article is intended for those who are new to PHP error handling (  It addresses one of the most common problems that plague beginning PHP develop…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to count occurrences of each item in an array.

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question