regular expression strip and replace tags within a string

Posted on 2006-06-08
Medium Priority
Last Modified: 2008-02-01
Dear Experts,

In my prevoius posts you've noticed that everything which has to do with regular expression is not my cup of T.

For my new-system I would like to do the following:

$html_text (contains a html-page);

with the following code I can display under the news title the beginning of the news item (stored in $html_text) for a number of positions ($minimum) when the string is longer than $maximum.

if(strlen($html_text) > $maximum) {
$html_text = substr($html_text, 0, $minimum);

This is al well, but there are some drawbacks:
1. it will stop in the middle of a word (cutting words in two).
2. as I use images, tables, ect... in the string $html_text, well you know what's the problem.

So I would first like to do the following:
strip out the <img ... > and replace it with [image: alt-atribute of the img-tag (if the alt-attribute exist, if not: do nothing]
strip out the complete <table> from start till ending and replace it with ['here is tatble']
Then strip out all html-tags (like <p>, <br>, <font>, <hr>, <strong>, <i>,<u>, <ol>, <ul> ,<li>, etc...) and replace with nothing.
But the <a href=...> hyperling tags should be left complete.

After that <taggy>-operation determine how long the output of a string will be.
If the string is longer than xxx-positions then output yyy-postions of that string.
But it has to keep the last word of its output in tact (not cut in two).

Thanks in advance for your help !


Question by:gijsbertjr
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2

Expert Comment

ID: 16867843
I did this quick, so let me know if you run into any problems. But this should work:

$regex = array('@<img.*?alt=(\'(?:.*?)\'|"(?:.*?)"|(?:[^\'">\s]*))[^>]*>@ie', '@<table[^>]*>.*?</table>@i');
$replace = array('trim(stripslashes(\'$1\'),\'\\\'"\')', '[\'here is table\']');

$html_text = preg_replace($regex, $replace, $html_text);
$html_text = strip_tags($html_text, '<a>');

if ( strlen($html_text) > $xxx )
      $extra = substr($html_text, $xxx);
      $pos = ( preg_match('@\s@', $extra, $m) ) ? strpos($extra, $m[0]) : 0;
      $html_text = substr($html_text, 0, $xxx + $pos);

Expert Comment

ID: 16867857
> ...everything which has to do with regular expression is not my cup of T.

If you ever find some free time, and are curious, take a look at this website.
LVL 49

Accepted Solution

Roonaan earned 2000 total points
ID: 16869133
Alternative code:

  function limit_html($html, $max=100) {
    $html = strip_tags($html, '<a>,<table>,<img>');
    $html = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '[image:\2]', $html);
    $html = preg_replace('/<table(.*?)table>/im', '[table]', $html);
    $max = 900;
    if(strlen($html) > $max) {
      $html = substr($html, 0, $max);
      $pos = max(strrpos($html, ']'), strrpos($html,' '), strrpos($html, '</a>')+5);
      if($pos > 0) {
        $html = substr($html, 0, $pos);
    return $html;
  echo limit_html( file_get_contents('http://www.experts-exchange.com'), 900);

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!


Expert Comment

ID: 16870990
Roonan, the only dispute I would have with your code is that it's not very flexible with the <img> tags. It requires that the alt-attribute is enclosed specifically with double quotes (which is not always the case) and that there are no greater-than symbols (>) in the alt-attribute (which is not always the case). Mine is a little more flexible, since it allows for double quotes, single quotes, or no quotes, and also allows for a greater-than symbol in the alt-attribute. Sure, having that symbol in the alt-attribute is not W3C valid code, but who ever said he'd be parsing valid code to begin with? Also, what if the user had something like this:

Again, not W3C valid by any means, but your regex would not handle it properly. I know that one's a little bit of a stretch, but I think the image one is certainly a valid concern.

And I just realized my code should say this:
$replace = array('\'[image:\'.trim(stripslashes(\'$1\'),\'\\\'"\').\']\'', '[\'here is table\']');

Author Comment

ID: 16875607
Goodmorning  soapergem and Roonaan,

First let me tell you how much I appreciate you guys helping me out ! As a 43 year old youngster, with a steap learning curve ahead, the brain starts to refuses to grasp some issues. :)

Thanks soapergem for the regex tutorial tip: I'll check it out this weekend. I've noticed you posted a similar question in the past concerning <img>-tag.

Ok, about the the code. Both of them do not perform as I expected.

1) soapergem's code:
a) concerning images:
Produces this [image:drop] (drop being the alt-attrib) when the image has an alt-attribute.
But when the image has no alt-attribute it produces nothing or better &nbsp;

b) concerning table
it removes al the tags, but leaves the text between the opening and closing table tags.

2) Roonan's code
a) concerning images:
Produces this [image:] wheter the image has an alt-attribute or not.

b) concerning table:
the opening and closing table tag are still there, aswell as the text.
of course the other tags as <tr> have been strip with strip_tags function

3) the solution:
a) images:
perhaps I've badly discribed  the problem, but that is what I would like:
when image has an alt-attribute ==> [image: alt-atrib]
when image has no alt-attribute ==> [image]

b) tables:
when a table starts with the <table>-tag, everything that follows ( f.e. <tr><td>bla bla bla</td></tr>) until the first closing table-tag </table> should be replaced with [here is table].
What I noticed with both codes is that stripping the table leaves a lot of white space behind.

c) the other tags are stripped and in both case the <a href>-tag still works (strip_tag function and the execptions which ones not to strip)

d) concerning the lenght of the output:
The ouput will allways be appended with <a href="index.php?news_id=1>Read more ...</a>.
So if the original string is let's say 100 positions long it should output only 80 (leave's a bit to read).
On the other hand there should be some flexibility in the lenght of the output based on the lenght of the original string.
if the length of the orginal string is  > 100 and < 200 (output maximum 80)
if the length of the origanal string is >199 and < 400 (output maximum 250)
if the length of the original string is > 199 and < 1000 (output maximum 400)
if the length of the original string is > 999 (outpunt maximum 600)

Many thanks for helping me out !

best regards,


Author Comment

ID: 16877424

First off all i found a great tutorial at http://www.tote-taste.de/X-Project/regex/index.php.

I have adapted Roonaan's code to suit my needs.
I left out the whole alt-attribute thing because it did'nt find a solution ... yet.

My site is tri-lingual and a use a file to store all the language variables.

$lang_image = "Here is an image"; // language variable
$lang_table = "Here is a table"; // language variable
$lang_extension = " ..."; // language variable

function limit_text ($html_text, $lang_image, $lang_table, $lang_extension) {

$html_text = strip_tags($html_text, '<a>,<table>,<img>');
$html_text = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '&nbsp;<i>['.$lang_image.']</i>&nbsp;', $html_text);
$html_text = preg_replace('/<table(.*?)table>/ims', '&nbsp;<i>['.$lang_table.']</i>&nbsp;', $html_text); // notice the "s" at ims which solves the [here is table] not showing
$html_text = preg_replace('/\s\s+/', '', $html_text); // get rid of white space

// depending on the lenght of the string the length of the output will be different
    if(strlen($html_text) < 100) {
    $html_text = substr($html_text, 0, 80);
    } elseif(strlen($html_text) > 99 && strlen($html_text) < 1000) {
    $html_text = substr($html_text, 0, 250);
    } else {
    $html_text = substr($html_text, 0, 380);
    $pos = max(strrpos($html_text, ']'), strrpos($html_text,' '), strrpos($html_text, '</a>')+5);
    if($pos > 0) {
        $html_text = substr($html_text, 0, $pos);
    $html_text = $html_text . $lang_extension; // adding the dots ... at the end
    return $html_text;

echo limit_text($news_text, $lang_image, $lang_table, $lang_extension);

If you have any suggestions for improvement, there are welcome.

I'll grant the point to Roonan as I used his code as the basis for my solution.

Thanks again,


Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
This article discusses four methods for overlaying images in a container on a web page
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question