• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1855
  • Last Modified:

regular expression strip and replace tags within a string

Dear Experts,

In my prevoius posts you've noticed that everything which has to do with regular expression is not my cup of T.

For my new-system I would like to do the following:

$html_text (contains a html-page);

with the following code I can display under the news title the beginning of the news item (stored in $html_text) for a number of positions ($minimum) when the string is longer than $maximum.

code:
if(strlen($html_text) > $maximum) {
$html_text = substr($html_text, 0, $minimum);
}

This is al well, but there are some drawbacks:
1. it will stop in the middle of a word (cutting words in two).
2. as I use images, tables, ect... in the string $html_text, well you know what's the problem.


So I would first like to do the following:
strip out the <img ... > and replace it with [image: alt-atribute of the img-tag (if the alt-attribute exist, if not: do nothing]
strip out the complete <table> from start till ending and replace it with ['here is tatble']
Then strip out all html-tags (like <p>, <br>, <font>, <hr>, <strong>, <i>,<u>, <ol>, <ul> ,<li>, etc...) and replace with nothing.
But the <a href=...> hyperling tags should be left complete.

After that <taggy>-operation determine how long the output of a string will be.
If the string is longer than xxx-positions then output yyy-postions of that string.
But it has to keep the last word of its output in tact (not cut in two).

Thanks in advance for your help !

Gijs










0
gijsbertjr
Asked:
gijsbertjr
  • 3
  • 2
1 Solution
 
soapergemCommented:
I did this quick, so let me know if you run into any problems. But this should work:

<?php
$regex = array('@<img.*?alt=(\'(?:.*?)\'|"(?:.*?)"|(?:[^\'">\s]*))[^>]*>@ie', '@<table[^>]*>.*?</table>@i');
$replace = array('trim(stripslashes(\'$1\'),\'\\\'"\')', '[\'here is table\']');

$html_text = preg_replace($regex, $replace, $html_text);
$html_text = strip_tags($html_text, '<a>');

if ( strlen($html_text) > $xxx )
{
      $extra = substr($html_text, $xxx);
      $pos = ( preg_match('@\s@', $extra, $m) ) ? strpos($extra, $m[0]) : 0;
      $html_text = substr($html_text, 0, $xxx + $pos);
}
?>
0
 
soapergemCommented:
> ...everything which has to do with regular expression is not my cup of T.

If you ever find some free time, and are curious, take a look at this website.
http://www.regular-expressions.info/tutorial.html
0
 
RoonaanCommented:
Alternative code:

<?php
  function limit_html($html, $max=100) {
    $html = strip_tags($html, '<a>,<table>,<img>');
    $html = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '[image:\2]', $html);
    $html = preg_replace('/<table(.*?)table>/im', '[table]', $html);
    $max = 900;
    if(strlen($html) > $max) {
      $html = substr($html, 0, $max);
      $pos = max(strrpos($html, ']'), strrpos($html,' '), strrpos($html, '</a>')+5);
      if($pos > 0) {
        $html = substr($html, 0, $pos);
      }
    }
    return $html;
  }
  echo limit_html( file_get_contents('http://www.experts-exchange.com'), 900);
?>

-r-
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
soapergemCommented:
Roonan, the only dispute I would have with your code is that it's not very flexible with the <img> tags. It requires that the alt-attribute is enclosed specifically with double quotes (which is not always the case) and that there are no greater-than symbols (>) in the alt-attribute (which is not always the case). Mine is a little more flexible, since it allows for double quotes, single quotes, or no quotes, and also allows for a greater-than symbol in the alt-attribute. Sure, having that symbol in the alt-attribute is not W3C valid code, but who ever said he'd be parsing valid code to begin with? Also, what if the user had something like this:
    <table><tr><td>table></td></tr></table>

Again, not W3C valid by any means, but your regex would not handle it properly. I know that one's a little bit of a stretch, but I think the image one is certainly a valid concern.

And I just realized my code should say this:
$replace = array('\'[image:\'.trim(stripslashes(\'$1\'),\'\\\'"\').\']\'', '[\'here is table\']');
0
 
gijsbertjrAuthor Commented:
Goodmorning  soapergem and Roonaan,

First let me tell you how much I appreciate you guys helping me out ! As a 43 year old youngster, with a steap learning curve ahead, the brain starts to refuses to grasp some issues. :)

Thanks soapergem for the regex tutorial tip: I'll check it out this weekend. I've noticed you posted a similar question in the past concerning <img>-tag.

Ok, about the the code. Both of them do not perform as I expected.


1) soapergem's code:
a) concerning images:
Produces this [image:drop] (drop being the alt-attrib) when the image has an alt-attribute.
But when the image has no alt-attribute it produces nothing or better &nbsp;

b) concerning table
it removes al the tags, but leaves the text between the opening and closing table tags.


2) Roonan's code
a) concerning images:
Produces this [image:] wheter the image has an alt-attribute or not.

b) concerning table:
the opening and closing table tag are still there, aswell as the text.
of course the other tags as <tr> have been strip with strip_tags function

3) the solution:
a) images:
perhaps I've badly discribed  the problem, but that is what I would like:
when image has an alt-attribute ==> [image: alt-atrib]
when image has no alt-attribute ==> [image]

b) tables:
when a table starts with the <table>-tag, everything that follows ( f.e. <tr><td>bla bla bla</td></tr>) until the first closing table-tag </table> should be replaced with [here is table].
What I noticed with both codes is that stripping the table leaves a lot of white space behind.

c) the other tags are stripped and in both case the <a href>-tag still works (strip_tag function and the execptions which ones not to strip)

d) concerning the lenght of the output:
The ouput will allways be appended with <a href="index.php?news_id=1>Read more ...</a>.
So if the original string is let's say 100 positions long it should output only 80 (leave's a bit to read).
On the other hand there should be some flexibility in the lenght of the output based on the lenght of the original string.
F.e.
if the length of the orginal string is  > 100 and < 200 (output maximum 80)
if the length of the origanal string is >199 and < 400 (output maximum 250)
if the length of the original string is > 199 and < 1000 (output maximum 400)
if the length of the original string is > 999 (outpunt maximum 600)

Many thanks for helping me out !

best regards,

Gijs
0
 
gijsbertjrAuthor Commented:
Hi,

First off all i found a great tutorial at http://www.tote-taste.de/X-Project/regex/index.php.

I have adapted Roonaan's code to suit my needs.
I left out the whole alt-attribute thing because it did'nt find a solution ... yet.

My site is tri-lingual and a use a file to store all the language variables.

$lang_image = "Here is an image"; // language variable
$lang_table = "Here is a table"; // language variable
$lang_extension = " ..."; // language variable

function limit_text ($html_text, $lang_image, $lang_table, $lang_extension) {

$html_text = strip_tags($html_text, '<a>,<table>,<img>');
$html_text = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '&nbsp;<i>['.$lang_image.']</i>&nbsp;', $html_text);
$html_text = preg_replace('/<table(.*?)table>/ims', '&nbsp;<i>['.$lang_table.']</i>&nbsp;', $html_text); // notice the "s" at ims which solves the [here is table] not showing
$html_text = preg_replace('/\s\s+/', '', $html_text); // get rid of white space

// depending on the lenght of the string the length of the output will be different
    if(strlen($html_text) < 100) {
    $html_text = substr($html_text, 0, 80);
    } elseif(strlen($html_text) > 99 && strlen($html_text) < 1000) {
    $html_text = substr($html_text, 0, 250);
    } else {
    $html_text = substr($html_text, 0, 380);
    }
    $pos = max(strrpos($html_text, ']'), strrpos($html_text,' '), strrpos($html_text, '</a>')+5);
    if($pos > 0) {
        $html_text = substr($html_text, 0, $pos);
        }
    $html_text = $html_text . $lang_extension; // adding the dots ... at the end
            
    return $html_text;
  }

echo limit_text($news_text, $lang_image, $lang_table, $lang_extension);

If you have any suggestions for improvement, there are welcome.

I'll grant the point to Roonan as I used his code as the basis for my solution.

Thanks again,

Gijs
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now