?
Solved

regular expression strip and replace tags within a string

Posted on 2006-06-08
6
Medium Priority
?
1,849 Views
Last Modified: 2008-02-01
Dear Experts,

In my prevoius posts you've noticed that everything which has to do with regular expression is not my cup of T.

For my new-system I would like to do the following:

$html_text (contains a html-page);

with the following code I can display under the news title the beginning of the news item (stored in $html_text) for a number of positions ($minimum) when the string is longer than $maximum.

code:
if(strlen($html_text) > $maximum) {
$html_text = substr($html_text, 0, $minimum);
}

This is al well, but there are some drawbacks:
1. it will stop in the middle of a word (cutting words in two).
2. as I use images, tables, ect... in the string $html_text, well you know what's the problem.


So I would first like to do the following:
strip out the <img ... > and replace it with [image: alt-atribute of the img-tag (if the alt-attribute exist, if not: do nothing]
strip out the complete <table> from start till ending and replace it with ['here is tatble']
Then strip out all html-tags (like <p>, <br>, <font>, <hr>, <strong>, <i>,<u>, <ol>, <ul> ,<li>, etc...) and replace with nothing.
But the <a href=...> hyperling tags should be left complete.

After that <taggy>-operation determine how long the output of a string will be.
If the string is longer than xxx-positions then output yyy-postions of that string.
But it has to keep the last word of its output in tact (not cut in two).

Thanks in advance for your help !

Gijs










0
Comment
Question by:gijsbertjr
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 6

Expert Comment

by:soapergem
ID: 16867843
I did this quick, so let me know if you run into any problems. But this should work:

<?php
$regex = array('@<img.*?alt=(\'(?:.*?)\'|"(?:.*?)"|(?:[^\'">\s]*))[^>]*>@ie', '@<table[^>]*>.*?</table>@i');
$replace = array('trim(stripslashes(\'$1\'),\'\\\'"\')', '[\'here is table\']');

$html_text = preg_replace($regex, $replace, $html_text);
$html_text = strip_tags($html_text, '<a>');

if ( strlen($html_text) > $xxx )
{
      $extra = substr($html_text, $xxx);
      $pos = ( preg_match('@\s@', $extra, $m) ) ? strpos($extra, $m[0]) : 0;
      $html_text = substr($html_text, 0, $xxx + $pos);
}
?>
0
 
LVL 6

Expert Comment

by:soapergem
ID: 16867857
> ...everything which has to do with regular expression is not my cup of T.

If you ever find some free time, and are curious, take a look at this website.
http://www.regular-expressions.info/tutorial.html
0
 
LVL 49

Accepted Solution

by:
Roonaan earned 2000 total points
ID: 16869133
Alternative code:

<?php
  function limit_html($html, $max=100) {
    $html = strip_tags($html, '<a>,<table>,<img>');
    $html = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '[image:\2]', $html);
    $html = preg_replace('/<table(.*?)table>/im', '[table]', $html);
    $max = 900;
    if(strlen($html) > $max) {
      $html = substr($html, 0, $max);
      $pos = max(strrpos($html, ']'), strrpos($html,' '), strrpos($html, '</a>')+5);
      if($pos > 0) {
        $html = substr($html, 0, $pos);
      }
    }
    return $html;
  }
  echo limit_html( file_get_contents('http://www.experts-exchange.com'), 900);
?>

-r-
0
Are You Using the Best Web Development Editor?

The worlds of web hosting and web development are constantly evolving. Every year we see design trends change, coding standards adapt and new frameworks/CMS created. With such a quick pace of change it’s easy to get lost trying to keep up.

See if your editor made the list.

 
LVL 6

Expert Comment

by:soapergem
ID: 16870990
Roonan, the only dispute I would have with your code is that it's not very flexible with the <img> tags. It requires that the alt-attribute is enclosed specifically with double quotes (which is not always the case) and that there are no greater-than symbols (>) in the alt-attribute (which is not always the case). Mine is a little more flexible, since it allows for double quotes, single quotes, or no quotes, and also allows for a greater-than symbol in the alt-attribute. Sure, having that symbol in the alt-attribute is not W3C valid code, but who ever said he'd be parsing valid code to begin with? Also, what if the user had something like this:
    <table><tr><td>table></td></tr></table>

Again, not W3C valid by any means, but your regex would not handle it properly. I know that one's a little bit of a stretch, but I think the image one is certainly a valid concern.

And I just realized my code should say this:
$replace = array('\'[image:\'.trim(stripslashes(\'$1\'),\'\\\'"\').\']\'', '[\'here is table\']');
0
 

Author Comment

by:gijsbertjr
ID: 16875607
Goodmorning  soapergem and Roonaan,

First let me tell you how much I appreciate you guys helping me out ! As a 43 year old youngster, with a steap learning curve ahead, the brain starts to refuses to grasp some issues. :)

Thanks soapergem for the regex tutorial tip: I'll check it out this weekend. I've noticed you posted a similar question in the past concerning <img>-tag.

Ok, about the the code. Both of them do not perform as I expected.


1) soapergem's code:
a) concerning images:
Produces this [image:drop] (drop being the alt-attrib) when the image has an alt-attribute.
But when the image has no alt-attribute it produces nothing or better &nbsp;

b) concerning table
it removes al the tags, but leaves the text between the opening and closing table tags.


2) Roonan's code
a) concerning images:
Produces this [image:] wheter the image has an alt-attribute or not.

b) concerning table:
the opening and closing table tag are still there, aswell as the text.
of course the other tags as <tr> have been strip with strip_tags function

3) the solution:
a) images:
perhaps I've badly discribed  the problem, but that is what I would like:
when image has an alt-attribute ==> [image: alt-atrib]
when image has no alt-attribute ==> [image]

b) tables:
when a table starts with the <table>-tag, everything that follows ( f.e. <tr><td>bla bla bla</td></tr>) until the first closing table-tag </table> should be replaced with [here is table].
What I noticed with both codes is that stripping the table leaves a lot of white space behind.

c) the other tags are stripped and in both case the <a href>-tag still works (strip_tag function and the execptions which ones not to strip)

d) concerning the lenght of the output:
The ouput will allways be appended with <a href="index.php?news_id=1>Read more ...</a>.
So if the original string is let's say 100 positions long it should output only 80 (leave's a bit to read).
On the other hand there should be some flexibility in the lenght of the output based on the lenght of the original string.
F.e.
if the length of the orginal string is  > 100 and < 200 (output maximum 80)
if the length of the origanal string is >199 and < 400 (output maximum 250)
if the length of the original string is > 199 and < 1000 (output maximum 400)
if the length of the original string is > 999 (outpunt maximum 600)

Many thanks for helping me out !

best regards,

Gijs
0
 

Author Comment

by:gijsbertjr
ID: 16877424
Hi,

First off all i found a great tutorial at http://www.tote-taste.de/X-Project/regex/index.php.

I have adapted Roonaan's code to suit my needs.
I left out the whole alt-attribute thing because it did'nt find a solution ... yet.

My site is tri-lingual and a use a file to store all the language variables.

$lang_image = "Here is an image"; // language variable
$lang_table = "Here is a table"; // language variable
$lang_extension = " ..."; // language variable

function limit_text ($html_text, $lang_image, $lang_table, $lang_extension) {

$html_text = strip_tags($html_text, '<a>,<table>,<img>');
$html_text = preg_replace('/<img([^>]*)(alt="([^"]+)")?([^>]*)>/i', '&nbsp;<i>['.$lang_image.']</i>&nbsp;', $html_text);
$html_text = preg_replace('/<table(.*?)table>/ims', '&nbsp;<i>['.$lang_table.']</i>&nbsp;', $html_text); // notice the "s" at ims which solves the [here is table] not showing
$html_text = preg_replace('/\s\s+/', '', $html_text); // get rid of white space

// depending on the lenght of the string the length of the output will be different
    if(strlen($html_text) < 100) {
    $html_text = substr($html_text, 0, 80);
    } elseif(strlen($html_text) > 99 && strlen($html_text) < 1000) {
    $html_text = substr($html_text, 0, 250);
    } else {
    $html_text = substr($html_text, 0, 380);
    }
    $pos = max(strrpos($html_text, ']'), strrpos($html_text,' '), strrpos($html_text, '</a>')+5);
    if($pos > 0) {
        $html_text = substr($html_text, 0, $pos);
        }
    $html_text = $html_text . $lang_extension; // adding the dots ... at the end
            
    return $html_text;
  }

echo limit_text($news_text, $lang_image, $lang_table, $lang_extension);

If you have any suggestions for improvement, there are welcome.

I'll grant the point to Roonan as I used his code as the basis for my solution.

Thanks again,

Gijs
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question