?
Solved

Regex to remove <img>

Posted on 2006-04-25
8
Medium Priority
?
1,276 Views
Last Modified: 2012-08-17
Hi everybody. This question might be a little bit more than what it seems. I need a regular expression to remove HTML image tags. Sure. Easy enough. Why not something like this, right?
$html = preg_replace('/<img[^>]*>/i', '', $html);

Except I was reading a website that pointed out to me the possibility that there's a greater-than sign (>) in the alt-attribute. And in that case, the command above would change something like this:

<img src="next.jpg" alt=">">

Into this:

">

But clearly, I'd like it to delete all of that with one fell-swoop. So I'm looking for a similar regular expression replacement to accomodate for greater-than signs in the alt attribute (but remember, there's always a possibility that the alt-attribute isn't even there to begin with). This has probably already been addressed somewhere, but I had a few extra points to give away. Thanks.
0
Comment
Question by:soapergem
  • 5
  • 2
8 Comments
 
LVL 6

Author Comment

by:soapergem
ID: 16540096
And one more thing, please also remember that even if the alt attribute is there, it isn't necessarily enclosed in double quotes. Could be single quotes, could be no quotes, could be some invalid combination of single and double quotes. Thanks again.
0
 
LVL 9

Accepted Solution

by:
blue_hunter earned 500 total points
ID: 16540444
/[\<]+\s*(img)\s*(src)[\=]+(\"|\')*[a-zA-Z0-9\.\/]*(\"|\')*(\s)*(alt)*[\=]*\s*(\"|\')*[a-zA-Z0-9]*(\"|\')*\s*\/*[\>]+
try this

*ps. not yet tested, try to enchance the regxp with ignore case

0
 
LVL 6

Author Comment

by:soapergem
ID: 16540498
I tested that on the same example I used above in the question and it still exhibited the same behavior (leaving the extra ">). And yes, I remembered to add the extra / on the end that you omitted. ;) (plus the letter i for case insensistive). So unfortunately, my tests show that the very complex and well-thought-out expression you came up with don't quite do it. But that's a lot more complex of an expression than I can come up with, maybe you're on the right track. (I wouldn't really know, or rather, I don't really want to take the time to comprehend all of that expression.)

But one more comment: theoretically speaking, the "src" attribute wouldn't have to be there. Obviously you don't get an image without it, so omitting it would do nobody any good, but it seems like your expression requires that it be there.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 6

Author Comment

by:soapergem
ID: 16541188
Nevermind...I figured it out myself. This works:

/(<img.*?((alt="[^"]*?")|(alt='[^']*?'))[^>]*?>)|(<img[^>]*?>)/i

Thanks for your honest attempt, though, blue_hunter. I'm in a good mood, so I'll give you the points anyway.
0
 
LVL 6

Author Comment

by:soapergem
ID: 16541283
And now I *really* found a good answer, probably a better one, thanks to this:

http://haacked.com/archive/2004/10/25/1471.aspx
0
 
LVL 6

Author Comment

by:soapergem
ID: 16541299
So this is probably the most reliable, according to that article:
</?img((\s+\w+(\s*=\s*(?:".*?"|\'.*?\'|[^\'">\s]+))?)+\s*|\s*)/?>
0
 
LVL 9

Expert Comment

by:blue_hunter
ID: 16542820
i have back to view again this question, thanks for the points.
this is a pretty cool regular expression( in your latest post)

I had left quite lots of attribute of <img>, thanks for remind me.

cheers.






0
 

Expert Comment

by:jimmieandersson
ID: 38304702
Thank you, That actually worked.
I got what I wanted, how ever its really slow.

this is what I done:
            wb1 = new WebBrowser();
            wb1.ScrollBarsEnabled = false;
            wb1.ScriptErrorsSuppressed = true;
            wb1.DocumentText = source;
            wb1.AllowNavigation = false;


            while(wb1.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); }

            var collection = wb1.Document.GetElementsByTagName("a");

            wb1.Dispose();

Open in new window


The source parameter I set as DocumentText is downloaded with WebClient and before setting it, I remove all image-tags and <link rel stylesheets> tags to speed it up with regex.

I also tried running this in multi-threading mode but that didn't the webbrowser control like. I now start the application multiple times instead.

Not a perfect solution, but biggest problem is the speed.

If someone have an better approach for me I would very much take a look at that.
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses
Course of the Month13 days, 10 hours left to enroll

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question