Regex to remove <img>

Hi everybody. This question might be a little bit more than what it seems. I need a regular expression to remove HTML image tags. Sure. Easy enough. Why not something like this, right?
$html = preg_replace('/<img[^>]*>/i', '', $html);

Except I was reading a website that pointed out to me the possibility that there's a greater-than sign (>) in the alt-attribute. And in that case, the command above would change something like this:

<img src="next.jpg" alt=">">

Into this:

">

But clearly, I'd like it to delete all of that with one fell-swoop. So I'm looking for a similar regular expression replacement to accomodate for greater-than signs in the alt attribute (but remember, there's always a possibility that the alt-attribute isn't even there to begin with). This has probably already been addressed somewhere, but I had a few extra points to give away. Thanks.
LVL 6
soapergemAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

soapergemAuthor Commented:
And one more thing, please also remember that even if the alt attribute is there, it isn't necessarily enclosed in double quotes. Could be single quotes, could be no quotes, could be some invalid combination of single and double quotes. Thanks again.
0
blue_hunterTechnical ConsultantCommented:
/[\<]+\s*(img)\s*(src)[\=]+(\"|\')*[a-zA-Z0-9\.\/]*(\"|\')*(\s)*(alt)*[\=]*\s*(\"|\')*[a-zA-Z0-9]*(\"|\')*\s*\/*[\>]+
try this

*ps. not yet tested, try to enchance the regxp with ignore case

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
soapergemAuthor Commented:
I tested that on the same example I used above in the question and it still exhibited the same behavior (leaving the extra ">). And yes, I remembered to add the extra / on the end that you omitted. ;) (plus the letter i for case insensistive). So unfortunately, my tests show that the very complex and well-thought-out expression you came up with don't quite do it. But that's a lot more complex of an expression than I can come up with, maybe you're on the right track. (I wouldn't really know, or rather, I don't really want to take the time to comprehend all of that expression.)

But one more comment: theoretically speaking, the "src" attribute wouldn't have to be there. Obviously you don't get an image without it, so omitting it would do nobody any good, but it seems like your expression requires that it be there.
0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

soapergemAuthor Commented:
Nevermind...I figured it out myself. This works:

/(<img.*?((alt="[^"]*?")|(alt='[^']*?'))[^>]*?>)|(<img[^>]*?>)/i

Thanks for your honest attempt, though, blue_hunter. I'm in a good mood, so I'll give you the points anyway.
0
soapergemAuthor Commented:
And now I *really* found a good answer, probably a better one, thanks to this:

http://haacked.com/archive/2004/10/25/1471.aspx
0
soapergemAuthor Commented:
So this is probably the most reliable, according to that article:
</?img((\s+\w+(\s*=\s*(?:".*?"|\'.*?\'|[^\'">\s]+))?)+\s*|\s*)/?>
0
blue_hunterTechnical ConsultantCommented:
i have back to view again this question, thanks for the points.
this is a pretty cool regular expression( in your latest post)

I had left quite lots of attribute of <img>, thanks for remind me.

cheers.






0
jimmieanderssonCommented:
Thank you, That actually worked.
I got what I wanted, how ever its really slow.

this is what I done:
            wb1 = new WebBrowser();
            wb1.ScrollBarsEnabled = false;
            wb1.ScriptErrorsSuppressed = true;
            wb1.DocumentText = source;
            wb1.AllowNavigation = false;


            while(wb1.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); }

            var collection = wb1.Document.GetElementsByTagName("a");

            wb1.Dispose();

Open in new window


The source parameter I set as DocumentText is downloaded with WebClient and before setting it, I remove all image-tags and <link rel stylesheets> tags to speed it up with regex.

I also tried running this in multi-threading mode but that didn't the webbrowser control like. I now start the application multiple times instead.

Not a perfect solution, but biggest problem is the speed.

If someone have an better approach for me I would very much take a look at that.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.