Link to home
Start Free TrialLog in
Avatar of JAndyEvans
JAndyEvansFlag for United States of America

asked on

Better regular expressions or programatically reformatting malformed IMG tags found in HTML

In a web-based application that I am writing, I need to read in a URL's HTML, use Regex to extract the IMG tags and read the tag's SRC attribute.  Everything works fine until this morning when I found that one URL's IMG appear to be improperly formatted (i.e. <img href=http://www.myurl.com/images/imagesource.gif width=1>).  

My code is below using the following regex:

REGEX_MATCH_IMGSRC = <img\s+src\s*=\s*""?(?<img>[^\s""]+)
REGEX_EXTRACT_IMG = <img[^<>]+>

UriCollection imageSources = new UriCollection();
string pageSource = FetchPage(url).ToString();
Uri pageUri = new Uri(url);
Regex regex = new Regex(REGEX_MATCH_IMGSRC, RegexOptions.Compiled |    
        RegexOptions.IgnoreCase);
MatchCollection matches = Regex.Matches(pageSource, REGEX_EXTRACT_IMG);

foreach (Match match in matches)
{
        Match srcMatch = regex.Match(match.Value);
        string srcValue = srcMatch.Groups["img"].ToString();

        /* Insert remaining code to read and process the image */
}

The problem is that srcValue returns a null using the img tag listed above.  Am I wrong in thinking that the img tag is malformed or is my regex being to restrive?


Avatar of Fernando Soto
Fernando Soto
Flag of United States of America image

Hi JAndyEvans;

I am not a HTML expert but according to my HTML reference manual the href attribute is not a member of the img tag and so it is malformed. I modified this statement in your code sample. Basically all I did was to replace this src with this (src|href). The meaning of the replaced code is look for src and if it was not found then see if href is there and match it.

REGEX_MATCH_IMGSRC = @"<img\s+(src|href)\s*=\s*""?(?<img>[^\s""]+)";

In looking over your code I noticed that you run two regex commands. this is not needed because the above regex is already finding the tag and allows you to extract the link. I sow in the sample code below what you can do.

Fernando

String REGEX_MATCH_IMGSRC = @"<img\s+(src|href)\s*=\s*""?(?<img>[^\s""]+)";
// You really do not need this next line because does it already
//REGEX_EXTRACT_IMG = <img[^<>]+>
 
		// Finds the img tags and extracts the src at the same time
    Regex regex = new Regex(REGEX_MATCH_IMGSRC, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    MatchCollection matches = Regex.Matches(pageSource, REGEX_MATCH_IMGSRC);
 
    foreach (Match match in matches)
    {
        string srcValue = match.Groups["img"].ToString();
 
        /* Insert remaining code to read and process the image */
    }

Open in new window

Avatar of JAndyEvans

ASKER

Many apologies, I goofed on the img tag.  It should be:

<img src=http://www.myurl.com/images/imagesource.gif width=1>

Replace 'http' with 'src'.
I guess the problem is that the img tag does not terminiate properly.  I need to adjust my regex to accept or the malformed tag or modify it.  Or would it be better to find the src attribute in tags that begin with "<img"??

The tag that is returned to me is:

<img src=http://www.myurl.com/images/imagesource.gif width=1>

instead of:

<img src=http://www.myurl.com/images/imagesource.gif width=1 />

or

<img src=http://www.myurl.com/images/imagesource.gif width=1></img>

Hi JAndyEvans;

This regex pattern, "
String REGEX_MATCH_IMGSRC = @"<img\s+src\s*=\s*""?(?<img>[^\s""]+)";", will match the sample URL's that you posted as follows:

<img src=http://www.myurl.com/images/imagesource.gif width=1>
Will Match; <img src=http://www.myurl.com/images/imagesource.gif
Will have access to: http://www.myurl.com/images/imagesource.gif

<img src=http://www.myurl.com/images/imagesource.gif width=1 />
Will Match; <img src=http://www.myurl.com/images/imagesource.gif
Will have access to: http://www.myurl.com/images/imagesource.gif

<img src=http://www.myurl.com/images/imagesource.gif width=1></img>
Will Match; <img src=http://www.myurl.com/images/imagesource.gif
Will have access to: http://www.myurl.com/images/imagesource.gif

Do not use this pattern, "REGEX_EXTRACT_IMG = <img[^<>]+>" in your code. Use the regex pattern I posted above in this post which is the same as your original for this pattern. The above pattern will match the img tag's and parse the URL all in one pass. See my sample code in my first post and use the REGEX_MATCH_IMGSRC posted her on this post and it should give you what you want.

Fernando

ASKER CERTIFIED SOLUTION
Avatar of margajet24
margajet24
Flag of Singapore image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
margajet24,

I actually needed to use the src attribute so I came up with this.

<img\ssrc\=(?<src>"?\S+"?).*\>

What was the leading backslash for?
o yeah..sorry..\ is used as and escape character because some symbols need this to be recognized properly..