troubleshooting Question

Better regular expressions or programatically reformatting malformed IMG tags found in HTML

Avatar of JAndyEvans
JAndyEvansFlag for United States of America asked on
HTMLC#Regular Expressions
7 Comments1 Solution612 ViewsLast Modified:
In a web-based application that I am writing, I need to read in a URL's HTML, use Regex to extract the IMG tags and read the tag's SRC attribute.  Everything works fine until this morning when I found that one URL's IMG appear to be improperly formatted (i.e. <img href= width=1>).  

My code is below using the following regex:

REGEX_MATCH_IMGSRC = <img\s+src\s*=\s*""?(?<img>[^\s""]+)
REGEX_EXTRACT_IMG = <img[^<>]+>

UriCollection imageSources = new UriCollection();
string pageSource = FetchPage(url).ToString();
Uri pageUri = new Uri(url);
Regex regex = new Regex(REGEX_MATCH_IMGSRC, RegexOptions.Compiled |    
MatchCollection matches = Regex.Matches(pageSource, REGEX_EXTRACT_IMG);

foreach (Match match in matches)
        Match srcMatch = regex.Match(match.Value);
        string srcValue = srcMatch.Groups["img"].ToString();

        /* Insert remaining code to read and process the image */

The problem is that srcValue returns a null using the img tag listed above.  Am I wrong in thinking that the img tag is malformed or is my regex being to restrive?

Join the community to see this answer!
Join our exclusive community to see this answer & millions of others.
Unlock 1 Answer and 7 Comments.
Join the Community
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 1 Answer and 7 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros