shanemay
asked on
Advanced C# string parse to pull specific string and substrings out of a larger string.
I have a C# string that represents HTML code. I need to parse the string and find each instance of the img element. Once the img element is taken out, I need to then extract the src attribute from the img element.
Example if I have the C# string, I need to find <img src="http://www.someurl.com/image1.gif" /> and assign it to a string. Then I need to parse the new string for the src http://www.someurl.com/image1.gif. I am somewhat familar with C# string functions, but I am not sure how to begin working on this.
If you need more clarification please let me know. Thank you for your help.
Example if I have the C# string, I need to find <img src="http://www.someurl.com/image1.gif" /> and assign it to a string. Then I need to parse the new string for the src http://www.someurl.com/image1.gif. I am somewhat familar with C# string functions, but I am not sure how to begin working on this.
If you need more clarification please let me know. Thank you for your help.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
RegEx's are definitely the way to go, but the code provided is pretty cumbersome. It won't match any img tag with more than a src attribute (eg <img src="img.jpg" id="img1">, and it does a lot of extra processing. You can extract everything you need in one pass with a function like this (this returns the sources in an arraylist):
private ArrayList getImageSources(String HTML)
{
Regex re = new Regex("<img\\s[^>]*?src=[\"']([^\"']+)[\"'][^>]*>", RegexOptions.IgnoreCase);
MatchCollection matches = re.Matches(HTML);
ArrayList sources = new ArrayList();
foreach (Match m in matches) {
sources.Add(m.Groups[1]);
}
return sources;
}
ASKER