Parse HTML for specific string pattern

Greetings,
I am trying to compose a query in vb.net that will parse a website looking for all the strings in a particular pattern and placing those strings in a collection:
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User1</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 1.</p>
			</div>
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User2</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 2.</p>
			</div>
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User3</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 3.</p>
			</div>

Open in new window

I want to pull out:

User1   Sample comment 1.
User2   Sample comment 2.
User3   Sample comment 3.

Thanks in advance.

M
MaxKroyAsked:
Who is Participating?
 
Ioannis ParaskevopoulosCommented:
You may use HtmlAgilityPack (available on Nuget). If you do not use NuGet you may get the binaries from CodePlex .

You may check the following sample code that gets an Enumerable of Anonymous objects that have a User and a Comment properties. Use it as you like:

	Dim html As String
	html = _
	"<div class=""comment-author"">" + _
	"	<img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"	<cite class=""fn""><a href='/bio.html' class='url'>User1</a></cite>		" + _				
	"</div>" + _
	"<div class=""comment_content"">" + _
	"	<p>Sample comment 1.</p>" + _
	"</div>" + _
	"<div class=""comment-author"">" + _
	"		 <img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"		<cite class=""fn""><a href='/bio.html' class='url'>User2</a></cite>" + _
	"</div>" + _
	"<div class=""comment_content"">" + _
	"		<p>Sample comment 2.</p>" + _
	"</div>" + _
	"<div class=""comment-author"">" + _
	"	<img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"	<cite class=""fn""><a href='/bio.html' class='url'>User3</a></cite>" + _				
	"</div>" + _
	"<div class=""comment_content"">" + _
	"	<p>Sample comment 3.</p>" + _
	"</div>"
	
	Dim htmlDoc = New HtmlAgilityPack.HtmlDocument
	htmlDoc.LoadHtml(html)

	Dim Result  = htmlDoc.DocumentNode.Elements("div").Where(Function(x) x.Attributes("class").Value = "comment-author").Select(Function(x) New With {.User = x.Element("cite").InnerText, .Comment = x.NextSibling.InnerText.Trim})
	For Each obj In Result
		Console.WriteLine("User={0}, Comment={1}",obj.User, obj.Comment)
	Next

Open in new window


Giannis
0
 
apeterCommented:
Can't you use Linq to xml to parse the xml ?

Use XDocument to parse the xml.  http://msdn.microsoft.com/en-us/library/bb918016.aspx
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.