Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Parse HTML for specific string pattern

Posted on 2014-03-09
3
Medium Priority
?
501 Views
Last Modified: 2014-03-30
Greetings,
I am trying to compose a query in vb.net that will parse a website looking for all the strings in a particular pattern and placing those strings in a collection:
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User1</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 1.</p>
			</div>
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User2</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 2.</p>
			</div>
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User3</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 3.</p>
			</div>

Open in new window

I want to pull out:

User1   Sample comment 1.
User2   Sample comment 2.
User3   Sample comment 3.

Thanks in advance.

M
0
Comment
Question by:MaxKroy
2 Comments
 
LVL 25

Expert Comment

by:apeter
ID: 39917051
Can't you use Linq to xml to parse the xml ?

Use XDocument to parse the xml.  http://msdn.microsoft.com/en-us/library/bb918016.aspx
0
 
LVL 23

Accepted Solution

by:
Ioannis Paraskevopoulos earned 2000 total points
ID: 39917105
You may use HtmlAgilityPack (available on Nuget). If you do not use NuGet you may get the binaries from CodePlex .

You may check the following sample code that gets an Enumerable of Anonymous objects that have a User and a Comment properties. Use it as you like:

	Dim html As String
	html = _
	"<div class=""comment-author"">" + _
	"	<img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"	<cite class=""fn""><a href='/bio.html' class='url'>User1</a></cite>		" + _				
	"</div>" + _
	"<div class=""comment_content"">" + _
	"	<p>Sample comment 1.</p>" + _
	"</div>" + _
	"<div class=""comment-author"">" + _
	"		 <img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"		<cite class=""fn""><a href='/bio.html' class='url'>User2</a></cite>" + _
	"</div>" + _
	"<div class=""comment_content"">" + _
	"		<p>Sample comment 2.</p>" + _
	"</div>" + _
	"<div class=""comment-author"">" + _
	"	<img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"	<cite class=""fn""><a href='/bio.html' class='url'>User3</a></cite>" + _				
	"</div>" + _
	"<div class=""comment_content"">" + _
	"	<p>Sample comment 3.</p>" + _
	"</div>"
	
	Dim htmlDoc = New HtmlAgilityPack.HtmlDocument
	htmlDoc.LoadHtml(html)

	Dim Result  = htmlDoc.DocumentNode.Elements("div").Where(Function(x) x.Attributes("class").Value = "comment-author").Select(Function(x) New With {.User = x.Element("cite").InnerText, .Comment = x.NextSibling.InnerText.Trim})
	For Each obj In Result
		Console.WriteLine("User={0}, Comment={1}",obj.User, obj.Comment)
	Next

Open in new window


Giannis
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
Despite its rising prevalence in the business world, "the cloud" is still misunderstood. Some companies still believe common misconceptions about lack of security in cloud solutions and many misuses of cloud storage options still occur every day. …
How can you see what you are working on when you want to see it while you to save a copy? Add a "Save As" icon to the Quick Access Toolbar, or QAT. That way, when you save a copy of a query, form, report, or other object you are modifying, you…

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question