Solved

Parse HTML for specific string pattern

Posted on 2014-03-09
3
474 Views
Last Modified: 2014-03-30
Greetings,
I am trying to compose a query in vb.net that will parse a website looking for all the strings in a particular pattern and placing those strings in a collection:
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User1</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 1.</p>
			</div>
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User2</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 2.</p>
			</div>
<div class="comment-author">
			 <img  src="/images/avatar.jpg" class="avatar photo"   width="44">
			<cite class="fn"><a href='/bio.html' class='url'>User3</a></cite>						</div>
						<div class="comment_content">
			<p>Sample comment 3.</p>
			</div>

Open in new window

I want to pull out:

User1   Sample comment 1.
User2   Sample comment 2.
User3   Sample comment 3.

Thanks in advance.

M
0
Comment
Question by:MaxKroy
3 Comments
 
LVL 25

Expert Comment

by:apeter
ID: 39917051
Can't you use Linq to xml to parse the xml ?

Use XDocument to parse the xml.  http://msdn.microsoft.com/en-us/library/bb918016.aspx
0
 
LVL 23

Accepted Solution

by:
Ioannis Paraskevopoulos earned 500 total points
ID: 39917105
You may use HtmlAgilityPack (available on Nuget). If you do not use NuGet you may get the binaries from CodePlex .

You may check the following sample code that gets an Enumerable of Anonymous objects that have a User and a Comment properties. Use it as you like:

	Dim html As String
	html = _
	"<div class=""comment-author"">" + _
	"	<img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"	<cite class=""fn""><a href='/bio.html' class='url'>User1</a></cite>		" + _				
	"</div>" + _
	"<div class=""comment_content"">" + _
	"	<p>Sample comment 1.</p>" + _
	"</div>" + _
	"<div class=""comment-author"">" + _
	"		 <img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"		<cite class=""fn""><a href='/bio.html' class='url'>User2</a></cite>" + _
	"</div>" + _
	"<div class=""comment_content"">" + _
	"		<p>Sample comment 2.</p>" + _
	"</div>" + _
	"<div class=""comment-author"">" + _
	"	<img  src=""/images/avatar.jpg"" class=""avatar photo""   width=""44"">" + _
	"	<cite class=""fn""><a href='/bio.html' class='url'>User3</a></cite>" + _				
	"</div>" + _
	"<div class=""comment_content"">" + _
	"	<p>Sample comment 3.</p>" + _
	"</div>"
	
	Dim htmlDoc = New HtmlAgilityPack.HtmlDocument
	htmlDoc.LoadHtml(html)

	Dim Result  = htmlDoc.DocumentNode.Elements("div").Where(Function(x) x.Attributes("class").Value = "comment-author").Select(Function(x) New With {.User = x.Element("cite").InnerText, .Comment = x.NextSibling.InnerText.Trim})
	For Each obj In Result
		Console.WriteLine("User={0}, Comment={1}",obj.User, obj.Comment)
	Next

Open in new window


Giannis
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
This Micro Tutorial will give you a basic overview how to record your screen with Microsoft Expression Encoder. This program is still free and open for the public to download. This will be demonstrated using Microsoft Expression Encoder 4.
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question