vb.net - gather text from web page

First of all I should point out I am looking to gather some text from our own companys internal website and no try to do anything naughty to external websites.

I have an Windows Forms Application that performs many useful functions that our Desktop Analysts use on a regular basis, one of these is to show who is logged on to a PC and what their full name is.

I want to go a little further...

It can already open our internal directory website and search for the name of the person logged on to provide a photo, telephone number, position, area they work in, etc. But this is an extra click. I want my app to gather this info such as telephone number, area they work in and maybe even import the photo.

So this is how it works so far:

Dim ADUsername As String = txtUsername.Text
            Dim newusername As String = ADUsername.Split("\"c)(1)
            Dim cluesLookup As String = "https://intranet/directory/servlet/domain.intranet.server.servlet.CorpDir?SearchType=Everyone&NumSearchFields=1&DisplayStart=1&SortColumn=null&Ascending=&Advanced=null&SearchField0=LAN+User+ID&SearchOperator0=Contains&ShowAllResults=true&NumOfResultsToDisplay=15&SearchText0=" & newusername
            Process.Start(cluesLookup)

Open in new window


Which brings up the results of the search which includes the name and a telephone number. This on its own would be acceptable as I can at least gather the telephone number from this source code:
<td align="left" valign="top" CLASS="oddrow">+44 1234 567890</td>
This is the portion of the source code I am interested in:
<td align="left" valign="top" CLASS="oddrow"><a href="/intranet/servlet/domain.intranet.server.servlet.ShowPersonalInfo?SearchType=Everyone&SearchDN=recid=PPS-000A4801,OU=People,dc=domain,dc=com" onMouseOver="self.status='View personal information'; return true" onMouseOut="self.status=''">Firstname Surname</a>

Open in new window


What I would need from this code is recid=PPS-000A4801 because I could then use it to bring up the full details of the person using this link:
https://intranet.domain.com/intranet/servlet/intranet.domain.server.servlet.ShowPersonalInfo?SearchType=Everyone&SearchDN=recid=PPS-000A4801,OU=People,dc=domain,dc=com

This page shows all the info I am looking for, in particular these sections of the source code:
<td><b> Internal Email Address</b></td>
					<td>&nbsp;&nbsp;&nbsp;&nbsp;</td>
					<td colspan="3"><a href="mailto:surname_firstname@domain.com">surname_firstname@domain.com</a></td>

<td><IMG src="https://pictures.domain.com/private/pictures/4658572099.jpg" border=0> </td>

Open in new window


There's some other stuff I can pick out once I get a good idea of how to do it.
What I would like to do is for the end user to click a button which would silently gather the RECID=PPS-xxxxxxxx number, constructs the link which contains the full information and then gathers the required text from the webpage to populate in my Windows Forms Application.

There is no API I can use to get this info unfortunately and this internal website is the only repository for this info. There is no RECID in AD for example that I could use.

So...any thoughts?
LVL 2
fruitloopyAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

käµfm³d 👽Commented:
I suggest looking into the HTML Agility Pack on CodePlex. You can acquire it through NuGet. It exposes classes and methods that make parsing HTML much easier.

For example, if I wanted to pull the "Related Questions" section of this very page, I might do the following:

Imports HtmlAgilityPack

Public Class Form1

    Private Sub btnDownload_Click(sender As Object, e As EventArgs) Handles btnDownload.Click
        Dim downloader As New HtmlWeb()
        Dim hDoc As HtmlDocument = downloader.Load(Me.txtAddress.Text)
        Dim relatedQuestions As HtmlNode

        Me.txtHtml.Text = hDoc.DocumentNode.OuterHtml

        relatedQuestions = hDoc.DocumentNode.SelectSingleNode("//*[@id='listContent']")

        Me.lstRelatedQuestions.Items.Clear()

        For Each question As HtmlNode In relatedQuestions.SelectNodes("li/a/span")
            Me.lstRelatedQuestions.Items.Add(question.InnerText)
        Next
    End Sub
End Class

Open in new window


Screenshot

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
fruitloopyAuthor Commented:
This looks awesome and I've tried to install it via NuGet packages but it fails each time:

PM> Install-Package HtmlAgilityPack
Install-Package : File contains corrupted data.
At line:1 char:1
+ Install-Package HtmlAgilityPack
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Install-Package], FileFormatException
    + FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PowerShell.Commands.InstallPackageCommand

Open in new window

käµfm³d 👽Commented:
Unsure. I've never run into this before. As you can see, it downloaded for me. Do you have any kind of proxy servers or firewalls that might be blocking downloads of such?
Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

fruitloopyAuthor Commented:
It worked ok for other NuGet installs. I've mentioned this on the forum support so I'll have to wait and see.
käµfm³d 👽Commented:
Does it download if you go through the GUI? I wouldn't expect it to, but that's what I used.
fruitloopyAuthor Commented:
I dont know how you do it through the GUI
käµfm³d 👽Commented:
Right-click your project, then look for "Manage NuGet Packages...":

Screenshot
You can search for HTML Agility Pack using the dialog that opens:

Screenshot
fruitloopyAuthor Commented:
I see the problem. It doesnt like Windows 8!
Windows 8 error
I tried it on a Win7 PC and it installs fine
käµfm³d 👽Commented:
Hmmm...  I'm running Windows 10 right now, and it worked for me. Is NuGet up to date on your machine? Does it show as an update in Extension manager (i.e. Tools->Extensions and Updates)? If nothing else, you could pull the binaries from CodePlex:

http://htmlagilitypack.codeplex.com/releases/view/90925
fruitloopyAuthor Commented:
You GENIUS! There was a NuGet update and after updating and restarting VS the HTMLAgilityPack has installed successfully.
Gold star!
fruitloopyAuthor Commented:
Finding VB.NET examples to gather text is a bit difficult.

The crappy code I have put together so far is:
Dim user As String = txtUsername.Text.Split("\"c)(1)
        Dim webclient As New WebClient

        Dim Weblink As String = webclient.DownloadString("https://server.servlet.CorpDir?SearchType=Everyone&NumSearchFields=1&DisplayStart=1&SortColumn=null&Ascending=&Advanced=null&SearchField0=LAN+User+ID&SearchOperator0=Contains&ShowAllResults=true&NumOfResultsToDisplay=15&SearchText0=" & user)
        Dim htmldoc As New HtmlAgilityPack.HtmlDocument()
        'Dim recid As String

        htmldoc.LoadHtml(Weblink)
        For Each link As HtmlNode In htmldoc.DocumentNode.SelectNodes("//*[text()[contains(., 'recid=')]]")
            If link.InnerText IsNot Nothing Then
                txtclsid.Text = (link.InnerText)
            End If
        Next

Open in new window


I get an error (not a surprise!) :
System.NullReferenceException was unhandled
  HResult=-2147467261
  Message=Object reference not set to an instance of an object.

The section of source code it should find is:
<td align="left" valign="top" CLASS="oddrow"><a href="/server.servlet.ShowPersonalInfo?SearchType=Everyone&SearchDN=recid=PIP-10153771,OU=People,dc=domain,dc=com" onMouseOver="self.status='View personal information'; return true" onMouseOut="self.status=''">

Open in new window


Someone give me a clue!!
käµfm³d 👽Commented:
Which line raises the error?
fruitloopyAuthor Commented:
This line:
For Each link As HtmlNode In htmldoc.DocumentNode.SelectNodes("//*[text()[contains(., 'recid=')]]")
käµfm³d 👽Commented:
OK, so your SelectNodes isn't finding nodes. You probably need to refine your XPath.
fruitloopyAuthor Commented:
I've found the problem and I'm not sure there's a way to work around it.
I used the HTMLAgilityPack XPath Finder and directed it to the page I want to get the info from.

Because our intranet uses a login page it is reading the source code from that before it gets to the page I want.
This is the source code the XPath Finder gets:
<html>
<body>
<script language="JavaScript">
    window.location.replace("https://login.domain.com/cgi-bin/kerb");
</script>
</body>
</html>

Is there any way to work around this?
fruitloopyAuthor Commented:
Thanks for all the help
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.