We help IT Professionals succeed at work.

vb.net - gather text from web page

First of all I should point out I am looking to gather some text from our own companys internal website and no try to do anything naughty to external websites.

I have an Windows Forms Application that performs many useful functions that our Desktop Analysts use on a regular basis, one of these is to show who is logged on to a PC and what their full name is.

I want to go a little further...

It can already open our internal directory website and search for the name of the person logged on to provide a photo, telephone number, position, area they work in, etc. But this is an extra click. I want my app to gather this info such as telephone number, area they work in and maybe even import the photo.

So this is how it works so far:

Dim ADUsername As String = txtUsername.Text
            Dim newusername As String = ADUsername.Split("\"c)(1)
            Dim cluesLookup As String = "https://intranet/directory/servlet/domain.intranet.server.servlet.CorpDir?SearchType=Everyone&NumSearchFields=1&DisplayStart=1&SortColumn=null&Ascending=&Advanced=null&SearchField0=LAN+User+ID&SearchOperator0=Contains&ShowAllResults=true&NumOfResultsToDisplay=15&SearchText0=" & newusername
            Process.Start(cluesLookup)

Open in new window


Which brings up the results of the search which includes the name and a telephone number. This on its own would be acceptable as I can at least gather the telephone number from this source code:
<td align="left" valign="top" CLASS="oddrow">+44 1234 567890</td>
This is the portion of the source code I am interested in:
<td align="left" valign="top" CLASS="oddrow"><a href="/intranet/servlet/domain.intranet.server.servlet.ShowPersonalInfo?SearchType=Everyone&SearchDN=recid=PPS-000A4801,OU=People,dc=domain,dc=com" onMouseOver="self.status='View personal information'; return true" onMouseOut="self.status=''">Firstname Surname</a>

Open in new window


What I would need from this code is recid=PPS-000A4801 because I could then use it to bring up the full details of the person using this link:
https://intranet.domain.com/intranet/servlet/intranet.domain.server.servlet.ShowPersonalInfo?SearchType=Everyone&SearchDN=recid=PPS-000A4801,OU=People,dc=domain,dc=com

This page shows all the info I am looking for, in particular these sections of the source code:
<td><b> Internal Email Address</b></td>
					<td>&nbsp;&nbsp;&nbsp;&nbsp;</td>
					<td colspan="3"><a href="mailto:surname_firstname@domain.com">surname_firstname@domain.com</a></td>

<td><IMG src="https://pictures.domain.com/private/pictures/4658572099.jpg" border=0> </td>

Open in new window


There's some other stuff I can pick out once I get a good idea of how to do it.
What I would like to do is for the end user to click a button which would silently gather the RECID=PPS-xxxxxxxx number, constructs the link which contains the full information and then gathers the required text from the webpage to populate in my Windows Forms Application.

There is no API I can use to get this info unfortunately and this internal website is the only repository for this info. There is no RECID in AD for example that I could use.

So...any thoughts?
Comment
Watch Question

Most Valuable Expert 2011
Top Expert 2015
Commented:
I suggest looking into the HTML Agility Pack on CodePlex. You can acquire it through NuGet. It exposes classes and methods that make parsing HTML much easier.

For example, if I wanted to pull the "Related Questions" section of this very page, I might do the following:

Imports HtmlAgilityPack

Public Class Form1

    Private Sub btnDownload_Click(sender As Object, e As EventArgs) Handles btnDownload.Click
        Dim downloader As New HtmlWeb()
        Dim hDoc As HtmlDocument = downloader.Load(Me.txtAddress.Text)
        Dim relatedQuestions As HtmlNode

        Me.txtHtml.Text = hDoc.DocumentNode.OuterHtml

        relatedQuestions = hDoc.DocumentNode.SelectSingleNode("//*[@id='listContent']")

        Me.lstRelatedQuestions.Items.Clear()

        For Each question As HtmlNode In relatedQuestions.SelectNodes("li/a/span")
            Me.lstRelatedQuestions.Items.Add(question.InnerText)
        Next
    End Sub
End Class

Open in new window


Screenshot

Author

Commented:
This looks awesome and I've tried to install it via NuGet packages but it fails each time:

PM> Install-Package HtmlAgilityPack
Install-Package : File contains corrupted data.
At line:1 char:1
+ Install-Package HtmlAgilityPack
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Install-Package], FileFormatException
    + FullyQualifiedErrorId : NuGetCmdletUnhandledException,NuGet.PowerShell.Commands.InstallPackageCommand

Open in new window

Most Valuable Expert 2011
Top Expert 2015

Commented:
Unsure. I've never run into this before. As you can see, it downloaded for me. Do you have any kind of proxy servers or firewalls that might be blocking downloads of such?

Author

Commented:
It worked ok for other NuGet installs. I've mentioned this on the forum support so I'll have to wait and see.
Most Valuable Expert 2011
Top Expert 2015

Commented:
Does it download if you go through the GUI? I wouldn't expect it to, but that's what I used.

Author

Commented:
I dont know how you do it through the GUI
Most Valuable Expert 2011
Top Expert 2015

Commented:
Right-click your project, then look for "Manage NuGet Packages...":

Screenshot
You can search for HTML Agility Pack using the dialog that opens:

Screenshot

Author

Commented:
I see the problem. It doesnt like Windows 8!
Windows 8 error
I tried it on a Win7 PC and it installs fine
Most Valuable Expert 2011
Top Expert 2015

Commented:
Hmmm...  I'm running Windows 10 right now, and it worked for me. Is NuGet up to date on your machine? Does it show as an update in Extension manager (i.e. Tools->Extensions and Updates)? If nothing else, you could pull the binaries from CodePlex:

http://htmlagilitypack.codeplex.com/releases/view/90925

Author

Commented:
You GENIUS! There was a NuGet update and after updating and restarting VS the HTMLAgilityPack has installed successfully.
Gold star!

Author

Commented:
Finding VB.NET examples to gather text is a bit difficult.

The crappy code I have put together so far is:
Dim user As String = txtUsername.Text.Split("\"c)(1)
        Dim webclient As New WebClient

        Dim Weblink As String = webclient.DownloadString("https://server.servlet.CorpDir?SearchType=Everyone&NumSearchFields=1&DisplayStart=1&SortColumn=null&Ascending=&Advanced=null&SearchField0=LAN+User+ID&SearchOperator0=Contains&ShowAllResults=true&NumOfResultsToDisplay=15&SearchText0=" & user)
        Dim htmldoc As New HtmlAgilityPack.HtmlDocument()
        'Dim recid As String

        htmldoc.LoadHtml(Weblink)
        For Each link As HtmlNode In htmldoc.DocumentNode.SelectNodes("//*[text()[contains(., 'recid=')]]")
            If link.InnerText IsNot Nothing Then
                txtclsid.Text = (link.InnerText)
            End If
        Next

Open in new window


I get an error (not a surprise!) :
System.NullReferenceException was unhandled
  HResult=-2147467261
  Message=Object reference not set to an instance of an object.

The section of source code it should find is:
<td align="left" valign="top" CLASS="oddrow"><a href="/server.servlet.ShowPersonalInfo?SearchType=Everyone&SearchDN=recid=PIP-10153771,OU=People,dc=domain,dc=com" onMouseOver="self.status='View personal information'; return true" onMouseOut="self.status=''">

Open in new window


Someone give me a clue!!
Most Valuable Expert 2011
Top Expert 2015

Commented:
Which line raises the error?

Author

Commented:
This line:
For Each link As HtmlNode In htmldoc.DocumentNode.SelectNodes("//*[text()[contains(., 'recid=')]]")
Most Valuable Expert 2011
Top Expert 2015

Commented:
OK, so your SelectNodes isn't finding nodes. You probably need to refine your XPath.

Author

Commented:
I've found the problem and I'm not sure there's a way to work around it.
I used the HTMLAgilityPack XPath Finder and directed it to the page I want to get the info from.

Because our intranet uses a login page it is reading the source code from that before it gets to the page I want.
This is the source code the XPath Finder gets:
<html>
<body>
<script language="JavaScript">
    window.location.replace("https://login.domain.com/cgi-bin/kerb");
</script>
</body>
</html>

Is there any way to work around this?

Author

Commented:
Thanks for all the help