Solved

Using HTML agility pack to select a node

Posted on 2013-01-15
38
2,184 Views
Last Modified: 2013-02-08
Hi guys,

From the below link I am trying to return '16 line items found.'

http://www.netcomponents.com/results.htm?t=f&r=1&pn1=12345J

My code so far is below, it keeps returning 0 results..

    HtmlWeb NC = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = NC.Load(NetComsURL.NetComURL);
            HtmlNodeCollection NetlinkNodes = doc.DocumentNode.SelectNodes("//div[@ID=\"livesearch\"]");

            if (NetlinkNodes != null)
            {
                // Loop through the nodes in and grab the last one
                foreach (var node in NetlinkNodes)
                {

                    txtResult.text += node.InnerText;

                }

Open in new window


Incase you cannot load the link here is a source code snippet....

<tr><td colspan="2"></td></tr></table></td><TD ALIGN=CENTER VALIGN=MIDDLE NOWRAP WIDTH=100% >&nbsp;<div ID="livesearch" style="display:none;;background: infobackground;"></div>16 line items found.</td> 

Open in new window


Many Thanks,

Dean
0
Comment
Question by:deanlee17
  • 18
  • 14
  • 6
38 Comments
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 38777736
I'd put a breakpoint on line 5 (if (NetlinkNodes != null)) and inspect the values of the variables you have then single step the code to see what is happening.
0
 

Author Comment

by:deanlee17
ID: 38777751
Hi Andy I have done that. Its not finding anything, hence the result is 0
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 38777768
Have you iterated through the nodes in the document and inspected their values ?
0
 

Author Comment

by:deanlee17
ID: 38777776
Not exactly, as im not sure how to do that, im new to this so still finding my way.
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 38777846
something like
foreach (var node in doc.DocumentNodes)
{
  //check what is in the node
}
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38778097
Are you sure you are bringing back the correct page? When I click on your link above, I am taken to a login screen. Clicking the search button takes me to the page with the "16 line items found." line. The only difference I see between the target page and the link you posted is the "d=1" querystring parameter.

In addition to that, the "livesearch" <div> does not contain the text you are seeking; the <td> does.

(Formatted)
<TD ALIGN=CENTER VALIGN=MIDDLE NOWRAP WIDTH=100% >
    &nbsp;<div ID="livesearch" style="display:none;position:absolute;background: infobackground;"></div>
    16 line items found.
</td>

Open in new window

0
 

Author Comment

by:deanlee17
ID: 38778138
hi kaufmed,

We have an account with that site, so when my user clicks to go to it they are logged in automatically. Yes you are correct, its the TD data that I need to navigate to.

Any ideas?
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38778226
Again, I'd make sure that HAP is pulling the correct document. When I try to run against the page, I don't get the correct HTML because apparently the site requires cookies:

Screenshot
There are apparently some ways to overcome this within HAP, but I don't have time at the moment to craft an example. I can work one up later this morning.
0
 

Author Comment

by:deanlee17
ID: 38778255
An example would be fantastic if you do get the time.

Hey how did you genrate the html visualiser image?
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 38778272
>>Are you sure you are bringing back the correct page?

That is why I suggested, in an earlier comment, checking what actually was being processed.
0
 

Author Comment

by:deanlee17
ID: 38778285
Yes, sorry Andy, I need to get AgilityPack to show me whats is getting read into it, im looking into how to do this.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38778345
Click the little arrow next to the magnifying glass for the property you are interested in:

Screenshot
0
 

Author Comment

by:deanlee17
ID: 38778477
Guys, you are absolutely right, it seems to be the cookies problem. I got the same error as you print screened earlier.

Earlier in my code I....

            string BrokerPrefix = "http://www.brokerforum.com/electronic-components-search-en.jsa?originalFullPartNumber=";
            string BrokerSuffix = "&x=50&y=16&hasNoSearchCriteria=false";
            string ConcatBroker = BrokerPrefix + TxtboxSearch + BrokerSuffix;
            BroBrowser.Navigate(ConcatBroker);


            BrokerSearch = new SearchResults ();
            BrokerSearch.BrokerURL = ConcatBroker;

Open in new window


And it loads correctly so assumed it was loading correctly in HTML Agility Pack.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38778502
Well the WebBrowser control (I assume that's what "BroBrowser" is) will handle cookies on its own. It's a scaled down version of IE, effectively. HAP relies on HttpWebRequest/Response, and so cookies need to be handled manually.
0
 

Author Comment

by:deanlee17
ID: 38778511
Yes you are correct it is a webbrowser cotrol. I see.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38779364
This appears to work for grabbing the page:

using System.Net;
using HtmlAgilityPack;

namespace ConsoleApplication56
{
    class Program
    {
        private static CookieCollection _cookies = new CookieCollection();

        static void Main(string[] args)
        {
            HtmlWeb web = new HtmlWeb() { PreRequest = BeforeHAPRequest, PostResponse = AfterHAPRequest };
            HtmlDocument doc = web.Load("http://www.netcomponents.com/results.htm?t=f&r=1&pn1=12345J");

            doc = web.Load("http://www.netcomponents.com/results.htm?d=1&t=f&r=1&pn1=12345J");
        }

        static bool BeforeHAPRequest(HttpWebRequest request)
        {
            request.CookieContainer = new CookieContainer();

            foreach (Cookie cookie in _cookies)
            {
                request.CookieContainer.Add(cookie);
            }

            return true;
        }

        static void AfterHAPRequest(HttpWebRequest request, HttpWebResponse response)
        {
            _cookies = request.CookieContainer.GetCookies(request.RequestUri);
        }
    }
}

Open in new window


Try executing your search against the HTML that is returned by the above.
0
 

Author Comment

by:deanlee17
ID: 38781892
Oh excellent.

Ok so now I need to get the returned data inside 'doc' into my code below? ....

 public void NetComponents()
        {

            HtmlWeb NC = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = NC.Load(NetComsURL.NetComURL);
            HtmlNodeCollection NetlinkNodes = doc.DocumentNode.SelectNodes("//div[@ID=\"livesearch\"]//td");

            if (NetlinkNodes != null)
            {
                // Loop through the nodes in and grab the last one
                foreach (var node in NetlinkNodes)
                {
                    //Create instance of class and load result into it
                    NetCommsReturnResults = new NetCommsForumSearch();
                    NetCommsReturnResults.NetCommsResult += node.InnerText;

                }
           
            }

Open in new window


How can I get that value out of the static class?

Thanks.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38782350
What static class?
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 38782458
                    //Create instance of class and load result into it
                    NetCommsReturnResults = new NetCommsForumSearch();
                    NetCommsReturnResults.NetCommsResult += node.InnerText;

Open in new window


Each time you assign a new object to an existing object (first code line - line 2) the previous object is replaced.  This code would ONLY ever give the value of the final node.InnerText, all the values from the other nodes are thrown away.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 

Author Comment

by:deanlee17
ID: 38782545
Oh yes I understand that it would only get the final node. Basically im struggling to integrate your code into my project. Ignore what I said about a static class.

The result of the project class is html stored within 'doc'?

How can I call 'doc' in the below code? ....

  public void NetComponents()
        {

            HtmlWeb NC = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = NC.Load(NetComsURL.NetComURL);
            HtmlNodeCollection NetlinkNodes = doc.DocumentNode.SelectNodes("//div[@ID=\"livesearch\"]//td");

            if (NetlinkNodes != null)
            {
                // Loop through the nodes in and grab the last one
                foreach (var node in NetlinkNodes)
                {
                    //Create instance of class and load result into it
                    NetCommsReturnResults = new NetCommsForumSearch();
                    NetCommsReturnResults.NetCommsResult += node.InnerText;

                }

            }
}

Open in new window


Many Thanks.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38782721
using System.Net;

...

private CookieCollection _cookies = new CookieCollection();

public void NetComponents()
{
    HtmlWeb NC = new HtmlWeb() { PreRequest = BeforeHAPRequest, PostResponse = AfterHAPRequest };
    HtmlAgilityPack.HtmlDocument doc = NC.Load(NetComsURL.NetComURL);
    HtmlNodeCollection NetlinkNodes = doc.DocumentNode.SelectNodes("//div[@ID=\"livesearch\"]//td");

    if (NetlinkNodes != null)
    {
        // Loop through the nodes in and grab the last one
        foreach (var node in NetlinkNodes)
        {
            //Create instance of class and load result into it
            NetCommsReturnResults = new NetCommsForumSearch();
            NetCommsReturnResults.NetCommsResult += node.InnerText;

        }

    }
}

private bool BeforeHAPRequest(HttpWebRequest request)
{
    request.CookieContainer = new CookieContainer();

    foreach (Cookie cookie in _cookies)
    {
        request.CookieContainer.Add(cookie);
    }

    return true;
}

private void AfterHAPRequest(HttpWebRequest request, HttpWebResponse response)
{
    _cookies = request.CookieContainer.GetCookies(request.RequestUri);
}

Open in new window


I think you'll need to preload the cookies collection by navigating the base URL. This is why you saw two requests in my example. Try it without it, and if it doesn't work, then add in another call to preload the cookies.
0
 

Author Comment

by:deanlee17
ID: 38782778
Invalid URI: The format of the URI could not be determined.

On line....

 HtmlAgilityPack.HtmlDocument doc = NC.Load(NetComsURL.NetComURL);

==========================================

Ok ignore the above, i was sending the wrong link. So ive sorted that, added a break point and im viewing whats being passed in, its the site login page.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38782926
Are you handling logins within your app? I recall you mentioning so earlier in the discussion.
0
 
LVL 44

Expert Comment

by:AndyAinscow
ID: 38782956
If you only require the last node check if this works:
    if (NetlinkNodes != null)
    {
        // Get last node
            NetCommsReturnResults = new NetCommsForumSearch();
            NetCommsReturnResults.NetCommsResult = NetlinkNodes[NetlinkNodes.Count - 1].InnerText;
        }

Open in new window


If it does it means there is no looping required and only one new statement being executed
0
 

Author Comment

by:deanlee17
ID: 38782987
kaufmed: I am logging into the website manually (the first time the app loads) within the web browser and saving the credentials.

AndyAinscow: I shall try this when I have sorted out loggin into the site. You are right, makes no sense doing all that looping to get the last node.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38783064
Ah. I don't think that is going to work. I believe the WebBrowser and HAP's cookies will be mutually exclusive. Neither is accessible to the other. Try this out:

using System.Net;

...

private CookieCollection _cookies = new CookieCollection();

public void NetComponents()
{
    HtmlWeb NC = new HtmlWeb() { PostResponse = AfterHAPRequest };
    HtmlAgilityPack.HtmlDocument doc;
    HtmlNodeCollection NetlinkNodes;

    NC.PreRequest = BeforeHAPRequestLogin;
    NC.Load("http://www.netcomponents.com/login.htm");    
    NC.PreRequest = BeforeHAPRequest; 
    doc = NC.Load(NetComsURL.NetComURL);
    NetlinkNodes = doc.DocumentNode.SelectNodes("//div[@ID=\"livesearch\"]//td");
    
    if (NetlinkNodes != null)
    {
        // Loop through the nodes in and grab the last one
        foreach (var node in NetlinkNodes)
        {
            //Create instance of class and load result into it
            NetCommsReturnResults = new NetCommsForumSearch();
            NetCommsReturnResults.NetCommsResult += node.InnerText;

        }

    }
}

private bool BeforeHAPRequestLogin(HttpWebRequest request)
{
    string org = "org=" + your_acct_num;
    string login = "login=" + your_login_name;
    string pwd = "pwd=" + your_password;

    request.Method = "POST";

    using (StreamWriter writer = new StreamWriter(request.GetRequestStream()))
    {
        string requestBody = org + "&" + login + "&" + pwd;

        writer.Write(requestBody);
    }

    return true;
}

private bool BeforeHAPRequest(HttpWebRequest request)
{
    request.CookieContainer = new CookieContainer();

    foreach (Cookie cookie in _cookies)
    {
        request.CookieContainer.Add(cookie);
    }

    return true;
}

private void AfterHAPRequest(HttpWebRequest request, HttpWebResponse response)
{
    _cookies = request.CookieContainer.GetCookies(request.RequestUri);
}

Open in new window


Be sure to change your_acct_num, your_login_name, and your_password accordingly.
0
 

Author Comment

by:deanlee17
ID: 38783121
Ok changed it and got...

Object reference not set to an instance of an object.

on line

_cookies = request.CookieContainer.GetCookies(request.RequestUri);
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38783248
Can you put a breakpoint on that line and see if there is a value in response.Cookies?
0
 

Author Comment

by:deanlee17
ID: 38783417
Home now mate, will do it first thing in morning
0
 

Author Comment

by:deanlee17
ID: 38786259
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38787571
Sorry, I was asking for the response object  : )
0
 

Author Comment

by:deanlee17
ID: 38787823
Im sorry, I dont know which line you mean :)
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38787893
In the code above, you have the method AfterHAPRequest. It takes two parameters. The first is the request; the second is the response. Mouse over the second parameter (while at a breakpoint within the method) and see if the Cookies collection is populated.
0
 

Author Comment

by:deanlee17
ID: 38792372
Ah ok, see attached...
printscreen.png
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 38792504
Can you expand Cookies? If you see a cookie count of more than zero, then try the following; otherwise, we'll have to try a different approach.

using System.Net;

...

private CookieCollection _cookies = new CookieCollection();

public void NetComponents()
{
    HtmlWeb NC = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc;
    HtmlNodeCollection NetlinkNodes;

    NC.PreRequest = BeforeHAPRequestLogin;
    NC.PostResponse = AfterHAPRequestLogin;
    NC.Load("http://www.netcomponents.com/login.htm");    
    NC.PreRequest = BeforeHAPRequest;
    NC.PostResponse = AfterHAPRequest;
    doc = NC.Load(NetComsURL.NetComURL);
    NetlinkNodes = doc.DocumentNode.SelectNodes("//div[@ID=\"livesearch\"]//td");
    
    if (NetlinkNodes != null)
    {
        // Loop through the nodes in and grab the last one
        foreach (var node in NetlinkNodes)
        {
            //Create instance of class and load result into it
            NetCommsReturnResults = new NetCommsForumSearch();
            NetCommsReturnResults.NetCommsResult += node.InnerText;

        }

    }
}

private bool BeforeHAPRequestLogin(HttpWebRequest request)
{
    string org = "org=" + your_acct_num;
    string login = "login=" + your_login_name;
    string pwd = "pwd=" + your_password;

    request.Method = "POST";

    using (StreamWriter writer = new StreamWriter(request.GetRequestStream()))
    {
        string requestBody = org + "&" + login + "&" + pwd;

        writer.Write(requestBody);
    }

    return true;
}

private void AfterHAPRequestLogin(HttpWebRequest request, HttpWebResponse response)
{
    _cookies = response.Cookies;
}

private bool BeforeHAPRequest(HttpWebRequest request)
{
    request.CookieContainer = new CookieContainer();

    foreach (Cookie cookie in _cookies)
    {
        request.CookieContainer.Add(cookie);
    }

    return true;
}

private void AfterHAPRequest(HttpWebRequest request, HttpWebResponse response)
{
    _cookies = request.CookieContainer.GetCookies(request.RequestUri);
}

Open in new window

0
 

Author Comment

by:deanlee17
ID: 38792531
Cookie count was zero, sadly :(
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 38792606
If you login with your WebBrowser control, does [WebBrowser Control].Document.Cookie contain anything?
0
 

Author Comment

by:deanlee17
ID: 38804466
Hi, sorry I was off sick yesterday.

Are you asking me to set a break point?

I cannot run my program as it stands because it errors as mentioned about because 'HttpWebResponse response' is empty.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now