• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 4181
  • Last Modified:

Parse Google Search Engine Results using C#

Hello

I am trying to figure a way to screen scrape google search result page using c#.  The reason I want to screen scrape is that I use the Custom Search API to get a ranking, but I want to compare this with  the screen scrape.

So can anybody point me to any help on this?

Thanks
Kev.
0
KABarrie
Asked:
KABarrie
  • 4
  • 3
1 Solution
 
Asim NazirCommented:
Ok. Do following:

Add webbrowser control to a windows form in a c# application
Set www.google.com as the default NavigateURL
Once results are shown, run analysis by using HTML code library and webbrowser control.
You can get access to DOM using above. Further, You can use regular expressions for parsing.

I did parsing of a different site long time ago and was successfull using above approach. So approach is tested :)

Let me know if you need assistance in following above steps. I hope it helps.
0
 
KABarrieAuthor Commented:

I'm new to all this, so can you go intot using the HTML, webbrowser step , thanks.

Kev.
0
 
Asim NazirCommented:
Ok. GO with this HTML parser then http://htmlagilitypack.codeplex.com/
It has number of examples with source code. I am sure this will do.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
KABarrieAuthor Commented:

Sorry to be a pain, but I am trying to load the html into htmldocument see code -

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

            doc.Load(webBrowser1.Document.Body.InnerHtml);

            listBox1.Items.Clear();

            foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))

            {
                string value = link.InnerHtml.ToString();
            }          

and I keep on getting - Illegal characters in path.

Am  I doing this right.

Kev.
0
 
KABarrieAuthor Commented:
Hello

I've managed to parse the first page of a google search using webbrowser and html Agility Pack.  My Code looks like -

webBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(ScrapeScreen);

webBrowser.Navigate(buildURL.ToString());


private void ScrapeScreen(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            if (e.Url.Equals(((WebBrowser)sender).Url))
            {  
                if (((WebBrowser)sender).Document.Body.InnerHtml != null)
                {
                    bool found = false;

                    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                    doc.LoadHtml(((WebBrowser)sender).Document.Body.InnerHtml.ToString());

                    var hrefs = doc.DocumentNode.SelectNodes("//h3");

                    int i = 0;

                    if (hrefs != null)
                    {
                        foreach (var tag in hrefs)
                        {
                            string tagURL = tag.InnerHtml.ToString();
                            i++;

                            if (tagURL.ToUpper().Contains(txtURL.Text.ToString().ToUpper()))
                            {
                                found = true;
                                txtScrapeRanking.Text = i.ToString();
                                break;
                            }
                        }

                        if (!found)
                        {

                        }
                    }
                }
            }
        }

My question is now, if I have not found the value I am looking for in the first page I need to access the second page of google.  So how can I check that the DocumentCompleted event is finished so that I can check and if not found run again with new URL?

I hope this make sense?

Kevin.
0
 
Asim NazirCommented:
Hi Kevin,

Nice to hear this. Although DocumentCompleted event is the event where you can do it but I know it fires multiple times. So You can look at this particular article http://support.microsoft.com/kb/180366 for more detailed solution.
You can also check the solution provided here: http://stackoverflow.com/questions/2777878/detect-webbrowser-complete-page-loading

I hope it helps.
Asim
0
 
KABarrieAuthor Commented:
Great help.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now