Link to home
Start Free TrialLog in
Avatar of KABarrie
KABarrie

asked on

Parse Google Search Engine Results using C#

Hello

I am trying to figure a way to screen scrape google search result page using c#.  The reason I want to screen scrape is that I use the Custom Search API to get a ranking, but I want to compare this with  the screen scrape.

So can anybody point me to any help on this?

Thanks
Kev.
Avatar of Asim Nazir
Asim Nazir
Flag of Pakistan image

Ok. Do following:

Add webbrowser control to a windows form in a c# application
Set www.google.com as the default NavigateURL
Once results are shown, run analysis by using HTML code library and webbrowser control.
You can get access to DOM using above. Further, You can use regular expressions for parsing.

I did parsing of a different site long time ago and was successfull using above approach. So approach is tested :)

Let me know if you need assistance in following above steps. I hope it helps.
Avatar of KABarrie
KABarrie

ASKER


I'm new to all this, so can you go intot using the HTML, webbrowser step , thanks.

Kev.
Ok. GO with this HTML parser then http://htmlagilitypack.codeplex.com/
It has number of examples with source code. I am sure this will do.

Sorry to be a pain, but I am trying to load the html into htmldocument see code -

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

            doc.Load(webBrowser1.Document.Body.InnerHtml);

            listBox1.Items.Clear();

            foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))

            {
                string value = link.InnerHtml.ToString();
            }          

and I keep on getting - Illegal characters in path.

Am  I doing this right.

Kev.
Hello

I've managed to parse the first page of a google search using webbrowser and html Agility Pack.  My Code looks like -

webBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(ScrapeScreen);

webBrowser.Navigate(buildURL.ToString());


private void ScrapeScreen(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            if (e.Url.Equals(((WebBrowser)sender).Url))
            {  
                if (((WebBrowser)sender).Document.Body.InnerHtml != null)
                {
                    bool found = false;

                    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                    doc.LoadHtml(((WebBrowser)sender).Document.Body.InnerHtml.ToString());

                    var hrefs = doc.DocumentNode.SelectNodes("//h3");

                    int i = 0;

                    if (hrefs != null)
                    {
                        foreach (var tag in hrefs)
                        {
                            string tagURL = tag.InnerHtml.ToString();
                            i++;

                            if (tagURL.ToUpper().Contains(txtURL.Text.ToString().ToUpper()))
                            {
                                found = true;
                                txtScrapeRanking.Text = i.ToString();
                                break;
                            }
                        }

                        if (!found)
                        {

                        }
                    }
                }
            }
        }

My question is now, if I have not found the value I am looking for in the first page I need to access the second page of google.  So how can I check that the DocumentCompleted event is finished so that I can check and if not found run again with new URL?

I hope this make sense?

Kevin.
ASKER CERTIFIED SOLUTION
Avatar of Asim Nazir
Asim Nazir
Flag of Pakistan image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Great help.