We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now

x

Parse Google Search Engine Results using C#

Medium Priority
5,251 Views
Last Modified: 2016-12-07
Hello

I am trying to figure a way to screen scrape google search result page using c#.  The reason I want to screen scrape is that I use the Custom Search API to get a ranking, but I want to compare this with  the screen scrape.

So can anybody point me to any help on this?

Thanks
Kev.
Comment
Watch Question

Asim NazirProject Manager

Commented:
Ok. Do following:

Add webbrowser control to a windows form in a c# application
Set www.google.com as the default NavigateURL
Once results are shown, run analysis by using HTML code library and webbrowser control.
You can get access to DOM using above. Further, You can use regular expressions for parsing.

I did parsing of a different site long time ago and was successfull using above approach. So approach is tested :)

Let me know if you need assistance in following above steps. I hope it helps.

Author

Commented:

I'm new to all this, so can you go intot using the HTML, webbrowser step , thanks.

Kev.
Asim NazirProject Manager

Commented:
Ok. GO with this HTML parser then http://htmlagilitypack.codeplex.com/
It has number of examples with source code. I am sure this will do.

Author

Commented:

Sorry to be a pain, but I am trying to load the html into htmldocument see code -

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

            doc.Load(webBrowser1.Document.Body.InnerHtml);

            listBox1.Items.Clear();

            foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))

            {
                string value = link.InnerHtml.ToString();
            }          

and I keep on getting - Illegal characters in path.

Am  I doing this right.

Kev.

Author

Commented:
Hello

I've managed to parse the first page of a google search using webbrowser and html Agility Pack.  My Code looks like -

webBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(ScrapeScreen);

webBrowser.Navigate(buildURL.ToString());


private void ScrapeScreen(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            if (e.Url.Equals(((WebBrowser)sender).Url))
            {  
                if (((WebBrowser)sender).Document.Body.InnerHtml != null)
                {
                    bool found = false;

                    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                    doc.LoadHtml(((WebBrowser)sender).Document.Body.InnerHtml.ToString());

                    var hrefs = doc.DocumentNode.SelectNodes("//h3");

                    int i = 0;

                    if (hrefs != null)
                    {
                        foreach (var tag in hrefs)
                        {
                            string tagURL = tag.InnerHtml.ToString();
                            i++;

                            if (tagURL.ToUpper().Contains(txtURL.Text.ToString().ToUpper()))
                            {
                                found = true;
                                txtScrapeRanking.Text = i.ToString();
                                break;
                            }
                        }

                        if (!found)
                        {

                        }
                    }
                }
            }
        }

My question is now, if I have not found the value I am looking for in the first page I need to access the second page of google.  So how can I check that the DocumentCompleted event is finished so that I can check and if not found run again with new URL?

I hope this make sense?

Kevin.
Project Manager
Commented:
Unlock this solution and get a sample of our free trial.
(No credit card required)
UNLOCK SOLUTION

Author

Commented:
Great help.
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a sample view!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.