KABarrie
asked on
Parse Google Search Engine Results using C#
Hello
I am trying to figure a way to screen scrape google search result page using c#. The reason I want to screen scrape is that I use the Custom Search API to get a ranking, but I want to compare this with the screen scrape.
So can anybody point me to any help on this?
Thanks
Kev.
I am trying to figure a way to screen scrape google search result page using c#. The reason I want to screen scrape is that I use the Custom Search API to get a ranking, but I want to compare this with the screen scrape.
So can anybody point me to any help on this?
Thanks
Kev.
ASKER
I'm new to all this, so can you go intot using the HTML, webbrowser step , thanks.
Kev.
Ok. GO with this HTML parser then http://htmlagilitypack.codeplex.com/
It has number of examples with source code. I am sure this will do.
It has number of examples with source code. I am sure this will do.
ASKER
Sorry to be a pain, but I am trying to load the html into htmldocument see code -
HtmlAgilityPack.HtmlDocume
doc.Load(webBrowser1.Docum
listBox1.Items.Clear();
foreach(HtmlNode link in doc.DocumentNode.SelectNod
{
string value = link.InnerHtml.ToString();
}
and I keep on getting - Illegal characters in path.
Am I doing this right.
Kev.
ASKER
Hello
I've managed to parse the first page of a google search using webbrowser and html Agility Pack. My Code looks like -
webBrowser.DocumentComplet ed += new WebBrowserDocumentComplete dEventHand ler(Scrape Screen);
webBrowser.Navigate(buildU RL.ToStrin g());
private void ScrapeScreen(object sender, WebBrowserDocumentComplete dEventArgs e)
{
if (e.Url.Equals(((WebBrowser )sender).U rl))
{
if (((WebBrowser)sender).Docu ment.Body. InnerHtml != null)
{
bool found = false;
HtmlAgilityPack.HtmlDocume nt doc = new HtmlAgilityPack.HtmlDocume nt();
doc.LoadHtml(((WebBrowser) sender).Do cument.Bod y.InnerHtm l.ToString ());
var hrefs = doc.DocumentNode.SelectNod es("//h3") ;
int i = 0;
if (hrefs != null)
{
foreach (var tag in hrefs)
{
string tagURL = tag.InnerHtml.ToString();
i++;
if (tagURL.ToUpper().Contains (txtURL.Te xt.ToStrin g().ToUppe r()))
{
found = true;
txtScrapeRanking.Text = i.ToString();
break;
}
}
if (!found)
{
}
}
}
}
}
My question is now, if I have not found the value I am looking for in the first page I need to access the second page of google. So how can I check that the DocumentCompleted event is finished so that I can check and if not found run again with new URL?
I hope this make sense?
Kevin.
I've managed to parse the first page of a google search using webbrowser and html Agility Pack. My Code looks like -
webBrowser.DocumentComplet
webBrowser.Navigate(buildU
private void ScrapeScreen(object sender, WebBrowserDocumentComplete
{
if (e.Url.Equals(((WebBrowser
{
if (((WebBrowser)sender).Docu
{
bool found = false;
HtmlAgilityPack.HtmlDocume
doc.LoadHtml(((WebBrowser)
var hrefs = doc.DocumentNode.SelectNod
int i = 0;
if (hrefs != null)
{
foreach (var tag in hrefs)
{
string tagURL = tag.InnerHtml.ToString();
i++;
if (tagURL.ToUpper().Contains
{
found = true;
txtScrapeRanking.Text = i.ToString();
break;
}
}
if (!found)
{
}
}
}
}
}
My question is now, if I have not found the value I am looking for in the first page I need to access the second page of google. So how can I check that the DocumentCompleted event is finished so that I can check and if not found run again with new URL?
I hope this make sense?
Kevin.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Great help.
Add webbrowser control to a windows form in a c# application
Set www.google.com as the default NavigateURL
Once results are shown, run analysis by using HTML code library and webbrowser control.
You can get access to DOM using above. Further, You can use regular expressions for parsing.
I did parsing of a different site long time ago and was successfull using above approach. So approach is tested :)
Let me know if you need assistance in following above steps. I hope it helps.