How to Extract URL

hi,

I want to extract urls from google search engine. I want to extract morethan 5k urls.. its fully in java script so i cant sort anything. please give me some codes or resource to build it.
Yogesh_AgarwalAsked:
Who is Participating?
 
Tony McCreathConnect With a Mentor Technical SEO ConsultantCommented:
Try this.
  • I converted your array into a generic list so I can easily keep adding uris to it.
  • I added a loop to repeat the query, each time telling google to return the next page of results.
  • I changed to for to a foreach 'cos its neater looking
    Public Function GetResults(ByVal query As String, ByVal pagesToGet As Integer) As List(Of Uri)
        Dim links As New List(Of Uri)()
        
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
        
        ' loop through the first 10 pages of results
        For page As Integer = 0 To pagesToGet - 1
            
            ' Build query string
            Dim url As String = "http://www.google.de/search?q=" & query
            
            ' google expects a start variable that indicates the first result to display. 
            ' As each page contains 10 results the start value = page * 10 
            If page > 0 Then
                url += "&start=" & page * 10
            End If
            
            ' We use a Webclient to query and impose as "Internet Explorer 7"
            Dim client As New WebClient()
            client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
            
            ' Read the html-page and select the root-node
            Dim doc As New HtmlAgilityPack.HtmlDocument()
            doc.Load(client.OpenRead(url))
            Dim rootNode As HtmlNode = doc.DocumentNode
            
            ' Now select all links by using xpath
            Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
            
            ' Loop over all results
            For Each linkNode As HtmlNode In resultNodes
                links.Add(New Uri(linkNode.Attributes("href").Value))
            Next
        Next
        Return links
        
    End Function

Open in new window

0
 
Tony McCreathTechnical SEO ConsultantCommented:
I think more information is needed.

What is in javascript?

You placed the question in a vb.net and windows zone so is this going to be a windows application?

By search engine, are you talking about scraping urls from search engine results for particular search phrases?

0
 
Yogesh_AgarwalAuthor Commented:
yeah milking urls from search engine and adding it in listbox.. i will send the string(Search) via programming and it shud give me all urls that are present in google for that phrase.
0
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

 
Shanmuga SundaramDirector of Software EngineeringCommented:
did you check here?
http://urenjoy.blogspot.com/2008/10/extract-links-from-string.html

Open in new window

0
 
Tony McCreathTechnical SEO ConsultantCommented:
Google provide a javascript/ajax API to do search requests, where the results are easier to read:

http://code.google.com/apis/ajaxsearch/
0
 
Yogesh_AgarwalAuthor Commented:
i will be glad if somebody give me direct code. i am confused to how to use. i will accept their solution hwo gives me full code..
0
 
Tony McCreathTechnical SEO ConsultantCommented:
Please clarify the language you want this code in

It sounds like your are asking 2 questions which have been answered quite a few times in this website.

how do I make an http request (HttpWebRequest or WebClient class)

how do I scrape data from html (normally using Regex class)
0
 
Yogesh_AgarwalAuthor Commented:
I don know how to code them. I am a Noob. So i want some correct code in VB.Net.
0
 
Yogesh_AgarwalAuthor Commented:
anyone to give me code? i have code but fetches only 1st page in google.. its written using httpclient and webrequest object..
0
 
Tony McCreathTechnical SEO ConsultantCommented:
Post the code and maybe we can work on it.
0
 
Yogesh_AgarwalAuthor Commented:
yeah sure..
Public Class cls2
    Public Function GetResults(ByVal query As String) As Uri()
 
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
 
        ' Build query string
        Dim url As String = "http://www.google.de/search?q=" & query
 
        ' We use a Webclient to query and impose as "Internet Explorer 7"
        Dim client As New WebClient()
        client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
 
        ' Read the html-page and select the root-node
        Dim doc As New HtmlAgilityPack.HtmlDocument
        doc.Load(client.OpenRead(url))
        Dim rootNode As HtmlNode = doc.DocumentNode
 
        ' Now select all links by using xpath
        Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
 
        ' Loop over all results
        Dim links As Uri() = New Uri(resultNodes.Count - 1) {}
        For i As Integer = 0 To resultNodes.Count - 1
            links(i) = New Uri(resultNodes(i).Attributes("href").Value)
        Next
        Return (links)
 
    End Function
   ListBox1.Items.Clear()
        Dim results() As Uri
        Dim s As String = TextBox1.Text.ToString 
        results = c.GetResults(s)
        For Each result As Uri In results
            ListBox1.Items.Add(result)
        Next
    End Sub

Open in new window

0
 
Yogesh_AgarwalAuthor Commented:
hey i got a error.. i have posted picture..
Untitled.jpg
0
 
Yogesh_AgarwalAuthor Commented:
please give me 2-4 suggestions of code.. i have to submit my project in next few days..
0
 
Yogesh_AgarwalAuthor Commented:
i converted URI to string and used array.. now all is perfect.. thanks for help!!!!!!!!!!!!!!!!!!!
0
 
Yogesh_AgarwalAuthor Commented:
nice solution.. very fast..
0
 
Yogesh_AgarwalAuthor Commented:
how to make webclient to use proxy? i am getting blocked by google for making such request..
0
 
Tony McCreathTechnical SEO ConsultantCommented:
As this is indicating that this activity is breaking Googles TOS (Terms Of Service) I would rather not help in developing a way to try and fool Google.
0
 
Yogesh_AgarwalAuthor Commented:
ok man can u tell me wherte i can find consistant and reliable proxies (ip and port)???
0
 
Yogesh_AgarwalAuthor Commented:
mate y timeout options in not coming?  client.timeout is not coming........
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.