We help IT Professionals succeed at work.

We've partnered with Certified Experts, Carl Webster and Richard Faulkner, to bring you a podcast all about Citrix Workspace, moving to the cloud, and analytics & intelligence. Episode 2 coming soon!Listen Now

x

How to Extract URL

Medium Priority
835 Views
Last Modified: 2013-12-19
hi,

I want to extract urls from google search engine. I want to extract morethan 5k urls.. its fully in java script so i cant sort anything. please give me some codes or resource to build it.
Comment
Watch Question

Tony McCreathTechnical SEO Consultant

Commented:
I think more information is needed.

What is in javascript?

You placed the question in a vb.net and windows zone so is this going to be a windows application?

By search engine, are you talking about scraping urls from search engine results for particular search phrases?

Author

Commented:
yeah milking urls from search engine and adding it in listbox.. i will send the string(Search) via programming and it shud give me all urls that are present in google for that phrase.
Shanmuga SundaramDirector of Software Engineering
CERTIFIED EXPERT

Commented:
did you check here?
http://urenjoy.blogspot.com/2008/10/extract-links-from-string.html

Open in new window

Tony McCreathTechnical SEO Consultant

Commented:
Google provide a javascript/ajax API to do search requests, where the results are easier to read:

http://code.google.com/apis/ajaxsearch/

Author

Commented:
i will be glad if somebody give me direct code. i am confused to how to use. i will accept their solution hwo gives me full code..
Tony McCreathTechnical SEO Consultant

Commented:
Please clarify the language you want this code in

It sounds like your are asking 2 questions which have been answered quite a few times in this website.

how do I make an http request (HttpWebRequest or WebClient class)

how do I scrape data from html (normally using Regex class)

Author

Commented:
I don know how to code them. I am a Noob. So i want some correct code in VB.Net.

Author

Commented:
anyone to give me code? i have code but fetches only 1st page in google.. its written using httpclient and webrequest object..
Tony McCreathTechnical SEO Consultant

Commented:
Post the code and maybe we can work on it.

Author

Commented:
yeah sure..
Public Class cls2
    Public Function GetResults(ByVal query As String) As Uri()
 
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
 
        ' Build query string
        Dim url As String = "http://www.google.de/search?q=" & query
 
        ' We use a Webclient to query and impose as "Internet Explorer 7"
        Dim client As New WebClient()
        client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
 
        ' Read the html-page and select the root-node
        Dim doc As New HtmlAgilityPack.HtmlDocument
        doc.Load(client.OpenRead(url))
        Dim rootNode As HtmlNode = doc.DocumentNode
 
        ' Now select all links by using xpath
        Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
 
        ' Loop over all results
        Dim links As Uri() = New Uri(resultNodes.Count - 1) {}
        For i As Integer = 0 To resultNodes.Count - 1
            links(i) = New Uri(resultNodes(i).Attributes("href").Value)
        Next
        Return (links)
 
    End Function
   ListBox1.Items.Clear()
        Dim results() As Uri
        Dim s As String = TextBox1.Text.ToString 
        results = c.GetResults(s)
        For Each result As Uri In results
            ListBox1.Items.Add(result)
        Next
    End Sub

Open in new window

Technical SEO Consultant
Commented:
Try this.
  • I converted your array into a generic list so I can easily keep adding uris to it.
  • I added a loop to repeat the query, each time telling google to return the next page of results.
  • I changed to for to a foreach 'cos its neater looking
    Public Function GetResults(ByVal query As String, ByVal pagesToGet As Integer) As List(Of Uri)
        Dim links As New List(Of Uri)()
        
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
        
        ' loop through the first 10 pages of results
        For page As Integer = 0 To pagesToGet - 1
            
            ' Build query string
            Dim url As String = "http://www.google.de/search?q=" & query
            
            ' google expects a start variable that indicates the first result to display. 
            ' As each page contains 10 results the start value = page * 10 
            If page > 0 Then
                url += "&start=" & page * 10
            End If
            
            ' We use a Webclient to query and impose as "Internet Explorer 7"
            Dim client As New WebClient()
            client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
            
            ' Read the html-page and select the root-node
            Dim doc As New HtmlAgilityPack.HtmlDocument()
            doc.Load(client.OpenRead(url))
            Dim rootNode As HtmlNode = doc.DocumentNode
            
            ' Now select all links by using xpath
            Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
            
            ' Loop over all results
            For Each linkNode As HtmlNode In resultNodes
                links.Add(New Uri(linkNode.Attributes("href").Value))
            Next
        Next
        Return links
        
    End Function

Open in new window

Not the solution you were looking for? Getting a personalized solution is easy.

Ask the Experts

Author

Commented:
hey i got a error.. i have posted picture..
Untitled.jpg

Author

Commented:
please give me 2-4 suggestions of code.. i have to submit my project in next few days..

Author

Commented:
i converted URI to string and used array.. now all is perfect.. thanks for help!!!!!!!!!!!!!!!!!!!

Author

Commented:
nice solution.. very fast..

Author

Commented:
how to make webclient to use proxy? i am getting blocked by google for making such request..
Tony McCreathTechnical SEO Consultant

Commented:
As this is indicating that this activity is breaking Googles TOS (Terms Of Service) I would rather not help in developing a way to try and fool Google.

Author

Commented:
ok man can u tell me wherte i can find consistant and reliable proxies (ip and port)???

Author

Commented:
mate y timeout options in not coming?  client.timeout is not coming........
Access more of Experts Exchange with a free account
Thanks for using Experts Exchange.

Create a free account to continue.

Limited access with a free account allows you to:

  • View three pieces of content (articles, solutions, posts, and videos)
  • Ask the experts questions (counted toward content limit)
  • Customize your dashboard and profile

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.