[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

How to Extract URL

Posted on 2009-02-11
19
Medium Priority
?
826 Views
Last Modified: 2013-12-19
hi,

I want to extract urls from google search engine. I want to extract morethan 5k urls.. its fully in java script so i cant sort anything. please give me some codes or resource to build it.
0
Comment
Question by:Yogesh_Agarwal
  • 12
  • 6
19 Comments
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 23620590
I think more information is needed.

What is in javascript?

You placed the question in a vb.net and windows zone so is this going to be a windows application?

By search engine, are you talking about scraping urls from search engine results for particular search phrases?

0
 

Author Comment

by:Yogesh_Agarwal
ID: 23620621
yeah milking urls from search engine and adding it in listbox.. i will send the string(Search) via programming and it shud give me all urls that are present in google for that phrase.
0
 
LVL 17

Expert Comment

by:Shanmuga Sundaram
ID: 23620642
did you check here?
http://urenjoy.blogspot.com/2008/10/extract-links-from-string.html

Open in new window

0
Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

 
LVL 23

Expert Comment

by:Tony McCreath
ID: 23620650
Google provide a javascript/ajax API to do search requests, where the results are easier to read:

http://code.google.com/apis/ajaxsearch/
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23621718
i will be glad if somebody give me direct code. i am confused to how to use. i will accept their solution hwo gives me full code..
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 23623503
Please clarify the language you want this code in

It sounds like your are asking 2 questions which have been answered quite a few times in this website.

how do I make an http request (HttpWebRequest or WebClient class)

how do I scrape data from html (normally using Regex class)
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23623556
I don know how to code them. I am a Noob. So i want some correct code in VB.Net.
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23635324
anyone to give me code? i have code but fetches only 1st page in google.. its written using httpclient and webrequest object..
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 23639771
Post the code and maybe we can work on it.
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23639780
yeah sure..
Public Class cls2
    Public Function GetResults(ByVal query As String) As Uri()
 
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
 
        ' Build query string
        Dim url As String = "http://www.google.de/search?q=" & query
 
        ' We use a Webclient to query and impose as "Internet Explorer 7"
        Dim client As New WebClient()
        client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
 
        ' Read the html-page and select the root-node
        Dim doc As New HtmlAgilityPack.HtmlDocument
        doc.Load(client.OpenRead(url))
        Dim rootNode As HtmlNode = doc.DocumentNode
 
        ' Now select all links by using xpath
        Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
 
        ' Loop over all results
        Dim links As Uri() = New Uri(resultNodes.Count - 1) {}
        For i As Integer = 0 To resultNodes.Count - 1
            links(i) = New Uri(resultNodes(i).Attributes("href").Value)
        Next
        Return (links)
 
    End Function
   ListBox1.Items.Clear()
        Dim results() As Uri
        Dim s As String = TextBox1.Text.ToString 
        results = c.GetResults(s)
        For Each result As Uri In results
            ListBox1.Items.Add(result)
        Next
    End Sub

Open in new window

0
 
LVL 23

Accepted Solution

by:
Tony McCreath earned 2000 total points
ID: 23639925
Try this.
  • I converted your array into a generic list so I can easily keep adding uris to it.
  • I added a loop to repeat the query, each time telling google to return the next page of results.
  • I changed to for to a foreach 'cos its neater looking
    Public Function GetResults(ByVal query As String, ByVal pagesToGet As Integer) As List(Of Uri)
        Dim links As New List(Of Uri)()
        
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
        
        ' loop through the first 10 pages of results
        For page As Integer = 0 To pagesToGet - 1
            
            ' Build query string
            Dim url As String = "http://www.google.de/search?q=" & query
            
            ' google expects a start variable that indicates the first result to display. 
            ' As each page contains 10 results the start value = page * 10 
            If page > 0 Then
                url += "&start=" & page * 10
            End If
            
            ' We use a Webclient to query and impose as "Internet Explorer 7"
            Dim client As New WebClient()
            client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
            
            ' Read the html-page and select the root-node
            Dim doc As New HtmlAgilityPack.HtmlDocument()
            doc.Load(client.OpenRead(url))
            Dim rootNode As HtmlNode = doc.DocumentNode
            
            ' Now select all links by using xpath
            Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
            
            ' Loop over all results
            For Each linkNode As HtmlNode In resultNodes
                links.Add(New Uri(linkNode.Attributes("href").Value))
            Next
        Next
        Return links
        
    End Function

Open in new window

0
 

Author Comment

by:Yogesh_Agarwal
ID: 23640339
hey i got a error.. i have posted picture..
Untitled.jpg
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23640341
please give me 2-4 suggestions of code.. i have to submit my project in next few days..
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23640362
i converted URI to string and used array.. now all is perfect.. thanks for help!!!!!!!!!!!!!!!!!!!
0
 

Author Closing Comment

by:Yogesh_Agarwal
ID: 31545953
nice solution.. very fast..
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23640389
how to make webclient to use proxy? i am getting blocked by google for making such request..
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 23642994
As this is indicating that this activity is breaking Googles TOS (Terms Of Service) I would rather not help in developing a way to try and fool Google.
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23643426
ok man can u tell me wherte i can find consistant and reliable proxies (ip and port)???
0
 

Author Comment

by:Yogesh_Agarwal
ID: 23704674
mate y timeout options in not coming?  client.timeout is not coming........
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article provides a case study on how our local youth baseball league deployed a new website, including the platform selection, implementation and benefits to the league.
Objective of This Article In 1990’s, when I was a budding software professional, I had a lot of confusion about which stream or technology, I had to choose to build my career. In those days, I had lot of confusion like whether to choose System so…
The purpose of this video is to demonstrate how to reset a WordPress password if you are locked out and cannot reset the password. A typical use would be if you cannot access the email to which WordPress would send the password recovery email to…
The purpose of this video is to demonstrate how to update a WordPress Site’s version. WordPress releases new versions of its software frequently and it is important to update frequently in order to keep your site secure, and to get new WordPress…

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question