Link to home
Start Free TrialLog in
Avatar of Yogesh_Agarwal
Yogesh_Agarwal

asked on

How to Extract URL

hi,

I want to extract urls from google search engine. I want to extract morethan 5k urls.. its fully in java script so i cant sort anything. please give me some codes or resource to build it.
Avatar of Tony McCreath
Tony McCreath
Flag of Australia image

I think more information is needed.

What is in javascript?

You placed the question in a vb.net and windows zone so is this going to be a windows application?

By search engine, are you talking about scraping urls from search engine results for particular search phrases?

Avatar of Yogesh_Agarwal
Yogesh_Agarwal

ASKER

yeah milking urls from search engine and adding it in listbox.. i will send the string(Search) via programming and it shud give me all urls that are present in google for that phrase.
Avatar of Shanmuga Sundaram D
did you check here?
http://urenjoy.blogspot.com/2008/10/extract-links-from-string.html

Open in new window

Google provide a javascript/ajax API to do search requests, where the results are easier to read:

http://code.google.com/apis/ajaxsearch/
i will be glad if somebody give me direct code. i am confused to how to use. i will accept their solution hwo gives me full code..
Please clarify the language you want this code in

It sounds like your are asking 2 questions which have been answered quite a few times in this website.

how do I make an http request (HttpWebRequest or WebClient class)

how do I scrape data from html (normally using Regex class)
I don know how to code them. I am a Noob. So i want some correct code in VB.Net.
anyone to give me code? i have code but fetches only 1st page in google.. its written using httpclient and webrequest object..
Post the code and maybe we can work on it.
yeah sure..
Public Class cls2
    Public Function GetResults(ByVal query As String) As Uri()
 
        ' Encode url and replace spaces with +
        query = HttpUtility.UrlEncode(query)
 
        ' Build query string
        Dim url As String = "http://www.google.de/search?q=" & query
 
        ' We use a Webclient to query and impose as "Internet Explorer 7"
        Dim client As New WebClient()
        client.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)")
 
        ' Read the html-page and select the root-node
        Dim doc As New HtmlAgilityPack.HtmlDocument
        doc.Load(client.OpenRead(url))
        Dim rootNode As HtmlNode = doc.DocumentNode
 
        ' Now select all links by using xpath
        Dim resultNodes As HtmlNodeCollection = rootNode.SelectNodes("//a[@class='l']")
 
        ' Loop over all results
        Dim links As Uri() = New Uri(resultNodes.Count - 1) {}
        For i As Integer = 0 To resultNodes.Count - 1
            links(i) = New Uri(resultNodes(i).Attributes("href").Value)
        Next
        Return (links)
 
    End Function
   ListBox1.Items.Clear()
        Dim results() As Uri
        Dim s As String = TextBox1.Text.ToString 
        results = c.GetResults(s)
        For Each result As Uri In results
            ListBox1.Items.Add(result)
        Next
    End Sub

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Tony McCreath
Tony McCreath
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
hey i got a error.. i have posted picture..
Untitled.jpg
please give me 2-4 suggestions of code.. i have to submit my project in next few days..
i converted URI to string and used array.. now all is perfect.. thanks for help!!!!!!!!!!!!!!!!!!!
nice solution.. very fast..
how to make webclient to use proxy? i am getting blocked by google for making such request..
As this is indicating that this activity is breaking Googles TOS (Terms Of Service) I would rather not help in developing a way to try and fool Google.
ok man can u tell me wherte i can find consistant and reliable proxies (ip and port)???
mate y timeout options in not coming?  client.timeout is not coming........