how to write a crawler to extract all links from a website

I want to write an application which can accept a start url of a website and return me all the links present on that website and throw them in a text file.
Any ideas ?
thanks
mmalik15Asked:
Who is Participating?
 
harshgandhi18Commented:
This link may help you.

http://stackoverflow.com/questions/2425043/how-do-you-screen-scrape

Also you can use the "http://htmlagilitypack.codeplex.com/" utility for the same.
0
 
Om PrakashCommented:
pl check the following link
http://www.dotnetperls.com/scraping-html
0
 
käµfm³d 👽Commented:
This is not terribly difficult to accomplish. You can use a combination of WebClient, List, Queue, and regular expressions. Here's a quick, untested example which should give you the basics:

using System;
using System.Collections.Generic;
using System.Net;

namespace _27504172
{
    class Program
    {
        static void Main(string[] args)
        {
            Queue<string> toVisit = new Queue<string>();
            List<string> visited = new List<string>();
            string baseURL = "http://www.example.com";

            toVisit.Enqueue(baseURL);

            using (WebClient client = new WebClient())
            {
                while (toVisit.Count > 0)
                {
                    string current = toVisit.Dequeue();
                    Uri parsed = new Uri(current);
                    string html;

                    if (!parsed.IsAbsoluteUri)
                    {
                        current = baseURL + "/" + current.TrimStart('/');
                    }

                    if (!visited.Contains(current))
                    {
                        html = client.DownloadString(current);

                        foreach (System.Text.RegularExpressions.Match m in System.Text.RegularExpressions.Regex.Matches(html, "<a [^>]*href=['\"]?([^ '\"]+)"))
                        {
                            toVisit.Enqueue(m.Groups[1].Value);
                        }
                    }
                }
            }
        }
    }
}

Open in new window


0
 
mmalik15Author Commented:
Many thanks guys.
0
 
käµfm³d 👽Commented:
NP. Glad to help  = )

One thing to note is that my code is not a full solution. For example, you may end up crawling the whole web because my code just searches for all links on a page--it doesn't confirm whether or not the links point to the current site or not. You should be able to account for this easily; just know that that "flaw" is there.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.