Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

web bots and wget for spiders?

Posted on 2010-09-21
1
Medium Priority
?
1,065 Views
Last Modified: 2013-12-08
Hello. Is anyone familiar with writing spiders? Could you say what the pros and cons of using wget for spidering versus using other tools? My goal is to do focused crawling and downloading of blogs from the web.

Here is info on others from Wikipedia

Open-source crawlers
Aspseek is a crawler, indexer and a search engine written in C++ and licensed under the GPL
crawler4j is a crawler written in Java and released under an Apache License. It can be configured in a few minutes and is suitable for educational purposes.
DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search ( http://wikiasearch.com ) uses to crawl the web.
Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Web-site Parse Templates using computer's free CPU resources only.
mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL
Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
Open Search Server is a search engine and web crawler software release under the GPL.
Pavuk is a command-line Web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, e.g., regular expression based filtering and file creation rules.
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
[edit]Crawling the Deep Web and Web Applications

[edit]Crawling the Deep Web
A vast amount of Web pages lie in the deep or invisible Web[41]. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemap Protocol and mod oai[42] are intended to allow discovery of these deep-Web resources.
Deep Web crawling also multiplies the number of Web links to be crawled. Some crawlers only take some of the <a href="URL"-shaped URLs. In some cases, such as the Googlebot, Web crawling is done on all text contained inside the hypertext content, tags, or text.
[edit]Crawling Web 2.0 Applications
Sheeraj Shah provides insight into Crawling Ajax-driven Web 2.0 Applications.
Interested readers might wish to read AJAXSearch: Crawling, Indexing and Searching Web 2.0 Applications.
Making AJAX Applications Crawlable, from Google Code. It defines an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. Google currently supports this agreement.[43]
0
Comment
Question by:onyourmark
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
1 Comment
 
LVL 2

Accepted Solution

by:
aaronblum earned 2000 total points
ID: 33746460
Depending on the blog structures (don't require form entries) "wget --spider" might provide you with everything that you need.  Writing a full crawler can be very time consuming and I expect that many of the option you are seeking already exist in wget.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
FAQ pages provide a simple way for you to supply and for customers to find answers to the most common questions about your company. Here are six reasons why your company website should have a FAQ page
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…
This tutorial walks through the best practices in adding a local business to Google Maps including how to properly search for duplicates, marker placement, and inputing business details. Login to your Google Account, then search for "Google Mapmaker…
Suggested Courses

704 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question