[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

open source web crawlers?

Posted on 2004-12-01
11
Medium Priority
?
457 Views
Last Modified: 2012-05-05
Can anyone recommend an open source Java program that crawls the web? So far the best I have found is Heritrix (http://crawler.archive.org/). However, it uses RAM rather than disk to store the list of URLs that it has already visited and the queue of URLs it has encountered but not visited. This is major limitation, since broad crawls (crawls of all links encountered) will consume many gigabytes. Thus, I am seeking a freely available web crawler written in Java that uses a bounded amount of RAM regardless of the size of the crawl.

Thank you!
0
Comment
Question by:bobwood2000
10 Comments
 
LVL 8

Expert Comment

by:kiranhk
ID: 12721097
0
 

Author Comment

by:bobwood2000
ID: 12721192
Have either of you used any of these programs? Do you know which would be suitable for performing crawls of hundreds of millions of web sites while using only a bounded amount of RAM?

Thanks.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 86

Expert Comment

by:CEHJ
ID: 12721244
I haven't personally, but if i'd needed a web crawler of any power then i'd have probably avoided Java
0
 

Author Comment

by:bobwood2000
ID: 12721823
Why would you avoid Java? Lots of developers have choosen to use Java for web crawlers. And, for distributed web crawlers, what could be better than using J2EE?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12722047
>>Why would you avoid Java?

Because Java's html-oriented libraries are weak and also for something like this, you need as small a footprint and as fast a performance as possible, and native code is better for that.

>>Lots of developers have choosen to use Java for web crawlers

Well, there are a lot of Java programmers around these days ;-)
0
 

Author Comment

by:bobwood2000
ID: 12722492
Okay, I'll try to justify using Java for this. Let me know if I'm overlooking something.

1. I need a memory footprint that does not grow with the number of pages crawled. Whether the memory is footprint is bounded by 1/2 GB or 1 GB of memory does not matter too much for this application.

2. I probably won't use html-oriented libraries, because I would rather just use regular expressions to parse the web pages. My thinking is that regular expressions will be faster by perhaps multiple orders of magnitude; forming html parse trees is ridiculously slow in all languages.

3. CPU usage might be higher in Java, but I would guess by not more than 25%. (Does that seem about right?) I'd be very willing to rent an extra couple of servers in order to avoid dealing with segmentation faults.

Am I overlooking anything?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12722529
No that seems about right. Your crucial point is 2. If you're *not* going to use Java html processing then you're in with a good chance
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 1500 total points
ID: 12722568
Having said that, of course if you end up having to write custom code to perform 2. then that hardly reflects well on Java, the 'language of networking' ;-)
0
 
LVL 2

Expert Comment

by:jprgn
ID: 12723205
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Suggested Courses
Course of the Month17 days, 14 hours left to enroll

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question