open source web crawlers?

Can anyone recommend an open source Java program that crawls the web? So far the best I have found is Heritrix (http://crawler.archive.org/). However, it uses RAM rather than disk to store the list of URLs that it has already visited and the queue of URLs it has encountered but not visited. This is major limitation, since broad crawls (crawls of all links encountered) will consume many gigabytes. Thus, I am seeking a freely available web crawler written in Java that uses a bounded amount of RAM regardless of the size of the crawl.

Thank you!
bobwood2000Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

bobwood2000Author Commented:
Have either of you used any of these programs? Do you know which would be suitable for performing crawls of hundreds of millions of web sites while using only a bounded amount of RAM?

Thanks.
0
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

CEHJCommented:
I haven't personally, but if i'd needed a web crawler of any power then i'd have probably avoided Java
0
bobwood2000Author Commented:
Why would you avoid Java? Lots of developers have choosen to use Java for web crawlers. And, for distributed web crawlers, what could be better than using J2EE?
0
CEHJCommented:
>>Why would you avoid Java?

Because Java's html-oriented libraries are weak and also for something like this, you need as small a footprint and as fast a performance as possible, and native code is better for that.

>>Lots of developers have choosen to use Java for web crawlers

Well, there are a lot of Java programmers around these days ;-)
0
bobwood2000Author Commented:
Okay, I'll try to justify using Java for this. Let me know if I'm overlooking something.

1. I need a memory footprint that does not grow with the number of pages crawled. Whether the memory is footprint is bounded by 1/2 GB or 1 GB of memory does not matter too much for this application.

2. I probably won't use html-oriented libraries, because I would rather just use regular expressions to parse the web pages. My thinking is that regular expressions will be faster by perhaps multiple orders of magnitude; forming html parse trees is ridiculously slow in all languages.

3. CPU usage might be higher in Java, but I would guess by not more than 25%. (Does that seem about right?) I'd be very willing to rent an extra couple of servers in order to avoid dealing with segmentation faults.

Am I overlooking anything?
0
CEHJCommented:
No that seems about right. Your crucial point is 2. If you're *not* going to use Java html processing then you're in with a good chance
0
CEHJCommented:
Having said that, of course if you end up having to write custom code to perform 2. then that hardly reflects well on Java, the 'language of networking' ;-)
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jprgnCommented:
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.