Perl Spider

I would like to build a spider program in Perl.  I have read up on a few modules that will do this but I am confused about how to program using modules.

The end result is I wish to be able to search the web and read meta tags

What do I do?  Where do I start?
aboskocoAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

TintinCommented:
All the various Perl modules for spidering contain plenty of documentation and examples.

When you say "search the web", what parameters/limitations/starting point are you intending to use?

0
mike_chaseCommented:
http://www.thebananatree.org/vector_space/building_a_spider.html provides a good starting point on how to put together a web spider.  

0
emblueCommented:
On CPAN, read the explanation of LWP because that's the best library for downloading pages.

What your spider will probably want to do is get a starting address, download the page with LWP.  Then parse the HTML and extract the info you want using regular expressions.

You can look for links of the form <a[^>]*href="([^"]*"[^>]*>([^<]*)</a>
This gives the URL in $1, and link title in $2.

You can then make a list of those URL's and visit each of them.
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

Adam314Commented:
If you want to be robust to anything in a page, you wouldn't use regex to parse the HTML, you would use one of the HTML parsers.

Take a look at one of the spyders already built:
  http://search.cpan.org/~ashley/WWW-Spyder-0.19/Spyder.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/RobotUA.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/WWW/RobotRules.pm
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
emblueCommented:
For a simple spider that is for a specific purpose, regex parsing works very well, I've written a few custom purpose spiders using that method.

If you want something more generic, there are modules such as:
http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
which can parse the HTML document completely and provide you with  access to the tags and attributes.

Personally, I find these usually provide more features than are needed, and it adds some complexity.
0
Adam314Commented:
If you know what the HTML will look like, a regex will work.  But for generic HTML, a regex doesn't work so well.

Such as:
    <a href="http://www.mysite.com/path/to/foo<bar>/page.html">link</a>
    <img alt="b<a" src="b_lessthan_a.gif"><a href="page.html">...
This isn't to common, so the regex works in many cases... but if you want it to be robust to many different things in general, a parser works better.

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.