?
Solved

Perl Spider

Posted on 2007-03-30
6
Medium Priority
?
562 Views
Last Modified: 2008-02-01
I would like to build a spider program in Perl.  I have read up on a few modules that will do this but I am confused about how to program using modules.

The end result is I wish to be able to search the web and read meta tags

What do I do?  Where do I start?
0
Comment
Question by:aboskoco
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
6 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 18834006
All the various Perl modules for spidering contain plenty of documentation and examples.

When you say "search the web", what parameters/limitations/starting point are you intending to use?

0
 
LVL 1

Expert Comment

by:mike_chase
ID: 18850803
http://www.thebananatree.org/vector_space/building_a_spider.html provides a good starting point on how to put together a web spider.  

0
 
LVL 4

Expert Comment

by:emblue
ID: 18920187
On CPAN, read the explanation of LWP because that's the best library for downloading pages.

What your spider will probably want to do is get a starting address, download the page with LWP.  Then parse the HTML and extract the info you want using regular expressions.

You can look for links of the form <a[^>]*href="([^"]*"[^>]*>([^<]*)</a>
This gives the URL in $1, and link title in $2.

You can then make a list of those URL's and visit each of them.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 39

Accepted Solution

by:
Adam314 earned 1500 total points
ID: 18920362
If you want to be robust to anything in a page, you wouldn't use regex to parse the HTML, you would use one of the HTML parsers.

Take a look at one of the spyders already built:
  http://search.cpan.org/~ashley/WWW-Spyder-0.19/Spyder.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/RobotUA.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/WWW/RobotRules.pm
0
 
LVL 4

Expert Comment

by:emblue
ID: 18920670
For a simple spider that is for a specific purpose, regex parsing works very well, I've written a few custom purpose spiders using that method.

If you want something more generic, there are modules such as:
http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
which can parse the HTML document completely and provide you with  access to the tags and attributes.

Personally, I find these usually provide more features than are needed, and it adds some complexity.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 18920851
If you know what the HTML will look like, a regex will work.  But for generic HTML, a regex doesn't work so well.

Such as:
    <a href="http://www.mysite.com/path/to/foo<bar>/page.html">link</a>
    <img alt="b<a" src="b_lessthan_a.gif"><a href="page.html">...
This isn't to common, so the regex works in many cases... but if you want it to be robust to many different things in general, a parser works better.

0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question