Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 567
  • Last Modified:

Perl Spider

I would like to build a spider program in Perl.  I have read up on a few modules that will do this but I am confused about how to program using modules.

The end result is I wish to be able to search the web and read meta tags

What do I do?  Where do I start?
0
aboskoco
Asked:
aboskoco
1 Solution
 
TintinCommented:
All the various Perl modules for spidering contain plenty of documentation and examples.

When you say "search the web", what parameters/limitations/starting point are you intending to use?

0
 
mike_chaseCommented:
http://www.thebananatree.org/vector_space/building_a_spider.html provides a good starting point on how to put together a web spider.  

0
 
emblueCommented:
On CPAN, read the explanation of LWP because that's the best library for downloading pages.

What your spider will probably want to do is get a starting address, download the page with LWP.  Then parse the HTML and extract the info you want using regular expressions.

You can look for links of the form <a[^>]*href="([^"]*"[^>]*>([^<]*)</a>
This gives the URL in $1, and link title in $2.

You can then make a list of those URL's and visit each of them.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
Adam314Commented:
If you want to be robust to anything in a page, you wouldn't use regex to parse the HTML, you would use one of the HTML parsers.

Take a look at one of the spyders already built:
  http://search.cpan.org/~ashley/WWW-Spyder-0.19/Spyder.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/RobotUA.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/WWW/RobotRules.pm
0
 
emblueCommented:
For a simple spider that is for a specific purpose, regex parsing works very well, I've written a few custom purpose spiders using that method.

If you want something more generic, there are modules such as:
http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
which can parse the HTML document completely and provide you with  access to the tags and attributes.

Personally, I find these usually provide more features than are needed, and it adds some complexity.
0
 
Adam314Commented:
If you know what the HTML will look like, a regex will work.  But for generic HTML, a regex doesn't work so well.

Such as:
    <a href="http://www.mysite.com/path/to/foo<bar>/page.html">link</a>
    <img alt="b<a" src="b_lessthan_a.gif"><a href="page.html">...
This isn't to common, so the regex works in many cases... but if you want it to be robust to many different things in general, a parser works better.

0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now