Solved

Perl Spider

Posted on 2007-03-30
6
557 Views
Last Modified: 2008-02-01
I would like to build a spider program in Perl.  I have read up on a few modules that will do this but I am confused about how to program using modules.

The end result is I wish to be able to search the web and read meta tags

What do I do?  Where do I start?
0
Comment
Question by:aboskoco
6 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 18834006
All the various Perl modules for spidering contain plenty of documentation and examples.

When you say "search the web", what parameters/limitations/starting point are you intending to use?

0
 
LVL 1

Expert Comment

by:mike_chase
ID: 18850803
http://www.thebananatree.org/vector_space/building_a_spider.html provides a good starting point on how to put together a web spider.  

0
 
LVL 4

Expert Comment

by:emblue
ID: 18920187
On CPAN, read the explanation of LWP because that's the best library for downloading pages.

What your spider will probably want to do is get a starting address, download the page with LWP.  Then parse the HTML and extract the info you want using regular expressions.

You can look for links of the form <a[^>]*href="([^"]*"[^>]*>([^<]*)</a>
This gives the URL in $1, and link title in $2.

You can then make a list of those URL's and visit each of them.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 39

Accepted Solution

by:
Adam314 earned 500 total points
ID: 18920362
If you want to be robust to anything in a page, you wouldn't use regex to parse the HTML, you would use one of the HTML parsers.

Take a look at one of the spyders already built:
  http://search.cpan.org/~ashley/WWW-Spyder-0.19/Spyder.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/RobotUA.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/WWW/RobotRules.pm
0
 
LVL 4

Expert Comment

by:emblue
ID: 18920670
For a simple spider that is for a specific purpose, regex parsing works very well, I've written a few custom purpose spiders using that method.

If you want something more generic, there are modules such as:
http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
which can parse the HTML document completely and provide you with  access to the tags and attributes.

Personally, I find these usually provide more features than are needed, and it adds some complexity.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 18920851
If you know what the HTML will look like, a regex will work.  But for generic HTML, a regex doesn't work so well.

Such as:
    <a href="http://www.mysite.com/path/to/foo<bar>/page.html">link</a>
    <img alt="b<a" src="b_lessthan_a.gif"><a href="page.html">...
This isn't to common, so the regex works in many cases... but if you want it to be robust to many different things in general, a parser works better.

0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question