Solved

Perl Spider

Posted on 2007-03-30
6
552 Views
Last Modified: 2008-02-01
I would like to build a spider program in Perl.  I have read up on a few modules that will do this but I am confused about how to program using modules.

The end result is I wish to be able to search the web and read meta tags

What do I do?  Where do I start?
0
Comment
Question by:aboskoco
6 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 18834006
All the various Perl modules for spidering contain plenty of documentation and examples.

When you say "search the web", what parameters/limitations/starting point are you intending to use?

0
 
LVL 1

Expert Comment

by:mike_chase
ID: 18850803
http://www.thebananatree.org/vector_space/building_a_spider.html provides a good starting point on how to put together a web spider.  

0
 
LVL 4

Expert Comment

by:emblue
ID: 18920187
On CPAN, read the explanation of LWP because that's the best library for downloading pages.

What your spider will probably want to do is get a starting address, download the page with LWP.  Then parse the HTML and extract the info you want using regular expressions.

You can look for links of the form <a[^>]*href="([^"]*"[^>]*>([^<]*)</a>
This gives the URL in $1, and link title in $2.

You can then make a list of those URL's and visit each of them.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 39

Accepted Solution

by:
Adam314 earned 500 total points
ID: 18920362
If you want to be robust to anything in a page, you wouldn't use regex to parse the HTML, you would use one of the HTML parsers.

Take a look at one of the spyders already built:
  http://search.cpan.org/~ashley/WWW-Spyder-0.19/Spyder.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/RobotUA.pm
  http://search.cpan.org/~gaas/libwww-perl-5.805/lib/WWW/RobotRules.pm
0
 
LVL 4

Expert Comment

by:emblue
ID: 18920670
For a simple spider that is for a specific purpose, regex parsing works very well, I've written a few custom purpose spiders using that method.

If you want something more generic, there are modules such as:
http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm
which can parse the HTML document completely and provide you with  access to the tags and attributes.

Personally, I find these usually provide more features than are needed, and it adds some complexity.
0
 
LVL 39

Expert Comment

by:Adam314
ID: 18920851
If you know what the HTML will look like, a regex will work.  But for generic HTML, a regex doesn't work so well.

Such as:
    <a href="http://www.mysite.com/path/to/foo<bar>/page.html">link</a>
    <img alt="b<a" src="b_lessthan_a.gif"><a href="page.html">...
This isn't to common, so the regex works in many cases... but if you want it to be robust to many different things in general, a parser works better.

0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now