Parse & Download HTML pages

I want to write a Perl script that does two things:

1. Parse a set of HTML pages in different directories recursively and extract all <A HREF=.....> </A> tags into separate text files while maintaining the directory structure. The text files have to reside in the same directories as the html pages. (Note that tags/links are to html pages).

2. Use the extracted tags from text files and download the html pages and save them to the respective directories.

I have managed to extract the tags but into a single file. The directory structure needs to be maintained. I would enter the home directory and the script should do the extraction and download recursively.

Appreciate your help.
ankhan100599Asked:
Who is Participating?
 
smiskConnect With a Mentor Commented:
($link) = $line =~ /HREF="(.*?)"/i;
print "link : $link\n";
0
 
smiskCommented:
this may be off topic, but have you tried using the tool wget?
0
 
ankhan100599Author Commented:
I dont think wget can work recursively through a directory structure, downloading html pages against links in the text files. If it can please let me know how.

Remember I need to fire this off from a home directory, all downloaded html pages will be placed in the same folder as the text file containing the links.

0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
smiskCommented:
Ok, I thought you just wanted a simple spider...
0
 
jhurstCommented:
wget -r http://www.yahoo.com

does a recursive, as defined by the -r, get of the entire website and preserves the directory structures completely.  
0
 
ankhan100599Author Commented:
I have managed to extract the links. Can someone tell me how to extract the actual links (portion between quotes) in the following examples :

<A HREF="htmlsrpl.html#bascomlopt">Basic command-line options:</A>
<A HREF="htmlsrpl.html#old">old="..."</A>
<A HREF="htmlsrpl.html#upcase">upcase=1</A>
<A HREF="htmlsrpl.html#new">new="..."</A>
<A HREF="htmlsrpl.html#old">old="..."</A>
<A HREF="htmlsrpl.html#intags">intags=1</A>
<A HREF="htmlqref.html#syntax">tags</A>
<A HREF="htmlsrpl.html#inclexcl">Element inclusion/exclusion command-line options:</A>
<A HREF="htmlsrpl.html#inside">inside=...</A>
<A HREF="htmlqref.html#syntax">elements</A>
<A HREF="htmlsrpl.html#outside">outside=...</A>
<A HREF="htmlsrpl.html#inmost">inmost=...</A>
<A HREF="htmlsrpl.html#inside">inside=</A>
<A HREF="htmlqref.html#inline">&lt;IMG&gt;</A>
<A HREF="htmlqref.html#br">&lt;BR&gt;</A>
<A HREF="htmlchek.html">htmlchek</A>

I need to pass these as a variable.

Thanks for the responses.
0
All Courses

From novice to tech pro — start learning today.