Parsing HTML to get all link addresses on a page.

I am looking for a way to parse through HTML code and extract all the links in that code. I am trying to make a Perl script that will retrieve all the files in a directoy on a remote WWW server. I was told I need to use a socket, retrieve the Directory as a web browser would (In HTML) and then parse it to get the links.

I also have a question as to will this work for both HTTP and FTP servers?

If you have any suggestions as to how to go about getting a directory listing of a remote server please let me know. It would be much appreciated
capsiteAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
flivaudaConnect With a Mentor Commented:
#!/usr/local/bin/perl5

use HTML::LinkExtor;
use LWP::Simple;

$base_url = shift;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse($webpage)->eof;
@links = $parser->links;

foreach $linkarray (@links)
{
  my @element = @$linkarray;
  my $elt_type = shift @element;
  while (@element)
  {
    my ($attr_name, $attr_value) = splice(@element, 0, 2);
    $seen{$attr_value}++;
  }
}

for (sort keys %seen)
{
 print $_, "\n";
}

------------------------------end of code

then run it from the command line:
% findlinks.pl http://www.collegestudent.com

or whatever web page you wish to find links on.  It is very easy to change it up to grab all the links on a page then grab all the links on those pages, etc.

Let me know if you need any more help to finish it out
0
 
capsiteAuthor Commented:
Internal Server Error...

I do not have access to the UNIX Command line, only FTP, then I must check through the browser. I have never needed to do this before, so I am unfamilular with some calls. What is LinkExtor? Do I need it to process this procedure?

Thanks for responding but I cannot determine what I need to do to make this work...

0
 
flivaudaCommented:
Okay if you dont have access to the command line then you will need to be running it from a web page and you will have to change some of it.

Changed the line:
$base_url = shift;
to
$base_url = "http://www.mywebpage.com";

Then try it out.  It was trying to read a command line paramter and that could be why it died.  Do you have any more information about the error?  You need to make sure the program is executable.  (chmod +x filename.pl) but if you dont have command line access you may not have control over the execute permissions and you may have to ask your isp to do it for you.

Try using this and see what happens:
#!/usr/local/bin/perl5

    use HTML::LinkExtor;
    use LWP::Simple;

    $base_url = "http://www.collegestudent.com";

    $parser = HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links = $parser->links;

    foreach $linkarray (@links)
    {
      my @element = @$linkarray;
      my $elt_type = shift @element;
      while (@element)
      {
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
      }
    }

    for (sort keys %seen)
    {
     print $_, "\n";
    }

0
 
capsiteAuthor Commented:
Nope...

I really don't understand why, the script is chmod *777 and all I get is "Premature end of script headers" on a 500 error. I have "Content-Type: text/html\n\n" for a header. What is this LinkExtor thing? Never heard of it before, could not having this cause the script to crash? Does my ISP need to install it? I would increase the pints for you, but I'm all out... Sorry. I'll work on getting some more.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.