Solved

Parsing HTML to get all link addresses on a page.

Posted on 1998-12-28
4
140 Views
Last Modified: 2010-05-18
I am looking for a way to parse through HTML code and extract all the links in that code. I am trying to make a Perl script that will retrieve all the files in a directoy on a remote WWW server. I was told I need to use a socket, retrieve the Directory as a web browser would (In HTML) and then parse it to get the links.

I also have a question as to will this work for both HTTP and FTP servers?

If you have any suggestions as to how to go about getting a directory listing of a remote server please let me know. It would be much appreciated
0
Comment
Question by:capsite
  • 2
  • 2
4 Comments
 
LVL 1

Accepted Solution

by:
flivauda earned 10 total points
ID: 1207178
#!/usr/local/bin/perl5

use HTML::LinkExtor;
use LWP::Simple;

$base_url = shift;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse($webpage)->eof;
@links = $parser->links;

foreach $linkarray (@links)
{
  my @element = @$linkarray;
  my $elt_type = shift @element;
  while (@element)
  {
    my ($attr_name, $attr_value) = splice(@element, 0, 2);
    $seen{$attr_value}++;
  }
}

for (sort keys %seen)
{
 print $_, "\n";
}

------------------------------end of code

then run it from the command line:
% findlinks.pl http://www.collegestudent.com

or whatever web page you wish to find links on.  It is very easy to change it up to grab all the links on a page then grab all the links on those pages, etc.

Let me know if you need any more help to finish it out
0
 

Author Comment

by:capsite
ID: 1207179
Internal Server Error...

I do not have access to the UNIX Command line, only FTP, then I must check through the browser. I have never needed to do this before, so I am unfamilular with some calls. What is LinkExtor? Do I need it to process this procedure?

Thanks for responding but I cannot determine what I need to do to make this work...

0
 
LVL 1

Expert Comment

by:flivauda
ID: 1207180
Okay if you dont have access to the command line then you will need to be running it from a web page and you will have to change some of it.

Changed the line:
$base_url = shift;
to
$base_url = "http://www.mywebpage.com";

Then try it out.  It was trying to read a command line paramter and that could be why it died.  Do you have any more information about the error?  You need to make sure the program is executable.  (chmod +x filename.pl) but if you dont have command line access you may not have control over the execute permissions and you may have to ask your isp to do it for you.

Try using this and see what happens:
#!/usr/local/bin/perl5

    use HTML::LinkExtor;
    use LWP::Simple;

    $base_url = "http://www.collegestudent.com";

    $parser = HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links = $parser->links;

    foreach $linkarray (@links)
    {
      my @element = @$linkarray;
      my $elt_type = shift @element;
      while (@element)
      {
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
      }
    }

    for (sort keys %seen)
    {
     print $_, "\n";
    }

0
 

Author Comment

by:capsite
ID: 1207181
Nope...

I really don't understand why, the script is chmod *777 and all I get is "Premature end of script headers" on a 500 error. I have "Content-Type: text/html\n\n" for a header. What is this LinkExtor thing? Never heard of it before, could not having this cause the script to crash? Does my ISP need to install it? I would increase the pints for you, but I'm all out... Sorry. I'll work on getting some more.
0

Featured Post

Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

773 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question