Solved

Parsing HTML to get all link addresses on a page.

Posted on 1998-12-28
4
139 Views
Last Modified: 2010-05-18
I am looking for a way to parse through HTML code and extract all the links in that code. I am trying to make a Perl script that will retrieve all the files in a directoy on a remote WWW server. I was told I need to use a socket, retrieve the Directory as a web browser would (In HTML) and then parse it to get the links.

I also have a question as to will this work for both HTTP and FTP servers?

If you have any suggestions as to how to go about getting a directory listing of a remote server please let me know. It would be much appreciated
0
Comment
Question by:capsite
  • 2
  • 2
4 Comments
 
LVL 1

Accepted Solution

by:
flivauda earned 10 total points
ID: 1207178
#!/usr/local/bin/perl5

use HTML::LinkExtor;
use LWP::Simple;

$base_url = shift;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse($webpage)->eof;
@links = $parser->links;

foreach $linkarray (@links)
{
  my @element = @$linkarray;
  my $elt_type = shift @element;
  while (@element)
  {
    my ($attr_name, $attr_value) = splice(@element, 0, 2);
    $seen{$attr_value}++;
  }
}

for (sort keys %seen)
{
 print $_, "\n";
}

------------------------------end of code

then run it from the command line:
% findlinks.pl http://www.collegestudent.com

or whatever web page you wish to find links on.  It is very easy to change it up to grab all the links on a page then grab all the links on those pages, etc.

Let me know if you need any more help to finish it out
0
 

Author Comment

by:capsite
ID: 1207179
Internal Server Error...

I do not have access to the UNIX Command line, only FTP, then I must check through the browser. I have never needed to do this before, so I am unfamilular with some calls. What is LinkExtor? Do I need it to process this procedure?

Thanks for responding but I cannot determine what I need to do to make this work...

0
 
LVL 1

Expert Comment

by:flivauda
ID: 1207180
Okay if you dont have access to the command line then you will need to be running it from a web page and you will have to change some of it.

Changed the line:
$base_url = shift;
to
$base_url = "http://www.mywebpage.com";

Then try it out.  It was trying to read a command line paramter and that could be why it died.  Do you have any more information about the error?  You need to make sure the program is executable.  (chmod +x filename.pl) but if you dont have command line access you may not have control over the execute permissions and you may have to ask your isp to do it for you.

Try using this and see what happens:
#!/usr/local/bin/perl5

    use HTML::LinkExtor;
    use LWP::Simple;

    $base_url = "http://www.collegestudent.com";

    $parser = HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links = $parser->links;

    foreach $linkarray (@links)
    {
      my @element = @$linkarray;
      my $elt_type = shift @element;
      while (@element)
      {
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
      }
    }

    for (sort keys %seen)
    {
     print $_, "\n";
    }

0
 

Author Comment

by:capsite
ID: 1207181
Nope...

I really don't understand why, the script is chmod *777 and all I get is "Premature end of script headers" on a 500 error. I have "Content-Type: text/html\n\n" for a header. What is this LinkExtor thing? Never heard of it before, could not having this cause the script to crash? Does my ISP need to install it? I would increase the pints for you, but I'm all out... Sorry. I'll work on getting some more.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This is a video describing the growing solar energy use in Utah. This is a topic that greatly interests me and so I decided to produce a video about it.

930 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now