?
Solved

Parsing HTML to get all link addresses on a page.

Posted on 1998-12-28
4
Medium Priority
?
145 Views
Last Modified: 2010-05-18
I am looking for a way to parse through HTML code and extract all the links in that code. I am trying to make a Perl script that will retrieve all the files in a directoy on a remote WWW server. I was told I need to use a socket, retrieve the Directory as a web browser would (In HTML) and then parse it to get the links.

I also have a question as to will this work for both HTTP and FTP servers?

If you have any suggestions as to how to go about getting a directory listing of a remote server please let me know. It would be much appreciated
0
Comment
Question by:capsite
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 1

Accepted Solution

by:
flivauda earned 30 total points
ID: 1207178
#!/usr/local/bin/perl5

use HTML::LinkExtor;
use LWP::Simple;

$base_url = shift;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse($webpage)->eof;
@links = $parser->links;

foreach $linkarray (@links)
{
  my @element = @$linkarray;
  my $elt_type = shift @element;
  while (@element)
  {
    my ($attr_name, $attr_value) = splice(@element, 0, 2);
    $seen{$attr_value}++;
  }
}

for (sort keys %seen)
{
 print $_, "\n";
}

------------------------------end of code

then run it from the command line:
% findlinks.pl http://www.collegestudent.com

or whatever web page you wish to find links on.  It is very easy to change it up to grab all the links on a page then grab all the links on those pages, etc.

Let me know if you need any more help to finish it out
0
 

Author Comment

by:capsite
ID: 1207179
Internal Server Error...

I do not have access to the UNIX Command line, only FTP, then I must check through the browser. I have never needed to do this before, so I am unfamilular with some calls. What is LinkExtor? Do I need it to process this procedure?

Thanks for responding but I cannot determine what I need to do to make this work...

0
 
LVL 1

Expert Comment

by:flivauda
ID: 1207180
Okay if you dont have access to the command line then you will need to be running it from a web page and you will have to change some of it.

Changed the line:
$base_url = shift;
to
$base_url = "http://www.mywebpage.com";

Then try it out.  It was trying to read a command line paramter and that could be why it died.  Do you have any more information about the error?  You need to make sure the program is executable.  (chmod +x filename.pl) but if you dont have command line access you may not have control over the execute permissions and you may have to ask your isp to do it for you.

Try using this and see what happens:
#!/usr/local/bin/perl5

    use HTML::LinkExtor;
    use LWP::Simple;

    $base_url = "http://www.collegestudent.com";

    $parser = HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links = $parser->links;

    foreach $linkarray (@links)
    {
      my @element = @$linkarray;
      my $elt_type = shift @element;
      while (@element)
      {
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
      }
    }

    for (sort keys %seen)
    {
     print $_, "\n";
    }

0
 

Author Comment

by:capsite
ID: 1207181
Nope...

I really don't understand why, the script is chmod *777 and all I get is "Premature end of script headers" on a 500 error. I have "Content-Type: text/html\n\n" for a header. What is this LinkExtor thing? Never heard of it before, could not having this cause the script to crash? Does my ISP need to install it? I would increase the pints for you, but I'm all out... Sorry. I'll work on getting some more.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question