Solved

Parsing HTML to get all link addresses on a page.

Posted on 1998-12-28
4
138 Views
Last Modified: 2010-05-18
I am looking for a way to parse through HTML code and extract all the links in that code. I am trying to make a Perl script that will retrieve all the files in a directoy on a remote WWW server. I was told I need to use a socket, retrieve the Directory as a web browser would (In HTML) and then parse it to get the links.

I also have a question as to will this work for both HTTP and FTP servers?

If you have any suggestions as to how to go about getting a directory listing of a remote server please let me know. It would be much appreciated
0
Comment
Question by:capsite
  • 2
  • 2
4 Comments
 
LVL 1

Accepted Solution

by:
flivauda earned 10 total points
ID: 1207178
#!/usr/local/bin/perl5

use HTML::LinkExtor;
use LWP::Simple;

$base_url = shift;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse($webpage)->eof;
@links = $parser->links;

foreach $linkarray (@links)
{
  my @element = @$linkarray;
  my $elt_type = shift @element;
  while (@element)
  {
    my ($attr_name, $attr_value) = splice(@element, 0, 2);
    $seen{$attr_value}++;
  }
}

for (sort keys %seen)
{
 print $_, "\n";
}

------------------------------end of code

then run it from the command line:
% findlinks.pl http://www.collegestudent.com

or whatever web page you wish to find links on.  It is very easy to change it up to grab all the links on a page then grab all the links on those pages, etc.

Let me know if you need any more help to finish it out
0
 

Author Comment

by:capsite
ID: 1207179
Internal Server Error...

I do not have access to the UNIX Command line, only FTP, then I must check through the browser. I have never needed to do this before, so I am unfamilular with some calls. What is LinkExtor? Do I need it to process this procedure?

Thanks for responding but I cannot determine what I need to do to make this work...

0
 
LVL 1

Expert Comment

by:flivauda
ID: 1207180
Okay if you dont have access to the command line then you will need to be running it from a web page and you will have to change some of it.

Changed the line:
$base_url = shift;
to
$base_url = "http://www.mywebpage.com";

Then try it out.  It was trying to read a command line paramter and that could be why it died.  Do you have any more information about the error?  You need to make sure the program is executable.  (chmod +x filename.pl) but if you dont have command line access you may not have control over the execute permissions and you may have to ask your isp to do it for you.

Try using this and see what happens:
#!/usr/local/bin/perl5

    use HTML::LinkExtor;
    use LWP::Simple;

    $base_url = "http://www.collegestudent.com";

    $parser = HTML::LinkExtor->new(undef, $base_url);
    $parser->parse(get($base_url))->eof;
    @links = $parser->links;

    foreach $linkarray (@links)
    {
      my @element = @$linkarray;
      my $elt_type = shift @element;
      while (@element)
      {
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
      }
    }

    for (sort keys %seen)
    {
     print $_, "\n";
    }

0
 

Author Comment

by:capsite
ID: 1207181
Nope...

I really don't understand why, the script is chmod *777 and all I get is "Premature end of script headers" on a 500 error. I have "Content-Type: text/html\n\n" for a header. What is this LinkExtor thing? Never heard of it before, could not having this cause the script to crash? Does my ISP need to install it? I would increase the pints for you, but I'm all out... Sorry. I'll work on getting some more.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Validating the data using Oracle DBD module. 5 62
Perl - Mawk 2 69
delete query using perl dbi 3 92
collecting information 2 129
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now