Link to home
Start Free TrialLog in
Avatar of atticus
atticus

asked on

How 2 grab and parse the contents of a URL?

I would like to be able to do in CGI what Jeffry Friedel does on a command-line driven Perl script with webget.pl, namely, input a URL into a form field, hit submit, and have the inputted URL's HTML text be parsed by my CGI script so I can do with the text whatever I want.  Where do I start?  Is there an easy way to do this, or will it be long and complex?
Avatar of atticus
atticus

ASKER

Edited text of question
If you have the command line driven script then it should be easy to modify. There are only two things you need to do.

You need to pick up the URL from a cgi variable rather than from a command line argument.

All printed output should be changed to output a html document rather than just text.

If you were to post the script then I could show you how to modify it.
Avatar of atticus

ASKER

OK, I will hold off on grading (if I can--that was my first post).  Jeffry Friedl's script is here:  http://enterprise.ic.gc.ca/~jfriedl/perl/inlined/webget

If you can show me every line that needs changing you'll get an A!  I know that you need to substitute ARGV for variable, but am not sure how to do it properly.  Can't seem to call this script from an evan or a system call.

Thanks.
Avatar of atticus

ASKER

Adjusted points to 205
Avatar of atticus

ASKER

I'm new to this.  I guess it really doesn't hurt "icd" if I reject the quesion, right?  I want others to be able to answer.  His/hers was a good answer but incomplete.
What you need to do is quite simple:

1. create a form which at least contains a field for the URL.  Let's say you name this field as "inputurl".  The action of this form should point to your cgi script, like this: <FORM METHOD=POST ACTION=http://your_cgi_domain/your_cgi_path/something.pl>

2. in the cgi script, your read from a $ENV{'CONTENT_LENGTH'} to get the length of the cgi input, then read from stdin this number of bytes into your buffer, this buffer will contain inputurl=what_ever_user_input&otherinputfield=other_value

3. parse the buffer to get the user input for the url.  All the name=value pair is separated by "&", use a perl split() will do this job quite easily.  Then you need to an url unescaping (convert %xx to its original form)

4. Now you have the url and can do whatever the processing.  but when you output the outcome, format it using html tags.

I think you are more familiar with your script so I only supply this guide, anyway, do it once and you will be familar with it, it is really simple.

By the way, a typical way to do step 3 is some code like this: (suppose $buffer contains the stuff read from stdin), at last $FORM{"inputurl"} will be the one you want.

      @pairs=split(/&/, $buffer);
      foreach $pair (@pairs)      {
            ($name,$value)=split(/=/,$pair);
            $value =~ tr/+/ /;
            $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("c",hex($1))/eg;
            $FORM{$name} = $value;
      }

Avatar of atticus

ASKER

This is a detailed answer but it misses the point.  I want to grab the entire contents of what's on that URL and be able to read what's on that page entirely into my script.

For example, if input:

http://www.mydomain.com/cgi-bin/getURL.cgi?URL=http://www.cnn.com

I should have the program know that on that page the words "CNN Headline News" appear in the <TITLE> tags and the like.

I'm not sure that your answer tells me how to do that.  I want to take the URL and do whatever I want with the HTML contents of that remote page.

If you can answer THAT I'll give you an A.

Thanks for responding, "faster" !

---atticus---
ASKER CERTIFIED SOLUTION
Avatar of faster
faster

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of atticus

ASKER

This is an acceptable answer, but my second response gave the path to a URL that does this, however, only from a command line driven script:

     http://enterprise.ic.gc.ca/~jfriedl/perl/inlined/webget

I wanted somebody to show me what I need to change in the above script to get it to work.  I know it's asking a lot, but that's why I assigned it so many points (all I had at the time!).

Subsequent answers still appreciated.  Thanks to all who have helped so far.
I didn't look at your script, but if your script can already get the content of the url, then what else problem do you have?  My first half of the answer shows how to get the input url from the form, then as long as you get this, there should be little difference as you get from the command line.  So where is the problem you exactly have?
Avatar of atticus

ASKER

I suppose you're right, "faster."  I was looking for the easy way out, i.e. someone to show me exactly what in the script I need to change from command line input/output to CGI.  But it's better I learn it on my own.

Shishir Gundavarm's "Mouse" book by O'Reilly's has a good section on "Checking Hypertext (HTTP) Links" that explains it pretty well.  Together with your answer I should be able to build a script that work :)