How 2 grab and parse the contents of a URL?

Posted on 1997-07-15
Last Modified: 2013-12-25
I would like to be able to do in CGI what Jeffry Friedel does on a command-line driven Perl script with, namely, input a URL into a form field, hit submit, and have the inputted URL's HTML text be parsed by my CGI script so I can do with the text whatever I want.  Where do I start?  Is there an easy way to do this, or will it be long and complex?
Question by:atticus
  • 7
  • 3

Author Comment

ID: 1829091
Edited text of question

Expert Comment

ID: 1829092
If you have the command line driven script then it should be easy to modify. There are only two things you need to do.

You need to pick up the URL from a cgi variable rather than from a command line argument.

All printed output should be changed to output a html document rather than just text.

If you were to post the script then I could show you how to modify it.

Author Comment

ID: 1829093
OK, I will hold off on grading (if I can--that was my first post).  Jeffry Friedl's script is here:

If you can show me every line that needs changing you'll get an A!  I know that you need to substitute ARGV for variable, but am not sure how to do it properly.  Can't seem to call this script from an evan or a system call.

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.


Author Comment

ID: 1829094
Adjusted points to 205

Author Comment

ID: 1829095
I'm new to this.  I guess it really doesn't hurt "icd" if I reject the quesion, right?  I want others to be able to answer.  His/hers was a good answer but incomplete.

Expert Comment

ID: 1829096
What you need to do is quite simple:

1. create a form which at least contains a field for the URL.  Let's say you name this field as "inputurl".  The action of this form should point to your cgi script, like this: <FORM METHOD=POST ACTION=http://your_cgi_domain/your_cgi_path/>

2. in the cgi script, your read from a $ENV{'CONTENT_LENGTH'} to get the length of the cgi input, then read from stdin this number of bytes into your buffer, this buffer will contain inputurl=what_ever_user_input&otherinputfield=other_value

3. parse the buffer to get the user input for the url.  All the name=value pair is separated by "&", use a perl split() will do this job quite easily.  Then you need to an url unescaping (convert %xx to its original form)

4. Now you have the url and can do whatever the processing.  but when you output the outcome, format it using html tags.

I think you are more familiar with your script so I only supply this guide, anyway, do it once and you will be familar with it, it is really simple.

By the way, a typical way to do step 3 is some code like this: (suppose $buffer contains the stuff read from stdin), at last $FORM{"inputurl"} will be the one you want.

      @pairs=split(/&/, $buffer);
      foreach $pair (@pairs)      {
            $value =~ tr/+/ /;
            $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("c",hex($1))/eg;
            $FORM{$name} = $value;


Author Comment

ID: 1829097
This is a detailed answer but it misses the point.  I want to grab the entire contents of what's on that URL and be able to read what's on that page entirely into my script.

For example, if input:

I should have the program know that on that page the words "CNN Headline News" appear in the <TITLE> tags and the like.

I'm not sure that your answer tells me how to do that.  I want to take the URL and do whatever I want with the HTML contents of that remote page.

If you can answer THAT I'll give you an A.

Thanks for responding, "faster" !


Accepted Solution

faster earned 200 total points
ID: 1829098
If you need the "content" of the url, then I only answered half of your question.  The remaining half is more difficult to implement, however it is possible (actually it is what the web worms are doing, right?)

Basically, you cgi has to act as a browser (more precisely, an http client).  After you get the url, you need to connect to the server that owns the url.  Let's say the url is http://www.somesite/somepath/1.html, you need to connect to the server: www.somesit, of course, you will be using TCP, and the port number is by default 80, or use the one appear on the url.  After you successfully connect to the server, send it the following:

GET /somepath/1.html HTTP/1.0\r\n\r\n

Then you need to recieve the response from the server, that is the "content" of the url (even when it is actually an image or java class).

All the connection, sending and receiving have to be coded using sockets.  I don't whether there are existing scripts/software that can extract the content of the url for you (it is not difficult if one is familiar with sockets and http protocol), maybe you can find one instead of writing or the socket code yourself.  Anyway, if you really need to do that yourself, the steps I mentioned above is sufficient: create a socket, connect to the server, sending request and receiving the response.

Have fun.

Author Comment

ID: 1829099
This is an acceptable answer, but my second response gave the path to a URL that does this, however, only from a command line driven script:

I wanted somebody to show me what I need to change in the above script to get it to work.  I know it's asking a lot, but that's why I assigned it so many points (all I had at the time!).

Subsequent answers still appreciated.  Thanks to all who have helped so far.

Expert Comment

ID: 1829100
I didn't look at your script, but if your script can already get the content of the url, then what else problem do you have?  My first half of the answer shows how to get the input url from the form, then as long as you get this, there should be little difference as you get from the command line.  So where is the problem you exactly have?

Author Comment

ID: 1829101
I suppose you're right, "faster."  I was looking for the easy way out, i.e. someone to show me exactly what in the script I need to change from command line input/output to CGI.  But it's better I learn it on my own.

Shishir Gundavarm's "Mouse" book by O'Reilly's has a good section on "Checking Hypertext (HTTP) Links" that explains it pretty well.  Together with your answer I should be able to build a script that work :)

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This tutorial will give you a fast look what you can do with WhizBase. I expect you already know how to work with HTML at least, and that you understand the basics of the internet and how the internet works. WhizBase is a server-s…
If you get a (Blue Screen of Death), your system writes a small file called a minidump. Your first step is to make certain your computer is setup to record memory dumps. Right click My Computer, choose properties. Click on the advanced tab, an…
The viewer will learn how to count occurrences of each item in an array.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

790 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question