How 2 grab and parse the contents of a URL?

Posted on 1997-07-15
Last Modified: 2013-12-25
I would like to be able to do in CGI what Jeffry Friedel does on a command-line driven Perl script with, namely, input a URL into a form field, hit submit, and have the inputted URL's HTML text be parsed by my CGI script so I can do with the text whatever I want.  Where do I start?  Is there an easy way to do this, or will it be long and complex?
Question by:atticus
  • 7
  • 3

Author Comment

ID: 1829091
Edited text of question

Expert Comment

ID: 1829092
If you have the command line driven script then it should be easy to modify. There are only two things you need to do.

You need to pick up the URL from a cgi variable rather than from a command line argument.

All printed output should be changed to output a html document rather than just text.

If you were to post the script then I could show you how to modify it.

Author Comment

ID: 1829093
OK, I will hold off on grading (if I can--that was my first post).  Jeffry Friedl's script is here:

If you can show me every line that needs changing you'll get an A!  I know that you need to substitute ARGV for variable, but am not sure how to do it properly.  Can't seem to call this script from an evan or a system call.

Resolve Critical IT Incidents Fast

If your data, services or processes become compromised, your organization can suffer damage in just minutes and how fast you communicate during a major IT incident is everything. Learn how to immediately identify incidents & best practices to resolve them quickly and effectively.


Author Comment

ID: 1829094
Adjusted points to 205

Author Comment

ID: 1829095
I'm new to this.  I guess it really doesn't hurt "icd" if I reject the quesion, right?  I want others to be able to answer.  His/hers was a good answer but incomplete.

Expert Comment

ID: 1829096
What you need to do is quite simple:

1. create a form which at least contains a field for the URL.  Let's say you name this field as "inputurl".  The action of this form should point to your cgi script, like this: <FORM METHOD=POST ACTION=http://your_cgi_domain/your_cgi_path/>

2. in the cgi script, your read from a $ENV{'CONTENT_LENGTH'} to get the length of the cgi input, then read from stdin this number of bytes into your buffer, this buffer will contain inputurl=what_ever_user_input&otherinputfield=other_value

3. parse the buffer to get the user input for the url.  All the name=value pair is separated by "&", use a perl split() will do this job quite easily.  Then you need to an url unescaping (convert %xx to its original form)

4. Now you have the url and can do whatever the processing.  but when you output the outcome, format it using html tags.

I think you are more familiar with your script so I only supply this guide, anyway, do it once and you will be familar with it, it is really simple.

By the way, a typical way to do step 3 is some code like this: (suppose $buffer contains the stuff read from stdin), at last $FORM{"inputurl"} will be the one you want.

      @pairs=split(/&/, $buffer);
      foreach $pair (@pairs)      {
            $value =~ tr/+/ /;
            $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("c",hex($1))/eg;
            $FORM{$name} = $value;


Author Comment

ID: 1829097
This is a detailed answer but it misses the point.  I want to grab the entire contents of what's on that URL and be able to read what's on that page entirely into my script.

For example, if input:

I should have the program know that on that page the words "CNN Headline News" appear in the <TITLE> tags and the like.

I'm not sure that your answer tells me how to do that.  I want to take the URL and do whatever I want with the HTML contents of that remote page.

If you can answer THAT I'll give you an A.

Thanks for responding, "faster" !


Accepted Solution

faster earned 200 total points
ID: 1829098
If you need the "content" of the url, then I only answered half of your question.  The remaining half is more difficult to implement, however it is possible (actually it is what the web worms are doing, right?)

Basically, you cgi has to act as a browser (more precisely, an http client).  After you get the url, you need to connect to the server that owns the url.  Let's say the url is http://www.somesite/somepath/1.html, you need to connect to the server: www.somesit, of course, you will be using TCP, and the port number is by default 80, or use the one appear on the url.  After you successfully connect to the server, send it the following:

GET /somepath/1.html HTTP/1.0\r\n\r\n

Then you need to recieve the response from the server, that is the "content" of the url (even when it is actually an image or java class).

All the connection, sending and receiving have to be coded using sockets.  I don't whether there are existing scripts/software that can extract the content of the url for you (it is not difficult if one is familiar with sockets and http protocol), maybe you can find one instead of writing or the socket code yourself.  Anyway, if you really need to do that yourself, the steps I mentioned above is sufficient: create a socket, connect to the server, sending request and receiving the response.

Have fun.

Author Comment

ID: 1829099
This is an acceptable answer, but my second response gave the path to a URL that does this, however, only from a command line driven script:

I wanted somebody to show me what I need to change in the above script to get it to work.  I know it's asking a lot, but that's why I assigned it so many points (all I had at the time!).

Subsequent answers still appreciated.  Thanks to all who have helped so far.

Expert Comment

ID: 1829100
I didn't look at your script, but if your script can already get the content of the url, then what else problem do you have?  My first half of the answer shows how to get the input url from the form, then as long as you get this, there should be little difference as you get from the command line.  So where is the problem you exactly have?

Author Comment

ID: 1829101
I suppose you're right, "faster."  I was looking for the easy way out, i.e. someone to show me exactly what in the script I need to change from command line input/output to CGI.  But it's better I learn it on my own.

Shishir Gundavarm's "Mouse" book by O'Reilly's has a good section on "Checking Hypertext (HTTP) Links" that explains it pretty well.  Together with your answer I should be able to build a script that work :)

Featured Post

MIM Survival Guide for Service Desk Managers

Major incidents can send mastered service desk processes into disorder. Systems and tools produce the data needed to resolve these incidents, but your challenge is getting that information to the right people fast. Check out the Survival Guide and begin bringing order to chaos.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This tutorial will discuss fancy secure registration forms, with AJAX technology support. In this article I assume you already know HTML and some JS. I will write the code using WhizBase Server Pages, so you need to know some basics in WBSP (you mig…
A quick Powershell script I wrote to find old program installations and check versions of a specific file across the network.
The viewer will learn how to dynamically set the form action using jQuery.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

740 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question