Solved

How 2 grab and parse the contents of a URL?

Posted on 1997-07-15
11
180 Views
Last Modified: 2013-12-25
I would like to be able to do in CGI what Jeffry Friedel does on a command-line driven Perl script with webget.pl, namely, input a URL into a form field, hit submit, and have the inputted URL's HTML text be parsed by my CGI script so I can do with the text whatever I want.  Where do I start?  Is there an easy way to do this, or will it be long and complex?
0
Comment
Question by:atticus
  • 7
  • 3
11 Comments
 

Author Comment

by:atticus
ID: 1829091
Edited text of question
0
 
LVL 5

Expert Comment

by:icd
ID: 1829092
If you have the command line driven script then it should be easy to modify. There are only two things you need to do.

You need to pick up the URL from a cgi variable rather than from a command line argument.

All printed output should be changed to output a html document rather than just text.

If you were to post the script then I could show you how to modify it.
0
 

Author Comment

by:atticus
ID: 1829093
OK, I will hold off on grading (if I can--that was my first post).  Jeffry Friedl's script is here:  http://enterprise.ic.gc.ca/~jfriedl/perl/inlined/webget

If you can show me every line that needs changing you'll get an A!  I know that you need to substitute ARGV for variable, but am not sure how to do it properly.  Can't seem to call this script from an evan or a system call.

Thanks.
0
 

Author Comment

by:atticus
ID: 1829094
Adjusted points to 205
0
 

Author Comment

by:atticus
ID: 1829095
I'm new to this.  I guess it really doesn't hurt "icd" if I reject the quesion, right?  I want others to be able to answer.  His/hers was a good answer but incomplete.
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 
LVL 7

Expert Comment

by:faster
ID: 1829096
What you need to do is quite simple:

1. create a form which at least contains a field for the URL.  Let's say you name this field as "inputurl".  The action of this form should point to your cgi script, like this: <FORM METHOD=POST ACTION=http://your_cgi_domain/your_cgi_path/something.pl>

2. in the cgi script, your read from a $ENV{'CONTENT_LENGTH'} to get the length of the cgi input, then read from stdin this number of bytes into your buffer, this buffer will contain inputurl=what_ever_user_input&otherinputfield=other_value

3. parse the buffer to get the user input for the url.  All the name=value pair is separated by "&", use a perl split() will do this job quite easily.  Then you need to an url unescaping (convert %xx to its original form)

4. Now you have the url and can do whatever the processing.  but when you output the outcome, format it using html tags.

I think you are more familiar with your script so I only supply this guide, anyway, do it once and you will be familar with it, it is really simple.

By the way, a typical way to do step 3 is some code like this: (suppose $buffer contains the stuff read from stdin), at last $FORM{"inputurl"} will be the one you want.

      @pairs=split(/&/, $buffer);
      foreach $pair (@pairs)      {
            ($name,$value)=split(/=/,$pair);
            $value =~ tr/+/ /;
            $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("c",hex($1))/eg;
            $FORM{$name} = $value;
      }

0
 

Author Comment

by:atticus
ID: 1829097
This is a detailed answer but it misses the point.  I want to grab the entire contents of what's on that URL and be able to read what's on that page entirely into my script.

For example, if input:

http://www.mydomain.com/cgi-bin/getURL.cgi?URL=http://www.cnn.com

I should have the program know that on that page the words "CNN Headline News" appear in the <TITLE> tags and the like.

I'm not sure that your answer tells me how to do that.  I want to take the URL and do whatever I want with the HTML contents of that remote page.

If you can answer THAT I'll give you an A.

Thanks for responding, "faster" !

---atticus---
0
 
LVL 7

Accepted Solution

by:
faster earned 200 total points
ID: 1829098
If you need the "content" of the url, then I only answered half of your question.  The remaining half is more difficult to implement, however it is possible (actually it is what the web worms are doing, right?)

Basically, you cgi has to act as a browser (more precisely, an http client).  After you get the url, you need to connect to the server that owns the url.  Let's say the url is http://www.somesite/somepath/1.html, you need to connect to the server: www.somesit, of course, you will be using TCP, and the port number is by default 80, or use the one appear on the url.  After you successfully connect to the server, send it the following:

GET /somepath/1.html HTTP/1.0\r\n\r\n

Then you need to recieve the response from the server, that is the "content" of the url (even when it is actually an image or java class).

All the connection, sending and receiving have to be coded using sockets.  I don't whether there are existing scripts/software that can extract the content of the url for you (it is not difficult if one is familiar with sockets and http protocol), maybe you can find one instead of writing or the socket code yourself.  Anyway, if you really need to do that yourself, the steps I mentioned above is sufficient: create a socket, connect to the server, sending request and receiving the response.

Have fun.
0
 

Author Comment

by:atticus
ID: 1829099
This is an acceptable answer, but my second response gave the path to a URL that does this, however, only from a command line driven script:

     http://enterprise.ic.gc.ca/~jfriedl/perl/inlined/webget

I wanted somebody to show me what I need to change in the above script to get it to work.  I know it's asking a lot, but that's why I assigned it so many points (all I had at the time!).

Subsequent answers still appreciated.  Thanks to all who have helped so far.
0
 
LVL 7

Expert Comment

by:faster
ID: 1829100
I didn't look at your script, but if your script can already get the content of the url, then what else problem do you have?  My first half of the answer shows how to get the input url from the form, then as long as you get this, there should be little difference as you get from the command line.  So where is the problem you exactly have?
0
 

Author Comment

by:atticus
ID: 1829101
I suppose you're right, "faster."  I was looking for the easy way out, i.e. someone to show me exactly what in the script I need to change from command line input/output to CGI.  But it's better I learn it on my own.

Shishir Gundavarm's "Mouse" book by O'Reilly's has a good section on "Checking Hypertext (HTTP) Links" that explains it pretty well.  Together with your answer I should be able to build a script that work :)
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
i want to read the contents of the file to Csv 13 53
Powershell Find Folders 7 52
scripting, exchange 35 46
autoit - check if option is checked in another program 2 105
Ever wondered how to display how many visitors you have online. In this tutorial I will show you an easy but effective way to display the number of online visitors in WhizBase. In this article I assume you have read my previous articles and know …
This tutorial will discuss the log-in process using WhizBase. In this article I assume you already know HTML. I will write the code using WhizBase Server Pages, so you need to know some basics in WBSP (you might look at some of my other articles abo…
The viewer will learn how to dynamically set the form action using jQuery.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

929 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now