Solved

How 2 grab and parse the contents of a URL?

Posted on 1997-07-15
11
177 Views
Last Modified: 2013-12-25
I would like to be able to do in CGI what Jeffry Friedel does on a command-line driven Perl script with webget.pl, namely, input a URL into a form field, hit submit, and have the inputted URL's HTML text be parsed by my CGI script so I can do with the text whatever I want.  Where do I start?  Is there an easy way to do this, or will it be long and complex?
0
Comment
Question by:atticus
  • 7
  • 3
11 Comments
 

Author Comment

by:atticus
Comment Utility
Edited text of question
0
 
LVL 5

Expert Comment

by:icd
Comment Utility
If you have the command line driven script then it should be easy to modify. There are only two things you need to do.

You need to pick up the URL from a cgi variable rather than from a command line argument.

All printed output should be changed to output a html document rather than just text.

If you were to post the script then I could show you how to modify it.
0
 

Author Comment

by:atticus
Comment Utility
OK, I will hold off on grading (if I can--that was my first post).  Jeffry Friedl's script is here:  http://enterprise.ic.gc.ca/~jfriedl/perl/inlined/webget

If you can show me every line that needs changing you'll get an A!  I know that you need to substitute ARGV for variable, but am not sure how to do it properly.  Can't seem to call this script from an evan or a system call.

Thanks.
0
 

Author Comment

by:atticus
Comment Utility
Adjusted points to 205
0
 

Author Comment

by:atticus
Comment Utility
I'm new to this.  I guess it really doesn't hurt "icd" if I reject the quesion, right?  I want others to be able to answer.  His/hers was a good answer but incomplete.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 7

Expert Comment

by:faster
Comment Utility
What you need to do is quite simple:

1. create a form which at least contains a field for the URL.  Let's say you name this field as "inputurl".  The action of this form should point to your cgi script, like this: <FORM METHOD=POST ACTION=http://your_cgi_domain/your_cgi_path/something.pl>

2. in the cgi script, your read from a $ENV{'CONTENT_LENGTH'} to get the length of the cgi input, then read from stdin this number of bytes into your buffer, this buffer will contain inputurl=what_ever_user_input&otherinputfield=other_value

3. parse the buffer to get the user input for the url.  All the name=value pair is separated by "&", use a perl split() will do this job quite easily.  Then you need to an url unescaping (convert %xx to its original form)

4. Now you have the url and can do whatever the processing.  but when you output the outcome, format it using html tags.

I think you are more familiar with your script so I only supply this guide, anyway, do it once and you will be familar with it, it is really simple.

By the way, a typical way to do step 3 is some code like this: (suppose $buffer contains the stuff read from stdin), at last $FORM{"inputurl"} will be the one you want.

      @pairs=split(/&/, $buffer);
      foreach $pair (@pairs)      {
            ($name,$value)=split(/=/,$pair);
            $value =~ tr/+/ /;
            $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("c",hex($1))/eg;
            $FORM{$name} = $value;
      }

0
 

Author Comment

by:atticus
Comment Utility
This is a detailed answer but it misses the point.  I want to grab the entire contents of what's on that URL and be able to read what's on that page entirely into my script.

For example, if input:

http://www.mydomain.com/cgi-bin/getURL.cgi?URL=http://www.cnn.com

I should have the program know that on that page the words "CNN Headline News" appear in the <TITLE> tags and the like.

I'm not sure that your answer tells me how to do that.  I want to take the URL and do whatever I want with the HTML contents of that remote page.

If you can answer THAT I'll give you an A.

Thanks for responding, "faster" !

---atticus---
0
 
LVL 7

Accepted Solution

by:
faster earned 200 total points
Comment Utility
If you need the "content" of the url, then I only answered half of your question.  The remaining half is more difficult to implement, however it is possible (actually it is what the web worms are doing, right?)

Basically, you cgi has to act as a browser (more precisely, an http client).  After you get the url, you need to connect to the server that owns the url.  Let's say the url is http://www.somesite/somepath/1.html, you need to connect to the server: www.somesit, of course, you will be using TCP, and the port number is by default 80, or use the one appear on the url.  After you successfully connect to the server, send it the following:

GET /somepath/1.html HTTP/1.0\r\n\r\n

Then you need to recieve the response from the server, that is the "content" of the url (even when it is actually an image or java class).

All the connection, sending and receiving have to be coded using sockets.  I don't whether there are existing scripts/software that can extract the content of the url for you (it is not difficult if one is familiar with sockets and http protocol), maybe you can find one instead of writing or the socket code yourself.  Anyway, if you really need to do that yourself, the steps I mentioned above is sufficient: create a socket, connect to the server, sending request and receiving the response.

Have fun.
0
 

Author Comment

by:atticus
Comment Utility
This is an acceptable answer, but my second response gave the path to a URL that does this, however, only from a command line driven script:

     http://enterprise.ic.gc.ca/~jfriedl/perl/inlined/webget

I wanted somebody to show me what I need to change in the above script to get it to work.  I know it's asking a lot, but that's why I assigned it so many points (all I had at the time!).

Subsequent answers still appreciated.  Thanks to all who have helped so far.
0
 
LVL 7

Expert Comment

by:faster
Comment Utility
I didn't look at your script, but if your script can already get the content of the url, then what else problem do you have?  My first half of the answer shows how to get the input url from the form, then as long as you get this, there should be little difference as you get from the command line.  So where is the problem you exactly have?
0
 

Author Comment

by:atticus
Comment Utility
I suppose you're right, "faster."  I was looking for the easy way out, i.e. someone to show me exactly what in the script I need to change from command line input/output to CGI.  But it's better I learn it on my own.

Shishir Gundavarm's "Mouse" book by O'Reilly's has a good section on "Checking Hypertext (HTTP) Links" that explains it pretty well.  Together with your answer I should be able to build a script that work :)
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

In this tutorial I will show you how to make a simple HTML bar chart with the usage of WhizBase, If you want more information about WhizBase please read my previous articles at http://www.experts-exchange.com/ARTH_5123186.html (http://www.experts-ex…
I hope you'll find this tutorial useful and interesting. So let's try to extend Tcl with a new package.  For anyone more deeply interested please check out the book "Practical Programming in Tcl and Tk". It's really one of the best written books abo…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now