[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 281
  • Last Modified:

Reading data from a website

I was thinking of doing this in Perl if possible, but other languages might work too.  I am trying to figure out how would I go about reading data from a webpage that constantly changes.  So if I want to read the results of the winners of a local game, and I know the web address is constant, how do I actually get the page source to the code, if that makes sense?
0
feldmani
Asked:
feldmani
  • 5
  • 3
  • 3
  • +1
3 Solutions
 
ozoCommented:
perl -MLWP::Simple -e "getprint 'http://webaddress'"
0
 
bigjosh2Commented:
Check out these tools...

http://www.webzinc.com/online/

and

http://poorva.com/aie/ale.shtml

Both make it pretty easy write programms to scrape data off a webpages.

If you want something cheap (free) and the page you are using is pretty simple, you could use a tool like WGET...

http://www.gnu.org/software/wget/

...to download the page to a file on your local machine and then use Perl to open the file and grab the data you are looking for.
0
 
feldmaniAuthor Commented:
I would prefer to automate it as much as possible, so would not want to download and save a page.  Then it would be fairly easy to do,  I would prefer to do it all automatically.  I have not tried your option ozo.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
bigjosh2Commented:
You would call WGET from *inside* your Perl program using, say, the "exec" command, so the whole thing would be completely automated.

0
 
feldmaniAuthor Commented:
But would I be able to run that on any system?  Will it work in Windows like that?
0
 
bigjosh2Commented:
WGET is available for most major platforms, including LINUX and WINDOWS.
0
 
wnrossCommented:
Or use Ozo's technique, that way you are using an entirely perl based technique

Or if you want fine-grained control
--------------- CUT HERE ----------------
#!/usr/bin/perl
use IO::Socket;

$source="www.cnn.com";

%params = (
  Proto => 'tcp',
  PeerPort => 'http(80)',
  PeerHost => $source
);

$socket = IO::Socket::INET->new( %params ) || die "Unable to connect";

$socket->send("GET / HTTP/1.1\n");
$socket->send("User-Agent: Mozilla/5.0 (Perl Really; linux)\n");
$socket->send("Host: $source\n");
$socket->send("Pragma: no-cache\n");
$socket->send("Accept: text/html, */*\n");
$socket->send("Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\n");
$socket->send("Connection: close\n\n");

undef $/;

$page = <$socket>; # Explicit Read;

$socket->close();
$/ = "\n";

# Match up any links
while ( $page =~ m/<a href="(\S+)">/mg ) {
   print "$1\n";
}
--------------- CUT HERE ----------------
0
 
feldmaniAuthor Commented:
Right on thanks, you guys have both been very helpful
0
 
wnrossCommented:
No problem, thanks for the points
-Bill
0
 
bigjosh2Commented:
wnross's answer is definately the way to get the most control, but can also require a lot of work. Web browsers handle a huge number of details for you when they download pages, and you may end up having to do some of that work yourself depending on the web page.

Here is an example: The HTTP spec allows a web server to compress a page when serving it using, say, the "Content-Encoding: gzip" header. You can read about it here...

http://www.websiteoptimization.com/speed/tweak/compress/

If you use the above code to get a compressed page, then all you will end up with is garbage that you will have to write code to detect and de-compress.

There are many other cases where a page you grab yourself might require you to do lots of work to actually use - HTTP is a pretty complex protocol. If the URL happens to return a REDIRECT, you'll have to add code to handle that. If the URL you want happens to be HTTPS, you'd need to spend weeks to get that to work.

If you only need to get that one single page, and the brute force method seems to work, then no problem - but I think the WGET method is a more general and preferable soltuion becuase it should work for pretty much any web page coming off any web server without you having to worry about any the details of HTTP.

-josh
0
 
wnrossCommented:
Josh:

A web server will *never* compress a page for you unless you ask it to with another header:

Accept-Encoding: compress, gzip

But you are right about redirects.  Still, if you are only checking a single page then you're probably ok.
I think feldmani was looking for a perl-only solution, in which case Ozo's use of the LWP library
wins the race.

Cheers,
-Bill
0
 
bigjosh2Commented:
You're right about the compress - sorry I missed that.

I agree that in the limited case of a program that is running on a known machine with no proxies and is fetching a single static page of a simple webserver without redirects, the above solution is a good one.

Thanks,
josh
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

  • 5
  • 3
  • 3
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now