Solved

Using cURL to download an entire webpage (HTML, images, css, js etc...)

Posted on 2011-03-22
13
2,068 Views
Last Modified: 2012-05-11
Hi,

I have been searching here and Google for the past few days but I haven't been able to find an answer.

I want to have a script that will download one page of a website with all the content i.e. images, css, js etc...

I have been able to save the html (text) like this:

function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

$returned_content = get_data('http://example.com/page.htm');

$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file:  '.$my_file);
fwrite($handle, $returned_content);

Open in new window


This will save a file called 'file.htm' with all the HTML but no images, css, js etc...

I have also been able to do this:

$img[]='http://example.com/image.jpg';

foreach($img as $i){
	save_image($i);
	if(getimagesize(basename($i))){
		echo 'Image ' . basename($i) . ' Downloaded OK';
	}else{
		echo 'Image ' . basename($i) . ' Download Failed';
	}
}

function save_image($img,$fullpath='basename'){
	if($fullpath=='basename'){
		$fullpath = basename($img);
	}
	$ch = curl_init ($img);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
	$rawdata=curl_exec($ch);
	curl_close ($ch);
	if(file_exists($fullpath)){
		unlink($fullpath);
	}
	$fp = fopen($fullpath,'x');
	fwrite($fp, $rawdata);
	fclose($fp);
}

Open in new window



This will save that specific image but I haven't found anything that will save the entire HTML with all the content behind it.


Thanks for your help in advance!
0
Comment
Question by:jambla
  • 8
  • 5
13 Comments
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190001
I would propose to using a wget call from php.
Wget has the option "-p" option to get a copy of the page including all images,css etc.
If you further like to adjust the links to the images,css rewriten to the downloaded content you have to specify "-K"
0
 

Author Comment

by:jambla
ID: 35190060
Hello roemelboemel,

Thanks for your response.  In my hours of searching I have seen a lot of talk about wget.  However, I'm not sure how to use it.  I have verified that my server has enabled it.

Using wget and php how would I go about saving a webpage to a folder with all the contents of the webpage?  Could you show me an example of code that would do this for me?

Thanks again.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190246
This sniplet would save the http://www.gnu.org in the directory /tmp/www.gnu.org/
exec('wget -P /tmp/ -p -k www.gnu.org');

Open in new window

0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 

Author Comment

by:jambla
ID: 35190507
Hi roemelboemel,

I uploaded this to my server:

<?php

exec('wget -P /tmp/ -p -k www.gnu.org');

?>

Open in new window


When I execute it I'm getting a "500 Internal Server Error".  I contacted my server and they confirmed that wget is enabled.

Any suggestions?
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190648
I've tested on one of my servers. It's working. Please check the errorlogs of the apache webserver about the specific error.

no permission to write to /tmp/
not allowed to call external command using exec
selinux preventing something
...
0
 

Author Comment

by:jambla
ID: 35190807
Ok,

I was able to get rid of the 500 Internal Error and it seems to be executing fine, however I looked in the /tmp/ folder but I don't see any files/folders there that match the downloaded site.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190966
Should be in /tmp/www.gnu.org/ with all the images etc.
What about when running it on the commandline?
0
 

Author Comment

by:jambla
ID: 35191066
When I try to go to /tmp/www.gnu.org/ I'm getting "550 Can't change directory to /tmp/www.gnu.org: No such file or directory"

What about when running it on the commandline?

I'm running Windows 7.  Can I run a Linux cmd from Windows?
0
 
LVL 4

Accepted Solution

by:
roemelboemel earned 500 total points
ID: 35191295
If you have a shell access via ssh you could run it on the server. furthermore if you have a directory where the user under which the webserver (doesn't have to be the same user you are using to upload you php file) is running you could point it there instead of /tmp.
You could make a directory on the webserver ex.
mkdir /home/<myhomedir>/foo

Open in new window

. then give access to the write access to webserver user or for testing just
chmod 777 /home/<myhomedir>/foo

Open in new window

And then point the wget output there

<?php
exec('wget -P /home/<myhomedir>/foo/ -p -k www.gnu.org');
?> 

Open in new window

0
 

Author Comment

by:jambla
ID: 35191372
Hi roemelboemel,

Thanks for the suggestion however I don't have shell access.  I have tried to mess around with wget for a few days now and I always end up getting nowhere.  Which is why I started looking more at cURL.

I will try to create a dir with 777 and try to send the files there.
0
 

Author Comment

by:jambla
ID: 35191477
Hi roemelboemel,

I have tried creating a folder, changing the permission to 777 and running the wget script but I'm still not getting the folder/files.
0
 

Author Comment

by:jambla
ID: 35191528
Hi roemelboemel,

I tried a few other things and I got it working!


Thanks so much for your help!
0
 

Author Closing Comment

by:jambla
ID: 35191546
Like always the gurus at EE comes though!
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

838 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question