Solved

Using cURL to download an entire webpage (HTML, images, css, js etc...)

Posted on 2011-03-22
13
2,295 Views
Last Modified: 2012-05-11
Hi,

I have been searching here and Google for the past few days but I haven't been able to find an answer.

I want to have a script that will download one page of a website with all the content i.e. images, css, js etc...

I have been able to save the html (text) like this:

function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

$returned_content = get_data('http://example.com/page.htm');

$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file:  '.$my_file);
fwrite($handle, $returned_content);

Open in new window


This will save a file called 'file.htm' with all the HTML but no images, css, js etc...

I have also been able to do this:

$img[]='http://example.com/image.jpg';

foreach($img as $i){
	save_image($i);
	if(getimagesize(basename($i))){
		echo 'Image ' . basename($i) . ' Downloaded OK';
	}else{
		echo 'Image ' . basename($i) . ' Download Failed';
	}
}

function save_image($img,$fullpath='basename'){
	if($fullpath=='basename'){
		$fullpath = basename($img);
	}
	$ch = curl_init ($img);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
	$rawdata=curl_exec($ch);
	curl_close ($ch);
	if(file_exists($fullpath)){
		unlink($fullpath);
	}
	$fp = fopen($fullpath,'x');
	fwrite($fp, $rawdata);
	fclose($fp);
}

Open in new window



This will save that specific image but I haven't found anything that will save the entire HTML with all the content behind it.


Thanks for your help in advance!
0
Comment
Question by:jambla
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
13 Comments
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190001
I would propose to using a wget call from php.
Wget has the option "-p" option to get a copy of the page including all images,css etc.
If you further like to adjust the links to the images,css rewriten to the downloaded content you have to specify "-K"
0
 

Author Comment

by:jambla
ID: 35190060
Hello roemelboemel,

Thanks for your response.  In my hours of searching I have seen a lot of talk about wget.  However, I'm not sure how to use it.  I have verified that my server has enabled it.

Using wget and php how would I go about saving a webpage to a folder with all the contents of the webpage?  Could you show me an example of code that would do this for me?

Thanks again.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190246
This sniplet would save the http://www.gnu.org in the directory /tmp/www.gnu.org/
exec('wget -P /tmp/ -p -k www.gnu.org');

Open in new window

0
Why Off-Site Backups Are The Only Way To Go

You are probably backing up your data—but how and where? Ransomware is on the rise and there are variants that specifically target backups. Read on to discover why off-site is the way to go.

 

Author Comment

by:jambla
ID: 35190507
Hi roemelboemel,

I uploaded this to my server:

<?php

exec('wget -P /tmp/ -p -k www.gnu.org');

?>

Open in new window


When I execute it I'm getting a "500 Internal Server Error".  I contacted my server and they confirmed that wget is enabled.

Any suggestions?
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190648
I've tested on one of my servers. It's working. Please check the errorlogs of the apache webserver about the specific error.

no permission to write to /tmp/
not allowed to call external command using exec
selinux preventing something
...
0
 

Author Comment

by:jambla
ID: 35190807
Ok,

I was able to get rid of the 500 Internal Error and it seems to be executing fine, however I looked in the /tmp/ folder but I don't see any files/folders there that match the downloaded site.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190966
Should be in /tmp/www.gnu.org/ with all the images etc.
What about when running it on the commandline?
0
 

Author Comment

by:jambla
ID: 35191066
When I try to go to /tmp/www.gnu.org/ I'm getting "550 Can't change directory to /tmp/www.gnu.org: No such file or directory"

What about when running it on the commandline?

I'm running Windows 7.  Can I run a Linux cmd from Windows?
0
 
LVL 4

Accepted Solution

by:
roemelboemel earned 500 total points
ID: 35191295
If you have a shell access via ssh you could run it on the server. furthermore if you have a directory where the user under which the webserver (doesn't have to be the same user you are using to upload you php file) is running you could point it there instead of /tmp.
You could make a directory on the webserver ex.
mkdir /home/<myhomedir>/foo

Open in new window

. then give access to the write access to webserver user or for testing just
chmod 777 /home/<myhomedir>/foo

Open in new window

And then point the wget output there

<?php
exec('wget -P /home/<myhomedir>/foo/ -p -k www.gnu.org');
?> 

Open in new window

0
 

Author Comment

by:jambla
ID: 35191372
Hi roemelboemel,

Thanks for the suggestion however I don't have shell access.  I have tried to mess around with wget for a few days now and I always end up getting nowhere.  Which is why I started looking more at cURL.

I will try to create a dir with 777 and try to send the files there.
0
 

Author Comment

by:jambla
ID: 35191477
Hi roemelboemel,

I have tried creating a folder, changing the permission to 777 and running the wget script but I'm still not getting the folder/files.
0
 

Author Comment

by:jambla
ID: 35191528
Hi roemelboemel,

I tried a few other things and I got it working!


Thanks so much for your help!
0
 

Author Closing Comment

by:jambla
ID: 35191546
Like always the gurus at EE comes though!
0

Featured Post

Enroll in June's Course of the Month

June’s Course of the Month is now available! Experts Exchange’s Premium Members, Team Accounts, and Qualified Experts have access to a complimentary course each month as part of their membership—an extra way to sharpen your skills and increase training.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
This article discusses how to implement server side field validation and display customized error messages to the client.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

690 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question