Solved

Using cURL to download an entire webpage (HTML, images, css, js etc...)

Posted on 2011-03-22
13
1,919 Views
Last Modified: 2012-05-11
Hi,

I have been searching here and Google for the past few days but I haven't been able to find an answer.

I want to have a script that will download one page of a website with all the content i.e. images, css, js etc...

I have been able to save the html (text) like this:

function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

$returned_content = get_data('http://example.com/page.htm');

$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file:  '.$my_file);
fwrite($handle, $returned_content);

Open in new window


This will save a file called 'file.htm' with all the HTML but no images, css, js etc...

I have also been able to do this:

$img[]='http://example.com/image.jpg';

foreach($img as $i){
	save_image($i);
	if(getimagesize(basename($i))){
		echo 'Image ' . basename($i) . ' Downloaded OK';
	}else{
		echo 'Image ' . basename($i) . ' Download Failed';
	}
}

function save_image($img,$fullpath='basename'){
	if($fullpath=='basename'){
		$fullpath = basename($img);
	}
	$ch = curl_init ($img);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
	$rawdata=curl_exec($ch);
	curl_close ($ch);
	if(file_exists($fullpath)){
		unlink($fullpath);
	}
	$fp = fopen($fullpath,'x');
	fwrite($fp, $rawdata);
	fclose($fp);
}

Open in new window



This will save that specific image but I haven't found anything that will save the entire HTML with all the content behind it.


Thanks for your help in advance!
0
Comment
Question by:jambla
  • 8
  • 5
13 Comments
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190001
I would propose to using a wget call from php.
Wget has the option "-p" option to get a copy of the page including all images,css etc.
If you further like to adjust the links to the images,css rewriten to the downloaded content you have to specify "-K"
0
 

Author Comment

by:jambla
ID: 35190060
Hello roemelboemel,

Thanks for your response.  In my hours of searching I have seen a lot of talk about wget.  However, I'm not sure how to use it.  I have verified that my server has enabled it.

Using wget and php how would I go about saving a webpage to a folder with all the contents of the webpage?  Could you show me an example of code that would do this for me?

Thanks again.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190246
This sniplet would save the http://www.gnu.org in the directory /tmp/www.gnu.org/
exec('wget -P /tmp/ -p -k www.gnu.org');

Open in new window

0
 

Author Comment

by:jambla
ID: 35190507
Hi roemelboemel,

I uploaded this to my server:

<?php

exec('wget -P /tmp/ -p -k www.gnu.org');

?>

Open in new window


When I execute it I'm getting a "500 Internal Server Error".  I contacted my server and they confirmed that wget is enabled.

Any suggestions?
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190648
I've tested on one of my servers. It's working. Please check the errorlogs of the apache webserver about the specific error.

no permission to write to /tmp/
not allowed to call external command using exec
selinux preventing something
...
0
 

Author Comment

by:jambla
ID: 35190807
Ok,

I was able to get rid of the 500 Internal Error and it seems to be executing fine, however I looked in the /tmp/ folder but I don't see any files/folders there that match the downloaded site.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190966
Should be in /tmp/www.gnu.org/ with all the images etc.
What about when running it on the commandline?
0
 

Author Comment

by:jambla
ID: 35191066
When I try to go to /tmp/www.gnu.org/ I'm getting "550 Can't change directory to /tmp/www.gnu.org: No such file or directory"

What about when running it on the commandline?

I'm running Windows 7.  Can I run a Linux cmd from Windows?
0
 
LVL 4

Accepted Solution

by:
roemelboemel earned 500 total points
ID: 35191295
If you have a shell access via ssh you could run it on the server. furthermore if you have a directory where the user under which the webserver (doesn't have to be the same user you are using to upload you php file) is running you could point it there instead of /tmp.
You could make a directory on the webserver ex.
mkdir /home/<myhomedir>/foo

Open in new window

. then give access to the write access to webserver user or for testing just
chmod 777 /home/<myhomedir>/foo

Open in new window

And then point the wget output there

<?php
exec('wget -P /home/<myhomedir>/foo/ -p -k www.gnu.org');
?> 

Open in new window

0
 

Author Comment

by:jambla
ID: 35191372
Hi roemelboemel,

Thanks for the suggestion however I don't have shell access.  I have tried to mess around with wget for a few days now and I always end up getting nowhere.  Which is why I started looking more at cURL.

I will try to create a dir with 777 and try to send the files there.
0
 

Author Comment

by:jambla
ID: 35191477
Hi roemelboemel,

I have tried creating a folder, changing the permission to 777 and running the wget script but I'm still not getting the folder/files.
0
 

Author Comment

by:jambla
ID: 35191528
Hi roemelboemel,

I tried a few other things and I got it working!


Thanks so much for your help!
0
 

Author Closing Comment

by:jambla
ID: 35191546
Like always the gurus at EE comes though!
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
This article discusses how to create an extensible mechanism for linked drop downs.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now