[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x
?
Solved

Using cURL to download an entire webpage (HTML, images, css, js etc...)

Posted on 2011-03-22
13
Medium Priority
?
2,615 Views
Last Modified: 2012-05-11
Hi,

I have been searching here and Google for the past few days but I haven't been able to find an answer.

I want to have a script that will download one page of a website with all the content i.e. images, css, js etc...

I have been able to save the html (text) like this:

function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

$returned_content = get_data('http://example.com/page.htm');

$my_file = 'file.htm';
$handle = fopen($my_file, 'w') or die('Cannot open file:  '.$my_file);
fwrite($handle, $returned_content);

Open in new window


This will save a file called 'file.htm' with all the HTML but no images, css, js etc...

I have also been able to do this:

$img[]='http://example.com/image.jpg';

foreach($img as $i){
	save_image($i);
	if(getimagesize(basename($i))){
		echo 'Image ' . basename($i) . ' Downloaded OK';
	}else{
		echo 'Image ' . basename($i) . ' Download Failed';
	}
}

function save_image($img,$fullpath='basename'){
	if($fullpath=='basename'){
		$fullpath = basename($img);
	}
	$ch = curl_init ($img);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
	$rawdata=curl_exec($ch);
	curl_close ($ch);
	if(file_exists($fullpath)){
		unlink($fullpath);
	}
	$fp = fopen($fullpath,'x');
	fwrite($fp, $rawdata);
	fclose($fp);
}

Open in new window



This will save that specific image but I haven't found anything that will save the entire HTML with all the content behind it.


Thanks for your help in advance!
0
Comment
Question by:jambla
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
13 Comments
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190001
I would propose to using a wget call from php.
Wget has the option "-p" option to get a copy of the page including all images,css etc.
If you further like to adjust the links to the images,css rewriten to the downloaded content you have to specify "-K"
0
 

Author Comment

by:jambla
ID: 35190060
Hello roemelboemel,

Thanks for your response.  In my hours of searching I have seen a lot of talk about wget.  However, I'm not sure how to use it.  I have verified that my server has enabled it.

Using wget and php how would I go about saving a webpage to a folder with all the contents of the webpage?  Could you show me an example of code that would do this for me?

Thanks again.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190246
This sniplet would save the http://www.gnu.org in the directory /tmp/www.gnu.org/
exec('wget -P /tmp/ -p -k www.gnu.org');

Open in new window

0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:jambla
ID: 35190507
Hi roemelboemel,

I uploaded this to my server:

<?php

exec('wget -P /tmp/ -p -k www.gnu.org');

?>

Open in new window


When I execute it I'm getting a "500 Internal Server Error".  I contacted my server and they confirmed that wget is enabled.

Any suggestions?
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190648
I've tested on one of my servers. It's working. Please check the errorlogs of the apache webserver about the specific error.

no permission to write to /tmp/
not allowed to call external command using exec
selinux preventing something
...
0
 

Author Comment

by:jambla
ID: 35190807
Ok,

I was able to get rid of the 500 Internal Error and it seems to be executing fine, however I looked in the /tmp/ folder but I don't see any files/folders there that match the downloaded site.
0
 
LVL 4

Expert Comment

by:roemelboemel
ID: 35190966
Should be in /tmp/www.gnu.org/ with all the images etc.
What about when running it on the commandline?
0
 

Author Comment

by:jambla
ID: 35191066
When I try to go to /tmp/www.gnu.org/ I'm getting "550 Can't change directory to /tmp/www.gnu.org: No such file or directory"

What about when running it on the commandline?

I'm running Windows 7.  Can I run a Linux cmd from Windows?
0
 
LVL 4

Accepted Solution

by:
roemelboemel earned 2000 total points
ID: 35191295
If you have a shell access via ssh you could run it on the server. furthermore if you have a directory where the user under which the webserver (doesn't have to be the same user you are using to upload you php file) is running you could point it there instead of /tmp.
You could make a directory on the webserver ex.
mkdir /home/<myhomedir>/foo

Open in new window

. then give access to the write access to webserver user or for testing just
chmod 777 /home/<myhomedir>/foo

Open in new window

And then point the wget output there

<?php
exec('wget -P /home/<myhomedir>/foo/ -p -k www.gnu.org');
?> 

Open in new window

0
 

Author Comment

by:jambla
ID: 35191372
Hi roemelboemel,

Thanks for the suggestion however I don't have shell access.  I have tried to mess around with wget for a few days now and I always end up getting nowhere.  Which is why I started looking more at cURL.

I will try to create a dir with 777 and try to send the files there.
0
 

Author Comment

by:jambla
ID: 35191477
Hi roemelboemel,

I have tried creating a folder, changing the permission to 777 and running the wget script but I'm still not getting the folder/files.
0
 

Author Comment

by:jambla
ID: 35191528
Hi roemelboemel,

I tried a few other things and I got it working!


Thanks so much for your help!
0
 

Author Closing Comment

by:jambla
ID: 35191546
Like always the gurus at EE comes though!
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
These days socially coordinated efforts have turned into a critical requirement for enterprises.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses

650 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question