Can someone help me with a script which will simply get each page of a website via HTTP, by following (internal) links and give a report on the total size of each page?

This would need to be HTML file size, image file size, (including background images as well) and total.

Modules OK.
I would probably use wget and do a recursive fetch of the whole website. Then get the size of all the data you downloaded.

But i guess you dont want to have to download the whole website. You will have to still download all the webpages, but not all the gif's n stuff, in order to get all the links to other files and images.
You can use the HEAD method to get the size of the files.

Have a look at HTML::LinkExtor, its a subset of the HTML::Parser, and has a simple example of how to get all the tag's details out of the html.


Maybe I didn't mention I was being lazy.

Give me some code to do the basics (let's say, create a hash with the page URL as the key and the size as the value) and I'll handle all the rest.

More points available if you think I'm being stingy.
Well, if you want to be really lazy, you'd wget the whole thing. It would just cost a bit of bandwidth.

But, your server will need wget installed, not to mention the free space.

Something like this:
$website = '';

#this will store all the files in the directory, under the current directory
system("wget -r http://$websiteUrl/");
$size = `du -sh $website`;
print "Size is $size\n";
#system("rm -fR $website");#uncoment this to delete all the files...

