Using wget safely

I have a directory, lets call it:

http://www.mysite.com/files

I would like to use wget to backup everything in this directory, including subdirectories.  But I don't want it to grab the files that sit in the directory before it, i.e. http://www.mysite.com/index.html

I don't want to tax the server while performing this backup so I think the wait, or rate-limit, option should be used.

I would tell you everything I've tried so far but that would just confuse the issue.
hrolsonsAsked:
Who is Participating?
 
mccrackyConnect With a Mentor Commented:
oops, I forgot to add in the -np option above (and since it's a backup, I might not convert the links):

wget -c -np -r -N -l inf -w 5 --limit-rate=<rate you want> http://www.mysite.com/files
0
 
omarfaridCommented:
if you use wget with the recursive option then you can take all files and subdirs as well. If you specify the url to be http://www.mysite.com/files then it will not download http://www.mysite.com/index.html

looking at the wget man page http://linux.die.net/man/1/wget  , below options are useful

--limit-rate=amount
    Limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix. For example, --limit-rate=20k will limit the retrieval rate to 20KB/s. This is useful when, for whatever reason, you don't want Wget to consume the entire available bandwidth.

    This option allows the use of decimal numbers, usually in conjunction with power suffixes; for example, --limit-rate=2.5k is a legal value.

    Note that Wget implements the limiting by sleeping the appropriate amount of time after a network read that took less time than specified by the rate. Eventually this strategy causes the TCP transfer to slow down to approximately the specified rate. However, it may take some time for this balance to be achieved, so don't be surprised if limiting the rate doesn't work well with very small files.
-w seconds
--wait=seconds
    Wait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the "m" suffix, in hours using "h" suffix, or in days using "d" suffix.
0
 
hrolsonsAuthor Commented:
cool, that is looking good.  How would it treat files that it already got on a previous backup?
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

 
hrolsonsAuthor Commented:
Darn it, it's still getting too many files, let me change my original example to I want:

http://www.mysite.com/files/set1

and under "files", I have set1, set2, set3 ...

It's not just grabbing the set1 files, it also grabs set2, set3 etc.

The command I issue is:

wget --limit-rate=20K -r http://www.mysite.com/files/set1


0
 
mccrackyCommented:
You need to look into the -np option too (no parent directories so you don't go up the tree, only down).

Do you want the links to be rewritten to work locally or not?  (The -k option)

Do you want to keep the domain and all the directories?  (The -nH and --cut-dirs options)

Personally, I'd probably use something like:

wget -c -k -r -N -l inf -w 5 --limit-rate=<rate you want> http://www.mysite.com/files
0
 
omarfaridCommented:
see if below will help

-m
--mirror
    Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
0
 
omarfaridCommented:
is rsync an option for you?

http://linux.die.net/man/1/rsync
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.