Using wget safely

Posted on 2011-02-21
Medium Priority
Last Modified: 2012-05-11
I have a directory, lets call it:


I would like to use wget to backup everything in this directory, including subdirectories.  But I don't want it to grab the files that sit in the directory before it, i.e. http://www.mysite.com/index.html

I don't want to tax the server while performing this backup so I think the wait, or rate-limit, option should be used.

I would tell you everything I've tried so far but that would just confuse the issue.
Question by:hrolsons
  • 3
  • 2
  • 2
LVL 40

Expert Comment

ID: 34948761
if you use wget with the recursive option then you can take all files and subdirs as well. If you specify the url to be http://www.mysite.com/files then it will not download http://www.mysite.com/index.html

looking at the wget man page http://linux.die.net/man/1/wget  , below options are useful

    Limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix. For example, --limit-rate=20k will limit the retrieval rate to 20KB/s. This is useful when, for whatever reason, you don't want Wget to consume the entire available bandwidth.

    This option allows the use of decimal numbers, usually in conjunction with power suffixes; for example, --limit-rate=2.5k is a legal value.

    Note that Wget implements the limiting by sleeping the appropriate amount of time after a network read that took less time than specified by the rate. Eventually this strategy causes the TCP transfer to slow down to approximately the specified rate. However, it may take some time for this balance to be achieved, so don't be surprised if limiting the rate doesn't work well with very small files.
-w seconds
    Wait the specified number of seconds between the retrievals. Use of this option is recommended, as it lightens the server load by making the requests less frequent. Instead of in seconds, the time can be specified in minutes using the "m" suffix, in hours using "h" suffix, or in days using "d" suffix.

Author Comment

ID: 34948798
cool, that is looking good.  How would it treat files that it already got on a previous backup?

Author Comment

ID: 34948871
Darn it, it's still getting too many files, let me change my original example to I want:


and under "files", I have set1, set2, set3 ...

It's not just grabbing the set1 files, it also grabs set2, set3 etc.

The command I issue is:

wget --limit-rate=20K -r http://www.mysite.com/files/set1

Build your data science skills into a career

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

LVL 12

Expert Comment

ID: 34948885
You need to look into the -np option too (no parent directories so you don't go up the tree, only down).

Do you want the links to be rewritten to work locally or not?  (The -k option)

Do you want to keep the domain and all the directories?  (The -nH and --cut-dirs options)

Personally, I'd probably use something like:

wget -c -k -r -N -l inf -w 5 --limit-rate=<rate you want> http://www.mysite.com/files
LVL 12

Accepted Solution

mccracky earned 2000 total points
ID: 34948892
oops, I forgot to add in the -np option above (and since it's a backup, I might not convert the links):

wget -c -np -r -N -l inf -w 5 --limit-rate=<rate you want> http://www.mysite.com/files
LVL 40

Expert Comment

ID: 34948923
see if below will help

    Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
LVL 40

Expert Comment

ID: 34948940
is rsync an option for you?


Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Fine Tune your automatic Updates for Ubuntu / Debian
Welcome back to our beginners guide of the popular Unix tool, cron. If you missed part one where we introduced this tool, the link is below. We left off learning how to build a simple script to schedule automatic back ups. Now, we’ll learn how to se…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

597 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question