Alternatives to du

Hello Experts!

I've got a real quick question, I have several directories on my CentOS 6.4 Server that have lots and lots of files in them.  When I try and find out the folder size, du hangs or takes forever to provide the answer.

Are there any alternatives that I could use to find out quickly the size of these folders (with their subfolders)?

Thanks!
LVL 17
OmniUnlimitedAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

savoneCommented:
Often times there are many ways to accomplish the same thing in Linux.  I am afraid that this is NOT one of those times.

The problem is, any program would have to stat all the files to find out what their sizes are and then total them. That is what is causing the delay, and unfortunately I do not think there is any other way.

Short answer "No."
0
arnoldCommented:
Are the directories that pose this issue are NFS mounts?

As was pointed out you could break up the way you catalog the data. I.e recursively/iteratively drill down to the directories and then navigate back up totaling up the subdirectories.
0
Gerwin Jansen, EE MVETopic Advisor Commented:
Are you using du without any options? If you do, du will show you progress of what it is doing, you will see a scrolling list of files and folders.

If you do: "du -sk" then du will do all processing in the background and only show you the total once finished. You could perceive this as 'hanging'.

You could try adding the -x flag so it skips different file systems.

What command are you giving exactly?
0
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

OmniUnlimitedAuthor Commented:
Experts, thank you so much for your assistance.

@savone: thank you for your honesty, although not the answer I am looking for (i.e. it does not provide a solution to my dilema.)

@arnold: Hi arnold!  Man, you are fast becoming my "go to" man for Linux questions. To answer your question, no, these are not NFS mounts.  Drilling down is not really an option as what is happening is that the folders I am trying to get the sizes on are image folders, divided by category, and each category contains thousands of images with several variations of each image.  So I could do a "du" on just one of these folders, but the end result is still the same, du hangs (you could ask, "how do you know du hangs?"  Well, I let the thing run for over 48 hours in one folder and it still did not give me a response.)

@gerwinjansen:  yes, I am using du without any options.  I do not want to use the -sk option as I need a breakdown of each folder, as well as the total.  I do not want to use the -x option because every file is important to obtain the total size.
0
Gerwin Jansen, EE MVETopic Advisor Commented:
How about this approach:

1 - use df to find free space on the device your're interested in
2 - use du -sk for each folder that you're not interested in
3 - grab a calculator and subtract results from (2) from used space from (1)

:)
0
OmniUnlimitedAuthor Commented:
Thanks gerwinjansen, but the folders I'm not interested in is literally in the thousands.  It would be quicker for me to hand count the files one by one. :)
0
savoneCommented:
Some more options would be to run du with "nice" in order to limit it's cpu scheduling.  Maybe that will stop it from "hanging".

http://linux.die.net/man/1/nice
0
arnoldCommented:
One option is to have a process that catalogues the directory structure and maintains its space usage while at the same time maintaining catalogued items, their creation, modification date and size.
I.e. For each directory there will be a dbm type flat DB file.
The simpler approach if youhave the space and option is to create a completely new mount point /filesofinterest and move the data to this location, df -k as was suggested will report the amount of space consumed by the files.
0
OmniUnlimitedAuthor Commented:
@savone: attempted use of nice, not seeing a noticeable change in performance.  Will keep you informed.

@arnold:

One option is to have a process that catalogues the directory structure and maintains its space usage while at the same time maintaining catalogued items, their creation, modification date and size.

Intriguing.  Exactly how would I go about doing that?  Is this something I would program, like through PHP?
0
savoneCommented:
@arnold, that solution would involve a process that does the same thing as du.
0
arnoldCommented:
Savone, yes, but it is a process that can run on a schedule and only cataloging something that has been added and removing anything that has been removed.

Php, perl you can use mysql or provesql, etc. where the file reference and its size is stored.
Then a query summing the sizes of the files will get you that data based on the last time the cataloguer ran.
Creating a dedicated partition where the data is stored is simpler and straight forward where the OS/filesystem maintains data on usage.
In such a setup, depending on what process adds a file, modifying the process to maintain the file reference and  size might eliminate the cataloging need.

Presumably you have a reference/library through which you can easily locate the file.
Adding file size to that data might be .....
0
savoneCommented:
@arnold, good luck.
0
arnoldCommented:
@savone,
Is this a sarcastic comment?

The user would need to transition to a setup where the information is easily accessible.
0
junipllcCommented:
You might be able to speed up the process using rsync, although it would be a manual one. I'm not talking about actually using rsync to copy the files, just to enumerate them. In my smaller environments (I don't have one with that many files) it does better than du.

rsync -an --stats /path/to/folder/to/count /any/path

The -a is for archive and the -n is for dry-run (I believe, you might need to look it up). The -h is for human readable, so if you need the numbers without MB/GB/etc. then take that out.

The idea here is to get rsync to enumerate all of the files both by count and by space taken, and then do nothing but display the --stats for it. The destination path must exist, but it won't be written to. I use /tmp typically but you can create one if you wish.

Here's an example of what I did in order to perform this on one of the /home folder of one of my servers:

root# rsync -anh --stats /home /tmp

Number of files: 291643
Number of files transferred: 222638
Total file size: 15.06G bytes
Total transferred file size: 15.06G bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 7.29M
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 8.24M
Total bytes received: 943.81K

sent 8.24M bytes  received 943.81K bytes  1.08M bytes/sec
total size is 15.06G  speedup is 1640.52 (DRY RUN)

Open in new window

The process took less than 3 seconds to check over a quarter million files.

Cheers,

Mike
0
slubekCommented:
1. Have you tried executing simple 'find' command and watching in which directory it stops?
2. du should not stop without cause. Have you checked your logs for disk read errors?
3. If you can mount these directories on machine with Gnome, you can use baobab (from gnome-utils) to calculate disk space usage and to drill down directories.
0
OmniUnlimitedAuthor Commented:
@junipllc: I really had some high hopes for your rsync solution, but it's coming up on 24 hours since I ran that command and it is still processing.  I think the difference could also be in the fact that you used that command to check 15 GB worth of files.  I think (I'm only guessing because I can't get anything to tell me the size) that my files are closer to half a TB.

@slubek: find also gets "stuck" (actually just takes an outrageous amount of time).  I did check the logs, there are no errors.  We don't have a Gnome machine available.

@arnold: I'm sorry, but I haven't had the time to create a disk usage application yet.  I'll keep you posted.
0
slubekCommented:
Have you noticed in which directory find gets stuck? It has really fast algorithm, so I don't think it is possible to count files faster. IMHO, you have to find which directory causes problems and avoid it in du -hs.
0
OmniUnlimitedAuthor Commented:
@slubek: find doesn't provide feedback until it finds something.  How can I find out in which directory it gets stuck?
0
Gerwin Jansen, EE MVETopic Advisor Commented:
300000 files in 3 seconds, around 15Gb would mean that your 500Gb of files should take roughly 100s using the rsync method - since you are running for 24 hours already, either something is wrong with your disk or you have mounted file systems. I'm afraid neither of the solutions proposed above will help you any further.
0
OmniUnlimitedAuthor Commented:
@gerwinjansen: you are right.  The system does have mounted file systems.  We have two multi-partitioned slave drives hooked up.  Do these affect the function and performance of du, as well as rsync and the other commands?
0
Gerwin Jansen, EE MVETopic Advisor Commented:
Yes. Mounted file systems are remote and are accessed over the network. Accessing files over the network is a lot slower that local access. That's why you have to wait a long time.
0
OmniUnlimitedAuthor Commented:
We recently did a copy of some files over to one of the slave drives.  The transfer rate was approximately 800 MB per hour.  Can I expect this to be about the speed of du, rsync and find over the same system?
0
slubekCommented:
I mean 'find' without any parameters. On my system it shows full path to files. All I have to do is watching its output and see what directory was found before stuck.
Anyway, network mounts are good track - I'm going back to lurking this thread.
0
arnoldCommented:
Are These USB 1 external IDE drives?
The only way to speed up future inquiries is to take the hit and catalog all the files as well as make sure the process that adds/modifies files uses the cataloging option where the data is Ina databse with file location reference, create date, modify date, and size of the file.
0
OmniUnlimitedAuthor Commented:
@slubek: running find with no parameters stops in the same directory du does.  If I'm not mistaken, this directory is somewhere around 300 GB of small (<2MB) files.  When find hits that directory, all output stops.  It doesn't even list the first small file.

@arnold:

Are These USB 1 external IDE drives?

No, they are SCSI drives.

The only way to speed up future inquiries...

arnold, these files get transferred in from dozens of sources.  How can I ensure every process uses the cataloging option?  I think I would better set up some sort of a monitoring system that detects when that directory has been changed and rescan it (running into the same problem I am having now.)
0
arnoldCommented:
To achieve cataloging in an environment that the source for data is dispersed, is to use a staging area into which all new files are placed. The moving process will catalog and relocate the files based on a preset criteria.

Scsi drive with 800MB per hour
The issue might be compounded if all external are trying to write new files at the same time.
A way to disperse is to possibly setup a software raid that will span multiple drives distributing the write load.

Depending on the average size of the files, properly tunning the filesystem might help a bit.
0
Gerwin Jansen, EE MVETopic Advisor Commented:
>> 800 MB per hour

This is really slow... - this is about USB1 'speed' I believe.
0
skullnobrainsCommented:
no simple command will do much better than du as the problem is IO-bound

you can implement the catalog system using inotify on the directory. just hook to file writes and update the catalog after each write. this can be done in a shell script or php daemon pretty easily. if you don't want to keep info for each file you can hook file open and close and compute the difference in sizes. note that it will loose sync if you write in the directory while the monitoring process is not running

you can achieve a similar goal by using fuse and implement the same algo in the fuse fs. at least you can make sure it stays synced

but the easier way would be to use a decent filesystem. i assume you are using ext3 or worse ext2. ext4 would do better, ufs, zfs, xfs, (and i guess reiserfs as well) could all handle such situations MUCH better than any ext filesystem.

splitting that dir into smaller subdirs should also help on ext. a rule of thumb would be don't go over 500 files in ext3 and not over 2-5000 in ext4, and make sure you have enough ram for the directory cache to work properly.

another simple and efficient solution : use a dedicated partition for this directory. then you can simply df the partition which will be instantaneous. basically let the filesystem do the work for you
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
OmniUnlimitedAuthor Commented:
Experts, I am so sorry it took so long to get back to you.  I am afraid that nothing I tried has produced a significantly different result than du.  I would have liked to award points to everyone, but I figured that I need to keep this in the strict sense of awarding points only for solutions.  Since the title of the question is "Alternatives to du", I awarded points to those that came up with solutions that did not involve du.

Again, thanks to all who participated.
0
skullnobrainsCommented:
try the dedicated filesystem, or possibly quotas if your system handles them. this will give excellent performance for little work. using alternate commands will not produce any measurable improvement and building a catalog seems overkill.

best regards
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.