Link to home
Start Free TrialLog in
Avatar of zizi21
zizi21

asked on

subdirectories in unix/linux system

hi there,

i have a program that is taking a very long time to access the directory as there is too many files in it.
If i make subdirectories in that directory, wouldn't it be the same thing ?

tq,
zizi
SOLUTION
Avatar of enachemc
enachemc
Flag of Afghanistan image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
directories use a tree structure and the access time is proportional to the number of direct child nodes
SOLUTION
Avatar of Narendra Kumar S S
Narendra Kumar S S
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Depends on your type of filesystem and operating system.  (Your topic and zones conflict--is it BSD or Linux)?  In general older OS's and filesystems had more of a problem with big directories, and creating subdirectories will speed it up.  That is why terminfo directory is structured the way it is, for example.
Avatar of zizi21
zizi21

ASKER

It is in Linux.  I have stopped the program and I am using the linux ls now and it is still the same problem. It is just that when I do a ls , it is taking hours and hours. And someone suggested to have subdirectories. I don't see how does the subdirectory is going to help.

For instance,

main directory --> sub1
                       --> sub 2

I mean, when I do a ls, I am still going to do in the main directory.
> I mean, when I do a ls, I am still going to do in the main directory.
So, is ls taking long time to complete, in any directory or only in directories with lot of files?

I feel that, your system RAM is used up by some process. It is not releasing the RAM - may be it has a memory leak.
Because, of that, system has become slow.

Can you execute "top" command and observe the values under RSS?
RSS means Resident Segment Size. For any process, this should not keep on growing.
If it is continuously growing, then that means, this process is eating up memory, but not releasing.
So, check if any process is having unusually high value under RSS compared to other processes?
Is ls in the main directory still taking hours when all the main directory contains is the two sub directories?
Avatar of zizi21

ASKER

if there is only sub directories, it is fast. But I read that the more subdirectories, you have , the more time it would take. http://serverfault.com/questions/46384/how-to-solve-linux-subdirectories-number-limit

This means, if i have 1000 subdirectories in the main directory, the access would be slow. Please correct me if I am wrong.
Avatar of zizi21

ASKER

I meant, if there is only 2 subdirectories, it is fast ...but ..
what do you want to do with 1000 subdirectories?  are you performing a task that takes seconds per directory?
Why do you need so many subdirectories?
Avatar of zizi21

ASKER

My programs need to create multiple files where it could go to millions...And then, i need to access those million files..
Avatar of zizi21

ASKER

However, if those million files are in one directory, it takes days and days. Someone said to put them in multiple sub directories so that the access to those files will be faster.
The part of the program that does a directory scan may be faster, but it sounds like that may be only a small part of the overall task.
If it takes 100 milliseconds to process one file, then a million files will take over a day to process.
So, even ls will take very long to display all the files.
Million names to roll over your screen, will take time.

So, if you want to speedup, then you will have to think about reducing the number of files itself.
One million files in one directory......that is too much processing for any program that has to process all the files.
So, see if you can merge these files and have only few.
Avatar of zizi21

ASKER

I am confused. I am not scanning the directory but i am opening files using C 'fopen' and reading using fread and writing using fwrite.
Avatar of zizi21

ASKER

Sorry, i am confused.
>  I am not scanning the directory but i am opening files using C 'fopen' and reading using fread and writing using fwrite
You are doing this for all the one million files, right?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of zizi21

ASKER

thanks for the explanation. What do you mean by ssd? Do you mean solid state disk (i found this when i googled ssd)
Hello,

There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode
ozo is right. merge 1000 files into one. then you have 1000 files which is much enough. or use a database to write one file to one record. how big is one file on average? is it text file?

Sara
yes, a ssd is a solid state disk which is about ten times faster (and ten times more expensive) than normal hard disk.

Sara
In an old SUN OS system, we had problems trying to open a number of files. There were millions of files in one folder (against my protests). When the open took too long, they remembered my protests and asked me to fix it. The solution was to have two layers of sub directories (e.g., 128 sub directoriesunder the main archive folder, and each of these 128 sub directories would have another 128 sub directories). This resulted in 16384 sub directories in the tree.

The files followed a long filenaming conventions. Given a filename, I applied two hash functions - one for each of the two directory layers. The two numbers would indicate the two layer path to the folder where the file would reside.
>> This resulted in 16384 sub directories in the tree
To clarify, there are 16384 sub directories at the bottom layer of the tree. And the files only resided in the bottom layer. The top layer, the archive directory, had only 128 directories in it (and no files) named "000" through "127". Each of these 128 directories had 128 directories in them (again, no files). So, the pathname of a file would look like: "/archive/045/101/filename.dat"

The processing time improvement was extremely significant.
Avatar of zizi21

ASKER

Thank you.
Avatar of zizi21

ASKER

phoffric,
i don't have access to the top most level. i have access from the third level onwards. is that okay?

/Users/myuser/main
from main onwards, i have 128 directories. and in those 128, i have another 128 directories. thanks.
>> i have access from the third level onwards. is that okay?
Your tree can start wherever you wish, as long as you have permissions to create the tree.

Now you need to develop two hash functions that operate on the filename. You should try to develop these hash functions so that you attempt to get a reasonably even distribution of the files across the 128*128 = 16K lowest level directories, so as to try to minimize the number of files in any given directory.

Instead of 128, you can experiment with a lesser number (e.g., 64) depending upon how many files you expect to have. A directory having only 200-300 files in them should be no problem.

My filenames had a specialize pattern to them (a sequence of {numbers and dots}). I ran simulations to show that the double hash tree structure did produce a reasonably even distribution. You should write a simulation for your type of filenames to test your hash functions to verify a good distribution.
What is the average number of files that you have.
Based on that, you can decide the number of directories and number of levels of tree you need to have.
Also, you can have this decided dynamically in your code.
Avatar of zizi21

ASKER

i can have thousands of folders. would the performance be effected by this?

\main\directory1\anotherdirectory\nameoffile.

sorry for the numerous questions.
it looks to me as if you would use the file system as substitute for a database. you shouldn't do that. it is highly ineffective, slow, and error-prone.

Sara
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of zizi21

ASKER

Thanks a lot.
I saw a question with my name on it, but it was deleted. Did I miss something important?
BTW - on the SUN OS system, many operations in a folder were O(n²), not  O(n). So doubling the number of files resulted in a four-fold increase in time to perform the operation on the files in a folder. My understanding is that some modern OS such as Windows 7 do not suffer from the O(n²) performance degradation.
Avatar of zizi21

ASKER

phoffric,

i am suprised it is deleted. i have put the question again. please accept it . thanks.