zizi21
asked on
subdirectories in unix/linux system
hi there,
i have a program that is taking a very long time to access the directory as there is too many files in it.
If i make subdirectories in that directory, wouldn't it be the same thing ?
tq,
zizi
i have a program that is taking a very long time to access the directory as there is too many files in it.
If i make subdirectories in that directory, wouldn't it be the same thing ?
tq,
zizi
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
directories use a tree structure and the access time is proportional to the number of direct child nodes
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Depends on your type of filesystem and operating system. (Your topic and zones conflict--is it BSD or Linux)? In general older OS's and filesystems had more of a problem with big directories, and creating subdirectories will speed it up. That is why terminfo directory is structured the way it is, for example.
ASKER
It is in Linux. I have stopped the program and I am using the linux ls now and it is still the same problem. It is just that when I do a ls , it is taking hours and hours. And someone suggested to have subdirectories. I don't see how does the subdirectory is going to help.
For instance,
main directory --> sub1
--> sub 2
I mean, when I do a ls, I am still going to do in the main directory.
For instance,
main directory --> sub1
--> sub 2
I mean, when I do a ls, I am still going to do in the main directory.
> I mean, when I do a ls, I am still going to do in the main directory.
So, is ls taking long time to complete, in any directory or only in directories with lot of files?
I feel that, your system RAM is used up by some process. It is not releasing the RAM - may be it has a memory leak.
Because, of that, system has become slow.
Can you execute "top" command and observe the values under RSS?
RSS means Resident Segment Size. For any process, this should not keep on growing.
If it is continuously growing, then that means, this process is eating up memory, but not releasing.
So, check if any process is having unusually high value under RSS compared to other processes?
So, is ls taking long time to complete, in any directory or only in directories with lot of files?
I feel that, your system RAM is used up by some process. It is not releasing the RAM - may be it has a memory leak.
Because, of that, system has become slow.
Can you execute "top" command and observe the values under RSS?
RSS means Resident Segment Size. For any process, this should not keep on growing.
If it is continuously growing, then that means, this process is eating up memory, but not releasing.
So, check if any process is having unusually high value under RSS compared to other processes?
Is ls in the main directory still taking hours when all the main directory contains is the two sub directories?
ASKER
if there is only sub directories, it is fast. But I read that the more subdirectories, you have , the more time it would take. http://serverfault.com/questions/46384/how-to-solve-linux-subdirectories-number-limit
This means, if i have 1000 subdirectories in the main directory, the access would be slow. Please correct me if I am wrong.
This means, if i have 1000 subdirectories in the main directory, the access would be slow. Please correct me if I am wrong.
ASKER
I meant, if there is only 2 subdirectories, it is fast ...but ..
what do you want to do with 1000 subdirectories? are you performing a task that takes seconds per directory?
Why do you need so many subdirectories?
Why do you need so many subdirectories?
ASKER
My programs need to create multiple files where it could go to millions...And then, i need to access those million files..
ASKER
However, if those million files are in one directory, it takes days and days. Someone said to put them in multiple sub directories so that the access to those files will be faster.
The part of the program that does a directory scan may be faster, but it sounds like that may be only a small part of the overall task.
If it takes 100 milliseconds to process one file, then a million files will take over a day to process.
If it takes 100 milliseconds to process one file, then a million files will take over a day to process.
So, even ls will take very long to display all the files.
Million names to roll over your screen, will take time.
So, if you want to speedup, then you will have to think about reducing the number of files itself.
One million files in one directory......that is too much processing for any program that has to process all the files.
So, see if you can merge these files and have only few.
Million names to roll over your screen, will take time.
So, if you want to speedup, then you will have to think about reducing the number of files itself.
One million files in one directory......that is too much processing for any program that has to process all the files.
So, see if you can merge these files and have only few.
ASKER
I am confused. I am not scanning the directory but i am opening files using C 'fopen' and reading using fread and writing using fwrite.
ASKER
Sorry, i am confused.
> I am not scanning the directory but i am opening files using C 'fopen' and reading using fread and writing using fwrite
You are doing this for all the one million files, right?
You are doing this for all the one million files, right?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thanks for the explanation. What do you mean by ssd? Do you mean solid state disk (i found this when i googled ssd)
Hello,
There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode
There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode
ozo is right. merge 1000 files into one. then you have 1000 files which is much enough. or use a database to write one file to one record. how big is one file on average? is it text file?
Sara
Sara
yes, a ssd is a solid state disk which is about ten times faster (and ten times more expensive) than normal hard disk.
Sara
Sara
In an old SUN OS system, we had problems trying to open a number of files. There were millions of files in one folder (against my protests). When the open took too long, they remembered my protests and asked me to fix it. The solution was to have two layers of sub directories (e.g., 128 sub directoriesunder the main archive folder, and each of these 128 sub directories would have another 128 sub directories). This resulted in 16384 sub directories in the tree.
The files followed a long filenaming conventions. Given a filename, I applied two hash functions - one for each of the two directory layers. The two numbers would indicate the two layer path to the folder where the file would reside.
The files followed a long filenaming conventions. Given a filename, I applied two hash functions - one for each of the two directory layers. The two numbers would indicate the two layer path to the folder where the file would reside.
>> This resulted in 16384 sub directories in the tree
To clarify, there are 16384 sub directories at the bottom layer of the tree. And the files only resided in the bottom layer. The top layer, the archive directory, had only 128 directories in it (and no files) named "000" through "127". Each of these 128 directories had 128 directories in them (again, no files). So, the pathname of a file would look like: "/archive/045/101/filename .dat"
The processing time improvement was extremely significant.
To clarify, there are 16384 sub directories at the bottom layer of the tree. And the files only resided in the bottom layer. The top layer, the archive directory, had only 128 directories in it (and no files) named "000" through "127". Each of these 128 directories had 128 directories in them (again, no files). So, the pathname of a file would look like: "/archive/045/101/filename
The processing time improvement was extremely significant.
ASKER
Thank you.
ASKER
phoffric,
i don't have access to the top most level. i have access from the third level onwards. is that okay?
/Users/myuser/main
from main onwards, i have 128 directories. and in those 128, i have another 128 directories. thanks.
i don't have access to the top most level. i have access from the third level onwards. is that okay?
/Users/myuser/main
from main onwards, i have 128 directories. and in those 128, i have another 128 directories. thanks.
>> i have access from the third level onwards. is that okay?
Your tree can start wherever you wish, as long as you have permissions to create the tree.
Now you need to develop two hash functions that operate on the filename. You should try to develop these hash functions so that you attempt to get a reasonably even distribution of the files across the 128*128 = 16K lowest level directories, so as to try to minimize the number of files in any given directory.
Instead of 128, you can experiment with a lesser number (e.g., 64) depending upon how many files you expect to have. A directory having only 200-300 files in them should be no problem.
My filenames had a specialize pattern to them (a sequence of {numbers and dots}). I ran simulations to show that the double hash tree structure did produce a reasonably even distribution. You should write a simulation for your type of filenames to test your hash functions to verify a good distribution.
Your tree can start wherever you wish, as long as you have permissions to create the tree.
Now you need to develop two hash functions that operate on the filename. You should try to develop these hash functions so that you attempt to get a reasonably even distribution of the files across the 128*128 = 16K lowest level directories, so as to try to minimize the number of files in any given directory.
Instead of 128, you can experiment with a lesser number (e.g., 64) depending upon how many files you expect to have. A directory having only 200-300 files in them should be no problem.
My filenames had a specialize pattern to them (a sequence of {numbers and dots}). I ran simulations to show that the double hash tree structure did produce a reasonably even distribution. You should write a simulation for your type of filenames to test your hash functions to verify a good distribution.
What is the average number of files that you have.
Based on that, you can decide the number of directories and number of levels of tree you need to have.
Also, you can have this decided dynamically in your code.
Based on that, you can decide the number of directories and number of levels of tree you need to have.
Also, you can have this decided dynamically in your code.
ASKER
i can have thousands of folders. would the performance be effected by this?
\main\directory1\anotherdi rectory\na meoffile.
sorry for the numerous questions.
\main\directory1\anotherdi
sorry for the numerous questions.
it looks to me as if you would use the file system as substitute for a database. you shouldn't do that. it is highly ineffective, slow, and error-prone.
Sara
Sara
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks a lot.
I saw a question with my name on it, but it was deleted. Did I miss something important?
BTW - on the SUN OS system, many operations in a folder were O(n²), not O(n). So doubling the number of files resulted in a four-fold increase in time to perform the operation on the files in a folder. My understanding is that some modern OS such as Windows 7 do not suffer from the O(n²) performance degradation.
ASKER
phoffric,
i am suprised it is deleted. i have put the question again. please accept it . thanks.
i am suprised it is deleted. i have put the question again. please accept it . thanks.
ASKER
This is the link. thanks again.
https://www.experts-exchange.com/questions/26971568/Points-for-phoffric.html
https://www.experts-exchange.com/questions/26971568/Points-for-phoffric.html