• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 654
  • Last Modified:

subdirectories in unix/linux system

hi there,

i have a program that is taking a very long time to access the directory as there is too many files in it.
If i make subdirectories in that directory, wouldn't it be the same thing ?

tq,
zizi
0
zizi21
Asked:
zizi21
  • 14
  • 6
  • 5
  • +5
4 Solutions
 
enachemcCommented:
no, it wouldn't
it would work faster
but the number of directories in Linux are limited, the number of files ... not realy
0
 
enachemcCommented:
directories use a tree structure and the access time is proportional to the number of direct child nodes
0
 
ssnkumarCommented:
> i have a program that is taking a very long time to access the directory
How are you trying to access the directory?
Are you logged into the system and using cd command?
Or are you trying to access the directory using ssh or rsh?

> If i make subdirectories in that directory, wouldn't it be the same thing
This is going to organize your directory and it should be easier to search for a particular file.
But, your first problem looks to be different. So, I really don't know if having subdirectories is going to make any difference!
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
SuperdaveCommented:
Depends on your type of filesystem and operating system.  (Your topic and zones conflict--is it BSD or Linux)?  In general older OS's and filesystems had more of a problem with big directories, and creating subdirectories will speed it up.  That is why terminfo directory is structured the way it is, for example.
0
 
zizi21Author Commented:
It is in Linux.  I have stopped the program and I am using the linux ls now and it is still the same problem. It is just that when I do a ls , it is taking hours and hours. And someone suggested to have subdirectories. I don't see how does the subdirectory is going to help.

For instance,

main directory --> sub1
                       --> sub 2

I mean, when I do a ls, I am still going to do in the main directory.
0
 
ssnkumarCommented:
> I mean, when I do a ls, I am still going to do in the main directory.
So, is ls taking long time to complete, in any directory or only in directories with lot of files?

I feel that, your system RAM is used up by some process. It is not releasing the RAM - may be it has a memory leak.
Because, of that, system has become slow.

Can you execute "top" command and observe the values under RSS?
RSS means Resident Segment Size. For any process, this should not keep on growing.
If it is continuously growing, then that means, this process is eating up memory, but not releasing.
So, check if any process is having unusually high value under RSS compared to other processes?
0
 
ozoCommented:
Is ls in the main directory still taking hours when all the main directory contains is the two sub directories?
0
 
zizi21Author Commented:
if there is only sub directories, it is fast. But I read that the more subdirectories, you have , the more time it would take. http://serverfault.com/questions/46384/how-to-solve-linux-subdirectories-number-limit

This means, if i have 1000 subdirectories in the main directory, the access would be slow. Please correct me if I am wrong.
0
 
zizi21Author Commented:
I meant, if there is only 2 subdirectories, it is fast ...but ..
0
 
ozoCommented:
what do you want to do with 1000 subdirectories?  are you performing a task that takes seconds per directory?
Why do you need so many subdirectories?
0
 
zizi21Author Commented:
My programs need to create multiple files where it could go to millions...And then, i need to access those million files..
0
 
zizi21Author Commented:
However, if those million files are in one directory, it takes days and days. Someone said to put them in multiple sub directories so that the access to those files will be faster.
0
 
ozoCommented:
The part of the program that does a directory scan may be faster, but it sounds like that may be only a small part of the overall task.
If it takes 100 milliseconds to process one file, then a million files will take over a day to process.
0
 
ssnkumarCommented:
So, even ls will take very long to display all the files.
Million names to roll over your screen, will take time.

So, if you want to speedup, then you will have to think about reducing the number of files itself.
One million files in one directory......that is too much processing for any program that has to process all the files.
So, see if you can merge these files and have only few.
0
 
zizi21Author Commented:
I am confused. I am not scanning the directory but i am opening files using C 'fopen' and reading using fread and writing using fwrite.
0
 
zizi21Author Commented:
Sorry, i am confused.
0
 
ssnkumarCommented:
>  I am not scanning the directory but i am opening files using C 'fopen' and reading using fread and writing using fwrite
You are doing this for all the one million files, right?
0
 
ozoCommented:
fopen requires scanning the directory for the file name.
a smaller directory may possibly speed up the fopen, but it should not much affect the speed of the freads or fwrites
millions of freads and fwrited will take a long time, even if you have a fast ssd
if you cannot reduce the freads and fwrites, then you probably want to get the fastest sdd you can.
0
 
zizi21Author Commented:
thanks for the explanation. What do you mean by ssd? Do you mean solid state disk (i found this when i googled ssd)
0
 
singh677Commented:
Hello,

There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode
0
 
sarabandeCommented:
ozo is right. merge 1000 files into one. then you have 1000 files which is much enough. or use a database to write one file to one record. how big is one file on average? is it text file?

Sara
0
 
sarabandeCommented:
yes, a ssd is a solid state disk which is about ten times faster (and ten times more expensive) than normal hard disk.

Sara
0
 
phoffricCommented:
In an old SUN OS system, we had problems trying to open a number of files. There were millions of files in one folder (against my protests). When the open took too long, they remembered my protests and asked me to fix it. The solution was to have two layers of sub directories (e.g., 128 sub directoriesunder the main archive folder, and each of these 128 sub directories would have another 128 sub directories). This resulted in 16384 sub directories in the tree.

The files followed a long filenaming conventions. Given a filename, I applied two hash functions - one for each of the two directory layers. The two numbers would indicate the two layer path to the folder where the file would reside.
0
 
phoffricCommented:
>> This resulted in 16384 sub directories in the tree
To clarify, there are 16384 sub directories at the bottom layer of the tree. And the files only resided in the bottom layer. The top layer, the archive directory, had only 128 directories in it (and no files) named "000" through "127". Each of these 128 directories had 128 directories in them (again, no files). So, the pathname of a file would look like: "/archive/045/101/filename.dat"

The processing time improvement was extremely significant.
0
 
zizi21Author Commented:
Thank you.
0
 
zizi21Author Commented:
phoffric,
i don't have access to the top most level. i have access from the third level onwards. is that okay?

/Users/myuser/main
from main onwards, i have 128 directories. and in those 128, i have another 128 directories. thanks.
0
 
phoffricCommented:
>> i have access from the third level onwards. is that okay?
Your tree can start wherever you wish, as long as you have permissions to create the tree.

Now you need to develop two hash functions that operate on the filename. You should try to develop these hash functions so that you attempt to get a reasonably even distribution of the files across the 128*128 = 16K lowest level directories, so as to try to minimize the number of files in any given directory.

Instead of 128, you can experiment with a lesser number (e.g., 64) depending upon how many files you expect to have. A directory having only 200-300 files in them should be no problem.

My filenames had a specialize pattern to them (a sequence of {numbers and dots}). I ran simulations to show that the double hash tree structure did produce a reasonably even distribution. You should write a simulation for your type of filenames to test your hash functions to verify a good distribution.
0
 
ssnkumarCommented:
What is the average number of files that you have.
Based on that, you can decide the number of directories and number of levels of tree you need to have.
Also, you can have this decided dynamically in your code.
0
 
zizi21Author Commented:
i can have thousands of folders. would the performance be effected by this?

\main\directory1\anotherdirectory\nameoffile.

sorry for the numerous questions.
0
 
sarabandeCommented:
it looks to me as if you would use the file system as substitute for a database. you shouldn't do that. it is highly ineffective, slow, and error-prone.

Sara
0
 
phoffricCommented:
Each OS has different perks so doing a timing test for an experiment is advised.

In my SUN OS system, we used the Oracle database to store file metadata information. Each file was anywhere from 4MB to 100MB in length. The files were stored in a robotic tape library (very slow access). The disk file system was the cache for this tape library.

In my old SUN OS system, there was an added advantage to having very short folder names due to their filename caching scheme IIRC (I chose 3 digit folder names, as explained above).

For timing comparison purposes, I would recommend starting with a two layer folder system. For example, your full pathname would look like:
     /Users/myuser/main/045/101/filename.dat

Organize the tree so that there are only 64 to 256 folders or files within any folder. That is, in /Users/myuser/main, there should be N folders where N is fixed number and doesn't change. You create the folder tree before adding any files. If, for example, N = 128, then
   /Users/myuser/main/ will have exactly 128 folders in it (and no files).

One of these folders will be called 082; and the path /Users/myuser/main/082 will also have exactly 128 folders in it (and no files). And one of these folders will be called 101.

And the path /Users/myuser/main/082/101 will contain only files. How many will be determined by the two hash functions if using this approach.

If you expect to have 4 million files, then in a perfect hash world, you would have in each of the bottom tier folders about 244 files - i.e., 4000000/(128*128).

If you have more than 4 million files, then N = 128 may be too small. In my SUN OS system, I needed to satisfy this formula:
     number_of_files/N² < 256

This system was deployed, and was highly effective, had very fast (disk) cache file access, and ran without any fielded errors.
0
 
zizi21Author Commented:
Thanks a lot.
0
 
phoffricCommented:
I saw a question with my name on it, but it was deleted. Did I miss something important?
0
 
phoffricCommented:
BTW - on the SUN OS system, many operations in a folder were O(n²), not  O(n). So doubling the number of files resulted in a four-fold increase in time to perform the operation on the files in a folder. My understanding is that some modern OS such as Windows 7 do not suffer from the O(n²) performance degradation.
0
 
zizi21Author Commented:
phoffric,

i am suprised it is deleted. i have put the question again. please accept it . thanks.
0
 
zizi21Author Commented:
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 14
  • 6
  • 5
  • +5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now