The structure of the file descriptor "table"?

Maarten Bruins
Maarten Bruins used Ask the Experts™
on
Actually the file descriptor table is not a real table. It's just an array of pointers to the "open file table" (struct file). But let's say we will see it as a table. What are the columns? For example:

FD   | Pointer to "open file table"
----------------------------------
...  | ...

In short, that's the question. I see a lot of different figures on the internet, but they are all different. For example, see:

http://faculty.winthrop.edu/dannellys/csci325/10_shared.htm
There they have a column "fd flags" (read/write), but I would think that this column is part of the "open file table" and not part of the "file descriptor table". See for example: http://man7.org/linux/man-pages/man2/open.2.html


       A call to open() creates a new open file description, an entry in the
       system-wide table of open files.  The open file description records
       the file offset and the file status flags (see below).  A file
       descriptor is a reference to an open file description; this reference
       is unaffected if pathname is subsequently removed or modified to
       refer to a different file.  For further details on open file
       descriptions, see NOTES.

       The argument flags must include one of the following access modes:
       O_RDONLY, O_WRONLY, or O_RDWR.  These request opening the file read-
       only, write-only, or read/write, respectively.

So this creates a file descriptor and an entry in the open file table. A file descriptor is only an integer, so then I would conclude that O_RDONLY/O_WRONLY/O_RDWR is part of the open file table and not of the file descriptor table. They also say this:

A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file.

If write/read would be part of the file descriptor table, then the above can not be true in my opinion.

Another example:

http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch03lev1sec10.html
Here they also have a column "flags". Do they also mean read/write et cetera with "flags"? Or what exactly do they mean by flags in this case?

https://www.usna.edu/Users/cs/aviv/classes/ic221/s16/lec/21/lec.html
Here they don't have the column flags at all.

I think for the answer we have to look at https://elixir.bootlin.com/linux/v3.18/source/include/linux/fdtable.h

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
      atomic_t count;
      struct fdtable __rcu *fdt;
      struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
      spinlock_t file_lock ____cacheline_aligned_in_smp;
      int next_fd;
      unsigned long close_on_exec_init[1];
      unsigned long open_fds_init[1];
      struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

Although they write "open file table structure", probably this is what other people call the "file descriptor table". These structures are still a bit abracadabra for me, but I don't see something like read/write flags.

So how I have to see it? Which columns does the file descriptor "table" have? (if we would see it as a table)
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
nociSoftware Engineer
Distinguished Expert 2018

Commented:
This structure refers to another struct fdtable array (which is slightly above it, in the same source)
The struct fdtable references a struct file array...

struct file can be found in  /usr/src/linux/include/linux/fs.h
struct file {
        union {
                struct llist_node       fu_llist;
                struct rcu_head         fu_rcuhead;
        } f_u;
        struct path             f_path;
        struct inode            *f_inode;       /* cached value */
        const struct file_operations    *f_op;

        /*
         * Protects f_ep_links, f_flags.
         * Must not be taken from IRQ context.
         */
        spinlock_t              f_lock;
        enum rw_hint            f_write_hint;
        atomic_long_t           f_count;
        unsigned int            f_flags;
        fmode_t                 f_mode;
        struct mutex            f_pos_lock;
        loff_t                  f_pos;
        struct fown_struct      f_owner;
        const struct cred       *f_cred;
        struct file_ra_state    f_ra;

        u64                     f_version;
#ifdef CONFIG_SECURITY
        void                    *f_security;
#endif
        /* needed for tty driver, and maybe others */
        void                    *private_data;

#ifdef CONFIG_EPOLL
        /* Used by fs/eventpoll.c to link all the hooks to this file */
        struct list_head        f_ep_links;
        struct list_head        f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
        struct address_space    *f_mapping;
        errseq_t                f_wb_err;
} __randomize_layout
  __attribute__((aligned(4)));  /* lest something weird decides that 2 is OK */


That contains all the flags you need...   so please follow the breadcrum marked road.

Author

Commented:
Thanks! So actually you're saying that @mccarl was wrong here: https://www.experts-exchange.com/questions/29120426/Understanding-the-structure-of-the-open-file-table.html#a42697350

It is a simple mismatch between terms used in the kernel documentation and lecture text that you linked to. What the kernel (struct files_struct) is calling "Open file table" in the source comments is what the lecture text is referring to as the "per process table" or "Per process file information" and what you are calling "File descriptor table".

So according to you, this is incorrect?

And see: http://faculty.winthrop.edu/dannellys/csci325/10_shared.htm

This figure is also incorrect? Because the read/write (fd flags) are part of the "open file table" and not of the "file descriptor table"?
mccarlIT Business Systems Analyst / Software Developer
Top Expert 2015

Commented:
So actually you're saying that @mccarl was wrong here

Maybe @noci can also respond further but I don't think anything like that has been said. @noci has just posted the "struct file" definition which is the elements in the "open file table", i.e. what they have called "File table" in the second and third links that you originally posted.


So, to the 3 links that you originally posted...

http://faculty.winthrop.edu/dannellys/csci325/10_shared.htm - this one I would pretty much disregard if you are looking for a more accurate picture of what is stored where. Quoting from that page "The following picture roughly illustrates the File Descriptor Table", so I don't think their aim was to fully/accurately depict the actual kernel structures but just give a general idea of what the types of info stored across those structures. Anyway, yeah, for what you are trying to learn, ignore this link.

http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch03lev1sec10.html - looks pretty accurate in terms of the diagram and most of the words that I have read. I would use this one as your reference.

Here they also have a column "flags". Do they also mean read/write et cetera with "flags"? Or what exactly do they mean by flags in this case?
They actually give you the answer in the page that yo linked to. Look at the words above the picture (section 3.10), in particular, the bullet point 1. sub point a. which is referring to the flags in the FD table. But in summary, no, these are different to the read/write/etc flags as stored in the file table entry. If you follow the link in this bullet point to section 3.14, you need to read all this very carefully, but you will see that it makes a distinction between "file descriptor flags" and "file status flags", and there are different IOCTL commands to get/set these two different sets of flags, GET_FD/SET_FD and the GET_FL/SET_FL commands


https://www.usna.edu/Users/cs/aviv/classes/ic221/s16/lec/21/lec.html - this 3rd link also has pretty accurate diagrams, they look almost the same as the 2nd link, but reading some of the words they have written, I think they would end up confusing you. They seem to change the terms that they use for some entities, even with the one paragraph. I would stick to the 2nd link.

Here they don't have the column flags at all.
As I said above, the diagrams are basically the same as the 2nd link, you must have missed the "flags" column because it is there.



I think for the answer we have to look at https://elixir.bootlin.com/linux/v3.18/source/include/linux/fdtable.h
As @noci said, to get the full picture of the FD table you also need to look just above the "struct files_struct" that you posted, there is a "struct fdtable" which looks like...

struct fdtable {
	unsigned int max_fds;
	struct file __rcu **fd;      /* current fd array */
	unsigned long *close_on_exec;
	unsigned long *open_fds;
	struct rcu_head rcu;
};

Open in new window


Here you can see the "struct file __rcu **fd;" member which is an array of pointers to "struct file" objects. And also you can see "unsigned long *close_on_exec;" which is a bit mask of the close_on_exec flags, which if you read that section 3.14, you will know is the *only* flag that is stored in the "file descriptor flags".

You will also see "unsigned long *open_fds;" which is a bit mask of which FD are "open" or not. They don't include this in the description of "file descriptor flags" but you could think of this like a flag.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

nociSoftware Engineer
Distinguished Expert 2018

Commented:
Close on exec needs to task specific.. to have a parent be able to close FD's that should not propagate to a child when exec will be succesful.
The open_fds is a bitmask that is meant to search faster though available fd's. (Checking a bit in that array is way faster than walking all elements and check various offsets. if an element is closed or not).

Also you are talking about columns... the better terms are records and fields (generic) or structures and fields (C).

Author

Commented:
Thanks a lot @mccarl! Always really refreshing your posts!!! (and probably 90% of the answer to my question is in there)

Maybe @noci can also respond further but I don't think anything like that has been said. @noci has just posted the "struct file" definition which is the elements in the "open file table", i.e. what they have called "File table" in the second and third links that you originally posted.

The question was about "file descriptor table" and @noci came up with struct file. However, "struct file" refers to the "open file table", so anyway it's weird to come up with it when it's about the "file descriptor table". But I also don't know what he exactly wanted to say with it.

But you said that I can compare "struct files_struct" with the file descriptor table, but then how I have to see "struct fdtable"? I've tried to understand the structures a bit to answer this question myself. However, I don't see it yet because the structures are still too abstract for me to understand at this point. So maybe you have a hint how I have to see it?

Later on in your post, you're saying:

Here you can see the "struct file __rcu **fd;" member which is an array of pointers to "struct file" objects.

That I understand, because the "file descriptor table" is actually just an array of pointers to "struct file" objects. But how I have to see "struct fdtable" in this story?

And:

They actually give you the answer in the page that yo linked to. Look at the words above the picture (section 3.10), in particular, the bullet point 1. sub point a. which is referring to the flags in the FD table. But in summary, no, these are different to the read/write/etc flags as stored in the file table entry.

Thanks mcccarl, this is exactly one of the reasons why I opened this question. There is "only one" flag in the file descriptor table, right? (close_on_exec, and somehow open_fds) So to answer my own question, see: http://faculty.winthrop.edu/dannellys/csci325/10_shared.htm
Then this figure is indeed incorrect, because they say it's the "file descriptor table" and the "open file table", but then they should put the "fd flags" (read/write) in the "open file table" and not in the "file descriptor table". This is what I would expected in the first place, but I got confused because I thought maybe I understand it the wrong way.

You're saying:

so I don't think their aim was to fully/accurately depict the actual kernel structures but just give a general idea of what the types of info stored across those structures.

But even if that was their aim, then why they would put "fd flags" (read/write) in the file descriptor table instead of the open file table? I don't see any reason for that, it's just incorrect, right?

And:

but you will see that it makes a distinction between "file descriptor flags" and "file status flags"

Exactly, and "fd flags" (read/write) are part of the "file status flags", because it's used in the system call open(), right? So read/write are not part of the "file descriptor table".

So now I only need to understand how I have to see "struct fdtable" in this all. But already thanks a lot mccarl! Your post was really really useful and actually it confirms thoughts that already passed my mind. You're a good teacher ;).

@noci:

the better terms are records and fields (generic) or structures and fields (C).

If that's better, then why 90% of the websites explain it in a different way? (with tables/columns) Because the structures are not that easy to understand or to start with. First you need to have some basic idea about it. But I understand you, because I also like to make the connection to the structures. Then at least I have an idea what's really behind the tables. I also don't like to simplify things if they are not correct anymore later on. So anyway it's good that you're adding that kind of stuff to your posts (it's useful for me). However, many website are talking about tables, so if I would only know about the structures, than I don't understand many websites. My main goal is to just understand the basics of Linux. My main work area is PHP/MYSQL, CSS, HTML, JavaScript, htaccess, caching. That is where my knowledge lies. I graduated physics and astronomy, so sometimes I just notice that I miss some basic knowledge. That's why I decided to learn some more about Linux. So I don't need to program in C later on, or I don't need to use the structures later on, but I want to understand at least the basics such as "file descriptor table", "open file table", "in-RAM inode table" the correct way. Many websites are showing different columns, so I wanted to know how I have to see these columns. I would prefer to see something like this:

FILE DESCRIPTOR TABLE
FD | close_on_exec | open_fds | pointer to open file table

or something like this:

FILE DESCRIPTOR TABLE
FD | flags*| pointer to open file table
* close_on_exec | open_fds

I know websites don't write it down like that, because it might confuse people but I like to see it like that. I think if you want to see it as a table, then this is maybe the best way to write it down, right? That's actually what my question is about. And the same I want to know for the "open file table" (in that other question).

Anyway both thanks a lot again!! I really appriate your help!
nociSoftware Engineer
Distinguished Expert 2018

Commented:
Tables & columns are in spreadsheets, databases and that is what most people see first...., (except maybe for programmers)...
So an ordered list of structures might be compared to A tables of equal rows....

On FDTable structure....
They are actualy 3 equaly long arrays ( programmer term vs. users "table")   the File Descriptors (pointers -> Open file structures)
close_on_exec (collection of bits), actualy active FD's (open_fds).
The fd array will have N pointers, the other 2 will have N bits. (N = max_fds field).
why do you think of *close_on_exec as only one flag... it is a pointer to an arbitrary long list of longs (32 bits/long).... (each a collection of bits)...  Why 3 tables? it is faster.... and it conserves space. adding 2 bits to the FD array  in structure would waste 62 bits/ entry on 64 bit systems (almost 50% of memory used) if entries are needed to keep aligned.
Making a structure of these 3 and forming one array will cause unaligned access to memory ==> forcing more data transfers from memory -> cpu and back. ==> causing slower access.
Chances are if the open_fds and close_on_exec are properly aligned that 128 - 256 bits  (depending on CPU) will be read anyway. So one cache row miss will get you the most data.

Author

Commented:
Is this an answer to this quote?

But how I have to see "struct fdtable" in this story?

Because in summary: mccarl is saying that I must see the "file descriptor table" as:

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
      atomic_t count;
      struct fdtable __rcu *fdt;
      struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
      spinlock_t file_lock ____cacheline_aligned_in_smp;
      int next_fd;
      unsigned long close_on_exec_init[1];
      unsigned long open_fds_init[1];
      struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

And now I want to know how I have to see:

struct fdtable {
      unsigned int max_fds;
      struct file __rcu **fd;      /* current fd array */
      unsigned long *close_on_exec;
      unsigned long *open_fds;
      struct rcu_head rcu;
};

You're saying:

They are actualy 3 equaly long arrays

So you're saying that the "file descriptor table" are actually 3 long arrays? And where I can find these arrays in the structures?

But actually immediately I got confused again. Mccarl just told me that I can see close_on_exec as a flag. And now you're saying that I have to see it as a pointer too:

why do you think of *close_on_exec as only one flag... it is a pointer to an arbitrary long list of longs (32 bits/long)

A flag is just "yes/no" (your words), so how can "yes/no" be a pointer? (close_on_exec) So I got stuck again after a couple of sentences.
nociSoftware Engineer
Distinguished Expert 2018

Commented:
fd is an array (pointer to an array of pointer to be exact...)
close_on_exec is a pointer to an array of bits. ...
each of the bits is a yes/no.
say your hard fle limit = 1024 (so max_fds = 1024) then there are
1024 fd elements...
(the fd field point to a array of 1024 pointers)
the close_on_exec hold a pointer to 1024 bits = (32 long words of 32 bits each).
the open_fds holds a pointer to 1024 bits (=32 longwords of 32 bits each).

FD=0 ->   references the fd->[0] pointer, bit 0 ( close_on_exec->[0/32]  & 1<<(0%32) ) like wise for open_fds
FD=33 -> references the fd->[33] pointer, bit 33 (close_on_exec->[33/32] & 1<<(33%32)) ==  (close_on_exec->[1] & 2)

this is how the C language deals with pointers (& arrays).

Author

Commented:
Okay, yeah we were trying to simplify it to a table, so for that it's good that @mccarl just saw close_on_exec as a column/flag in this table. But with which "struct" you want to compare my "file descriptor table", because that's still not really clear for me?

"struct fdtable {" contains "close_on_exec" and "open_fds", what we are talking about right now. So then I would say, "struct fdtable {" should be seen as the file descriptor table.  However, @mccarl sees "struct files_struct" as being the "file descriptor table"? Probably you both are in some way right, but I don't see yet how I have to see this.
IT Business Systems Analyst / Software Developer
Top Expert 2015
Commented:
But how I have to see "struct fdtable" in this story?
Basically, you just have to look at "struct files_struct" and "struct fdtable" together as being the file descriptor table. The fact that it is broken into 2 structs in merely an implementation detail.

Essentially, each process has 1 struct files_struct object which points (via the fdt member) to 1 struct fdtable object which contains the array of pointers and the bitmask of close_on_exec flags.
nociSoftware Engineer
Distinguished Expert 2018

Commented:
@mccarl is correct here

The cause is in history the fdtable is implemented (naively):

//attempt #1
#define N_FILES 1024
struct fdtable {
     struct file *fd
     bit close_on_exec;
     bit open_fds;
} fdtable[N_FILES];

This has a few drawbacks. the first fdtable entry = alligned then others are not... unaligned access depends on hardware involved can take twice -> 1000-times as much time to resolve as un alligned data... Some hardware just disallows it and the OS will abort the program with something like BUS-error.
//so Attempt #2:
#define N_FILES 1024
struct fdtable {
     struct file *fd[N_FILES]
     bit close_on_exec[N_FILES];
     bit open_fds[N_FILES];
};

Ok much better, the fdtable will be aligned, so will the others... trouble is  the bit type doesn't  exist...
so we pack them in natural elements.. (longwords f.e. 32 bits/entity).
//Attempt #3
#define N_FILES 1024
struct fdtable {
     struct file *fd[N_FILES]
     long close_on_exec[N_FILES/(sizeof(long)*8)];
     long open_fds[N_FILES/(sizeof(long)*8)];
}xyz;
#define IS_SET (base,x) (base[x/(sizeof(long)*8)] & (1 << x%(sizeof(long)*8))
where  test would be IS_SET(xyz.close_on_exec,33)

Now we still have a problem we don;t want to recompile the kernel when the number of files allowed per process changes... (Not this was actually needed around  1990)....
So modern kernels allocate memory dynamicaly..

hence the structure is required to become
//Attempt #4
#define N_FILES 1024
struct fdtable {
     long max_fds;     /* current actual value for N_FILES... */
     struct file **fd;
     long *close_on_exec;
     long *open_fds;
}xyz;

With some init functions to allocate memory and assign it to the fields to get the actual arrays.
If a start student is confronted with the latest version then it isn't about 4-5 concepts around files...
but also memory management, how arrays are implemented indirections, alignment to hardware boundaries,
how compilers implement alignment, SMP cache coherencemanagement  etc. etc. (20+ concepts (and counting) in one heap).

(oh concept extra there are system where you need to be capable to open 10000's of files in a process...
so having ALL processes have a large table one can build using smaller tables and chain those together...)
Then there is an extra issue: performance of CPU's made them highly asynchronous, so multiple internal units work processing instructioncauseing out of order execution.
ensure some stuff is consistent means there need to be check points on save data elements.
this mechanism is called RCU:  see here for explanation, i won't go into that on EE.   https://lwn.net/Articles/262464/

Author

Commented:
Thanks a lot!! So now I see it as:

FILE DESCRIPTOR "TABLE"
FD | flags*| pointer to entry in open file "table"

* close_on_exec, open_fds

And I can compare it with: "struct files_struct" and "struct fdtable". This is exactly what I wanted to know.

It's still a bit weird why  it is broken into 2 structs, but probably tomorrow I'll dive a little bit further into it, so I have at least a little bit an idea what's behind it. Apparently is has something to do with terms as "alligned", so first I need some more time to read about this. So I'll do my homework and then I'll come back to this question, but I think @mccarl just answerd this question with "struct files_struct" and "struct fdtable" (and close_on_exec, open_fds).
mccarlIT Business Systems Analyst / Software Developer
Top Expert 2015

Commented:
I could well be totally wrong here but I believe the main reason for both files_struct and fdtable is, as @noci mentioned above, to allow for a dynamic number of fdtable entries. In my (very) layman's terms, it allows a new "fdtable" copy to be built, maybe of a different size though, and for the files_struct->fdt pointer to "switch" to the new fdtable in one atomic operation. But yeah, as I said, possibly wrong with this, but it is how I see it.

Author

Commented:
Thanks mccarl!! That's already a bit easier to understand. For now I decided to skip the question why they are two separate structures. Probably that's just a couple of steps too much for me for now.

I also started to read a bit about structural alignment, but probably that's a big separate subject and for understanding to basics of Linux, I don't really need to know about it.

So I think this question has been answered by you. Probably later today I'll read the posts again and I'll mark the posts as helpful/solution. Thanks a lot! (also noci)
nociSoftware Engineer
Distinguished Expert 2018

Commented:
Structural alignment is a hardware thingy, not a linux one. You will have to readup on CPU design to grasp that one..
Involved elements: I-Cache, D-Cache, 1st level Cache, 2nd Level cache, 3rd-Level Cache, Cache coherence protocols between cores, chipsets, memory controllers, page size, Out of order execution.
Then you need to do that for major CPU's: Intel x86, Intel x86-64, AMD x86-64, ARM v4 - ARM v7.
(Within the Intel Line & Amd Line the various generations...)

So i would think you have your work cut out if you want to know it all...
Again Good Luck.

Author

Commented:
Thanks noci! That's the reason why I'm gonna skip that part (probably not only now, but also in the future).

I really don't need to know everything ;). But with the "file descriptor table", "open file table" it's different. That are just the basics when you start reading about Linux on the internet. Then at least I want to understand that a bit more. I think I understand the "file descriptor table" now good enough...at least what I wanted to understand about it. However, with for example the "open file table" (other question) I'm not there yet, for example with the "modes" et cetera. And such subjects are actually directly related to the "open file table". With e.g. structural alignments that's not the case, so I don't need to know it.

But anyway thanks for your explanation and advices about it.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial