Link to home
Start Free TrialLog in
Avatar of JYoungman
JYoungman

asked on

atomic locking over NFS with link(2),stat(2)

Because of the potential for the loss of RPC reply packets on NFS, the O_EXCL option of open does not work.  For this reason, one can/should open (O_CREAT|O_EXCL) a [temporary] file, link(2) this to the name of the file we really wanted to open [discarding the return value of the link(2) call], and then stat the newer name (the name we really wanted to create).  If the hard link count is 2, the process has worked.  Otherwise the process has failed.

What I'm not so clear on it what the backout strategy for failure is that allows us to retry later.  If the strategy fails and the link count is >2, presumably one should unlink() the new name.   Is this right?  But what if the link count is only 1?  Do I unlink anything then?   What actions do I take if the stat(2) fails?   What are the pros/cons of using lstat/fstat here instead of stat?  Which of these is correct?

(This lockfile strategy doesn't have to be compatible with any MTA or anything; I just need to atomically create a lockfile whose name is fixed).
Avatar of elfie
elfie
Flag of Belgium image

As far as i know over NFS, advisory locks are recommended.

Your can also try setting the file access bits to 6664. For  normal files the (setuid and setgid) have their impact on the locking mechanism.


From the man pages on HPUX:

man lockf(2)...

Only advisory record locking is implemented for NFS files.


Avatar of JYoungman
JYoungman

ASKER

Advisiory record locks are completely inappropriate for this form of locking.  The lockfile is just that -- a lockfile, whose existence signals the "locked" state of some other file.  Hence record locks on the lockfile are not useful at all.   In addition, not all systems provide even advisory locks over NFS.  Setting the setuid bit to request mandatory locking adds less still.

If my question was unclear, I apologise (please say so if it was!).

The man pages for open on HPUX and Solaris does not state any restriction in using the O_EXCL flag!  AIX 4.1 states:
The O_EXCL flag is not fully supported for Network File Systems (NFS). The NFS protocol does not guarantee the designed function of the O_EXCL flag.

I assume you are sure of the support for O_EXCL flag on your UNIX(which is it any way?).

I am not clear of your logic.  I assume you want to create the [temporary] file in the local file system; and link to this a new file on the NFS file system;  But hard link is not allowed across file systems, right?
There is no "your UNIX".  The idea is to write *portable* programs.  The program I work on works on at lease twelve different varieties of Unix and at least one OS that isn't Unix at all.

As for O_EXCL, it is NOT POSSIBLE to make it work correctly and reliably over NFS because NFS is nonstateful and has no RPC call for "open" at all; the create RPC call still isn't enough since
the O_EXCL flag can't be passed on to the server.

Hence the link(2) scheme I outline above.

Why discard the return value of link?  Surely if link fails, you can be certain you haven't got the lock.  Similarly, if you can't lstat the new file, you should assume you haven't got a lock.  If lstat suceeds though, check the thing isn't a symlink, and check the link count.  If it is a symlink, or the link isn't 2, you haven't got the lock, and must assume something else is messing with your locking mechanism.

I'm probably missing something, my knowledge of low-level NFS is zero, though as an alternative locking mechanism, at least on systems that support symlinks, how about using symlink(lock_file, lock_file) to grab a lock?
It's imperative to discard the return value of link() because this is being done over NFS.   One can get an RPC failure and an error value returned from link(), even if the actual filesystem operation on the remote server succeeded.  The RPC reply packet(s) may get lost or dropped, for example, leading to a failure report when in fact the hard link did get made.  That is why the check is made with stat(2) afterward -- if the link count is 2, the link(2) operation must have succeeded on the server even if the reply had been lost.  If the stat(2) RPC fails, then the link may or may not have worked...you can't tell.

The thing with symlink() is that it's no more likely to succeed than link(), and you can't check the result by looking at the link count on the target.

Ok. now I sort of understand a bit more.  I think it's unsafe to assume anything about the lock file iff the link count isn't two, and you can't ensure that it is linked to the original file (maybe stat/lstat the tmp file and the lock and compare st_dev and st_ino might do this).  If other processes are trying to create the lock using the same mechanism, and all you're doing is  checking the link count, there are race points which I can't see a way of overcoming.
So how about encoding a lock key within a symlink such as host:pid:time, and then readlink(2) it back, if it matches you got the lock.
Now we're rolling :-)
System call return values occur after the = sign.

Host A                            Host B
open("A:100", O_CREAT|O_EXCL)=0  
                                   open(B:200,O_CREAT|O_EXCL)=0
                                   symlink("B:100", "z.foo")=-1
                                   (RCP reply lost but operation
                                   succeeded on server)
symlink("A:100", "z.foo")=-1
(EEXIST)

So at this point both parties will do a readlink(2) on "z.foo" and determine that it actually points to B:100.  A knows that wasn't its file and so determines that it has failed to acquire the lock.

OK, I understand how that works.  It seems to me that the same approach with hard links has all the same advantages as well as working on Unix systems which don't support symbolic links.
Another problem is that the hostname of machine A may itself be longer than the filename length limit of the NFS server (or for the case of locking a local file, be longer than its own filename length limit).   Hence I think I'll use a hard link.

Now that we're on the same page, what should lockers do if they determine that the readlink() or stat() call has returned the wrong pathname/failed (as appropriate for symbolic and hard links respectively)?  What is the correct backout strategy?

There is no need to create any temporary files.  The lock-key is stored directly in the symlink, eg.
 sprintf(lock_key, "%s:%-d.%ld", this_host, getpid(), time(0));
 symlink(lock_key, lock_file);
 if ((len = readlink(lock_file, buf, sizeof(buf))) != -1) {
  buf[len] = '\0;
  if (! strcmp(lock_key, buf)) {
   /* lock obtained, do the stuff */
   unlink(lock_file);
  }
 }
So, this doesn't hit old 14char filename limits.

As to what happens if stat/readlink fails, I'm unsure.  I must stress NFS is not my gig, but I would be surprised if the kernel doesn't attempt to re-satisfy a "read" operation if it doesn't get a response within a certain time. And, I assume that the call will block until the kernel does get a response or a critical amount of time has passed.  I remember from way back being throughly hacked off when anyone shutdown any NFS server I had mounted filesystems from, because there was a godd chance my machine would hang until the NFS box was back to life.

The problem I have with link mechanism, is the link count will not indicate who obtained the lock ie.
 process A                       process B
 link(tmp_fileA, lock_file)
                                      link(tmp_fileB, lock_file)  /* this link fails */
                                      stat(lock_file, &buf)         /* this suceeds */
                                      buf.st_nlink == 2
stat(lock_file, &buf) /* this also suceeds */
buf.st_nlink == 2

So who has the lock?  If you resort to comparing the inode of tmp_file against the inode of lock_file, you'll get closer to resolving the problem, but 32bit rather than 16bit inode numbers are a recent innovation.
You'd stat your *own* temporary file, not the lockfile itself.
But this doesn't solve the problem of a clash of temporary file names; I'm not sure that's solvable.   Well, you've earned the points.

ASKER CERTIFIED SOLUTION
Avatar of ecw
ecw

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial