• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 10423
  • Last Modified:

nfs_statfs: statfs error = 116

I have a file server (linux12) running SuSE 9.0. It is serving several disks to about 20 clients.
It was working fine until about 4 weeks ago when occasionally (10 - 20 times per day, on single, nearly always different, clients) mounts defined in the client's /etc/fstab got a stale NFS handle.
To repair this stale file handle I can react in 3 ways:
1. re-mount the dropped file system on the client on e.g. /mnt -> the original mount comes back too
2. unmount the dropped file system and re-mount it on the same mount point on the client
3. restart the NFS server on the file server (rcnfsserver restart).
I dont know exactly when this behaviour started, may be it was around the time I made the file server an DNS server? This "stale NFS handle" failure only seems to happen when the client is in use and affects "random" clients normally during the day time. I normally affects only one client at a time. I then repair the condition with method 3 (on the server).
Here is a log from a client called linux17 (mounts with "-" are "stale NFS handle"):

linux17# df
Filesystem           1K-blocks      Used Available Use% Mounted on
linux12:/users               -         -         -   -  /users
linux12:/noquota-users/commands
                             -         -         -   -  /commands
linux12:/noquota-users/IMDM  -         -         -   -  /IMDM
(all 3 mounts dropped)
linux17# mount linux12:/users /mnt             (repair type 1)
linux17# df
Filesystem           1K-blocks      Used Available Use% Mounted on
linux12:/users        32257404   3945452  26673324  13% /users
linux12:/noquota-users/commands
                             -         -         -   -  /commands
linux12:/noquota-users/IMDM  -         -         -   -  /IMDM
linux12:/users        32257404   3945452  26673324  13% /mnt
linux17# umount /IMDM             (repair type 2)
linux17# df
Filesystem           1K-blocks      Used Available Use% Mounted on
linux12:/users        32257404   3945452  26673324  13% /users
linux12:/noquota-users/commands
                             -         -         -   -  /commands
linux12:/users        32257404   3945452  26673324  13% /mnt
linux17# mount linux12:/noquota-users/IMDM /IMDM        (repair type 2)
linux17# df
Filesystem           1K-blocks      Used Available Use% Mounted on
linux12:/users        32257404   3945480  26673296  13% /users
linux12:/noquota-users/commands
                      32281500   3721296  26920372  13% /commands
linux12:/users        32257404   3945480  26673296  13% /mnt
linux12:/noquota-users/IMDM
                      32281500   3721296  26920372  13% /IMDM
linux17# tail /var/log/messages
...
Nov 18 17:02:40 linux17 in.rshd[25001]: connect from 192.124.235.182 (192.124.235.182)
Nov 18 17:02:40 linux17 pam_rhosts_auth[25001]: denied to myuser@linux12 as myuser: access not allowed
Nov 18 17:02:40 linux17 in.rshd[25001]: rsh denied to myuser@linux12 as myuser: Permission denied.
Nov 18 17:02:40 linux17 in.rshd[25001]: rsh command was 'ls -ld /users/myuser/.cshrc ; uptime'
Nov 18 17:03:21 linux17 kernel: nfs_statfs: statfs error = 116
Nov 18 17:03:39 linux17 last message repeated 4 times
Nov 18 17:03:42 linux17 in.rshd[25088]: connect from 192.124.235.182 (192.124.235.182)
Nov 18 17:03:42 linux17 pam_rhosts_auth[25088]: allowed to myuser@linux12 as myuser
Nov 18 17:03:42 linux17 in.rshd[25089]: myuser@linux12 as myuser: cmd='ls -ld /users/myuser/.cshrc ; uptime'
Nov 18 17:04:17 linux17 kernel: nfs_statfs: statfs error = 116
Nov 18 17:04:44 linux17 in.rshd[25188]: connect from 192.124.235.182 (192.124.235.182)
Nov 18 17:04:44 linux17 pam_rhosts_auth[25188]: allowed to myuser@linux12 as myuser
Nov 18 17:04:44 linux17 in.rshd[25189]: myuser@linux12 as myuser: cmd='ls -ld /users/myuser/.cshrc ; uptime'
...

I'm running the
linux12# rsh linux17 "ls -ld /users/myuser/.cshrc ; uptime"
every minute to detect the error condition which causes a "Permission denied." because the .rhosts file is on the failing /users file system.

I looked through loads of web pages stating this sporadic unmount condition.

I'd be really gratefull for any help or just new ideas how to tackle this problem!
0
riemer
Asked:
riemer
  • 11
  • 8
1 Solution
 
wesly_chenCommented:
Hi,

  "stale NFS handle" happens
when the application (or a shell script) on NFS client opens a file residing on a NFS mounted directory and the file
becomes unavailable (for some reasons) so the application creates a pseudo file and reports stale NFS handle.

The reasons for the mounted filesystem unavailable:
1. Network connection down between NFS server and clients.
==> Check the network connection for TX/RX errors.
2. The file/direcotry is deletd by some other application (or someone else).
==> Check if the file/directory still exist on the NFS server
3. mount point on client side has problem (very rare). But it is possible if the client use automount with short timeout
   period (default 10 minutes) and the automountd umount the idle mount point automatically.
==> umount and re-mount the mount point on the client side.
==> increase the timeout period ("--timeout=" option in the autofs configuration, /etc/sysconfig/autofs ?)
4. NFS daemon on the NFS server has problem.
==> restart nfsd.
5. NFS server is under heavy load and fails to respond the NFS call
==> Find out what processes take a lot of CPU resources and kill the process or move that service to another machine.
6. Heavy NFS requests and the running nfsd fails to respond those requests in time.
==> Increase the number of running nfsd by modify /etc/sysconfig/nfs
    RPCNFSDCOUNT=16  (default is 8)
    Then restart nfsd.

I hope this will help.

Regards,

Wesly

0
 
riemerAuthor Commented:
Hi,

thank you very much for your instant comment!

1. I checked we also changed the switch - no errors seem to occure.
2. the complete mount goes funny not just a file e.g.
linux17# df
Filesystem           1K-blocks      Used Available Use% Mounted on
linux12:/users               -         -         -   -  /users
The .rhosts file I'm using for testing is never deleted.
3. I'm not using automount, the mount is specified in /etc/fstab:
...
linux12:/users /users nfs defaults 0 0
linux12:/noquota-users/commands /commands nfs defaults 0 0
linux12:/noquota-users/IMDM /IMDM nfs defaults 0 0
4. Yes that helped, but I need to check the mounted volumes on all clients constantly because it happens 10-20 times per day and it's the patch I'm currently using. I'm checking every minute so there is a minute in which applications on a client die because of a stale file handle -> not very practical.
5. I'm doing an "uptime" command on server and all clients - the server load is rarely above 0.1 and never bigger than 0.6.
6. The parameter in my /etc/sysconfig/nfs was called
USE_KERNEL_NFSD_NUMBER=4  
I increased it to 16, did
linux12# rcnfsserver restart
- and now I am waiting for results ...
- ps auxww is already showing 16 nfsd processes instead of the previous 4 ones.

Thanks again!
0
 
riemerAuthor Commented:
Increasing the number of NFS demons on the server (suggestion 6.) didn't help - after about 1 1/2 hours one of the clients lost the /users drive again - all the other

clients had no problem at more or less the same time (script on server executes  
linux12# rsh $CLIENT "ls -ld /users/myuser/.cshrc ; uptime"
foreach of the clients available and only one client failed. An automated
linux12# rcnfsserver restart
cured that mount failure)

I don't seem to be able to provoke the loss of the mounted drive, the only effect I know is that nearly all cases happen within office hours when most of the clients are used more heavily - that's why I thought suggestion 6. might be a promise.

Any other idea?
0
Get your Conversational Ransomware Defense e‑book

This e-book gives you an insight into the ransomware threat and reviews the fundamentals of top-notch ransomware preparedness and recovery. To help you protect yourself and your organization. The initial infection may be inevitable, so the best protection is to be fully prepared.

 
wesly_chenCommented:
Hi,

  As for your situration, I might more focus on 1, 3(not automount) and 4.
When you see this "stale nfs handle" happen, do "nfstat -s" on the nfs server
to check "badcalls", "badauth", "badclnt", "xdrcall".
And do "nfsstat -c" on the client to check "retrans".

   On the NFS server, add "async" to /etc/exports
---
/users  *(rw,async)
---
Besides, check the syslog files for any suspecious error messages
such memory related errors, filesystem errors, network package drop....

   For the client side, pick one of machine and add "nfsvers=3,tcp" to mount option
--/etc/fstab
linux12:/users /users nfs defaults 0 0  ===> linux12:/users /users nfs nfsvers=3,tcp,async 0 0
----

Regards,
0
 
riemerAuthor Commented:
Thank you very much for your further ideas!

1. I changed "sync" into "async" in the server's /etc/exports and did
 
linux12# rcnfsserver restart
Shutting down kernel based NFS server                                                                                 done
Starting kernel based NFS server                                                                                      done
linux12#

- still a little later files on /users on linux05 had "Stale NFS file handle" (all other clients were OK):

Here some results with client linux05 and server linux12:

linux05# ls -l /users/myuser/.rhosts
/users/myuser/.rhosts: Stale NFS file handle.
linux05# df
Filesystem           1K-blocks      Used Available Use% Mounted on
...
linux12:/users       77371252437321868667518976         0 77371252437321868667518976   0% /users
...
linux05#

linux12# nfsstat -s
Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
79575461   0          0          0          0      
Server nfs v2:
null       getattr    setattr    root       lookup     readlink  
0       0% 119559 16% 22654   3% 0       0% 542253 74% 133     0%
read       wrcache    write      create     remove     rename    
417     0% 0       0% 28031   3% 2676    0% 2665    0% 2649    0%
link       symlink    mkdir      rmdir      readdir    fsstat    
2       0% 0       0% 0       0% 0       0% 2026    0% 78      0%

Server nfs v3:
null       getattr    setattr    lookup     access     readlink  
21094   0% 15432198 19% 136934  0% 11761921 14% 6699005  8% 8065    0%
read       write      create     mkdir      symlink    mknod      
42458407 54% 1372341  1% 67222   0% 236     0% 47      0% 9       0%
remove     rmdir      rename     link       readdir    readdirplus
62401   0% 124     0% 63850   0% 1349    0% 99455   0% 152500  0%
fsstat     fsinfo     pathconf   commit    
45408   0% 5342    0% 28      0% 197592  0%

linux12#

linux05# mount linux12:/users /mnt
linux05# df
Filesystem           1K-blocks      Used Available Use% Mounted on
...
linux12:/users        32257404   3935872  26682904  13% /users
linux12:/users        32257404   3935872  26682904  13% /mnt
...
linux05#


2. I put "nfsvers=3,tcp" into one client's /etc/fstab but cannot unmount /users as it is "busy" - is there any other way than rebooting?

Thank you very much for all your help!
0
 
wesly_chenCommented:
> but cannot unmount /users as it is "busy"
As root
# fuser -k /users
and
# lsof
to check any open file from /users and kill that process.

Then you should be able to umount /users (use "-f" for umount maight work somtimes, but not recommended).

I don't see the suspious activities from "nfsstat -s", I suspect this might related to network issue.
Change the network cable, change the switch port (even different switch), even the network card to see any improvement.

Wesly
0
 
riemerAuthor Commented:
May be the change of "sync" into "async" in the server's /etc/exports did the trick?

Apart from the one incident I reported on linux05 only less than 1/2 hour after the change we had no "Stale NFS file handle" for about 9 hours so far (we normally have 10-20 in that period).

I'll wait for a couple of days and then confirm the solution. - Thanks a lot!!

Martin
0
 
wesly_chenCommented:
>May be the change of "sync" into "async" in the server's /etc/exports did the trick?
Yes, it definitely improve the performance of NFS. However, I would like to let you know the pros and cons of "async"
---man exports ----
async  This option allows the NFS server to violate  the  NFS  protocol
          and  reply to requests before any changes made by that request
          have been committed to stable storage (e.g. disc drive).
          Using this option usually improves performance, but at the  cost
          that  an unclean server restart (i.e. a crash) can cause data to
          be lost or corrupted.
---------------------
It should be ok if you are not care about the real-time data sync between memory and disk.

Wesly
0
 
riemerAuthor Commented:
I have a very strange and disappointing effect:

For more than 24hours our NFS problem seemed to be solved and I nearly got the Champagne out :-)
Then I started to clean up the clients' "/etc/fstab"s which at that time contained the ip-numbers instead of the server's name "linux12" (from a previous attempt to avoid any problems caused by DNS).
At that stage a "df" command showed the relevant mounts twice, e.g.:
linux09# df
Filesystem           1K-blocks      Used Available Use% Mounted on
...
192.124.235.182:/users
                      32257404   3940340  26678436  13% /users
192.124.235.182:/noquota-users/commands
                      32281500   3721296  26920372  13% /commands
linux12:/users        32257404   3940340  26678436  13% /users
linux12:/noquota-users/commands
                      32281500   3721296  26920372  13% /commands
...
linux09#

this was caused by editing /etc/fstab (old version /etc/stab.4):
linux09# diff  /etc/fstab  /etc/fstab.4
13,14c13,14
< 192.124.235.182:/users                        /users  nfs     defaults 0 0
< 192.124.235.182:/noquota-users/commands       /commands       nfs     defaults 0 0
---
> linux12:/users                        /users  nfs     defaults 0 0
> linux12:/noquota-users/commands       /commands       nfs     defaults 0 0
linux09# mount -a
...
Iin order to get finally rid of these duplicate entries I rebooted all modified clients. During that time the original problem re-occured several times. So this morning (after 18hours of re-occuring NFS errors) I modified the clients' "/etc/fstab"s and did "mount -a" to reachive the state the clients were in 24hours ago (duplicate entries for each disk served by linux12 with name and ip-number). But the problem remained the same.

A few weeks ago we physically changed our main switch from a recent 1Gb back to an older 100Mb Ethernet device which logs errors. No errors were logged on that line from 4 Nov until now.
I would have to order a replacement netword card for the server if there is a chance that is faulty - but wouldn't that have shown up on the "nfsstat -s" and error counts on the switch?

I had changed the following
--/etc/fstab
linux12:/users /users nfs defaults 0 0  ===> linux12:/users /users nfs nfsvers=3,tcp,async 0 0
----
on 2 clients too but changed that back to the original as it seemed to have no effect. Should I try that again as that is your only suggestion I have not tried yet on all clients?

Do you have any other suggestions?
0
 
wesly_chenCommented:
> I had changed the following
> --/etc/fstab
> linux12:/users /users nfs defaults 0 0  ===> linux12:/users /users nfs nfsvers=3,tcp,async 0 0
>----
>on 2 clients too but changed that back to the original as it seemed to have no effect. Should I try that again >as that is your only suggestion I have not tried yet on all clients?

"async" is good enough, no need for "nfsvers=3,tcp" on all the client since some clients may not support nfs version 3.

> A few weeks ago we physically changed our main switch from a recent 1Gb back to an older 100Mb
> Ethernet device which logs errors.
NFS server should have Gigabit NIC and connect to Gigabit switch for more network bandwidth.
Or the heavy NFS request will jam up the network bandwidth.

Wesly
0
 
riemerAuthor Commented:
I ordered a replacement netword card for the server as I see that as one of the last chances ...
0
 
riemerAuthor Commented:
I've been busy trying to reproduce and analyze this error - with some sort of success:
I can provoke the error (after waiting for about 30min) by
starting a program and opening the "File / Open ..." menu.
At the same time I continuously (3sec intervall) run the "df" command in a shell loop in another
window.
The "File / Open ..." menu is initially on the user home directory /users/myuser .
Then I walk up the tree to the root / folder (via double click).
Instantly (at 13:47:35, see  <==== click to /) when I reach the root folder the mounted volumes
disappear:

view of shell loop:
linux18# /users/root/cmd_loop.scr df 3
Fri Dec  3 13:47:32 CET 2004 ... finish via ^C
Filesystem           1K-blocks      Used Available Use% Mounted on
...
linux12:/users        32257404   3620152  26998624  12% /users
linux12:/noquota-users/commands
                      32281500   3721296  26920372  13% /commands
linux12:/noquota-users/IMDM
                      32281500   3721296  26920372  13% /IMDM
linux12:/noquota-users/IMI
                      32281500   3721296  26920372  13% /mnt_IMI
linux16:/data_l16c   288354844   9884964 263822220   4% /data/data_l16c
Fri Dec  3 13:47:35 CET 2004 ... finish via ^C             <==== click to /
Filesystem           1K-blocks      Used Available Use% Mounted on
...
linux12:/users               -         -         -   -  /users
linux12:/noquota-users/commands
                             -         -         -   -  /commands
linux12:/noquota-users/IMDM
                             -         -         -   -  /IMDM
linux12:/noquota-users/IMI
                             -         -         -   -  /mnt_IMI
linux16:/data_l16c   288354844   9884964 263822220   4% /data/data_l16c
linux12:/data_l12b   157558536  49955692  99599264  34% /data/data_l12b
Fri Dec  3 13:47:38 CET 2004 ... finish via ^C

I also noted that always linux12:/data_l12b appears in the "df"-listing which is an automounted
device referenced by a top level link "/scratch". I removed this only top level link to
/data/data_l12b and waited for 30 minutes accessed the root directory and NO ERROR!
I could then reproduce the error by accessing /data/data_l12b by walking down the /data path.

So it seems that automounting linux12:/data_l12b causes the error for linux12:/users and linux12:/noquota-users/

Have you got any idea how to bypass that problem?

Martin
0
 
riemerAuthor Commented:
I just found that manually unmounting linux12:/data_l12b causes the error too.
0
 
wesly_chenCommented:
> 3. I'm not using automount, the mount is specified in /etc/fstab
On your previous post, you said you didn't use automount.
Now you found you did use automount.
Which automount daemon do you use? Amd or Autofs?

For autofs, on NFS client machine, edit /etc/init.d/autofs (path may vary) and
search for "OPTIONS" and add
noatime, async, hard, intr, nfsvers=3
and
--timeout=1200 <===20 minutes

Then restart autofs on NFS client.

This might help. For automount, please don't expect too much when your network traffic getting heavy.

Wesly
0
 
wesly_chenCommented:
> I just found that manually unmounting linux12:/data_l12b causes the error too
Then the problem is most likely on NFS server Linux12.
1. Check the disk
shutdown -nF <== force fsck on reboot  

2. Check the network connectivity. Improved by Gigabit NIC card.

3. Check the system log for error messages.

4. Add CPU or upgrade to faster CPU. Add more memory.
    Change to faster hard disk (15k rpm Fiber channel disk), use hardware RAID controller.

Those are ways to improve the performance on NFS server.

For really heavy disk accesst, you might go for dedicated NAS (Network Appliance) or SAN solution.

Wesly
0
 
riemerAuthor Commented:
Thank you very much for your suggestions!

I'll try them out on Monday as I'm at home already (7pm CET) and cannot access the machines at (re)boot time if they get stuck.

Just a few answers to your questions which I can already find out from home:

The command for controlling automount is "rcautofs" so I suppose I'm using Autofs (I basically just click the automount option at installation time). The disks causing problems are all mounted hard via /etc/fstab, that's why I wrote "I'm not using automount" as those other automount disks never seemed to cause any problems. I didn't realize that an automount is initiated if I just browse through a directory (/ in my case) with links pointing to a directory on an automount volume.

I'd first like to also try to (auto)(un)mount other disk from a client to see whether the error is just caused by one disk or all automount-disks and then try fsck on the server Linux12 if only one disk fails. We already had one reboot (after a crash caused by a faulty program - endless memory hogg) during the time of the error - and all fsck's were allright (and took over 1 hour as there is nearly 1TB of disk space).

If unsuccessful I'll try to modify a client's /etc/sysconfig/autofs which contains the only vaild line:
AUTOFS_OPTIONS=""
On SuSE 9.0 /etc/init.d/autofs seems to be only a script file which reads its options from there.
As I now can provoke the error within 30min. this should be an easier task to test the effect than just waiting for the few events each day.

This week I also monitored the server's network load on the switch and I don't think load would be a problem as it was always well under 10%. So I hope I can leave hw-upgrades for a while, a Gbit-Network is already available (I put the lead back to 100Mbit as that was one of the changes around the time of the first error).

Thanks again - I hope I'll have more results more on Monday.

Martin
0
 
riemerAuthor Commented:
More results:
ALL disks currently mounted from linux12 (via /etc/fstab AND automount) go into the error conditon when automounting OR mounting any other linux12 disk e.g.:
linux18# ls /data/data_l12c   (see <==== )
...
at 15:57:44

causes: (view of shell loop)
linux18# /users/root/cmd_loop.scr df 3
Mon Dec  6 15:57:41 CET 2004 ... finish via ^C
Filesystem           1K-blocks      Used Available Use% Mounted on
...
linux12:/users        32257404   3621936  26996840  12% /users
linux12:/noquota-users/commands
                      32281500   3721296  26920372  13% /commands
linux12:/noquota-users/IMDM
                      32281500   3721296  26920372  13% /IMDM
linux12:/noquota-users/IMI
                      32281500   3721296  26920372  13% /mnt_IMI
linux12:/data_l12f   157558536 128028196  21526760  86% /data/data_l12f
linux12:/data_l12e   157558536 126972080  22582876  85% /data/data_l12e
linux12:/data_l12d   157558536 103869740  45685216  70% /data/data_l12d
Mon Dec  6 15:57:44 CET 2004 ... finish via ^C      <====  ls /data/data_l12c
Filesystem           1K-blocks      Used Available Use% Mounted on
...
linux12:/users               -         -         -   -  /users
linux12:/noquota-users/commands
                             -         -         -   -  /commands
linux12:/noquota-users/IMDM
                             -         -         -   -  /IMDM
linux12:/noquota-users/IMI
                             -         -         -   -  /mnt_IMI
linux12:/data_l12f           -         -         -   -  /data/data_l12f
linux12:/data_l12e           -         -         -   -  /data/data_l12e
linux12:/data_l12d           -         -         -   -  /data/data_l12d
linux12:/data_l12c   157558536 149554956         0 100% /data/data_l12c
Mon Dec  6 15:57:47 CET 2004 ... finish via ^C
...

(I could not reproduce the error with umounting)

As linux12:/data_l12a-f are all separate disks I don't think it's an individual disk problem (/ /users and /noquota-users are on the same disk).

Thinking about any special situation of linux12 I can only mention that most disks are on a raid controller. I can see via "yast2" is "Triones HPT370A" and "Adaptec AHA-2940AU Single" and that the first 3 disks a EIDE and the last 4 (data_l12c-f) are on the raid controller. Still as the disks are working locally without any fault I doubt that this could be the problem.

Thanks for any help,

Martin

PS: Knowing how to avoid the error is already a great help - the number of errors dropped from 10-15 to only 2 today.
0
 
wesly_chenCommented:
>Still as the disks are working locally without any fault I doubt that this could be the problem
Then you might narrow down to nfs server daemon, applications which create heavy
netowork traffic or interrupt NFS, and network issue.

Most of my experience are some hardware aging causing network or system issue.

Wesly
0
 
riemerAuthor Commented:
After some patches like scripts with automatic
# rcnfsserver restart
if the result of
# rsh -n nfsclient ls /users/myfile
contained "Permission denied"
(checking every 60 sec)
I gave up and installed a new server with SuSE 9.2 which is now working absolutely fine.

Another advice could be:
Via another bug which caused
# mount -a
not be be successfully executed at boot time
I learned that it seems to be advisable to use
# yast2
as much as possible avoiding manual editing of /etc/fstab and other system files.
0
 
CetusMODCommented:
PAQed with points refunded (500)

CetusMOD
Community Support Moderator
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 11
  • 8
Tackle projects and never again get stuck behind a technical roadblock.
Join Now