Solved

NFS anomalies

Posted on 2002-03-04
14
219 Views
Last Modified: 2010-03-18
I am a user (without root priviliges) on a small academic
cluster of about 15 homogeneous linux PCs. They are supposed to share the file system which contains the user's home directories. I am a linux newbie and know nothing about linux networking.

After the cluster came online about 2 weeks ago, and I moved my hoemspace to the server there, I started having strange problems, mostly when saving files in applications.

1)Netscape frequently gives "Error saving bookmarks file."

2)Emacs frequently gives "I/O error" when I try to save a file.

3)When I compile a file using g++ -c, the compiler often reports a "no space left on device" error. I don't believe that the disk space available to the compiler is insufficient.

4)When I try to "more" a file which I have recently saved in emacs, the result is the same as "more"ing an empty file, i.e. exists but has no contents. However, ls -l command on the same file reveals that the file is in fact 23KB etc.

5)When I reload files, sometimes I find mysterious control characters appear in files which I had been working on, obliterating portions of the text which I had entered.

**

The frequency above effects depends on which machine in the cluster I started the application from.

 Please explain to me what is the likely cause of and remedy for the above. What exactly should I say to my sytem manager? I'm not sure that he is an expert either.

 Thank you very much

 Victor
0
Comment
Question by:glebspy
14 Comments
 
LVL 4

Expert Comment

by:MFCRich
Comment Utility
What do you see when you type 'mount' at the prompt?
0
 
LVL 1

Author Comment

by:glebspy
Comment Utility
On one workstation (the one where it most often fails) I see
vil@rdtm07:~>mount
/dev/hda1 on / type ext3 (rw)
none on /proc type proc (rw)
usbdevfs on /proc/bus/usb type usbdevfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /dev/shm type tmpfs (rw)
/dev/hda3 on /usr type ext3 (rw)
automount(pid1074) on /nfs type autofs (rw,fd=5,pgrp=1074,minproto=2,maxproto=3)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
rdfs:/home on /nfs/home type nfs (rw,nosuid,soft,intr,rsize=8192,wsize=8192,addr=130.158.109.13)
rdfs:/soft/linux on /nfs/local type nfs (rw,nosuid,soft,intr,rsize=8192,wsize=8192,addr=130.158.109.13)


On a machine where I experience fewer problems, I find
the following difference
vil@libra1:~>diff l1mount rdmount
7,8c7
< automount(pid1044) on /nfs type autofs (rw,fd=5,pgrp=1044,minproto=2,maxproto=3)
< rdfs:/home on /nfs/home type nfs (rw,nosuid,soft,intr,rsize=8192,wsize=8192,addr=130.158.109.13)
---
> automount(pid1074) on /nfs type autofs (rw,fd=5,pgrp=1074,minproto=2,maxproto=3)
10c9
< rdfs:/soft/linux on /nfs/local type nfs (rw,nosuid,soft,intr,rsize=8192,wsize=8192,addr=130.158.109.13)
---
> rdfs:/home on /nfs/home type nfs (rw,nosuid,soft,intr,rsize=8192,wsize=8192,addr=130.158.109.13)
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
It would be most helpful to know which Linux is running on the cluster.

If this happened to be a RedHat 7.2 cluster, then I'd bet on it being a problem with the network. Most likely a link speed/mode negotiation problem. A good clue of that being the problem would be very low transfer rates between the machines  with FTP for a 3-5mb file. Normal rates would something like 600-800kb/sec or 6-8mb/sec for 10Mbps or 100Mbps networks, respectively.
0
 
LVL 1

Author Comment

by:glebspy
Comment Utility
It is indeed redhat7.2.

Is there anything that can be done...
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
I thought it might be RedHat 7.2...

I'm reasonable certain that the problem is fixable. But first we have to determine exactly what it is. I need you to run the FTP test between the file server and the problematical cluster node (FTP a 2-5Mb file in both directions) and see what the transfer rates are. If, as I suspect, they are low w/respect to the normal rate for a 100Mbps connection (which is what I'd assume you are using), then I need to know what ethernet cards are in the server and the node.
0
 
LVL 1

Author Comment

by:glebspy
Comment Utility
Incredibly dumb question coming up..

How can I tell which machine is the server?

I'm making a wild guess and performing the test with a 5mb file.

on trouble node, "put" to server
227 Entering Passive Mode (130,158,109,69,64,38)
150 Opening BINARY mode data connection for glump.
226 Transfer complete.
5731476 bytes sent in 96.7 secs (58 Kbytes/sec)

on trouble node, "get" from server
ftp> get glump
local: glump remote: glump
227 Entering Passive Mode (130,158,109,69,99,248)
150 Opening BINARY mode data connection for glump (5731476 bytes).
226 Transfer complete.
5731476 bytes received in 8.26 secs (6.8e+02 Kbytes/sec)

Is there any way I can find out what the hardware (ethernet) confiugration is without asking the system manager (i.e. by asking the computer)
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
Well, you can determine what the NIC is if it's a PCI card by looking at the contents of /proc/pci. That is usually enough information to allow one to figure out which of the NIC diag/setup programs from http://www.scyld.com/diag/ to use. Unfortuately that's probably not going to do much good if you don't have root privs as I believe that the diag progs require root prives to access the device registers.

You can tell which machine the server is by looking at the output of df. It'll list the source of an automounted file system, like you home dir. For example, on the system I'm using right now I can see:

chaos> df
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda2              2519792   2017968    373824  85% /
...
wilowisp:/nfs0        20724064   8841064  10830272  45% /.automount/wilowisp/root/nfs0

From that I can see the /nfs0 is automounted from host wilowisp.

From the FTP transaction shown above I'd say that you have a problem. A 'put' only got a transfer rate of 58Kb/sec as opposed to the 'get' that saw 680Kb/sec.

Since you don't have root privs on the boxes there's going to be very little that you can do to solve the problem except to document it's existance. I'd say that it's time to call in your sysadmin and show him what you've found. Personally I would want to see normal FTP rates between each of the cluster nodes and the file server before I'd consider the problem solved as any of the nodes could have a similar problem.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 1

Author Comment

by:glebspy
Comment Utility
I found that the machine I said before was the server, wasn't really the server. I don't seem to be allowed to
log into the server.. my usual password doesn't work.

Here's the output of cat /proc/pci (on the trouble node).

PCI devices found:
  Bus  0, device   0, function  0:
    Host bridge: Intel Corporation 82850 850 (Tehama) Chipset Host Bridge (MCH) (rev 2).
      Prefetchable 32 bit memory at 0xe8000000 [0xebffffff].
  Bus  0, device   1, function  0:
    PCI bridge: Intel Corporation 82850 850 (Tehama) Chipset AGP Bridge (rev 2).
      Master Capable.  Latency=64.  Min Gnt=14.
  Bus  0, device  30, function  0:
    PCI bridge: Intel Corporation 82801BAM PCI (rev 4).
      Master Capable.  No bursts.  Min Gnt=6.
  Bus  0, device  31, function  0:
    ISA bridge: Intel Corporation 82801BA ISA Bridge (ICH2) (rev 4).
  Bus  0, device  31, function  1:
    IDE interface: Intel Corporation 82801BA IDE U100 (rev 4).
      I/O at 0xf000 [0xf00f].
  Bus  0, device  31, function  2:
    USB Controller: Intel Corporation 82801BA(M) USB (Hub A) (rev 4).
      IRQ 11.
      I/O at 0xd000 [0xd01f].
  Bus  0, device  31, function  4:
    USB Controller: Intel Corporation 82801BA(M) USB (Hub B) (rev 4).
      IRQ 9.
      I/O at 0xd400 [0xd41f].
  Bus  0, device  31, function  5:
    Multimedia audio controller: Intel Corporation 82801BA(M) AC'97 Audio (rev 4).
      IRQ 5.
      I/O at 0xd800 [0xd8ff].
      I/O at 0xdc00 [0xdc3f].
  Bus  1, device   0, function  0:
    VGA compatible controller: nVidia Corporation NV11 DDR (rev 178).
      IRQ 10.
      Master Capable.  Latency=32.  Min Gnt=5.Max Lat=1.
      Non-prefetchable 32 bit memory at 0xec000000 [0xecffffff].
      Prefetchable 32 bit memory at 0xe0000000 [0xe7ffffff].
  Bus  2, device  10, function  0:
    Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 12).
      IRQ 11.
      Master Capable.  Latency=32.  Min Gnt=8.Max Lat=56.
      Non-prefetchable 32 bit memory at 0xef020000 [0xef020fff].
      I/O at 0xc000 [0xc03f].
      Non-prefetchable 32 bit memory at 0xef000000 [0xef01ffff].


Here's the situation:

 The sysadmin is aware of the problem, but it is not at the top of his priority list. He's a physicsist, not a computer person, and I think he set up the network 'out of a
box' and may have to do a certain amount of research to figure out how to deal with the problem, which is probably an irritation for him. He has asked me to try to make do, and apparently no-one else on the system had complained (yet). It's very much at the top of *my* priority list since I am genuinely anticipating losing a
substantial amount of work in an unpredictable way and at
unpredictable times. I've already lost my netscape bookmarks file, and part of a couple of c++ source files, which, thanks to a coincidental backup and not having changed them very much recently, I was able to recover.

  What would really help the situation from my point of view, is if you could give me a recipe, which I could tell the system manager, and save him any trouble, or at least fill him in slightly about how he might be able to fix the problem. Making it easier for him in this way, he might be willing to attend to it immediately, which is what I really want. Or he might even give me temporary root privileges, if he thought I could really shoot the problem dead.

  Can you tell me what I would do, *if* I had root privileges?
0
 
LVL 40

Accepted Solution

by:
jlevie earned 125 total points
Comment Utility
Okay, we can see that at least one of the cluster nodes is exhibiting the kind of behaviour that one would expect to see if there is a link speed/mode negotiation problem. Assuming that all of the cluster members have the same hardware it's probably fair to assume that all are having the same problem.
Also we can see from the contents of /proc/pci show that the NIC is an Intel Ehternet PRO 100, which is a 10/100 card.

The solution to this problem is to force the NICs to the highest speed the link will accept. For a simple 10/100 or 100 only hub that will be 100Mbps/HDX. If the nodes connect to a switch you can set the NIC's to 100Mbps/FDX. To accomplish this you need get the eepro100-diag.c file from http://www.scyld.com/diag/ and compile it according to the intructions at the end of the source code. Then it needs to be run on each node in the cluster, like:

# eepro100-diag -F 100baseTx-FD    # for a switch
# eepro100-diag -F 100baseTx-HD    # for a hub

After executing that command on two nodes you can try an FTP in both directions between those nodes and see if you don't get data rates in the 6-8Mb/sec range.

This will need to be done at each boot and the command can be invoked in /etc/rc.d/rc.local.
0
 
LVL 1

Author Comment

by:glebspy
Comment Utility
Ok, thanks for that, at least now we have somthing to work with.

There is one other thing I forgot to mention. This problem , if my memory serves, started at about the time that my node got moved, and the lan cable consequently switched to a different access point. The system manager doesn't think this could have anything to do with the problem, and I admit I don't see how it could myself. It seems a bit of a coincidence though.

If you have an opinion on that, please say.
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
how many hubs are inbetween your machine and the server?
How long (in meters) is the cable?
0
 
LVL 1

Author Comment

by:glebspy
Comment Utility
I have no idea.. I'll find out as soon as I can.
0
 
LVL 40

Expert Comment

by:jlevie
Comment Utility
I'd say that it could have everything to do with the problem. It could be, as ahhoffman suggests, that you are too far (electrically) from the server. Or it could be that link speed/negotiation is failing when you are connected to this access point.
0
 
LVL 1

Author Comment

by:glebspy
Comment Utility
Thanks for the comment. I'll check the length of the new
cable, but the machine wasn't moved very far.. only about
3 meters. It seems to be connected to a different port on
the same box.

(By box I mean the little box with a lot of lan cables coming out of it)

Could it be then that some switch on the box could need to be flipped or... what?
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

I have seen several blogs and forum entries elsewhere state that because NTFS volumes do not support linux ownership or permissions, they cannot be used for anonymous ftp upload through the vsftpd program.   IT can be done and here's how to get i…
Note: for this to work properly you need to use a Cross-Over network cable. 1. Connect both servers S1 and S2 on the second network slots respectively. Note that you can use the 1st slots but usually these would be occupied by the Service Provide…
This video discusses moving either the default database or any database to a new volume.
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now