Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 16411
  • Last Modified:

Slow NFS Performance

We moved our office a few weeks ago, and as part of that move, re-ip'd the whole environment from a 10.0.0.x range to a 192.168.0.x range.

Since then, we've had performance issues on Solaris, AIX, and HP-UX NFS clients.  Our linux clients are as well-behaved as they ever were.

We get a lot of RPC timeouts, and very slow performance on the NFS mounted directories.  Have tried playing with vers=2/vers=3 and tweaking the wsize/rsize params, but so far, no luck.

Looking for any suggestions or ideas...  Since the linux environments remain fast, I think we can safely rule out a network or hardware issue, so I'm thinking it's on the nfs clients themselves.

netstat -i shows no collisions on the interfaces themselves.

nfsstat from the client shows a lot of badcalls and badxids:

# nfsstat -rc

Client rpc:
Connection oriented:
calls      badcalls   badxids    timeouts   newcreds   badverfs   timers
27347      2599       537        2563       0          0          0
cantconn   nomem      interrupts
0          0          17
Connectionless:
calls      badcalls   retrans    badxids    timeouts   newcreds   badverfs
9          1          0          0          0          0          0
timers     nomem      cantsend
6          0          0



Finally, a truss of a cp from an NFS mount to local disk on the client hangs on the following line:

write(4, 0xFE800000, 8388608)   (sleeping...)


We have a fairly heterogenous environment, and the NFS servers in question are a linux-based Snap Server, and a stock Solaris 8 box with clients running linux, solaris, HP-UX, and AIX.  None of these issues were present in the old environment, and the linux clients are all still happy.

Looking for any ideas...



0
aaamr
Asked:
aaamr
  • 10
  • 8
  • 5
2 Solutions
 
gheistCommented:
Maybe some TCP extension recently implemented in Linux, or PMTU blackhole.
0
 
ahoffmannCommented:
does your DNS reverse lookup work correctly?
0
 
aaamrAuthor Commented:
Reverse dns is ok...

Did a quick audit of the network topology, and found a mess of switches chained in to our QA area... current hypothesis is "too many hops", so I'm redoing the network to eliminate as many chained devices as possible.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
ahoffmannCommented:
reset all switches (they may struggle with their MAC tables;-)
0
 
gheistCommented:
calls      badcalls   badxids    timeouts
27347      2599       537        2563  

Looks more like PMTU blackhole in the middle.
0
 
aaamrAuthor Commented:
Would PMTU blackholes affect a local LAN?  What's the best way to diagnose (and fix?)?
0
 
gheistCommented:
pinging with 1500 or 9000 byte packets.
0
 
gheistCommented:
(or broken checksum processor in netcard, or wiring fault)
0
 
aaamrAuthor Commented:
Doesn't look limk PMTU issues:

# ping -sn nfsserver 9000
PING nfsserver (192.168.0.50): 9000 data bytes
9008 bytes from 192.168.0.50: icmp_seq=0. time=2.96 ms
9008 bytes from 192.168.0.50: icmp_seq=1. time=2.53 ms

...

----nfsserver PING Statistics----
13 packets transmitted, 13 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 2.52/2.565/2.96/0.120
0
 
gheistCommented:
Add -D flag to ping to make it nonfragmenting.
0
 
aaamrAuthor Commented:
Max size, non fragmenting is 1472 (standard MTU of 1500 less the 28 byte header).

Packets larger than that don't go.

C:\ping -l 1473 -f wise

Pinging wise [192.168.0.50] with 1473 bytes of data:

Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Ping statistics for 192.168.0.50:
    Packets: Sent = 2, Received = 0, Lost = 2 (100% loss),
0
 
ahoffmannCommented:
hardware problem, either NIC or hub or switch
0
 
aaamrAuthor Commented:
Hmmm... the behavior is evident on several servers which would seem to rule out an individual NIC.  

The switches are a mix of Cisco, SMC, Netgear, and 3Com gear... some of them are pretty old.

I'm planning to subnet the network to try and isolate the issue, but can't do so until May.
0
 
ahoffmannCommented:
did you reset *all* switches (see http:#16417635)
another reason might be a routing problem, check with traceroute from both ends
0
 
gheistCommented:
is there a ROUTER of FIREWALL involved where problem occur???
0
 
aaamrAuthor Commented:
No router or fw between the boxes in question, though when we subnet the networkm we'll add a small router at that time.

Traceroutes seem ok.... will try resetting all switches today.
0
 
aaamrAuthor Commented:
Did some network sniffing and dumped the results into Ethereal for analysis.  I'm seeing a lot of:

- TCP Previous segment lost
- TCP Dup ACK
- TCP Retransmission
- TCP Out-of-order
- TCP Window Update


This implies that packets are being lost somewhere, or at the very least going temporarily astray.
0
 
aaamrAuthor Commented:
Can someone shed some light on this?  I flipped the interface on one of the nfs servers to the other nic on the assumption that we might have a bad nic.

It autonegotiated to 100-half-duplex (this is a Solaris 8 box) and things suddenly got a lot faster.

This box is plugged directly into a Cisco Catalyst 2950 XL switch... which should talk 100fd, but I can't argue with the results.

Why would 100hd be faster in this case when plugged into a switch?
0
 
ahoffmannCommented:
hmm, sounds like the auto-negotation failed

try to install Ethtool http://freshmeat.net/redir/ethtool/20128/url_homepage/gkernel (not sure if it works for Sun boxes)
0
 
gheistCommented:
Catalyst always had serious problems with NWay. Install  firmware newer than netcard or force both ends to same media/duplex.

- TCP Previous segment lost -> duplex prob
- TCP Dup ACK -> duplex prob
- TCP Retransmission -> duplex prob
- TCP Out-of-order -> normal work
- TCP Window Update -> normal work

bad checksums would note wire problem or node netcard checksum processor problems
0
 
gheistCommented:
I have asked moderators to reopen question - reconsider which answer led you to acknowleging that you use Catalyst, which led me to solution, and share points at least there.
0
 
aaamrAuthor Commented:
Agreed... moderators, please split points between gheist and ahoffmann.  Was a team effort.
0
 
aaamrAuthor Commented:
Thanks everyone for all the ideas, and for working through this issue with me.
0

Featured Post

Receive 1:1 tech help

Solve your biggest tech problems alongside global tech experts with 1:1 help.

  • 10
  • 8
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now