Link to home
Start Free TrialLog in
Avatar of US-IT
US-IT

asked on

SCO Unix Network Hang

Hi All,
We have a SCO OpenServer 5.0.7 server in our office. Lately (past week) it has been hanging very roughly every 12 hours, and I need to reboot. The reboot seems to clear it up, and it runs fine, for a while.  We have 2 remote NFS shares mounted to the local file system, and the box also serves several samba shares. When the system begins to hang, these become unreachable (NFS shares from SCO, Samba shares from remote systems). We have about 50 users logging in at a time to our foxbase applications. Sometimes (I think if not rebooted quick enough) it stops responding to any network protocol (ssh, telnet, etc.).

Here is the netstat -m from before the last reboot (this was fairly deep into the hang)
streams allocation:
                                             config    alloc     free       total      max     fail
stream                                8448      202     8246        2464      202        0
queues                                  908      413      495        4939      413        0
mblks                                 8634     8164      470    26890848     8599        0
buffer headers                  9018     8568      450      308777     8969        0
class  1,     64 bytes         396       74      322    10407791      393        0
class  2,    128 bytes         50       20       30     2273073       76        0
class  3,    256 bytes         44       28       16     5951744      585        0
class  4,    512 bytes         14       10        4       34800       58        1
class  5,   1024 bytes       31        0       31       38453       54        3
class  6,   2048 bytes      7310     7309        1     2284286     7310      804
class  7,   4096 bytes       532      532        0       15352      585 43361300
class  8,   8192 bytes        0        0        0      114823        9       47
class  9,  16384 bytes        0        0        0      224169        3       43
class 10,  32768 bytes        0        0        0      154546        3      893
class 11,  65536 bytes        0        0        0        1830        3       54
class 12, 131072 bytes        0        0        0           0        0        0
class 13, 262144 bytes        0        0        0           0        0        0
class 14, 524288 bytes        0        0        0           0        0        0
total configured streams memory: 17024.00KB
streams memory in use: 17129.64KB
maximum streams memory used: 18012.29KB


I get this error message repeated quite a bit:
WARNING: allocb failed - NSTRPAGES exceeded


I am out of ideas where to look next, any help is greatly appreciated. Thank you.
Avatar of dfke
dfke

Looks like a kernel issue as the fail colunm should be all zero's.
I see that the total configured streams memory almost matches the streams memory in use. You should increase the  number of NSTRPAGES.  NSTRPAGES controls the number of 4K pages  of memory that can be dynamically allocated for STREAMS use.

Furthermore NSTREAM should be set to at least 256 on systems that mount NFS-filesystems or invoke remote X clients.
right , looks liek a bottle neck issue . needs to reduce users or increase allocated resources. or check for teh hardware real phisycal limit.
Avatar of US-IT

ASKER

I'm thinking it is a kernel issue as well. Last night, I disconnected a mapped drive we had set up from a 2003 sever, and I ran /etc/conf/cf.d/configure to change the streams values. The server has been up for almost 19 hours, with only only 2 fails in class 8, and one in class 7. Are any fails acceptable? Or do I need to get these down to 0? Also, I say I ran the configure command for the streams, but with only reading about it a little I wasn't too comfortable with the changes. They seemed to have helped, but there were a lot of parameters to change. Which ones, or all of them, should I be focused on?

Avatar of US-IT

ASKER

Spoke too early. After little over 20 hours:

streams allocation:
                         config    alloc     free       total      max     fail
stream                    15000      332    14668        7527      333        0
queues                     1362      674      688       15066      676        0
mblks                     16996    16785      211    59642597    16939        0
buffer headers            17082    16998       84     3695278    17058   237255
class  1,     64 bytes      342      256       86    25284596      383        0
class  2,    128 bytes      213      192       21     4416542      212        0
class  3,    256 bytes      322      253       69    10898034     1052       26
class  4,    512 bytes       13       11        2       63549       44        4
class  5,   1024 bytes       33        0       33       49609       70        8
class  6,   2048 bytes    14742    14740        2     4360914    14741        1
class  7,   4096 bytes     1000     1000        0       22585     1050      640
class  8,   8192 bytes        0        0        0      191912        9      147
class  9,  16384 bytes        0        0        0      414915        4        2
class 10,  32768 bytes        0        0        0      290699        3        2
class 11,  65536 bytes        0        0        0        1993        3        0
class 12, 131072 bytes        0        0        0           0        0        0
class 13, 262144 bytes        0        0        0           0        0        0
class 14, 524288 bytes        0        0        0           0        0        0
total configured streams memory: 32000.00KB
streams memory in use: 34311.99KB
maximum streams memory used: 35239.42KB

Note: Users began logging in after the 19th hour of uptime. Also note, we have a web application that accesses a shared drive, most likely traffic beginning around the same time.
Avatar of US-IT

ASKER

Don't know if this information helps at all.

Client nfs:
calls      badcalls   nclget     nclsleep
34186      0          34223      0          
null       getattr    setattr    root       lookup     readlink   read      
0  0%      2279  6%   10  0%     0  0%      4097 11%   0  0%      17128 50%  
wrcache    write      create     remove     rename     link       symlink    
0  0%      9519 27%   429  1%    39  0%     79  0%     0  0%      0  0%      
mkdir      rmdir      readdir    fsstat    
0  0%      0  0%      452  1%    154  0%    



$ ls
[Lists all files in nfs mount]
$ l
[hangs]

Ok just maybe there is a problem with the network card or driver. Check to see if there is an updated driver of try to switch cards.

If that doesn't help you can try to sniff the packets in some way and compare the timings with the netstat -m output:

Try a shell script that records `netstat -m` output:

while :; do
date
netstat -m
sleep 1      #
done > netstat-m.log

Meanwhile, put a packet sniffer on the LAN, tell it to capture
everything being sent to servers IP address.  Try to make sure the
sniffer and server agree closely about the time (within a second or
better).  Then run the sniff for long enough to observe the buffers
rising significantly, according to the `netstat -m` log.

You should be able to identify specific times when buffers were
consumed.  Look at the corresponding times in the sniffer log: is there
a particular kind of incoming packet that seems to be causing this?
Avatar of US-IT

ASKER

I believe the NIC is onboard. This may be  a dumb question, but, what would be the best brand/model NIC to use to try out (DELL Power Edge 2500)? My knowledge is much more suited to Linux, so while I know some things, I'm almost a newcomer to SCO/Unix.

I will work on getting the packet sniffer going.

Thanks for your help.

what about having 2 nics ? plus teh built-in . all on teh same networking prviding the same service ?

packet sniffer is a good idea. but you may try a bandwidth manager and it is a better idea , so you can both monitor and control overshots
You can get a free trial of SarCheck that will ID all of the kernel tunables to adjust - you may need to get to near crash status to have it give the desired result.

Sarcheck:

http://www.sarcheck.com/scosr5.htm

go back to the home page and you can find the free trial.
Also - make sure you are not running out of space on the /, /var (if it's there), /usr (if it's there) filesystems.

Look at:
http://docsrv.sco.com:507/en/PERFORM/kernel_configure.html

In particular:  
STRMSGSZ
   

Although, if the problem is new, and no configuration changes were made before the problem cropped up, I'd suspect either a network issue or a Chatty Cathy client inundating the server with packets.  
ASKER CERTIFIED SOLUTION
Avatar of yotech
yotech
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial