US-IT
asked on
SCO Unix Network Hang
Hi All,
We have a SCO OpenServer 5.0.7 server in our office. Lately (past week) it has been hanging very roughly every 12 hours, and I need to reboot. The reboot seems to clear it up, and it runs fine, for a while. We have 2 remote NFS shares mounted to the local file system, and the box also serves several samba shares. When the system begins to hang, these become unreachable (NFS shares from SCO, Samba shares from remote systems). We have about 50 users logging in at a time to our foxbase applications. Sometimes (I think if not rebooted quick enough) it stops responding to any network protocol (ssh, telnet, etc.).
Here is the netstat -m from before the last reboot (this was fairly deep into the hang)
streams allocation:
config alloc free total max fail
stream 8448 202 8246 2464 202 0
queues 908 413 495 4939 413 0
mblks 8634 8164 470 26890848 8599 0
buffer headers 9018 8568 450 308777 8969 0
class 1, 64 bytes 396 74 322 10407791 393 0
class 2, 128 bytes 50 20 30 2273073 76 0
class 3, 256 bytes 44 28 16 5951744 585 0
class 4, 512 bytes 14 10 4 34800 58 1
class 5, 1024 bytes 31 0 31 38453 54 3
class 6, 2048 bytes 7310 7309 1 2284286 7310 804
class 7, 4096 bytes 532 532 0 15352 585 43361300
class 8, 8192 bytes 0 0 0 114823 9 47
class 9, 16384 bytes 0 0 0 224169 3 43
class 10, 32768 bytes 0 0 0 154546 3 893
class 11, 65536 bytes 0 0 0 1830 3 54
class 12, 131072 bytes 0 0 0 0 0 0
class 13, 262144 bytes 0 0 0 0 0 0
class 14, 524288 bytes 0 0 0 0 0 0
total configured streams memory: 17024.00KB
streams memory in use: 17129.64KB
maximum streams memory used: 18012.29KB
I get this error message repeated quite a bit:
WARNING: allocb failed - NSTRPAGES exceeded
I am out of ideas where to look next, any help is greatly appreciated. Thank you.
We have a SCO OpenServer 5.0.7 server in our office. Lately (past week) it has been hanging very roughly every 12 hours, and I need to reboot. The reboot seems to clear it up, and it runs fine, for a while. We have 2 remote NFS shares mounted to the local file system, and the box also serves several samba shares. When the system begins to hang, these become unreachable (NFS shares from SCO, Samba shares from remote systems). We have about 50 users logging in at a time to our foxbase applications. Sometimes (I think if not rebooted quick enough) it stops responding to any network protocol (ssh, telnet, etc.).
Here is the netstat -m from before the last reboot (this was fairly deep into the hang)
streams allocation:
config alloc free total max fail
stream 8448 202 8246 2464 202 0
queues 908 413 495 4939 413 0
mblks 8634 8164 470 26890848 8599 0
buffer headers 9018 8568 450 308777 8969 0
class 1, 64 bytes 396 74 322 10407791 393 0
class 2, 128 bytes 50 20 30 2273073 76 0
class 3, 256 bytes 44 28 16 5951744 585 0
class 4, 512 bytes 14 10 4 34800 58 1
class 5, 1024 bytes 31 0 31 38453 54 3
class 6, 2048 bytes 7310 7309 1 2284286 7310 804
class 7, 4096 bytes 532 532 0 15352 585 43361300
class 8, 8192 bytes 0 0 0 114823 9 47
class 9, 16384 bytes 0 0 0 224169 3 43
class 10, 32768 bytes 0 0 0 154546 3 893
class 11, 65536 bytes 0 0 0 1830 3 54
class 12, 131072 bytes 0 0 0 0 0 0
class 13, 262144 bytes 0 0 0 0 0 0
class 14, 524288 bytes 0 0 0 0 0 0
total configured streams memory: 17024.00KB
streams memory in use: 17129.64KB
maximum streams memory used: 18012.29KB
I get this error message repeated quite a bit:
WARNING: allocb failed - NSTRPAGES exceeded
I am out of ideas where to look next, any help is greatly appreciated. Thank you.
right , looks liek a bottle neck issue . needs to reduce users or increase allocated resources. or check for teh hardware real phisycal limit.
ASKER
I'm thinking it is a kernel issue as well. Last night, I disconnected a mapped drive we had set up from a 2003 sever, and I ran /etc/conf/cf.d/configure to change the streams values. The server has been up for almost 19 hours, with only only 2 fails in class 8, and one in class 7. Are any fails acceptable? Or do I need to get these down to 0? Also, I say I ran the configure command for the streams, but with only reading about it a little I wasn't too comfortable with the changes. They seemed to have helped, but there were a lot of parameters to change. Which ones, or all of them, should I be focused on?
ASKER
Spoke too early. After little over 20 hours:
streams allocation:
config alloc free total max fail
stream 15000 332 14668 7527 333 0
queues 1362 674 688 15066 676 0
mblks 16996 16785 211 59642597 16939 0
buffer headers 17082 16998 84 3695278 17058 237255
class 1, 64 bytes 342 256 86 25284596 383 0
class 2, 128 bytes 213 192 21 4416542 212 0
class 3, 256 bytes 322 253 69 10898034 1052 26
class 4, 512 bytes 13 11 2 63549 44 4
class 5, 1024 bytes 33 0 33 49609 70 8
class 6, 2048 bytes 14742 14740 2 4360914 14741 1
class 7, 4096 bytes 1000 1000 0 22585 1050 640
class 8, 8192 bytes 0 0 0 191912 9 147
class 9, 16384 bytes 0 0 0 414915 4 2
class 10, 32768 bytes 0 0 0 290699 3 2
class 11, 65536 bytes 0 0 0 1993 3 0
class 12, 131072 bytes 0 0 0 0 0 0
class 13, 262144 bytes 0 0 0 0 0 0
class 14, 524288 bytes 0 0 0 0 0 0
total configured streams memory: 32000.00KB
streams memory in use: 34311.99KB
maximum streams memory used: 35239.42KB
Note: Users began logging in after the 19th hour of uptime. Also note, we have a web application that accesses a shared drive, most likely traffic beginning around the same time.
streams allocation:
config alloc free total max fail
stream 15000 332 14668 7527 333 0
queues 1362 674 688 15066 676 0
mblks 16996 16785 211 59642597 16939 0
buffer headers 17082 16998 84 3695278 17058 237255
class 1, 64 bytes 342 256 86 25284596 383 0
class 2, 128 bytes 213 192 21 4416542 212 0
class 3, 256 bytes 322 253 69 10898034 1052 26
class 4, 512 bytes 13 11 2 63549 44 4
class 5, 1024 bytes 33 0 33 49609 70 8
class 6, 2048 bytes 14742 14740 2 4360914 14741 1
class 7, 4096 bytes 1000 1000 0 22585 1050 640
class 8, 8192 bytes 0 0 0 191912 9 147
class 9, 16384 bytes 0 0 0 414915 4 2
class 10, 32768 bytes 0 0 0 290699 3 2
class 11, 65536 bytes 0 0 0 1993 3 0
class 12, 131072 bytes 0 0 0 0 0 0
class 13, 262144 bytes 0 0 0 0 0 0
class 14, 524288 bytes 0 0 0 0 0 0
total configured streams memory: 32000.00KB
streams memory in use: 34311.99KB
maximum streams memory used: 35239.42KB
Note: Users began logging in after the 19th hour of uptime. Also note, we have a web application that accesses a shared drive, most likely traffic beginning around the same time.
ASKER
Don't know if this information helps at all.
Client nfs:
calls badcalls nclget nclsleep
34186 0 34223 0
null getattr setattr root lookup readlink read
0 0% 2279 6% 10 0% 0 0% 4097 11% 0 0% 17128 50%
wrcache write create remove rename link symlink
0 0% 9519 27% 429 1% 39 0% 79 0% 0 0% 0 0%
mkdir rmdir readdir fsstat
0 0% 0 0% 452 1% 154 0%
$ ls
[Lists all files in nfs mount]
$ l
[hangs]
Client nfs:
calls badcalls nclget nclsleep
34186 0 34223 0
null getattr setattr root lookup readlink read
0 0% 2279 6% 10 0% 0 0% 4097 11% 0 0% 17128 50%
wrcache write create remove rename link symlink
0 0% 9519 27% 429 1% 39 0% 79 0% 0 0% 0 0%
mkdir rmdir readdir fsstat
0 0% 0 0% 452 1% 154 0%
$ ls
[Lists all files in nfs mount]
$ l
[hangs]
Ok just maybe there is a problem with the network card or driver. Check to see if there is an updated driver of try to switch cards.
If that doesn't help you can try to sniff the packets in some way and compare the timings with the netstat -m output:
Try a shell script that records `netstat -m` output:
while :; do
date
netstat -m
sleep 1 #
done > netstat-m.log
Meanwhile, put a packet sniffer on the LAN, tell it to capture
everything being sent to servers IP address. Try to make sure the
sniffer and server agree closely about the time (within a second or
better). Then run the sniff for long enough to observe the buffers
rising significantly, according to the `netstat -m` log.
You should be able to identify specific times when buffers were
consumed. Look at the corresponding times in the sniffer log: is there
a particular kind of incoming packet that seems to be causing this?
If that doesn't help you can try to sniff the packets in some way and compare the timings with the netstat -m output:
Try a shell script that records `netstat -m` output:
while :; do
date
netstat -m
sleep 1 #
done > netstat-m.log
Meanwhile, put a packet sniffer on the LAN, tell it to capture
everything being sent to servers IP address. Try to make sure the
sniffer and server agree closely about the time (within a second or
better). Then run the sniff for long enough to observe the buffers
rising significantly, according to the `netstat -m` log.
You should be able to identify specific times when buffers were
consumed. Look at the corresponding times in the sniffer log: is there
a particular kind of incoming packet that seems to be causing this?
ASKER
I believe the NIC is onboard. This may be a dumb question, but, what would be the best brand/model NIC to use to try out (DELL Power Edge 2500)? My knowledge is much more suited to Linux, so while I know some things, I'm almost a newcomer to SCO/Unix.
I will work on getting the packet sniffer going.
Thanks for your help.
I will work on getting the packet sniffer going.
Thanks for your help.
what about having 2 nics ? plus teh built-in . all on teh same networking prviding the same service ?
packet sniffer is a good idea. but you may try a bandwidth manager and it is a better idea , so you can both monitor and control overshots
packet sniffer is a good idea. but you may try a bandwidth manager and it is a better idea , so you can both monitor and control overshots
You can get a free trial of SarCheck that will ID all of the kernel tunables to adjust - you may need to get to near crash status to have it give the desired result.
Sarcheck:
http://www.sarcheck.com/scosr5.htm
go back to the home page and you can find the free trial.
Sarcheck:
http://www.sarcheck.com/scosr5.htm
go back to the home page and you can find the free trial.
Also - make sure you are not running out of space on the /, /var (if it's there), /usr (if it's there) filesystems.
Look at:
http://docsrv.sco.com:507/en/PERFORM/kernel_configure.html
In particular:
STRMSGSZ
Although, if the problem is new, and no configuration changes were made before the problem cropped up, I'd suspect either a network issue or a Chatty Cathy client inundating the server with packets.
Look at:
http://docsrv.sco.com:507/en/PERFORM/kernel_configure.html
In particular:
STRMSGSZ
Although, if the problem is new, and no configuration changes were made before the problem cropped up, I'd suspect either a network issue or a Chatty Cathy client inundating the server with packets.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I see that the total configured streams memory almost matches the streams memory in use. You should increase the number of NSTRPAGES. NSTRPAGES controls the number of 4K pages of memory that can be dynamically allocated for STREAMS use.
Furthermore NSTREAM should be set to at least 256 on systems that mount NFS-filesystems or invoke remote X clients.