Link to home
Start Free TrialLog in
Avatar of tbeasley123
tbeasley123

asked on

Netware 6.5 sp3 spinlock errors, locked up server

I've been having issues with my Netware 6.5 sp3 server over the past couple days.  Each morning for the past two days when I've come in the users are unable to login, groupwise is not working, and the other services (web) aren't working.  The server is at a black screen at the server console.

After rebooting everything seems to work ok.  However, over the course of the day we intermittently loose access to shares for a brief period of time (usually less than a minute).

Pasted below are snippets of the abend log.  I've noticed this spinlock error in the log that seems to occur at nearly the same time each day.  

This is a production server so whatever help I can get will be greatly appreciated.

Server LGHSV001 halted Wednesday, July 25, 2007   1:07:25.113 am
Abend 2 on P02: Server-5.70.03-0: Attempt to release an unacquired spinlock
Registers:
    CS = 0060 DS = 007B ES = 007B FS = 007B GS = 007B SS = 0068
    EAX = 00000008 EBX = 4D6547E0 ECX = 4D8D50C8 EDX = 00000003
    ESI = 00000020 EDI = 00000000 EBP = 00000020 ESP = B3BB3F08
    EIP = 0011415B FLAGS = 00000096
    LOADER.NLM|DMAMutexLock:
0011415B 68EC4C1300     PUSH    LOADER.NLM|SystemDMASpinLock
    EIP in UNKNOWN memory area

The violation occurred while processing the following instruction:
LOADER.NLM|DMAMutexLock:
0011415B 68EC4C1300     PUSH    LOADER.NLM|SystemDMASpinLock
00114160 E8DBFAFFFF     CALL    LOADER.NLM|kSpinLock
00114165 83C404         ADD     ESP, 00000004
00114168 C3             RET    
LOADER.NLM|DMAMutexUnlock:
00114169 68EC4C1300     PUSH    LOADER.NLM|SystemDMASpinLock
0011416E E8FDFBFFFF     CALL    LOADER.NLM|kSpinUnlock
00114173 83C404         ADD     ESP, 00000004
00114176 C3             RET    
LOADER.NLM|kSpinLockDebugParser:
00114177 C705684C130001 MOV     [00134C68]=00000000, 00000001
         00000000114181 C3             RET  
Additional Information:
    The NetWare OS detected a problem with the system while executing a process owned by LOADER.NLM.  It may be the source of the problem or there may have been a memory corruption.



Server LGHSV001 halted Thursday, July 26, 2007   1:51:23.128 am
Abend 2 on P01: Server-5.70.03-0: Attempt to release an unacquired spinlock

Registers:
    CS = 0008 DS = 0068 ES = 0068 FS = 0068 GS = 007B SS = 0068
    EAX = B2C088F0 EBX = B2C088F0 ECX = 004EAD60 EDX = 000000FA
    ESI = 00000000 EDI = 00000001 EBP = 710F0BCC ESP = 55AC61EC
    EIP = 00109DBB FLAGS = 00000096
    LOADER.NLM|DMAMutexLock:
00109DBB 684CA91200     PUSH    LOADER.NLM|SystemDMASpinLock
    EIP in UNKNOWN memory area

The violation occurred while processing the following instruction:
LOADER.NLM|DMAMutexLock:
00109DBB 684CA91200     PUSH    LOADER.NLM|SystemDMASpinLock
00109DC0 E8DBFAFFFF     CALL    LOADER.NLM|kSpinLock
00109DC5 83C404         ADD     ESP, 00000004
00109DC8 C3             RET    
LOADER.NLM|DMAMutexUnlock:
00109DC9 684CA91200     PUSH    LOADER.NLM|SystemDMASpinLock
00109DCE E8FDFBFFFF     CALL    LOADER.NLM|kSpinUnlock
00109DD3 83C404         ADD     ESP, 00000004
00109DD6 C3             RET    
LOADER.NLM|kSpinLockDebugParser:
00109DD7 C705C8A8120001 MOV     [0012A8C8]=00000000, 00000001
         00000000109DE1 C3             RET
Additional Information:
    The NetWare OS detected a problem with the system while executing a process owned by LOADER.NLM.  It may be the source of the problem or there may have been a memory corruption.
Avatar of ShineOn
ShineOn
Flag of United States of America image

What's rough about what you're posting is it's "abend 2."  If you don't have any "abend 1" entried in your abend log that means the first abend is being followed so fast by the second abend that the first abend doesn't get logged.  

You'll need to trap the Abend 1 for any useful diagnostics.  The Abend 2 is usually caused by the Abend 1.

To trap the Abend 1 you will have to turn off Auto Restart After Abend.  It's usually set to 1.  

At the server console type "Set Auto Restart After Abend = 0" to force it to stop after the first abend and wait for you to respond.

IF you search Experts-Exchange for "Auto Restart after Abend" or words to that effect you should find several topics that cover abend debugging procedures including using loadstage startup and such.  For now, just try to capture the actual "causative" abend.
Avatar of tbeasley123
tbeasley123

ASKER

Thanks for the help ShineOn.  I've issued that command and we'll see what happens if it abends again.

Also, I didn't mention in my first post.  I applied Sp3 to they system last Thursday and previously running on sp2 I hadn't had these issues. Since applying sp3  I've had intermittent issues with temporarily (30 seconds to a minute) loss of access to the volumes (network shares).  I applied the NW65OS3A post sp3 patch yesterday and things have run better today, but I still got the abend with the same information in the abend file.  I will apply the NW65OS3B patch(which is supposed to resolve some abend issues) tonight and see if the abend occurs tomorrow morning.

I've checked crontab to see if there is anything kicking off that might be causing problems.  The only thing running daily is:
0 0 * * * perl sys:/apache2/rotate.pl sys:/apache2/rotate.ini --noscreen
I tried running this by itself today during operating hours and nothing crashed, although it failed because the file rotate.pl doesn't exist.  Is this something to be concerned about?

Thanks again for your help.  I'll post an update on Monday.
Um.

SP3 had problems, which is why it was followed by SP4 and SP5, and there's now an SP6.

SP3 is from, like, 2 years ago.

Is there a reason why you only went from SP2 to SP3?  Is there some 3rd party software that's holding you back?  Usually, you do not have to install support packs sequentially - you can jump directly from SP2 to SP6.  At least I'm not aware of any reason why you'd have to install any other SP's before going to SP6.

There will be some "gotcha's" like having to re-create your certificates because of a major upgrade to Certificate Services, but other than that - all you get is improvements.

I suggest you download SP6 separately so it can be applied, but also download the SP6 Overlay Images (there should be two - one for OS and one for Products) so if you want to install anything, you don't have to go back and re-apply the SP's, and if you want to install another server or six you don't have to install the SP's separately.  The overlay images also will have the latest hardware drivers, making for a nicer install experience on new hardware...


And if you were ever to call Novell for support, the first thing they'd say is "call us back when you're fully patched".
Thanks for the suggestions.  I've begun the download of the patch.  As far as reasoning for not patching earlier.  We're a small company and we rely heavily on this Netware box, it's running everything (file server, groupwise, etc).  Up until last week it was running reliably patched to sp2.  We just added on our first Microsoft Vista machine last week and I was having issues with the novell client for vista not mapping network drives and having other misc. login issues.  I noticed that this didn't occur on my development box running Netware 6.5 sp3.  I decided to go ahead and upgrade the production server to NW 6.5 sp3 since the upgrade appeared to be pretty minor and didn't look like it had the potential of breaking other programs.  Up until Tuesday of this week the sp3 patch seemed to be fine on the production server.  This is when we started getting the lock ups.  

I've read around on other postings and found another posting on EE where a similar issue was occuring and Sophos was mentioned.  I am running Sophos for netware on this server and took a look at the logs and it appears that the sophos manager has been having issues updating sophos since the update.  I've turned off the daily full scan to see if this eliminates the error

Long story short, I'm still troubleshooting the reason for this error, but ultimately I'm hoping it will be resolved after the sp6 upgrade.

thanks, i'll keep you posted.
I have almost always worked for small companies (between 50 and 150 users) and have never taken the position to hold off for years on NetWare support packs because we rely heavily on the server(s).  I have always taken the opposite position of carefully planning regular updates to my servers - not immediately on release of a support pack, but within a few months of its release (to let the major bugs shake out in bigger companies) - BECAUSE we rely heavily on the servers.

The "careful planning" comes down to making sure the version/support pack/patch level of the software the server supports doesn't have issues with the support pack, and where they might, checking for updates to those packages to make them compatible, and determining the order in which the updates should be applied.  It also includes checking with the hardware manufacturer(s) for BIOS flash updates, other firmware updates, LPARs, whatever they might call them - again, to make sure the hardware firmware/drivers/etc are all at the latest supported level to ensure compatibility with the support pack and the other software updates.  Note that I say "updates" - not "upgrades."

I *never* consider support packs or patches to be "upgrades."  I wouldn't say, for example, "I'm hoping it will be resolved after the SP6 upgrade."  I'd say "I'm hoping it will be resolved after I apply SP6."  A Support Pack is an integrated and cross-tested set of bugfixes and patches, with the occasional feature enhancement or security improvement, NOT an upgrade.  If it were an upgrade, it would at minimum change the major revision level of the OS, if not the version.  Support packs only change a minor rev level.  After applying SP6 you'd have NetWare 6.5.6.  You now have NetWare 6.5.3.  It's still NetWare 6.5, so it's not an upgrade.  It's supposed to be part of ongoing maintenance.
I understand what you're saying Shineon, but I'm a one man band for now and I'm constantly working on other issues and applying these patches often gets put off for other issues especially if nothing appears to be "broken".  I appreciate your advise and will make time in the near future to get everything patched up to date.  I'm sure I'll be posting a question when that occurs because inevitably something that is now working will not work when I apply those patches.

In regards to the issue related to this posting.  I've concluded that the error is occurring after Arcserve runs a full back up.  I will check on any recent Arcserve patches, but since we've had issues with this recently I'm pretty sure we're not 2 years out of date on that.  I'll be applying those today and will post the results.

I'll continue to post my updates to this thread since I'm sure others will or have had similar issues after applying service packs.  

Thanks
Is ARCserve running on the server or is the actual backup server on another server, with this server being backed up across the network?

What TSA modules are you loading?   Are you doing a "hot" backup of GroupWise?  Are you using any open file agents?  Can you check the version(s) of the TSA modules?  (in Windows Explorer, right-click the module in SYS:/SYSTEM and select "Properties." On the  "NetWare Version" tab, click the button.  That'll give you the same info as if you had done a "Modules" command on the system console.
As to the "one man band" and constantly working on other issues, I can relate.
Thanks Shineon, I'm sure you probably can relate.  Fortunately I have exchanges like this where I can gain insight.

Arcserve is actually running on this server.  It is set to kickoff nightly and do a full backup nightly Monday thru Friday.  Prior to starting it shuts down Sweep and after running it starts it back up again.

Yes, I am doing a hot backup of Groupwise.  I'm not currently using any open file agents.  I was told my previous counterpart that they weren't necessary.  I'm open to purchasing them if they're needed.

I couldn't seem to get the module info like you indicated in your previous post, but here is the information captured from console:

TSAFS.NLM                                                        
  Loaded from [SYS:SYSTEM\]                                      
  (Address Space = OS)                                          
  SMS - File System Agent for NetWare 6.X                        
  Version 6.50.11 January 13, 2005                              
  Copyright (C) 2002-03, 2005 Novell, Inc.  All Rights Reserved.
UNIQSVR.NLM                                                            
  Loaded from [SYS:\ARCSERVE\NLM\]                                    
  (Address Space = OS)                                                
  CA Universal RPC Queue Manager r11.1 SP3 (Build 991.011 03/01/07)    
  Version 11.10 March 1, 2007                                          
  (C) 2007 CA                                                          
POOLUTIL.NLM                                                          
  Loaded from [SYS:\ARCSERVE\NLM\]                                    
  (Address Space = OS)                                                
  CA File System Pool Utility Module r11.1 SP3 (Build 991.011 03/01/07)
  Version 11.10 March 1, 2007                                          
  (C) 2007 CA                                                          
UNIDB.NLM                                                              
  Loaded from [SYS:\ARCSERVE\NLM\]                                    
  (Address Space = OS)                                                
  CA Universal Database Manager r11.1 SP3 (Build 991.011 03/01/07)    
  Version 11.10 March 1, 2007                                          
  (C) 2007 CA                                                          
DISCOVER.NLM                                                          
  Loaded from [SYS:\ARCSERVE\NLM\]                                    
  (Address Space = OS)                                                
  CA Discovery Module r11.1 SP3 (Build 991.011 03/01/07)              
  Version 11.10 March 1, 2007                                          
  (C) 2007 CA    
  Loaded from [SYS:\ARCSERVE\NLM\]                                              
  (Address Space = OS)                                                          
  BrightStor ARCserve r11.1 SP3 ARCserve Validation Module (Build 991.011 03/01/
07)                                                                            
  Version 11.10 March 1, 2007                                                  
  (C) 2007 CA                                                                  
UNIDMSVR.NLM                                                                    
  Loaded from [SYS:\ARCSERVE\NLM\]                                              
  (Address Space = OS)                                                          
  CA Universal RPC Device Manager r11.1 SP3 (Build 991.011 03/01/07)            
  Version 11.10 March 1, 2007                                                  
  (C) 2007 CA                                                                  
STANDARD.NLM                                                                    
  Loaded from [SYS:\ARCSERVE\NLM\]                                              
  (Address Space = OS)                                                          
  BrightStor ARCserve r11.1 SP3 Standard Tape Support (Build 991.011 03/01/07)  
  Version 11.10 March 1, 2007                                                  
  (C) 2007 CA                                                                  
TAPEALRT.NLM                                                                    
  Loaded from [SYS:\ARCSERVE\NLM\]                                              
  (Address Space = OS)                                                          
  Tape Alert r11.1 SP3 SNMP Agent For NetWare (Build 991.011 03/01/07)          
  Version 11.10 March 1, 2007                                                  
  (C) 2007 CA                                                                  
                                                     


No GroupWise agent should be necessary.  Whether an OFA for regular files is desirable depends on your server / nss config.

However, there are SMS/TSA requirements for successful "hot" backup of GroupWise.   They include the use of the /enablegw switch on the load of TSAFS as well as a "helper app" called TSAFSGW.  These replace the old GWTSA.NLM agent.
http://support.novell.com/cgi-bin/search/searchtid.cgi?/10095865.htm
http://support.novell.com/docs/Tids/Solutions/10098834.html

I wonder if you had a newer (or older) set of TSA's prior to the SP3 update than you have now...

Check for the latest TSA updates at download.novell.com.  The latest one at the time of this posting is TSA5UP21.ZIP, which has TSAFS.NLM version 6.52.06, dated February 2, 2007.

It doesn't have the TSAFSGW helper app, or TSANDS.  To get updated versions of TSAFSGW, you'd have to go for an earlier TSA5UPxx patch kit, like TSA5UP19, which has the TSAFSGW.NLM dated June 7, 2005, or a GroupWise SP, like GW7SP1 (in the "agents" folder" or as individual GroupWise patches (GW6.5 NetWare Target Service Agent rev C), and it looks like the latest TSANDS.NLM is dated 19 Jul 2005, which also would come with things like eDirectory updates or NetWare SP's.

Anyway, you should have the TSAFS load line read "LOAD TSAFS.NLM /EnableGW=Yes"  Also, for best results, add the "LOAD TSAFSGW" command to follow the TSAFS /EnableGW command.
shineon,

I'm in the process of upgrading my servers to sp6.  I'm having an issue with our intranet server running on this box.  With sp6 it is upgrading php to version 5.  I'm having an issue loading zlib in php5 and can't seem to find the solution on the internet.  I saw a post in EE regarding this issue that psicop helped someone on, but the solution wasn't posted from what i could tell.  Have you encountered this?

it's throwing this error:
[03-Aug-2007 14:14:49] PHP Warning:  PHP Startup: Unable to load dynamic library 'sys:/php5/ext/php_zlib.nlm' - (dlfcn) Load failure including unresolved symbol in Unknown on line 0

i copied the php_zlib.nlm file that I was using with php4 from the php directory and copied it to the php5 directory.

i modified the php.ini file and added the extension php_zlib.nlm

i also modified the line:
zlib.output_compression = On

Any ideas anyone?
I disabled hyperthreading in the bios and things have been running better.  It's not locking up every day, however, still having tons of issues with dropping connectivity throughout the day where we loose access to the network shares, etc.  usually within under a minute everything is working again fine.  I'm pretty sure when I did the sp3 upgrade I opted not to install the driver updates.  I'm thinking I'll go back and re-apply the patch and this time choose to install them and see if that helps any.  Any other suggestions?

Server LGHSV001 halted Monday, August 6, 2007  10:00:41.949 am
Abend 1 on P00: Server-5.70.03: Page Fault Processor Exception (Error code 00000000)

Registers:
    CS = 0060 DS = 007B ES = 007B FS = 007B GS = 007B SS = 0068
    EAX = 00000000 EBX = 6F4960C0 ECX = 00000024 EDX = 094E4F28
    ESI = 6F4960C0 EDI = 094E4F28 EBP = 6B4F1F2C ESP = 6B4F1EDC
    EIP = 00107516 FLAGS = 00010246
    00107516 3901           CMP     [ECX]=?, EAX
    EIP in UNKNOWN memory area
    Access Location: 0x00000024

The violation occurred while processing the following instruction:
00107516 3901           CMP     [ECX], EAX
00107518 757D           JNZ     00107597
0010751A 9C             PUSHFD  
0010751B FA             CLI    
0010751C 8B1500F00300   MOV     EDX, [LOADER.NLM|CpuCurrentProcessor]=00000000
00107522 42             INC     EDX
00107523 FF05A8E00300   INC     dword ptr [0003E0A8]=0516EC62
LOADER.NLM|kspinlockdisable_patch:
00107529 F00FB111(LOCK) CMPXCHG [ECX], EDX
0010752D 7561           JNZ     00107590
0010752F 833DE883120000 CMP     [001283E8]=00000000, 00000000



Running process: Server 00:141 Process
Thread Owned by NLM: SERVER.NLM
Stack pointer: 6B4F1F60
OS Stack limit: 6B4EA020
Scheduling priority: 67371008
Wait state: 50500F0  Waiting for work
Stack: B7ACA5A2  (WSPSSL.NLM|WSPSSL_UpperConnLayerDataReceive+36)
       --00000024  ?
       --6B4F1F2C  ?
       --FFFEFFFE  ?
       --0000024C  ?
       --6B4F1F2C  ?
       --FFFEFFFE  ?
       --097CAC98  ?
       --097CAC64  ?
       61F9043B  ?
       --6F4960C0  ?
       --094E4F28  ?
       --00000000  ?
       --00000003  ?
       --00000000  ?
       --094E4F28  ?
       --00000000  ?
       --00000000  ?
       --00000000  ?
       --00000017  ?
       --097CAC8C  ?
       --FFFEFFFE  ?
       --097CAC64  ?
       61F7E190  (NILE.NLM|SSLDeRegister+E50)
       61F7E2E3  (NILE.NLM|SSLDeRegister+FA3)
       --097CAC64  ?
       --00000000  ?
       --097CAC78  ?
       --FFFEFFFE  ?
       --FFFFFFFF  (LOADER.NLM|KernelTempAliasesEnd+FFF)
       61F7E190  (NILE.NLM|SSLDeRegister+E50)
       003570C9  (SERVER.NLM|StartWorkToDo+23)
       --097CAC78  ?
       --FFFEFFFE  ?
       --FFFFFFFF  (LOADER.NLM|KernelTempAliasesEnd+FFF)
       --00000000  ?
       --17072299  ?
       00223223  (SERVER.NLM|kWorkerThread+DB)
       --097CAC78  ?
       --00000000  ?
       --69EDF540  ?
       --00000000  ?
       --69EDF540  ?
       0021CEA4  (SERVER.NLM|TcoNewSystemThreadEntryPoint+3C)
       --69EDF540  ?
       --00000000  ?
       --00000000  ?
       --00000000  ?
       --00000000  ?
       --74007303  ?
       --6C007900  ?
       --2E006500  ?
       --69007600  ?
       --69007300  ?
       --69006200  ?
       --69006C00  ?
       --34343434  ?
       --0F000000  ?
       --1CF13C00  ?
       --00000000  ?
       --142AFB00  ?
       --02000000  ?
       --01000000  ?
       --03000000  ?
       --00000130  ?
       --57000000  ?
       --1F000000  ?
       --18F12500  ?
       --00000000  ?
       --10F12800  ?
       --00000000  ?
       --00000000  ?
       --00000000  ?
       --00000000  ?
       --00000000  ?
       --3E001F00  ?
       --000015F1  ?
       --42000000  ?
       --00000DF1  ?
       --00700300  ?
       --00740070  ?
       --0079005F  ?
       --000F0000  ?
       --001CF13C  ?
       --00000000  ?
       --00142AFB  (LOADER.NLM|titleBarSaveBuffer+1DAF)
       --00020000  (SERVER.NLM|BIOSDriveCount+5E38)
       --00010000  ?
       --08030000  ?
       --00000001  ?
       -00570000  ?
       --000F0000  ?
       --00382B02  ?
       --000F0000  ?
       00302B08  (SERVER.NLM|DoLanguageCommand+64)
       --00000000  ?
       00102B03  (LOADER.NLM|InitializeProcessorTaskGates+4B)
       --00010000  ?
       --00000000  ?
       --08030000  ?
       
Additional Information:
    The CPU encountered a problem executing code in LOADER.NLM.  The problem may be in that module or in data passed to that module by a process owned by SERVER.NLM.



any other ideas on this anyone?
ASKER CERTIFIED SOLUTION
Avatar of ShineOn
ShineOn
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
upgraded to sp5.  see new ticket 22780363 with new issues.
i mean, upgraded to sp6