Link to home
Start Free TrialLog in
Avatar of dmarinenko
dmarinenkoFlag for United States of America

asked on

Locking server

I have a server that keeps locking up.  It has also randomly rebooted.  I have installed a new memory, motherboard, and cpu (in that order).  Memtest used to shutdown the server until I installed the new motherboard.  Now it can run the test fine. I haven't had a server re-boot or shutdown since.  I am still getting a lockup though.   I thought for awhile it was something with backup exec, but last night it locked up at 8:15 and backup exec doesn't run until 9 p.m.  It locked up a second time at 4 a.m. or so when backup exec was verifying.

I am not at the place so it is a little harder to gather information.  I do know that a red or orange light comes on the case.  This is the light below the warning triangle. (note it is a supermicro case).  It doesn't seem to come on until it freezes up, because last time i was there it was green for awhile.  I asked today someone if it was red or amber, and they said it looks orange.

I also ran chkdsk without the /f option and came up with 3 reparse records and 5 unindexed files as issues.  I don't want to use the /f option unless absolutely needed as it takes forever and it is there main server.  

There aren't any errors in the event log except the occasional dns or ntfrs replication which i have fixed a few times.  These are due to the server being off for too long at times.

Any ideas would be appreciated.

Avatar of akrdm
akrdm
Flag of United States of America image

Are the lock up times pretty consistent or does the lockups happen at random intervals?
Does it happen every day or is that random as well?

One thing that I have seen cause lockups and reboots (besides software) is a bad power supply. If you have a power supply tester you can test the power supply to see if it's having issues. Just a suggestion hope it helps.
Avatar of dnairns
dnairns

Are you using SBS?
Avatar of dmarinenko

ASKER

It seems to like to lock up at night more often, but there is no real pattern.  It does have redundant power supplies and while I have personally seen the ups work (The power and lights went off in a cramped server room as I had a 4U server I was holding lining up the rails to slide into the rack, perfect timing lol) It still looks a little old. Because of this I split off the 2 power supplies, 1 to the ups and 1 to a surge protector on a wall outlet as a test.

 I would think with redundant power supplies even 1 bad power supply  shouldn't take it down?

I also have thought about replacing the middle plane as a possibility?  Kind of reluctant to keep throwing parts at it.

I forgot to note earlier that I have intel active system console, and do not see any current errors in there.
Also what version of Backup Exec are you using?
No it is Server 2003 standard
And it is Veritas 10D updated to service pack 4
Have you tried monitoring the resource usage around the times where this is happening? If you can do so this will help us see what resources are being used when the failure happens.
It seems as though it is not a hardware failure, but if you could post some details on MOBO, CPU and Memory and the specs of each we can look at the configuration a little more to find out if there is a hardware problem causing memory corruption that leads to the system locking up.
It sounds a lot like data corruption. Are you using an array? or is it a non-redundant system?
It has an Intel S5000PAL motherboard, with an intel 5120 (1.86ghz) processor.

2gb of Kingston FB-DDR2 (PC2-5300) ram of which 80% is used while being logged in and running SIW.  

It has an Adaptec 3805 with 4 Seagate 73B drives in a Raid5 with a hot spare.  And a battery Backup.   The array is not degraded.

It is in an Intel SR2500NA Chassis

I updated everything (Bios, Boot Block, Bus etc.) to the lates 10/13/2008 updates when I put in the new motherboard.  

The RAID card drivers are up to date the firmware on it is slightly out of date (Build 12814 which was the 8-07, now they have a build 15753 update as of 10-08)

I am suspecting a data corruption there are 2 things that bother me though.  The checkdisk didn't pull up anything major, and the orange light on the case.  I can run the performance monitor and see what happens over the weekend.

By The Way I closed out a couple of things, and memory usage is at 77% if I wasn't logged in it would be probably 74%.  Still a little high though, they could definately use more memory.  They have 3 SQL databases running that are eating up +680MB
Go ahead an run the performance monitor to see if there is a process that spikes around 8:15 or so.
I suspect that the data on your voulme is corrupted, but I get the idea that it has been progressing for some time.
If possible, duplicate the volume and run CHKDSK with the /f flag on that and see what the corruption looks like. Sometimes, CHKDSK does not catch all of the errors until after you toss the /f flag to it. I just did htis on a server 2 days ago and it was locking up too.
Also, try to run a utility to check the MFT. CHKDSK can repair this, but only to a certain extent. If there is corruption in this table, CHKDSK will not always see it.
It seems to me that a program is trying to access some corrupt file on the volume and when it does it runs into many problems. It may be that somewhere within the file, the data is corrupted. If this is the case you will have to find the particular file a do a restore on it to a point where it was working.
I have seen servers have memory corruption when the memory was a bit slow and the data got corrupted, and it saved to the HDD. Even after checking the FS by means of CHKDSK, we found no errors. The problem still persisted. Upon further investigation it was a file that a remote management agent we had installed on the system was using had been corrupted. The headers and footers in the file were fine, so the file seemed to be okay but in between the files there was some corrupt data that when the agent ran it caused the system to halt as it tried to update the inventory on another server.
Do this only after running CHKDSK, as your problem may be a bit more simplistic: Check any programs you have installed on the server and check the files that are associated with them for corruption, and see if one is loading data around the time that the locking occurs.
Any suggestions on a utility to check the MFT?  Any freeware out there?
I will look, but i dismounted the volume and used a linux utility to check it. It is called Testdisk if you are interested. otherwise I have no suggestions at the moment. Sorry!
No problem, you've been pretty helpful so far.
Maybe I'll just grabtestdisk, i'd rather have a windows program, but I have built a few rocks clusters, and played around with getting openfiler on a USB to be familiar with linux.  Although I wouldn't call myself an expert.
:-) okay. It is something that is on most OS linux platform these days. Totally free. just boot to ubuntu and run it from disk if all else fails.
Reinstall the video drivers.
OK I ran checkdisk and fixed the few problems.  I didn't see any problems with testdisk.  I also replaced the power distribution module on the advice of someone else.  I still am having a locking up server. While I didn't do the video drivers lately, all the drivers were updated at some point in this mess.  It never solved the problem. This has been progresivly getting worse.  Although there wqas only 1 lockup this weekend which is better then normal.  Any other ideas?
ASKER CERTIFIED SOLUTION
Avatar of dnairns
dnairns

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry, it is in an Intel SR2500NA Chassis?
Does the light blink or stay solid?
Yes it is a SR25000NA chassis, and it is a solid orange/amber light.  I have seen a red light on them (FYI) so I know it isn't red.
It looks like it was actually a memory issue, due to how many things they had running thanks for all your help.  dnairns i will give you points for trying.
Months later.............
Although it has been solved awhile
It ended up being a bad replacement motherboard sent from intel to replace the bad one already in it.
Wow, that is unfortunate. I have been having a lot of problems with intel motherboards lately. I had 4 DOA's in a row. I feel for you!
Thanks for the info though!
Oh, just for reference, the 4 DOA's in a row were on a DP45SG.