Locking server

I have a server that keeps locking up.  It has also randomly rebooted.  I have installed a new memory, motherboard, and cpu (in that order).  Memtest used to shutdown the server until I installed the new motherboard.  Now it can run the test fine. I haven't had a server re-boot or shutdown since.  I am still getting a lockup though.   I thought for awhile it was something with backup exec, but last night it locked up at 8:15 and backup exec doesn't run until 9 p.m.  It locked up a second time at 4 a.m. or so when backup exec was verifying.

I am not at the place so it is a little harder to gather information.  I do know that a red or orange light comes on the case.  This is the light below the warning triangle. (note it is a supermicro case).  It doesn't seem to come on until it freezes up, because last time i was there it was green for awhile.  I asked today someone if it was red or amber, and they said it looks orange.

I also ran chkdsk without the /f option and came up with 3 reparse records and 5 unindexed files as issues.  I don't want to use the /f option unless absolutely needed as it takes forever and it is there main server.  

There aren't any errors in the event log except the occasional dns or ntfrs replication which i have fixed a few times.  These are due to the server being off for too long at times.

Any ideas would be appreciated.

LVL 8
dmarinenkoAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

akrdmCommented:
Are the lock up times pretty consistent or does the lockups happen at random intervals?
Does it happen every day or is that random as well?

One thing that I have seen cause lockups and reboots (besides software) is a bad power supply. If you have a power supply tester you can test the power supply to see if it's having issues. Just a suggestion hope it helps.
0
dnairnsCommented:
Are you using SBS?
0
dmarinenkoAuthor Commented:
It seems to like to lock up at night more often, but there is no real pattern.  It does have redundant power supplies and while I have personally seen the ups work (The power and lights went off in a cramped server room as I had a 4U server I was holding lining up the rails to slide into the rack, perfect timing lol) It still looks a little old. Because of this I split off the 2 power supplies, 1 to the ups and 1 to a surge protector on a wall outlet as a test.

 I would think with redundant power supplies even 1 bad power supply  shouldn't take it down?

I also have thought about replacing the middle plane as a possibility?  Kind of reluctant to keep throwing parts at it.

I forgot to note earlier that I have intel active system console, and do not see any current errors in there.
0
10 Tips to Protect Your Business from Ransomware

Did you know that ransomware is the most widespread, destructive malware in the world today? It accounts for 39% of all security breaches, with ransomware gangsters projected to make $11.5B in profits from online extortion by 2019.

dnairnsCommented:
Also what version of Backup Exec are you using?
0
dmarinenkoAuthor Commented:
No it is Server 2003 standard
0
dmarinenkoAuthor Commented:
And it is Veritas 10D updated to service pack 4
0
dnairnsCommented:
Have you tried monitoring the resource usage around the times where this is happening? If you can do so this will help us see what resources are being used when the failure happens.
It seems as though it is not a hardware failure, but if you could post some details on MOBO, CPU and Memory and the specs of each we can look at the configuration a little more to find out if there is a hardware problem causing memory corruption that leads to the system locking up.
It sounds a lot like data corruption. Are you using an array? or is it a non-redundant system?
0
dmarinenkoAuthor Commented:
It has an Intel S5000PAL motherboard, with an intel 5120 (1.86ghz) processor.

2gb of Kingston FB-DDR2 (PC2-5300) ram of which 80% is used while being logged in and running SIW.  

It has an Adaptec 3805 with 4 Seagate 73B drives in a Raid5 with a hot spare.  And a battery Backup.   The array is not degraded.

It is in an Intel SR2500NA Chassis

I updated everything (Bios, Boot Block, Bus etc.) to the lates 10/13/2008 updates when I put in the new motherboard.  

The RAID card drivers are up to date the firmware on it is slightly out of date (Build 12814 which was the 8-07, now they have a build 15753 update as of 10-08)

I am suspecting a data corruption there are 2 things that bother me though.  The checkdisk didn't pull up anything major, and the orange light on the case.  I can run the performance monitor and see what happens over the weekend.

0
dmarinenkoAuthor Commented:
By The Way I closed out a couple of things, and memory usage is at 77% if I wasn't logged in it would be probably 74%.  Still a little high though, they could definately use more memory.  They have 3 SQL databases running that are eating up +680MB
0
dnairnsCommented:
Go ahead an run the performance monitor to see if there is a process that spikes around 8:15 or so.
I suspect that the data on your voulme is corrupted, but I get the idea that it has been progressing for some time.
If possible, duplicate the volume and run CHKDSK with the /f flag on that and see what the corruption looks like. Sometimes, CHKDSK does not catch all of the errors until after you toss the /f flag to it. I just did htis on a server 2 days ago and it was locking up too.
Also, try to run a utility to check the MFT. CHKDSK can repair this, but only to a certain extent. If there is corruption in this table, CHKDSK will not always see it.
It seems to me that a program is trying to access some corrupt file on the volume and when it does it runs into many problems. It may be that somewhere within the file, the data is corrupted. If this is the case you will have to find the particular file a do a restore on it to a point where it was working.
I have seen servers have memory corruption when the memory was a bit slow and the data got corrupted, and it saved to the HDD. Even after checking the FS by means of CHKDSK, we found no errors. The problem still persisted. Upon further investigation it was a file that a remote management agent we had installed on the system was using had been corrupted. The headers and footers in the file were fine, so the file seemed to be okay but in between the files there was some corrupt data that when the agent ran it caused the system to halt as it tried to update the inventory on another server.
Do this only after running CHKDSK, as your problem may be a bit more simplistic: Check any programs you have installed on the server and check the files that are associated with them for corruption, and see if one is loading data around the time that the locking occurs.
0
dmarinenkoAuthor Commented:
Any suggestions on a utility to check the MFT?  Any freeware out there?
0
dnairnsCommented:
I will look, but i dismounted the volume and used a linux utility to check it. It is called Testdisk if you are interested. otherwise I have no suggestions at the moment. Sorry!
0
dmarinenkoAuthor Commented:
No problem, you've been pretty helpful so far.
Maybe I'll just grabtestdisk, i'd rather have a windows program, but I have built a few rocks clusters, and played around with getting openfiler on a USB to be familiar with linux.  Although I wouldn't call myself an expert.
0
dnairnsCommented:
:-) okay. It is something that is on most OS linux platform these days. Totally free. just boot to ubuntu and run it from disk if all else fails.
0
kml57Commented:
Reinstall the video drivers.
0
dmarinenkoAuthor Commented:
OK I ran checkdisk and fixed the few problems.  I didn't see any problems with testdisk.  I also replaced the power distribution module on the advice of someone else.  I still am having a locking up server. While I didn't do the video drivers lately, all the drivers were updated at some point in this mess.  It never solved the problem. This has been progresivly getting worse.  Although there wqas only 1 lockup this weekend which is better then normal.  Any other ideas?
0
dnairnsCommented:
Whew! Ruled out data corruption! Now on to memory leaks..... Ugh.
Okay, performance data is really the only way I know of to diagnose a memory leak. If you can start a log of what processes are running and how much memory they are using that would be fantastic. From this data we will be able to see if there is a particular process that is using up a TON of memory and causing a memory leak, which will burn up all of the memory and causing the system to hang. Almost like a program not responding but more like ALL of the programs/services not responding at the same time.
This is certainly sounding more and more like a memory leak, but we can not be sure.
 
Another route is to look as to what the light is hooked to on the Motherboard and look at the motherboard manual and see what the light represents. From there you can look at the light when it comes on and see if there is a pattern to the blinking. If there is a pattern of blinking as oppose to just blinking at a constant rate, you can get on supermicro's website and see what it means.
Can you post some event logs for applications and system also. This will help rule out some of the problems.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
dnairnsCommented:
Sorry, it is in an Intel SR2500NA Chassis?
0
dnairnsCommented:
Does the light blink or stay solid?
0
dmarinenkoAuthor Commented:
Yes it is a SR25000NA chassis, and it is a solid orange/amber light.  I have seen a red light on them (FYI) so I know it isn't red.
0
dmarinenkoAuthor Commented:
It looks like it was actually a memory issue, due to how many things they had running thanks for all your help.  dnairns i will give you points for trying.
0
dmarinenkoAuthor Commented:
Months later.............
Although it has been solved awhile
It ended up being a bad replacement motherboard sent from intel to replace the bad one already in it.
0
dnairnsCommented:
Wow, that is unfortunate. I have been having a lot of problems with intel motherboards lately. I had 4 DOA's in a row. I feel for you!
Thanks for the info though!
0
dnairnsCommented:
Oh, just for reference, the 4 DOA's in a row were on a DP45SG.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2003

From novice to tech pro — start learning today.