• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 380
  • Last Modified:

Server hard-locks (ish) on random nights

I deployed a brand new HP Proliant ML350 G5 in March of this year.  Its running Server 2003 SBS standard, and using Symantec backup exec 12 for a backup system (external hard drives as backup media).  The server ran like a champ for a couple months, and then on May 5th, at night, the server locked up, and my customer took it down with the power button.  It ran fine for a few days after that and then happened again.  This time, I went onsite to see exactly what was happening.  The server appeared to be in hard lock (num lock light would not toggle, and the monitor was just black.  I have 4 1-TB drives in RAID 5, and the activity LED's were going crazy.  I went to a workstation and could ping the server successfully.  I could also telnet to port 21 and it asked me for ftp credentials, but wouldn't authenticate me when I gave them to it.  So I had to power it down with the power button again.  Since then it's locked up several more times  

Here are the times and dates that it appears to have gone into lock (as indicated by the eventlog.  When i restart the server it logs a 6008 (previous system shutdown  at blah blah was unexpected...)  The event date shows the time that the server came back up after I shut it down, but inside the event it shows the time at night when i presume it locked up):

5-24 - 2:33 am
5-19 - 2:02 am
5-14 - 11:44 pm
5-12 - 11:30 pm
5-5 -  2:19 am

Other than the shutdown event, there is nothing in any of the event logs that is consistent -- nothing that would indicate a particular application or service or scheduled task causing the issue.

The only thing that runs at night is the backup.  I run two backup jobs each night, splitting the vast amount of data between two external 1-TB hard drives.  One job does a bunch of data plus system, system state, and exchange.  The other just does like 700G of data (mostly photoshop and illustrator drawings).  I run a full backup on Fridays, and differential backups Mon-Thurs.  As you can see by the dates, I'm experiencing the lockup both on differential nights and on full nights.    The first night the server locked up, both backups completed successfully (differential) before the server locked up.  Other times one completes and the other might or might not complete.  The lockup times do not directly relate to the start or finish of any backup job.

No updates were installed, and no application installations or changes had been made at any point.

I have Microsoft involved on this, and they installed perfwiz and poolmon to collect data at the time of lockup, but upon analysis of the logs, they're not coming up with much right now.

Any ideas?  This one is driving me crazy.
  • 3
1 Solution
Iamthecreator OMAdministrateur Systeme et ReseauxCommented:
Are you also running SEP ?? or BESR??
logicaltechsAuthor Commented:
Neither of them
I'd flash the firmware of the RAID controller and upgrade to the latest Proliant Support Pack. Also upgrade the firmware on the disks themselves.

Firmware 1.82 (17 April 09) for the e200 fixes 3 possible BSODs. - http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=uk&prodTypeId=329290&prodSeriesId=1157688&swItem=MTX-0bbd582676044917bb0a304b34&prodNameId=3182562&swEnvOID=1005&swLang=8&taskId=135&mode=4&idx=3 (you might have a P400 rather than e200, P400 is a much better controller.
logicaltechsAuthor Commented:
Flashed all the firmware and performed all the latest driver updates.  Locked up that night.  This has to be Backup Exec related -- it never happens except at night, and that's the only difference.  I saw some other EE articles just recently that might indicate an issue with the VSS writer post service pack 2 in Server 2003.  There's apparently a hotfix.  For now, I have uninstalled backup exec, and I'm going to give it a few days to make sure thats what's doing it.  If I can confirm Backup Exec is the root cause, then I may reinstall it and try the VSS hotfix.  
logicaltechsAuthor Commented:
Just wanted to post an update.  I removed BEWS 12.5 and was just running differential data backups and exchange backups with NTbackup for the last week.  Server has been completely solid, no problems at all.   So it's something Backup Exec related.  I think I'm going to make sure the server is completely up to date, reinstall backup exec and slowly add work to the backup job.  I'm going to start with just data, then add exchange, then add system state -- see if I can pinpoint exactly where this is happening.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now