Solved

Server Shutdown unexpectedly.

Posted on 2013-11-09
42
617 Views
Last Modified: 2014-03-09
Hello Everybody,

I have server working as domain controller using windows server 2003 R2, this server is shutdown unexpectedly around three or four times weekly, There are three event in event viewer related shutdown "Snapshot Attached" and there is no new software installed or hardware and i changed the power source i checked all driver is installed correctly and run scan using kaspersky found 15 threat but issue is still found waiting your suggestion and help.

Thanks.
01.png
02.png
03.png
0
Comment
Question by:CanadianITS
  • 18
  • 9
  • 5
  • +4
42 Comments
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Use the tools you have available to check your hardware. For example it looks like at least one of the disks in your array should be replaced. On a Dell Server you have the OpenManage utilities to monitor and report hardware problems, on HP's there is something similar. Or open your RAID utility, it should show you which disk to replace.

Also run a memtest86+, it'll show you if your RAM is still good. You'll find that tool on the UBCD:

http://ultimatebootcd.com

Also clean out all the dust, and make sure all the fans run smoothly.

Plan on replacing your server soon, along with the OS. Server 2003 will only be supported until June, after that no updates or security patches, which means you will risk your data to attack.
0
 
LVL 11

Expert Comment

by:Manjunath Sullad
Comment Utility
Hi,

Please go through Memory dump which is available in C:\Windows. MEMORY.DMP.

Please Microsoft analyzer tool to verify this dump.

If you have not configured memory dump, Please configure the Memory dump to findout the root cause (Surely this will help you to identify the root cause for unexpected reboot).

Please go through below link to generate memory dump.
Link : http://support.microsoft.com/kb/972110

~ Manju
0
 
LVL 5

Expert Comment

by:mbkitmgr
Comment Utility
Hi CanadianITS.

01.png - tells me the server is doing an ASR reboot as the OS has frozen.  Why.... see next paragraph

I can see from the error message capture 02.png you are running a Proliant server.  It is also telling me that something is failing the self test at boot.  Most likely a disk in your array.  You may also have a problem with the Smart Array Controller in the server.  Why ... see next paragraph

The error message you are receiving in 03.png, is telling me that one of the drives in your array is failing.

What to do?

1.

Replace the drive that is failing in the array and let the array rebuild itself with the new disk

2.

Check to see if there are any updated PSP (Proliant Support Packs) for your server and update the firmware in the array controller.  Once upgraded check the server.  You may find the Array Controller is ok, or it may highlight if the controller also has a problem.
What model ProLiant are you using?
0
 

Author Comment

by:CanadianITS
Comment Utility
Dear All,

Firstly sorry about my late in reply, but I was in another city.

Secondly this server is HP ProLiant ML370 G5 Server and operating system is windows server 2003 R2

and I don't think this issue from hard or any device see attached.
01.png
02.png
0
 
LVL 5

Expert Comment

by:mbkitmgr
Comment Utility
Perhaps a bit of time on the HP wesbite will shed some light on my comments.
0
 

Author Comment

by:CanadianITS
Comment Utility
Mr. Mbkitmgr

Thanks for your concern.
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
You haven't posted what the result of memtest, nor did you show us what the array management utility shows. You also didn't say whether you followed the advice and cleaned out all dust.
0
 
LVL 22

Expert Comment

by:65td
Comment Utility
Could cross connect cable to the iLO, and see any logs were captured.
0
 
LVL 20

Expert Comment

by:Daniel McAllister
Comment Utility
I had a similar experience with a 2003 server recently -- kept locking up overnight.
It turned out that a drive had a failing sector -- but the reason the system kept crashing overnight was that the client had a defrag sheduled every night. So the bad sectors would be used nearly every night during the defrag process.

If there is a time relativity issue here (e.g. it happens at about the same time), then check your scheduled events. That's how  found the defrag that had been scheduled.

FWIW: Predictive failure means (in my organization) that you can still back it up -- but not for long... replace ASAP. Every one of my hard drive vendors (Seagate, WD, Samsung, etc.) allow for warranty returns on drives who begin to fail SMART testing.

Just my thoughts...

Dan
IT4SOHO
0
 
LVL 24

Expert Comment

by:smckeown777
Comment Utility
Your screenshot 03.png shows you what is wrong(as already stated by others)

One of the drives in the server is failing...

You need to get the HP hardware diags software to get more details, but the screenshot shows it as drive in Slot1,Port2...do you know how many disks are in this server?

You need this to see what is going on on the array - http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=3279719&spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%253D%257CswItem%253DMTX_332d5ba4f7ed4d8ab043593ff4%257CswEnvOID%253D1005%257CitemLocale%253D%257CswLang%253D%257Cmode%253D%257Caction%253DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken

Or...check if the HP management software is already installed and open Array Manager to check the status of the drives...

This is the diags package for the server hardware - http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=3279719&spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%253D1%257CswItem%253DMTX_a8e4c0d6decb48a2a62c497c5b%257CswEnvOID%253D1005%257CitemLocale%253D%257CswLang%253D%257Cmode%253D4%257Caction%253DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken

Try that to see if it reports more info for the issue...
0
 

Author Comment

by:CanadianITS
Comment Utility
Thanks for everyone, you helped me more and gives me more ways.

first thing this server for one from our customer and i'll go to him today and check many things i know one hard disk is corrupted but i don't think this is the issue.

last time i copied dmp file from the server and try to analyze it and this was the result:

------------------------------------------------------------------------------------------------------------------

Symbol search path is: *** Invalid ***
****************************************************************************
* Symbol loading may be unreliable without a symbol search path.           *
* Use .symfix to have the debugger choose a symbol path.                   *
* After setting your symbol path, use .reload to refresh symbol locations. *
****************************************************************************
Executable search path is: 
**************************************************************************
THIS DUMP FILE IS PARTIALLY CORRUPT.
KdDebuggerDataBlock is not present or unreadable.
**************************************************************************
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
Unable to read PsLoadedModuleList
**************************************************************************
THIS DUMP FILE IS PARTIALLY CORRUPT.
KdDebuggerDataBlock is not present or unreadable.
**************************************************************************
KdDebuggerData.KernBase < SystemRangeStart
Windows Server 2003 Kernel Version 3790 MP (2 procs) Free x86 compatible
Product: LanManNt, suite: TerminalServer SingleUserTS
Machine Name:
Kernel base = 0x00000000 PsLoadedModuleList = 0x808af9c8
Debug session time: Mon Apr 19 10:45:47.453 2010 (UTC + 3:00)
System Uptime: 39 days 1:38:18.274
**************************************************************************
THIS DUMP FILE IS PARTIALLY CORRUPT.
KdDebuggerDataBlock is not present or unreadable.
**************************************************************************
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
Unable to read PsLoadedModuleList
**************************************************************************
THIS DUMP FILE IS PARTIALLY CORRUPT.
KdDebuggerDataBlock is not present or unreadable.
**************************************************************************
KdDebuggerData.KernBase < SystemRangeStart
Loading Kernel Symbols
Unable to read PsLoadedModuleList
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
CS descriptor lookup failed
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get program counter
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 8E, {c0000005, bf9f0375, f64a000c, 0}

***** Debugger could not find nt in module list, module list might be corrupt, error 0x80070057.

GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to read selector for PCR for processor 0
GetContextState failed, 0xD0000147
Unable to read selector for PCR for processor 0
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
Unable to get current machine context, NTSTATUS 0xC0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
Unable to read selector for PCR for processor 0
GetContextState failed, 0xD0000147
Unable to read selector for PCR for processor 1
GetContextState failed, 0xD0000147
Unable to read selector for PCR for processor 0
Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )

Followup: MachineOwner
---------

GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
GetContextState failed, 0xD0000147
  

----------------------------------------------------------------------------------

Open in new window


and i'll check the server today and feedback you, thanks again for everyone.
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
A corrupt dmp file points to problems with the HD or rest of the hardware, like the RAM. Also make sure you only have minidumps, and not the full memory dumps. minidumps are less likely to corrupt, and they don't use as much space.
0
 

Author Comment

by:CanadianITS
Comment Utility
there is no minidumps, how can i enable it
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Right click on "Computer", select "Properties", "Advanced System Settings", under the "Advanced" tab in the "startup and recovery" section "Settings", and then under "System Failure" select "Small Memory dump". You could also change the location to which the minidumps are saved at. That's how you get to this on Windows 7, but it'll be similar with 2003 server...

Of course after you have set those settings you will have to wait for the system to crash so the minidump gets written.
0
 

Author Comment

by:CanadianITS
Comment Utility
ok thanks i'll do it and feedback you.
0
 

Author Comment

by:CanadianITS
Comment Utility
last time i configured Minidump as leaved the location in windows\minidump and system crashed after that two times but there is no any minidump generated !!!

i'll bring a new version from full dump today maybe it's not corrupted and help us.
0
 

Author Comment

by:CanadianITS
Comment Utility
I confirm again i don't find any Minidump after crash although i already configured Minidump,

[Log deleted, rindi, EE Topic Advisor]
01.bmp
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Make sure you have show all files enabled in folder options. The minidumps are hidden. Look in the C:\Windows\Minidump folder for them, unless you set another location in the settings. Also, please, when posting a log like you did above, post it as either an attachment or code snippet. Like this it makes the thread unreadable.
0
 

Author Comment

by:CanadianITS
Comment Utility
Firstly, sorry to all about last log i wasn't know it's so long like that,
i hope supervisor of Expert-Exchange remove this log in the last comment and i'll attach it in this comment.

Secondly, related to Minidump i don't change the location and i already shown hidden but there is no any files. so i'll check tomorrow maybe generate today.
report.txt
0
 
LVL 20

Expert Comment

by:Daniel McAllister
Comment Utility
I will re-iterate my belief that a server that shuts down overnight, but not during the day, is most likely failing (or hanging) due to some task scheduled to run overnight.

In my most recent case, it was a disk defragment routine that a client had installed without my knowledge in an attempt to optimize the server.

In another recent case (about a year ago), a client had added a program that "wiped" empty parts of the disk -- presumably to prevent "recovery" of deleted files. Unfortunately, it not only crashed the server nightly, but it also crashed Exchange periodically through the day.

In both cases, the "mysterious overnight shutdowns" ceased when the scheduled tasks were removed.

While it is certainly possible that there is a hardware error, the timing points to a more human-caused issue. Hardware errors hardly ever wait for overnight to happen... they actually happen more frequently during the day, because the system is usually under more stress (from users) at that time.

Just my thoughts

Dan
IT4SOHO
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 

Author Comment

by:CanadianITS
Comment Utility
@it4soho

maybe it's correct, but actually i don't think that is the reason !!

and only scheduled tasks on this server two task "Daily Backup & Weekly Backup"

i'll try to disable this two task and see what will happen in next two days.
0
 

Author Comment

by:CanadianITS
Comment Utility
After i deleted two scheduled tasks yesterday server working good without shutdown but today i came found it shutdown !!!


so i think the schedule task is not the issue.
0
 

Author Comment

by:CanadianITS
Comment Utility
There is no generating to minidump and last time i configured it to full Dump but now i found it not generated also !!!

and attached two snapshot from event Viewer.
04.png
05.png
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Use the management utilities that came with the server to check the state of it's hardware. Make sure all dust is cleaned out. Upgrade the firmware (also that of the RAID controller). Scan the system for malware and remove any it finds.
0
 

Author Comment

by:CanadianITS
Comment Utility
Dear all,

thanks for everyone reply to me and help me.
and i'm sorry about my late in reply and feedback you.

now i bring the server to my office and clean the dust and clean RAMs but the same issue.

now the server is shutdown while working and give me Red light in lamp called Internal Healthy. "Image attached"
01.jpg
02.jpg
0
 
LVL 24

Expert Comment

by:smckeown777
Comment Utility
Did you load the diagnostics utility on this server yet? Without that there is no way to tell what is going on with the hardware...since you have a red light on the server that means there is an issue...the diagnostics suite should tell you exactly what is wrong

http://www.experts-exchange.com/Software/Server_Software/Q_28289777.html#a39670194

Either that or get the manual for that server to see if it shows any info on the red lights...
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
Most servers have diagnostics available specifically for that hardware via their manufacturers. So find those utilities if you haven't already installed them on the server on their site and run them.
0
 
LVL 20

Expert Comment

by:Daniel McAllister
Comment Utility
Failure to create the SW dumps from the OS leads me to believe that this is a power supply or overheating issue.

Check the system's BIOS logs (if such things exist).

Verify cooling and temp values

Dan
IT4SOHO
0
 

Author Comment

by:CanadianITS
Comment Utility
Dear All,

i found tools called HP System managment and include diagnostics utility but there is no any error if you can provide my link to download diagnostics utility and how to use it's will be so useful to me.

thanks.
0
 
LVL 20

Accepted Solution

by:
Daniel McAllister earned 500 total points
Comment Utility
OK, I've looked over the entire history of this question... and it appears to me that mbkitmgr & smckeown777 got it right -- 3-months ago. Unfortunately, They failed to explain why Windows would not report any such error, which INCORRECTLY led you to believe that there was no hardware problem.... sending you searching through software that likely is operating normally.

Your HP Proliant server has a RAID controller; and at least ONE of the drives in one of the RAID ARRAYS is in a pre-fail state (or was back in November). If you have not identified and replaced that drive, then that issue still remains, and IMHO is the most likely candidate for the cause of your freezes & reboots.

The RAID controller has only a minimal set of status information it can provide to the standard Windows device status -- and since the RAID ARRAY's DATA is still technically valid, the "drive" reports as "healthy". In fact, however, the RAID ARRAY is in a "degraded state" -- which means your data remains in-tact, but at significant risk of future loss.

This is like the loud noises a cooling fan makes when the bearings wear out -- that finally "fixes itself" (that is: the noise stops) when the fan completely fails and stops spinning. The lack of noise is actually an indication that the fan is now FULLY failed, vs. still operating but just not "well".

In this case, when the hard drive in question finally fails completely, the RAID controller will remove it from the array and make the appropriate "changes"... it will then no longer have the seek and/or read/write failures it's having to deal with currently (and that is likely causing your restarts). Unfortunately, any pretense of protecting your data will be gone at that time as well.

What is needed is the RAID management program (or programs) for your RAID controller, so you can identify the failing hard drive. (I predict that, since you're still experiencing issues, the drive is not yet fully failed -- or else the problems likely would have gone away.)

With a couple of quick Google inquiries, I have found that your RAID controller is most likely an HP Smart Array P400 (though it could also be a P200, or even a 3rd-party add-on card). I also located the UTILITIES for this controller (not just the DRIVER) on the HP site, and I can see the following that *I* would try:

HP ProLiant Array Configuration Utility for Windows (not the CLI)
HP ProLiant Array Diagnostics and SmartSSD Wear Gauge Utility for Windows

With these UTILITIES installed, you should be able to see the error status on the RAID array.

NOTE: This is a personal pet-peeve of mine: servers supposedly "protected" with RAID, but no way to monitor the actual RAID ARRAY status! It's like having backups you cannot validate!

Dan
IT4SOHO
0
 

Author Comment

by:CanadianITS
Comment Utility
Dear Mr. IT4SOHO,

thank you very much for your reply, really was so useful and clarified every thing to me.

actually i was know there are one hard disk is failure and i already know this hard disk.

but i wasn't believe the issue from this hard disk.  

and i'm already removed this failed hard but don't replace it yet.

this hard configured with another one as RAID 1 to hold the Operating System.
can i remove this array and leave the hard disk working alone without RAID, or once i removed it operating system will be damage ???

i know i have to replace the failed disk but i'm asking about availability of removing this RAID and leave the Operating system working on one hard disk???
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
If you don't replace the bad disk as soon as possible, you risk loosing your OS installation should the still running disk fail. A server should ALWAYS have RAID, and a failed disk should ALWAYS be replaced as soon as possible (in fact, you should, if you have enough empty drive bays in the server, have a "hot-spare" which can take over and rebuild the array as soon as a drive fails, and then you can replace the bad drive with another drive which then takes the part of being the "hot-spare").

You should also ALWAYS check that your backups are OK, as RAID doesn't replace backups.
0
 
LVL 24

Expert Comment

by:smckeown777
Comment Utility
I can't understand why you haven't replaced the failed disk...makes no sense

Troubleshooting problems in the IT world isn't easy...but in order to get to a root cause you need to eliminate issues along the way...you had a failed disk in a RAID1 array, you've pulled the failed disk out am I correct? Why can you not plug in a replacement disk? Your server is continously shutting down and giving issues...and you now are running on a single disk...so that makes life even harder...

If you don't want to lose this server completely I suggest you replace the failed disk first...once the array rebuilds we can continue to try to locate the issue...replacing this disk will not make things WORSE, it will help things move along in the right direction

Even if the failed disk was not the CAUSE of the issues...you still realistically need to replace it
0
 

Author Comment

by:CanadianITS
Comment Utility
Dear rindi & smckeown,

i know the importance of the RAID generally and at my environment i have RAID 1 for operating system and have spare disk, and data disk configured with RAID 5 with spare disk also, but this server for one of our customer and i'm facing many issues to take approve to change or add anything so i wanna know where is the issue exactly and ask to change the hard disk with any component maybe need to change.

so i wanna test if i removed the RAID 1 from operating system disk and running using one hard disk the issue will found or solve ???

if i removed RAID 1 the data will be removed or not ???
0
 
LVL 24

Expert Comment

by:smckeown777
Comment Utility
No that will not solve the issue

The picture u attached a few posts back was that the PSU? If so it appears to have a warning led so as a previous expert mentioned it might b as simple as replacing that
0
 
LVL 87

Expert Comment

by:rindi
Comment Utility
If one of the members of the array is dead already you are currently running the server without RAID. If your client refuses to fund the replacement disk I'd also refuse to go any further. You can't work without the tools needed, and if you do anything you risk paying more than you earn through the customer.

The first thing to do in such situations, and that's something you must insist on, is to make 200% sure that the backups are good and complete, and that they are current. If they aren't, backup now and test your backup. Then have the bad disk replaced and wait for it to finish rebuilding the array. After that you can go on analyzing the situation.
0
 

Author Comment

by:CanadianITS
Comment Utility
Thanks Mr. rindi, smckeown & IT4SOHO

as first step i'll ask to change this hard disk and rebuild the RAID after that check statue and update you.
0
 
LVL 20

Expert Comment

by:Daniel McAllister
Comment Utility
With regards to RAID-1 arrays and replacing hard drives.... here is my experience:

A typical server is built with RAID-1 with 2 drives that are typically purchased together... thus, they typically have a manufacture date either in common, or close to it.

Now let's visit the hard drive factory for a moment... like all manufacturing facilities, there are good days, and there are bad days... but generally speaking, drives manufactured at the same time are going to have similar lifetimes.

The "published" lifetime of a hard drive model is typically an aggregate of drives tested from many "batches", and may have little bearing on the actual lifetime of any single disk.

The point of all of these observations is that drives manufactured together often fail together. Thus, for RAID arrays built with drives from the same batch, it is even more important than you think that you replace a failed drive -- because the remaining drive USUALLY doesn't have much time before it too will fail.... which will result in loss of data, and in your case, loss of the server OS (which can be difficult to get re-installed later).

As a consultant, I have a form I give to clients if they either decline RAID in a server, or decline to replace a failed drive in a RAID array. The form indemnifies me from any loss caused by the computer malfunctioning; specifies that my labor rates for recovering their system AFTER loss will be substantially more than the cost of replacing the drive and re-establishing the RAID array; specifies their acknowledgement that backups may be corrupted by a slowly failing drive, and that therefore, their backups may not be as valid as they may be counting upon.

In most cases, when clients see the form, they choose to purchase the new drive (and/or pay for RAID in their new server).

BTW: I am confused about one thing -- you clearly state that the RAID-1 array was built "with a spare" -- the RAID controller should have already activated the spare, so your RAID array should be HEALTHY at this time. I'm far less concerned with the timeliness of replacing the SPARE, than I am with replacing the MIRROR component.

All of that being said, I want to revisit that the original complaint (now 3-months old) is that the server is halting or rebooting unexpectedly, and without software notification (that is, there are no log files or entries in current log files, that report a problem).

Just because the RAID arrays need attention doesn't mean this HAS to be the problem. It is RARE for a hard drive problem to result in a consistent system failure with no log entry. Someone else mentioned that the power supply showed a warning light. I would also encourage you to visually examine the motherboard on the server. (If its running 2003, it's not likely a very new piece of hardware).

Bulged capacitors would result (in most cases) in an inconsistent CPU and/or RAM function -- which could result in a system halt that goes otherwise unreported. Likewise, a power supply that is out-of-spec (aka: failing) could cause power problems what would similarly result in an unexpected halt.

The point is, we don't want to just repair the RAID -- we want to restore stability to the server.

Thanks for reading...

Dan
IT4SOHO
0
 

Author Comment

by:CanadianITS
Comment Utility
Dear Mr. it4soho

Thanks again for your reply and your concern,

i'm already send to replace the failed hard disk after that i'll test to diagnostic the issue
0
 

Author Comment

by:CanadianITS
Comment Utility
I Will Close this topic, once the new Hard Delivered i'll open new topic if the issue not solved, thanks for everyone helped me.
0

Featured Post

Are end users causing IT problems again?

You’ve taken the time to design and update all your end user’s email signatures, only to find out they’re messing up the HTML, changing the font and ruining the imagery. What can you do to prevent this? Find out how you can save your signatures from end users today.

Join & Write a Comment

Hyper-convergence systems have taken the IT world by storm and have quickly started to change our point of view of how the data center should and could be architected. In this article, I’ll explain the benefits of employing a hyper-converged system …
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now