Link to home
Start Free TrialLog in
Avatar of GenasysTechnologies
GenasysTechnologiesFlag for South Africa

asked on

Server Freezes or Reboots on its own

Hi Guys
I have a brand new R610 rack server that keeps giving problems. It was fine for a week but for the last three weeks it has behaved very sporadically. It either reboots on its own or just freezes and i cannot access it over the network or locally and I have hold the power button in to shut it down. Please see attached screenshots for the errors.

Here is the HW config of the server:
Intel Xeon E5620 Processor (2.40Ghz, 4C, 12M Cache, 5.86 GT/s QPI, 80W TDP,Turbo, HT) 1066MHz Max Memory
8GB Memory for 1CPU (4x2GB Dual Rank RDIMMs) 1066MHz
2 x 146GB SAS 6Gbps 15k 2.5" HD Hot Plug (RAID1)
3 x 146GB SAS 6Gbps 15k 2.5" Additional HD Hot Plug (RAID5)
PERC 6/i RAID Controller Card 256MB PCIe, 2x4 Connectors
16X DVD-ROM Drive SATA
High Output Redundant Power Supply (2 PSU) 717W, Performance BIOS Setting
Embedded Broadcom GbE LOM with TOE and iSCSI Offload HW Key
C8 MSS R1/R5 for PERC 6i/H700, Exactly 2 Primary and 3-4 Additional Drives

Currently I have teaming enabled for two of the NICs on a 1Gbps network.

This server is running Exchange 2010 SP1 and Server 2008 R2. It also has the latest version of ESET Mail Security for Exchange installed.

Is there any way I can fix this?

Thanks!!

ScreenShot-1.jpg
ScreenShot-2.jpg
Avatar of faizbaig
faizbaig
Flag of United Arab Emirates image

CPU Temperature need monitoring
Maybe download and install a program called Speed fan to see the CPU core temps. None of them should be above about 52 degrees C.

Download link
http://www.almico.com/sfdownload.php
....and

You might want to check the memory for errors, memory problems can often cause crashes and random restarts.

A good program for checking your memory is made by Microsoft. It's free and runs from a boot disk.

Go to this page:
http://oca.microsoft.com/en/windiag.asp

Click on the second link from the top "Download Windows Memory Diagnostic".

Download the little program and get a blank floppy disk. When you run the program it will create a bootable floppy disk with a memory testing program on. Switch off your computer and remove the second stick of memory. Start it up and boot from the floppy disk (you may have to change the boot order in the BIOS).

When the program loads it will immediately start checking your memory in 'standard mode'. Any errors encountered will be displayed at the bottom.
If your memory passes the test without any errors then it's probably okay, but just to be safe you can press 'T' which will make it go into a more thorough mode (this takes a bit longer - I recommend leaving it overnight to do this one).

Yes, Freezing or restart could be because of One of the memory stick bad.
If it's hardware causing the reboots, you should be able to find clues to the cause in an integrated log in the dell server management system (Openmanage).
Avatar of GenasysTechnologies

ASKER

Thanks Faiz!

The temperature should not be the problem as it never goes over 20 degrees C. The memory thing i will check out now. I just updated the controller firmware to the latest version and rebooting.
Is there any Memory Dump has generated on the server ? If not then genearte Manual Memory Dump and Analyze the Memory Dump to find the root cause

http://support.microsoft.com/kb/244139

Hi Dax

I checked the hardware logs in OpenManage already and there are no errors.
ASKER CERTIFIED SOLUTION
Avatar of Mohammed Basheer
Mohammed Basheer

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hey Basheer

I installed it with the bootable cd yes. I will download the SUU and do the needed updates.

Thanks!
Avatar of Mohammed Basheer
Mohammed Basheer

Thanks for the update,
Try the second test also. Keep the server ON in non windows mode and see it is restarting or not. That way i guess you can isolate if its a windows or hardware problem.

Regards
Great, will do. I'll only be able to do it on the weekend though as this is our production exchange server.

Will update once I'm done.
Okay, good luck..:) and dont forget to update us.
I have approx 45 R610's in my inventory and i've had a problem with a number of them where the memory gets slightly loose in transit. I eject all the chips and reseat them as standard practice now. I'd give that a try.
"Fatal firmware error" on the RAID controller and you hide it in a screenshot?

Just log a hardware fault with Dell assuming said firmware is up to date.
@ryan

Thanks, will try that this weekend as well and send an update.

@andy

What I hid in the screenshot was the name of my server, all the other info is there. I'll log a call as soon as I cannot fix it myself, thanks.
I meant that you hadn't put the message "fatal firmware error" in the text or the title but only in the screenshot, a lot of people just skim the threads and don't read through screenshots.

Bad RAM can't cause that, it can only be the RAID controller, whether onboard or in a PCI slot, it can't even be the PCI slot connection. A fatal firmware error on anything is integral to that specific device. At least the controller shuts itself down because it knows it's had a brain fart rather than writing gibberish to the disks, and with the I/O subsystem stopped the server hangs although the mouse & video might still work.
Ahh, ok, thanks for clearing that up, I will update the title now.

I did update the firmware of the controller yesterday to the latest version from Dell's site so I'm keeping an eye on it. For now all seems ok, but I don't want to speak too soon.
Ok, wait and see
Ok, the first thing I tried on saturday was to install the latest firmware from the latest Update Utility Image I downloaded on friday. After doing this, the fatal firmware errors disappeared and the server has been fine till now. Usually freezes once or twice per day.

So for me the solution was to update all the firmware to the latest versions.

Thanks basheer and everyone else!