Mark Galvin
asked on
Dell PowerEdge 2950 crashing on a weekly basis
Client has a Dell PowerEdge 2950. Running Server 2008 R2 and running as a SQL Server (2012).
A month ago users reported that apps (which use that server as DB back end) were running slowly. When we checked the console, it was totally unresponsive. We could ping it by name & IP and got a response. But we were not able to do anything else. Power cycled the server and it came back up.
7 days later same thing. I was not onsite at the time so a junior member of IT power cycled and it came back up.
7 days later again. I was on site this time. It seems like that the server's performance slowly gets worse and eventually the server becomes unresponsive. After a power cycle, I checked Dell System Admin and found that the PERC 5/i battery had failed. I know that can cause performance degradation in the RAID Array.We replaced that and after 48 the battery reported as being active.
I had hoped that, that would have stopped the symptoms. But it happened again yesterday. I was not on site so the server was power cycled.
Tonight I have run Windows Update. There were about 17 updates. All installed and server rebooted.
I have checked the server for tasks that run on a regular basis (such as AV etc.) and there is nothing that corresponds with the Tuesday/Wednesday (it happens on either on a Tuesday or Wednesday, each week for the last 4) performance degradation and subsequent unresponsiveness.
Firmware on PERC is up to date. BIOS is up to date.
First, any one have any thoughts on preemptively troubleshooting the issue?
Second, any one have any suggests for some basic monitoring (Windows out of he box stuff) to keep on eye on what is happening so that I can check through when the next crash takes place. Lots of other server in the doamin that can be used to store logs etc.
Thanks
Mark
A month ago users reported that apps (which use that server as DB back end) were running slowly. When we checked the console, it was totally unresponsive. We could ping it by name & IP and got a response. But we were not able to do anything else. Power cycled the server and it came back up.
7 days later same thing. I was not onsite at the time so a junior member of IT power cycled and it came back up.
7 days later again. I was on site this time. It seems like that the server's performance slowly gets worse and eventually the server becomes unresponsive. After a power cycle, I checked Dell System Admin and found that the PERC 5/i battery had failed. I know that can cause performance degradation in the RAID Array.We replaced that and after 48 the battery reported as being active.
I had hoped that, that would have stopped the symptoms. But it happened again yesterday. I was not on site so the server was power cycled.
Tonight I have run Windows Update. There were about 17 updates. All installed and server rebooted.
I have checked the server for tasks that run on a regular basis (such as AV etc.) and there is nothing that corresponds with the Tuesday/Wednesday (it happens on either on a Tuesday or Wednesday, each week for the last 4) performance degradation and subsequent unresponsiveness.
Firmware on PERC is up to date. BIOS is up to date.
First, any one have any thoughts on preemptively troubleshooting the issue?
Second, any one have any suggests for some basic monitoring (Windows out of he box stuff) to keep on eye on what is happening so that I can check through when the next crash takes place. Lots of other server in the doamin that can be used to store logs etc.
Thanks
Mark
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
It started happening again today. I was able to get onto the server remotely and found that the W3WP.exe process was not releasing memory after a user had run a Reporting Services Report. I changed the Recycling settings in IIS to be after two request rather than the standard 1740 minutes (29 hours).
I suspect that on the days the servers performance degraded that a number of users were running reports. I noticed today that after one report the process hung onto between 1 & 2 GB of memory. Theres only 32GB in there so I'm not surprised that the server was crashing.
I suspect that on the days the servers performance degraded that a number of users were running reports. I noticed today that after one report the process hung onto between 1 & 2 GB of memory. Theres only 32GB in there so I'm not surprised that the server was crashing.
tx for feedback
Also make sure all memory is in recommended slots and is the right type and speed, and each slot pair has matched memory.
The slot diagram should be on the inside of the cover.
If you still have problems I would check processor heat sink compound.
It could just be too little memory for the task, and the server slows as the cache ages and fills...