Problem with Hardware or Memory leak? (SATA RAID1 "dirty" after RAM swap)

Hello Experts.

I have a server that has two RAID1 arrays. In other words, 4 SATA drives make up 2 windows volumes. (C: and D:). D: is the data volume, and is 70 GB, and the other volume is 30 GBs (or so).

This server randomly freezes and I have been trying to understand why. I went to replace the RAM and after shutting down and unplugging it, and swapping the RAM, then re-booting, the RAID status screen displayed a critical message the 70 GB data RAID array was only operating on one drive! The client had said that this happened before when the server would freeze, but not EVERY time the server froze. After the OS booted (win2003 server), everything looked good, and I looked at the RAID status utility and the data (70 GB) RAID1 was rebuilding and was at 20%. I have never seen behavior like this before.

As you can probably guess by now, the new RAM did not fix the freezing, and so now I am left wondering if it’s hardware or software related. When the server was being built, it ran fine for several days before being put into a production environment, so the possibility of a memory leak from an app is there, but I wanted some insight regarding the RAID1 being “dirty” only after a RAM swap out.
Could this mean that there is a problem with a drive, or the RAID controller? The SATA drives are in a hot-swap “bay” with a hot-swap backplane (Intel components).
Has anyone run into this?

Thanks in advance for any help in this matter.
talkingbobAsked:
Who is Participating?
 
CetusMODConnect With a Mentor Commented:
PAQ'd, 266 points refunded.
CetusMOD
Community Support Moderator
0
 
gjohnson99Commented:
I lock up like this will most likely cause the the raid failure 90% of the time.

check logs for errors. Could be a driver are software  

0
 
rindiCommented:
Run memtest86 (http://memtest.org). If you don't get an error on the first pass, run at least 5 passes.

If the RAM is OK, try updating the firmware of your raid controllers.
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
nobusCommented:
if you have a spare drive, you can swap out one at the time, and test them all like that
0
 
reedsrCommented:
what RAID controllers are you using ?
0
 
talkingbobAuthor Commented:
Promise* PDC-20319 Serial ATA RAID is the controller. It's integrated on an Intel S875WP1-E server board.

This is the most recent event log error:
ID: 119
The driver for device \\device\harddisk1\dr1 elayed non-paging to requests for 0 ms to recover from a low memory condition.


ID: 2019
The server was unable to allocate from the system nonpaged pool because the pool was empty.

ID: 1001
The computer has rebooted from a bugcheck....


Hope this helps.
0
 
rindiCommented:
Check your memory.
0
 
talkingbobAuthor Commented:
I DID replaced All the RAM that was in the server with new sticks and the same problem happened. I thought this rulled out the chance that the memory was bad.
0
 
rindiCommented:
No, not necessarily. RAM is quite often bad. What is bad quite often too are the sockets for your RAM, so it also often helps if you just try another slot or if you reseat the ram.
0
 
talkingbobAuthor Commented:
UPDATE:

I went over to the server again on the 13th and change the RAM to a different slot. This time even after a warm reboot, the system came up saying that a drive in the data RAID 1 array was critical. I watched this drive rebuild itself, then the serverfroze for a moment, and then the drive was gone and the array was set from rebuilding to now critical. Rebooted again, and it started rebuilding from 0%.

I hope this is the root of all other problems, and yes another drive is on its way.

We'll see...
0
 
talkingbobAuthor Commented:
Ok,

The drive was replaced, and the errors have gone away (in the RAId event log). It is still locking up though and I may have found the reason why.

The Promise RAID Management utility (PAM) had a known memory leak issue with the version I was using. Upgraded to version 4.0 and then diabled the PAM service. It has not locked up in 60+ hours, so I'm hoping that that did it.

We will see...
0
 
talkingbobAuthor Commented:
The problem was solved, but I arrived at a solution through my own research and no expert comment helped.

I guess points need to be refunded.
0
 
rindiCommented:
I suggest a PAQ/Refund, not a Delete/Refund, as the user provided the answer and this could be usefull for the future.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.