Hey guys (and some gals),
We've got a serious problem. Our file server has been crashing while verifying its array. It doesn't seem to be inconsistent, or lose any data. It just crashes almost every time - three times today!
The server used to kick disks out randomly during verification, about once every two or three weeks, but that all but stopped when I upgraded from 2-disk RAID 1 to 4-disk RAID 1/0. After the upgrade, the RAID controller didn't kick a disk, or crash while verifying, for about six months.
We moved to a new office, and the file server got physically relocated three or four times while we were settling in and getting our floor plan hashed out. I think maybe the bumps and dings started this round of crash problems, but all the cables and cards are seated properly and firmly. We use all Highpoint ATA cables and an X-Connect power supply.
Since the move, the computer crashes about 85% of the time when verifying the array, usually about 30m-1hr into the verification process. It hasn't really been kicking the disks out except for one time three weeks ago when it kicked out !three of the four! disks. Now that was a fiasco. Luckily we have a robust backup policy and things were recovered (relatively) smoothly.
It used to give me video card errors, so I replaced the video card with something older, less adventurous, and presumably more stable. Now the errors don't refer to a video card driver anymore, they just say "The system has recovered from a serious error".
I don't believe it's a processor, motherboard, or memory problem because we have never had a crash that I am aware of that wasn't during the verification process. It's not the disk drives - even the ones that get kicked out are always fine. No clicks, no excess heat, no SMART errors. All four disks have been cycled through the system over the last month because of this problem, trying to see if a specific disk was responsible.
Our system is:
1 gig Crucial
Highpoint RocketRAID 404 Controller
4x 200 Seagate ATA
ATI Rage XL
WinXP Pro SP2 fully updated
2x 200mm Antec Quiet Fans (given to show that we do have adequate cooling)
Zalman Copper 92mm CPU fan (" ")
BOINC - Seti@Home
HPT Service Manager
Therapist Helper Server
WinAmp (waiting room music)
Highpoint RAID Management Console
I think the problem might be related to the fact that the HPT cards offset the XOR routine to the processor. BOINC runs Seti@Home while the system is otherwise inactive, so I wonder if the XOR offset might not collide with the SETI processes, but this seems a little out there.
Can anyone suggest a method to properly troubleshoot what actually causes the crash, how to stop the crashing, or, as a last resort, a known good and stable ATA RAID controller with eight channels and a reasonable price?
I can't seem to find ANY reviews of RAID controllers that have a review period of longer (DAMN COMPUTER! Just crashed again right now while verifying) than a week, and we all know that a week or two is nowhere NEAR long enough to assess the capabilities of a RAID card for long-term reliability. It's like assessing a new car model for reliability by glancing at the interior in a magazine spread.
So I guess this is a multi-pronged request - troubleshoot, fix, or suggest a replacement that is compatible and known good.