We help IT Professionals succeed at work.

Highpoint RAID 404 Crashes During Verify

slbriggsphd
slbriggsphd asked
on
566 Views
Last Modified: 2012-08-13
Hey guys (and some gals),

We've got a serious problem. Our file server has been crashing while verifying its array. It doesn't seem to be inconsistent, or lose any data. It just crashes almost every time - three times today!

The server used to kick disks out randomly during verification, about once every two or three weeks, but that all but stopped when I upgraded from 2-disk RAID 1 to 4-disk RAID 1/0. After the upgrade, the RAID controller didn't kick a disk, or crash while verifying, for about six months.

We moved to a new office, and the file server got physically relocated three or four times while we were settling in and getting our floor plan hashed out. I think maybe the bumps and dings started this round of crash problems, but all the cables and cards are seated properly and firmly. We use all Highpoint ATA cables and an X-Connect power supply.

Since the move, the computer crashes about 85% of the time when verifying the array, usually about 30m-1hr into the verification process. It hasn't really been kicking the disks out except for one time three weeks ago when it kicked out !three of the four! disks. Now that was a fiasco. Luckily we have a robust backup policy and things were recovered (relatively) smoothly.

It used to give me video card errors, so I replaced the video card with something older, less adventurous, and presumably more stable. Now the errors don't refer to a video card driver anymore, they just say "The system has recovered from a serious error".

I don't believe it's a processor, motherboard, or memory problem because we have never had a crash that I am aware of that wasn't during the verification process. It's not the disk drives - even the ones that get kicked out are always fine. No clicks, no excess heat, no SMART errors. All four disks have been cycled through the system over the last month because of this problem, trying to see if a specific disk was responsible.

Our system is:
P4 2.8
1 gig Crucial
Intel D865PERLK
Highpoint RocketRAID 404 Controller
4x 200 Seagate ATA
ATI Rage XL
WinXP Pro SP2 fully updated
2x 200mm Antec Quiet Fans (given to show that we do have adequate cooling)
Zalman Copper 92mm CPU fan (" ")

Runs:
Spybot
Ad-Aware
AVG Antivirus
AllSync Scheduler
PowerClock Server
BOINC - Seti@Home
HPT Service Manager
Therapist Helper Server
WinAmp (waiting room music)
Highpoint RAID Management Console

I think the problem might be related to the fact that the HPT cards offset the XOR routine to the processor. BOINC runs Seti@Home while the system is otherwise inactive, so I wonder if the XOR offset might not collide with the SETI processes, but this seems a little out there.

Can anyone suggest a method to properly troubleshoot what actually causes the crash, how to stop the crashing, or, as a last resort, a known good and stable ATA RAID controller with eight channels and a reasonable price?

I can't seem to find ANY reviews of RAID controllers that have a review period of longer (DAMN COMPUTER! Just crashed again right now while verifying) than a week, and we all know that a week or two is nowhere NEAR long enough to assess the capabilities of a RAID card for long-term reliability. It's like assessing a new car model for reliability by glancing at the interior in a magazine spread.

So I guess this is a multi-pronged request - troubleshoot, fix, or suggest a replacement that is compatible and known good.

Thanks.
Comment
Watch Question

Author

Commented:
Just checked all the components - no excess heat on the CPU, northbridge, memory, video card, disk drives, or RAID controller. Nothing is more than slightly warm to the touch.
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
CERTIFIED EXPERT
Most Valuable Expert 2015

Commented:
I haven't had all that good experiences with highpoint raid controllers. I find the promise raid cards are much more reliable, or even better are the 3ware ones, but highpoint seems to justify it's cheapness with low quality. If a firmware update as suggested by scrathy doesn't help, get more reliable raid cards.

Author

Commented:
Okay, I updated the RAID drivers in Windows; could not update the controller BIOS - long story; and updated the MB software package & bios.

Something interesting happened after I updated the RAID drivers. The verification failed, but instead of immediately rebooting like usual, the computer stayed on-line and gave me an error message that the second channel had failed. This is the channel I had been watching and suspected was a problem. This in hand, I moved the disk on that channel to the first channel as the slave. It has been stable since, and will verify.

However, I can't run both disks of the mirror off the 1st channel, it halves the performance. I think the card is out of warranty, Highpoint won't do warranty work on cards bought from resellers, and NewEgg doesn't sell the card anymore.

What kind of problems am I going to have moving the array onto a new controller? I'll basically have to make disk images and start from there, won't I? Is there any chance a new card, even a Highpoint card, will recognize my existing array? I don't think so... but... well, comments anyone?
The only way another RAID controller will recognize the existing array is if it is the SAME chipset on the controller card, and the same version, in which case you just plug the array and hope it works.  Usually this only works for mirror RAID 1 anyway.  I think it is safe to assume that if you want to move beyond the existing controller and its problems, you will have to wipe the array and start again.

But this is easier than you think.  Just install a good old IDE drive, copy all the data in the array to the IDE, and make sure the disk is bootable.  Remove the RAID from the system, boot from the CD, and make the IDe drive bootable from running fixboot C:  from the windows XP boot CD in recovery console.

Once you know the system can boot from this IDE, then it does not matter what happens to the RAID, get a new controller, reinitialize it, and copy all the data back -- but at least use RAID 1 or RAID 10 so that you have a mirror in the future, raid 0 and raid 5 are very prone to failures on removal of a drive.

Author

Commented:
Yeah, we're using RAID 10 currently. Its been very robust until this current problem.

A worse problem developed last night after I left - the stripe of the mirrors broke. Until now it's been one of the mirrors that breaks, which can be easily rebuilt with a spare drive. But the stripe broke somehow, reducing the problem to the same as a broken RAID 0 - I don't know of a way to recover this! From what I understand, RAID 0 breaks are unrecoverable in most situations.

I put in two spares, booted, and the controller didn't recognize the spares as useful - at least, there was no option to rebuild. I wouldn't really expect one, having it reduced to a RAID 0 situation anyway. It looks, at least intellectually, that a more robust system would be a mirror of two stripes, as opposed to a stripe of two mirrors...

Well, good thing we backup every night. Too bad its just the database, and not the OS and entire system, though.

Yeesh. Looks like I got some work to do. I'll get back here with my resolution for posterity and points.
CERTIFIED EXPERT
Most Valuable Expert 2015
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
rindi,

The array is a 1/0, which is a stripe of mirrors. When I say we're "reduced to a RAID 0 situation" I just mean that  the stripe between them broke. So we have two mirrors that are no longer striped, each mirroring only half the data.

Does anyone know of a tool to rebuild a RAID 10 array? The Raid Rebuilder from GetDataBack is only for RAID 0 and RAID 5. Highpoint was supposed to email me a tool, but I haven't seen it yet, and it was supposed to be here a few hours ago.
CERTIFIED EXPERT
Most Valuable Expert 2015

Commented:
Check the software out, it can rebuild a broken raid0 and therefore also a broken raid10.

Author

Commented:
I pulled a drive from each mirror and am running the Raid Rebuilder on them now to an image on an external HD. Let's see how this goes.

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.