Solved

Highpoint RAID 404 Crashes During Verify

Posted on 2006-06-16
11
532 Views
Last Modified: 2012-08-13
Hey guys (and some gals),

We've got a serious problem. Our file server has been crashing while verifying its array. It doesn't seem to be inconsistent, or lose any data. It just crashes almost every time - three times today!

The server used to kick disks out randomly during verification, about once every two or three weeks, but that all but stopped when I upgraded from 2-disk RAID 1 to 4-disk RAID 1/0. After the upgrade, the RAID controller didn't kick a disk, or crash while verifying, for about six months.

We moved to a new office, and the file server got physically relocated three or four times while we were settling in and getting our floor plan hashed out. I think maybe the bumps and dings started this round of crash problems, but all the cables and cards are seated properly and firmly. We use all Highpoint ATA cables and an X-Connect power supply.

Since the move, the computer crashes about 85% of the time when verifying the array, usually about 30m-1hr into the verification process. It hasn't really been kicking the disks out except for one time three weeks ago when it kicked out !three of the four! disks. Now that was a fiasco. Luckily we have a robust backup policy and things were recovered (relatively) smoothly.

It used to give me video card errors, so I replaced the video card with something older, less adventurous, and presumably more stable. Now the errors don't refer to a video card driver anymore, they just say "The system has recovered from a serious error".

I don't believe it's a processor, motherboard, or memory problem because we have never had a crash that I am aware of that wasn't during the verification process. It's not the disk drives - even the ones that get kicked out are always fine. No clicks, no excess heat, no SMART errors. All four disks have been cycled through the system over the last month because of this problem, trying to see if a specific disk was responsible.

Our system is:
P4 2.8
1 gig Crucial
Intel D865PERLK
Highpoint RocketRAID 404 Controller
4x 200 Seagate ATA
ATI Rage XL
WinXP Pro SP2 fully updated
2x 200mm Antec Quiet Fans (given to show that we do have adequate cooling)
Zalman Copper 92mm CPU fan (" ")

Runs:
Spybot
Ad-Aware
AVG Antivirus
AllSync Scheduler
PowerClock Server
BOINC - Seti@Home
HPT Service Manager
Therapist Helper Server
WinAmp (waiting room music)
Highpoint RAID Management Console

I think the problem might be related to the fact that the HPT cards offset the XOR routine to the processor. BOINC runs Seti@Home while the system is otherwise inactive, so I wonder if the XOR offset might not collide with the SETI processes, but this seems a little out there.

Can anyone suggest a method to properly troubleshoot what actually causes the crash, how to stop the crashing, or, as a last resort, a known good and stable ATA RAID controller with eight channels and a reasonable price?

I can't seem to find ANY reviews of RAID controllers that have a review period of longer (DAMN COMPUTER! Just crashed again right now while verifying) than a week, and we all know that a week or two is nowhere NEAR long enough to assess the capabilities of a RAID card for long-term reliability. It's like assessing a new car model for reliability by glancing at the interior in a magazine spread.

So I guess this is a multi-pronged request - troubleshoot, fix, or suggest a replacement that is compatible and known good.

Thanks.
0
Comment
Question by:slbriggsphd
  • 5
  • 3
  • 2
11 Comments
 

Author Comment

by:slbriggsphd
ID: 16923879
Just checked all the components - no excess heat on the CPU, northbridge, memory, video card, disk drives, or RAID controller. Nothing is more than slightly warm to the touch.
0
 
LVL 44

Accepted Solution

by:
scrathcyboy earned 230 total points
ID: 16925647
Clearly, the controller or the motherboard is going.  You should not get crashes trying to verify the array.

1.  First is to look for a BIOS update for the raid controller, of course from the MFGs website.

2.  If installing the BIOS does not fix the problem, go over the RAID settings once again, as I am sure you did.

3.  If you are determined to keep this controller, move it and the drives to a different motherboard.  That might solve the problem right there, the IRQ line on the MB might be unstable.

4.  If that does not work, then suspect the controller card.  Of course, you know you will have to backup all the data.  The best are promise RAID controllers, or Highpoint 370, both very reliable.

5.  Problem is, everyone is going now to SATA raid controllers, so if you are looking for a future solution, you are stuck with SATA, which means a whole new drive array, and this is expensive.
0
 
LVL 87

Expert Comment

by:rindi
ID: 16926898
I haven't had all that good experiences with highpoint raid controllers. I find the promise raid cards are much more reliable, or even better are the 3ware ones, but highpoint seems to justify it's cheapness with low quality. If a firmware update as suggested by scrathy doesn't help, get more reliable raid cards.
0
 

Author Comment

by:slbriggsphd
ID: 16964595
Okay, I updated the RAID drivers in Windows; could not update the controller BIOS - long story; and updated the MB software package & bios.

Something interesting happened after I updated the RAID drivers. The verification failed, but instead of immediately rebooting like usual, the computer stayed on-line and gave me an error message that the second channel had failed. This is the channel I had been watching and suspected was a problem. This in hand, I moved the disk on that channel to the first channel as the slave. It has been stable since, and will verify.

However, I can't run both disks of the mirror off the 1st channel, it halves the performance. I think the card is out of warranty, Highpoint won't do warranty work on cards bought from resellers, and NewEgg doesn't sell the card anymore.

What kind of problems am I going to have moving the array onto a new controller? I'll basically have to make disk images and start from there, won't I? Is there any chance a new card, even a Highpoint card, will recognize my existing array? I don't think so... but... well, comments anyone?
0
 
LVL 44

Expert Comment

by:scrathcyboy
ID: 16966173
The only way another RAID controller will recognize the existing array is if it is the SAME chipset on the controller card, and the same version, in which case you just plug the array and hope it works.  Usually this only works for mirror RAID 1 anyway.  I think it is safe to assume that if you want to move beyond the existing controller and its problems, you will have to wipe the array and start again.

But this is easier than you think.  Just install a good old IDE drive, copy all the data in the array to the IDE, and make sure the disk is bootable.  Remove the RAID from the system, boot from the CD, and make the IDe drive bootable from running fixboot C:  from the windows XP boot CD in recovery console.

Once you know the system can boot from this IDE, then it does not matter what happens to the RAID, get a new controller, reinitialize it, and copy all the data back -- but at least use RAID 1 or RAID 10 so that you have a mirror in the future, raid 0 and raid 5 are very prone to failures on removal of a drive.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:slbriggsphd
ID: 16971350
Yeah, we're using RAID 10 currently. Its been very robust until this current problem.

A worse problem developed last night after I left - the stripe of the mirrors broke. Until now it's been one of the mirrors that breaks, which can be easily rebuilt with a spare drive. But the stripe broke somehow, reducing the problem to the same as a broken RAID 0 - I don't know of a way to recover this! From what I understand, RAID 0 breaks are unrecoverable in most situations.

I put in two spares, booted, and the controller didn't recognize the spares as useful - at least, there was no option to rebuild. I wouldn't really expect one, having it reduced to a RAID 0 situation anyway. It looks, at least intellectually, that a more robust system would be a mirror of two stripes, as opposed to a stripe of two mirrors...

Well, good thing we backup every night. Too bad its just the database, and not the OS and entire system, though.

Yeesh. Looks like I got some work to do. I'll get back here with my resolution for posterity and points.
0
 
LVL 87

Assisted Solution

by:rindi
rindi earned 230 total points
ID: 16971414
There is a software you can use to recover a broken raid 0, but a restore from a backup is usually the real way to go. I strongly recommend you change the raid controller now.

raid reconstructor:

http://runtime.org
0
 

Author Comment

by:slbriggsphd
ID: 16972274
rindi,

The array is a 1/0, which is a stripe of mirrors. When I say we're "reduced to a RAID 0 situation" I just mean that  the stripe between them broke. So we have two mirrors that are no longer striped, each mirroring only half the data.

Does anyone know of a tool to rebuild a RAID 10 array? The Raid Rebuilder from GetDataBack is only for RAID 0 and RAID 5. Highpoint was supposed to email me a tool, but I haven't seen it yet, and it was supposed to be here a few hours ago.
0
 
LVL 87

Expert Comment

by:rindi
ID: 16972630
Check the software out, it can rebuild a broken raid0 and therefore also a broken raid10.
0
 

Author Comment

by:slbriggsphd
ID: 16972886
I pulled a drive from each mirror and am running the Raid Rebuilder on them now to an image on an external HD. Let's see how this goes.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Great sound, comfort and fit, excellent build quality, versatility, compatibility. These are just some of the many reasons for choosing a headset from Sennheiser.
This paper addresses the security of Sennheiser DECT Contact Center and Office (CC&O) headsets. It describes the DECT security chain comprised of “Pairing”, “Per Call Authentication” and “Encryption”, which are all part of the standard DECT protocol.
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now