asked on

Problematic server... Intel Raid to another controller?

This is a production SQL server. Windows 2003 R2 x64. Supermicro motherboard, 24gb ram, Quad-Xeon.
It has on-board Intel raid, configured to Raid 1 (for OS), and Raid 5 (for data).
Configuration is as follows:
Drive 0--RAID 1
Drive 1--RAID 1
Drive 2--RAID 5
Drive 3--RAID 5
Drive 4--RAID 5
Drive 5--RAID 5
All drives the same model Seagate SATA.

The Intel Raid Manager has been problematic. I found the manager to be quite problematic on regular workstations as well. Randomly, Intel Manager pops up with "raid degraded" message. This, however, can be quickly fixed by right-clicking on the drive (in the Intel Manager software), and selecting "Normal". Poof--and the problem is fixed, Intel proceeds to rebuild the drive. The same happens when Intel Raid Manager claims that a hard drive has failed. One click and it's online. (buggy software?) Also, spontaneous reboots have been occurring with this machine as well. I checked the logs--too quick for even Event Log to catch the problem, so no trace.

But this is not the main issue. I am trying to move this machine to a different RAID controller, Adaptec to be specific. Data on the server is crucial. Usually, in raid cases, I was always able to make an Acronis Image (.tib), and move it to another pre-configured raid. With this machine, however, even Acronis fails ("cannot load linux kernel" message comes up. CD is not scratched and works on all other raid machines).

Are there any other ways that I could move that machine from Intel to Adaptec controller, without loss of data? The Intel Raid Manager is a mess, and it's not reliable enough to be used further. Any suggestions?

Thank you very much!

David

Specifically, what is make/model of disk, and what is the model of Adaptec controller?

wolfcamel

i use storagecraft IT edition for this - you can get a 2 week license which is a bit cheaper than a full license.

wolfcamel

the adaptec controller probably wont boot until you disable the onboard intel raid.

ASKER CERTIFIED SOLUTION

rindi

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

94704

ASKER

@ Wolfcamel: I am aware of that. I'll try storagecraft and post results soon.
@ dlethe: ST3250310NS Seagate 250gb SATA
New RAID card would be ADAPTEC-SUPERMICRO AOC-LPZCR2 rev 3.00 (the card has been suggested by the manufactorer as 100% tested and approved for the motherboard)

94704

ASKER

@ rindi: I'll give it a try, sounds like miracle software.

David

You probably have a much bigger problem. The symptoms are classic indication of TLER issue with the intel controller, assuming you are using consumer-class disks, and not the enterprise/server drives. Specifically the crux of the issue is that the consumer class disks go into a deep recovery cycle when they encounter a bad block. This can take 10-30 secs, depending on specific model. (The drive basically freezes and dedicates 100% to recovering the block).

Unfortunately, the Intel controller only allows around 7 seconds before it figures the drive died, and so it kills it from the RAID set. This is WHY you are having such a problem. Bad block -> deep recovery -> disk "locks up" for too long -> controller thinks it died -> degraded RAID -> you manually reset & rebuild -> repeat.

Now MANY, but not all of the Adaptec RAID controllers also require enterprise drives. So I would hate you to have to go through all of this trouble, and continue to see the issue.

You can read more about it here (among other related issues regarding disk/data reliability in general).
https://www.experts-exchange.com/Storage/Misc/A_2757-Disk-drive-reliability-overview.html

So my suggestion is to first step back, consider root cause before making changes.

David

never mind, you have enterprise class. But still this should not happen. Make sure you don't have acoustic (quiet mode) turned on, as this affects timing and performance. Also, if these are OEM, not retail labeled, they could have been programmed with different firmware to make them behave better in another config.

Run full diagnostics, including media verify. What you are experiencing is still indication of an inherent drive issue. Are this retail disks with standard firmware?

94704

ASKER

@ dlethe: the disks are enterprise level, just as you noticed. We purchased them directly from Seagate, with specification for server use. Further, acoustic mode is not turned on, under no circumstances is there any setting set for "quiet" or "power-save" enabled. Also, the drives the the original manufactorer firmware.

David

Good acoustic should be off ... the other tunable parameter on enterprise drives that is sometimes enabled that messes things up is the power-saving "green" stuff. Make sure they aren't going all tree-hugger on you and spinning down ;)

If you have NOT been running regular (weekly at least) data consistency checks, then please look at event log and do so. This is run within the firmware and it reads all blocks on all disks and looks for parity XOR errors as well as unreadable blocks and fixes them. Drives go offline for a reason, and get a few consecutive bad blocks, and they will time out. A rebuild will repair the blocks and move on.

Personally, I would take a downtime window and run extensive diags on disks through a NON-RAID controller. It is foolish to throw drives at a new controller when you very well have bad drives. the windows scandisk, and other tests won't run true diags, especially with that RAID controller in the way.

94704

ASKER

Event logs doesn't not show anything unordinary. Sometimes a failed service at most.

Drives are not treehugger-friendly, so I don't suspect that they would spin down. Besides, the server is used constantly nearly 24/7, so I'd doubt that it would have the time to sleep the drives even if it did have treehugger option enabled.

The goal, at this moment, is to get it off the Intel raid. We bought enterprise-level WD5001AALS drives, just to make sure we don't trash the production drives. At this moment, even an Acronis backup messes up the Intel Raid and I have to rebuild after doing a simple backup.

94704

ASKER

Paragon software worked where Acronis failed miserably. Also, the software did have an option to inject new raid drivers to the OS, which allowed me to transfer from intel to adaptec raid.