Link to home
Start Free TrialLog in
Avatar of Faxxer
FaxxerFlag for United States of America

asked on

Unpredictable RAID behavior

Hello fellow geeks,

I have ran my gambit and request higher minds than myself.


Allow me to lay out the scene....

I built a server back in August in which there are two 500G drives in RAID 1, and 2 2TB drives in a RAID 1.

The 2 500G drives are for C: and the OS running Win SBS 2011 Standard
The 2 2TB drives are for all the stores, data, etc...pretty straight forward

The server motherboard has the integrated SAS 2008 LSI megaraid chipset and came with dual ports and cables to hold up to 8 HDD's on it's RAID platform.

Original installation and insertion into production went flawlessly and it's been in service since August roughly.

Well a couple weeks ago I got a notice that RAID had degraded and so I ran a consistency check in the megaraid software and it shows DRIVE 1 on the first RAID array has failed.

So I ordered an EXACT replacement drive (In this case it's Seagate's enterprise baracuda drive with the five year warranty)

The drive arrived, and I went down to the customer.

I shut down the server, swapped the drive it showed was bad, rebooted into the SAS utility in the BIOS and saw that it was rebuilding the new drive into the RAID....
AS EXPECTED.

After it finished rebuilding, I rebooted to the OS, and got a buttload of megaraid alerts saying that it degraded and so I ran the consistency check again and it shows the SAME drive failed and degraded again.

I'm thinking at this point that MAYBE the drive labels are wrong (I've read this happens sometimes on the little stickers so I decide to put the newly removed drive in place of the NON newly replaced slot.

To clarify...  At THIS point, I've replaced drive 1 with a brand new replacement and it showed as rebuilt successfully.  But then I replaced drive 0 with the recently removed drive....still with me?

At this point, the RAID utility will not even recognize the drive at all!  It shows NO DRIVE in the slot.  I try reseating cables, etc...no change.   At this point I'm kinda freaking out because this drive had JUST been in service just fine more or less.

So I take the drive back off, and put the ORIGINAL drive 0 back in place...
this time the utility begins and completes a rebuild of the array with out hesitation.

again on reboot, massive alerts and degraded RAID.

So I take the drive that would not show up on the server RAID utility to my shop and stick it in my little external desktop plug in...boom.  It shows up with files and data entact.  Cr;ystal Disk Info says the drive is perfectly fine!

This is my very first time having to swap out a bad drive in a RAID 1 array, and the first time it behaved EXACTLY like I expected it to, minus the alerts on reboot.

I did NOT test the new drive before putting it into service, my initial thought is I'm dealing with another failing drive (likely the one I just purchased was doa, but I didn't want to pull it out and find out after the nightmare I just had...

Has anyone had a similar experience?

I'm trying VERY hard to not break my OS so I can keep things running good without having to rely on restores from backups.

I do have backups from the microsoft utility so they're full system restores only, I prefer to do that as a very last resort.

any advice is appreciated

Ike
SOLUTION
Avatar of David Johnson, CD
David Johnson, CD
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Faxxer

ASKER

They are SAS interfaces.  

I get the following info on the 1st post..
bytes per sector 512
bytes per physical sector <not supported>
bytes per cluster 4096
bytes per file record segment 1024

"not supported" is not a very useful description!
Avatar of Faxxer

ASKER

http://www.newegg.com/Product/Product.aspx?Item=N82E16813182322

This is the exact motherboard by the way
does your megaraid controller (a $2 fake raid controller) support advanced format drives? Either step up to the plate and get a real controller battery backed up or use windows built in raid.
Avatar of Faxxer

ASKER

How does one tell the difference between a hardware and software controller?

LSI SAS2008 is a known controller card from the LSI website.

The controller runs in the PCIe bus according to the motherboard's block diagram

enlighten me of these things
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Faxxer

ASKER

Hi Gary,

Glad to learn of this!

I do let it finish rebuilding before booting into the OS.

So if I were to decide to move the RAID arrays to the SATA ports and use the other type, will I be able to do this without having to reinstall my OS and data?  
Is it simply a matter of setting up the RAID in the bios and then plugging in the hard drives to the appropriate SATA port?  

Surely there's a "Trick" in this?
You should be able to just break the raid through the bios or utilities.

Then I would set up the new raid using the built in windows raid.
But make a FULL backup before you try it!
Avatar of Faxxer

ASKER

Well I defer to the experts then.

I will go ahead and take the SAS offline and attempt a rebuild of the raid 1 on the SATA ports via bios and thus bypass the LSI component completely for the C drive OS.

IF it works EASILY for the C drive then in theory it should work just as easily for my DATADRIVE E which houses all the DB's and librarys, etc.

As long as the drive letters don't change, SBS should make the connections for Exchange right?

Now IF it blows up and I get total loss of OS, I will go ahead and simply restore from a full backup I have.

The server has been backing up fully every day since onlined for production back in August.  I think it failed ONE time out of all those and I've done full restores before when I had a SP for exchange go bad.   They work pretty well for the most part so I'm not worried about that.

Will report back....after the deed is done...perhaps this weekend.

Thank you guys for all the valuable info.  As always, EE experts ARE THE BEST on the planet in my opinion.

Ike
Since these are mirrors, I would work with them one at a time to convert them to JBOD. Backup yes... but you don't want to be in the position of doing a bare metal recovery. Proceed carefully and good luck!

HTH
Gary
Avatar of Faxxer

ASKER

My only backup option is the Windows SBS utility which is a full bare metal backup unfortunately.

But this is a very small law firm, they just didn't want to buy a $500+ backup software option (Like Acronis) because they've got all their vital data backed up to 2 externals (done manually at regular intervals)

If i had to reinstall the OS, (absolute worst case scenerio) it would not be a big deal in terms of data risk, just time consuming on my part.

I will post back on the matter...

I don't feel so intimidated and worried after having consulted the mighty EE crew.

Ike
Well,let me just do a cost justification,sooner or later ,the backup software will need to backup to a 4k format USB drive.

Those drives(or anything larger than 2.2 tb) won't work with the built in backup.

It was fixed in server 2012,but not for server 2008/11
Avatar of Faxxer

ASKER

OH yes!  

I learned that lesson on initial purchase!

I had to downgrade 2 4TB drives to 2TB ones.

Fortunately I had a use for them in a NAS.
Avatar of Faxxer

ASKER

I had a long weekend with this one...

I shall attempt to reconstruct exactly all that happened, and then decide on how this should be concluded....

First of all as instructed, I broke the RAID  and moved both drives to the standard SATA ports.  ...and enabled RAID via the Adaptec RAID utility -- Took 7 hours to Init the Array. (might be my first mistake)

After that I attempted to boot, BSOD.  SAFE mode boot BDOD. OK, maybe it's corrupted I think....
Run baremetal restore option (12 hours pass) -- BSOD in normal or safe boot.

I do some research to learn that windows had reassigned drive volume letters and that this could be part of my issue..

Use Diskpart to re assign drive letters to match prior partition letters..

BDOD no matter what.

............
Now by this point in time, it had turned into Saturday and the first restore took a full 12 hours before I got to boot to BSOD.

This office is a CPA lawfirm, and TAXES were due Tuesday for HUNDREDS of their customers....  Panic is a slight understatement.

.....I realize I may have already done things incorrectly, but time was in my face and so was a bunch of lawyers so I made the choice to go BACK to SAS connections and attempt bare metal restore from there.  

...................................

Saturday at around 7 pm was when I got to finally restart my RAID initializations and it went until Sunday at 4am.  Time is really in my face by this point and I can't fathom why these operations are taking hours and hours when they took mere minutes the first time I built the server.

..................................
After the RAID init finished, I attempt BMR and get a horrifying error from Microsoft restore utility...  It says no suitably sized drives can be found ...not big enough to restore to.....  that's right.... the EXACT SAME SIZED drives are now considered too small by the restore utility.

I read around MS site and see this issue goes far back to 2009 even..and NO SOLUTION ever was posted except one dude decided well might as well try a bigger hard drive and see if it works...in his case it did, but my C drive was 500G and my E drive was already MAX size of 2tb, the largest SBS will see anyway...

Needless to say I had another 2TB drive sitting in my storage and felt I had no other options but to put the 2TB in as the new C drive and of course pray.

Since I had only 1 drive, I broke the RAID 1 on both drives and just had a C and E drives that were both 2TB. Re assigned drive letters using Diskpart again. (Just to be sure)

The BMR started and completed in litterally 25 minutes, and I booted to the desktop I had last known Friday at 9pm when I ran the backup before it all started.

................................................

so ...alot of mistakes, yes?  Alot to learn? yes?

I could have left out some details, but I think I covered most of it....alot of waiting the first two days on LONG LONG slow counters of percentages on drive utilities.

Some observations:

1. The SATA adaptec RAID utility is also just as FAKE right?  (Built into mobo)
2. You guys meant for me to use WINDOWS partition manager to create RAID array right?  ...I only realized this TODAY if this is true as I had NO IDEA that windows 7 could do that from inside the OS!!!  Never had to do it or needed to do it.  Is that correct?

3. Was it a DRIVER issue that caused my BSOD's on the adaptec RAID controller?  My bsod only would say that it appeared that the hard drive configurations had changed or a drive was recently added...and not much else but the stop codes which I just was too panic'd of time to research....Is that a proper observation?

4.  If i HAD just used the Windows drive management option and tried booting directly to SATA without enabling the Adaptec hardware, would windows have correctly repaired the booting and then allowd me to add RAID from inside the OS? My first mistake from above?

I ask because i still think you are right about getting OFF of SAS, but I can't spend a whole weekend shooting blindly like that again....I want to be armed with knowledge AS WELL as the past experience to do it better.

I appreciate all the advice and help you guys feel like giving.

Ike

p.s.  Time was the enemy I had to fight the most, not my ability to work through the issues as they came up.  Time pushed me into panic and fear because I knew just how much the customer needed access to their tax applications to file electronically for hundreds of their clients.  

I've never yet encountered a problem I could not find a solution to given enough time to read forums and ask questions.

Some here might think I lack the proper knowledge and/or experience to be admin of a server and network like this... you may be correct.  But I never ONCE got a customer on my raw knowledge...I get AND KEEP them by being a person of integrity and honestly tell them exactly what is going on.   OH yes...I tell them I don't know what is going on when I don't know..and I tell them I'll FIND the solution in a forum and ask experts in places like Experts Exchange.  They don't care that I don't have MCSE certs, they care that I can be trusted and forthright.  

because of this business model my business THRIVES, more work than I can do by myself.  I just be myself and don't cop the IT IS GOD attitude that I have seen so many times in other past jobs before I started my company.  

I'll never work for someone again if I can help it because I AM very happy being my own boss and providing DIRECT customer service to my customers, not via proxy.

To conclude:  My Customer knew my difficulties all weekend, knew I stayed up there all the nights and days....They happily pay me my rate and don't consider my difficulties my failure at all.  Even though I willingly and profusely apologize for not having a quick and simple restoration when they needed it most.  Why do they not hold me responsible?  Because they know my attitude, and know my goal is to see them prosper.  They feel I take ownership in their business just as I do my own.

I have no idea why I typed all that out, but it's the OTHER side of the coin in this business...and I think it's overlooked all too often.

I can find 100 geeks to fix my computer, but only a few will be the person I want to do it because of their outlook, attitude, and sincere desire for helping.  The customer KNOWS the difference.

Ike
FYI, WD Red drives are 512byte,so you could use the built in backup for external USB drives.
Avatar of Faxxer

ASKER

One more note...they now decided that Acronis sounds like a better option...So there's progress for the future!
Symantec system Restore 2013 is also a pretty good product,so ...
Avatar of Faxxer

ASKER

Info from each of you helped to create a larger pic for me to work from.  While the specific issues was not resolved by the suggestions from this thread, they all contributed to future actions and the info learned here is very useful for this and future situations with this particular server.