Solved

Unpredictable RAID behavior

Posted on 2014-04-08
20
482 Views
Last Modified: 2016-10-27
Hello fellow geeks,

I have ran my gambit and request higher minds than myself.


Allow me to lay out the scene....

I built a server back in August in which there are two 500G drives in RAID 1, and 2 2TB drives in a RAID 1.

The 2 500G drives are for C: and the OS running Win SBS 2011 Standard
The 2 2TB drives are for all the stores, data, etc...pretty straight forward

The server motherboard has the integrated SAS 2008 LSI megaraid chipset and came with dual ports and cables to hold up to 8 HDD's on it's RAID platform.

Original installation and insertion into production went flawlessly and it's been in service since August roughly.

Well a couple weeks ago I got a notice that RAID had degraded and so I ran a consistency check in the megaraid software and it shows DRIVE 1 on the first RAID array has failed.

So I ordered an EXACT replacement drive (In this case it's Seagate's enterprise baracuda drive with the five year warranty)

The drive arrived, and I went down to the customer.

I shut down the server, swapped the drive it showed was bad, rebooted into the SAS utility in the BIOS and saw that it was rebuilding the new drive into the RAID....
AS EXPECTED.

After it finished rebuilding, I rebooted to the OS, and got a buttload of megaraid alerts saying that it degraded and so I ran the consistency check again and it shows the SAME drive failed and degraded again.

I'm thinking at this point that MAYBE the drive labels are wrong (I've read this happens sometimes on the little stickers so I decide to put the newly removed drive in place of the NON newly replaced slot.

To clarify...  At THIS point, I've replaced drive 1 with a brand new replacement and it showed as rebuilt successfully.  But then I replaced drive 0 with the recently removed drive....still with me?

At this point, the RAID utility will not even recognize the drive at all!  It shows NO DRIVE in the slot.  I try reseating cables, etc...no change.   At this point I'm kinda freaking out because this drive had JUST been in service just fine more or less.

So I take the drive back off, and put the ORIGINAL drive 0 back in place...
this time the utility begins and completes a rebuild of the array with out hesitation.

again on reboot, massive alerts and degraded RAID.

So I take the drive that would not show up on the server RAID utility to my shop and stick it in my little external desktop plug in...boom.  It shows up with files and data entact.  Cr;ystal Disk Info says the drive is perfectly fine!

This is my very first time having to swap out a bad drive in a RAID 1 array, and the first time it behaved EXACTLY like I expected it to, minus the alerts on reboot.

I did NOT test the new drive before putting it into service, my initial thought is I'm dealing with another failing drive (likely the one I just purchased was doa, but I didn't want to pull it out and find out after the nightmare I just had...

Has anyone had a similar experience?

I'm trying VERY hard to not break my OS so I can keep things running good without having to rely on restores from backups.

I do have backups from the microsoft utility so they're full system restores only, I prefer to do that as a very last resort.

any advice is appreciated

Ike
0
Comment
Question by:Faxxer
  • 10
  • 5
  • 2
  • +2
20 Comments
 
LVL 78

Assisted Solution

by:David Johnson, CD, MVP
David Johnson, CD, MVP earned 166 total points
Comment Utility
I've seen this with drives that have different sector sizes the newer drives have 4K sector sizes where the older had 512 byte sector sizes

fsutil fsinfo ntfsinfo <driveletter:> will show you the information i.e.
Newer Drive
C:\Windows\system32>fsutil fsinfo ntfsinfo c:
NTFS Volume Serial Number :       0x8216f51a16f51041
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Older Drive
C:\Windows\system32>fsutil fsinfo ntfsinfo I:
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
0
 
LVL 30

Assisted Solution

by:pgm554
pgm554 earned 167 total points
Comment Utility
My opinion ,that's a fake RAID chip(XOR chip).

I would dump it and use the built in windows RAID.
Faster on reads.

Or get a real hardware RAID controller with a co processor and cache.

If you're married to that controller,I would check for firmware and driver upgrades.

As for the 512 thing ,my guess is the 500's are 512byte.
The 2 TB 4k.

Are those SAS or SATA interfaces?
0
 

Author Comment

by:Faxxer
Comment Utility
They are SAS interfaces.  

I get the following info on the 1st post..
bytes per sector 512
bytes per physical sector <not supported>
bytes per cluster 4096
bytes per file record segment 1024

"not supported" is not a very useful description!
0
 

Author Comment

by:Faxxer
Comment Utility
http://www.newegg.com/Product/Product.aspx?Item=N82E16813182322

This is the exact motherboard by the way
0
 
LVL 78

Expert Comment

by:David Johnson, CD, MVP
Comment Utility
does your megaraid controller (a $2 fake raid controller) support advanced format drives? Either step up to the plate and get a real controller battery backed up or use windows built in raid.
0
 

Author Comment

by:Faxxer
Comment Utility
How does one tell the difference between a hardware and software controller?

LSI SAS2008 is a known controller card from the LSI website.

The controller runs in the PCIe bus according to the motherboard's block diagram

enlighten me of these things
0
 
LVL 12

Accepted Solution

by:
Gary Coltharp earned 167 total points
Comment Utility
If it says "Host RAID" it is usually of the type that the other experts are referring to as fake RAID. The LSI 2008 is a host RAID adapter. It basically means that there is no controller per se. It is handled in the BIOS rather than a separate card with its own processor, memory and battery backup. These can be flaky. Even if you configured it for RAID, VMWare's ESX for instance will see the separate physical drives and ignore the RAID because it isn't "real".

My suggestion for your immediate issue, though, would be to initiate a rebuild and let it finish before you continue booting your OS. Sometimes that will return you to a consistent running state.

HTH
Gary
0
 

Author Comment

by:Faxxer
Comment Utility
Hi Gary,

Glad to learn of this!

I do let it finish rebuilding before booting into the OS.

So if I were to decide to move the RAID arrays to the SATA ports and use the other type, will I be able to do this without having to reinstall my OS and data?  
Is it simply a matter of setting up the RAID in the bios and then plugging in the hard drives to the appropriate SATA port?  

Surely there's a "Trick" in this?
0
 
LVL 30

Expert Comment

by:pgm554
Comment Utility
You should be able to just break the raid through the bios or utilities.

Then I would set up the new raid using the built in windows raid.
0
 
LVL 16

Expert Comment

by:Gerald Connolly
Comment Utility
But make a FULL backup before you try it!
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 

Author Comment

by:Faxxer
Comment Utility
Well I defer to the experts then.

I will go ahead and take the SAS offline and attempt a rebuild of the raid 1 on the SATA ports via bios and thus bypass the LSI component completely for the C drive OS.

IF it works EASILY for the C drive then in theory it should work just as easily for my DATADRIVE E which houses all the DB's and librarys, etc.

As long as the drive letters don't change, SBS should make the connections for Exchange right?

Now IF it blows up and I get total loss of OS, I will go ahead and simply restore from a full backup I have.

The server has been backing up fully every day since onlined for production back in August.  I think it failed ONE time out of all those and I've done full restores before when I had a SP for exchange go bad.   They work pretty well for the most part so I'm not worried about that.

Will report back....after the deed is done...perhaps this weekend.

Thank you guys for all the valuable info.  As always, EE experts ARE THE BEST on the planet in my opinion.

Ike
0
 
LVL 12

Expert Comment

by:Gary Coltharp
Comment Utility
Since these are mirrors, I would work with them one at a time to convert them to JBOD. Backup yes... but you don't want to be in the position of doing a bare metal recovery. Proceed carefully and good luck!

HTH
Gary
0
 

Author Comment

by:Faxxer
Comment Utility
My only backup option is the Windows SBS utility which is a full bare metal backup unfortunately.

But this is a very small law firm, they just didn't want to buy a $500+ backup software option (Like Acronis) because they've got all their vital data backed up to 2 externals (done manually at regular intervals)

If i had to reinstall the OS, (absolute worst case scenerio) it would not be a big deal in terms of data risk, just time consuming on my part.

I will post back on the matter...

I don't feel so intimidated and worried after having consulted the mighty EE crew.

Ike
0
 
LVL 30

Expert Comment

by:pgm554
Comment Utility
Well,let me just do a cost justification,sooner or later ,the backup software will need to backup to a 4k format USB drive.

Those drives(or anything larger than 2.2 tb) won't work with the built in backup.

It was fixed in server 2012,but not for server 2008/11
0
 

Author Comment

by:Faxxer
Comment Utility
OH yes!  

I learned that lesson on initial purchase!

I had to downgrade 2 4TB drives to 2TB ones.

Fortunately I had a use for them in a NAS.
0
 

Author Comment

by:Faxxer
Comment Utility
I had a long weekend with this one...

I shall attempt to reconstruct exactly all that happened, and then decide on how this should be concluded....

First of all as instructed, I broke the RAID  and moved both drives to the standard SATA ports.  ...and enabled RAID via the Adaptec RAID utility -- Took 7 hours to Init the Array. (might be my first mistake)

After that I attempted to boot, BSOD.  SAFE mode boot BDOD. OK, maybe it's corrupted I think....
Run baremetal restore option (12 hours pass) -- BSOD in normal or safe boot.

I do some research to learn that windows had reassigned drive volume letters and that this could be part of my issue..

Use Diskpart to re assign drive letters to match prior partition letters..

BDOD no matter what.

............
Now by this point in time, it had turned into Saturday and the first restore took a full 12 hours before I got to boot to BSOD.

This office is a CPA lawfirm, and TAXES were due Tuesday for HUNDREDS of their customers....  Panic is a slight understatement.

.....I realize I may have already done things incorrectly, but time was in my face and so was a bunch of lawyers so I made the choice to go BACK to SAS connections and attempt bare metal restore from there.  

...................................

Saturday at around 7 pm was when I got to finally restart my RAID initializations and it went until Sunday at 4am.  Time is really in my face by this point and I can't fathom why these operations are taking hours and hours when they took mere minutes the first time I built the server.

..................................
After the RAID init finished, I attempt BMR and get a horrifying error from Microsoft restore utility...  It says no suitably sized drives can be found ...not big enough to restore to.....  that's right.... the EXACT SAME SIZED drives are now considered too small by the restore utility.

I read around MS site and see this issue goes far back to 2009 even..and NO SOLUTION ever was posted except one dude decided well might as well try a bigger hard drive and see if it works...in his case it did, but my C drive was 500G and my E drive was already MAX size of 2tb, the largest SBS will see anyway...

Needless to say I had another 2TB drive sitting in my storage and felt I had no other options but to put the 2TB in as the new C drive and of course pray.

Since I had only 1 drive, I broke the RAID 1 on both drives and just had a C and E drives that were both 2TB. Re assigned drive letters using Diskpart again. (Just to be sure)

The BMR started and completed in litterally 25 minutes, and I booted to the desktop I had last known Friday at 9pm when I ran the backup before it all started.

................................................

so ...alot of mistakes, yes?  Alot to learn? yes?

I could have left out some details, but I think I covered most of it....alot of waiting the first two days on LONG LONG slow counters of percentages on drive utilities.

Some observations:

1. The SATA adaptec RAID utility is also just as FAKE right?  (Built into mobo)
2. You guys meant for me to use WINDOWS partition manager to create RAID array right?  ...I only realized this TODAY if this is true as I had NO IDEA that windows 7 could do that from inside the OS!!!  Never had to do it or needed to do it.  Is that correct?

3. Was it a DRIVER issue that caused my BSOD's on the adaptec RAID controller?  My bsod only would say that it appeared that the hard drive configurations had changed or a drive was recently added...and not much else but the stop codes which I just was too panic'd of time to research....Is that a proper observation?

4.  If i HAD just used the Windows drive management option and tried booting directly to SATA without enabling the Adaptec hardware, would windows have correctly repaired the booting and then allowd me to add RAID from inside the OS? My first mistake from above?

I ask because i still think you are right about getting OFF of SAS, but I can't spend a whole weekend shooting blindly like that again....I want to be armed with knowledge AS WELL as the past experience to do it better.

I appreciate all the advice and help you guys feel like giving.

Ike

p.s.  Time was the enemy I had to fight the most, not my ability to work through the issues as they came up.  Time pushed me into panic and fear because I knew just how much the customer needed access to their tax applications to file electronically for hundreds of their clients.  

I've never yet encountered a problem I could not find a solution to given enough time to read forums and ask questions.

Some here might think I lack the proper knowledge and/or experience to be admin of a server and network like this... you may be correct.  But I never ONCE got a customer on my raw knowledge...I get AND KEEP them by being a person of integrity and honestly tell them exactly what is going on.   OH yes...I tell them I don't know what is going on when I don't know..and I tell them I'll FIND the solution in a forum and ask experts in places like Experts Exchange.  They don't care that I don't have MCSE certs, they care that I can be trusted and forthright.  

because of this business model my business THRIVES, more work than I can do by myself.  I just be myself and don't cop the IT IS GOD attitude that I have seen so many times in other past jobs before I started my company.  

I'll never work for someone again if I can help it because I AM very happy being my own boss and providing DIRECT customer service to my customers, not via proxy.

To conclude:  My Customer knew my difficulties all weekend, knew I stayed up there all the nights and days....They happily pay me my rate and don't consider my difficulties my failure at all.  Even though I willingly and profusely apologize for not having a quick and simple restoration when they needed it most.  Why do they not hold me responsible?  Because they know my attitude, and know my goal is to see them prosper.  They feel I take ownership in their business just as I do my own.

I have no idea why I typed all that out, but it's the OTHER side of the coin in this business...and I think it's overlooked all too often.

I can find 100 geeks to fix my computer, but only a few will be the person I want to do it because of their outlook, attitude, and sincere desire for helping.  The customer KNOWS the difference.

Ike
0
 
LVL 30

Expert Comment

by:pgm554
Comment Utility
FYI, WD Red drives are 512byte,so you could use the built in backup for external USB drives.
0
 

Author Comment

by:Faxxer
Comment Utility
One more note...they now decided that Acronis sounds like a better option...So there's progress for the future!
0
 
LVL 30

Expert Comment

by:pgm554
Comment Utility
Symantec system Restore 2013 is also a pretty good product,so ...
0
 

Author Closing Comment

by:Faxxer
Comment Utility
Info from each of you helped to create a larger pic for me to work from.  While the specific issues was not resolved by the suggestions from this thread, they all contributed to future actions and the info learned here is very useful for this and future situations with this particular server.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Ever notice how you can't use a new drive in Windows without having Windows assigning a Disk Signature?  Ever have a signature collision problem (especially with Virtual Machines?)  This article is intended to help you understand what's going on and…
The article will include the best Data Recovery Tools along with their Features, Capabilities, and their Download Links. Hope you’ll enjoy it and will choose the one as required by you.
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now