Solved

How to diagnose Dell T610 Raid or Hard drive issues without Windows

Posted on 2014-02-19
20
3,319 Views
Last Modified: 2016-11-23
Hi Everyone,

We had one of our ESXi servers die the other day and it looked to be caused by 2 dead drives in the 4 drive Raid 5 array.

We reseated the drives and put them back online and it asked us if we wanted to import the foreign array - we eventually said yes and the raid went into rebuild (the drives were reseated in the same bays they were in - no changes).

However, both servers data (VM files) were corrupt and not readable. We did a reinstall (none destructive) for ESXi 5.1 and can now see the files but none were readable or recoverable.

We have now failed over to our Axcient Backup system until we can diagnose and repair the Dell T610 server issues (either hard drives or raid or both). I have found online that you can use the Dell Open Manage boot disks to diagnose raid and hard drive issues but I have not been able to figure out how as there are no options I can see to do this only to reinstall the OS and build a new array/virtual drive.

The raid controller is Dell Perc 6/i SAS (which I believe is a LSI product) with 4x 500GB SAS 7.2K hard drives.

Any help you can provide would be appreciated.

thanks,

TheSonicOne
0
Comment
Question by:TheSonicGod
  • 8
  • 8
  • 3
  • +1
20 Comments
 
LVL 47

Accepted Solution

by:
dlethe earned 300 total points
ID: 39871569
When you have a foreign config situation, then understand you are in a multiple failure scenario.  There is no universal answer other than you need to get all the logs from the controller and configuration data (using diagnostic software) into the hands of an expert.  

Has the array been running degraded?  If so, what disk was the first to go, and how much write data was done?  Did a HDD die during a previous rebuild?  Where the drives reassembled in the wrong order?  Was a failed drive with stale data forced into a configuration that was degraded at the time?   Is your parity data correct for 100% of the target? Is metadata correct?

You don't have the software to make such determinations, because it doesn't come cheap, and the skills to determine what to do can't just be picked up w/o lots of training.  

This is why data recovery firms charge $10K+ to try to fix a busted RAID, especially one that uses NTFS within ESXi.  
 
Your data is gone. I can't tell you what happened without looking at that information, and running some proprietary code to analyze things. You'd have to even put those disks behind a non-RAID controller so I could look at the raw physical blocks through a binary editor.

So bottom line, when/if this happens again, call Dell first, and if they won't / can't walk you through it, determine if it is worth $10K or so for onTrack to fix it, or to call somebody else who does this and get their opinion.
0
 

Author Comment

by:TheSonicGod
ID: 39871694
Thanks dlethe,

I really don't care about the lost servers at this point as we had full image backup and failed over to the appliance and they are up and running on it now anyway. We will just restore them back to either the fixed or new hardware once we can.

What I need to do now is determine if I can fix the existing server and how to find out what is wrong/damaged/defective on it or if I need to replace the unit completely. The Server is almost 4 years old now so both are possibilities and server is out of warranty.

Thanks,

Spencer
0
 
LVL 47

Assisted Solution

by:dlethe
dlethe earned 300 total points
ID: 39871788
Replace all 4 disk drives.  Call Dell and see what they support as a replacement since I expect the smallest capacity they offer is larger than what you have anyway.

In fact unless you are highly read intensive and doing large block I/O, then a pair of larger disks configured as a RAID1 will be cheaper and much faster than a 4-disk RAID5.  A 3 disk RAID5 will easily have less than half the write speed (maybe much less) then a 2 disk RAID1 anyway, and overall performance most likely will be faster anyway.
0
 

Author Comment

by:TheSonicGod
ID: 39871819
But how can I be sure this issue was not caused by a failing Raid card itself. I want to be sure that I do not need a new card.

Given the prices it may be more cost affective to get a new server with drives and a new raid card if the card and drives need replacing.

thoughts?
0
 
LVL 47

Assisted Solution

by:dlethe
dlethe earned 300 total points
ID: 39871848
The odds of a failing RAID card is so small, especially in the light of you losing 2 x 4-year  old drives, that it isn't worth considering.

You'll have to crunch the numbers to see if buying 2 new SAS drives (3 with a hot spare) vs a whole server with 2-4 SAS drives is a better deal.
0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 100 total points
ID: 39871855
What you really needed was to know which drive failed first and which one failed second, then you would have known which to import.

I don't know enough Dell stuff to tell you how to make a bootable OMSA CD although it's definitely possible, a DSET report would probably have enough info too. PowerEdgeTech will probably tell us how to do that tomorrow.

This basic diagnostic software comes with the box, it's just that you need a bootable CD with it on rather than the Windows version.
0
 

Assisted Solution

by:TheSonicGod
TheSonicGod earned 0 total points
ID: 39872372
Thanks everyone.

I found this Dell support article that has a windows and non-windows set of diagnostics that you can load on to various media including a USB stick:

http://en.community.dell.com/support-forums/disk-drives/f/3534/t/19525983.aspx

I am running all the hardware tests now so we should have a better answer in the morning once all the tests have run.

Thanks,

TheSonicOne
0
 
LVL 32

Assisted Solution

by:PowerEdgeTech
PowerEdgeTech earned 100 total points
ID: 39874533
"I have found online that you can use the Dell Open Manage boot disks to diagnose raid and hard drive issues but I have not been able to figure out how as there are no options I can see to do this only to reinstall the OS and build a new array/virtual drive."

Those are the installation utilities.

OMSA Live is what you want:
http://linux.dell.com/files/openmanage-contributions/omsa-71-live/OMSA71-CentOS6-x86_64-LiveDVD.iso

Boot to it, create a password, startx, then open OpenManage Server Administrator to view the status of RAID, pull the RAID controller logs, and/or view the hardware logs.

The diagnostics are a good start to determine the actual health of the drives.
0
 

Author Comment

by:TheSonicGod
ID: 39881583
Hey Everyone,

Ok - so I have completed all the utilities using the Dell diagnostics via a bootable USB pen drive as per my prior posts. Everything passed included all the confidence tests on all 4 hard drives.

I am curious on your thoughts as to what next. Could this all have been caused by an overheat issue?

Reason I asked is that we noticed one of the cooling fans for the cabinet had failed and the room itself was much hotter than usual (likely 90+ degrees F) . The ambient heat was hot enough to force any server shutdowns but I am wondering if it was hot enough to cause these drive failures.

We opened all the doors for the room to drop the temp before we reseated these drives and tried to rebuild the raid, as I mentioned in a prior post the drives came back online right away so I am not sure now what to think.

I am wondering if I just rebuild the arrays from scratch and install ESXi and just move the VM's back to this unit since all the tests have passed.

Thoughts????

Thanks,

TheSonicOne
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39881630
Those Dell diagnostics are pretty much toys compared to DVT tests that professionals use.  If the drive is dead or has a profoundly high number of bad blocks,then the Dell tests will fail the HDD.

But those tests are simply incapable of detecting problems related to interoperability, configurable settings, data integrity, ECC problems, vibration, it is a long list.  

Besides, you have all the tests you need. You lost 2 HDDs, didn't you?
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 

Author Comment

by:TheSonicGod
ID: 39881679
Is there anyway to run the DVT tests on these drives?

Thanks,

TheSonicOne
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39881685
Quality HDD test software costs more than your computer, and you would have to use a JBOD SAS controller.

But that is not the issue, you really need to have somebody who knows what they are doing look at it for the reasons I mentioned earlier.  Interoperability issues need talented test engineers and detailed specs.
0
 

Author Comment

by:TheSonicGod
ID: 39881702
I have been working with servers for 20+ years so I think I know what I am doing. I was asking for others opinions - is this not what this service is for????? (experts exchanging their experiences)

I have had many raid controller issues and drive failures over the years but I have never had a drive come back to life and act like nothing ever happened which is the case this time and then pass every test I can throw at them (Actually 2 drives in this case). Usually they either stay dead or fail some test at some point.

I was looking for reasons why this would happen and if anyone else has experienced this. Also, if I should act like they are still dead and replace them or given that I have full image fail-over with our Axcient backup for both VMs this server holds and can fail over in less than 10 mins without loosing anything (as that there is no data held on this server, only email filtering service and BES server) just rebuild the server and see what happens.

I was weighing my options and getting other suggestions for review.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39881725
If I had the drive in my lab and your controller logs I could probably tell you, but the fact of the matter is you don't have the hardware, software, or experience required to do a level 2 diagnostic.

there could be dozens of individual reasons and hundreds of thousands of different combinations of reasons.  Nobody can tell you what is wrong if a HDD passes the basic manufacturer's tests.

I've worked with OEMs and had to do level 3 diagnostics on why HDDs fail in a certain environment, and it cost well over $100K worth of engineering resources.  This involved getting the electron microscopes out to examine the heads & platters in a particularly bizarre set of circumstances.

My answer is that unless you are willing to pay some serious coin, then you'll never know for sure unless the HDDs fail consistently.  This particular problem has to be resolved by a pro who has the talent and equipment.
0
 

Author Comment

by:TheSonicGod
ID: 39881739
dlethe - I think you are way over thinking and complicating this issue. I am not looking for an absolute reason, I understand that this server is not worth that kind of time and money.

I am asking for an opinion, is it possible that this was just an issue caused by a over heated server room?

Given the backup coverage I have, I am leaning to just resetting up the Virtual disks, installing ESXi and copying back the live VM files.

I am just asking if anyone has experienced this type of drive activity before and if so how did they handle it.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39882300
It is possible it is heat, but that is one thing you can test by monitoring temperatures of the HDDs. Since these are SAS drives, then temperature is going to be SCSI Log page 2F.  You'll need also to monitor the unrecovered and recovered error count log pages real-time.

But since you need to do this pass-through the LSI  / MPT API, then just forget it.    There is no shareware / free product that can do that.  So while this information can be confirmed with software, such software isn't available without spending a lot of money.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39882638
>I have had many raid controller issues and drive failures over the years but I have never had a drive come back to life and act like nothing ever happened which is the case this time and then pass every test I can throw at them (Actually 2 drives in this case). Usually they either stay dead or fail some test at some point.

I've had drives "fail" when there's nothing really wrong with them, if they take too long to respond the controller will fail them and kick them out of the array. No amount of testing will show such disks as having failed because they didn't fail, the controller just had a huff and kicked them out. TLER/CCTL/ERC is meant to limit the length of time that the disk retries a bad block before giving up, this is set to a low value for drives on a RAID controller so that instead of waiting the controller can recover the data from parity instead. It may be in some cases though that the good drive still takes too long to respond so the controller marks it as failed. Perfectly good disks can lock up just like perfectly good servers can crash. Electronics isn't infallible, a stray alpha particle may cause a soft error.
0
 
LVL 47

Expert Comment

by:dlethe
ID: 39882769
There is ALWAYS a reason.  the most difficult things for lay people to test is intermittent issues and those problems that are specific to a load and configurable settings at any point in time.  

A perfectly fine drive can pass every diagnostic on the planet but still fail due to simple interoperability or stress-induced problems.   Here is an easy example.   The block size is 520 but the controller expects 512 bytes/block.  The drive will pass all tests but simply not work.  Obviously block size is not your issue, but it is easy to get your head around.  There are hundreds of subtle configurable settings on a HDD alone, and dozens have to do with error recovery scenarios that only fail when a series of events happen

Andy is right, and even what he posts is only a fraction of what can happen.  We didn't even get into even firmware bugs that come up constantly, and their fixes.  In fact Seagate put out an update a few weeks ago for a certain enterprise class HDD that still had SEV1 / high priority fixes that PREVENT DATA LOSS in certain I/O situations.  One of my OEM customers was seeing drives drop off randomly, and they have tens of thousands of these disks.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 39918930
Extra points for useless waffle?

>"OMSA Live is what you want:" plus the link to get it from was worth more than the rest of jargon put together AFAIK.
0
 

Author Closing Comment

by:TheSonicGod
ID: 39931074
Found download of required utilities on my own
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Storage devices are generally used to save the data or sometime transfer the data from one computer system to another system. However, sometimes user accidentally erased their important data from the Storage devices. Users have to know how data reco…
How to update Firmware and Bios in Dell Equalogic PS6000 Arrays and Hard Disks firmware update.
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now