Link to home
Start Free TrialLog in
Avatar of TheSonicGod
TheSonicGod

asked on

How to diagnose Dell T610 Raid or Hard drive issues without Windows

Hi Everyone,

We had one of our ESXi servers die the other day and it looked to be caused by 2 dead drives in the 4 drive Raid 5 array.

We reseated the drives and put them back online and it asked us if we wanted to import the foreign array - we eventually said yes and the raid went into rebuild (the drives were reseated in the same bays they were in - no changes).

However, both servers data (VM files) were corrupt and not readable. We did a reinstall (none destructive) for ESXi 5.1 and can now see the files but none were readable or recoverable.

We have now failed over to our Axcient Backup system until we can diagnose and repair the Dell T610 server issues (either hard drives or raid or both). I have found online that you can use the Dell Open Manage boot disks to diagnose raid and hard drive issues but I have not been able to figure out how as there are no options I can see to do this only to reinstall the OS and build a new array/virtual drive.

The raid controller is Dell Perc 6/i SAS (which I believe is a LSI product) with 4x 500GB SAS 7.2K hard drives.

Any help you can provide would be appreciated.

thanks,

TheSonicOne
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of TheSonicGod
TheSonicGod

ASKER

Thanks dlethe,

I really don't care about the lost servers at this point as we had full image backup and failed over to the appliance and they are up and running on it now anyway. We will just restore them back to either the fixed or new hardware once we can.

What I need to do now is determine if I can fix the existing server and how to find out what is wrong/damaged/defective on it or if I need to replace the unit completely. The Server is almost 4 years old now so both are possibilities and server is out of warranty.

Thanks,

Spencer
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
But how can I be sure this issue was not caused by a failing Raid card itself. I want to be sure that I do not need a new card.

Given the prices it may be more cost affective to get a new server with drives and a new raid card if the card and drives need replacing.

thoughts?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hey Everyone,

Ok - so I have completed all the utilities using the Dell diagnostics via a bootable USB pen drive as per my prior posts. Everything passed included all the confidence tests on all 4 hard drives.

I am curious on your thoughts as to what next. Could this all have been caused by an overheat issue?

Reason I asked is that we noticed one of the cooling fans for the cabinet had failed and the room itself was much hotter than usual (likely 90+ degrees F) . The ambient heat was hot enough to force any server shutdowns but I am wondering if it was hot enough to cause these drive failures.

We opened all the doors for the room to drop the temp before we reseated these drives and tried to rebuild the raid, as I mentioned in a prior post the drives came back online right away so I am not sure now what to think.

I am wondering if I just rebuild the arrays from scratch and install ESXi and just move the VM's back to this unit since all the tests have passed.

Thoughts????

Thanks,

TheSonicOne
Those Dell diagnostics are pretty much toys compared to DVT tests that professionals use.  If the drive is dead or has a profoundly high number of bad blocks,then the Dell tests will fail the HDD.

But those tests are simply incapable of detecting problems related to interoperability, configurable settings, data integrity, ECC problems, vibration, it is a long list.  

Besides, you have all the tests you need. You lost 2 HDDs, didn't you?
Is there anyway to run the DVT tests on these drives?

Thanks,

TheSonicOne
Quality HDD test software costs more than your computer, and you would have to use a JBOD SAS controller.

But that is not the issue, you really need to have somebody who knows what they are doing look at it for the reasons I mentioned earlier.  Interoperability issues need talented test engineers and detailed specs.
I have been working with servers for 20+ years so I think I know what I am doing. I was asking for others opinions - is this not what this service is for????? (experts exchanging their experiences)

I have had many raid controller issues and drive failures over the years but I have never had a drive come back to life and act like nothing ever happened which is the case this time and then pass every test I can throw at them (Actually 2 drives in this case). Usually they either stay dead or fail some test at some point.

I was looking for reasons why this would happen and if anyone else has experienced this. Also, if I should act like they are still dead and replace them or given that I have full image fail-over with our Axcient backup for both VMs this server holds and can fail over in less than 10 mins without loosing anything (as that there is no data held on this server, only email filtering service and BES server) just rebuild the server and see what happens.

I was weighing my options and getting other suggestions for review.
If I had the drive in my lab and your controller logs I could probably tell you, but the fact of the matter is you don't have the hardware, software, or experience required to do a level 2 diagnostic.

there could be dozens of individual reasons and hundreds of thousands of different combinations of reasons.  Nobody can tell you what is wrong if a HDD passes the basic manufacturer's tests.

I've worked with OEMs and had to do level 3 diagnostics on why HDDs fail in a certain environment, and it cost well over $100K worth of engineering resources.  This involved getting the electron microscopes out to examine the heads & platters in a particularly bizarre set of circumstances.

My answer is that unless you are willing to pay some serious coin, then you'll never know for sure unless the HDDs fail consistently.  This particular problem has to be resolved by a pro who has the talent and equipment.
dlethe - I think you are way over thinking and complicating this issue. I am not looking for an absolute reason, I understand that this server is not worth that kind of time and money.

I am asking for an opinion, is it possible that this was just an issue caused by a over heated server room?

Given the backup coverage I have, I am leaning to just resetting up the Virtual disks, installing ESXi and copying back the live VM files.

I am just asking if anyone has experienced this type of drive activity before and if so how did they handle it.
It is possible it is heat, but that is one thing you can test by monitoring temperatures of the HDDs. Since these are SAS drives, then temperature is going to be SCSI Log page 2F.  You'll need also to monitor the unrecovered and recovered error count log pages real-time.

But since you need to do this pass-through the LSI  / MPT API, then just forget it.    There is no shareware / free product that can do that.  So while this information can be confirmed with software, such software isn't available without spending a lot of money.
>I have had many raid controller issues and drive failures over the years but I have never had a drive come back to life and act like nothing ever happened which is the case this time and then pass every test I can throw at them (Actually 2 drives in this case). Usually they either stay dead or fail some test at some point.

I've had drives "fail" when there's nothing really wrong with them, if they take too long to respond the controller will fail them and kick them out of the array. No amount of testing will show such disks as having failed because they didn't fail, the controller just had a huff and kicked them out. TLER/CCTL/ERC is meant to limit the length of time that the disk retries a bad block before giving up, this is set to a low value for drives on a RAID controller so that instead of waiting the controller can recover the data from parity instead. It may be in some cases though that the good drive still takes too long to respond so the controller marks it as failed. Perfectly good disks can lock up just like perfectly good servers can crash. Electronics isn't infallible, a stray alpha particle may cause a soft error.
There is ALWAYS a reason.  the most difficult things for lay people to test is intermittent issues and those problems that are specific to a load and configurable settings at any point in time.  

A perfectly fine drive can pass every diagnostic on the planet but still fail due to simple interoperability or stress-induced problems.   Here is an easy example.   The block size is 520 but the controller expects 512 bytes/block.  The drive will pass all tests but simply not work.  Obviously block size is not your issue, but it is easy to get your head around.  There are hundreds of subtle configurable settings on a HDD alone, and dozens have to do with error recovery scenarios that only fail when a series of events happen

Andy is right, and even what he posts is only a fraction of what can happen.  We didn't even get into even firmware bugs that come up constantly, and their fixes.  In fact Seagate put out an update a few weeks ago for a certain enterprise class HDD that still had SEV1 / high priority fixes that PREVENT DATA LOSS in certain I/O situations.  One of my OEM customers was seeing drives drop off randomly, and they have tens of thousands of these disks.
Extra points for useless waffle?

>"OMSA Live is what you want:" plus the link to get it from was worth more than the rest of jargon put together AFAIK.
Found download of required utilities on my own