Problems with Dell Perc H810 cards - twice

crp0499
crp0499 used Ask the Experts™
on
We have a super weird thing happening.  We have two Dell R730 servers, both with Perc H810 controllers in them.  The two servers are connected to an IBM V3700 and a Dell MD1220.

Last week, we rebooted the first host and at the boot, we got this error:

"LSI-EFI SAS Driver:
Unhealthy status reported by this UEFI driver without specific error

UEFI0116: One or more boot drivers have reported issues
Check the Driver Health Menu in the Boot Manager for details.

One or more boot drivers require configuration changes.  Press any key to load the driver health manager for configurations."

So, we got that after the reboot and after some googling, we decided to move the Perc H810 out of the second server into the first and it booted right up, no issues.  So, we chalked it up to a bad card, ordered a replacement, put it into the first server and it booted just fine as well.  There you have it, we had a bad Perc card.

Now, fast forward a week.  Server 1 has it's new Perc and it's running great, seeing all of the storage and we are happy.

Server 2 is running great with it's Perc card that we stole from server 1 last week and life is good.

Now, tonight, I need to reboot server 2 and boom, it hangs and is now reporting the exact same error as noted above.

Pressing any key does nothing.  The machine will not boot into anything...iDraq doesn't even kick work when it's in this state.  All we can do is remove the Perc card and it boots normally.

It "could be" that we lost two Perc H810 cards, but it's too coincidental to me.  It seems something else is amiss and I can't put my finger on it.

Anyone else know what's going on here?

Thanks

Cliff

By the way, we migrated all of the active VMs to the first server with it's new Perc H810 and we plan on replacing the H810 in server 2 that has now seemed to gone kaput!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Dr. KlahnPrincipal Software Engineer

Commented:
Swap your spare motherboard into the first system, let it run for a week and see if the problem is alleviated.  If so, there's probably a PCI / PCIe bus issue.

Side note:  Check the BIOS revision for both systems and see if they are identical, and also check the Dell BIOS updates to see if there is (a) a later one than that installed in the servers, (b) which directly addresses this specific issue.  Don't update the BIOS just to update it; that is begging for trouble.
Top Expert 2014

Commented:
This does not imply a problem with the card, it can be caused by a failed disk, foreign disk, cable problem etc.  You have to use the configuration utility to see what the problem is, but it hangs rather than going into the configuration menu.

Being "old school" I would switch it into BIOS mode, reboot and watch the old fashioned boot menu and use <ctrl> R , fix the disk problem and then switch it back to UEFI.
crp0499CEO

Author

Commented:
That's it.  Nothing works here.  The server locks up and no matter what I change in regards to the boot order, it won't enter setup, it won't bring up the boot menu, nothing.  The only way to get the server to boot is to remove the card and power cycle it.  Again, this card has been working for months, even thru reboots.  My hang up is two exact server, connected to the exact same storage, are exhibiting the exact same issues with the exact same Perc card.  I could handle it idea that two cards just "went bad" but it just seems odd and I feel like I'm missing something.

Looking in idrac, for the Perc H810, I see this error under foreign config:

STOR079: The device does not support this operation or is in a state that does not allow this operation.  Make sure the device supports the requested operation. If the operation is supported, then make sure the server is turned on and retry the operation.

I am also wondering if what I am doing will work.  You see, both R730 servers are connecting to the same MD1200 via the same Perc cards.  The first server was used to create the array that's in the first server so I'm expecting the second server with its second Perc card to see that array and provide access to it.  Shared storage.  That seems perfectly normal to me, but who knows.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Top Expert 2014
Commented:
Two servers connected to the same enclosure? Not a valid configuration, I'm surprised it ever worked. MD1200 is a dumb shelf, you cannot connect two PERCs to it. It is not shared storage, When one server writes to it it actually writes to the cache module on the PERC, the other server can't read that cache so your data will get corrupted almost instantly. There is no mechanism to synchronize the cache in the two servers.

Also on boot the PERC writes a timestamp to the disks, there is no mechanism to synchronize that timestamp either so your disks will keep going into a foreign state. There's probably nothing wrong with your original card, but replacing it with a new one which had no stored configuration lets it automatically import the config from the disks.

Have you really been running like this for months? I find that almost impossible.
crp0499CEO

Author

Commented:
Months! Literally. Both esx hosts show the same storage and I was even vmotioning between the two hosts.

That being said, I'm onsite. I took the card and put it into a pc and it halts the bios just like in the server.  It seems the card is bad.

What u say matches what I thought. Do u have any documentation on your recommend about the md1200 not being used for shared storage?  Not that I doubt u, but when I was called to troubleshoot this, I was thinking the same thing.
crp0499CEO

Author

Commented:
Never mind! I found it! I'm an idiot! Going to clean up this mess I made and get some shared storage!

Thanks Andy
Top Expert 2014

Commented:
The MD1200 could be in split mode, but then you couldn't vmotion unless it was just acting as local storage with vSAN software on top.

VMware only has one host accessing each VM's files so although they are both accessing it they never access the same data area, that would explain why it's not corrupting the data. Server A never needs to see Server B's cache. I'm surprised vMotion works though, because that's one case when one server does have to access the data area that was previously used by the other one, the reason there's no corruption is probably that the cache is flushed to disk fast enough.

If the card stops a PC with no disks connected then I guess it must be faulty, I would try clearing its cache just to confirm though by removing the battery for a few minutes.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial