PERC H700 predictive failure continues with new drive

Hi,

System is a PowerEdge R710.with PERC H700 integrated controller.  The virtual drive in question is RAID 5 with 3 drives. I have been monitoring controller logs weekly, and running consistency check for the past 3 months with 0 issues.  Suddenly yesterday, OMSA is showing 1 drive as predictive failure.  I went through logs and see a bunch of unexpected sense logs many of them stating "corrected medium error", but I do not see any unrecoverable errors.  I went onsite, offlined drive, then inserted new Dell branded drive.  Rebuild completed, but the new drive is also reporting as predictive failure.  I went through logs again, and notice many more unexpected sense logs during rebuild process.  I then ran consistency check, which again had many unexpected sense logs.  Drive is still in predicted failure state, so I replaced drive again with a new drive.  This time, after rebuild drive was not showing predictive failure.  Just to be sure, I ran another consistency check, which put same drive in predictive failure state again.  There are again a bunch of unexpected sense logs many of them stating "corrected medium error".  I am unsure how to proceed.

The firmware for controller, drives, & BIOS are all up to date, but IDRAC6 & Lifecycle are out of date.  IDRAC6 is at 1.92.00 (build 5) and Lifecycle is at 1.4.0.445

Please let me know your thoughts.  If you recommend updating IDRAC6 & Lifecycle, please let me know where to find updates and steps for updating.  I am a little confused identifying where to locate these updates on support page.  Also, can I just update to latest version or does it have to be in steps?

Thank you in advance!
itechresultsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization ConsultantCommented:
We usually just use the Firmware DVD and update all firmware.

Anyway your issue is something we’ve been noticing of late with Dell support and their supply of drives!

However the problem went away because Dell could no longer supply us the drives!

We had to retire the servers and we did purchase non Dell drives from The bay the only issue with them was we now have another yellow warning non certified drive but it fixed the predictions issue!!!

But servers out of production!
itechresultsAuthor Commented:
Thanks. I need this server running for 4 more months, then it will be out of production!
andyalderCommented:
Normally predictive failure propagation like this only occurs if you have a punctured stripe, in which case you have to backup, wipe thew array and start again.

Can you upload the log?
OWASP: Threats Fundamentals

Learn the top ten threats that are present in modern web-application development and how to protect your business from them.

itechresultsAuthor Commented:
I agree that there is something up with the virtual disk.  I did search for "puncture" and did not find any, but I'm not sure if H700 controllers log punctured stripes.  Also, all the unexpected sense logs reference PD 03 which is the drive I'm having issues with.  I uploaded 3 logs.  Thanks!
lsi_0203_1.log
lsi_0203_2.log
lsi_0203_5.log
andyalderCommented:
I spy uncorrectable medium errors: this entry occurs before any "corrected medium errors" in the log. Would like to see an older log as I suspect this is after the problem occurred,

01/28/19  1:59:53: DEV_REC:Medium Error DevId[3] devHandle b RDM=8055c400 retires=0

REC does not mean recovered, I think "DEV_REC" is short for  device record, not sure but it certainly is an URE

https://www.dell.com/support/article/us/en/04/sln111497/double-faults-and-punctures-in-raid-arrays?lang=en may apply where it says "Data error on an online drive is propagated (copied) to a rebuilding drive", however the LBAs reported in the first log are not present in the 3rd one. I haven't had time to check each LBA to see if there is a pattern so it may just be multiple bad spares.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
itechresultsAuthor Commented:
Thank you AndyAdler for your time analyzing the logs.  I have attached 2 more logs - 1 week prior & 3 weeks prior to predictive failure.

I ordered new drives and plan on replacing PD 03 again.
lsi_0113.log
lsi_0127.log
itechresultsAuthor Commented:
Andrew & andyalder, thank you for taking your time to assist me troubleshooting this issue.  I replaced PD 03 for a third time, which resolved predictive failure issue.  Although I am happy to see controller not logging errors and PD 03 in normal state, I'm still not confident virtual disk is healthy.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.