Dell Storage Array

GalaxyTechService
GalaxyTechService used Ask the Experts™
on
Hello Experts,

We have dell storage array.  It ran into some issues of running very slow yesterday and eventually resolved its self.  Attached is the majoreventlog.  Any input on the situation would be great.

Thank you,

-GTS
majorEventLog.txt
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
JohnBusiness Consultant (Owner)
Most Valuable Expert 2012
Expert of the Year 2018

Commented:
I would run Dell hardware diagnostics on the Drive / RAID array. If RAID, Dell should have a RAID storage manager that will show you the drive health.  That is how we manage Lenovo Servers.
Philip ElderTechnical Architect - HA/Compute/Storage

Commented:
Let's see what can be found:
 * Discrete Lines error - indicates communication issue between controllers
 * Controller battery backup capacity error - needs a new battery probably both controllers
 * Physical disk replacement processes as well as a disk failure
 * Enclosure 0 failed power supply fan module 3/slot 1
 * Enclosure 1 failed power supply fan module 2/slot 1
 * More battery failures on Enclosure 0
 * Disk path degradation - probably one of the controllers going offline

All of the entries in the log are dated 2017? Is the date set up correctly?

Author

Commented:
yes the dating is correct.  anything older then 60 days ago has already been addressed.

-GTS
Fundamentals of JavaScript

Learn the fundamentals of the popular programming language JavaScript so that you can explore the realm of web development.

Technical Architect - HA/Compute/Storage
Commented:
A search using: "attention: true" brings up a battery and discrete lines error. So, looks as though trouble is on the horizon.

If the battery stalls then the controller associated with it goes into Write-Through mode instead of Write-Back. That causes a huge impact on performance in a negative way.
Top Expert 2014

Commented:
SAS topology changes Yesterday? Don't see that very often in conjunction with "resolved its self".

Surely some human intervention took place?

Author

Commented:
im sure it did andyalder.

Author

Commented:
Philip Elder, this R720 has two MD3620's attached to it for its storage array's.  Do you think this is the battery for the MD or PE?

-GTS
Philip ElderTechnical Architect - HA/Compute/Storage
Commented:
The log is from the storage array so that's where it will be.
Top Expert 2014

Commented:
No human intervention yet "Alternate RAID controller module removed or replaced" appears about 4PM 2nd January.

Battery looks OK to me but the effect on performance of a hung or removed controller are very similar to a dead battery, it goes into write-through mode because it cannot mirror the cache to the other controller. There is a command to alter this behaviour for people who only bought single controller variants, obviously you do not want to do that as yours is dual controller.
Top Expert 2014

Commented:
Oh by the way, if you can confirm that someone tried reseating the controller about 12:30PM and then a new one was fitted at 4:10PM I can explain about the battery error but need to confirm if the 4:10 event was a second reseat or a swap for a completely different controller.

Author

Commented:
We have to bring down the entire array to do any type of maintenance.  The errors would have come from when the whole system was up and running.
Top Expert 2014

Commented:
No you don't, every component on the MD3260 is redundant. If it is cabled up correctly you can remove either controller without stopping access, even if wired incorrectly one of the controllers can be removed. It's clear from the log that between 4:09 and 4:13 one of the controllers was reset (probably removed and replaced). (It takes about 4 minutes to completely boot the controller.

Unfortunately the controllers do not put their serial numbers in the boot-up messages or I could tell if it was swapped or merely reseated.
Top Expert 2014

Commented:
Do you still insist there was no human intervention?

Date/Time: 1/2/19 4:09:44 PM
Sequence number: 41095
Event type: 400B
Event category: Internal
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Alternate RAID controller module removed or replaced
Event specific codes: 0/0/0
Component type: RAID Controller Module
Component location: Enclosure 0, Slot 0
Logged by: RAID Controller Module in slot 0

Author

Commented:
according to on site I.T. members no one has touched the storage array's.
Top Expert 2014

Commented:
Well, they say one thing the log says another. Maybe you have two sites? You can always log onto MDSM and flash some lights and have someone on site verify you are talking about the same unit. Of course I can't verify the timestamp, it hasn't logged a "set by NTP" recently.

Date/Time: 1/2/19 4:12:22 PM
Sequence number: 41131
Event type: 730D
Event category: Internal
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Battery replaced

Author

Commented:
we only have one set of these storage arrays.  we will be replacing the battery to se what changes that makes first.
Top Expert 2014

Commented:
There's no battery errors until *after* you said it slowed down and that is at the same time as the "controller replaced" and state change messages. If you are definite there was no human intervention then you would be wrong to replace the battery because you do not get "RAID controller module reset itself" logged due to a bad battery.

Power down at your earliest convenience and re-seat both controllers because someone (or some thing like vibration) caused two reboots on that day, one that seems to have made it go slow (about 12:30 lunchtime) and one that corrected it at about 4:15PM.

If you don't believe me then log it on Dell's MDxx forum, that's free even out of warranty and manned by Dell engineers.

Bit baffled about Philip Elder saying "All of the entries in the log are dated 2017? Is the date set up correctly?" whereas the errors I see are in 2019, did you upload and then replace the file with a newer one?

Author

Commented:
I have only uploaded that one file since starting this thread.
Top Expert 2014

Commented:
Sorry, didn't mean to accuse you of anything, Just pointing out that I see 1019 log entries but Philip saw only 2017 ones.

Both batteries that were in it on Midnight 1/1/19 were good, I think one of your controllers locked up due to a disk error it did not know how to interpret. If the controller was not physically replaced it threw the same messages as a re-seat as well as recharging its local battery nicely on the second attempt. I did not write the firmware though.

There may even be a "reboot partner" algorithm that reset one controller twice on 1/2/19, but something kicked it pretty hard to make it do that. [not your boot but a start-of-day routine.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial