SAN Troubleshooting

gutenmorgen used Ask the Experts™
I am preparing for interview and looking for some systematic approach to troubleshoot SAN . For e.g

1/ Host Side ( Please assists me what are the things we need to check )
2/ Switch . ( Please assists me what are the things we need to check )
3/ Storage ( SVC ) side ( Please assists me what are the things we need to check )

I imagine its a vast topic . Please give me the basic idea so that i can isolate if any issue arises in SAN

Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Most Valuable Expert 2013
Top Expert 2013

dividing the whole issue into parts (as you did in your Q) is already a good, classic approach.

You have frontend, infrastructure and backend, and this segmentation will greatly help with problem isolation.

In order to decide which part might be responsible for a problem you could ask, starting from the frontend:

1) Does the server fail to see just a few LUNs, or is it unable to access the SAN at all?
2) Do all adapters of a server experience the problem, or just a single one?
3) Can other servers access the SAN smoothly?
4) Are the LUNs pertaining to a particular storage server (or SVC) inaccessible, while the LUNs on a different device are well available?
5) Is there more than one fabric? If so, are all paths over all fabrics unavailable, or can the problem be attributed to a specific fabric?
6) Is zoning in use? If so, and if WWN zoning: Did a WWN change due to hardware replacement? If D/P zoning: Has a machine's adapter been connected to a different port than before?
Is the zoning set up correctly? Typos in WWN specifications?
7) Inspecting the logs on server/switch/backend: Do you see any hints indicating hardware failures?
8) Is the firmware (microcode) of the server adapters, the switches and the storage devices current? Any hints on the manufacturer's web pages regarding issues pertaining to a particular firmware level?
9) Regarding the backend machines: Is the host mapping set up correctly? Typos in WWNs?
OS type of frontend specified correctly? Some hosts expect SCSI masked LUNS, some must see SCSI mapped LUNS.
10) Are all fibre cables and plugs intact? Any damages or kinks/sharp bends? Plugs loose?

Some of the above questions relate specifically to either frontend, infrastucture or backend, while others can relate to two or even all three of these areas.
Should one of the questions of the latter type yield some hint you'll have to drill down in order to find the responsibe area.

In any case, answering the questions above will always give you a start point from which to investigate further.

Just in case: The above list does by no means claim to be exhaustive!

Good luck!

Top Expert 2010
I'd add:
1. Get understanding of infrastructure.  If the SAN isn't switched or zoned, then you save yourself a lot of time by NOT going down many of the items in the list above because they aren't applicable.
2. WMP's list is a great start but it is binary in nature.  Meaning is there a connection.  Many times in a SAN the issue is performance.  So you need to establish expected throughput levels and look at counters (which can be measured at the HBA or switch or target LUNs depending on what you have).

So ask questions about what software they have licensed to help determine how well each connection is doing.  
3. Is path failover / failback working?  Should it be working?


Many thanks

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial