Solved

MSA1000 FC Switch replaced and when came online, caused some VMs to crash.

Posted on 2010-08-25
24
2,363 Views
Last Modified: 2013-11-14
I would like to know the exact cause for VMs that had trouble during a Fiber Channel switch replacement.

Infrastructure:

MSA1000 with two HP MSA Switch 2/8 in Active/Active configuration.
Two ESX 3.5 Hosts - DL380 G4
HP Blade c3000 with two blades and two Brocade 4/12 SAN Switch for HP c-Class BladeSystem

Recently, I had an HP MSA1000 Fiber Channel switch replaced.  When the switch came online, it seemed to work; HP Systems Insight Manager showed that the switch was up and I could see it in the Switch Explorer web interface. Also, under the LUN properties of the ESX hosts, the "dead" status for the Path Status changed to "On".

However, once the replaced switch came online, a bunch of "lost connection" alerts were generated from HPSIM. VMs on the ESX Host lost connectivity for about 5 seconds and I am assuming this is DRS or VMotion in action. Most of the VMs were fine but three VMs had problems and exhibited different behaviors:

VM1 - Generated these errors but recovered automatically.

Event Type:      Error
Event Source:      Disk
Event Category:      None
Event ID:      11
Date:            2010/08/24
Time:            22:09:24
User:            N/A
Computer:      xxxxx
Description:
The driver detected a controller error on \Device\Harddisk0

Event Type:      Error
Event Source:      vmscsi
Event Category:      None
Event ID:      15
Date:            2010/08/24
Time:            22:09:24
User:            N/A
Computer:      xxxxx
Description:
The device, \Device\Scsi\vmscsi1, is not ready for access yet.

VM2 - Had the same errors as above but did not recover. Had to reboot the VM.

vent Type:      Error
Event Source:      Disk
Event Category:      None
Event ID:      11
Date:            08/24/2010
Time:            10:09:18 PM
User:            N/A
Computer:      ttttt
Description:
The driver detected a controller error on \Device\Harddisk0.

Event Type:      Error
Event Source:      symmpi
Event Category:      None
Event ID:      15
Date:            08/24/2010
Time:            10:09:18 PM
User:            N/A
Computer:      ttttt
Description:
The device, \Device\Scsi\symmpi1, is not ready for access yet.

VM3 - Generated no errors but rebooted all of a sudden and got stuck at the BIOS screen. The VM worked normally after restarting it.

I am trying to find the exact reason of why these VMs had problems other than it was related to the FC switch being replaced.

I looked up how to troubleshoot the Event ID 11 and 15 errors and tried the following on one of the ESX hosts:

1. Looked for error messages in the /var/log/vmkwarning log. Found this:

Aug 24 22:14:15 MY_ESX_HOST03 vmkernel: 930:21:32:49.137 cpu3:1722)WARNING: Swap: vm 1721: 7515: Swap sync read failed: status=195887167, retrying...
.
.

Aug 24 22:14:05 MY_ESX_HOST03 vmkernel: 930:21:32:39.410 cpu0:1055)WARNING: SCSI: 5306: vml.0200010000600805f3001a3c70a0e14b09724e00054d534120564f: Too many failed retries 33 (32),  Returning I/O failure. 0x16 1/0x0 0x0 0x0 0x0

2. Ran the command "vm-support -x" and found the vm ID 1721 listed above:

VMware ESX Server Support Script 1.29

Available worlds to debug:

vmid=1425       AAAAA
vmid=1490       BBBBB
vmid=1503       CCCCC
vmid=1673       DDDDD
vmid=1695       EEEEE
vmid=1721       The VM that had problems after the switch replacement

3. Confirmed the error with the command "ls /proc/vmware/vm/1721/disk"

vml.0200010000600805f3001a3c70a0e14b09724e00054d534120564f

I think this shows why VM1 and VM2 had problems but I cannot find out why VM3 rebooted. On VM3 there was not crash dump file generated. The only thing that I could find was in the vmware.log in the folder of the VM which just shows that it could not find an OS to boot to:


Aug 24 22:15:33.788: vcpu-0| Msg_Post: Warning
Aug 24 22:15:33.788: vcpu-0| [msg.Backdoor.OsNotFound] No bootable device was detected.  A bootable device might be a CD, floppy, hard disk, or network device, as when booting with PXE.
Aug 24 22:15:33.788: vcpu-0| To install an operating system, insert a bootable CD or floppy and restart the virtual machine by clicking the Reset button.----------------------------------------
Aug 25 07:43:28.645: mks| SOCKET 11 recv error 5: Input/output error
Aug 25 07:43:28.645: mks| SOCKET 11 destroying VNC backend on socket error: 5
Aug 25 07:44:00.747: vcpu-0| Unknown int 10h func 0x2000
Aug 25 07:44:01.132: mks| VNCENCODE 12 encoding mode change: (640x480x16depth,16bpp)
Aug 25 07:44:09.409: mks| VNCENCODE 12 encoding mode change: (720x400x16depth,16bpp)
Aug 25 07:44:10.337: mks| VNCENCODE 12 encoding mode change: (640x480x16depth,16bpp)
Aug 25 07:44:10.666: vcpu-1| CPU reset: soft

Is there anything else that I can check?

Please reference my original question about the actual FC switch replacement for more background information:

My original EE question concerning the SAN.
0
Comment
Question by:SSAKUSEISHA
  • 10
  • 7
  • 6
  • +1
24 Comments
 
LVL 47

Expert Comment

by:dlethe
ID: 33521083
When you replaced the switch, all the WWNs changed.  Configure the switch and set up zoning properly as well.  One just does not replace a switch and go on their merry way as if they are changing a battery or cable.  You have to do some real work to get the new paths mapped everywhere.
0
 
LVL 118

Accepted Solution

by:
Andrew Hancock (VMware vExpert / EE MVE) earned 250 total points
ID: 33521455
from the event logs that you have provided from the Virtual Machines, the Virtual Machines lost contact with the SAN, a timeout has occured.

This is a very common issue, with incorrectly configured ESX servers and SANs, I've also seen this error on iSCSI and FC sans, where the disk subsystem (SAN) has disappared for a micro-second.

However if you case, because some VMs crashed, I would think they lost the path to the SAN.

You can see from the above NT eventlogs, and VMkernel logs, that the ESX server is having difficulties in connecting to the storage.

Please post the switch configs, so we can look at the Zoning and WWNN info, that you previously posted without the linethroughs, so we can check.

How is the SAN, ESX servers now? Did this just happen because when the switch was removed, you weren't truly running Active/Active Fault Tolerent? Have you ever tested this before I wonder, by failing a switch.

@dlethe: We asked @SSAKUSEISHA to post a new question relating to this specifc issue, as his original questions was due to a missing license on the switch fabric, due to HP Engineer. So don't flame the poor guy.....
0
 
LVL 118
ID: 33521476
My apologies, just seent the previous link.

0
 
LVL 118
ID: 33521500
can you upload the images with the lines through...
0
 

Author Comment

by:SSAKUSEISHA
ID: 33527310
hanccocka,

Thank you again for your help.

I have attached the configs - note that the Fabric OS version for the replaced switch (switch1) is still different. The HP engineer said that as long as both are 3.2.x then it is okay. Just a bit of background info, the other switch2 was replaced earlier this year and at that time, HP sent out two engineers that replaced the switch without any problems. That is why for this replacement, I was not expecting so much trouble.

You can see from the output that there are two E-Ports configured. These attach to an HP c3000 Blade ESX datacenter - it was the VMs on that ESX datacenter that had problems. The other F-ports (port 3 &4) are attached to two  DL380 G4 servers ESX datacenter. On this datacenter, I manually shutdown the VMs, put the two DL380 ESX servers in maintenance mode, and then shutdown the servers so these VMs obviously did not have any problems. I could not shutdown the Blade servers because of some running critical application servers. The whole idea of this "redundancy" is to not have to shutdown anything but I digress...

I did not see the engineer remove Switch2 - only Switch1. However, on Switch2, I ran the uptime command and saw the following output:

SGM06019L2-switch2:admin> uptime
Up for:      1 day, 10:53
Powered for: 1606 days,  8:51
Last up at:  Tue Aug 24 04:33:44 2010
Reason:      Power-on

Correct me if I am wrong, but this means that switch was not removed but it was rebooted, correct? If the "Last up at" is when the switch was possibly rebooted, that time does not coincide with the time of the replacement - the replacement occured around Aug 24 at 22:07.

For reference, here is the output for the replaced switch, Switch1:

SGM06019L2-switch1:admin> uptime
Up for:      1 day, 10:55
Powered for: 3 days,  3:58
Last up at:  Tue Aug 24 22:09:21 2010
Reason:      Power-on

To reiterate, most of the VMs were fine - just the three mentioned above had issues. It was not a case where the bad switch was removed and then the other active switch was removed which would obviously kill everything attached to the SAN. I have attached screenshots from HPSIM that show the Active/Active configuration - it was like this before the switch failure as well.

The engineers that designed the infrastructure wrote a build document in which they describe the following test cases:

Blade Switch Failure
Fibre Channel Path Failure
MSA1000 Controller Failure
ESX Server Failure

All cases resulted in no errors or downtime except for the ESX Server Failure case where all vms lost connectivity for about 5 minutes. I could not find any logs or technical info for the cases but these engineers were professionals so I am very confident that they did not just draw some Visio diagrams and write "Pass" next to each test case.
MSA1000.Switch.1.switchShow.jpg
MSA1000.Switch.2.switchShow.jpg
MSA1000.Switch1.Active.jpg
MSA1000.Switch2.Active.jpg
0
 
LVL 118
ID: 33528794
Maybe I have missed this, but which version of ESX is this?
0
 
LVL 118
ID: 33528829
can you log and attach the output from a

show tech_support attached to each controller using the serial console cable.

sometimes this doesn't run correctly, so you may have to run the commands

show units
show acls
show connections

instead.


0
 

Author Comment

by:SSAKUSEISHA
ID: 33528985
hanccocka,

Thank you. The version of ESX is 3.5.0 Build 64607 for all four ESX servers. VirtualCenter is version 2.5.0 Build 64192. We have DRS and VMotion Licensing as well.

The MSA1000 is located at our Data Center (which is far away) and unfortunately, I only have a console cable attached to the controller that had the failed FC switch. I will attach that log for now. Ironically, one of the disks failed today (you will see that in the log) so I have to go the DC to replace it. While I am there, I will get the tech_support output from the other controller.

Thanks again for your time and assistance.
MSA1000.Controller1.showTech.zip
0
 
LVL 118
ID: 33529026
what FC cards are installed in the ESX servers? (the server with the issue?)

Just a word of caution, (you may already know this), the MSA 1000 has been dropped from the HCL for ESX 4. Although we are running production here on firmwares 4.48 (Active/Passive) and 7.00 (Active/Active) on ESXi (ESX4).
0
 
LVL 55

Assisted Solution

by:andyalder
andyalder earned 250 total points
ID: 33529427
I'm not at all happy with the switchshows, both switches have E ports going to the same device, you're meant to have two seperate fabrics. What are the devices at WWNs ending 07 00 e6 and 06 f1 26 ? I'd guess they are SAN switches in your blade enclosure.

To have redundant fabrics you have to disconnect the cross-connect cables you've got between these (and add extra ISLs if you want for bandwidth). With a single fabric like you've got replacing a switch will cause a topology change which may interupt traffic for a short period and there isn't a redundant fabric to fail over onto. Think of fabric redundancy as protection against a madman at the console, he can screw up a whole fabric of hundreds of switches with a couple of keystrokes but if they aren't connected together he can't screw both up at the same time.

Just an aside; did you notice "Disk202: Box 2, Bay 02, (B:T:L 2:01:00)   DRIVE FAILED! (0x0D)"
0
 
LVL 55

Expert Comment

by:andyalder
ID: 33529523
Another option instead of disconnecting the cross-connects that join your two fabrics together is to put the switches in the blade enclosure into Access Gateway mode, then they act similar to passthroughs (although they support multiple hosts on a single fibre cable), in access gateway mode they don't behave like switches and don't take part in fibre channel FSPF.
0
 

Author Comment

by:SSAKUSEISHA
ID: 33539356
@andyalder - Thanks for your help again. I wrote in my post to hanccocka that I know about the drive failure and I have since replaced it. Yes, the two devices that you mentioned connect to the two Brocade 4/14 SAN switches in the blade enclosure. Are you saying that the current setup is completely wrong, or not ideal? I am obviously not qualified to go in and start changing the infrastructure - especially on this production system with over 20 critical VMs. I am trying to learn but we don't have any spares to test and practice on and my company has no plans on sending me to training. I tried to find at least a simulator to work with but had no luck. Anyway, if this configuration is not correct, then I will need to inform management and look at hiring a consultant.

@hanccocka - Thank you for the heads up on ESX 4. The organization has no plans to upgrade at this time. I have both controller show tech_support outputs now - please see them in the attached .zip. Also, I took screen shots of the FC card info from all four ESX servers. Note that ESX3/4 are BL460c G1 Blades and ESX 5/6 are DL380 G4 ESX Hosts.

Thanks again for all of your help!


MSA1000.Controller.CConfigs.zip
ESX3.HBA.INFO.jpg
ESX4.HBA.INFO.jpg
ESX5.HBA.INFO.jpg
ESX6.HBA.INFO.jpg
0
Network it in WD Red

There's an industry-leading WD Red drive for every compatible NAS system to help fulfill your data storage needs. With drives up to 8TB, WD Red offers a wide array of solutions for customers looking to build the biggest, best-performing NAS storage solution.  

 
LVL 118
ID: 33539437
I was going to mention that SAN Configs are very critical to their continued operation, and I'm not sure how happy your Management would be for you to change it on the advice of looking it up on the web on Expert Exchange, whether correct or not. There is a high amount of risk if you are not confident or trained with FC SANS. I would suggest to you that your Managment send you on the relevant training courses if they require support in house or mitigate the risk and Hire your SAN Consultants to Health Check your environment. I will look at the configs now.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 33540888
Well, I can charge you a consultation fee if you want, I'm a Master ASE in StorageWorks and have been for the last 12 odd years.

You probably know that with Ethernet when you add or remove switches that introduce loops spanning tree protocol sorts out the loops by blocking some ports and this can cause temporary loss of connectivity until the network converges. Well it's near enough the same with fibre channel.

So if you have two fabrics then if the topology changes on one of them the other will remain stable, but with only one fabric a topology change will affect everything. In your case your single fabric can be converted into dual fabric by removing the cross-connects between the switches. Beware though that removing those cross-connects involves a topology change so you'll have to suspend all your VMs first or you may get the same temporary loss of connectivity that crashed them last time.
0
 
LVL 118
ID: 33540992
@andyalder: "Well, I can charge you a consultation fee if you want, I'm a Master ASE in StorageWorks and have been for the last 12 odd years." - that would probably work if your business has got PI!
0
 
LVL 55

Expert Comment

by:andyalder
ID: 33549353
I don't see the need for a private investigator, anyone who knows anything about SANs will confirm that a single fabric isn't optimal for redundancy.

Here's a logical diagram, I had to use mspaint as I don't have Visio at home.


The c3000 backplane connects the mezannine HBAs to both switches in the c3000.

The MSA1000 controllers have a single fibre connection to the integrated switches.

In red are the cross-connect fiber cables that I recommend removing.

In orange are the cross-connects in the c3000 which should also be disabled, they are disabled by default anyway so probably not in use.

In blue are optional extra ISLs, not really needed as a single ISL can't be over-subscribed with just the MSA1000 but may in future if more storage is plugged into the MSA1000 integrated switches.

SSAKUSEISHA, if you want a second opinion just post another question entitled something like 'FC SAN topology for VMware" for 20 points with the URL for this one pasted in the body.
MSAplusCclass.GIF
0
 
LVL 118
ID: 33549398
@andyalder: PI = Professional Indemnity Insurance (PI).
0
 
LVL 118
ID: 33549419
Has this question not gone a bit off topic, the question asked was MSA1000 FC Switch replaced and when came online, caused some VMs to crash?

and we seem to have established it's because loss of connectivity maybe due to poor design to LUNs going offline.

Now we are into re-designing SSAKUSEISHA: SAN toplogy, and whether SSAKUSEISHA and/or his management will accept what we post here, right or wrong, SSAKUSEISHA may not have the confidence to change a Production Working Environment with questions being asked by his Management Team why he changed it, and it's very easy for us, at arms length to specify designs and quote (12+ years, or 20+ years in my case), that doesn't help him when it all goes pear shaped.

But we are not at the coal face in doing this, so it's very easy for us, to say do this and configure that, it doesn't effect us, when it fails, it's SSAKUSEISHA that gets it in the neck.

I would recommend to SSAKUSEISHA, that he takes the advice given here, and run through it with the Consultants they use to check current SAN Toplogy, and ask why failover has failed here.

0
 
LVL 55

Expert Comment

by:andyalder
ID: 33549506
It hasn't gone off topic (except for your irrelevant posts about PI and engaging a consultant), I didn't know you were a Master ASE as well, what's your ASE number?

All that is required is a few more experts confirming that a single fabric is not optimal and that that's probably what caused the hickup in connectivity.

I'll gladly go through the logs on the switches, they'll probably show the topology changes.
0
 

Author Comment

by:SSAKUSEISHA
ID: 33555215
Thank you both for your comments.

For now, due to the current configuration, if any of the FC switches need to be replaced in the future I will assume that the VMs will temporarily lose connectivity and plan accordingly.

Hopefully, the next budget will allow training and possibly a hardware upgrade to a solution that can be better managed by the current and future IT staff.

I will leave this question open for one more day if you would like to offer some final comments.

@andyalder - Thank you very much for the offer however, since I am in Japan, it would be much easier (for all of us) to stick with a local vendor. The information that you have provided has been very helpful.

@hanccocka - Thank you for your insightful and helpful comments. If I could duplicate the environment with spare equipment I would be more confident in making changes but since that is probably never going to happen, I won't touch the SAN configuration with a 10 ft pole! :)
0
 
LVL 55

Expert Comment

by:andyalder
ID: 33556894
Why don't you log a call with VMware or HP and get them to confirm whether your topology is valid or not? Presumably you've still got support from both of them. You hardly nead a vendor since you're likely to end up with two less cables that you can sell (not for very much) on eBay.
0
 

Author Comment

by:SSAKUSEISHA
ID: 33574180
@andyalder - Unfortunately, the company does not have a support contract with VMware. We only have a CarePack agreement with HP but we did try and ask;  HP said that they cannot recommend "custom" configurations without sufficient testing.
0
 

Author Closing Comment

by:SSAKUSEISHA
ID: 33574192
Thank you again for your advice and help.
0
 
LVL 55

Expert Comment

by:andyalder
ID: 33574423
Exactly, HP can't recommend custom configurations which is what yours is at the moment. It would become a standard configuration if you took those extra cables out.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

HOW TO: Upload an ISO image to a VMware datastore for use with VMware vSphere Hypervisor 6.5 (ESXi 6.5) using the vSphere Host Client, and checking its MD5 checksum signature is correct.  It's a good idea to compare checksums, because many installat…
In this article, I show you step by step with screenshots to assist you - HOW TO: Deploy and Install the VMware vCenter Server Appliance 6.5 (VCSA 6.5), with some helpful tips along the way.
Teach the user how to edit .vmx files to add advanced configuration options Open vSphere Web Client: Edit Settings for a VM: Choose VM Options -> Advanced: Add Configuration Parameters:
Advanced tutorial on how to run the esxtop command to capture a batch file in csv format in order to export the file and use it for performance analysis. He demonstrates how to download the file using a vSphere web client (or vSphere client) and exp…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now