VMware FC failover to second target

Hello all,

I have a question about VMware ESXi 4.x and failover of fibrechannel from one target/storage processor to another.

We have an openfiler active/passive cluster and this setup works by one server taking over on shutdown/failure/etc. from the other.
We are using shared direct attached storage arrays that connect to both servers.  When a server takes over it mounts all the appropriate filesystems and takes over a virtual IP.  Consequently, all our iSCSI sessions failover to it.

With iSCSI this is very simple since we point to the virtual IP.
About a year back we began moving to fibrechannel because of speed and cost.  (2G FC is much cheaper than 10GE and bonding of 1GE in VMware for iSCSI has been lackluster.)

While both target servers are setup and working, we are unsure how to have VMware failover to the other target server since both have different WWNs and only one server is presenting LUNs on startup.

I tried looking down the path of virtual WWNs (like with the virtual IP), but no luck there.

Thoughts?

Thanks!!
-Cheers, Peter.
LVL 5
ein_mann_betriebAsked:
Who is Participating?
 
Paul SolovyovskyConnect With a Mentor Senior IT AdvisorCommented:
You need to configure Round robin on the datastores.  Typically you set round robin on the datastores, set the IOPS to vendor preference (default is 10K), and configure queue depth on the HBAs accordingly (I believe qlogic is 65 and Emulex is 16).

here's an example
0
 
Paul SolovyovskySenior IT AdvisorCommented:
oops..forgot the link

http://www.ivobeerens.nl/?p=465
0
 
ein_mann_betriebAuthor Commented:
Hi paulsolov,
   How does esxi know about the other wwn?  Does it rescan on failure and pickup based on the signature of the vmfs?
If so, do you know what kind of failover time would be typical for it to figure out the other node and come online?

Thanks!  -Cheers, Peter.
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

 
Paul SolovyovskySenior IT AdvisorCommented:
Check datastore and see if it sees it as a path. By default uses most recent path but will failiver
0
 
ein_mann_betriebAuthor Commented:
Hi.

Sorry... I haven't abandoned, but I haven't got a test rig setup yet.
I'm waiting for a spare server to use for the failover test.  Will report back as soon as I can.

Thanks.  -Cheers, Peter.
0
 
ein_mann_betriebAuthor Commented:
Hi paulsolov,
  Sorry for the delay.   So I setup a second target server with a direct attached shared storage.  When I failover the storage system to the other server, VMware shows both paths as "Dead" but I don't see it recognize the path via the new storage server until I reboot the ESXi host.

I must be missing something...

Thanks!  -Cheers, Peter.
0
 
Paul SolovyovskySenior IT AdvisorCommented:
What do you mean failover to other system?  Are you talking about VMWare HA?  Are both systems in a VMware cluster or MCSC cluster?  
0
 
ein_mann_betriebAuthor Commented:
Hi paulsolov,
  Not HA.  I have in this case one ESXi server and two storage processors in an Active/Standby configuration.

When I startup ESXi it sees only the Active storage processor.  Make sense since the other one is in standby mode and is not presenting luns.

If I force a failure on the Active storage processor, the Standby system storage processor takes over, but I can't seem to get ESXi to find the datastore again until I reboot ESXi.

Thoughts?

Thanks!  -Cheers, Peter.
0
 
Paul SolovyovskySenior IT AdvisorCommented:
What make model San?
0
 
ein_mann_betriebAuthor Commented:
Hi paulsolov,
   Its based on OpenFiler and the linux scst-fc software.
Our HBAs and switches are all QLogic.  Specifically QLA2342 HBAs and SanBox2 switches.
0
 
Paul SolovyovskyConnect With a Mentor Senior IT AdvisorCommented:
Most likely a storage SP issue.  I have done Netapp, HP, etc.. and as long as the path is up on initial controller the second controller picks up where the first one left and usually a battery cache that preserves the data.  This is why I avoid openfiler in a production environment as the software is good but you don't always have the tight integreation of software/hardware and the support that comes along with it.

Normally the SP should take over the function of the failed controller, in this case it's either not failing the dead path fast enough and picking the new path.

Since Openfiler is not on the VMWare HCL these type of issues usually occur since they haven't been fully tested and certified.
0
 
ein_mann_betriebAuthor Commented:
Hi paulsolov,
  So I think I have my answer...  Spent several hours this past weekend on trying to rule out all the components.  It seems that the HBA in my one SP had formed some type of failure... Not sure what... but when I got a PCI SERR ERR code thrown, I knew something for sure went bad.

  Replace the HBA and now Round Robin works like a champ...  copied several gigs of data and the MD5 sums were all matching... yay.

  Seems the trick is to present the same LUNs on both SPs at the same time.  So its not something you keep offline and bring online in case of failure.  Its needs to be more of an active/active config in order to work.

  Still not sure why rescanning the HBAs in ESXi doesn't show any new LUNs until a reboot.  But maybe this is a limitation of the free license?

Thanks ever so much!
  -Cheers, Peter.
0
 
Paul SolovyovskySenior IT AdvisorCommented:
The free and paid license have the same functionality when it comes to storage.  Make sure you're running on the latest ESXi 4.1 update 1
0
 
murdochkaCommented:
Hi ein_mann_betrieb

Is it at all possible for you to elaborate on how you got your setup to work.

I am abit confused. Is drbd handling the replication or is vmware doing that for you?

If you could explain in abit more detail that would be a great help.

Regards
0
 
ein_mann_betriebAuthor Commented:
Hi murdochka,
   We are not using drbd for this setup, but that is certainly a possible way of implementing this type of setup.  We are using special clustering RAID hardware whereby we have two servers both attached to the same external disk shelves.  So the RAID cards on both servers are working in lock-step with one another.  Either controller can takeover completely if the other fails.

  When the heartbeat daemon detects a failure, it run scripts on the failover server.  The failover server then mounts all the LVMs we have on the disks, takes over the virtual iscsi IP addresses, and brings up the daemons for nfs, iscsi target, fc target, etc.

Hope that helps.  -Cheers, Peter.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.