mtkaus
asked on
VMWare VCB from iscsi san very slow
Having an issue backing up full vms in the following environment:
ESX 3.5,VC2.5 Backup Exec 11d
Proxy Server- Dell workstation with W2k3
iSCSI San- Dell CX3-10
Full vms copy very slowly to the proxy when using the 'san' method- a 10 GB vm taking over 2 hours. When I run the backup using the 'nbd' option it copies within 1/2 hour. I'd be grateful for any thoughts on what might be the cause?
Thanks
ESX 3.5,VC2.5 Backup Exec 11d
Proxy Server- Dell workstation with W2k3
iSCSI San- Dell CX3-10
Full vms copy very slowly to the proxy when using the 'san' method- a 10 GB vm taking over 2 hours. When I run the backup using the 'nbd' option it copies within 1/2 hour. I'd be grateful for any thoughts on what might be the cause?
Thanks
What type of NIC are you using in your VCB Proxy Server ? Fast Ethernet or Gigabit Ethernet ?
ASKER
Using a Gigabit Ethernet
Please update a rough diagram of your Network with respect to your VCB Proxy Server and iSCSI SAN Connectivity.
ASKER
its too simple to do a diagram! there are 2 network cards in the proxy - one goes into the network switch (cisco gigabit) and the other goes into the san switch (also a gigabit, but only a netgear). the san also goes into this switch. the 2 networks are on different subnets.
First of all there is not much to go on here. It might sound simple, but there are a lot of pieces here.
I would want to prove the NIC on the proxy tied to the SAN is performing as it should. Crate and Mount a LUN from the SAN on the Proxy as a native volume and find the performance for that link. This will verify 1/2 the path. Test the other side as well if there is any significant traffic at all. Do this with Windows copy as well as Backup Exec to verify numbers.
Make sure that all the BIOS/NIC and NIC drivers are up to date. Broadcom and Intel both are sensitive to this, but Broadcom more so.
What is the second NIC in the proxy?
Try switching the IPs between the NIC and swap cables to see if that changes things.
Are you running Jumbo frames on either network?
How is Backup Exec configured? Media Servers?
The key to solving this is to test each component to eliminate them from the problem while taking intoo account the exeact path of the data flow. (Kind of a "be the packet" approach...)
I would want to prove the NIC on the proxy tied to the SAN is performing as it should. Crate and Mount a LUN from the SAN on the Proxy as a native volume and find the performance for that link. This will verify 1/2 the path. Test the other side as well if there is any significant traffic at all. Do this with Windows copy as well as Backup Exec to verify numbers.
Make sure that all the BIOS/NIC and NIC drivers are up to date. Broadcom and Intel both are sensitive to this, but Broadcom more so.
What is the second NIC in the proxy?
Try switching the IPs between the NIC and swap cables to see if that changes things.
Are you running Jumbo frames on either network?
How is Backup Exec configured? Media Servers?
The key to solving this is to test each component to eliminate them from the problem while taking intoo account the exeact path of the data flow. (Kind of a "be the packet" approach...)
ASKER
thanks for the last comment, it all makes perfect sense - easier said than done though in my situation as the whole san is formatted with vmfs so i have no way of creating an ntfs lun.
i will try the nic swap
no jumbo frames running.
backup exec is installed on a media server and is configured to backup the vcb server, but the problem occurs before BE comes into play - during testing it was just the image copy from esx to the vcb server.
i will try the nic swap
no jumbo frames running.
backup exec is installed on a media server and is configured to backup the vcb server, but the problem occurs before BE comes into play - during testing it was just the image copy from esx to the vcb server.
Mount the image copy manually and try a file copy to the NUL: device and watch the NIC stats to see if there are errors and what kind of performance you are getting. You can see if the issue is read rate on the lun
ASKER
Hi - not fully sure what you mean by this - what is the NUL: device? in the meantime i have swapped the nics and run a san disk mount (as per the fullvm copy) and this is still running very slowly. According to the task manager networking, the nic is working at <1%. so i think this discounts the nics and leaves the problem in the SAN?
Excuse me... I cannot quite tell from the post...
So are you saying that the source of the backups are on the VMFS via VCB and it is slow when copying to another device? (what are you copying to? tape, local disk, another volume on the SAN, etc...)
Also what update level are you running on the ESX and VCB?
These may help explain a bit about the SAN LUN and some backup options as well...
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1003955&sliceId=1&docTypeID=DT_KB_1_1&dialogID=20110236&stateId=1 0 20108842
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1002293&sliceId=1&docTypeID=DT_KB_1_1&dialogID=20110236&stateId=1 0 20108842
http://vmprofessional.com/index.php?content=esx3backups
So are you saying that the source of the backups are on the VMFS via VCB and it is slow when copying to another device? (what are you copying to? tape, local disk, another volume on the SAN, etc...)
Also what update level are you running on the ESX and VCB?
These may help explain a bit about the SAN LUN and some backup options as well...
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1003955&sliceId=1&docTypeID=DT_KB_1_1&dialogID=20110236&stateId=1 0 20108842
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1002293&sliceId=1&docTypeID=DT_KB_1_1&dialogID=20110236&stateId=1 0 20108842
http://vmprofessional.com/index.php?content=esx3backups
BTW, if things are working well I would think that a single VMDK image backup should only take 5-10 minutes for a 10G file. I have backed them up in as little as 3 minutes for 10G with a really fast tape drive... (faster than LTO-4)
ASKER
Hi, thanks for sticking with this! The part where things are really slow on the backups is just the very first step - when copying the vm from the vmfs datastore to the vcb server.
The ESX version is 3.5 R2 (i think, its build 110268)
The VCB version is 1.5 build 102898
Thanks
Sandra
The ESX version is 3.5 R2 (i think, its build 110268)
The VCB version is 1.5 build 102898
Thanks
Sandra
Are you copying the vmdk image file to local storage on the VCB server? As in local SAS drives in a Raid array?
ASKER
Yes i am copying the image from the SAN to local storage on the VCB server. At the moment this server is just a desktop with a SATA hdd and a couple of Gig network network cards. It will soon be moved to a server with scsi drives (10K i think) in a RAID 10 configuration.
OK this helps. An otherwise idle empty SATA drive can receive >25MB/sec if the copy is a single large file. Many can achieve >50MB/sec. If the GigE NICs are on PCIe (PCI Express) or both on a modern system CPU board, they should be able to sustain >50MB/sec and really closer to 100MB/sec given the correct configuration, but not likely with iSCSI. What we would like to do here is find a way to separate iSCSI performance in Windows to/from the SAN versus any issues with VCB and it's rather funky way of making data available to the VCB host.
So what I meant before by the copy to the NUL: device was this. In Windows, there is a file called NUL: that is a virtual device name for something that you can write an infinite amount of data to and never fill and have a minimal amount of overhead in doing so. This make a great way to test one side of a connection and eliminate other overhead to help isolate any bandwidth bottlenecks. The way it would work in this case is:
Mount the VMFS file system in question then read the file with the copy command and write it to the NUL: device (a.k.a. the bit-bucket)... To eliminate as many VMFS SCSI reservation issues, it would be best to copy a file (like a VMDK) that is both large and not being used by any other program. A VMDK that is part of a VM currently not running works fine. A VMDK located on a RAID LUN volume that is lightly or even otherwise not used is best, but that is not always possible in small environments. The sytntax looks like this from the Windows command prompt:
C:\> COPY /B X:\MOUNTED_PATH\BIGFILE.VM DK NUL:
It is not case sensitive and you want the file large enough to take at least a couple of minutes to copy. Windows Task Manager can show network statistics well enough to give you an idea of the the currently active throughput.
This will show you what Windows can do as far as reading the file directly from the VMFS mount. I have generally seen this to be slower than Windows native volumes, but it should be better than you are getting so long as you are copying the VMDK as a single file instead of opening the VMDK and copying the logical disk contents (as in Windows files inside the VMDK).
Does this make sense? Troubleshooting this is like peeling back layers of an onion. First we test a layer, determine what that layer is capable of and then peel it back to test what is underneath. Eventually we should find the truth somewhere along the way...
:-)
-Corey
So what I meant before by the copy to the NUL: device was this. In Windows, there is a file called NUL: that is a virtual device name for something that you can write an infinite amount of data to and never fill and have a minimal amount of overhead in doing so. This make a great way to test one side of a connection and eliminate other overhead to help isolate any bandwidth bottlenecks. The way it would work in this case is:
Mount the VMFS file system in question then read the file with the copy command and write it to the NUL: device (a.k.a. the bit-bucket)... To eliminate as many VMFS SCSI reservation issues, it would be best to copy a file (like a VMDK) that is both large and not being used by any other program. A VMDK that is part of a VM currently not running works fine. A VMDK located on a RAID LUN volume that is lightly or even otherwise not used is best, but that is not always possible in small environments. The sytntax looks like this from the Windows command prompt:
C:\> COPY /B X:\MOUNTED_PATH\BIGFILE.VM
It is not case sensitive and you want the file large enough to take at least a couple of minutes to copy. Windows Task Manager can show network statistics well enough to give you an idea of the the currently active throughput.
This will show you what Windows can do as far as reading the file directly from the VMFS mount. I have generally seen this to be slower than Windows native volumes, but it should be better than you are getting so long as you are copying the VMDK as a single file instead of opening the VMDK and copying the logical disk contents (as in Windows files inside the VMDK).
Does this make sense? Troubleshooting this is like peeling back layers of an onion. First we test a layer, determine what that layer is capable of and then peel it back to test what is underneath. Eventually we should find the truth somewhere along the way...
:-)
-Corey
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.