Getting the most out of vmware

We are looking to increase the performance of our vmware as a whole. Please read and then re-read my post a few times before you comment.

Here is a quick run down of our systems.

6 IBM 3650s M1s and M2s (yes they are older say 5 years)
4 Iscsi cat 5 cables to an HP 5406 ZL switch (2 for vmotion 2 for iscsi data)
IBM DS3500 SAN 48 disk (sas2) raid 10 split into 6 luns 5 for production 1 for ISOs

IBM DS3300 12 disk san raid 5, split into 4 luns. (This is where our backups are pushed to)

Veeam runs 3 primary production servers automated every 4 hours. SQL, Email, and other servers are done manually with Veeam every 3 months.
Backups from with in SQL, and Email servers are varied. Email is Daily, SQL has a much more often backup.

SQL backups can be very quick, only seconds, to 17 minutes, it all depends on how many transactions are done. This size is 300 megs to 15 or so gigs.

50 running servers, various flavours 2003 TS to 2008 R2 domain controllers.

Split networks for data and iscsi. They do not interact.

Our biggest issues are:
When backups run (SQL mostly) and Email. We get a huge slow down, and various issues right across the board. I monitor these lines and see that the pipe is running at 300-800mbit depending on what is happening during backups. Normal run speeds are in the 8-16 mbit range.

These are very big backups for us but vastly important:
Transaction logs every 15 minutes, differentials a few times a day and fulls at midnight. I dont know the exact details on them, but this is what the company has laid out as acceptable risk.

My job is to get the data to the backup and to make sure things are running smoothly.

At this point in my opinion there is simply not enough pipe to push the data quickly enough from A to B, there is more than enough disk on the 3500 to send and more than enough disk on the 3300 to receive.

Lets converse:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Can you bandwidth shape your data across the link?
I realize that every environment is unique and you are doing a lot of backups.  But we do as well, although probably not as frequently as you do. Is is our anecdotal experience that iSCSI running over a GB LAN is not adequate to handle a really busy VM environment. Most of our production stuff, especially the  big servers like SQL run specifically on a Fiber channel system that is old by computer standard, but it seems to keep up for the most part.

We are seriously looking at a storage solution from PureStorage that will give us a 100% SSD storage array environment for most of our production stuff.

I was a little concerned about how it would handle our things. I have a friend who runs this storage device. He's running 600 VMware view desktops (notorious for disk I/O needs, 8 XenApp VMs, 2 Exchange 2010 DAG servers, 2 large SQL Vms and a few other things. He says they average between 3000-4000 iops with sub 1 ms latency.
wlacroixAuthor Commented:
hanccoka, we have thought about that, its possible inside vmware.
Our only issue seems to be peak bandwidth so to speak, the backup fills up the pipe and slows down the whole system.
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

wlacroixAuthor Commented:
jhyiesla, I can do easily 100,000 iops on my system, if they are small, I have tested this with IO meter a few times just to make sure.
I can even do it during the day when people are in full production, ramp up everything to say 500 mbit across 2-3 servers with no issues.
IF i push beyond 750mbit I start to run into issues, and this seems to be exactly what my backups are doing. Filling the pipe full then causing a slow down. On occasion I get a note in tasks and events:

Path redundancy to storage device
naa.60080e50001c0b64000011ac4dbaa182 (Datastores: "3500-2-5")
restored. Path vmhba34:C3:T0:L5 is active again.
8/15/2013 8:40:58 AM
I have checked with my backup guy and our SQL backup runs at 8:30am, once it ramps up it may trigger this event it may not. The event itself does not cause me any pain just a slow down in the system where people start calling the help desk.

Latency runs around 3-6ms Avg depending on the LUN:
Will peak up to 37.5 during backup times, but its only for say 3-5 seconds.

We are running multi path where we have 2 unique paths to the storage system, but we cant use round robin, the DS3500 does not support it, so one bath is very busy, where the other path is not so busy but still has data flowing down it.

As mentioned things run fairly smooth during the day, even with veeam backups going, its the dam SQL pushes of large data to my backup that causes issues.

Fiber might not be a bad option, but I already have a iscsi infrastructure in place, was thinking of trying out 10Ge to see. I am unsure on the requirements of 10Ge but would test it on my current cabling for sure then see if i had to make any changes.
But If I have to change, going fiber might be a good idea too.
wlacroixAuthor Commented:
Is there a way in VMware to trunk 2 1gb pipes together so it will run at 2gbit instead of only 1?
My switch supports trunking.
Obviously going to 10 Gbe would require infrastructure changes. You'd probably need new cables, depending on what you have as well as cards and switches.

If you're seriously thinking about going that direction, take a look at some total SSD storage. Not sure what your budget is, but by the time you bought the fiber chassis and disks, you would be close to a total SSD solution.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Yes, we have seen this, hence why we usually bandwidth shape WAN links, or assign priority to traffic.
wlacroixAuthor Commented:
jhyiesla, I just purchased a HP5406zl, it has blade based stuff so I could change them out, but do they want to spend another 15k-30k on parts, and cabling to hit the 6a standard required.
I thought about fiber, but your in the same boat as iscsi with a 1gb maximum. There is 10 gig transceivers for fiber too. I have 4x 1gig ones from my primary room to my secondary room, they were cheep at 600-800 bucks, but I cant see the 10g ones being that much hehe. And fiber in my server room would be just as much as 6a standard i'm sure.

hanccoka, this is all in my server room and not a traditional WAN link, but think your talking about nic teaming inside vmware correct? I dont think that vmware has the ability for me to prioritize traffic like a packet shaper device, were strictly talking iscsi traffic, on an isolated network.

Nic teaming or trunking might be a good idea in my current infrastructure, it wont cost me much to do. I herd that nic teaming inside vmware does not work worth a dam, have either of you guys ever deployed it, or tried it?
When we looked at the SSD solution we asked what connectivity they supported and they said either fiber or 10 GBe. We already have a 4 GB fiber infrastructure and asked if that would be adequate and the guy said that it would and that the bottleneck would be the servers. So if/when we go that direction we'll probably stay with fiber simply because it's already there and works.
wlacroixAuthor Commented:
jhyiesla, do you remember what you spent on your fiber infrastructure, and is it that much better than a copper solution?
From my experience its got such a low error rate that's where the additional speed comes from. I moved from copper to fiber from my primary room to my backup room and its so fast, I should have gone with less strands (12) but hey, I prepare for the future, because if they were not there they would ask me for them.
I do not. We started buying it about 5-6 years ago; went right to fiber because we were concerned that iSCSI over 1 GB would be too slow. We bought an IBM DS4700 chassis with a few 146 GB disks, which is what they were selling at the time. Over the years we added an expansion chassis and more disks and the last time we bought disks for it we are getting 460 GB disks for around $500.

There is a definite difference between iSCSI and fiber. We just upgraded our ESXi on all hosts to 5.0 U1. We've spent the last week retooling how our datastores are sized and laid out to get a better use of the storage. While doing this we've had to put some of our less beefy VMs on iSCSI storage and if I have to access a server that lives on that storage from the console, I notice a difference in speed.  I've not taken the time to quantitate it, but it is slower.

And even if SSD isn't on your radar, the storage systems today are much faster and there are tiered systems that use flash vram as well as SSD and spinning disks to get increased throughput.
Last year we looked at a hybrid solution from Nimble storage which seemed quite nice and speedy. Nimbus is another major player, but I've not seen their products and I think EMC has a solution in that space as well.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
wlacroixAuthor Commented:
We moved up from the IBM DS3300 to the 3500 about a year and a half ago, I can easily change out the controllers for fiber ones, I just dont think my company will shell out for fiber at this time. I have a server refresh coming in a year or two on top of it.

Still running vmware 4.1 latest patches. The powers that be opted out of maintenance due to cost, now they realize that was a bad idea.

I have plans to test out vmware 5.1 on another box, but If I cant figure out the performance issues of this one, then that plan may have to wait.

Can you run 2 controllers on a DS3500 with different interfaces? Say Fiber on B and Iscsi on A? In this scenario I would loose out on redundancy on my beefy system. Bu the cost would be easier to swallow if it was done over a few years, similar to your own situation.
wlacroixAuthor Commented:
The question is, would fiber solve my issues and allow me future growth.
wlacroixAuthor Commented:
I have been drawing it out on paper, old school, like a white board.

Could i go fiber from my host to my 5406zl and leave the iscsi back end from the switch to storage?
Give a 10g fiber connection from host to switch....I would in theory max out at 1gig on the leg from switch to storage.

The cost is huge from what I can see.

There is no simple solution for me to increase peak bandwidth to my storage without some huge expenses.
Was just looking again at my idea for trunking and its moot if i cant increase bandwidth from the switch to the storage.

Can I retool my whole back end as fiber from my storage to the swtich, then use iscsi trunks from host to switch?
Its a technology change, but in theory its still TCPIP packets is it not?
This would give me say 10G fiber from storage to switch, then 2g copper from host to switch.
In theory, I should visio it up.
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
If you want to increase throughput on iSCSI Mutlipath, Teaming is not supported on any environment, VMware or Microsoft for iSCSI.

I thought this was WAN, you'll not be able to bandwith share via VMware.

Use VLANs to isolate storage traffic, or a dedicated storage network.
wlacroixAuthor Commented:
hanccoka, I would sacrifice multipath for additional bandwidth via teaming.

I guess the fact is that 1gb is just not enough at peak times, but running time is its, we only peak during backups like the majority of the planet.
But this peak causes me slow down in production, and thats bad. If I reduce backups it works fine of course. But the mandate is what it is for data retention right now. We will have to have a meeting I think.
Vlans wont save me at this point, my multipathing is working, were still peaking out a full line, if there is anything else at all running on said line, it causes the slow down.
Granted there is another line available but its just not being used as the path is already established inside vmware, my storage system does not support round robin, which would in theory give me 2g throughput to my storage system on multi paths.

Anyone know if you can you throttle an SQL job? that would help for sure, it would take longer and I would have sustained data throughput for say 3 times as long, but it may not cause me issues in production.
wlacroixAuthor Commented:
I dont think that vmware supports link agragation for iscsi data.
wlacroix, what are you using (software) for the daily email/SQL backups? Just to understand how it actually backs up the data, interacts with the OS and SQL, and uses the Ethernet fabric.

Correct/answer any of the following if I'm wrong:
- all servers are VMs;
- the backup server is a physical machine;
- you're doing "backup agent" based backups via the data VLAN through the backup server (no HW snapshots, no RDMs?);
- the data VLAN is 1GbE;
- your backup server is attached via iSCSI to the DS3300 (SW or HW initiators?)/(with/without MPIO?);
- you are not seeing storage traffic performance degradation (IOPS drop/latency increase) when backups are done, it's the data traffic and backup traffic that are being affected.

I run a 100% iSCSI setup similar to yours (albeit with smaller volumes of data), on DS3300 and V3700. We are in the process of moving to iSCSI and data over 10GbE - and considering iSCSI vs. FCoE for a mid term improvement step. I cannot directly compare this setup to yours, but when we suffered similar performance issues we never found them to be uniquely or specifically iSCSI protocol related.

I would tend to blame your performance problems on "oversubscription" of the lines/ports/NICs when you perform backups. If at all possible, I'd use separate VLANs for data and backups, simultaneously shaping and prioritizing the bandwidth at the switch level (per VLAN). I'd need much more detail on your physical connections and VLAN layouts to come up with a specific solution (but I don't believe this to be the purpose of your question).

Regarding the topic of iSCSI vs. Fiberchannel and copper vs. fiber, all I would add right now is that you may want to look into converged networks (FCoE/CEE). You already have a lot invested in Ethernet, there's no use in spending more on FC if you feel your storage is performing well (IOPS).
Being based on TCP/IP, iSCSI is inherently not a lossless protocol, and this has varying levels of impact on different technologies like backups, storage access, etc. You have to look into your switching to see if there's anything that could be improved there before considering a major technology shift like iSCSI over copper -> FC over fiber.
- consider fiber over copper if you are seeing dropped packets and errors on a non-saturated link (but check the cables, length and connectors first, it may be only a bad one or wrong spec, or external electromagnetic interference and/or bad shielding);
- consider 10GbE if you are seeing saturated links with dropped packets or high latencies (and you can't work around it with QoS or bandwidth shaping);
- Are any of the links saturated when backups occur? If yes: which links are you seeing at 100%? Dropped packets? Errors? Latency?

In this aspect, FCoE and CEE is probably what you should be looking into as future option. CEE works toward assured delivery of storage traffic over Ethernet - effectively it all boils down to making sure that storage packets never get dropped and are delivered as expected - it's data that get's delayed/dropped. You can achieve similar or identical results by correctly configuring QoS, bandwidth shaping on ports and/or VLANs, uplinks and port configurations, etc. As of today I see FCoE/CEE as an emerging technology, more geared towards consolidating all traffic on a single fabric (Ethernet), hopefully simplifying configuration requirements and saving some money. Look at it as a migration path for moving FC to Ethernet without losing performance, or allowing coexistence of FC and TCP/IP on the same switches. Unfortunately, it is not always so - you get all the complexity of Ethernet and FC mixed together, and only marginal savings. But you end up using less hardware, less rack space, less cabling, leveraging your FC knowledge, etc.

However, like I said, based on my assumptions, your problem is most likely due to a bottleneck somewhere in your switch links being temporarily oversubscribed with data and backups (not storage), and therefore impacting both... you may be able to work around this successfully with some additional configuration.
wlacroixAuthor Commented:
Costa73, you are 100% right, its not iscsi where the problem lies, its is in-fact over-subscription.

All of the backups in question come from SQL itself, right out of a job. My other backups come out of veeam and they dont cause any issues.

We are backing up from a windows 2008 R2 box running 2008 SQL to itself. I just had a thought that I bet the guy set up the read\write to itself, its not actually going anywhere other than the 3500. Let me go take a peak.

We just replaced our switch with an HP5406zl for iscsi.
We do get saturation but I believe its due to the above, where its writing and reading to the same location. Let me go take a look then work with them to put it in an alternate location, IE a drive directly on the DS3300.
wlacroixAuthor Commented:
Yup just as I suspected, they have it reading and writing to the same LUN on the same data store, going to move it and see how things perform.

I have to say I enjoy talking with like minds, its a nice change from the every day ;)

Thank you for all your responses.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.