Link to home
Start Free TrialLog in
Avatar of wlacroix
wlacroix

asked on

Getting the most out of vmware

We are looking to increase the performance of our vmware as a whole. Please read and then re-read my post a few times before you comment.

Here is a quick run down of our systems.

6 IBM 3650s M1s and M2s (yes they are older say 5 years)
4 Iscsi cat 5 cables to an HP 5406 ZL switch (2 for vmotion 2 for iscsi data)
IBM DS3500 SAN 48 disk (sas2) raid 10 split into 6 luns 5 for production 1 for ISOs

NON PRODUCTION:
IBM DS3300 12 disk san raid 5, split into 4 luns. (This is where our backups are pushed to)

BACKUPS:
Veeam runs 3 primary production servers automated every 4 hours. SQL, Email, and other servers are done manually with Veeam every 3 months.
Backups from with in SQL, and Email servers are varied. Email is Daily, SQL has a much more often backup.

SIZE\TIME:
SQL backups can be very quick, only seconds, to 17 minutes, it all depends on how many transactions are done. This size is 300 megs to 15 or so gigs.

50 running servers, various flavours 2003 TS to 2008 R2 domain controllers.

Split networks for data and iscsi. They do not interact.

Our biggest issues are:
When backups run (SQL mostly) and Email. We get a huge slow down, and various issues right across the board. I monitor these lines and see that the pipe is running at 300-800mbit depending on what is happening during backups. Normal run speeds are in the 8-16 mbit range.

These are very big backups for us but vastly important:
Transaction logs every 15 minutes, differentials a few times a day and fulls at midnight. I dont know the exact details on them, but this is what the company has laid out as acceptable risk.

My job is to get the data to the backup and to make sure things are running smoothly.

At this point in my opinion there is simply not enough pipe to push the data quickly enough from A to B, there is more than enough disk on the 3500 to send and more than enough disk on the 3300 to receive.

Lets converse:
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Can you bandwidth shape your data across the link?
SOLUTION
Avatar of jhyiesla
jhyiesla
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of wlacroix
wlacroix

ASKER

hanccoka, we have thought about that, its possible inside vmware.
Our only issue seems to be peak bandwidth so to speak, the backup fills up the pipe and slows down the whole system.
jhyiesla, I can do easily 100,000 iops on my system, if they are small, I have tested this with IO meter a few times just to make sure.
I can even do it during the day when people are in full production, ramp up everything to say 500 mbit across 2-3 servers with no issues.
IF i push beyond 750mbit I start to run into issues, and this seems to be exactly what my backups are doing. Filling the pipe full then causing a slow down. On occasion I get a note in tasks and events:

Path redundancy to storage device
naa.60080e50001c0b64000011ac4dbaa182 (Datastores: "3500-2-5")
restored. Path vmhba34:C3:T0:L5 is active again.
info
8/15/2013 8:40:58 AM
vmhost1.domain.local
I have checked with my backup guy and our SQL backup runs at 8:30am, once it ramps up it may trigger this event it may not. The event itself does not cause me any pain just a slow down in the system where people start calling the help desk.

Latency runs around 3-6ms Avg depending on the LUN:
Will peak up to 37.5 during backup times, but its only for say 3-5 seconds.

We are running multi path where we have 2 unique paths to the storage system, but we cant use round robin, the DS3500 does not support it, so one bath is very busy, where the other path is not so busy but still has data flowing down it.

As mentioned things run fairly smooth during the day, even with veeam backups going, its the dam SQL pushes of large data to my backup that causes issues.

Fiber might not be a bad option, but I already have a iscsi infrastructure in place, was thinking of trying out 10Ge to see. I am unsure on the requirements of 10Ge but would test it on my current cabling for sure then see if i had to make any changes.
But If I have to change, going fiber might be a good idea too.
Is there a way in VMware to trunk 2 1gb pipes together so it will run at 2gbit instead of only 1?
My switch supports trunking.
Obviously going to 10 Gbe would require infrastructure changes. You'd probably need new cables, depending on what you have as well as cards and switches.

If you're seriously thinking about going that direction, take a look at some total SSD storage. Not sure what your budget is, but by the time you bought the fiber chassis and disks, you would be close to a total SSD solution.
Yes, we have seen this, hence why we usually bandwidth shape WAN links, or assign priority to traffic.
jhyiesla, I just purchased a HP5406zl, it has blade based stuff so I could change them out, but do they want to spend another 15k-30k on parts, and cabling to hit the 6a standard required.
I thought about fiber, but your in the same boat as iscsi with a 1gb maximum. There is 10 gig transceivers for fiber too. I have 4x 1gig ones from my primary room to my secondary room, they were cheep at 600-800 bucks, but I cant see the 10g ones being that much hehe. And fiber in my server room would be just as much as 6a standard i'm sure.

hanccoka, this is all in my server room and not a traditional WAN link, but think your talking about nic teaming inside vmware correct? I dont think that vmware has the ability for me to prioritize traffic like a packet shaper device, were strictly talking iscsi traffic, on an isolated network.

Nic teaming or trunking might be a good idea in my current infrastructure, it wont cost me much to do. I herd that nic teaming inside vmware does not work worth a dam, have either of you guys ever deployed it, or tried it?
When we looked at the SSD solution we asked what connectivity they supported and they said either fiber or 10 GBe. We already have a 4 GB fiber infrastructure and asked if that would be adequate and the guy said that it would and that the bottleneck would be the servers. So if/when we go that direction we'll probably stay with fiber simply because it's already there and works.
jhyiesla, do you remember what you spent on your fiber infrastructure, and is it that much better than a copper solution?
From my experience its got such a low error rate that's where the additional speed comes from. I moved from copper to fiber from my primary room to my backup room and its so fast, I should have gone with less strands (12) but hey, I prepare for the future, because if they were not there they would ask me for them.
I do not. We started buying it about 5-6 years ago; went right to fiber because we were concerned that iSCSI over 1 GB would be too slow. We bought an IBM DS4700 chassis with a few 146 GB disks, which is what they were selling at the time. Over the years we added an expansion chassis and more disks and the last time we bought disks for it we are getting 460 GB disks for around $500.

There is a definite difference between iSCSI and fiber. We just upgraded our ESXi on all hosts to 5.0 U1. We've spent the last week retooling how our datastores are sized and laid out to get a better use of the storage. While doing this we've had to put some of our less beefy VMs on iSCSI storage and if I have to access a server that lives on that storage from the console, I notice a difference in speed.  I've not taken the time to quantitate it, but it is slower.

And even if SSD isn't on your radar, the storage systems today are much faster and there are tiered systems that use flash vram as well as SSD and spinning disks to get increased throughput.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
We moved up from the IBM DS3300 to the 3500 about a year and a half ago, I can easily change out the controllers for fiber ones, I just dont think my company will shell out for fiber at this time. I have a server refresh coming in a year or two on top of it.

Still running vmware 4.1 latest patches. The powers that be opted out of maintenance due to cost, now they realize that was a bad idea.

I have plans to test out vmware 5.1 on another box, but If I cant figure out the performance issues of this one, then that plan may have to wait.

Can you run 2 controllers on a DS3500 with different interfaces? Say Fiber on B and Iscsi on A? In this scenario I would loose out on redundancy on my beefy system. Bu the cost would be easier to swallow if it was done over a few years, similar to your own situation.
The question is, would fiber solve my issues and allow me future growth.
I have been drawing it out on paper, old school, like a white board.

Could i go fiber from my host to my 5406zl and leave the iscsi back end from the switch to storage?
Give a 10g fiber connection from host to switch....I would in theory max out at 1gig on the leg from switch to storage.

The cost is huge from what I can see.

There is no simple solution for me to increase peak bandwidth to my storage without some huge expenses.
Was just looking again at my idea for trunking and its moot if i cant increase bandwidth from the switch to the storage.

Can I retool my whole back end as fiber from my storage to the swtich, then use iscsi trunks from host to switch?
Its a technology change, but in theory its still TCPIP packets is it not?
This would give me say 10G fiber from storage to switch, then 2g copper from host to switch.
In theory, I should visio it up.
If you want to increase throughput on iSCSI Mutlipath, Teaming is not supported on any environment, VMware or Microsoft for iSCSI.

I thought this was WAN, you'll not be able to bandwith share via VMware.

Use VLANs to isolate storage traffic, or a dedicated storage network.
hanccoka, I would sacrifice multipath for additional bandwidth via teaming.

I guess the fact is that 1gb is just not enough at peak times, but running time is its, we only peak during backups like the majority of the planet.
But this peak causes me slow down in production, and thats bad. If I reduce backups it works fine of course. But the mandate is what it is for data retention right now. We will have to have a meeting I think.
Vlans wont save me at this point, my multipathing is working, were still peaking out a full line, if there is anything else at all running on said line, it causes the slow down.
Granted there is another line available but its just not being used as the path is already established inside vmware, my storage system does not support round robin, which would in theory give me 2g throughput to my storage system on multi paths.

Anyone know if you can you throttle an SQL job? that would help for sure, it would take longer and I would have sustained data throughput for say 3 times as long, but it may not cause me issues in production.
I dont think that vmware supports link agragation for iscsi data.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Costa73, you are 100% right, its not iscsi where the problem lies, its is in-fact over-subscription.

All of the backups in question come from SQL itself, right out of a job. My other backups come out of veeam and they dont cause any issues.

We are backing up from a windows 2008 R2 box running 2008 SQL to itself. I just had a thought that I bet the guy set up the read\write to itself, its not actually going anywhere other than the 3500. Let me go take a peak.

We just replaced our switch with an HP5406zl for iscsi.
We do get saturation but I believe its due to the above, where its writing and reading to the same location. Let me go take a look then work with them to put it in an alternate location, IE a drive directly on the DS3300.
Yup just as I suspected, they have it reading and writing to the same LUN on the same data store, going to move it and see how things perform.

I have to say I enjoy talking with like minds, its a nice change from the every day ;)

Thank you for all your responses.