Solved

Help explain networking basics to my boss

Posted on 2010-11-10
24
658 Views
Last Modified: 2012-06-21
I have to explain the following two situation to my boss (just started here). I need help making my explanation in layman's terms.

1. They have A windows Server 2008 R2 server with 2 onboard NICs. They have a single switch connected. One NIC has the main IP and gateway and is what internet traffic is NAT-ed to. The 2nd NIC has the same gateway and an IP in that subnet and is for other traffic. I know from experience they should not both have gateways even if in same subnet. Why is it bad to have it like this?


2. The same server is connected to a layer 2 switch with no VLANs set up. Connected to that switch is an iSCSI SAN that serves up storage to the server over the 2nd IP. I am failing to explain to them why it is valuable for security and performance reasons why they should either use VLANs or get a separate switch for the iSCSI traffic. They do need to be able to replicate that SAN storage over the WAN so it can't be completely isolated. They are basically using a flat network and the server is getting hosed. They have a 48 port switch with many SANs and many servers all on the same VLAN. They also do replication from that SAN to another SAN in a different city over a trunk. How do I explain that we either need a switch that can separate out the ports into different VLANs and then segregate the traffic or we need to get a separate iSCSI switch?

3. Lastly, Every SAN vendor I've ever worked with has recommended purchasing separate iSCSI HBA cards instead of using the onboard NICs that come with most servers today (ours come with 4 onboard). Again I can't figure out how to explain to them that the HBAs take load off the server's CPU and motherboard.

Thanks!
0
Comment
Question by:MrVault
  • 12
  • 7
  • 2
  • +3
24 Comments
 
LVL 6

Expert Comment

by:wwakefield
ID: 34102238
First and foremost, explain what he is getting out of it!   Cost / benefit...

1.  If you supply X, then this system will work faster reulting in an increase in 42 transactions per minute.
2.  If you supply X, you staff will be able to get do whatever faster resulting in 12  additional sales per hour.
3.  If you do not purchase X, then the chances of Y failing have increased by Z and it may fail.   If this systems fails, it will take 4 days to return to full oeprations resulting in $$$ lost prodcutivity or lost sales or whatever.  
4.  At the same time, what have you done to take care of it with what you have on hand?  Have you maximised your resources?

Give him big picture stuff that show bang for the buck.  If you are unable to show him what the dollar buys or results in, then there probably is not bang for the buck and not necessary for the company.   So is it just nice to have fo the IT team or does it benefit operations?
0
 

Author Comment

by:MrVault
ID: 34102503
I guess I should talk to the current IT guy and ask how many customers are on a given server. Then talk to sales and find out the average revenue per customer across the servers. Then I can present the risk in failure of hardware. But I'm not really asking that yet. That is a conversation for high availability and disaster recovery, something I plan on bringing to them at some point.

The main issue is this right now: their servers are running very slow. And instead of following best practices for how to configure any database server in an iSCSI SAN environment they are just throwing more RAM, more faster drives with bigger cache, and putting a bigger backplane on their switches which are in a flat network. What I'm trying to convey to them is that before they keep throwing thousands of dollars (his words not mine) towards increasing hardware capabilities they need to be in the proper configuration to see if that's necessary and then they can draw the conclusion if it's really the horsepower instead of the configuration.

The trouble is putting it into words that he can understand. He's pretty technical for a CEO (he founded this IT service company). But I have to explain why a flat network is bad from a performance metric and why offloading the HBAs will help performance as well as removing the dual gateway settings.
0
 
LVL 14

Accepted Solution

by:
Otto_N earned 100 total points
ID: 34102938
To get the servers to run faster, you need to open up the bottlenecks.

Right now, your network connection is the bottleneck.  If everything is on one VLAN (SAN, NICs, and all other users) it all need to fit into the capacity of that single VLAN (1Gbps, if you're using a Gigabit switch).  If you start using a separate VLAN, you're doubling capacity, provided that the switch hardware can handle the load on it's hardware.  If the switch cannot, you need a better switch, or a separate one.  But only upgradeing the switch isn't going to make GigabitEthernet more than 1Gbps.

It's also no use to inrease RAM or processing power on the server if it only has a single 1Gbps port to connect to the outside world.  You need to have more than one interface, for a start, but then also direct outgoing traffic to the other interfaces, which is only really possible with static routing on your server, which usually require NICs to be in separate subnets/VLAN segments.

An HBA is also another thing that can increase speed, as it does not need to go via the motherboard shared bus, which is also a potential bottleneck.

I hope this helps you to convince your boss...
0
 
LVL 17

Expert Comment

by:Kvistofta
ID: 34103080
1) Windows is not working properly when configured with multiple default gateways. When talking about routing in general a routing-table should be a map of the entire world. Each unit should know of exact ONE way to reach every other unit. If it knows of NO ways it cannot communicate to that peer. If it knows of 2 different ways to reach a destination, which should it use? Sure it can load balance. Or it can use any of them and hope for the best. And there are many different ways to do this in routes but this generally involves routing protocols or other techniques that requires configuring. And a windows-box is simply not that clever when it comes to routing.

That´s just the way it it.

/Kvistofta
0
 

Author Comment

by:MrVault
ID: 34103105
Thanks Otto. I think you're on the right track. I just want to pull out some more technical explanation if possible. Something like this:

1. Because the NICs are all in the same VLAN, X is happening which results in latency. Y is happening which results in collisions. Z is happening which results in a saturated NIC (i would verify with tools). Additionally, security is reduced because T is possible if someone did W.

2. When we use onboard NICs for our iSCSI traffic it uses the motherboard BUS like this: X. This causes Y which results in reduced performance. An HBA has feature B which removes the Y issue. This is why it's worth spending $D on that HBA.

To me it seems like the current person is saying "well if the switch is overwhelmed lets get a better switch" instead of figuring out why it's overwhelmed and how to fix that first. Then if that's ideal and it's still hammered, pursue a better switch or more of them.


0
 

Author Comment

by:MrVault
ID: 34103156
Kvistofta, if there are multiple gateways (same one in each NIC) does this increase retransmits, broken connections, duplicate traffic across each NIC, etc? One error we're getting is in DNS event logs it's saying there's a duplicate name/ID on the network. Presumably because each NIC/IP is broadcasting the name of server is associated with it's NIC/IP/MAC-address. What does this result in if there are duplicates (besides just an event log entry)?

Thanks.
0
 
LVL 32

Expert Comment

by:aleghart
ID: 34105317
Think of your traffic like well water.  You can keep throwing RAM and faster drives (pump the handle faster), or put in an HBA and have pressurized water.  There are plenty of houses with wells...but nobody uses a manual pump anymore.  It's inefficient and affects the throughput.

As for VLANs and switch fabric...letting Windows pump the network traffic may be the bigger problem.  Segregating traffic may be a good plan, and smart.  But, if each server had 2x or 4xGbE connections to a very fast switch...you'd end up with disks being the bottleneck, not random broadcast traffic.
0
 

Author Comment

by:MrVault
ID: 34106541
aleghart, nice analogy. I like it. I might use that. For me though can you explain what we're doing by using an HBA that improves things? something much more technical.

regarding the VLANs, if we optimize the segregation of the traffic and then we see the disks as a bottleneck, that's fine. We can address that problem next. but until we start crossing out issues we're just flying blind. at least then I'd be able to justify buying bigger, faster disks, more of them, etc. the part I'm having trouble explaining is what why the performance is negatively impacted by having a flat network. I'm talking about performance only, not security, not high availability or ease of management.

The colleague just told me the Dell rep had told him 2 years ago that as long as certain counters were not outside certain thresholds having a flat network shouldn't be an issue and supposedly we're not outside those thresholds. Now I have found 4 different docs about setting up a Dell Equallogic iSCSI SAN and all for of them said we should segregate (though no reason why, just that it's best practice for performance). He hasn't given me the counters or the thresholds or any doc that backs this Dell rep's supposed claim, so until I can show counters that are being negatively affected because of our flat network, it's going to be difficult to convince them that we need to follow best practices for real reasons and not "just cuz everyone says too".

thanks for the help so far.
0
 
LVL 32

Assisted Solution

by:aleghart
aleghart earned 100 total points
ID: 34107023
An HBA frees up CPU cycles by offloading TCP wrapping and unwrapping.  It's called TOE (TCP Offload Engine).  In a single-purpose server with fast CPUs, this wouldn't be a big problem...take up 20% of the CPU for TCP work, but the server isn't doing much else but spinning disks.  Single-purpose servers are over-spec'd and under-utilized.  That's fine...it gives you breathing room as the workload rises.

But, as server utilization increases (multi-purpose, smaller servers/blades, virtualization), tapping that 20% of CPU could have negative effects on other server processes.

Is your CPU running at 5-10% average, or 50+%.  It will make a difference.

If you have money to throw at faster processors (RAM won't make a difference), then you can just as effectively spend $500-1000 on an HBA.  Adding CPU horsepower just to overcome TCP crunching seems counter-intuitive.  Bigger CPU does not mean less work for a standard NIC.  The NIC will only work as fast as the feed from the CPU.

When you can offload the TCP work, the NIC/HBA moves traffic faster, and the CPU is more available to do work.

Imaging sticking a bigger and bigger engine into my little Honda.  After a certain point, I need a new transmission with taller gears to go faster.  Otherwise it's just sucking up more gas to go the same speed.  An HBA is a racing transmission...lighter load on the engine, and taller gears to handle more speed.
0
 
LVL 42

Assisted Solution

by:kevinhsieh
kevinhsieh earned 300 total points
ID: 34108246
What kind os switch are you using? I am a Dell EqualLogic customer and I use Cisco 2960 gigabit switches. I actually don't see much of a problem with your network. It's pretty rare to need a TCP offload NIC or iSCSI HBA. The HBAs are really expensive, and I think that it's better to put the $$ into a newer, faster server. If your CPU isn't too high, that's not an issue at all. Having a separate VLAN for iSCSI is nice, but it shouldn't really do anything for performance unless your switch is oversubscribed, but that's really hard to do, particularly if you only have 1 server attached to your storage. Look at Dell's EqualLogic SAN Headquarters and see how the networking stats look. I doubt that utilization is very high. You probably should have the EqualLogic Host Integration Toolkit installed and MPIO configured on the server.
You can take the gateway off hte 2nd NIC on your server and configure it to not register and DNS and to unbind all protocols except for TCP/IP. That should make the setup a little cleaner, but aside from the message in the event log, I don't think that it's really a problem.
If your server is hammered, you need to find out why, but I don't think that anything that you've described in the setup is really an issue. Separate VLAns are nice, but then you need a router to connect the VLANs, and complexity just went up, and performance between the VLANs probably just went down because the router won't likely route at gigabit wire speed. You are using a gigabit switch, right?
0
 

Author Comment

by:MrVault
ID: 34136098
Thanks Aleghart. Our CPUs are running at 50-70% utilization. They're SQL servers that are constantly running queries, inserting data, deleting data, and serving as the transfer station for tons of data over iSCSI.

They have Foundry gigabit 48 port layer 2 switches (not routers). Our CPUs are often quad core 3.0 Ghz with 2 of them per server. The NCIs are both onboard and in the same network (no subnetting or VLANs used). We took the gateway off the 2nd NICs and that helped with some issues, but not CPU utilization. We also disabled it from registering with DNS and turned of Netbios and all protocols except TCP/IP v4.
0
 
LVL 42

Expert Comment

by:kevinhsieh
ID: 34139050
If you set your NICs and switch to support jumbo ethernet frames it will take fewer packets to transfer your data and that can help with CPU itilization. You should look to see how much of the CPU is being used by SQL Server, and how much is used by other. If most of it is being used by SQL, there isn't a lot to be done in terms of something like iSCSI HBA or ToE. It's possible that your server NICs already support ToE and that it just needs to be turned on.

Something to look at is whether or not your EqualLogic is the bottleneck. You should look at the counters and reports provided by SAN HQ, and probably talk to Dell Tech Support to help you understand any performance issues. If you have a bunch of high use SQL servers hitting a single PS series array with SATA disk, you are eventually going to hit a wall with your random reads. Adding RAM to the SQL servers to increase cache will help, but so will adding more/faster spindles/SSD to the array.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:MrVault
ID: 34139264
What's the perfmon counter for CPU utilizaiton by SQL server?

As for jumbo frames don't those become an issue when you're replicating the SAN across networks? It would have to be supported by all hops in between. In our case it goes out the switch, into the firewall and then to another firewall via direct link miles away. In my experience the use of jumbo frames is negative if the hops between replicating SANs do not support it. Unfortunately i know of no way to say "use jumbo frames for regular disk IO but not for SAN-to-SAN replication".

We have one 1, sometimes 2 SQL servers hitting our SAN. They are isolated and the SAN is used for bulk storage (up to 300K blocks), not for the database. The SANs are not hitting the disks too hard it seems.

One issue is that the guy has ordered and installed a bunch of 6 disk servers. The OS, page file, tempdb, database, transaction logs, and indexes are on internal storage. I think he has it set up as a RAID 1 set for OS/page file and a RAID 10 set for everything else. Not sure how we can improve that without changing the chassis to one that supports more drive bays or using DAS or SAN for offloading some of the activity. I think three RAID 1 sets is out of the question because our database can easily grow to 500GB and has at times gotten as large as 1.5 TB.

0
 
LVL 42

Expert Comment

by:kevinhsieh
ID: 34140965
If you aren't using the SAN for the databases, then we can take the SAN and the network out of the picture as to the reason why your SQL servers are slow. It sounds like it's all related the the local SQL server. It could be a disk IO issue, or bad SQL code/stored proceedures/queries/idexes, etc. because your CPU is pretty high. If the SQL configuration is RAID 1 for the OS and RAID10 for the SQL files, there are some things that can be done. The log files can be moved to the RAID 1, and maybe tempdb as well. I would really consider moving part or all of the SQL files to the SAN, since it's going to have more spindles to read from, and writes are really, really fast because of the write cache on the EqualLogic.
0
 

Author Comment

by:MrVault
ID: 34145211
Unfortunately the equallogic is one array and we're storing bulk files there in RAID 5 so even if we had the room, the database would be on a RAID 5 set which is not ideal compared to RAID 10.

So you would recommend putting the pagefile, tempdb files, and transactions logs all on the RAID 1 set that the OS is on?

Also, how do I see what portion of the cpu utilization is from the sqlservr process (or the application as a whole)?
0
 

Author Comment

by:MrVault
ID: 34146339
Which of these would you prefer assuming a heavy use SQL server?

RAID 1: OS/Pagefile/tempdb files
RAID 10: database + transaction logs

or

RAID 1: OS/pagefile/trans-logs
RAID 10: database/tempdb files

?
0
 
LVL 42

Assisted Solution

by:kevinhsieh
kevinhsieh earned 300 total points
ID: 34147052
I actually think that the RAID 5 on the EqualLogic would be faster than your local RAID 10. RAID 5 is normally slower for writes and faster for reads given the same number of spindles, but your EqualLogic has over twice as many spindles, which further increases the read performance, and it has huge write cache so your writes are comitted to the EqualLogic must faster than they are to the local RAID 10.

I am not a DBA, so I am not an expert in optimizing for SQL. That said, I would try to move the log files to the EqualLogic, or if you can't do that to the OS RAID 1. That will separate the log I/O from the database I/O.

Too see CPU utilization, go to task manager, processes tab, and look for the SQLServer process. You may need to show processes from all users. You can sort by process name, CPU utilization, process ID, username, and memory usage.


0
 

Author Comment

by:MrVault
ID: 34175656
thanks for that insight. we'll have to take that into consideration.
0
 

Author Comment

by:MrVault
ID: 34219327
due to space issues, moving the logs to the SAN is not an option at this point.

I'm curious about the jumbo frames question and why the network cards should be taken out of the picture jsut because the DB isn't on the SAN.
0
 
LVL 42

Assisted Solution

by:kevinhsieh
kevinhsieh earned 300 total points
ID: 34219440
What I understand you to be saying is that the SAN isn't being used to store the databases, and that it's being used to store some sort of bulk objects, which I assume are related to your database activity. If the bulk objects aren't what's slowing you down and there is very little IO to the SAN, then the network connection to the SAN shouldn't be an issue either, and nor should the processor load related to the SAN traffic. If the SAN isn't an issue, then neither is jumbo frames, which is only really useful when talking to the SAN. I think that puts us back to looking at SQL, how much RAM it has, and how fast the local disks are.

What is the CPU utilization of SQL compared to everything else?
How much RAM do your servers have, and how much is being used by the SQLServer processes?
0
 

Author Comment

by:MrVault
ID: 34219482
before we move onto CPU, SQL, etc, can we take a step back? Forgive me.

Where do I measure the following:

1. IO to the SAN (should this be done on the Windows NIC counters, something in the equallogic, etc)?
2. What do I look for with SNMP or perfmon counters to see if the NICs are a bottleneck?
3. I'm not sure I agree about your jumbo frames comment. I have worked in environments where the performance of the SAN (and it's affect on the server) were amazingly better when we removed jumbo-frames configuration.

there is 64 GB ram with 32 dedicated to SQL
0
 
LVL 42

Assisted Solution

by:kevinhsieh
kevinhsieh earned 300 total points
ID: 34219548
You can see IO to the SAN using perfmon or just task manager. Look for traffic on interfaces that you are using for iSCSI traffic. We also need to establish how much CPU SQL server is using compared to the overall system. Are any of your cores running above 50%? 75%?


You should be using SAN HQ from EqualLogic to see the stats on the SAN. You can see networking stats, IOPS, average IO size, disk queues, etc. on an overall and per volume or volume group basis. SAN HQ is an SNMP based app that can be loaded onto any Windows based machine. I have it running on a Windows 2003 VM that runs lots of other stuff. You can load a remote console to your workstation. If you aren't using it, you should be.

I am saying that jumbo frames shouldn't be an issue because I don't think that SAN traffic is an issue, because you're telling me that SAN IO is low. SAN HQ will be able to tell us if that's true.


Your servers have 64 GB RAM and 32 GB is being used by SQL. What is the rest being used for? I would thik that you could let it have at least 48 GB, and possibly as high as 60 GB.
0
 

Author Comment

by:MrVault
ID: 34219688
Because of the nature of our app data from servers is coming onto our servers and then being copied to our SAN for block level storage. If they need to get data back they do a lookup on our DB and then pull the data back out the pipe. Their thought (though not based on numbers) was that maybe giving more memory to the system because of the disk activity would help. They said they did see a performance increase when they split it. However it was more that before 32 GB was onboard and SQL took all of it. So this time they just added 32 more but dedicated 32 to SQL. They have no idea if those numbers are the best case. I'm on the same page that I'm guessing it can handle more being dedicated to SQL, but like them this is just a gut feeling. I have no idea how to determine the best layout. I'm guessing if I change the SQL dedication amount it won't take effect until services restarted.

For one of our more sluggish servers doing this role, during peak times (8pm to 6am), the % processor time _Total is 77. The % processor time sqlservr is 435.225 which makes no sense to me.

For that particular server, the C drive has the OS and apps. The E drive has the page file and some static files. The F drive has the main DB transaction logs and the tempdb (though it should be noted that the other servers where tempdb is not shared with the t-logs and db files, the ave queue depth value is well below the max recommended value). G drive is the SAN drive with the client data. and H drive is where the database is stored. That is on a equallogic SAN. I won't get into why it's on a SAN by itself. Let's just say the CFO wasn't happy to learn how the $$ was spent when they did this.

Anyway, C and E are logical volumes on a RAID1 set. F is a RAID10 set with 4 drives. And H and G are both on their own SANs with 14 disks, RAID10 for the DB and RAID50 for the bulk storage.
0
 
LVL 42

Expert Comment

by:kevinhsieh
ID: 34363083
So did you identifty the bottleneck? What was it?
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

New Windows 7 Installations take days for Windows-Updates to show up and install. This can easily be fixed. I have finally decided to write an article because this seems to get asked several times a day lately. This Article and the Links apply to…
The recent Microsoft changes on update philosophy for Windows pre-10 and their impact on existing WSUS implementations.
This tutorial will walk an individual through the steps necessary to configure their installation of BackupExec 2012 to use network shared disk space. Verify that the path to the shared storage is valid and that data can be written to that location:…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

24 Experts available now in Live!

Get 1:1 Help Now