Link to home
Create AccountLog in
Avatar of MrVault
MrVault

asked on

Help explain networking basics to my boss

I have to explain the following two situation to my boss (just started here). I need help making my explanation in layman's terms.

1. They have A windows Server 2008 R2 server with 2 onboard NICs. They have a single switch connected. One NIC has the main IP and gateway and is what internet traffic is NAT-ed to. The 2nd NIC has the same gateway and an IP in that subnet and is for other traffic. I know from experience they should not both have gateways even if in same subnet. Why is it bad to have it like this?


2. The same server is connected to a layer 2 switch with no VLANs set up. Connected to that switch is an iSCSI SAN that serves up storage to the server over the 2nd IP. I am failing to explain to them why it is valuable for security and performance reasons why they should either use VLANs or get a separate switch for the iSCSI traffic. They do need to be able to replicate that SAN storage over the WAN so it can't be completely isolated. They are basically using a flat network and the server is getting hosed. They have a 48 port switch with many SANs and many servers all on the same VLAN. They also do replication from that SAN to another SAN in a different city over a trunk. How do I explain that we either need a switch that can separate out the ports into different VLANs and then segregate the traffic or we need to get a separate iSCSI switch?

3. Lastly, Every SAN vendor I've ever worked with has recommended purchasing separate iSCSI HBA cards instead of using the onboard NICs that come with most servers today (ours come with 4 onboard). Again I can't figure out how to explain to them that the HBAs take load off the server's CPU and motherboard.

Thanks!
Avatar of wwakefield
wwakefield
Flag of United States of America image

First and foremost, explain what he is getting out of it!   Cost / benefit...

1.  If you supply X, then this system will work faster reulting in an increase in 42 transactions per minute.
2.  If you supply X, you staff will be able to get do whatever faster resulting in 12  additional sales per hour.
3.  If you do not purchase X, then the chances of Y failing have increased by Z and it may fail.   If this systems fails, it will take 4 days to return to full oeprations resulting in $$$ lost prodcutivity or lost sales or whatever.  
4.  At the same time, what have you done to take care of it with what you have on hand?  Have you maximised your resources?

Give him big picture stuff that show bang for the buck.  If you are unable to show him what the dollar buys or results in, then there probably is not bang for the buck and not necessary for the company.   So is it just nice to have fo the IT team or does it benefit operations?
Avatar of MrVault
MrVault

ASKER

I guess I should talk to the current IT guy and ask how many customers are on a given server. Then talk to sales and find out the average revenue per customer across the servers. Then I can present the risk in failure of hardware. But I'm not really asking that yet. That is a conversation for high availability and disaster recovery, something I plan on bringing to them at some point.

The main issue is this right now: their servers are running very slow. And instead of following best practices for how to configure any database server in an iSCSI SAN environment they are just throwing more RAM, more faster drives with bigger cache, and putting a bigger backplane on their switches which are in a flat network. What I'm trying to convey to them is that before they keep throwing thousands of dollars (his words not mine) towards increasing hardware capabilities they need to be in the proper configuration to see if that's necessary and then they can draw the conclusion if it's really the horsepower instead of the configuration.

The trouble is putting it into words that he can understand. He's pretty technical for a CEO (he founded this IT service company). But I have to explain why a flat network is bad from a performance metric and why offloading the HBAs will help performance as well as removing the dual gateway settings.
ASKER CERTIFIED SOLUTION
Avatar of Otto_N
Otto_N
Flag of South Africa image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Avatar of Jimmy Larsson, CISSP, CEH
1) Windows is not working properly when configured with multiple default gateways. When talking about routing in general a routing-table should be a map of the entire world. Each unit should know of exact ONE way to reach every other unit. If it knows of NO ways it cannot communicate to that peer. If it knows of 2 different ways to reach a destination, which should it use? Sure it can load balance. Or it can use any of them and hope for the best. And there are many different ways to do this in routes but this generally involves routing protocols or other techniques that requires configuring. And a windows-box is simply not that clever when it comes to routing.

That´s just the way it it.

/Kvistofta
Avatar of MrVault

ASKER

Thanks Otto. I think you're on the right track. I just want to pull out some more technical explanation if possible. Something like this:

1. Because the NICs are all in the same VLAN, X is happening which results in latency. Y is happening which results in collisions. Z is happening which results in a saturated NIC (i would verify with tools). Additionally, security is reduced because T is possible if someone did W.

2. When we use onboard NICs for our iSCSI traffic it uses the motherboard BUS like this: X. This causes Y which results in reduced performance. An HBA has feature B which removes the Y issue. This is why it's worth spending $D on that HBA.

To me it seems like the current person is saying "well if the switch is overwhelmed lets get a better switch" instead of figuring out why it's overwhelmed and how to fix that first. Then if that's ideal and it's still hammered, pursue a better switch or more of them.


Avatar of MrVault

ASKER

Kvistofta, if there are multiple gateways (same one in each NIC) does this increase retransmits, broken connections, duplicate traffic across each NIC, etc? One error we're getting is in DNS event logs it's saying there's a duplicate name/ID on the network. Presumably because each NIC/IP is broadcasting the name of server is associated with it's NIC/IP/MAC-address. What does this result in if there are duplicates (besides just an event log entry)?

Thanks.
Think of your traffic like well water.  You can keep throwing RAM and faster drives (pump the handle faster), or put in an HBA and have pressurized water.  There are plenty of houses with wells...but nobody uses a manual pump anymore.  It's inefficient and affects the throughput.

As for VLANs and switch fabric...letting Windows pump the network traffic may be the bigger problem.  Segregating traffic may be a good plan, and smart.  But, if each server had 2x or 4xGbE connections to a very fast switch...you'd end up with disks being the bottleneck, not random broadcast traffic.
Avatar of MrVault

ASKER

aleghart, nice analogy. I like it. I might use that. For me though can you explain what we're doing by using an HBA that improves things? something much more technical.

regarding the VLANs, if we optimize the segregation of the traffic and then we see the disks as a bottleneck, that's fine. We can address that problem next. but until we start crossing out issues we're just flying blind. at least then I'd be able to justify buying bigger, faster disks, more of them, etc. the part I'm having trouble explaining is what why the performance is negatively impacted by having a flat network. I'm talking about performance only, not security, not high availability or ease of management.

The colleague just told me the Dell rep had told him 2 years ago that as long as certain counters were not outside certain thresholds having a flat network shouldn't be an issue and supposedly we're not outside those thresholds. Now I have found 4 different docs about setting up a Dell Equallogic iSCSI SAN and all for of them said we should segregate (though no reason why, just that it's best practice for performance). He hasn't given me the counters or the thresholds or any doc that backs this Dell rep's supposed claim, so until I can show counters that are being negatively affected because of our flat network, it's going to be difficult to convince them that we need to follow best practices for real reasons and not "just cuz everyone says too".

thanks for the help so far.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of MrVault

ASKER

Thanks Aleghart. Our CPUs are running at 50-70% utilization. They're SQL servers that are constantly running queries, inserting data, deleting data, and serving as the transfer station for tons of data over iSCSI.

They have Foundry gigabit 48 port layer 2 switches (not routers). Our CPUs are often quad core 3.0 Ghz with 2 of them per server. The NCIs are both onboard and in the same network (no subnetting or VLANs used). We took the gateway off the 2nd NICs and that helped with some issues, but not CPU utilization. We also disabled it from registering with DNS and turned of Netbios and all protocols except TCP/IP v4.
If you set your NICs and switch to support jumbo ethernet frames it will take fewer packets to transfer your data and that can help with CPU itilization. You should look to see how much of the CPU is being used by SQL Server, and how much is used by other. If most of it is being used by SQL, there isn't a lot to be done in terms of something like iSCSI HBA or ToE. It's possible that your server NICs already support ToE and that it just needs to be turned on.

Something to look at is whether or not your EqualLogic is the bottleneck. You should look at the counters and reports provided by SAN HQ, and probably talk to Dell Tech Support to help you understand any performance issues. If you have a bunch of high use SQL servers hitting a single PS series array with SATA disk, you are eventually going to hit a wall with your random reads. Adding RAM to the SQL servers to increase cache will help, but so will adding more/faster spindles/SSD to the array.
Avatar of MrVault

ASKER

What's the perfmon counter for CPU utilizaiton by SQL server?

As for jumbo frames don't those become an issue when you're replicating the SAN across networks? It would have to be supported by all hops in between. In our case it goes out the switch, into the firewall and then to another firewall via direct link miles away. In my experience the use of jumbo frames is negative if the hops between replicating SANs do not support it. Unfortunately i know of no way to say "use jumbo frames for regular disk IO but not for SAN-to-SAN replication".

We have one 1, sometimes 2 SQL servers hitting our SAN. They are isolated and the SAN is used for bulk storage (up to 300K blocks), not for the database. The SANs are not hitting the disks too hard it seems.

One issue is that the guy has ordered and installed a bunch of 6 disk servers. The OS, page file, tempdb, database, transaction logs, and indexes are on internal storage. I think he has it set up as a RAID 1 set for OS/page file and a RAID 10 set for everything else. Not sure how we can improve that without changing the chassis to one that supports more drive bays or using DAS or SAN for offloading some of the activity. I think three RAID 1 sets is out of the question because our database can easily grow to 500GB and has at times gotten as large as 1.5 TB.

If you aren't using the SAN for the databases, then we can take the SAN and the network out of the picture as to the reason why your SQL servers are slow. It sounds like it's all related the the local SQL server. It could be a disk IO issue, or bad SQL code/stored proceedures/queries/idexes, etc. because your CPU is pretty high. If the SQL configuration is RAID 1 for the OS and RAID10 for the SQL files, there are some things that can be done. The log files can be moved to the RAID 1, and maybe tempdb as well. I would really consider moving part or all of the SQL files to the SAN, since it's going to have more spindles to read from, and writes are really, really fast because of the write cache on the EqualLogic.
Avatar of MrVault

ASKER

Unfortunately the equallogic is one array and we're storing bulk files there in RAID 5 so even if we had the room, the database would be on a RAID 5 set which is not ideal compared to RAID 10.

So you would recommend putting the pagefile, tempdb files, and transactions logs all on the RAID 1 set that the OS is on?

Also, how do I see what portion of the cpu utilization is from the sqlservr process (or the application as a whole)?
Avatar of MrVault

ASKER

Which of these would you prefer assuming a heavy use SQL server?

RAID 1: OS/Pagefile/tempdb files
RAID 10: database + transaction logs

or

RAID 1: OS/pagefile/trans-logs
RAID 10: database/tempdb files

?
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of MrVault

ASKER

thanks for that insight. we'll have to take that into consideration.
Avatar of MrVault

ASKER

due to space issues, moving the logs to the SAN is not an option at this point.

I'm curious about the jumbo frames question and why the network cards should be taken out of the picture jsut because the DB isn't on the SAN.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of MrVault

ASKER

before we move onto CPU, SQL, etc, can we take a step back? Forgive me.

Where do I measure the following:

1. IO to the SAN (should this be done on the Windows NIC counters, something in the equallogic, etc)?
2. What do I look for with SNMP or perfmon counters to see if the NICs are a bottleneck?
3. I'm not sure I agree about your jumbo frames comment. I have worked in environments where the performance of the SAN (and it's affect on the server) were amazingly better when we removed jumbo-frames configuration.

there is 64 GB ram with 32 dedicated to SQL
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of MrVault

ASKER

Because of the nature of our app data from servers is coming onto our servers and then being copied to our SAN for block level storage. If they need to get data back they do a lookup on our DB and then pull the data back out the pipe. Their thought (though not based on numbers) was that maybe giving more memory to the system because of the disk activity would help. They said they did see a performance increase when they split it. However it was more that before 32 GB was onboard and SQL took all of it. So this time they just added 32 more but dedicated 32 to SQL. They have no idea if those numbers are the best case. I'm on the same page that I'm guessing it can handle more being dedicated to SQL, but like them this is just a gut feeling. I have no idea how to determine the best layout. I'm guessing if I change the SQL dedication amount it won't take effect until services restarted.

For one of our more sluggish servers doing this role, during peak times (8pm to 6am), the % processor time _Total is 77. The % processor time sqlservr is 435.225 which makes no sense to me.

For that particular server, the C drive has the OS and apps. The E drive has the page file and some static files. The F drive has the main DB transaction logs and the tempdb (though it should be noted that the other servers where tempdb is not shared with the t-logs and db files, the ave queue depth value is well below the max recommended value). G drive is the SAN drive with the client data. and H drive is where the database is stored. That is on a equallogic SAN. I won't get into why it's on a SAN by itself. Let's just say the CFO wasn't happy to learn how the $$ was spent when they did this.

Anyway, C and E are logical volumes on a RAID1 set. F is a RAID10 set with 4 drives. And H and G are both on their own SANs with 14 disks, RAID10 for the DB and RAID50 for the bulk storage.
So did you identifty the bottleneck? What was it?