Network Design needed - Resilient and Fast!

Posted on 2007-08-02
Last Modified: 2008-03-16

I'm upgrading my 10-year old network architecture, and could use some great advice! Here's what I've got now:

- 4 x HP Procurve 4000m 10/100 managed switches connected in a fully meshed topology. Each switch has a (1Gbps) mesh connection to each of the other 3 switches.
- The mesh function ties the switches together with multiple physical routes, giving (1) non-stop protection against link failure, and (2) higher bandwidth switch-switch.
- Two switches are located in the basement data center, and two are on the second floor serving desktops (I expect to add the first floor in the next few years).
- If a second floor switch fails, about half of the desktops are affected, and I can restore those by taking them off the failed switch, and plugging them into the survivor.
- If a data center switch fails, all critical servers are connected with redundant NICS to separate switches, and so they maintain their connections on the surviving switch.
- Everything is on one IP subnet, with no defined VLANs. There are about 100 connected systems.
- We have a single Internet connection to/from the LAN (2 T-1s bonded in a Cisco 3640 to a firewall box that's decent - though it doesn't route very well - to the LAN).
- All systems (web servers, etc.) reside "inside" on the single LAN.
- Because of the simple configuration, I can pretty much plug and unplug things anywhere on the network without trouble.
- It's *really* simple to manage, obviously.

What's driving the upgrade:

- VoIP deployment on the LAN in the next 90 days; so I want more LAN security and performance headroom.
- Rapid growth in the Internet-facing applications; so I want to get them out of the LAN, and make them more scalable.

My goals:

- Get Gigabit connectivity, especially among the servers, but retain the same resilience (or better) that I have now (i.e. drop a switch, and keep running).
- Improve/enforce security by segmenting the network into LAN and a few DMZs.
- Create an infrastructure that will last for the next 10 years (assuming no huge growth spurts, but be able to roll with them if they occur).
- Keep it simple to manage.

My thoughts on how to solve (and I have a diagrams if anyone is interested): Some of this is conceptual, and some revolves around particular products to make an example. If I'm off-base on the concepts, then please correct me there. I don't want to get off discussing this product vs. that if the picture is wrong to start.

A. Create a topology where a capable *redundant* routing firewall (a Juniper SSG140 is in my mind here) controls access to and from
    (1) the Internet (eventually with redundant links);
    (2) an "Internet" DMZ, where the web servers, DNS, e-mail proxy, etc. reside;
    (3) an "Intranet" DMZ, where company apps, VoIP proxy, VPN access, etc. reside;
    (4) the internal LAN, and possibly
    (5) a couple of other subnets for server/device management, sandbox systems, etc. These could possibly be set up as VLANs riding on the internal LAN's gear instead.

B. Upgrade the LAN switches to Gigabit. At least for the data center, and preferably for all of them (management asks why can't we get gig to the desk for all this money?). For longevity, I'd look for the highest switch bandwidth/lowest latency for the buck. Issues:
    (1) I still see the internal LAN as essentially one subnet (may grow to 200 servers/desktops). I'd love to set them up meshed as now, but in the HP line, at least, they've restricted meshing to the high-end switches (e.e. 5400zl @ $100+/port). I might swing this in the end, but I'm also paying for a lot of functions I'm not sure I'll ever use. On the other hand, they have a lot of bandwidth...
    (2) Interconnects are expensive. It's great to have fast switches, but how to connect them? 10GbE is out of the price ballpark, so best I can do is trunk 1Gb links. Again, meshing helps here (though with lots of ports consumed).
    (3) If I go with a non-mesh setup (e.e. using 4200vl switches), how does my fault-tolerance fare? In other words, I could see two switches in the data center with trunked links between them for a high-speed core. Then, a set of trunks from one DC switch to one 2nd floor closet switch, and another set of trunks from the 2nd DC switch to the other 2nd floor switch. I lose something here, since a DC switch failure also kills a second floor switch. Do you combine all this with spanning trees for link redundancy? Yuck, seems complex and error-prone. And, how do I connect my redundant server-NICs to the DC switches in this case so that they behave as nicely as they do now? I'm really unclear on this.

So, that's it in a nutshell. There are obviously many more questions to be answered, but I think this covers the big issues, and I wanted to get it out there. I have not been up on all the latest products the last few years, and things have exploded. There seem to be lots of functional overlap in routers, switches, and firewalls, so how to get what you need without wasting a bunch? I've been wrestling with this for a week reading up on current products and design ideas, but then thought I should turn it over to the guys with real-world experience. I'd love to hear from you! Thanks!!
Question by:hillandale
    LVL 4

    Assisted Solution

    In my experience decent switches don't fail very often.  If it were me, personally, I won't worry as much about the switches failing as other possible problems.  I would get GigE for everything, not worry about meshing them, but maybe you can keep a spare on hand.  When a switch dies, it is pretty easy to figure out.  It takes all of 5 minutes to replace as well.  Your cabling, servers, NICs, Firewalls, etc. all are as likely or more likely to fail.  I would put my money towards these things, redundant Internet, good firewall, etc. vs. worrying too much about redundancy with switches.  I'm sure someone will disagree, but that's what I think (unless someone dies or something if you loose connectivity for 5 minutes on one floor or something)

    Author Comment

    dempsedm, I would tend to agree with you. The only unplanned outage I've had with the existing gear in 7+ years was, ironically, caused by monitoring software that was supposed to inform me on the network's health. That said, we run a call center, and any service disruption stops the entire business dead in its tracks. Plus, the data center is largely lights-out (long story). So, in our case, I think I really am going to need to provide redundancy in the LAN. Thanks!!
    LVL 3

    Expert Comment

    In all honesty, and I am not trying to be a jerkm just honest. If you dont know the answer to those questions then anyt equipment that you purchase or topology that you produce will not have a lifespan equal to the years that you ware wanting to get out of it. I would hire a consultant, in the long run it will save you a lot of time and money.
    LVL 14

    Expert Comment

    You may want to contact HP.  They will provide free design services for you.
    If you need, I can try to find the links/

    Author Comment

    Thanks for your replies -

    trath, that's a valid viewpoint/solution, and it's a possibility. The point of posting here is to get some general insight and/or ideas that can help me better understand the issues, even if a consultant eventually handles the details.

    steveoskh, I've done that. Unfortunately, I knew more about the products and possibilities than the assigned "tech", and the "senior" guy hasn't yet been available. I did contact these guys on another issue a few years ago, and was impressed... so far, not this time.
    LVL 17

    Accepted Solution

    Depending on what happens on your network on a daily basis, 10GE is out of the question. If you only have 200 Servers/desktops, 1GE is more than sufficient. It's the backplane of the switch and the packets per second that can be switched across it that counts.

    Unless your offices have two data jacks, making desktops redundant will be a problem. You should use the Core/Distribution/Access methodology. Your core, which should house your backbone and WAN, should be the fastest, encompasing 1 GE switches. These switches should be connected to each other using Etherchannel for redundancy. Each server should have one network card on each switch providing it's own redundancy. Since you may, or may not, be big enough to have a distribution layer, then you should get access layer switches, each with it's own etherchannel uplink to both core switches in the data center. At that point your redundancy stops at the desktops.

    I would be more inclined to put money into a high speed switching backbone to provide good QoS for VoIP and server access and leave the desktops on 100M ethernet. I would however get a decent model 100Mbit ethernet access layer switch for the desktops with a high packet per second throughput since your moving to VoIP. The last thing you want to do is get low end switches and then you have to start configuring and testing QoS on the switch to get good voice quality. As long as the you have the throughput, QoS will be limited to the core. Don't be afraid to buy 3 smaller switches and split up your users compared to larger ones. If you have 100 users with roughly 33 per switch, you have a greater portion of the company up and functional should a switch fail. All these would be etherchanneled back to the core. This would also facilitate not running into the problem of a couple people copying large files that might impact VoIP quality. Don't use a trunking protocol at this point, drop them on the VLAN configured on the switch for even more performance gain.

    At your core, you should have 3 Vlans propagated according to your schematics. You should have one VLAN for the servers, one VLAN for end users, and one VLAN for VoIP phones. Any internet access should be off the user VLAN unless you have lots of servers open to the internet which should be on a DMZ anyway.  

    ALL WAN access should be outside a firewall unless it's trusted remote sites. Redundant firewalls with an IP Load balancer setting behind them for internet based server access should be considered. From a security perspective, you have more to fear from your users than the outside world if you have a good firewall in place.

    The IP phone system you get SHOULD NOT be proprietary in nature. This will cause problems if you do need to start doing QoS. You should use trusted protocols with decent compression like SIP and H323 to cross the core. With a high speed core, RTP to the desktop won't be a problem.

    Just a few things to throw out there. Always remember, switch when you can, route when you must. The more of a switching backbone you have, the less quality problems you're going to have with VoIP and performance problems with end users.

    Author Comment

    Good feedback mikecr - thanks! I'm really tied up on another project at the moment, but would like to continue the discussion with you a bit. So, if you don't mind, please keep an eye out on the thread. Thanks again - Gary

    Author Comment

    OK, it's taken awhile, but I'm back on this one.  If you're still getting this, mikecr, here's a few follow-up questions for you:

    Given the size of my LAN, I'd see using a 2-tier setup at most. Practically, this might be two high-end (L2/3/4, ~200+Mpps) switches in the data center that are trunked together with (at least) 4 x 1Gb links (a 10Gb CX4 link or two seems reasonably economical if the base switch supports it). Any critical server would be connected to both switches with redundant NICs, though configured so that most server-server communication would traverse only a single switch.

    Based on your note, as I understand it, I'd drop access switches for client connections off of this "core". For example, I might trunk two Gb links to each access switch. Now, you mention putting these clients into their own VLAN, with the servers and voice on two more. Most of my traffic is client-server (or server-server); there's very little client peer-peer. So, the access switches would almost entirely carry the client VLAN, and provide desktop phones access to the voice VLAN. They really could just be fast L2 switches in this case. The core switches would route from the client VLAN to the server VLAN. This seems to be a lot of extra work ("route where you must") for little gain, though. What drives the segmentation of the client and server LANs? Is there a performance angle in this small a case, or is it primarily security? Subjective, I know, but is it "worth it" to complicate management, troubleshooting, recovery? (I also didn't follow you when talking about using more smaller edge switches, you wrote: "Don't use a trunking protocol at this point, drop them on the VLAN configured on the switch for even more performance gain.")

    Another complication is that we're mostly using (SIP) softphones on the clients, and I'm not aware (yet) of any switches that will assign traffic at an interface to a VLAN based on IP port. So, since I'll have voice and data traffic coming over the same interface, the endpoints will have to tag the traffic themselves. The developers were hoping to avoid this, but I would expect that's typical, and can't imagine it's that big a deal...

    Finally, based on how we typically do things here, I would still prefer to employ redundant links. So for example, with the two "core" switches as above, I'd add the access layer as 2 x 48-port switches. Each of those would connect back to the two core switches (with single or trunked links). Typically, then, I'd use some sort of Spanning Tree to designate primary and fail-over links from the edge to the core. 802.1s MSPT would seem to be a good choice conceptually, as you wouldn't have any purely idle links (data could flow on one, while voice traversed the other).

    My question on this is how fast can I expect the Spanning Tree function to redirect traffic in the case of a failed link? Basically, I can (1) have an edge switch fail, and it knocks out the directly-connected endpoints. I'm OK with that. (2) A link from edge to core, or a core switch could fail. Then, whatever VLANs were running over the failed link (or to the failed core switch) would have to divert to using the fail-over link. How fast does that happen? Different vendors make different claims, but I have to think my topology is so simple that it would be pretty fast. I've not played with it, though, so do you have an experienced opinion?

    From your comments, an alternative would be to use, say, 4 x 24-port switches at the edge, with each linking back to only one core switch. In that case there wouldn't be any need for SPT. A link failure would knock out 24 edge ports, a core switch failure would knock out 48. With SPT, no edge ports would be knocked out in these cases, so that's why I'd lean that way if it (SPT) works well, and isn't too unwieldy to manage.

    (I've learned since my original post that the HP "mesh" function appears to be proprietary, and so may not be familiar to many. I took it for granted that other vendors do similar things, but not really. It's rather like turning standard Ethernet ports into the dedicated "stacking" ports you see on some low- to mid-range switches. With HP's mesh function, I can connect multiple switches (up to 12, I think) together with multiple links, and designate those ports as "mesh". That kind of bonds the switches together as one virtual switch. There are lots of loops and redundant paths, and they're all active, but the switches deal with that. While I still think this is cool and useful, it's limited in that the IP routing functions must be disabled to use it. So, for instance, if I have separate client and server VLANs, I'd have to connect those through some router external to the meshed switches. That defeats most of HP's "mesh" advantage over STP (i.e. multiple active "mesh" paths gives no latency "fail-over" and more switch-to-switch bandwidth).

    So, in summary, please discuss any of my details, but I think the essential questions I'm asking on this path are:
    1. Do the VLANs (voice/client/server in the LAN) really benefit me, and how?
    2. If I use them, then that relegates me to using "typical" Spanning Tree methods for resiliency/redundancy in the LAN, and how will that perform in case of a failure?
    3. If the answer to #1 is "great benefit", and the answer to #2 is "SPT should work well in my case", then it's time to go shopping....

    Hope you get this, and thanks!
    LVL 17

    Assisted Solution

    First off, get high speed backbone switches for your data center to plug your servers into. I use Cisco for everything so I don't know what a comparable HP switch would be. I design and troubleshoot networks for people, that's what I do, so I try to stick with what I know best. Build your network based on current need with some expansion for the future. Keep in mind that workstation running at full duplex do 200Mbit per second, more than enough for the next few years. Servers should be all gigabit. The slowest part of your network will be your internet/firewall because of the jobs they do. Don't need to put a whole lot of money there except for redundant internet connections unless you have a need.

    Now here is the design. Since your running soft phone on the client workstation, which will be OK for now but won't be if you grow much bigger, you definately need Vlans. Your VoIP system needs to reside on the same subnet, if at all possible, as your clients so that you switch instead of route. Routing will add latency and cause you to have to design a QoS strategy. Your servers will need to be on another Vlan. Vlans create broadcast domains. Broadcasts stop at this boundry. Using Vlans, you keep your servers from processing needless broadcasts from the client workstations.

    I would get two high speed backbone switches for the data center. My personal pick would be either a Cisco 4900 series or 4500 series. 3750's may also work in your case. They come in a lot lower in price compared to the 4500/4900 but have a fast switch fabric with one really nice feature, clustering. You can cluster both of your core switches together making it look and act like one switch. With this in mind, you can put etherchannel capable nics in your servers and bond them together, putting one nic on each switch. There's your server redundancy, sweet and simple. Next, I would use 2960 switches for the clients, splitting them up onto each switch as you have now. The interesting part will be, NO SPANNING TREE. You would utilize bonded switch ports, once again using etherchannel, as your uplinks to the core switch. These bonded ports would give you agregated bandwidth equal to the speed of the port. If you bond two gigabit ports for your uplinks into one etherchannel, and you configure load balancing across them, you essentially get 2Gbps! All of this is EXTREMELY easy to configure, keeping your network not only resilient, but easy to manage.

    Since your uplink switches would be clustered in the case of the 3750, you could put one gigabit port on from each client switch and each server nic on each of the clustered switches, giving you 100% up time. The only thing that would drop you would be a total switch failure. No spanning tree. Just leave it at it's default in case somebody starts hooking up some switches wrong, but you don't need to rely on it to make your network redundant. Etherchannel signifies one virtual interface on the switch when configured, not requiring the configuration of spanning tree, but it's still nice to have just in case.

    You need two Vlans, one for the users, and one for the servers. Since you will probably be opening some servers on your network to the internet, I would put the inside of the firewall on the server Vlan so that you cut down on the amount of hops into the network. It's all about latency, how many mountains do I have to cross before I get there. The less you do, the faster you get there. Now this would increase the users by one hop getting to the internet but it's an acceptable trade off. If you have bonded T1's, I would replace the 3640 with a Cisco 2821. It does a much better job and will support VPN connections if needed.

    If you have any more questions, please don't hesitate to ask.

    Author Comment

    I'm grateful for your response, and am glad to have your Cisco recommendations, as I'm not really that familiar with the line. Just to make sure I understand correctly, your recommendation with some Cisco-specific functionality is: (1) A fast core, composed of at least two switches. For Cisco, the minimum would be the 3750 baseline in a "clustered" configuration. (2) An access layer, composed of (at least two) 2960 switches, each of which is uplinked to each 3750 switch for redundancy (I assume these are EtherChannel links, even though they're attached to two 3750 physical switches). This setup is largely self-configuring, and does not involve spanning-tree for the redundant links.

    If that's the case, then the Cisco "cluster" sounds quite a bit like HP's "mesh" function. If I go with the HP, they seem to have fine performance. 300+Gbps fabric, < 4usec FIFO latency, and 200+Mpps routing speed. They're also chassis-based with open slots to max out at 144 gig ports. Throw in a lifetime HW/SW warranty for ~ $5K, and that's a good value to me (a few high-end features, like OSPF are extra). The one limitation at the moment is that they don't support IP routing when the mesh feature is enabled. So, if I use VLANs, I have to route outside the switches, and that seems like a step backwards (HP has been mainly a closet vendor, so I don't expect them to compete with Cisco on functionality).

    I agree that 100Mbps would suit the desktop fine, and I see the Cisco 2960 FE version is a *whole* lot less than the gigabit. With 2x3750 core and FE 2960's on the edge I still probably make my budget. So, a few 3750 questions from their datasheet:

    (1) Under redundancy, they mention CrossStack UplinkFast "provides increased redundancy...through fast spanning-tree convergence...across a switch stack..." Later, it says "Stacked units behave as a single spanning-tree node." There's no mention of "clustering" by name. So, is this the same thing? And if so, I gather that the stack as a "single spanning-tree node" means the usual (slow) convergence doesn't occur. The bottom-line for me is that I really don't want a dropped core switch to take out an edge switch (or two).

    (2) I'm surprised that the 3750 specs out with a 32Gbps fabric, and 39Mpps through the switch and/or stack. This seems awfully slow...not that I'm going to blow that out anytime soon, but it doesn't sound really future-proof. This is, after all, a $10K switch when all is said and done (48-port gig with RPS). Am I missing something, or is there some reason that spec isn't terribly meaningful?

    (3) I notice a mention of policy-based routing (3750). My reorganized LAN/WAN will have multiple security zones (mainly a couple DMZs added), and I was considering handling the routing between those areas with the firewall device. Would the routing in the 3750 replace that functionality securely, and let me just use a firewall as a "straight-through" device on the perimeter? Which makes more sense? On a related note, you had mentioned using 1821's in place of the 3640's. My ISP's giving away 1841'a at the moment if we re-up with them, so that may be the thing to do....

    I'll segue into a few more questions for you... I think I need to post these as new topics, since they really stand-alone, and merit more points for answers. But, since I imagine Cisco has a solution for all of them, here's a preview:

    (A) As mentioned, I need a firewall box, and I see Cisco offers the PIX and ASA boxes. I gather the ASA is more of a UTM-type appliance. This supports all in/out Internet, with no VPN at the moment. The Internet facility will be 6Mbps for the next couple years, so not a screamer. As previous, I did/do plan on using this as a routing point between LAN, WAN, and DMZs. Of course, I'd expect to get two of them for redundancy. I also assume these would handle fail-over routing/balancing for multiple WAN circuits (terminated on the access routers, I assume). A bit of layer 7 routing that could shunt web surfing from the LAN off to our cheap cable-modem would be nice... I understand that the firewall may not be the place for all this, but that's how I see it at the moment.

    (B) For redundancy and ease of [rolling] upgrades (more than performance), I expect to put a load-balancer in front of the web servers (6 boxes at the moment). Some layer 7 routing would be nice (e.g. certain heavy web pages are served better by the hot machines), but not essential. No heavy SSL requirement (logins only), though if we can place an SSL certificate at a gateway point, rather than on each web server, that would be simpler I think? Of course, I'd look for a redundant pair.

    (C) Seems like the same box that would do (B) would also do the same thing for DNS?

    (D) VoIP gateways and - you guessed it - load balancing (again, mainly fail-over). I'd like to get appliances to take our switched T1/PRI voice traffic and turn it into IP on the LAN. We're using Asterisk for the telephony inside, SIP protocol, and G.711 or G.722 CODECs. For now, I expect to provision two quad-span gateways, with a 3rd box as a spare. If I could somehow configure the 3rd box (with something like drop and insert on the T-1 side) so that it takes over on failure of one of the others, that would be great! I've read a bit on the various AS53xx boxes. If all I need is TDM voice to IP, then it seems some used AS5300s would be very cost-effective.

    I'm wondering, though, if there might be some routing capability in the higher-end boxes. Let's assume I have (multiple) voice gateways forwarding traffic to one or more Asterisk boxes (I have two, naturally). I would like to be able to take one Asterisk box off-line for upgrade and "shutdown gracefully". That is, all in-progress calls should continue to that Asterisk box, but any new calls should be assigned to the alternate box. For normal operation, we'd probably only have one Asterisk box active at a time, with the other one as a hot-standby. In an Asterisk failure scenario, I'd expect to immediately route all traffic to the standby, but in-progress calls would be lost. If this intelligence isn't in the voice gateway, then some I would hope to use an external solution to do it. If you've messed with any of this, I'm all ears for pointers. Seems like any Cisco application load-balancing is relegated to the high-end products, and thus out of our price range.

    My budget for all the above was submitted as (~ $52K):
    4 x switches (core-edge) $17.5K
    2 x firewall with UTM $10K
    load balancing for web $5K
    Spam appliance (e.g. Barracuda) $1500
    Voice gateways and load-balancing/fail-over switch $18K
    Plus whatever I clear on the used 3640's. I may have some room above this if I can make a persuasive case, of course.

    I know that all this redundancy can sometimes complicate things to the point where they are less reliable. After all, good network gear is quite reliable, and few hands will touch them. At the same time, this network (voice and data) IS the business. Any outage immediately loses business now, and endangers customer relationships. The company owners like to see lots of backup.

    Again, I'll post these last items separate (a bit later), so feel free to answer there instead.... Thanks again!!

    Author Comment

    Sorry for the delay. I got derailed on solving some of the VoIP-specific issues. Thank you for your input!

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    Transferring data across the virtual world became simpler but protecting it is becoming a real security challenge.  How to approach cyber security  in today's business world!
    This paper addresses the security of Sennheiser DECT Contact Center and Office (CC&O) headsets. It describes the DECT security chain comprised of “Pairing”, “Per Call Authentication” and “Encryption”, which are all part of the standard DECT protocol.
    Need more eyes on your posted question? Go ahead and follow the quick steps in this video to learn how to Request Attention to your question. *Log into your Experts Exchange account *Find the question you want to Request Attention for *Go to the e…
    Internet Business Fax to Email Made Easy - With eFax Corporate (, you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…

    760 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    13 Experts available now in Live!

    Get 1:1 Help Now