Link to home
Start Free TrialLog in
Avatar of hillandale
hillandale

asked on

Network Design needed - Resilient and Fast!

Hi!

I'm upgrading my 10-year old network architecture, and could use some great advice! Here's what I've got now:

- 4 x HP Procurve 4000m 10/100 managed switches connected in a fully meshed topology. Each switch has a (1Gbps) mesh connection to each of the other 3 switches.
- The mesh function ties the switches together with multiple physical routes, giving (1) non-stop protection against link failure, and (2) higher bandwidth switch-switch.
- Two switches are located in the basement data center, and two are on the second floor serving desktops (I expect to add the first floor in the next few years).
- If a second floor switch fails, about half of the desktops are affected, and I can restore those by taking them off the failed switch, and plugging them into the survivor.
- If a data center switch fails, all critical servers are connected with redundant NICS to separate switches, and so they maintain their connections on the surviving switch.
- Everything is on one IP subnet, with no defined VLANs. There are about 100 connected systems.
- We have a single Internet connection to/from the LAN (2 T-1s bonded in a Cisco 3640 to a firewall box that's decent - though it doesn't route very well - to the LAN).
- All systems (web servers, etc.) reside "inside" on the single LAN.
- Because of the simple configuration, I can pretty much plug and unplug things anywhere on the network without trouble.
- It's *really* simple to manage, obviously.

What's driving the upgrade:

- VoIP deployment on the LAN in the next 90 days; so I want more LAN security and performance headroom.
- Rapid growth in the Internet-facing applications; so I want to get them out of the LAN, and make them more scalable.

My goals:

- Get Gigabit connectivity, especially among the servers, but retain the same resilience (or better) that I have now (i.e. drop a switch, and keep running).
- Improve/enforce security by segmenting the network into LAN and a few DMZs.
- Create an infrastructure that will last for the next 10 years (assuming no huge growth spurts, but be able to roll with them if they occur).
- Keep it simple to manage.

My thoughts on how to solve (and I have a diagrams if anyone is interested): Some of this is conceptual, and some revolves around particular products to make an example. If I'm off-base on the concepts, then please correct me there. I don't want to get off discussing this product vs. that if the picture is wrong to start.

A. Create a topology where a capable *redundant* routing firewall (a Juniper SSG140 is in my mind here) controls access to and from
    (1) the Internet (eventually with redundant links);
    (2) an "Internet" DMZ, where the web servers, DNS, e-mail proxy, etc. reside;
    (3) an "Intranet" DMZ, where company apps, VoIP proxy, VPN access, etc. reside;
    (4) the internal LAN, and possibly
    (5) a couple of other subnets for server/device management, sandbox systems, etc. These could possibly be set up as VLANs riding on the internal LAN's gear instead.

B. Upgrade the LAN switches to Gigabit. At least for the data center, and preferably for all of them (management asks why can't we get gig to the desk for all this money?). For longevity, I'd look for the highest switch bandwidth/lowest latency for the buck. Issues:
    (1) I still see the internal LAN as essentially one subnet (may grow to 200 servers/desktops). I'd love to set them up meshed as now, but in the HP line, at least, they've restricted meshing to the high-end switches (e.e. 5400zl @ $100+/port). I might swing this in the end, but I'm also paying for a lot of functions I'm not sure I'll ever use. On the other hand, they have a lot of bandwidth...
    (2) Interconnects are expensive. It's great to have fast switches, but how to connect them? 10GbE is out of the price ballpark, so best I can do is trunk 1Gb links. Again, meshing helps here (though with lots of ports consumed).
    (3) If I go with a non-mesh setup (e.e. using 4200vl switches), how does my fault-tolerance fare? In other words, I could see two switches in the data center with trunked links between them for a high-speed core. Then, a set of trunks from one DC switch to one 2nd floor closet switch, and another set of trunks from the 2nd DC switch to the other 2nd floor switch. I lose something here, since a DC switch failure also kills a second floor switch. Do you combine all this with spanning trees for link redundancy? Yuck, seems complex and error-prone. And, how do I connect my redundant server-NICs to the DC switches in this case so that they behave as nicely as they do now? I'm really unclear on this.

So, that's it in a nutshell. There are obviously many more questions to be answered, but I think this covers the big issues, and I wanted to get it out there. I have not been up on all the latest products the last few years, and things have exploded. There seem to be lots of functional overlap in routers, switches, and firewalls, so how to get what you need without wasting a bunch? I've been wrestling with this for a week reading up on current products and design ideas, but then thought I should turn it over to the guys with real-world experience. I'd love to hear from you! Thanks!!
SOLUTION
Avatar of dempsedm
dempsedm

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of hillandale
hillandale

ASKER

dempsedm, I would tend to agree with you. The only unplanned outage I've had with the existing gear in 7+ years was, ironically, caused by monitoring software that was supposed to inform me on the network's health. That said, we run a call center, and any service disruption stops the entire business dead in its tracks. Plus, the data center is largely lights-out (long story). So, in our case, I think I really am going to need to provide redundancy in the LAN. Thanks!!
In all honesty, and I am not trying to be a jerkm just honest. If you dont know the answer to those questions then anyt equipment that you purchase or topology that you produce will not have a lifespan equal to the years that you ware wanting to get out of it. I would hire a consultant, in the long run it will save you a lot of time and money.
Avatar of steveoskh
You may want to contact HP.  They will provide free design services for you.
If you need, I can try to find the links/
Thanks for your replies -

trath, that's a valid viewpoint/solution, and it's a possibility. The point of posting here is to get some general insight and/or ideas that can help me better understand the issues, even if a consultant eventually handles the details.

steveoskh, I've done that. Unfortunately, I knew more about the products and possibilities than the assigned "tech", and the "senior" guy hasn't yet been available. I did contact these guys on another issue a few years ago, and was impressed... so far, not this time.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Good feedback mikecr - thanks! I'm really tied up on another project at the moment, but would like to continue the discussion with you a bit. So, if you don't mind, please keep an eye out on the thread. Thanks again - Gary
OK, it's taken awhile, but I'm back on this one.  If you're still getting this, mikecr, here's a few follow-up questions for you:

Given the size of my LAN, I'd see using a 2-tier setup at most. Practically, this might be two high-end (L2/3/4, ~200+Mpps) switches in the data center that are trunked together with (at least) 4 x 1Gb links (a 10Gb CX4 link or two seems reasonably economical if the base switch supports it). Any critical server would be connected to both switches with redundant NICs, though configured so that most server-server communication would traverse only a single switch.

Based on your note, as I understand it, I'd drop access switches for client connections off of this "core". For example, I might trunk two Gb links to each access switch. Now, you mention putting these clients into their own VLAN, with the servers and voice on two more. Most of my traffic is client-server (or server-server); there's very little client peer-peer. So, the access switches would almost entirely carry the client VLAN, and provide desktop phones access to the voice VLAN. They really could just be fast L2 switches in this case. The core switches would route from the client VLAN to the server VLAN. This seems to be a lot of extra work ("route where you must") for little gain, though. What drives the segmentation of the client and server LANs? Is there a performance angle in this small a case, or is it primarily security? Subjective, I know, but is it "worth it" to complicate management, troubleshooting, recovery? (I also didn't follow you when talking about using more smaller edge switches, you wrote: "Don't use a trunking protocol at this point, drop them on the VLAN configured on the switch for even more performance gain.")

Another complication is that we're mostly using (SIP) softphones on the clients, and I'm not aware (yet) of any switches that will assign traffic at an interface to a VLAN based on IP port. So, since I'll have voice and data traffic coming over the same interface, the endpoints will have to tag the traffic themselves. The developers were hoping to avoid this, but I would expect that's typical, and can't imagine it's that big a deal...

Finally, based on how we typically do things here, I would still prefer to employ redundant links. So for example, with the two "core" switches as above, I'd add the access layer as 2 x 48-port switches. Each of those would connect back to the two core switches (with single or trunked links). Typically, then, I'd use some sort of Spanning Tree to designate primary and fail-over links from the edge to the core. 802.1s MSPT would seem to be a good choice conceptually, as you wouldn't have any purely idle links (data could flow on one, while voice traversed the other).

My question on this is how fast can I expect the Spanning Tree function to redirect traffic in the case of a failed link? Basically, I can (1) have an edge switch fail, and it knocks out the directly-connected endpoints. I'm OK with that. (2) A link from edge to core, or a core switch could fail. Then, whatever VLANs were running over the failed link (or to the failed core switch) would have to divert to using the fail-over link. How fast does that happen? Different vendors make different claims, but I have to think my topology is so simple that it would be pretty fast. I've not played with it, though, so do you have an experienced opinion?

From your comments, an alternative would be to use, say, 4 x 24-port switches at the edge, with each linking back to only one core switch. In that case there wouldn't be any need for SPT. A link failure would knock out 24 edge ports, a core switch failure would knock out 48. With SPT, no edge ports would be knocked out in these cases, so that's why I'd lean that way if it (SPT) works well, and isn't too unwieldy to manage.

(I've learned since my original post that the HP "mesh" function appears to be proprietary, and so may not be familiar to many. I took it for granted that other vendors do similar things, but not really. It's rather like turning standard Ethernet ports into the dedicated "stacking" ports you see on some low- to mid-range switches. With HP's mesh function, I can connect multiple switches (up to 12, I think) together with multiple links, and designate those ports as "mesh". That kind of bonds the switches together as one virtual switch. There are lots of loops and redundant paths, and they're all active, but the switches deal with that. While I still think this is cool and useful, it's limited in that the IP routing functions must be disabled to use it. So, for instance, if I have separate client and server VLANs, I'd have to connect those through some router external to the meshed switches. That defeats most of HP's "mesh" advantage over STP (i.e. multiple active "mesh" paths gives no latency "fail-over" and more switch-to-switch bandwidth).

So, in summary, please discuss any of my details, but I think the essential questions I'm asking on this path are:
1. Do the VLANs (voice/client/server in the LAN) really benefit me, and how?
2. If I use them, then that relegates me to using "typical" Spanning Tree methods for resiliency/redundancy in the LAN, and how will that perform in case of a failure?
3. If the answer to #1 is "great benefit", and the answer to #2 is "SPT should work well in my case", then it's time to go shopping....

Hope you get this, and thanks!
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm grateful for your response, and am glad to have your Cisco recommendations, as I'm not really that familiar with the line. Just to make sure I understand correctly, your recommendation with some Cisco-specific functionality is: (1) A fast core, composed of at least two switches. For Cisco, the minimum would be the 3750 baseline in a "clustered" configuration. (2) An access layer, composed of (at least two) 2960 switches, each of which is uplinked to each 3750 switch for redundancy (I assume these are EtherChannel links, even though they're attached to two 3750 physical switches). This setup is largely self-configuring, and does not involve spanning-tree for the redundant links.

If that's the case, then the Cisco "cluster" sounds quite a bit like HP's "mesh" function. If I go with the HP, they seem to have fine performance. 300+Gbps fabric, < 4usec FIFO latency, and 200+Mpps routing speed. They're also chassis-based with open slots to max out at 144 gig ports. Throw in a lifetime HW/SW warranty for ~ $5K, and that's a good value to me (a few high-end features, like OSPF are extra). The one limitation at the moment is that they don't support IP routing when the mesh feature is enabled. So, if I use VLANs, I have to route outside the switches, and that seems like a step backwards (HP has been mainly a closet vendor, so I don't expect them to compete with Cisco on functionality).

I agree that 100Mbps would suit the desktop fine, and I see the Cisco 2960 FE version is a *whole* lot less than the gigabit. With 2x3750 core and FE 2960's on the edge I still probably make my budget. So, a few 3750 questions from their datasheet:

(1) Under redundancy, they mention CrossStack UplinkFast "provides increased redundancy...through fast spanning-tree convergence...across a switch stack..." Later, it says "Stacked units behave as a single spanning-tree node." There's no mention of "clustering" by name. So, is this the same thing? And if so, I gather that the stack as a "single spanning-tree node" means the usual (slow) convergence doesn't occur. The bottom-line for me is that I really don't want a dropped core switch to take out an edge switch (or two).

(2) I'm surprised that the 3750 specs out with a 32Gbps fabric, and 39Mpps through the switch and/or stack. This seems awfully slow...not that I'm going to blow that out anytime soon, but it doesn't sound really future-proof. This is, after all, a $10K switch when all is said and done (48-port gig with RPS). Am I missing something, or is there some reason that spec isn't terribly meaningful?

(3) I notice a mention of policy-based routing (3750). My reorganized LAN/WAN will have multiple security zones (mainly a couple DMZs added), and I was considering handling the routing between those areas with the firewall device. Would the routing in the 3750 replace that functionality securely, and let me just use a firewall as a "straight-through" device on the perimeter? Which makes more sense? On a related note, you had mentioned using 1821's in place of the 3640's. My ISP's giving away 1841'a at the moment if we re-up with them, so that may be the thing to do....

I'll segue into a few more questions for you... I think I need to post these as new topics, since they really stand-alone, and merit more points for answers. But, since I imagine Cisco has a solution for all of them, here's a preview:

(A) As mentioned, I need a firewall box, and I see Cisco offers the PIX and ASA boxes. I gather the ASA is more of a UTM-type appliance. This supports all in/out Internet, with no VPN at the moment. The Internet facility will be 6Mbps for the next couple years, so not a screamer. As previous, I did/do plan on using this as a routing point between LAN, WAN, and DMZs. Of course, I'd expect to get two of them for redundancy. I also assume these would handle fail-over routing/balancing for multiple WAN circuits (terminated on the access routers, I assume). A bit of layer 7 routing that could shunt web surfing from the LAN off to our cheap cable-modem would be nice... I understand that the firewall may not be the place for all this, but that's how I see it at the moment.

(B) For redundancy and ease of [rolling] upgrades (more than performance), I expect to put a load-balancer in front of the web servers (6 boxes at the moment). Some layer 7 routing would be nice (e.g. certain heavy web pages are served better by the hot machines), but not essential. No heavy SSL requirement (logins only), though if we can place an SSL certificate at a gateway point, rather than on each web server, that would be simpler I think? Of course, I'd look for a redundant pair.

(C) Seems like the same box that would do (B) would also do the same thing for DNS?

(D) VoIP gateways and - you guessed it - load balancing (again, mainly fail-over). I'd like to get appliances to take our switched T1/PRI voice traffic and turn it into IP on the LAN. We're using Asterisk for the telephony inside, SIP protocol, and G.711 or G.722 CODECs. For now, I expect to provision two quad-span gateways, with a 3rd box as a spare. If I could somehow configure the 3rd box (with something like drop and insert on the T-1 side) so that it takes over on failure of one of the others, that would be great! I've read a bit on the various AS53xx boxes. If all I need is TDM voice to IP, then it seems some used AS5300s would be very cost-effective.

I'm wondering, though, if there might be some routing capability in the higher-end boxes. Let's assume I have (multiple) voice gateways forwarding traffic to one or more Asterisk boxes (I have two, naturally). I would like to be able to take one Asterisk box off-line for upgrade and "shutdown gracefully". That is, all in-progress calls should continue to that Asterisk box, but any new calls should be assigned to the alternate box. For normal operation, we'd probably only have one Asterisk box active at a time, with the other one as a hot-standby. In an Asterisk failure scenario, I'd expect to immediately route all traffic to the standby, but in-progress calls would be lost. If this intelligence isn't in the voice gateway, then some I would hope to use an external solution to do it. If you've messed with any of this, I'm all ears for pointers. Seems like any Cisco application load-balancing is relegated to the high-end products, and thus out of our price range.

My budget for all the above was submitted as (~ $52K):
4 x switches (core-edge) $17.5K
2 x firewall with UTM $10K
load balancing for web $5K
Spam appliance (e.g. Barracuda) $1500
Voice gateways and load-balancing/fail-over switch $18K
Plus whatever I clear on the used 3640's. I may have some room above this if I can make a persuasive case, of course.

I know that all this redundancy can sometimes complicate things to the point where they are less reliable. After all, good network gear is quite reliable, and few hands will touch them. At the same time, this network (voice and data) IS the business. Any outage immediately loses business now, and endangers customer relationships. The company owners like to see lots of backup.

Again, I'll post these last items separate (a bit later), so feel free to answer there instead.... Thanks again!!
Gary
Sorry for the delay. I got derailed on solving some of the VoIP-specific issues. Thank you for your input!