Solved

Network packet storm created by backup over routed Cisco network

Posted on 2008-10-20
7
598 Views
Last Modified: 2012-05-05
We have an obvious problem in that whenever a backup is initiated using IBM Tivoli and its new super fast VTL tape library, network problems occur due to some sort of packet / broadcast storm.

We have recently moved this backup service to a building 100 metres away connected by 1GB fibre in order to serve as an immediate off-site backup for the time being until we sort out replication. The backup server now sits on another subnet and must now traverse  a 'vlan interface' default gateway in order to do all of its backup to all the servers in the computer room (when the server was local a gateway was obviously not required).

Our HQ building and the large branch office building are connected via a routed vlan. The vlan at the branch building is vlan 29 (on the 172.29 subnet) - and the HQ building here with the server room is on vlan 20 (172.20 subnet). In order for this backup server to reach vlan 20 from its access vlan 29 Cisco 3560 switch, is must now go to the 172.29.1.252 address of the vlan 29 interface before leaping onto vlan 20. The vlan 29 interface is configured on our core switch (comprising 6 stacked 3750s as one logical unit) in the HQ central computer room. The core switch acts as the server vlan db - whereas the branch buildings floor switches are the client vlan db.

So as soon as the backup starts off from access vlan 29 addressing all of its servers to backup on vlan 20 all hell breaks loose on the vlan 20 network - which unfortunately includes some WAN bridged connections on the same subnet which now disconnect because, being on the same subnet and not partitioned off, also take on board all this excess traffic, swamp the slow bridged 2mb link, and drop off.

Of course I will partiion off these critical bridged connections in response to this now - but we never had this problem before when the backup server was plugged into access vlan 20 at HQ straight on the same subnet as the servers. So what has inter-vlan routing go to do with changing things so dramatically like this? And how can I solve it?
0
Comment
Question by:klwn
  • 4
  • 3
7 Comments
 
LVL 10

Assisted Solution

by:kyleb84
kyleb84 earned 500 total points
ID: 22762872
I'm afraid that an issue such as this requires quite a bit more attention and testing before a conclusion can be made as to what the direct issue is.

If I were you, I would isolate the backup server even more by putting it on VLAN 21 for example, setting the VLAN priority to 2 across all the switches, and making a VLAN 21 path all the way to the server, with another network card in it.

So the backup server would be 172.21.0.10 and the main server's second network adaptor's IP would be 172.21.0.20.

This would completely isolate all backup traffic, and since the VLAN's priority is 2, all other VLAN traffic would default to Pri-3 and even when the backup is processing it should not take as much of a hit on your network.

--------------------

I doubt its a broadcast storm, but it's possible there could be a routing loop.

Some things I would look at are:
- The bytes/second rate on the server's NIC, and compare that to your switches uplinks/trunks along the data path.
- Plug a laptop into switches along the way, doing a port mirror + short packet capture on each switch for it's trunking ports. Looking for duplicate packets + decreasing TTL values.

If it's not a routing loop issue, time to look at bottlenecks / interface issues:
- Is it GbE all the way through?
- Are there packet errors on any interface on the way?
- Check for large % of CRC errors on interfaces
- Check for duplex mismatches

Lastly double check your configs:
- Though its easier to do it via VTP, maybe manually configure each switch for VLAN membership
- Disable ip routing on switches that don't need to perform it
- Use more VLANs!


Good luck, and let me know how you go!
0
 

Author Comment

by:klwn
ID: 22763478
Good advice and yes segmenting traffic into more VLANs is the way to go except, I still don't understand why this has just started after moving to a different VLAN?
And with your example about creating a dedicated VLAN to the server - I'm afraid it's not just the one server - but a Computer Room full of servers (about 80 of them). So the TSM backup starts from VLAN 29 and inter-vlan routes immediately to VLAN 20 where all these servers (not just the one server) get backed up very very quickly (due to the new VTL). Unfortunately this same VLAN 20 extends throughout the rest of the building and extends out further to some bridged WAN connections affecting all these also. Basically anything on VLAN 20 is hammered! And this seems to have happened since moving the backup server to another VLAN (or it might be also due to the new VTL library making backup streams run very quickly thus possibly intensifying network traffic in doing so).
Can you think of a quick fix to control this traffic for now? The backup server is connected via a single 1GB NIC, can I lower the traffic priority coming from the 3560 port it's plugged into for example until I find quality time to get to the bottom of it?
0
 
LVL 10

Accepted Solution

by:
kyleb84 earned 500 total points
ID: 22763606
Hmm since its on VLAN 20 with many other devices, you'd still have to isolate the backup server to reduce it's priority across most of the network.

A few "get by" solutions that might help:

Move the backup server back to the other servers?
Create another VLAN all the way to the Backup server from the core switches?
Create ACLs in the routing switch to set the 802.1p value of the backup traffic?

----------------------------

From a diagnostic point, when you say the "WAN" devices become swamped with data because they're on the same VLAN as the Backup server - how do these WAN bridges connect to the network in relation to the Backup server?

How does this VTL work?
- Does it use broadcast/multicast to transfer the data?
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:klwn
ID: 22765293
Thanks for the suggestions... because we are looking for a work around, we are certainly going to move the backup server back onto the same network as the other server thus keeping this pack on the same VLAN 20. Although theoretically inter-vlan routing should have nothing to do with the increased traffic unless you think otherwise - putting it back on the same VLAN just puts it back to its original location where problems were not prevalent.
A VTL is a 'Virtual Tape Library' - which is a bunch of spinning hard disks pretending to be tapes - hence the much quicker backup times now that we have recently changed from backing up to tapes to backing up to disks - which is why I think the traffic is so much more intense. However, if things go back to normal by bringing the backup back onto VLAN 20 then I guess its nothing to do with this factor.
Your other suggestion of putting the backup server onto its own VLAN surely wont eliminate the fact that it would still have to inter-vlan route out to VLAN 20 where all of the 80 servers it has to backup exists? So why would you do that?
I think changing the 802.1p value of the backup traffic would have merit if it means lowering the priority of the packets coming out of the switchport. For example our backup server sits on switchport Gi0/3 - and I would like to lower the priority of all the packet streams coming out of that port - what command would you put in to achieve this? I don't think you would need an ACL, just a direct command on that switchport but an example would be nice.
The WAN bridges are simply connection to other very small branch office containing not more than 5 people. At the time it seemed prudent just to bridge these connection which effectively puts them on the same subnet which of course I see as something 'not' to do in future. They have never been affected before - but we have never had such a busy backup traffic problem before either. Nevertheless - lesson learnt there despite the small numbers involved.
0
 
LVL 10

Expert Comment

by:kyleb84
ID: 22782915
Hmm you seem to have misunderstood some of my questions..

"From a diagnostic point, when you say the "WAN" devices become swamped with data because they're on the same VLAN as the Backup server - how do these WAN bridges connect to the network in relation to the Backup server?"

Where I was going with this is that if the WAN devices are plugged directly into the switches that handle this backup data it still could be the case of the switches are being overutilised.

"How does this VTL work?
- Does it use broadcast/multicast to transfer the data?"

I know what VTL is, I want to know it's method of communication when backing up.
TCP? UDP? Multicast?

Where I was going with this is that if it uses multicast, a configuration error could lead to network chaos.


0
 

Author Comment

by:klwn
ID: 22783455
After all this the problem has been solved!

It ended up being the network card rather than anything else. The backup administrator had plugged the backup server into the switch whilst it was still configured as a team pair - but the without the second pair member. Also while he was there he upgraded the NIC with the latest drivers and hey presto, absolutely no problems exist at all now.

After this even I will want to be looking for a respectable and easy to use network monitoring software allowing me to be more proactive in finding problems like this - so the search starts from now. Nothing too expensive if I can avoid it.
0
 
LVL 10

Assisted Solution

by:kyleb84
kyleb84 earned 500 total points
ID: 22791308
Wow, how odd is that?

OpenNMS is a favourite of mine for SNMP/Net monitoring.


Have a look at:
http://en.wikipedia.org/wiki/Network_monitoring_comparison

Its got a feature list / Licensing info of some common NMS applications.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The worst thing when starting a new job is when the previous Network Administrator left behind no documentation. How do you get into the devices? If you've been in this situation or just accidently mistyped your password, this article will hopefully…
I eventually solved a perplexing problem setting up telnet for a new switch.  I installed a new Cisco WS-03560X-24P switch connected to an existing Cisco 4506 running a WS-X4013-10GE Sup II-Plus. After configuring vlans and trunking,  I could no…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question