Solved

Spanning-Tree Packets where there should be none

Posted on 2008-10-02
24
2,430 Views
Last Modified: 2010-04-21
I am troubleshooting some strange network problems on a very small network.  We have one LOB application (Eclipse from Galactek) that intermittently locks up.  It is the only application on the network that is experiencing any problem.  

The network consists of a WatchGuard firewall / router, a single Netgear unmanaged switch, a Dell PowerEdge 2900 running SBS2003 (from an original shrink wrap), and about 6 desktops and laptops (one of which is, for troubleshooting purposes, currently hosting the problem application).  

Troubleshooting has been long and complicated.  We have replaced the Sentinel dongle, moved the software offsite, moved it to another machine on the network, temporarily removed anti-virus, and even temporarily removed the entire server.  Taking the software offsite or removing the entire server are the only two resolutions that seem to make a difference.  We have now reached the point where we are now running Wireshark to look at the actual data on the wire.  

We notice the following strange behavior that seems to occur whenever we have the lockup problems.  The host machine sends a large stream of data (spanning multiple packets).  During that stream of packets we get a series of Spanning-Tree packets.
 
   Source:          Hughes_00:00:01
   Destination:   Spanning-tree-(for-bridges)_01
   Protocol:        CTRL
   Info:               MAC PAUSE: Quanta 65535

Since the application seems to consistently crash at the same time when this behavior occurs we strongly suspect they are related.  There is only one switch on the network.  Why are there Spanning-Tree packets on the network at all?

As mentioned earlier, troubleshooting seems to indicate that when we remove the Dell server from the network the problem does not occur.  This may be coincidence or it may be part of the problem.  If it is part of the problem...WHY?  We have replaced the Broadcom NIC in the server with Intel.  Teaming is NOT nor was it ever enabled.

I love a challenge, but I'm banging my head against the wall on this one.  Any ideas people?
0
Comment
Question by:ITnavigators
  • 18
  • 4
  • 2
24 Comments
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22627406
It sounds like the server is trying to process Spanning-Tree BPDU's. Do you have multiple NIC's in the server? What is the actual source MAC address of the BPDU's?
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22627937
Nice hunch.  The actual packet summary follows.
   Ethernet II, Src: Hughes_00:00:01 (00:00:10:00:00:01), Dst: Spanning-tree-(for-bridges)_01 (01:80:c2:00:00:01).  

The source refers to Sytek (now Hughes LAN Systems) that created the NetBIOS protocol.  I'm not familiar enough with the specific below that level, but I think it suggests that the server O/S may be trying to process the BPDUs.  

Yes there are multiple NICs.  The two Broadcoms on the motherboard are disabled.  We added a dual port Intel NIC card.  One of those is also disabled.  None of the ports are currently teamed.
0
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22630014
So even though there are multiple NIC's in the server, only one is connected, right?

I don't see anything in your list of equipment that would be generating BPDU's. Since you have no managed switches, there's no way to determine where this BPDU is coming from. Other than watch the BPDU's with a protocol analyzer and start unplugging devices.

Or you could figure out why your server is trying process them. I don't know Microsoft but I wonder if having more than one NIC enables spanning tree on the server?
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22630167
We are planning to disable NetBIOS on the server tonight.  It isn't required on our network and could potentially be chatting with the unmanaged switch.  I am seeing information on the net that Windows NetBIOS can cause this type of problems.  

Will let you know what happens.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22637483
Disabled NetBIOS on the server (and on the temporary host machine).  Still getting spanning tree packets.  Eclipse started producing errors as well.  They must still use NetBIOS.  :)  It is based on a FairCom database.
0
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22638123
You say that the crash happens when you see these BPDU's. Yet you don't have any managed switches. Therefor, you should not be seeing BPDU's. I would try to find the source of the BPDU's.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22638401
I agree.  I sure wish there was some data in the packets that positively identified where they were coming from.  
0
 
LVL 1

Accepted Solution

by:
MrRichardHead earned 500 total points
ID: 22766124
Don't know if you have solved this problem, but I had an identical problem and think this is the answer. The frames (Not packets as all this is happening at Layer 2) that you are seeing in Wireshark are generated by Ethernet flow control and not by the Spanning tree algorithm. Ethernet flow control uses the MAC multicast address 01-80-C2-00-00-01. The Bridge Protocol Data Units used by Spanning tree to work out network topography also use this multicast address so Wireshark is misreporting the Ethernet flow control frames as Spanning tree frames.
http://en.wikipedia.org/wiki/Ethernet_flow_control
http://en.wikipedia.org/wiki/Spanning_tree_protocol
With Ethernet flow control either the NIC on the server or the port on the switch sends a frame saying 'Don't send any more frames for x amount of time because I can't process the ones I have already'. Why this is happening is a completely different question.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22767076
Thank you for the response and the clarification.  I will certainly check into that.  

Since it seems to occur when the host machine is sending large packets, that may make sense.  The question remains...why...and what can be done about it.
0
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22768785
Good catch Richard. I didn't notice the 01 at the end of the MAC address. I've never turned this feature on (I'm hoping it's off by default) but it certainly sounds like the problem.
0
 
LVL 1

Assisted Solution

by:MrRichardHead
MrRichardHead earned 500 total points
ID: 22774543
Ethernet flow control at both the switch and the NIC is usually turned on by default. I wouldn't turn it off or you might start losing data. What you need to do is address the fact that too much data is trying to go down that ethernet connection. We are currently investigating teaming mulitple NICs - will let you know if this works. In the original post you don't mention the speed of the switch: If it is not a Gigabit switch you could try upgrading the switch.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22776531
The switch is an unmanaged Ethernet switch.  

The server has two onboard broadcom gig ports.  During testing those were disabled and a two port Intel gig card was added.  Only one port is currently in use.  The other is disabled.

The host machine (the LOB application was moved to a separate computer for diagnostic purposes) also has a gig port.  

I suspect the client machines are currently running 10/100 cards.  They are not that new.  It is certainly possible that the LOB application is pushing data to the client machines faster then the client machines can handle.  We see the problem when they are bursting data.  For diagnostics I may have the host machine use a 10/100 port.  If we can slow down the send it shouldn't overrun the receive.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 1

Author Comment

by:ITnavigators
ID: 22776537
Oops...  The switch is an unmanged gigabit switch.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22786722
We replaced the gigabit switch with a 10/100 switch for the duration of the test.  The entire network will be one giant bottleneck, but at least everything will be at the same speed.  

Will post the results.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22793893
We made it through an entire day without a single glitch.  Slowing the entire network down to 5 year old technology made quite a difference.  If we make it through one more, we may have a solution.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22839464
Problems are back (though I haven't yet verified the presence of the FlowControl frames).  This is a pesky problem.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22852002
On recommendation from Microsoft we contacted the LOB vendor for instructions on how to move the executable to the client machines (instead of launching them across the network).  Unfortunately we still have problems.  Starting another round of sniffing.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 23255840
Just followup for the record (and the benefit of anyone searching this issue).

Repeated the test of removing the server from the network.  Obviously, you lose all access to server based resources (authentication, shared printers, shared folders, Exchange, etc).  But the Eclipse problem goes away for as long as the server is disconnected from the LAN.  The only software that has a problem is the Eclipse program.  But somehow the server is involved.  

We reinstalled the operating system on the same box.  No change.  

Pulled up task manager on a client machine that was locked up.  Eclipse is using 85% of the processor...but doing nothing.  Asked Galactek for symbol files so that we could run Process Monitor and at least see what it is doing when it is so busy doing nothing.  No response.

Ran a full hardware level diagnostic on the server.  Absolutely nothing failed.  Not even any soft errors.

Random failures occur throughout the day -- but there are a couple of times a day when glitches are almost guaranteed.  9:30am (plus or minus 10 minutes) and 11:30am (plus or minus 10 minutes).  Nothing in the server event logs or scheduled tasks relating to that time.  
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 23590927
Followup (for anyone looking for this solution)...

Using Process Explorer from SysInternals, we noted that during a glitch the Eclipse application is consuming roughly 80% of the CPU.  We requested symbol tables from Galactek but our request was denied.  We used Process Monitor and captured a glitch in progress.  The application is stuck in an Infinite Loop reading two items out of the registry.  It appears to be a function or subroutine that is in the loop, as it completes several steps as a part of the loop.  

We provided the information to Galactek but have not heard back.

We have since captured a ProcMon log going into the loop.  There are several items of interest but nothing conclusive yet.  We are attempting to capture more crashes to identify what is happening that leads into the crash.

It seems clear to me that there is a bug in the program, but until we can identify for them what triggers the bug, we are stuck with the blame.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 23687585
Followup (for anyone looking for this solution)...

Seems to be no consistency is anything that triggers this problem.  Working with another provider, we have disabled all TCP Offload functions on the server.  Still no change.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 24450622
Long overdue update.  We have had a second Technology Provider go back through the server and network with a fine tooth comb.  Doesn't see anything unusual.  He repeated a number of the tests done above with the same results.  The customer has contacted the vendor and has asked to evaluate the Client Server version of the software.  Like the basic version, this is built on a Faircom database.  

I'm not expecting it to resolve the problem, but will try to keep an open mind.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 24683809
Another long overdue update.  The client finally convinced Galactek to upgrade them to the Client Server version of the software as a test.  Thus far, there have been no issues.  Not all of the features have been re-enabled, but we are anxiously awaiting the outcome of the test.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 25156733
Appears that upgrading the product seems to have solved the problem.  No idea how or what the base issue actually was.  I suspect that the backend DB is just not that robust.
0
 
LVL 1

Author Closing Comment

by:ITnavigators
ID: 31620360
The answered didn't specifically solve the problem, but pointed us to the source of the problem.  Customer is now running a client-server version of the software which is not experiencing the same issue.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

I’m often asked about newer and larger USB drives connected to SBS2008 and 2011 failing Windows Server Backup vs the older USB drives not failing. As disk space continues to grow and drive technology change SBS2008 and some SBS2011 end up with the f…
I'm a big fan of Windows' offline folder caching and have used it on my laptops for over a decade.  One thing I don't like about it, however, is how difficult Microsoft has made it for the cache to be moved out of the Windows folder.  Here's how to …
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now