Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Spanning-Tree Packets where there should be none

Posted on 2008-10-02
24
Medium Priority
?
2,537 Views
Last Modified: 2010-04-21
I am troubleshooting some strange network problems on a very small network.  We have one LOB application (Eclipse from Galactek) that intermittently locks up.  It is the only application on the network that is experiencing any problem.  

The network consists of a WatchGuard firewall / router, a single Netgear unmanaged switch, a Dell PowerEdge 2900 running SBS2003 (from an original shrink wrap), and about 6 desktops and laptops (one of which is, for troubleshooting purposes, currently hosting the problem application).  

Troubleshooting has been long and complicated.  We have replaced the Sentinel dongle, moved the software offsite, moved it to another machine on the network, temporarily removed anti-virus, and even temporarily removed the entire server.  Taking the software offsite or removing the entire server are the only two resolutions that seem to make a difference.  We have now reached the point where we are now running Wireshark to look at the actual data on the wire.  

We notice the following strange behavior that seems to occur whenever we have the lockup problems.  The host machine sends a large stream of data (spanning multiple packets).  During that stream of packets we get a series of Spanning-Tree packets.
 
   Source:          Hughes_00:00:01
   Destination:   Spanning-tree-(for-bridges)_01
   Protocol:        CTRL
   Info:               MAC PAUSE: Quanta 65535

Since the application seems to consistently crash at the same time when this behavior occurs we strongly suspect they are related.  There is only one switch on the network.  Why are there Spanning-Tree packets on the network at all?

As mentioned earlier, troubleshooting seems to indicate that when we remove the Dell server from the network the problem does not occur.  This may be coincidence or it may be part of the problem.  If it is part of the problem...WHY?  We have replaced the Broadcom NIC in the server with Intel.  Teaming is NOT nor was it ever enabled.

I love a challenge, but I'm banging my head against the wall on this one.  Any ideas people?
0
Comment
Question by:ITnavigators
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 18
  • 4
  • 2
24 Comments
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22627406
It sounds like the server is trying to process Spanning-Tree BPDU's. Do you have multiple NIC's in the server? What is the actual source MAC address of the BPDU's?
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22627937
Nice hunch.  The actual packet summary follows.
   Ethernet II, Src: Hughes_00:00:01 (00:00:10:00:00:01), Dst: Spanning-tree-(for-bridges)_01 (01:80:c2:00:00:01).  

The source refers to Sytek (now Hughes LAN Systems) that created the NetBIOS protocol.  I'm not familiar enough with the specific below that level, but I think it suggests that the server O/S may be trying to process the BPDUs.  

Yes there are multiple NICs.  The two Broadcoms on the motherboard are disabled.  We added a dual port Intel NIC card.  One of those is also disabled.  None of the ports are currently teamed.
0
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22630014
So even though there are multiple NIC's in the server, only one is connected, right?

I don't see anything in your list of equipment that would be generating BPDU's. Since you have no managed switches, there's no way to determine where this BPDU is coming from. Other than watch the BPDU's with a protocol analyzer and start unplugging devices.

Or you could figure out why your server is trying process them. I don't know Microsoft but I wonder if having more than one NIC enables spanning tree on the server?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 1

Author Comment

by:ITnavigators
ID: 22630167
We are planning to disable NetBIOS on the server tonight.  It isn't required on our network and could potentially be chatting with the unmanaged switch.  I am seeing information on the net that Windows NetBIOS can cause this type of problems.  

Will let you know what happens.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22637483
Disabled NetBIOS on the server (and on the temporary host machine).  Still getting spanning tree packets.  Eclipse started producing errors as well.  They must still use NetBIOS.  :)  It is based on a FairCom database.
0
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22638123
You say that the crash happens when you see these BPDU's. Yet you don't have any managed switches. Therefor, you should not be seeing BPDU's. I would try to find the source of the BPDU's.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22638401
I agree.  I sure wish there was some data in the packets that positively identified where they were coming from.  
0
 
LVL 1

Accepted Solution

by:
MrRichardHead earned 2000 total points
ID: 22766124
Don't know if you have solved this problem, but I had an identical problem and think this is the answer. The frames (Not packets as all this is happening at Layer 2) that you are seeing in Wireshark are generated by Ethernet flow control and not by the Spanning tree algorithm. Ethernet flow control uses the MAC multicast address 01-80-C2-00-00-01. The Bridge Protocol Data Units used by Spanning tree to work out network topography also use this multicast address so Wireshark is misreporting the Ethernet flow control frames as Spanning tree frames.
http://en.wikipedia.org/wiki/Ethernet_flow_control
http://en.wikipedia.org/wiki/Spanning_tree_protocol
With Ethernet flow control either the NIC on the server or the port on the switch sends a frame saying 'Don't send any more frames for x amount of time because I can't process the ones I have already'. Why this is happening is a completely different question.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22767076
Thank you for the response and the clarification.  I will certainly check into that.  

Since it seems to occur when the host machine is sending large packets, that may make sense.  The question remains...why...and what can be done about it.
0
 
LVL 50

Expert Comment

by:Don Johnston
ID: 22768785
Good catch Richard. I didn't notice the 01 at the end of the MAC address. I've never turned this feature on (I'm hoping it's off by default) but it certainly sounds like the problem.
0
 
LVL 1

Assisted Solution

by:MrRichardHead
MrRichardHead earned 2000 total points
ID: 22774543
Ethernet flow control at both the switch and the NIC is usually turned on by default. I wouldn't turn it off or you might start losing data. What you need to do is address the fact that too much data is trying to go down that ethernet connection. We are currently investigating teaming mulitple NICs - will let you know if this works. In the original post you don't mention the speed of the switch: If it is not a Gigabit switch you could try upgrading the switch.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22776531
The switch is an unmanaged Ethernet switch.  

The server has two onboard broadcom gig ports.  During testing those were disabled and a two port Intel gig card was added.  Only one port is currently in use.  The other is disabled.

The host machine (the LOB application was moved to a separate computer for diagnostic purposes) also has a gig port.  

I suspect the client machines are currently running 10/100 cards.  They are not that new.  It is certainly possible that the LOB application is pushing data to the client machines faster then the client machines can handle.  We see the problem when they are bursting data.  For diagnostics I may have the host machine use a 10/100 port.  If we can slow down the send it shouldn't overrun the receive.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22776537
Oops...  The switch is an unmanged gigabit switch.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22786722
We replaced the gigabit switch with a 10/100 switch for the duration of the test.  The entire network will be one giant bottleneck, but at least everything will be at the same speed.  

Will post the results.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22793893
We made it through an entire day without a single glitch.  Slowing the entire network down to 5 year old technology made quite a difference.  If we make it through one more, we may have a solution.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22839464
Problems are back (though I haven't yet verified the presence of the FlowControl frames).  This is a pesky problem.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 22852002
On recommendation from Microsoft we contacted the LOB vendor for instructions on how to move the executable to the client machines (instead of launching them across the network).  Unfortunately we still have problems.  Starting another round of sniffing.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 23255840
Just followup for the record (and the benefit of anyone searching this issue).

Repeated the test of removing the server from the network.  Obviously, you lose all access to server based resources (authentication, shared printers, shared folders, Exchange, etc).  But the Eclipse problem goes away for as long as the server is disconnected from the LAN.  The only software that has a problem is the Eclipse program.  But somehow the server is involved.  

We reinstalled the operating system on the same box.  No change.  

Pulled up task manager on a client machine that was locked up.  Eclipse is using 85% of the processor...but doing nothing.  Asked Galactek for symbol files so that we could run Process Monitor and at least see what it is doing when it is so busy doing nothing.  No response.

Ran a full hardware level diagnostic on the server.  Absolutely nothing failed.  Not even any soft errors.

Random failures occur throughout the day -- but there are a couple of times a day when glitches are almost guaranteed.  9:30am (plus or minus 10 minutes) and 11:30am (plus or minus 10 minutes).  Nothing in the server event logs or scheduled tasks relating to that time.  
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 23590927
Followup (for anyone looking for this solution)...

Using Process Explorer from SysInternals, we noted that during a glitch the Eclipse application is consuming roughly 80% of the CPU.  We requested symbol tables from Galactek but our request was denied.  We used Process Monitor and captured a glitch in progress.  The application is stuck in an Infinite Loop reading two items out of the registry.  It appears to be a function or subroutine that is in the loop, as it completes several steps as a part of the loop.  

We provided the information to Galactek but have not heard back.

We have since captured a ProcMon log going into the loop.  There are several items of interest but nothing conclusive yet.  We are attempting to capture more crashes to identify what is happening that leads into the crash.

It seems clear to me that there is a bug in the program, but until we can identify for them what triggers the bug, we are stuck with the blame.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 23687585
Followup (for anyone looking for this solution)...

Seems to be no consistency is anything that triggers this problem.  Working with another provider, we have disabled all TCP Offload functions on the server.  Still no change.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 24450622
Long overdue update.  We have had a second Technology Provider go back through the server and network with a fine tooth comb.  Doesn't see anything unusual.  He repeated a number of the tests done above with the same results.  The customer has contacted the vendor and has asked to evaluate the Client Server version of the software.  Like the basic version, this is built on a Faircom database.  

I'm not expecting it to resolve the problem, but will try to keep an open mind.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 24683809
Another long overdue update.  The client finally convinced Galactek to upgrade them to the Client Server version of the software as a test.  Thus far, there have been no issues.  Not all of the features have been re-enabled, but we are anxiously awaiting the outcome of the test.
0
 
LVL 1

Author Comment

by:ITnavigators
ID: 25156733
Appears that upgrading the product seems to have solved the problem.  No idea how or what the base issue actually was.  I suspect that the backend DB is just not that robust.
0
 
LVL 1

Author Closing Comment

by:ITnavigators
ID: 31620360
The answered didn't specifically solve the problem, but pointed us to the source of the problem.  Customer is now running a client-server version of the software which is not experiencing the same issue.
0

Featured Post

Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I’m often asked about newer and larger USB drives connected to SBS2008 and 2011 failing Windows Server Backup vs the older USB drives not failing. As disk space continues to grow and drive technology change SBS2008 and some SBS2011 end up with the f…
Trying to figure out group policy inheritance and which settings apply where can be a chore.  Here's a very simple summary I've written which might help.  Keep in mind, this is just a high-level conceptual overview where I try to avoid getting bogge…
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…

604 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question