• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2649
  • Last Modified:

Spanning-Tree Packets where there should be none

I am troubleshooting some strange network problems on a very small network.  We have one LOB application (Eclipse from Galactek) that intermittently locks up.  It is the only application on the network that is experiencing any problem.  

The network consists of a WatchGuard firewall / router, a single Netgear unmanaged switch, a Dell PowerEdge 2900 running SBS2003 (from an original shrink wrap), and about 6 desktops and laptops (one of which is, for troubleshooting purposes, currently hosting the problem application).  

Troubleshooting has been long and complicated.  We have replaced the Sentinel dongle, moved the software offsite, moved it to another machine on the network, temporarily removed anti-virus, and even temporarily removed the entire server.  Taking the software offsite or removing the entire server are the only two resolutions that seem to make a difference.  We have now reached the point where we are now running Wireshark to look at the actual data on the wire.  

We notice the following strange behavior that seems to occur whenever we have the lockup problems.  The host machine sends a large stream of data (spanning multiple packets).  During that stream of packets we get a series of Spanning-Tree packets.
 
   Source:          Hughes_00:00:01
   Destination:   Spanning-tree-(for-bridges)_01
   Protocol:        CTRL
   Info:               MAC PAUSE: Quanta 65535

Since the application seems to consistently crash at the same time when this behavior occurs we strongly suspect they are related.  There is only one switch on the network.  Why are there Spanning-Tree packets on the network at all?

As mentioned earlier, troubleshooting seems to indicate that when we remove the Dell server from the network the problem does not occur.  This may be coincidence or it may be part of the problem.  If it is part of the problem...WHY?  We have replaced the Broadcom NIC in the server with Intel.  Teaming is NOT nor was it ever enabled.

I love a challenge, but I'm banging my head against the wall on this one.  Any ideas people?
0
ITnavigators
Asked:
ITnavigators
  • 18
  • 4
  • 2
2 Solutions
 
Don JohnstonInstructorCommented:
It sounds like the server is trying to process Spanning-Tree BPDU's. Do you have multiple NIC's in the server? What is the actual source MAC address of the BPDU's?
0
 
ITnavigatorsAuthor Commented:
Nice hunch.  The actual packet summary follows.
   Ethernet II, Src: Hughes_00:00:01 (00:00:10:00:00:01), Dst: Spanning-tree-(for-bridges)_01 (01:80:c2:00:00:01).  

The source refers to Sytek (now Hughes LAN Systems) that created the NetBIOS protocol.  I'm not familiar enough with the specific below that level, but I think it suggests that the server O/S may be trying to process the BPDUs.  

Yes there are multiple NICs.  The two Broadcoms on the motherboard are disabled.  We added a dual port Intel NIC card.  One of those is also disabled.  None of the ports are currently teamed.
0
 
Don JohnstonInstructorCommented:
So even though there are multiple NIC's in the server, only one is connected, right?

I don't see anything in your list of equipment that would be generating BPDU's. Since you have no managed switches, there's no way to determine where this BPDU is coming from. Other than watch the BPDU's with a protocol analyzer and start unplugging devices.

Or you could figure out why your server is trying process them. I don't know Microsoft but I wonder if having more than one NIC enables spanning tree on the server?
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
ITnavigatorsAuthor Commented:
We are planning to disable NetBIOS on the server tonight.  It isn't required on our network and could potentially be chatting with the unmanaged switch.  I am seeing information on the net that Windows NetBIOS can cause this type of problems.  

Will let you know what happens.
0
 
ITnavigatorsAuthor Commented:
Disabled NetBIOS on the server (and on the temporary host machine).  Still getting spanning tree packets.  Eclipse started producing errors as well.  They must still use NetBIOS.  :)  It is based on a FairCom database.
0
 
Don JohnstonInstructorCommented:
You say that the crash happens when you see these BPDU's. Yet you don't have any managed switches. Therefor, you should not be seeing BPDU's. I would try to find the source of the BPDU's.
0
 
ITnavigatorsAuthor Commented:
I agree.  I sure wish there was some data in the packets that positively identified where they were coming from.  
0
 
MrRichardHeadCommented:
Don't know if you have solved this problem, but I had an identical problem and think this is the answer. The frames (Not packets as all this is happening at Layer 2) that you are seeing in Wireshark are generated by Ethernet flow control and not by the Spanning tree algorithm. Ethernet flow control uses the MAC multicast address 01-80-C2-00-00-01. The Bridge Protocol Data Units used by Spanning tree to work out network topography also use this multicast address so Wireshark is misreporting the Ethernet flow control frames as Spanning tree frames.
http://en.wikipedia.org/wiki/Ethernet_flow_control
http://en.wikipedia.org/wiki/Spanning_tree_protocol
With Ethernet flow control either the NIC on the server or the port on the switch sends a frame saying 'Don't send any more frames for x amount of time because I can't process the ones I have already'. Why this is happening is a completely different question.
0
 
ITnavigatorsAuthor Commented:
Thank you for the response and the clarification.  I will certainly check into that.  

Since it seems to occur when the host machine is sending large packets, that may make sense.  The question remains...why...and what can be done about it.
0
 
Don JohnstonInstructorCommented:
Good catch Richard. I didn't notice the 01 at the end of the MAC address. I've never turned this feature on (I'm hoping it's off by default) but it certainly sounds like the problem.
0
 
MrRichardHeadCommented:
Ethernet flow control at both the switch and the NIC is usually turned on by default. I wouldn't turn it off or you might start losing data. What you need to do is address the fact that too much data is trying to go down that ethernet connection. We are currently investigating teaming mulitple NICs - will let you know if this works. In the original post you don't mention the speed of the switch: If it is not a Gigabit switch you could try upgrading the switch.
0
 
ITnavigatorsAuthor Commented:
The switch is an unmanaged Ethernet switch.  

The server has two onboard broadcom gig ports.  During testing those were disabled and a two port Intel gig card was added.  Only one port is currently in use.  The other is disabled.

The host machine (the LOB application was moved to a separate computer for diagnostic purposes) also has a gig port.  

I suspect the client machines are currently running 10/100 cards.  They are not that new.  It is certainly possible that the LOB application is pushing data to the client machines faster then the client machines can handle.  We see the problem when they are bursting data.  For diagnostics I may have the host machine use a 10/100 port.  If we can slow down the send it shouldn't overrun the receive.
0
 
ITnavigatorsAuthor Commented:
Oops...  The switch is an unmanged gigabit switch.
0
 
ITnavigatorsAuthor Commented:
We replaced the gigabit switch with a 10/100 switch for the duration of the test.  The entire network will be one giant bottleneck, but at least everything will be at the same speed.  

Will post the results.
0
 
ITnavigatorsAuthor Commented:
We made it through an entire day without a single glitch.  Slowing the entire network down to 5 year old technology made quite a difference.  If we make it through one more, we may have a solution.
0
 
ITnavigatorsAuthor Commented:
Problems are back (though I haven't yet verified the presence of the FlowControl frames).  This is a pesky problem.
0
 
ITnavigatorsAuthor Commented:
On recommendation from Microsoft we contacted the LOB vendor for instructions on how to move the executable to the client machines (instead of launching them across the network).  Unfortunately we still have problems.  Starting another round of sniffing.
0
 
ITnavigatorsAuthor Commented:
Just followup for the record (and the benefit of anyone searching this issue).

Repeated the test of removing the server from the network.  Obviously, you lose all access to server based resources (authentication, shared printers, shared folders, Exchange, etc).  But the Eclipse problem goes away for as long as the server is disconnected from the LAN.  The only software that has a problem is the Eclipse program.  But somehow the server is involved.  

We reinstalled the operating system on the same box.  No change.  

Pulled up task manager on a client machine that was locked up.  Eclipse is using 85% of the processor...but doing nothing.  Asked Galactek for symbol files so that we could run Process Monitor and at least see what it is doing when it is so busy doing nothing.  No response.

Ran a full hardware level diagnostic on the server.  Absolutely nothing failed.  Not even any soft errors.

Random failures occur throughout the day -- but there are a couple of times a day when glitches are almost guaranteed.  9:30am (plus or minus 10 minutes) and 11:30am (plus or minus 10 minutes).  Nothing in the server event logs or scheduled tasks relating to that time.  
0
 
ITnavigatorsAuthor Commented:
Followup (for anyone looking for this solution)...

Using Process Explorer from SysInternals, we noted that during a glitch the Eclipse application is consuming roughly 80% of the CPU.  We requested symbol tables from Galactek but our request was denied.  We used Process Monitor and captured a glitch in progress.  The application is stuck in an Infinite Loop reading two items out of the registry.  It appears to be a function or subroutine that is in the loop, as it completes several steps as a part of the loop.  

We provided the information to Galactek but have not heard back.

We have since captured a ProcMon log going into the loop.  There are several items of interest but nothing conclusive yet.  We are attempting to capture more crashes to identify what is happening that leads into the crash.

It seems clear to me that there is a bug in the program, but until we can identify for them what triggers the bug, we are stuck with the blame.
0
 
ITnavigatorsAuthor Commented:
Followup (for anyone looking for this solution)...

Seems to be no consistency is anything that triggers this problem.  Working with another provider, we have disabled all TCP Offload functions on the server.  Still no change.
0
 
ITnavigatorsAuthor Commented:
Long overdue update.  We have had a second Technology Provider go back through the server and network with a fine tooth comb.  Doesn't see anything unusual.  He repeated a number of the tests done above with the same results.  The customer has contacted the vendor and has asked to evaluate the Client Server version of the software.  Like the basic version, this is built on a Faircom database.  

I'm not expecting it to resolve the problem, but will try to keep an open mind.
0
 
ITnavigatorsAuthor Commented:
Another long overdue update.  The client finally convinced Galactek to upgrade them to the Client Server version of the software as a test.  Thus far, there have been no issues.  Not all of the features have been re-enabled, but we are anxiously awaiting the outcome of the test.
0
 
ITnavigatorsAuthor Commented:
Appears that upgrading the product seems to have solved the problem.  No idea how or what the base issue actually was.  I suspect that the backend DB is just not that robust.
0
 
ITnavigatorsAuthor Commented:
The answered didn't specifically solve the problem, but pointed us to the source of the problem.  Customer is now running a client-server version of the software which is not experiencing the same issue.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

  • 18
  • 4
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now