Start Free Trial

asked on

Spanning-Tree Packets where there should be none

I am troubleshooting some strange network problems on a very small network. We have one LOB application (Eclipse from Galactek) that intermittently locks up. It is the only application on the network that is experiencing any problem.

The network consists of a WatchGuard firewall / router, a single Netgear unmanaged switch, a Dell PowerEdge 2900 running SBS2003 (from an original shrink wrap), and about 6 desktops and laptops (one of which is, for troubleshooting purposes, currently hosting the problem application).

Troubleshooting has been long and complicated. We have replaced the Sentinel dongle, moved the software offsite, moved it to another machine on the network, temporarily removed anti-virus, and even temporarily removed the entire server. Taking the software offsite or removing the entire server are the only two resolutions that seem to make a difference. We have now reached the point where we are now running Wireshark to look at the actual data on the wire.

We notice the following strange behavior that seems to occur whenever we have the lockup problems. The host machine sends a large stream of data (spanning multiple packets). During that stream of packets we get a series of Spanning-Tree packets.

Source: Hughes_00:00:01
Destination: Spanning-tree-(for-bridges)_01
Protocol: CTRL
Info: MAC PAUSE: Quanta 65535

Since the application seems to consistently crash at the same time when this behavior occurs we strongly suspect they are related. There is only one switch on the network. Why are there Spanning-Tree packets on the network at all?

As mentioned earlier, troubleshooting seems to indicate that when we remove the Dell server from the network the problem does not occur. This may be coincidence or it may be part of the problem. If it is part of the problem...WHY? We have replaced the Broadcom NIC in the server with Intel. Teaming is NOT nor was it ever enabled.

I love a challenge, but I'm banging my head against the wall on this one. Any ideas people?

It sounds like the server is trying to process Spanning-Tree BPDU's. Do you have multiple NIC's in the server? What is the actual source MAC address of the BPDU's?

ASKER

Nice hunch. The actual packet summary follows.
Ethernet II, Src: Hughes_00:00:01 (00:00:10:00:00:01), Dst: Spanning-tree-(for-bridges)_01 (01:80:c2:00:00:01).

The source refers to Sytek (now Hughes LAN Systems) that created the NetBIOS protocol. I'm not familiar enough with the specific below that level, but I think it suggests that the server O/S may be trying to process the BPDUs.

Yes there are multiple NICs. The two Broadcoms on the motherboard are disabled. We added a dual port Intel NIC card. One of those is also disabled. None of the ports are currently teamed.

So even though there are multiple NIC's in the server, only one is connected, right?

I don't see anything in your list of equipment that would be generating BPDU's. Since you have no managed switches, there's no way to determine where this BPDU is coming from. Other than watch the BPDU's with a protocol analyzer and start unplugging devices.

Or you could figure out why your server is trying process them. I don't know Microsoft but I wonder if having more than one NIC enables spanning tree on the server?

ASKER

We are planning to disable NetBIOS on the server tonight. It isn't required on our network and could potentially be chatting with the unmanaged switch. I am seeing information on the net that Windows NetBIOS can cause this type of problems.

Will let you know what happens.

ASKER

Disabled NetBIOS on the server (and on the temporary host machine). Still getting spanning tree packets. Eclipse started producing errors as well. They must still use NetBIOS. :) It is based on a FairCom database.

You say that the crash happens when you see these BPDU's. Yet you don't have any managed switches. Therefor, you should not be seeing BPDU's. I would try to find the source of the BPDU's.

ASKER

I agree. I sure wish there was some data in the packets that positively identified where they were coming from.

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Thank you for the response and the clarification. I will certainly check into that.

Since it seems to occur when the host machine is sending large packets, that may make sense. The question remains...why...and what can be done about it.

Good catch Richard. I didn't notice the 01 at the end of the MAC address. I've never turned this feature on (I'm hoping it's off by default) but it certainly sounds like the problem.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

The switch is an unmanaged Ethernet switch.

The server has two onboard broadcom gig ports. During testing those were disabled and a two port Intel gig card was added. Only one port is currently in use. The other is disabled.

The host machine (the LOB application was moved to a separate computer for diagnostic purposes) also has a gig port.

I suspect the client machines are currently running 10/100 cards. They are not that new. It is certainly possible that the LOB application is pushing data to the client machines faster then the client machines can handle. We see the problem when they are bursting data. For diagnostics I may have the host machine use a 10/100 port. If we can slow down the send it shouldn't overrun the receive.

ASKER

Oops... The switch is an unmanged gigabit switch.

ASKER

We replaced the gigabit switch with a 10/100 switch for the duration of the test. The entire network will be one giant bottleneck, but at least everything will be at the same speed.

Will post the results.

ASKER

We made it through an entire day without a single glitch. Slowing the entire network down to 5 year old technology made quite a difference. If we make it through one more, we may have a solution.

ASKER

Problems are back (though I haven't yet verified the presence of the FlowControl frames). This is a pesky problem.

ASKER

On recommendation from Microsoft we contacted the LOB vendor for instructions on how to move the executable to the client machines (instead of launching them across the network). Unfortunately we still have problems. Starting another round of sniffing.

ASKER

Just followup for the record (and the benefit of anyone searching this issue).

Repeated the test of removing the server from the network. Obviously, you lose all access to server based resources (authentication, shared printers, shared folders, Exchange, etc). But the Eclipse problem goes away for as long as the server is disconnected from the LAN. The only software that has a problem is the Eclipse program. But somehow the server is involved.

We reinstalled the operating system on the same box. No change.

Pulled up task manager on a client machine that was locked up. Eclipse is using 85% of the processor...but doing nothing. Asked Galactek for symbol files so that we could run Process Monitor and at least see what it is doing when it is so busy doing nothing. No response.

Ran a full hardware level diagnostic on the server. Absolutely nothing failed. Not even any soft errors.

Random failures occur throughout the day -- but there are a couple of times a day when glitches are almost guaranteed. 9:30am (plus or minus 10 minutes) and 11:30am (plus or minus 10 minutes). Nothing in the server event logs or scheduled tasks relating to that time.

ASKER

Followup (for anyone looking for this solution)...

Using Process Explorer from SysInternals, we noted that during a glitch the Eclipse application is consuming roughly 80% of the CPU. We requested symbol tables from Galactek but our request was denied. We used Process Monitor and captured a glitch in progress. The application is stuck in an Infinite Loop reading two items out of the registry. It appears to be a function or subroutine that is in the loop, as it completes several steps as a part of the loop.

We provided the information to Galactek but have not heard back.

We have since captured a ProcMon log going into the loop. There are several items of interest but nothing conclusive yet. We are attempting to capture more crashes to identify what is happening that leads into the crash.

It seems clear to me that there is a bug in the program, but until we can identify for them what triggers the bug, we are stuck with the blame.

ASKER

Followup (for anyone looking for this solution)...

Seems to be no consistency is anything that triggers this problem. Working with another provider, we have disabled all TCP Offload functions on the server. Still no change.

ASKER

Long overdue update. We have had a second Technology Provider go back through the server and network with a fine tooth comb. Doesn't see anything unusual. He repeated a number of the tests done above with the same results. The customer has contacted the vendor and has asked to evaluate the Client Server version of the software. Like the basic version, this is built on a Faircom database.

I'm not expecting it to resolve the problem, but will try to keep an open mind.

ASKER

Another long overdue update. The client finally convinced Galactek to upgrade them to the Client Server version of the software as a test. Thus far, there have been no issues. Not all of the features have been re-enabled, but we are anxiously awaiting the outcome of the test.

ASKER

Appears that upgrading the product seems to have solved the problem. No idea how or what the base issue actually was. I suspect that the backend DB is just not that robust.

ASKER

The answered didn't specifically solve the problem, but pointed us to the source of the problem. Customer is now running a client-server version of the software which is not experiencing the same issue.