Link to home
Start Free TrialLog in
Avatar of JamesonJendreas

asked on

Random Network Dropouts

I have a really odd issue.  I have users who are having issues where certain parts of their network drops out.  These suers are setup to grab ip's via DCHP (from a win 2003 server).  Some have a reservation setup.

Anyway, after about 10 minutes of being online (say after an Ip release/renew) they loose connectivity to any server that is not in my direct subnet.  This includes internet access, access to my web servers in my DMZ (have public IP's, routed by my sonciwall in transparent mode) or to my phone system.

The phone system is a good place to start.  My phone system's PBX is at IP  My users are in the subnet.  Users connect to the PBX (for unified messaging services) from their PC (the phones themselves  live on the subnet).  Now, in my firewall, I have a route setup for all subnet to route to a VLAN on my phone switch ( of

My users have no problem pinging the port on my phone switch, but can't ping anything in the  The annoying part, is it is NOT system wide but contained to a specific group of users.  I can't be certain if they all are patched into the same switch - I can go back and trace my drops (which I may need to do), but at least 3 users are geographically close to each other, so it's liekly they are.

Now, my users are patched into my Netgear GSM7352S switches, which I have in a 7 member stack.  The stack is in a duplex ring connected by add-on modules (CX-4).  The vast majority of my users are patched into these 7 switches (I have 3 other switches on the fringe of my network, 2 48 port Ciscos and 1 28 port linksys all patched via fiber.)

So what could be going on here?  It's only certain users.  

Avatar of JamesonJendreas


One thing I've noticed, and as far as I can tell shouldn't really make a difference, but my DCHP is passing a subnet of along, when we've been using as our subnet.  In my DHCP server, it won't let em change the subnet (it's grayed out).  My DHCP range is
Also, once the connection drops, they can connect to any Internationale resource, but cannot ping the firewall.  So the issue has to be something with routing to my FW.  My PC, for instance, does not have any of these problems.
What device is the PBX connected to?  Is that a switch that has layer 3 routing enabled and if so what networks are configured on it?

You started out saying that this is only happens after 10 mins.  What IP address are they using when it does work?  Can you re-produce the success/failure using static ip addresses.

It kinda sounds like you may have a rogue dhcp server.  If that is the case then you should be able to defeat the issue by using static addresses.
Hm, seems like possibly turning off link-layer topology discovery and responder alleviates this issue (only time will tell if this is true).  If so, what would cause thsi to break my network?
Do you have a simple network diagram we can look at?
Very unlikely LLDP is your issue.  
I had a similar issue.  It turned out to be the switch.  Are the users that are affected plugged into the same switch?  Try plugging those users into a spare switch for testing.  My switches were 5yrs old and failed exactly to the date 5yrs later.
I've patched myself into the same switch as some of the problematic users - I have no problems
These are new switches, about 2 months old.  I've plugged myself into one of the problematic switches, and cannot reproduce the issue.  I've also move one of teh users over to one of my Cisco switches, still having the same issue.  i also noted that I do not see these users in my FW's ARP table.  Along with that these users look to be getting their IP's from a DCHP reservation

 but as for a network diagram:
Check the host files of one of the problem workstations.  I have seen host file with a old IP server settings cause problems.
No host file entries.  Note, when this issue happens users cannot ping my firewall (by IP) but can ping any internal server.  Any routes defined by my firewall (to the DMZ or phone networks) are unavailable, as one would assume if you can't reach the firewall at all
What is the default gateway for the PC's?  What are the VLAN #'s?  Are the netgear switches layer 3 capable?  Do you have the VLANs defined on all of the switches?
Also, I've noted that I don't seem to the problematic systems MACs on either my switch or firewalls ARP tables.

Any suggestions?
Going through my switches, I do see a few ports that have a really high number of packet collisions.  All other ports show 0....
And I can confirm that at least one of the users who was having issues is patched into a port with high packet collisions

Would that be a loop?  If so, shouldn't the issue be system wide?  It seems some 5-10 users out over over 100 are affected by this,
The default gateway is my firewall, and the PC's are getting the right IP for it.  The get on the network, can talk tot he gateway for about 5-10 minutes, then all of the sudden they are no longer able to talk to the gateway, or anything that is routed by the gateway

Regarding VLANs:

The VLAN is only defined on my voice switch.  The switch has a single port on the VLAN (  The VLAN is within my data networks subnet.  Routing to the VLAN is done by a simple route in my firewall: routes to

Then, the voice switch itself routes the VLAN port over to the voice network (where my PBX lives).  This hasn't been an issue in the past is patched into my data network.  All users can ping this interface, but traffic is not routed there unless they can actually talk to the firewall.  If I setup a route on my switch to do the same, I could see how it would fix users access to my unified messaging, but it wouldn't fix the fact taht the users can't get to my DMZ or internet.

The issue is defiantly a connection between these users and my firewall.
OK - there is something defiantly wrong here, I move one of the problematic users to a new port (that had 0 packet collisions) and within a few minutes, tehre were already 3586 packet collisions on the port.

Now, how to go about finding what the cause of said collisions is.
And yes, the switches are layer-3
I have notice the problematic users MACs are not showing in my firewall arp table.  Attempting to manually add entries
Well adding users to the ARP table seems to be helping, although I can't figure why these units can't register their MAC with my firewall.  i did note (as mentioned above) that I had one PC that was causing a bunch of packet collisions.  Turns out this was patched into a port that we've had issues with.  we have a 1gbit network and the PC could only make a 100mbps connection.  We had our cable guys come in to re punch, now the units connect to 1gbit, but this port is now the one causing all the packet collisions.
>Well adding users to the ARP table seems to be helping, although I can't figure why these units can't register their MAC with my firewall

Adding the the users to the ARP table is a good troubleshooting setup, however, I would not recommend leaving it that way. Collisions do not exist on a switch network as long as the ports are full duplex. With that being said, ensure to to check speed and duplex settings on all ports to ensure that they match. What are the CPU and Memory statistics like on the routers and switches. If you are not seeing the MAC address of the users having issues in the ARP table, what about the MAC tables on the switches. You can also check to ensure that you are not exceeding any MAC or ARP entries in the routers and or switches. Another thing to consider is that you can poison MAC and ARP tables, so it could be very likely a users causing havoc.

Thanks rfc - I'll look into the memory and cpu stats for the friewall switches and router.  The clients MACs do seem to be showing on the switches.  Also, note the users can't ping the firewall by IP, but they can ping anything else in my local LAN (that doesn't require any routing from my firewall itself).

I don't plan to use the static ARP table as a fix.  It's more of a short-term band aid at this point.  Oddly the one user this isn't even working for is my boss (who, luckily sees himself as a low priority compared to my other users).  I've also tried flushing my ARP table on both my firewall and switches.  

I did a reboot of the sonicwall last night at midnight to no avail as well.

Looking at the last 24hrs of my firewall, CPU hasn't reached over 5% and doesn't give me any info on memory usage.  I'll have to check my router as well, but since the users seem to get blocked before the router, I think the issue has to be either the switches or the FW.

As for packet collisions, that is starting to look coincidental - I switched the user over to a different data drop and we are no longer seeing the packet collisions, but still having the same issues.  Although, my boss's unit says it's connected a 1gbit, but my switch has it's amber light on, signifying a 10/100 connection (this is a netgear, I've noticed on my Cisco switches it's the opposite)

I understand MAC poisoning (and spoofing), for things like man in the middle attacks and MAC flooding.  So it sounds like a possible security intrusion.  What's a good way to diagnose this?  Wireshark and just sniff packets and look for arp entries?  I ran wireshark for about an hour yesterday.  I do see a bunch of arp entires like:

139083      492.747251      Dell_2b:3c:14      Dell_46:34:9c      ARP is at f0:4d:a2:2b:3c:14
139085      492.788247      Dell_9f:49:b2      Dell_46:34:9c      ARP is at bc:30:5b:9f:49:b2
139086      492.789623      Dell_2b:3c:14      Dell_46:34:9c      ARP is at f0:4d:a2:2b:3c:14
131351      467.740520      Dell_2b:3c:14      Dell_46:34:9c      ARP is at f0:4d:a2:2b:3c:14

Seems like everyone is talking to Dell_46:34:9c who's who.  Note I totally forgot that I do have a Dell PowerEdge 10gbit switch.  It's only got two servers plugged in an an uplink via a SFP+ to my netgear stack

Duh, that would make sense Dell_46:34:9c is myself...
Also, my switches have a bunch of port security settings, are there any that I should consider using?
According to wireshark, 65.70% of packets and 38.86% of bytes are comming from ARP.  is that a bit high?
38.86% is a bit high, and this is dependant on your broadcast domain too. You will want to limit your design based on organizational boundaries thus creating subnets (Vlans) to separate your broadcast domains; however, with that being said, 38 percent depends on how much utilization of your network is. 38 percent of ARP on a 1gigabit network is not that bad; however, on a 10/100 network, I would say that it is a bit high. With the amount of switches you have in your network, are you monitoring bandwidth via SNMP using Cacti or MRTG. This will give you an idea of your network utilization, as well as the configuration of CPU/Memory graphing too.

Well I've got my ARP requests down to 13% of bytes, looks like my firewall was attempting to connect to a syslog server that didn't exist.  Hasn't seemed to fix my issue though.  I was able to get a few of the problematic users up-and-running by switching out some cabling.  This isn't working for all users.  

Could this be an issue with wiring?  
>Anyway, after about 10 minutes of being online (say after an Ip release/renew) they loose connectivity to any server that is not in my direct subnet.

If you can replicate this eveytime leads me to believe that it is related to the host. Some application is eating up bandwidth or possible a resource hog. If after you release and renew and everything is fine for 10 minutes and eventually slow network performance or network drops occur sounds like a host issue. How are other hosts on the same switch performing? Also, I see that you have a T1, you possible could be saturating the T1, have you reviewed the bandwidth utilization on the T1 interface?

On the router perforam a show interface serialx/x for the interface the T1 connects to.

Issues before the users even make it to the T1 - they can't get to the firewall.  So it's not a problem with the T1 itself - if I was to pull my WAN connection they'd still have issues connecting to my voice subnet as it's routed by my firewall.

I may have found a source for my headache, my firewall was looking for a syslog server that didn't exist.  It was constantly sending ARP requests for an IP and it looked to flooding my network.   After removing the syslog server IP (the server did not exists) things have seemed to stabilize.  

One other thing, I upped the ARP cache timeout - I noticed that problematic users would get an entry in my firewalls arp cache, and then once it expired (after 10 min) it wouldn't renew (unless a release/renew of my ip was done).  
Avatar of Rick_at_ptscinti

Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>, I upped the ARP cache timeout
Cisco has 4 hours, ARP cache should not be less than 15 minutes for most networks.

Thanks guys, I'm going to go over these items today.

One BIG change, after I upped the ARP cache timeout on my FW (it was at 5 min), we've noticed that while the network still drops out for these users, it's now just a blip - they loose their connection for maybe 5-10 seconds and then are back on.
Sorry, ARP was at 10, now is at 60 minutes.

One other thing, we are a 1Gbit network, and it does seem that the problematic users link light on my switch is amber, which for Netgear means 10/100.  I've noticed that the Status on the adapter properties shows 1Gbit.  Should I force these cards to run at 100mbps?  

Also, what's the recommended max length for a 1gbit drop?  A few of the users are pretty far from my MDF and replacing one users patch cable seemed to help.  Could the speed discrepancy between the NIC and my switches be due to too long of runs\bad cables\ old cables?  I'm not positive but some of the drops may be cat5 and not even cat5 e.  I do not believe there was any upgrading of the actual cable runs when we made the switch from a 100mbps to 1gbit network (I was not here at that time)

cat5 is not really rated for Gig but it will work for short distances.  cat5e is good for 300ft.

doubling up the cable pairs will definitely extend your range.....but it is a little ghetto.  You would probably be better off just adding an IDF closer to the users and putting in a new uplink.

Forcing the ports to 100 Full should definitely help.
"Just adding an IDF closer to the users and putting in a new uplink."

That's a plan I have - all I need is to get some fiber ran and I'm going to set that part of the building up with their own IDF.  Although I realized one of my probleamtic users is actually in an office that shares a wall with my MDF.   Tomorrow I'm going to force both the users and the switch to run 100mbps for those ports that are showing a 100 link.

Also I decided to upgrade my switch firmware @ midnight tonight, and I'm going to reboot my firewall at the same time.  Tomorrow evening I'll be dropping in a new sonicwall NSA 4500, so hopefully between all that we'll find a solution
Also, next user that goes down I'm going to do a ping to my FW and run wireshark at the same time.  If they show the incorrect MAC then I've made some progress.  Then I'll find that MAC on my switch address table and hunt it down
Here's another thing, althoug hit doesn't seem to be the users who are having issues, but in my trap log of my switches, I get:

7      JAN 01 06:49:08 1970      Link Down: 4/0/4
8      JAN 01 06:48:42 1970      Link Up: 6/0/15
9      JAN 01 06:48:38 1970      Link Down: 6/0/15
10      JAN 01 06:48:12 1970      Temperature change alarm: Sensor ID: 0 Event: 3
11      JAN 01 06:48:04 1970      Temperature change alarm: Sensor ID: 0 Event: 1
12      JAN 01 06:47:40 1970      Link Up: 5/0/29
13      JAN 01 06:47:37 1970      Link Up: 6/0/19
14      JAN 01 06:47:37 1970      Link Down: 5/0/29
15      JAN 01 06:47:34 1970      Link Down: 6/0/19
16      JAN 01 06:47:33 1970      Link Up: 6/0/19
17      JAN 01 06:47:31 1970      Link Down: 6/0/19
18      JAN 01 06:46:45 1970      Link Up: 7/0/30
19      JAN 01 06:46:43 1970      Link Down: 7/0/30
20      JAN 01 06:46:14 1970      Link Up: 1/0/31
21      JAN 01 06:46:12 1970      Link Down: 1/0/31
22      JAN 01 06:45:37 1970      Link Up: 3/0/7
23      JAN 01 06:45:34 1970      Link Down: 3/0/7
24      JAN 01 06:45:32 1970      Link Up: 3/0/7
25      JAN 01 06:45:30 1970      Link Down: 3/0/7
26      JAN 01 06:45:01 1970      Link Up: 6/0/38
27      JAN 01 06:44:57 1970      Link Down: 6/0/38
28      JAN 01 06:44:44 1970      Link Up: 7/0/25
29      JAN 01 06:44:42 1970      Link Down: 7/0/25
30      JAN 01 06:44:22 1970      Temperature change alarm: Sensor ID: 0 Event: 3
31      JAN 01 06:44:12 1970      Temperature change alarm: Sensor ID: 0 Event: 1
32      JAN 01 06:41:17 1970      Link Up: 1/0/31
33      JAN 01 06:41:13 1970      Link Down: 1/0/31
34      JAN 01 06:40:12 1970      Link Up: 7/0/30
35      JAN 01 06:40:08 1970      Link Down: 7/0/30
36      JAN 01 06:39:20 1970      Link Up: 7/0/25
37      JAN 01 06:39:16 1970      Link Down: 7/0/25
38      JAN 01 06:39:11 1970      Link Up: 5/0/3
39      JAN 01 06:39:09 1970      Link Down: 5/0/3
And they pretty much cycle through up-and-down.
Also, that's from today, I'm not too sure why the date and time are off so much (I cleared the log this morning)
I'm wondering if disabling STP could help.  
Also, I've attached a bit more detailed network diagram (I forgot I had it)
>I'm wondering if disabling STP could help
Noooo, never disable STP on any switch (Even if you do not have redundant layer 2 paths); however, others have different views on that, but lessons learned is all I can tell you. With that being said, you do have interfaces flapping, and here is one thing to consider, what STP Protocol are you using, IEEE or RSTP, etc. and are the interfaces that are bouncing host interfaces are interfaces in the VLAN of the users that are reporting issues. What does STP status give you on the stability of the STP Domain.

STP version is IEEE 802.1s.  As mentioned above, VLANs aren't in any real usage except for routing between my phone and voice networks, so yes, the bouncing interfaces ARE in the same VLAN and subnet.   As far as STP status,  I'm actually having issues finding any info on the switches
In case anyone cares, the switches are NetGear GSM7352Sv2 (how much I wish we had Cisco equipment...)
So I do have options between
IEEE 802.1d  (STP)
IEEE 802.1w (RSTP)
IEEE 802.1s (MSTP)

Any reason to use one over the other?
Switched to RSTP, and have gone through and changed every single port thats flipping (there was about 15) to force 100 full duplex.  i then traced down the PC's (using my address table to find the MAC and then my local ARP to find the IP, then connected to their c$, and figured out where ti was by the profiles) and forced the local NIC to 100 full-duplex.

Seems like the flipping ports have stopped.  Also seems like the random dropouts are fixed. I truly believe there were multiple issues here and we essentially broke a threshold of what my network could take.
OK - ports are flipping again.  Not necessarily the same ones, though some are.  All these ports are set to 100mbps/FDX. Same as the clients.  

Any Help!?
looks like you've got about every manufacturer of switching equipment represented....geez.  The harsh reality is that you need to punt and replace everything with a standard brand.

So let my try to provide some productive input.  If your issue was spanning tree or loops you would be seeing messages like "DEVICENAME is the new root of the spanning tree" indicating that spanning tree is changing the layer two routes.  Another indicating would be ports in the "blocking" state.  (as apposed to learning or forwarding)......I don't think this is your issue.  It wouldn't be that random.  It would be a good idea to set the spanning tree value to 1 on the device that is at the core of the network.  Looks like that is the Dell.  (commentary....I hate Dell switches....end commentary)

So we need to narrow this down is we are going to get it resolved.  Is there a specific switch or port that is consistently having this behavior?  

The first thing to do is to determine if you have a physical problem or a software problem.  CRC errors are a pretty good indicator of a physical problem.  The next thing I would look at is the port status when this is happening.  Are you losing the link indicator?  This is also an indicator the layer 2 connectivity is failing.

This should have been asked at step 1, but has anything changed in the network?
To be honest, I think you need to get back to the basics; starting capturing packets with wireshark and analyzing the issue. You are playing blind at this point and taking random shots at what could be the issue. I would highly recommand to conduct packet captures and analyze the data.

"The harsh reality is that you need to punt and replace everything with a standard brand."  I agree 100%.  I really wish I was here when all the new hardware was purchased.  Essentially the 7 switches were purchased to replace 90% of the ports, then I was brought in.

"If your issue was spanning tree or loops you would be seeing messages like "DEVICENAME is the new root of the spanning tree" indicating that spanning tree is changing the layer two routes.  Another indicating would be ports in the "blocking" state"  You are correct sir, STP is not likely the cause, as the ports are staying in forward

" CRC errors are a pretty good indicator of a physical problem" - Now I understand what a CRC error is (it's pretty much a checksum, right?), but I'm going to have to look into actually checking for them on my system.

" Is there a specific switch or port that is consistently having this behavior?"
After much screwing around, I've found that it seems like has to be specific end-user machines.  It was a bit weird at first, if I switched a users port it didn't seem to do much.  Now, when I switch a user from a flipping port to another, the newly patched port starts to flip.

"starting capturing packets with wireshark and analyzing the issue"
  - Here's the thing, is that I have been doing this and there really isn't much to show from it.  Unless there is anything specific I should be looking for.  I have a few .cap files that were begin ran on problematic computers.  The only thing I could really fin was what looked to me like excessive ARP packets flying around.  

And thanks again for all the continued support, I really appreciate it 100 times over.
Capturing traffic is tuff if you don't know exactly what you are looking for....

Are the issues happening on all stack members switches?  Does your stack share a common MAC address table?  Some brands do and others stacking just gives you single IP to manage with but that's it.  Can you move a known bad pc to a port on the dell switch.  I pick the dell because it appears to be as close to a core device as you've got.
Regarding the stack - Issue happens on any of the stack members.  If I move a problematic user to a different member switch, it starts to flip. The stack uses a single address table, which is one way I have been tracing down the units - I see the ports that are going up/down, then I search my address table for that port, which gives me a MAC.  From there I either use my local PC's arp table, firewall or DHCP server to resolve the IP from the MAC.  T I am not able to do this on the switch itself.  I just noticed that there are two separate address tables on my stack - one that ties MAC to a port, and an 'arp' table, but the arp table is blank.

I'm going to move one of the users over to my dell switch and monitor it as well.  Although, I'm going to have to go search out the IP and login for that switch (setup before I was here).
I bet something in the stack is going bad or has a conflict.
@Rick:  That's what it's looking like more and more to myself.  I'm not too sure how to move forward to fix this, but I do have a ticket open with netgear.  The stack is in a duplex-loop, which is recommended by NetGear, but I may go ahead and nix the loop and have a tree topology in the stack (simply)

The other odd thing is we are experiencing what looks to be some sort of security breach (we've seen a bunch of random users from all over the world trying to connect to our terminal server).
OK - so after moving one of my problem ports (6/0/14) over to port 24 on my Dell switch (this is an enet port) I start to see:

0      42 days 09:52:30      1/0/24 is transitioned from the Learning state to the Forwarding state in instance 0
1      42 days 09:52:30      Spanning Tree Topology Change: 0, Unit: 1
2      42 days 09:52:08      Link Up: 1/0/24
3      42 days 09:52:08      1/0/24 is transitioned from the Forwarding state to the Blocking state in instance 0
4      42 days 09:52:03      Link on 1/0/24 is failed
5      42 days 09:52:03      Link Down: 1/0/24
6      42 days 09:52:03      1/0/24 is transitioned from the Forwarding state to the Blocking state in instance 0

Goes to blocking state, down, forward, up....
Interestingly, I followed that drop back, and that unit had a "Broadcom Advanced Server Driver" installed, which shouldn't really be there - none of my other PC's have this installed (and they are the same PC models).  Not to mention it's a desktop with a single NIC, so there'd be no reason for load balancing/teaming...  I'm uninstalling the adapter and running an AV to see if I can get a stable connection
Nope, uninstalling the driver didn't help.  I'm going to try setting the adapters to not power down on idle
Also, Netgear's support requested I turn  off STP on all ports to see if it fixes the issue.  It did not.  These may be RMA units
Here's another thing, related or not...  I attempted to switch out my Firewall last night with a new NSA 4500.  If I give the new firewall the same LAN IP as the old one, I can't reach it (after removing the old one, of course).  If I assign it a new IP and change my gateway to that, it works A-OK.  If I leave the old address and ping the unit, I'll get a reply every 20-30 pings or so.  I reset my address tables on my switches, rebooted them and all.  No love

Now I could use the new IP,   but I fear i may have a bunch of static IP's out there.  But mainly (for the purposes of this thread) i am wondering if there is some connection between this and all the other issues I have been seeinf
Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>probably more than likely related to stale ARP entries  
<--This, right here, I think may be one of my biggest problems.  I cleared all the arp cache, on the machine and on the switches.  Returning an arp -a on the workstation shows the correct  MAC.  My switches, for some reason don't seem to show anything on it's arp table.  There is an address table, that shows the mac tied to a port, but the actual arp table of ips to MACs is, and has always been, empty.  

And I even attempted to have my new sonicwall clone the MAC address of the old unit.  Still, no love to be had.  I'm at my wits end!  I even brought in some hired help, and they are at as much at a loss as I am.  The only thing that's come of it is my increased confidence as I had tried everything they have.  I wish I got that hourly rate...
This is silly, but have you power cycled the whole stack?  Can you clear the mac table manually?

You said moving the device to the dell switch made the dell spit out stp alarms.  You might try removing the uplink to the netgear stack and see if that stops the stp events.
> You might try removing the uplink to the netgear stack and see if that stops the stp events.
I'll give it whirl.
 I've power cycled the stack a few times, reset to factory defaults, upgraded the firmware, dumped the MAC and addressing tables, all to no avail.  This evening I'm changing the stack master (requires a full stack reboot, so off hours), as apparently it holds all the addressing.  I've got my fingers crossed that the current master has a bug and this will alleviate the issue.
Also, while my switches reboot I'm going to be attempting to swap out the SonicWall again
Well, I'm not 100% sure but I think my issue is resolved.  Thanks all for the help, I probably couldn't have done it without the support.  So here's what I did / happened:

I attempted to install my new SonicWall last night.  I couldn't reach it if I used the old FW's IP.  So, I kept looking through arp tables, rebooting switches etc.  Then I noticed something while running a ping to the new FW, if I dropped my local ARP table (and this was using a problem PC), I'd get a single ping across! Note, that NO computer could talk to the FW, not just the one's with previous issues.  So, I knew I was on the right track.  I would then run an ARP -a, and sure enough the unit would have the MAC for the old FW.  So, after running wireshark on the unit, I saw it was getting incorrect ARP information (along with the correct, but the incorrect showed up at a much higher rate).  Oddly, the source was my old FW, which had been fully removed and turned off

My next move was of the 'screw it' type.  I pulled down every-single switch and brought them up slowly, starting with the switch that the firewall was patched to (I patched my test PC over to it as well).  The PC started getting the correct MAC!  As they came up, I'd drop my arp table, wait a bit, and then check it.  Then after bringing up my 5th switch in the stack, I started getting the wrong MAC  again. So I deduced the issue was on that switch.  I pulled it down, dropped my arp again, and Viola!, correct MAC again.  I brought up the remainder of my network and waited 10-15 minutes.  Dropped my arp, waited, and I kept the right MAC.

Finally, it was down to figuring out who was the culprit.  I brought back up the switch and started getting the wrong MAC, again.  So, hoping it wasn't the actual switch, I started pulling patches, 10 at a time, wait, drop the arp, check the arp.  Finally (and of course at in the last possible batch of ports) I finally started getting the right MAC again.  Slowly patched everything back, and found the culprit.  It was patched to an old switch (that wasn't even in any use, and hidden out on my shop floor).  I pulled and we're all getting the right MAC.

So, most things are fixed - we're still having ports flip, and had a real shaky morning, but after screwing around with my switch a bit, my routing, VLANs and firewall, everything seems stable.  I am also seeing a bit of trouble with pings - to my FW it jums from <1ms to over 200ms every 10 pings or so, and pinging my stack is constantly around 2ms.   The port flipping doesn't really seem to effect the users - it's only for a moment.  These ports are also seeing a high level of errors, but in general things are good.  I may open a new thread regarding the switch issue/ping issues, but I think for the scope of this thread I'm going to close it and award points.  

I hope no one objects to my awarding of points, I feel that I got the help I needed to a)make sure I wasn't wasting time looking at the wrong things, b) getting ideas of what to look for and potential issues.  I thank everyone who helped out.  this is by far the longest EE thread I've ever been a part of and hope some of the info can help others in the future!