Solved

failing core switch or dhcp issue?

Posted on 2011-02-10
17
772 Views
Last Modified: 2012-05-11
I'm desperate for help.
I'm still a new network admin.
I've ran this by 2 consultants and two Dell support staff members.

A quick synopsis of the problem.  We run a physical star- MDF in the middle of campus (core is Dell PC 6024f) with fiber connections to IDF (Dell PC 3448 switches) at 7 other buildings.

We have a management vlan 255, a server vlan of 10, and building vlans of 100, 101, 102, etc.

Building A's vlan is 101, port number 5 on core switch
Building B's vlan is 102, port number 6 on core switch
Building C... etc

Building G's vlan is 105, port number 9 on core switch
Building H's vlan is 107, port number 11 on core switch

Yesterday one building (G, vlan 105) (with 40 workstations- mainly Dell GX620s or Dell Optiplex 760s) dropped completely... no pings or connection to the network.  This was followed by two other buildings (B, Z and H) losing connection to network.  I cannot ping the switch in the IDFs but cannot ping any nodes on that particular vlan (G, B, Z, or H).  Those nodes cannot ping other devices including their switch.  Eventually buildings B, Z, and H came back up while G is still down.  In G they cannot log into the domain but logging in locally I see a ip of 169.254... signal that I can't get to DHCP server... right??  That switch uses vlan 105, 20 (a vlan for our printers.)  You might think we have a bad cable/fiber/gbic... nope... we have a printer on vlan 20 that I can ping and print to from another vlan across campus.

This morning we have building G, H, and A down completely.  From the core, I can ping their switches in the IDF (Dell PC 3448) but cannot ping nodes on that VLAN.  nodes on that VLAN cannot ping the IDF switch of their building and cannot ping the core switch.  again... 169.254...
Today I could get 6 workstations on vlan 20 to come across the switch in the G building and connect with the network.  Any more workstations we try to add on vlan 20 won’t connect… just a 169.254 address.

The problem is not electrical, no other device has been installed on the network.  No other changes were made on the network.

On the core switch we are seeing a ton of activity in the Statistics/RMON, Table Views, Utilization, Counter, Interface and Etherlike:

Utilization shows 3 ports with 100% Non Unicast Packets Received and all the other ports are working fine showing 100% Unicast recieved.

Counter Summary shows significantly more Received Non Unicast Packets by the bad ports rather than the good ports which report much lower.

Interface Statistics I've cleared the counters and on the bad ports I'm seeing a significant number of broadcast packets than unicast or multicast packets.  The good ports show unicast packets in the tens of thousands and very little broadcast or multicast packets.

There are zero reports under the Etherlike Stats for bad or good ports... all looks good there.

Port 9 on the core switch connects to the IDF in building G, vlan 105.  I even moved the connection to port 12 thinking it could be a bad port.  Same issue.

I considered a broadcast storm but since we have some buildings up and some buildings down I cannot isolate a particular building.  At one point I disconnected all nodes from building G and rebooted everything hoping to break a loop if there was one.  no luck.

I have bounced every switch at least 5 times.

Is my core switch failing or do I have some weird dhcp issue?

Thanks for any help you can give me.
0
Comment
  • 8
  • 6
  • 2
  • +1
17 Comments
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
Is there wifi in the buildings? Could someone have plugged into one building and connected to wifi in another and crossed vlans?
0
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
Ok, I just thought of a much better reason. Someone could have plugged the LAN side of a dhcp enabled router into a port thinking they can just use it as a switch.
0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
Thank you for these two ideas.   No on the first.   And no on the second. Our switches are all in locked boxes with only myself has the key.
0
 
LVL 21

Accepted Solution

by:
Rick_O_Shay earned 500 total points
Comment Utility
Can you check your DHCP server and see what the leases look like? They shouldn't be all be depleted or aged out or anything like that but active with various amounts of time left on them.

Also can you manually configure one PC in the failing area or areas so you can see if it is a network issue or DHCP issue when the problem occurs.

Make sure all of your uplinks are properly configured for spanning tree. And I would recommend that you use spanguard or whatever Dell calls it to prevent edge ports from activating with anythoing except edge devices - PC's, Printers, etc but not toher switches.

I would take a look with wireshark and confirm what may be causing a broadcast storm and that no one else is adding a DHCP server.
0
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
I Mean in building g, someone could have plugged in a rogue access point (consumer router) therefore adding another dhcp server to the network. I could still be totally wrong I just wanted to clarify.
0
 

Expert Comment

by:manni78
Comment Utility
If you can ping all your switches from core switch then I suggest you to start troubleshooting with giving static IP to PC which are not picking IP from DHCP.
If you can ping nodes in other building from server VLAN then it’s not an issue with physical connectivity.
Check the configuration of uplink ports between core and access layer switches. Uplink port should be a trunk port and make sure server VLAN is allowed in trunk port.
One more thing, check if there is option to define DHCP relay server IP on the core/access switch.

TA
0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
I can't tell you how appreciative I am to have all these suggestions...

aarontomosky:  I went to all 9 rooms and physically inspected and unplugged each port.... thinking that same thing.  No rouge AP.  We do use some palm switches so I thought maybe a teacher or kid messed with the cords and caused a loop... nothing.


Rick O Shay:  DHCP looks okay.  The buildings down (the scopes are according to the buildings) have no leases... they expired yesterday.  The current buildings are showing leases ending at various times tdoay or tomorrow.

I manually config'd a few workstations in building G with a vlan address...   Vlan 105 ip is:  10.1.105.1 so I manually config'd 10.1.105.100 or 10.1.105.200, etc.  subnet of 255.255.255.0 gateway of:  10.1.105.1 and nothing... can't even ping the IDF switch.

All uplinks are properly config'd for spanning tree.  This network has been running properly for years this way but I did go in with Dell support to confirm it's correct.

The 4th thing you mentioned I'd like a bit more info to help me out...  would you suggest I monitor traffic on what vlan?  I have nothing coming through at all on vlan 105 (Building G) or 101 (Building A)or 107 (building H.)  Should I look for DHCP  Offers or ACK from a DHCP anything other than our 1 I know is valid?


manni78:  I can ping all the switches from the core... even the one's whose buildings are down.  I can't ping even with a static ip.  For a 6 hour period yesterday our printer vlan 20 was able to ping out to all other vlans and nodes across it's IDF switch and across the core switch.  The uplink ports were working fine for years before now and nothing has changed.  They are trunk ports and the vlan is allowed in the trunk port.  Yes, our core is config'd to define the DHCP relay.

0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
could this be it?  I've used wireshark to watch the traffic on and off for 30 minutes now.  The core switch continues to send ARP to some servers.  

I see about 200 ARP requests to 10.1.10.17 in 3 seconds of capture
I see no response

I see about 200 ARP requests to 10.1.10.16 in 3 seconds of capture
I see no response

I see 50 ARP requests to 10.1.10.15 in 3 seconds of capture
10.1.10.15 (a server) responds back 13 responses

When I look at the switch I see the arp cache.  All these IP's and MACs are are already listed

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
super bad news.

reconfiged a new core (Dell 6024F) put on network and didn't fix my issue.

core switch issue is out.

could this be dhcp?  services are running.  leases look fine.  Can the program code get messed up???  What could be interfering with DHCP?
0
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
I like the arp path you were on for a second but I don't know how to troubleshoot that.
0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
Would it be fruitless to put dhcp on another server and take the old one down?

0
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
The fact that you put in a static ip and still couldnt ping the switch from the building makes me think it's not dhcp as the root cause.
0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
The only reason I can't let the dhcp go is because for a 6 hour period i could get workstations on the printer vlan using status and dynamic addresses and that I'm getting a 169.254 address
0
 
LVL 38

Expert Comment

by:Aaron Tomosky
Comment Utility
You could be right. I'm just leaning toward all traffic being jacked possibly because of arp and dhcp is just a side effect. But no way to be sure yet.
0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
thanks for all the back and forth aarontomosky... it gives me more brain power.  Another thought after 3 hours of reading.  could STP be the culprit?  It seems to get nasty on Dells?
0
 

Expert Comment

by:manni78
Comment Utility

As aarontomosky said if you can’t ping with static IP it can’t be DHCP issue. But yes it could be STP.

Have you made any changes on core switch? If yes could you check the ports that interconnect switches must not be configured with "spanning-tree portfast"? Do you any logs from core and access switches?
0
 

Author Comment

by:imayjustdriveoffintothesunset
Comment Utility
Well... it was a broadcast storm started from a small switch in a teacher's classroom in a building that wasn't taken down.  It took down 4 other buildings!
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

The worst thing when starting a new job is when the previous Network Administrator left behind no documentation. How do you get into the devices? If you've been in this situation or just accidently mistyped your password, this article will hopefully…
I eventually solved a perplexing problem setting up telnet for a new switch.  I installed a new Cisco WS-03560X-24P switch connected to an existing Cisco 4506 running a WS-X4013-10GE Sup II-Plus. After configuring vlans and trunking,  I could no…
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now