Link to home
Start Free TrialLog in
Avatar of vanroybel
vanroybel

asked on

network is unstable, we lose it, it comes back we lose it and so on

Hello,

We have a strange problem since yesterday. Suddenly, in the morning, we lost all network functionality. I could not ping any machine and nothing worked anymore (printer, as400, net, mail, ...). Since it was everything at once I thought it was the switch.

We tested a bit and then changed one switch. It worked fine for 2 hours. Then again, we lost all connections. We replaced another switch and the problem went away. 2 months ago we had to change another switch, so now all the switches on the network are relatively new. It worked fine for the remainder of the day.

this morning, after five minutes I was in, same thing happened. No more network. Two minutes later the network came back. 20 minutes later it went down again. We searched for an hour, shut down the server and router, but we didn't find why it went down. Finally, 20 minutes ago the problem went away. We could ping any machine again and everything seemed to work.
When the network is down, the switch still shows all the computers connected to it and we have blinking lights meaning there is traffic.

So we still have no idea where is the problem coming from and have no solution if it comes back. Does anyone know where it could be coming from?
ASKER CERTIFIED SOLUTION
Avatar of Syed Muhammad Usman
Syed Muhammad Usman
Flag of Bahrain image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of vanroybel
vanroybel

ASKER

we don't have so much users, around 25 + 15 printers + a lot of devices that have IP addresses. So we only have 2 switches that are next to each other. We haev other switch but they are loclized where we need them. We have around 10 8 port switch in the factory for each place where we need to connect a few devices. We have another 24 port switch we use to connect a part of the users and the factory.

I'm going to verify what you said because it might not be so hard to find for me since we don't have so much connections. And I have been asked to draw a plan of what is connected where on those switches. That's gonna be fun.
Solution:
Finding loops is most difficult job (I believe), better you buy one L3 switch (ie,,,Cisco 3750) and name this switch as core switch, connect all server and other switch to core switch. Enable STP.

I don't think my boss is gonna like that solution. The price is unbelievable for a 25 user firm. over 1,5k $???
Besides, I don't think loops are the problem, the problem began yesterday morning and I didn't change anything in the network configuration for a while now. I'm still gonna verify but let's imagine I ask my boss for the new switch and we get the problem again the day after. I think my boss would kill me!
from your post "(printer, as400, net, mail, ..." i assumed you are running AIX on IBM AS400 and having some mission criticle applications or something very important thats why is suggested L3 switch.
once you have STP enable on switch and "IF" the problem is due to loops you will not have problem, why ?????

onec STP is enable and L3 switch recive loops, switch will automaticlly block the port and you can view in cisco by using spanningtree blockports.
but i fully agree with you, if you have only 25 Users and having a small network no need to buy L3 switch if you dont have budget. try troubleshooting. i would suggest you draw cable map.
by the way do you have any Antivirus installed in your network???
Have you checked errpt on the AIX box to see what if any errors it is reporting?
Yes, coming from an SMB consultant, that would be a very high end switch for a small network.  Maybe a 2960G would be a better suggestion.  But asside from that, unless the switch died, which is entirely possible, You need to identify the source device.  The problem is being caused by some device or loop on the network.  If you only have < 48 devices on the network, it should not take long to remove everything and then put it back in an orderly manner.  That is what I would refer to as the brute force approach.  

Besides a network loop, another cause could be a device doing a rogue proxy arp on your network.  This is just as difficult to find though.  I would review ARP tables on PCs after running tests and then cross referencing the MAC addresses to the IP addresses and see if they are right.

What OS level is your AS400 at?  There was a recent issue similar to yours related to v6r1.

What kind of network switch do you have?  Is it managable at all???

So...
1) CHeck the ARP tables on the PCs and servers to find any inconsistencies
2) Try to identify a loop, either through the management interface or through brute force.

~Jon
It is possible a bad network card on a PC or printer could be flooding the network.  You don't mention what kind of switch you are using.  If it is a switch with some management capabilities via web interface you may be able to determine which port on the switch is getting flooded.

We've had a net card in a printer go haywire and flood a switch with bad data which dropped the segment.
The problem did not occur again.

Syed :
We have an antivirus on each computer of the network and it is up to date.

Carmd :
I looked on wikipidia to see what AIX is, and I don't think we have that here. Is it a part of an AS400? If so, I don't know about it.

Jsnyderman:
I will check the ARP tables. I still need to draw the maps, I didn't have time to do it yet.

Rward :
The switch we have has no management capabilities. Is there another way to find the problem with a netcard going haywire?
I found a managed switch and I am learning how to use it to monitor my network traffic.

I'll post here when I get results.
So I installed Procurve Manager and I'm monitoring activity. I had a few warnings lately :
"Critical treshold violation on 10.0.0.2(10.0.0.2):21:port#21, utilization Rx(ingress) 100 >= 90"

I have no idea what it means. I had it 3 times between 13:14 and 13:33.

I just sent a lot of pictures on the network to print on our production printer, could that be the cause?

Other then that there doesn't seem to be errors. I'm still reading the manual when I have time, maybe I'll find a better way to find errors then looking at the dashboard.
Well it had probably nothing to do with the photos I sent over the network since I tried it a couple of times again and didn't get the error message anymore.
Procurves are very nice managable switches.  You should be able to track it with the manager when it happens.   That message simply indicates a high utilization on port 21.  What is plugged in to port 21?  This could be a server or that printer but if the problem is not persistant, its probably unrelated.

Have you not had the problem again since yesterday?  All the reasons that we have mentioned would be a lot more persistant than that.  What time zone are you in?  Look for devices which may have been powered off or in power saver mode.

~Jon
you have a bad network adapter or a bad cable feeding one of your machines
The problem still won't come back.
I've had no other event on my procurve 1700.
I'll wait til the end of this week for the problem to appear again, then give points if it does not.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I just had another though.  I would recommend analysing that port stats on the Procurve.  If the problem was not the switch, but rather a bad NIC, the switch could be filtering the packets and eliminating the problem.  If your old unit was a switch, it should have done the same.  But if your old unit was a hub, one bad NIC could easily bring down a network.

~Jon
It was  a switch before too.
If there is a bad NIC, it shouldn't be able to access the network right?
Because in that case we don't have a bad nic, every computer and device can access the network.
Not necessarily.  I bad NIC can manifest itself in many ways including a network flood.  Not working would be the most obvious but many other symptoms can arise from a bad NIC or bad cable.   BTW, even with a switch, if the NIC is flooding with broadcast traffic, it will still effect your whole network.  Its just less likely on a switch.

~Jon
Sorry for the typoes :)
Thanks everyone for your input. The problem did not occur again this week.

I'll post another question if it comes back.