• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 327
  • Last Modified:

Mysterious problem, can't find a pattern

We are experiencing a major mysterious issue and hope that someone can make any sense of this.

We have a 35 user Novell 3.12 (yes, I know) single server network being accessed by mostly WinXP computers. The server is located in a computer room that is air conditioned and has been running smoothly for 13 years.

As of 1 month ago, we began experiencing slow downs in our DSL Internet connectivity at approximately 5:20pm. Did the usual troubleshooting, eventually replaced the modem and all was great for another week. Then b/w 5P and 6p we'd lose Internet then network connectivity. We'd reboot and couldn't connect.

- we'd check logs on server and all seemed fine, looked as if server was still operational and that issue was infrastructure related
- we'd down the server at the time of failure and bring it backup but wouldn't connect
- one one occasion we left server on despite failure and in the morning around 9:00am everything resumed as if nothing happened.
- we've replaced the network card in the serve
- we've replaced the switches
- we've replaced the UPS

consistencies:
- disconnect from Internet then network
- always between 5p and 6pm
- everything is normal after 9:10 am

inconsistencies:
- it doesn't happen every day
- if it happens on the weekend, we don't know we're not here

Your ideas and suggestions are greatly appreciated.
0
Boopig
Asked:
Boopig
  • 7
  • 5
  • 4
  • +2
1 Solution
 
_treySterCommented:
Do you lose connectivity to the Novell box or just the Internet at those times? (or both)
0
 
fruhjCommented:
You make is sound like you're loosing network connectivity around the same time.
You also mention you loose internet first, then the network.

I suspect something on your network is flooding the network during that time.
If my hunch is right, one of your PC's is responsible .

What I would try is when you think you've lost internet connectivity, then the network, unplug the internet and see if the network resumes to normal.

If it doesn't, Unplug 12 users, see if it improves, then another 12, then the last 11 and plug in one of the others to see if the network has improved.  ( I assume you can do this by just pulling cables from your hub or switch)

If I'm right - pulling one of the groups will retore your network - and if thats the case it should be pretty easy for you to find out which PC it is. From there you can determine if it's compromised, faulty nic, etc...

- Jack
0
 
BoopigAuthor Commented:
Thanks Jack.

I'll try this tomorrow and I'll update you.

-BP
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
BoopigAuthor Commented:
treyster, missed your comment, sorry. At first it's Internet but all seems fine on the server if already logged in. The moment you reboot to restart the Internet you discover you can't login.

And for all, it's Netware 4.11 (still old, I know).
0
 
mtpcbypcCommented:
sounds like fruhj is on to it - Had a very similar problem in a completely different server.  Turns out one one of the workstations had a network aware virus on it (client wasn't paying for anti-virus updates).  Disconnect the users and reset the gateway and router and see if the problem goes away at the server.  Then add one workstation at a time untill things go bad again.  Also next time network is up and good do an on-line scan through trendmicro or symantec from each of the workstations.


Good luck
0
 
BoopigAuthor Commented:
fruhj / jack,

You were right. Tonight it happened again, at about the same time, but this time I unplugged the dsl modem/router and the network continued to function. If we logged off, we could log back in.

Now it's a matter of finding the culprit. Here are my ideas, and some of them are along the lines of yours. Please let me know what you think and thank you very much so far!

- when it happens again I will disconnect modem from switch but plug it directly into a laptop to see if that can surf w/o a problem

- run some sort of packet sniffer capture product to analyze where and what kind of traffic is causing the flooding (Do you have any suggestions of a free trial one?)

- run Hijack This on all workstations to see if there's something secretly running

Please let me know what your thoughts are and for anyone else reading this.

Cheers,
BP
0
 
TreyHCommented:
Odds are you have one or more infected workstations. Fruhj is 'right on the money.' Narrow down your problem workstations by pulling wires.
Depending on what type of switches/hubs you have, you can sometimes make a good guess by keeping an eye on the activity lights as you disconnect workstations. If you have managed switches, you can check the logs  for unusually high traffic on any ports.

The usual war chest:
http://www.grisoft.com/
http://www.safer-networking.org/en/index.html
http://www.lavasoft.com/
http://www.spywareinfo.com/~merijn/downloads.html     (Hijackthis)

Good Luck.

BTW: Still running a novell 3.12 box in production - they just won't die, lol.
0
 
BoopigAuthor Commented:
Thank you TreyH!

I'll keep you posted.

-BP
0
 
fruhjCommented:
Hi Boopig

   The next step I would do is to unplug groups of PC's from your switch, while leaving the internet connection up.
   
    Here's my logic for doing this:
       The problem is likely on the inside of your network, going out.
      My hunch is that somethings running on a pc, and that once it looses internet conectivity, it stops flooding (If it didn't you'd still have a dead network after disconnecting the internet)
      So if you disconnect the internet, and connect to a laptop- you won't really learn anything new.
     Ideally you'd take a machine you know is good (ie one with a clean windows install on it) and use it as the "judge"
    then you'd take down the "suspects" one at a time, using the "judge" to validate if the network is still down.
 
   Since you said the internet was effected first, I would use the judge to do something on the internet and see when it improves.

   You mentioned using a sniffer, and those have their place, but I think they are somewhat limited - if you're switch is doing it's job, then the traffic won't make it to the port you have your sniffer on.  Most switches have a copy mode where they can copy a port over, but thats a lot more work than just pulling one cable at a time. That said, I've used the analogx sniffer before - I like it becuase it doesn't require a special packet driver to be installed - and it's freeware look at http://www.analogx.com then look for software called "packetmon"

  Theres some software I really like called root kit revealer from
    http://www.sysinternals.com/ntw2k/freeware/rootkitreveal.shtml 
    Take a moment to read what it does, then run it on any workstation you suspect has been comprimised.

   If you've narrowed down the problem to one machine, you're probably best off to reformat it.

  Good luck!  


     




         
 


 
0
 
mtpcbypcCommented:
no matter what you do if you have an infection.  Please disconnect the problem machine from the network and do thorough scans of the other machines.  There's no better way to protect the rest of the network than to unplug the culprit.
Good Luck
that's my 2 cents
0
 
BoopigAuthor Commented:
Here's the latest update.

- It happened again this evening at the usual time.
- Found that about 6 computers were ON and I pulled them off the network BUT still NO Internet.
- On one of the switches, found the light to be quickly flickering, different from other lights. Pulled that cable and went to cubicle where that computer was OFF.
- Booted that computer (Win Me) and ran Hijack this on it. Will analyze/post it here later.
- Will have the clean computer tomorrow in order to 'judge' the network after all computers are pulled from it.

Noticed this too, which makes sense:
We've got an ancient PRIME mini computer attached to network and is accessible using Kermit95 using an IP address. This is NOT accessible during flodding condition whereas IPX/SPX server is.

Thanks,
BP
0
 
mtpcbypcCommented:
From my experience that is one of two things.  A network collision/broadcast storm caused by a bad nic card/bad cable or bad port on the switch.  Try to isolate each of these two things.  If the machine was off then should have been no activity on the port, especally caused by software.  ? do you have/can borrow a replacement switch?  One in a million it was an outside attack at the last recorded location of a machine with a backdoor trojan type infection.  I'd put my money on switch port, cable, or nic.
Good luck.
A hyjack this of a machine that was powered down at the time may be a waste of your time.
0
 
fruhjCommented:
Hey Mtpcbypc already beat me to this, but after hearing that your IPX stack is ok, and IP is not it does raise questions about the switch..

I had a 3 com switch once that seemed to loose the 10mb segment - it was a 10/100 switch and all ports where 10/100 - we thought we were loosing our printer, but it turned out that anything at 10mbit was down -(we only had a few devices at 10mbit so it wasn't painfully obvious)

Anyways this was a common and ongoing problem until we replaced the switch with a newer one.

You could try changing ports of the internet connection to see if the switch shut down that port.
0
 
BoopigAuthor Commented:
The switches were both replaced 2 weeks ago and so was the network card in the server as a troubleshooting attempt to resolve this issue. I'll still try a different port.

-BP
0
 
mtpcbypcCommented:
It sounds like you are down to the cabling or a bad nic not in the server but in the workstation that the activity light was going nuts on.  The unit that was off but still locking up the port.  Collision lights don't flash for nothing.  If the workstation has onboard lan, disable it and throw in a generic realtec or a 3com 3c905cx-txm if your feeling fancy.  From where I sit it 30% chance it's the nic in the workstation and 65% cabling problem and 5% static electricity killed a port on your new switch.
Good luck.
0
 
mtpcbypcCommented:
fruhj had a good point there about speed settings and nics.  If you have a mixed netword of speeds sometime 10/100 auto isn't as happy as it seems.  I've had 3 installations with setups that made me pull my hair out.  I had to manually set the negotiation speeds on the nic's to either 10mb 1/2 dup or 100mb 1/2 dup depending on their original capasity.  Then my problems went away.
Thanks fruhj
0
 
BoopigAuthor Commented:
Though the problem and the actual culrpit was not found, I thought I'd close this out because of potential delays in finding that culprit. Here's what was discovered since my last update.

Spoke with the DSL provider Covad/XO and together we monitored the router and saw that during the day there were huge fluctuations in line usage BUT they were huge becase the line is a paltry SDSL line of 384k with 35 users!! So, it's clear that it is an INBOUND trafic issue and a matter of finding which computer is causing it. The client has agreed to upgrade the line to a much fatter DSL line of 3Mbps/768kbps and is aware that this can potentially continue. It's become a little political because the boss, non technical as he is, refuses to accept that one or more of his employees is streaming music or using the Internet for his/her own purposes!!

Thank you to all who have contributed to resolving this matter and thank you FRUHJ for your help especially!!

Cheers,
BP
0
 
fruhjCommented:
You're welcome!

Good luck with this one!
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 7
  • 5
  • 4
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now