Link to home
Start Free TrialLog in
Avatar of BeGentleWithMe-INeedHelp
BeGentleWithMe-INeedHelpFlag for United States of America

asked on

Server loses network connectivity every few days.... a reboot solves it. Any suggestions?

I help out a charity that has a Windows server 2012 R2 running as Essentials. Exchange is done by office 365.  Twice now, they are loosing access to the server and internet (as the DNS server for the machines on the LAN) after a few days.  A reboot cures everything.  I have an RMM agent on the server and both times, the server last checks into the RMM tool around 4AM.

Wondering how to troubleshoot that.

Some background:

Recently, there was a lightning strike nearby. Fried the key card door entry system, alarm system, phone system along with some of the computer network

Side note / question: the other systems were OLD... electronics are more susceptible to surges as they age, right?  caps dry out, etc.  A surge that might have not affected a newer machine could cause more harm to an older system?

Anyway, it blew out the main 16 port switch in the telco closet.  Getting that replaced.  Blew out a couple older data switches in a couple offices (using a single data line to that office to put the printer and PC on the internet), blew out a PC and monitor in another office.

The server appeared to be OK... but we're starting to see that need to reboot it.  twice now, they lose access to the server from the machines on the LAN.

Sitting at the server, I can ping it's own IP address.  Can't ping any other IP addresses.  From other machines, can ping other machines and the router.  I can do most anything - browse files on the server, check event logs, open apps. But anything related to getting out through the lone network port fails.

The machine still is under warranty with dell - it's 4 years old on a 5 year warranty.  

If I call dell, they'll likely have me reinstall the OS / it takes days for the issue to crop up so that's not going to work : )

I look in the event log and dont see any smoking guns.

In network settings, I disable / enable and it doesn't help keep it online.  I choose diagnose and it gives me an error about IP address is not set as dynamic.  the server should be static, right?  It's been static for the 4 years now. So I don't think that's the issue.

Any ideas / recommendations?
Avatar of arnold
arnold
Flag of United States of America image

If your switches are managed, look at whether the port is locked out. make sure there are no CRC errors on the port.
i.e. the Switch is configured for Auto, but on the Network Adapater, the speed/mode is locked at 1000FDX.
This will distabilize the port and it could bring the port in older days, the switch will become unresponsive.

Static meaning accessing the network adapter and its settings are IPv4 and IPv6 have a static IP? Often, the ipv6 is not set and it might refer to that and ipv6 related issue is causing the issue.
note that while you hit disable the adapter, the applications running effectively lock/prevent the release of the network likely being the cause why the issue does not resolve. if you stop the related services, exchange, etc. and then disable the port and then bring it up it may...
Next time, hopefully ... when you disable the network adapter, run

sole server, once its network drops, services such as DNS, DHCP and ....

does the system/application event log report loss of network.
What hardware is in use, does it have hardware monitoring.... reporting, IBM director, HP insight, DELL openManage?


netstat -an | find /i "established"

you will likely see several connection that would take some time to release...
What kind (make/model) of network switch... seen a similar issue (never solved) but with a specific switch brand.
Check the switch obviously. Try a different port for the server. Also be sure to follow all of the previous suggestions.

One major thing I see left out of this: Were/are any of the devices on surge protection units? If not, obviously this should be sorted right away. Some UPS devices over surge protection for incoming phone or cable lines (good to have if your internet is DSL or cable as you can protect the modem from surges two different ways).
Always happens at about 4 am? I remembered seeing this on a workstation once, but never figured out the cause as the problem just went away after about 2 weeks. And that would drop daily at the same time.

Also, is there a second NIC in the system? After some of the other troubleshooting, it may be worth looking at using the other one.
If you replaced the network switches ,odds are they are at a lower rev firmware levels ,so a flash might help.
Most stuff coming from the factory has been setting on the shelf awhile, so chances are they need to be checked for the latest firmware.
As for how they are configured,is spanning tree enabled?
Seen issues with different switches and spanning tree from time to time.
Avatar of BeGentleWithMe-INeedHelp

ASKER

Thanks for all the info and advice!

The switch is an unmanaged Netgear. The one that died from the surge was a 16 port gigabit.  I have a new 24 port gigabit in there temporarily waiting for a replacement switch form netgear under their lifetime warranty.  So yes, the problem has been since the 24 port went in. But I've power cycled the switch and that didn't change the situation.  Only a server reboot does.

The server only has one NIC

Here are pictures from the event logs. They were taken on the 20th at 3PM before the reboot.  The server appears to have the network issue starting around 1AM when it can't get to logmein? on the 20th, but nothing in the event log as a smoking gun.

There's loads of event log categories to look at under microsoft. is there a way to see all the subsections at the same time?  I looked under microsoft at DNS server, DHCP server, networking and didn't see any smoking guns.


It's a dell poweredge.  It likely has openmanage on it, but does that have to be configured? What would I want to set it to do?

that command:

netstat -an | find /i "established"

should I run that now or when there's an network outage?

Is there an app or script or something I can set up in scheduled tasks so that when it can't ping the router after some low number of attempts, it reboots the server and logs something to let me know?  

That way at least I don't have to wait for them to call saying there's a problem.
please double check that the configuration on the network adapter (properties, configure, advanced is not fixed but is rather set to auto negotiate)


I would stay away from automatic reboots, it is one thing with network related, repeated reboots, and then you get into a reboot loop,,,.....

Double check in openmanage that the IDRAC (remote access) ip does not share the same IP as the server.....

If you can access to the IDRAC you should be able to reboot it. if your idrac is sharing the only network interface and the LAN access is not present, the idrac interface might not be reachable as well.

move to a different port and see if that makes a difference.

based on what your symptoms are, and presumably you tried replacing the patch cable from server to wall, from patch panel to switch.
The only other concern is that when the switch it had some impact on this network card..
but rechecking it to make sure it is set to autonogotiate.....
2 things:

Here's the output from the netstat command.  Is that normal? All those ports?

And the pictures.
netstat.txt
2018-07-20-15.51.35.jpg
2018-07-20-15.51.08.jpg
2018-07-20-15.50.05.jpg
in an elevated command, run netstat -anb it will tell you to which process each of those connections belong.

you could use process explorer/monitor from sysinternals.com an MS purchased .....
How many zones does your DNS have? make sure you did not configure it as a ROOT DNS, check the DNS server root hints tab... there should be listing there for a.root-server.net-......


without knowing what is running. services it provides it is hard to answer. in some cases it is normal. check the forwarding task on the DNS to make sure you are not creating loops there.
arnold  - no idrac, network connection on server is set to auto negotiate.

Replace cable??  Interesting!  can't assume anything I guess.  I'll change that out tomorrow.  I hate these wait several days to see if it's fixed intermittent problems.

Oh, as for UPSs - yes, all the computers are on UPSs (or surge protectors for some PCs) (but the phone system, and the data switch were blown out and I know they had UPSs.   The fire alarm and door control systems are elsewhere in the building and I don't know about their UPS or surge protection.
\
also,  the UPSs / surge protectors don't have data cable filtering.  Is that common or uncommon these days?  And they don't slow down the network, keep it at gigabit speeds?

With all the cat 5 cables in a network, that stuff acts as an antenna, picking up the energy from a nearby lightning strike, right?  

And UPSs / surge protectors have a voltage when the kick in, right?  Clamping voltage?  A surge of 200V might get by the surge protector and harm older systems?  Or the UPS / surge protectors were older.  Would you replace all of them after a big hit like this?
Wow! that netstat -anb is a long output!

See attached.

 netstat.txt

And DNS - check this picture.  

User generated image
Should there be 2 entries? Server and server.domain.local?  they look like they have the same info.
Again, this is a simple network - 1 Essentials machine, 10 computers,email is on office 365. Web hosting is elsewhere.  This is just a file server, and Quickbooks database server.  Is there a way to see the date each of those was created? server is 4 years ago, server. domain.local was last week or something else?
They are the same, it is possible that you had one open and then added the other and saved the console view..

The majority relate to the server essential OS....
get dset from dell and run it.

Usually there are/were surge with phone line shielding. Surge protection of the equipment is often sufficient without ...

does the server have another path to the switch, i.e. a direct route bypassing .
In short eliminating any .. while placing a workstation on the server's connection to see whether the issue will follow the network or the machine.

i.e. switch the server path with a workstation path. if the workstation starts experiencing these issues you know the issue is with cabling.
if the issue remains with the server, you'll know the issue is with the sw/HW on the server.

One option, use GPMC disable the server's microsoft network connection test
computer configuration\administrative templates\system\internet communication manager\internet communication settings
turnoff windows network connectivity status indicator active tests (enable policy) and see if that helps
i.e. periodically the test fails and the system goes into a mode as though .....
Usually there are/were surge with phone line shielding. Surge protection of the equipment is often sufficient without ...

The fact that network devices (3 switches, the NIC on 1 PC,, along with 1 PC / monitor that won't boot at all), makes me feel the surge got on the data wires.

Would you recomment we delete 1 of the 2 DNS servers? Which would you go for?  Just to clean things up?  Or would that cause other problems?

Another path to the switch - yeah, I'll try swapping the data cable from the serer w/ a desktop on a seperate home run nearby.

When I google dset dell, I get:

https://www.dell.com/support/article/us/en/19/sln292237/how-to-use-dset-22-in-windows?lang=en

and it talks about dset is old?  The machine is a PowerEdge T110 II.  You know what generation that is?  would you suggest support assist enterprise instead?

But the fact that a server reboot solves things, I am thinking it's in the server itself.
Support assist I think starts working with gen 13 (pe720/320/420/620)  110 is likely still generation 11 circa 2012.
http://www.dell.com/downloads/global/products/pedge/T110_II_Spec_Sheet.pdf

search for dset on this link:
https://www.dell.com/support/home/us/en/19/product-support/product/poweredge-t110-2/drivers

This is an entry level system, not sure how much detail it logs. Something else might be causing the lock up.
Does the OS report any issues/validation?
When the network stops, you are still able to login on console and try different things?


run ipconfig /ll and make sure the Physical address is present and is not blank. If it is following a reboot, look at what the physical address reflects, and then use the adapter configuration to manually set the MAc address to that identifier and see whether that fixes the issue.

not recently but have seen a time when the Physical address disappears without which the packets .....access is lost
I rebooted the server a couple days ago I think it was, for something else.

today, it seems around 1AM, the server lost connectivity. I'm going there soon. I'll do the IPconfig /all and anything else people above suggested....

anything else?
Cabling, port switch, please pay attention on the physical address.
Check event log.
A follow up.  

Some notes from last week. Since these notes, I wound up setting up nightly reboots.  That's kept the unit up during the day. But would love to know if it's hardware or software issue.

server has been up since 7/25 3:30PM  It's now 7/27 12 noon.  Lost connectivity around 1AM so running for <1 1/2 days
ipconfig /all - see attached
can ping its own ip4 address 192.168.1.4
can't ping the router 192.168.1.1
swapped data cable with PC next to it, so different data cable from NIC to switch & Different port on switch - problem of no network connectivity remains with server

Something interesting?  Before and after swapping cable, the port lights on the nic - 1 was solid green and flashing (data?) - so the NIC could see a data cable was connected?

Event logs - nothing of note since going off line early 7/27 - see attached pic
Deleted 2nd dns zone

Without network connectivity, how to get DSET on there?  Didn't have a thumb drive so I took a 2 TB USB drive from the server (used for backups)  formatted as NTFS,
Connected drive to to a desktop, copied dset from download folder to the drive.
saw the file in windows explorer on the USB drive
ejected the drive from task bar
Connected drive on server - all the backup files were there, but not the recently added DSET

Moved drive back to the desktop, copied the file into an existing subdirectory, brought it back to server....still not visible on server.  I closed / opened / refreshed the view in windows explorer and still no new file
On server, I opened a DOS window - DSET file not there on USB drive
attrib *.* file not there
tried USB thumb drive - not seeing the thumb drive when inserted in usb of server

device manager - WD ses Device USB device - the drivers for this device are not installed.- this was the USB 2TB drive

chkdsk says it's raw format?!

After reboot, I could get to the files on the USB 2TB and thumb drive along with network connectivity restored.

I installed dset.  Ran report. A dos window opens briefly.  And then check the report browser and nothing is there. I've tried basic reports and advanced reports.  Nothing shows in the browser.

THoughts?
Before-reboot-can-t-ping.jpg
before-reboot-ethernet.jpg
before-reboot-IP-settings.jpg
before-reboot-limited-network.jpg
before-reboot-network-lights-on.jpg
before-reboot-network-status.jpg
before-reboot-no-errors.jpg
Before-reboot-troubleshooting.jpg
even-after-rboot--creating-a-report-.jpg
bump??

Anyone can recommend a NIC card I can install in this PowerEdge T110 II and see if using that is the answer?

I am thinking I boot up on a linux USB over a weekend and see if it loses connectivity before monday AM (pointing to hardware issue)

And or there's windows bootable USBs?  I don't want to play with the drives / hard drive controller... I'll solve 1 problem and cause another : (

Being a single mobo, Dell would just replace the entire mobo...  with shadowprotect full image, anyone know what I'd have to do to get things running with a new Mobo?
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks.  The nightly reboots have put the issue aside, but I guess i should find a better? / different fix.

I'll try that NIC.