network problem? server drops pings, then comes back. Every 2 minutes...

We have two DHCP servers (Domain controllers) on our active directory. ADC1 is a Windows 2003 machine, ADC2 is Windows 2008.

One of our servers (SQL) keeps dropping connections. For example: for 2 minutes it will ping properly, and shares show up when accessed via the network (\\sql). Afterwards, it drops the connection, and becomes un-pingable. After a brief period, it comes back online, and responds to ping requests.

Some workstations on the network don't have the problem. Some constantly switch between working and non-working. Reboot of ADC's and the SQL itself does not help.

Computers connected to either ADC1 or ADC2 have the same problem. (for example one PC on ADC1 dhcp lease will pick up SQL wihout drops in connection, while another, on the same dhcp server, has the connection dropped).

SQL is on static ip, it's Event Viewer does not show any changes in network, services or drivers failure. No automatic updates, no expired licenses, no limited number of simultaneous users, no new ip address leases. Workstations also don't change IP's, even though they're on DHCP.

Any ideas on what could be causing SQL to drop and re-establish connections?

The only suspicious log I found on ADC1 was "The dns server has enountered numerous run-time events. To determine the initial cause of these run-time events, examine the DNS server event log entries that precede this event...." Event id:3000.

Unfortunately, the log does not show anything prior to this event.
94704Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dave_LaSalleCommented:
On the server ping loopback with the -t switch
let it run for a while
use ctrl + Break keys to get stats of pings/loss
if it's dropping it may be your nic
if not then suspect dns
0
snusgubbenCommented:
If you have any sort of anti-virus programs on the SQL server I would try to uninstall it and see if that helps (even if you have marked the SQL program folder to not be online scanned).
0
castercoreyCommented:
Are you using a megabyte switch or a gigbyte switch on your network?
0
Protecting & Securing Your Critical Data

Considering 93 percent of companies file for bankruptcy within 12 months of a disaster that blocked access to their data for 10 days or more, planning for the worst is just smart business. Learn how Acronis Backup integrates security at every stage

94704Author Commented:
The loopback pings on the server (on sql: ping sql -t) work fine. NIC is ok.

No antivirus on the server; nothing that would refresh it every 2 minutes or so.
0
pwindellCommented:
First,...a few misconceptions.  Computers are not "connected to DCs".    They startup, they authenticate,...then nothing...you can throw the DC out in the street and the workstation wouldn't know the difference unitl the next time authentication was needed for something.    DHCP - well by default the Client only connects to DHCP once at start up and once every 4 days if left running,...so DHCP is irrelevant.
DNS could be one of your problems (just a big guess).  
DC's should never be mulit-homed.
DNS Scheme should be:
1. All machines on the LAN use only the AD/DNS and nothing else,..ever.  That includes the DC's themselves and the Firewall Device
2. The external DNS (the ISP's?) should be listed as a Forwarder within the config of the DNS Services.  If omitted the DNS will default to using Root hints.
3. Examine the DNS Zone and delete any incorrect, duplicated, or wrong entries.  Most entries will be added automatically so I reccomend that you do not manually add any.  A fast way to make a machine re-register itself in DNS it to rightclick on the "connection" and tell it to do a repair.  You can also do this from a command prompt:
IPCONFIG /registerdns
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
snusgubbenCommented:
On SQL server: ping 127.0.0.1 -t

at the same time:

On a host:
ping <IP to SQL> -t


Do both drops packets after ie. 2 minutes?
0
bhartwellCommented:
Try restarting the DNS service on your server and see if that helps.
0
snusgubbenCommented:
btw which HW is the SQL on?
0
94704Author Commented:
@ pwindell: I already tried ipconfig /registerdns. No errors in the Event Viewer.
@ snusgubben: yes, I tried pinging IP, pinging the hostname, and pinging loopback 127.0.0.1. No loss, no downtime. Responds to pings constantly. Basically, I'm convinced that NIC is fine.
@ bhartwell: both ADCs have been restarted a few times now, therefore DNS service on them has been restarted as well.
0
94704Author Commented:
Do you think re-joining the domain under another name, rebooting, after which re-joining it again as SQL would fix the problem?
0
snusgubbenCommented:
You said that it became un-pingable in your initial post?!

If you have 1 Gbit network cards on the SQL, you could try to lock it to 100 Mbit just to see if it is a negotiation problem.

Outdated NIC drivers?
0
94704Author Commented:
@ snusgubben: the machine becomes unpingable only by other workstations in the AD. For example: workstation1 can ping SQL. 100% works. 2 minutes later, the same workstation cannot ping SQL. 2 minutes later, workstation1 is able to ping SQL. And so on...
0
snusgubbenCommented:
If your SQL had a machine account problem the DC's would have logged it.

This sounds to me like a NIC / NIC driver / "1000 Mb vs 100 Mb problem" or some sort of anti-virus (firewall) problem.

The clients caches the FQDN and gets a kerberos ticket the first time they connect to the server. I doubt it's a domain problem (but who knows :)
0
Galtar99Commented:
Is your SQL server plugged into a managed switch?  If it is, can you look at the switch to see if it notices the link dropping?  (show log)  Make sure logging is set to informational.
If it isn't, do you see the link light go out?  Maybe try switching it to another port on the same switch to see if that port is going bad.  It may seem like an obvious thing, but update the NIC driver and ensure the server has all the latest windows updates.
Maybe hard code the switch and/or NIC to one speed 1000MB/Full or whatever you feel comfortable with.
0
94704Author Commented:
@ Galtar99: The server and the machine are both set to 100FDx, so that rules out any auto-config mismatches. Procurve (switch) log does not show any conflicts on the port or mac-address related conflicts.

@ snusgubben: Windows firewall is off/disabled. No other antivirus or firewall is present on the machine. Just updated the network card drivers (intel) to the most recent.

Still most workstations on the network loose connection to SQL every 2 minutes.

Odd: the connection with SQL (and ability to successfully ping) gets restored sooner, if the workstation1 user re-accesses the shared folder (\\sql). Nonetheless, the server drops connections and re-establishes them every 2 minutes.
0
castercoreyCommented:
configure your nic to match your network switch. so if your switch is a gigabyte switch than confiure it to full duplex at gigabyte rate.
0
94704Author Commented:
With the help few of you and the suggestions I figured it out.

The problem was a combination of:

1.) Speed mismatch between the switch and SQL. Now, instead of "auto" on both ends, the server and switch are set to 100FDx (although both support a gigabit connection).

That fixed collisions and partially the dns issue.

2.) DNS entries. Over the years, one of the ADC's accumulated several static entries for "SQL". Going to DNS management console fixed the issue, by deleting every entry and allowing the DNS to populate once again.

What really fixed it, was adding a second NIC and re-registering it with ADC's (DNS servers), while removing any previous references to SQL in DNS management console.

Big thank you to all of you. I hope this information will be useful to someone with a similar problem.
0
pwindellCommented:
Ok, well the speed thing would have fixed transmission errors,...not collisions.  Switches don't have collisions because of the way they create Virtual Circuits.  Not a big thing, but you should be aware of that.
Adding a second nic is almost never a soultion to anything unless the first nic is removed or disabled,...and even then you can have trouble if the old nic was not set to use DHCP before it was removed.  So if what you did caused the machine to become dual-homed you may have to get ready for future problems.  Unless a machine is being used as a Router, Firewall, Proxy, or using Nic Teaming,...dual-homing is almost always a very bad thing.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2003

From novice to tech pro — start learning today.