Link to home
Start Free TrialLog in
Avatar of tmaususer
tmaususerFlag for United States of America

asked on

Servers unable to UNC to each other but can to different servers

We have several Win2016 servers.  In particular a SQL and a Web.  What we are seeing is that after some arbitrary number of days the web server can no longer make any queries.  No sql backed application will run without error from the web server.  The web server is unable to \\unc to sql and sql is unable to \\unc to the web.  But both can \\unc to other servers just fine.  The message in the web event log is "A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available."  If I change the connection string in the web.config to our development sql everything works.  SQL and Web can ping each other by name and by ip address.  Both have host file entries pointing to each other.  Neither is running a firewall.  Nothing changes if we reboot the web server, but, in the past rebooting the production sql server does allow them to talk again.  This is not a good fix.  Any clues?  Again, they do talk, then they just stop.  
Avatar of Paul MacDonald
Paul MacDonald
Flag of United States of America image

Is it possible the DNS entry for the SQL server expires (is scavenged)?  Check DNS the next time this happens, if possible, and/or do a ipconfig /registerdns on the SQL server.
Also, verify that the "Shared Memory" and "Shared Pipes" protocols are enabled for the SQL instance(s).  I can't remember for sure which one it is that SQL might use for those types of connections.
Avatar of tmaususer

ASKER

It is happening right now and will be until I reboot our production sql.
Under "SQL Native Client 11.0 Configuration (32bit)" Shared Memory, TCP/IP and Named Pipes are enabled.
Under "SQL Server Network Configuration" Shared Memory and TCP/IP are enabled, but Named Pipes is disabled.
In DSN there is only one entry for our sql server.  The timestamp on it is 4 days ago.  (But a lot of things are older than today.)
The problem is not general.  Is is very specific.  We run some 40 database off this thing and only one server is unable to connect to it.
"We run some 40 database off this thing and only one server is unable to connect to it."
That wasn't clear to me in the original post.

You say there's no firewall on the destination server, but is there any sort of intrusion detection software?  Anything that might selectively block the source server?

"SQL and Web can ping each other by name and by ip address."
Is this true while the problem is active?  Also, can you map a different drive letter to the same destination while the problem is active?

Sorry for the confusion.  Yes, this sql server is answering other servers and desktop clients just fine.  It is only blocking a particular machine.  (That makes this real fun.)
The Windows firewall is disabled across all types, public, private, domain on both sql and web.  There is no 3rd party scanning done on either of these servers.  They both have the Windows Defender, but they all do.  SQL has lots of exceptions, but that is for performance reasons.  It is exactly like something monitors communications and eventually decides that it doesn't like the other.  I just can't put my finger on it.

From SQL, via explorer I can't map by name or ip, but can "net use" by IP Address, but not name.  The new map does show in Explorer and is browseable.
I can open, modify and close a file via the map.  Fast.

From Web, via explorer I can't map by name or ip, but can "net use" by Name AND IP Address. The new maps do show in Explorer and are browseable.
I can open, modify and close a file via the map.  Crazy slow lag. Notepad goes "Not Responding" but does eventually create or modify a text file from both maps.
Some questions:

- Do either/both of these servers have multiple network adapters? If so only one should have a gateway and they should be the same gateway.

- Are both servers on the same subnet?

- Is either server using DHCP (hopefully not) and if so, are they both using the same DHCP server?

- Is your DNS server active directory integrated?

- Are both servers on the domain?
Only one pic per server.  (These are VMs.)  They both have OS static ip addresses, same vlan, same gateway, same subnet, same domain.  

Yes, our DNS is Active Directory-Integrated.  Replication : All domain controllers in this domain.

They have been working together for a long time but something recently is blocking communication.  It is only these 2 VMs.  I restarted sql this weekend and all is working again, but it is a ticking bomb.  It will break again.  It would seem that it is something on the sql server that is amiss since restarting it helps.

More fun.  After the sql restart the web started working again, but now another applications server has stopped talking to the sql server and vice versa.
Is it possible you have IP Address conflicts?
Or, maybe your Servers have static IPs but you have a DHCP server that is handing out the same IPs?
IP addresses for this machines are all static and excluded from our DHCP ranges.
I restarted SQL again this weekend to get it talking to a different server.  I really don't want to rebuild my sql server.
Since it's a virtual machine you may want to verify that VMware tools are installed and the proper network card is used.  I have seen many strange issues without the appropriate drivers.
Some quick thoughts, don't think these have been mentioned above yet?  Is this the same for any user.  If you make / take a different admin user account and logon to the server is it able to UNC to the other box,either after logging out the original one or both.

Have you tried killing explorer.exe in task manager and then run explorer.exe again?

Is the SQL server instance running as any user or as Local System
The network drivers has been the same for a long time now and I'm not sure they would cause an intermittent problem like we are seeing.  But I'll give it a look.

The servers login as the same user.  Sql is run as Network Service or Local Server.
Right now the systems are talking just fine.  Which is weird and just support the erratic behaviour.  Restarting Explorer is a new idea.  When it happen I try to find things to do that do now power down the network or the server.  Restarting Explorer is worth a try.

Is port exhaustion a possibility?
Port exhaustion as in too many packets too quickly?  Might occur momentarily, but once this event occurs it stays broken until the sql server is rebooted.  It is also specific in its denial.  Other machines are unaffected while one is blocked.  While we are working fine now, there are two servers, plus the sql server, that this issues rotates between.  Each of the VMs is a clone of a standard installation.  So NIC and drivers are the same across the board.
It has done it again with yet a different server.  I have logged out of the affected server and logged in as a different admin level user.  I have a test app that does a simple query.  It still fails to connect.  All network drivers are the same.  Host file is empty.  Each can ping the other.  UNC is either extremely slow, minutes, or just fails to connect.  The blocked server can nslookup and get a response.  The SQL server fails an nslookup with a timeout.  DNS has entries for both servers.
It sounds like you have a DNS issue or something esoteric such as legacy WINS impacting communication.  Just a hunch but also check DHCP make sure that the IPs aren't being given out to other systems.  Check dropped frames on switch as well
What about \\server.domain.local\share ?
All UNC is timing out.  I checked DNS again.  Both servers have records, forward and reverse.  SQL was up to date.  The other was 4 days old.  I deleted it and added back.  No effect.  Both as static ip addresses with exclusions in dhcp.
We are putting as sniffer on each in hopes of getting more information.
ASKER CERTIFIED SOLUTION
Avatar of tmaususer
tmaususer
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I've seen this type of issue with the E1000 or flex controllers as mentioned before, the VMXNET3 controllers are usually a good option to try.  In a 3000 VM environment we had the E1000/flex nics go down from time to time and changing them out to VMXNET3 would usually resolve the issue

We have always used the VMXNET3.  Found out there were more compatible with our hardware right from the beginning.  I'm just glad I didn't have to stop production, but was about to.  What I don't know is, what abstraction layer was refreshed to make it work again.  Maybe there is a way to refresh it on some schedule.
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I have recommended this question be closed as follows:

Accept: 'tmaususer' (https:#a43153826)

If you feel this question should be closed differently, post an objection and the moderators will review all objections and close it as they feel fit. If no one objects, this question will be closed automatically the way described above.

seth2740
Experts-Exchange Cleanup Volunteer