asked on

File/folder connection failures using UNC

I have a truly frustrating problem...

We have numerous Perl (ActivePerl 5.6) scripts running on a Windows Server 2003 SP2 machine that connect to various remote folders for production processing. We are experiencing spurious failures when trying to connect to these folders when using UNC connections. (Mapped drives don't exhibit the problem but we don't want to use mapped drives for all the obvious reasons - locked into user context, etc.)

The connections will work for some period of time then fail with the infamous "No such file or directory" error. This can occur after numerous connections to the same folder had just worked fine!

A FileMon trace just give a "Bad Network Path" error when the errors occur. Again, this happens even though previous connections to the same folder work fine just seconds before.

A network capture of the time the error occurs shows *no connection* attempt to the remote folder/file at the exact time the error occurs. To me, this indicates the problem is internal to the file system some place as the folder/file connection attempt never actually makes it to the network interface.

There appear to be no identifiable resource (memory, file i/o, network bandwidth) issues at the time of the errors occur. It appears to be completely random. It doesn't always fail after x number of connection attempts...

Here's one odd symptom: If I map a drive to the same location, I don't have the problem. Even stranger, is that if I leave that drive mapped, but then return to using the UNC connection to the same location (same process context), the error does not occur. This condition is completely reproducible.

I really don't want to use mapped drives for the reasons mentioned above. I know I can work around the issue by mapping a drive on-the-fly within the Perl script, but I would just really like to get to the root of the problem.

Thanks for any help/ideas. I'd even be happy with some suggestions (tools, etc) on how to further debug the problem...

~A

ChiefIT

Sounds like a netbios problem. Are these UNC sites on VPNs.

Mapped drives use DNS,

ChiefIT

Wow, that didn't come out right, did it?

I meant to say are these computers that you are trying to access with the UNC accessed through a VPN?

nrhelpdesk

ASKER

No... same network. In at least one case, the same switch!

And, as I pointed out in my original post, when I do a network trace on the problem, the failures don't even make it to the network interface - at least there is nothing in the capture for the actual failure...

llman

Are there any errors or warnings in the application or system event log entries on either the local or remote machine when this happens?

ChiefIT

When you did a tracert, nothing.

This is what I originally thought:
____________________________________________________________________
UNC paths differ with Syntax:

http://en.wikipedia.org/wiki/Path_(computing)

//computername/share uses netbios
//computername.domain.name/share uses DNS
//IPaddress/share uses IP

Netbios is not routable, meaning it will not go over NAT, through a VPN tunnel, or across most firewalls. It also requires Netbios over TCP/IP. If this problem is random, you may have a problem with the browselist. So, on your PDCe, you might see 8032 and 8021 errors saying there is a problem with another computer thinking it is the master browser. If so, the conflict may screw up your browselist.

Try mapping by Domain or IP UNC paths. Or make sure you don't have a browser conflict and Netbios is enabled on the NIC on both comptuers you are trying to use the Netbios UNC path on.

_________________________________________________________________________________

NOW I am thinking

I am beginning to think you have intermittant NIC problem, but on vairous computers????? HMMM

I think we should troubleshoot spanning tree/portfast.

The switch uses either spanning tree or portfast. If this is a managed switch, XP pro and Vista machines require portfast. The discovery of a path takes 50 seconds to do under spanning tree. But it defines the path. That 50 seconds will time out a XP or Vista machine. So, portfast strips off the discovery of the path and goes right to forwarding the packets.

Portfast and Spanning tree are inverses of each other. Spanning tree will discover the path prior to routing the data. For switch to switch comminications, this is highly recommended. But for workstations, Portfast should be enabled. You might see 5719 error in event logs of the client computers if portfast is your problem.

I have an article that explains this issue:
http://tcpmag.com/qanda/article.asp?EditorialsID=277

Recommendations:
Enable portfast on workstation ports only and use spanning tree on Switch to Switch or Switch to router connections.

nrhelpdesk

ASKER

If it was a spanning tree issue, I would think that the connections would never succeed. I can run a test (a perl script using the same connection command syntax that the production scripts use...) with multiple connection attempts. I can get 20 successful, successive connections, but suddenly the 21st will fail. It may then fail again for several more connections, then resume as normal.

And again, I *never* see the failed connections in a sniffer trace. If it was a spanning tree/routing issue, wouldn't I at least see the packets leave the server, then fail someplace in between source and destination?

I'm really thinking this is something internal to the server/file system.

*Very* frustrating problem...

ChiefIT

What service pack are you running?

ASKER CERTIFIED SOLUTION

ChiefIT

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

nrhelpdesk

ASKER

Running SP2...

We were considering changing the speed/flow control on the NIC. It's currently set to auto-negotiate. (Both sides - nic and switch port) Not sure this would really help, however, as we are not seeing any packet loss or retransmits on the switch port.

We also have at least two other, identically configured servers (Dell PE2850's) that don't exhibit the problem. That is, I can run the same test script on these servers ad-infinitum and never get the error.

I guess I'm just going to have to keep drilling into this server until I can find what is "broken"...

ChiefIT

We were considering changing the speed/flow control on the NIC. It's currently set to auto-negotiate. (Both sides - nic and switch port) Not sure this would really help, however, as we are not seeing any packet loss or retransmits on the switch port.

I think I would hold off on this until we do a MTU channel troubleshoot. If you set your computer to a defined setting and switch to autonegotiate, you may impregnate your LAN with a new problem.

Also need to know if this server is a multihomed server. Multihomed servers can be problematic at best. Maybe the other packets are going out to the other NIC as your primary NIC is busy. Multihomed is defined as having two or more IPs. That could mean two IPs on the same NIC or two/more NICs.

nrhelpdesk

ASKER

This server is multihomed in both senses. Multiple NICs - multiple addresses.

The 2nd nic is connected to our backup LAN. Different subnet - doesn't even have a default gateway set, not a likely culprit.

The primary nic does have 3 addresses assigned - primary address and 2 for oracle database connectivity. All the same subnet. And there are no static routes defined on this server...

ChiefIT

Let's troubleshoot MTU channels first and consider troubleshooting your packets going out the wrong IP or NIC.

I could see packets being dispursed to the different NIC and you end up with fragmented packets. What do you think of this theory?

You might look in DNS and see what IP addresses have SRV records for them. SRV records are SeRVice records that provide Domain services to the client PCs. It does sound like you have DNS configured right, just some packets are going to the wrong IP or NIC.

Have you thought about moving Oracle to a different computer and teaming a couple NICs to provide a nice, fast connection to your Oracle applications? That way, it doesn't become a problem to your DC.

Another problem with 2003 server is regardless of deselecting its ability to register the SRV records automatically, it will register both NICs anyway. Let me see if I can provide some information on this as well. Yep, here it is:

https://www.experts-exchange.com/questions/23356031/There-are-currently-no-logon-servers-available-to-service-the-logon-request.html

If it were me, I would consider moving Oracle to a node all to itself, and doing backups on the primary NIC, leaving one NIC and one IP, meaning, one defined path to the sever.

nrhelpdesk

ASKER

MTU size tests out at 1500. It's the same on all our servers.

I don't think this is an SRV record problem. This is a stand-alone server - not a DC/GC. No SRV records involved. We already have the patch in place to insure we don't register the backup nic.

The only DNS entries we have (in AD integrated DNS) for this server are the stubbed addresses (all A records) we expect/want - one for the host name and two others that are used for database connectivity.

I'm afraid moving Oracle off the box is out of the question. That's it's primary reason for living. :o)

I did some further testing to try to eliminte the "traffic going out the wrong NIC theory:

I ran a sniffer trace against the 2nd NIC just to verify that there were no attempted connections out that address/NIC. I was able to repro the problem as usual. I could see multiple connection attempts going out the primary nic as expected until they suddenly just stop and further connection attempts fail. I saw no traffic going out to the target host via the 2nd nic once the failures started...

So... can I add any other environmental factors for you to make troubleshooting this problem more difficult? :o)

Now you know why I'm here...

ChiefIT

Can you sniff the second and third IP of NIC 1 in the same mannor?

The idea of moving Oracle is to put it on a stand alone server. I just assumed it was a DC. It is good where it is at. I just think we should look into where the HECK these mysteriously vanishing packets are going. Maybe IP 1, 2, 3, or 4 of that server.

Have you considered disabling NIC 2, (while not doing a backup), to see if you can reproduce the error without NIC 2. Sounds like Sniffer was pretty conclusive, but I still like a second opinion.

nrhelpdesk

ASKER

Tracing the primary NIC captured all three stubbed IPs, so I was seeing all of the traffic going in/out of that interface.

I can try disabling the 2nd nic for a test... not sure that'll buy me much since the sniffer trace captured nada. Worth a try, however.

It's really a perplexing problem. Most of the "usual suspects" have been eliminated. That's why I was looking for suggestions for additional tools/monitors that might be able to watch things better at the file system level. It really doesn't seem like a network problem, tho' that was my first assumption as well. From what I can see, whatever "dies", does so before getting out to the network layer in the communication/hand shake process...

Weird...

ChiefIT

I agree, this is perplexing as Heck!!! Why would these packets disappear before leaving the machine. Usually the ''Abyss'' is the network iteslf, not the machine. Sniffer programs don't always capture all data. So, maybe we are relying to heavily upon sniffer programs and not enough on ping.

Have you performed a Ping IPaddress -t, (Time to Live Ping)? To stop the ping press CRTL+C. Do we loose any packets on a time to live ping?

And try disabling NIC two to see if this make a difference. I doubt it, but it is worth a try.

llman

Do you have any anti-virus programs running and have you tried disabling them?

Is the remote server also windows server 2003?

Are you getting any redirector event log entries?

If it is a filesystem problem, it sounds like it is in the MUP but I cannot find any KB articles that sound like your problem:

From http://msdn.microsoft.com/en-us/library/ms794158.aspx:
"The multiple UNC provider (MUP) is a kernel-mode component responsible for channeling all remote file system accesses using a Universal Naming Convention (UNC) name to a network redirector (the UNC provider) that is capable of handling the remote file system requests. " ...

"MUP is not involved during an operation that creates a mapped drive letter (the "NET USE" command, for example). "

nrhelpdesk

ASKER

SAV 10 is installed but not currently running. (All services are currently set to disabled...)

The remote server I've been testing with is a W2k3-SP2 server. (Connected by a 40MB DS3 pipe...)

No event log entries on either side...

The MSDN article you provided is very informative, however. I'm seeing the "BAD_NETWORK_PATH" error documented in the article. At least it gives me a little more to go on in that I know exactly what calls are being executed when the connection fails.

Now, I just need to figure out why they are failing. (Even after working just seconds before!)

ChiefIT

@ llman:

Are you thinking purging the mup cache might resolve this issue?

nrhelpdesk

ASKER

Not sure a MUP cache refresh would do anything. A reboot doesn't clear the problem...

To add to the mystery, at ITChief's suggestion, I disabled the 2nd NIC and tried my usual test. I was NOT able to produce the problem. (I'm not sure that made me happy or more annoyed...) :o)
Flipped the NIC back on and the problem returned...

I'm not sure exactly what's going on yet, but it definitely seems linked to the fact that this server is dual homed.

Maybe more importantly...

The server is actually only running SP1! I thought it had been upgraded but I was mistaken! (Come get your breakfast now... the eggs are over here on my face :o) ) So... perhaps the MTU issue comes into play here?

Needless to say, I'm going to start by trying to get the SP2 update scheduled! I'll have to see if the problem persists after that.

ChiefIT

If this is a terminal server, reboot twice after upgrading to SP2. There's a little known gliche with SP2 that just requires a second reboot.

nrhelpdesk

ASKER

This server is scheduled for the SP2 update this afternoon. (8/18)

I'll post a follow-up as soon as I'm able to confirm whether or not that solves the problem...

rrable

Would be really interested to know if the sp2 update helped. Having simillar issue...

nrhelpdesk

ASKER

Yes! The SP2 update fixed the problem. I have not been able to repro the problem again while testing, nor have any of our production scripts failed since the update.

ChiefIT

@rrable:

""Would be really interested to know if the sp2 update helped. Having simillar issue...""

http://support.microsoft.com/default.aspx?scid=kb;en-us;898060
As this article describes:
"This problem occurs because the code incorrectly increments the number of host routes on the computer when the code modifies the MTU size of a host route. The maximum number of host routes is controlled by the registry value in MaxIcmpHostRoutes. The default number of host routes is 10,000. Because of the incorrect increment, the number of host routes eventually reaches the maximum value. After the maximum value is reached, the ICMP packets are ignored.

Note The default number of host routes was incorrectly listed as 1,000 in the original version of this article. The change to 10,000 reflects a correction, not a code change."

Basically, this update (MS05-019) found in SP1 has a type-OH. Instead of setting the MTUs to 10,000, they set it to 1,000. Since this was a known issue in SP1, SP2 corrects this problem. Intermittent connectivity will be the symptom and it can attack one service on the server, (like DNS or DHCP).

Odd as heck, I know!

rrable

I think you just made my day...