Link to home
Start Free TrialLog in
Avatar of digitized75
digitized75

asked on

Computer periodically stalls/freezes when accessing mapped drive

Hello Experts

I have a problem that is very puzzeling and I am hoping someone will be able to help me out.  First, the description.

I have to remove a old Windows 2003 DC from my site and I am replacing it with a new Windows 2008 DC.  This is a seven site network, all using the same domain.  The old 2003 DC held all the master/emulation roles and was also a certificate server (We don't need it and it was setup by the old admin for Exchange webmail purposes, using 3rd party cert now).  I transferred all roles to the new 2008 DC without any problems and I then uninstalled the CA services from the Windows 2003 DC.  This is where the problem began.

When the users would go to access the newly mapped drives (mapped to the new 2008 DC) their workstations would stall/freeze up.  The problem could be resolved by rebooting the new Windows 2008 DC.  After a couple of days I decided to demote the old Windows 2003 DC and make it a member server, which I did without issue.  This solved the problem of the user workstations freezing up for about 5 days.  This morning however the problem occurred again.

I have checked the event logs on workstations and the server, but nothing at all.  The only clue I was able to find, and on one workstations at that was a LSASRV 40961 SPNEGO error.  This error corresponded to the time the problem occurred, but like I said it was only on one workstation.

I have verified that the new DC hold all the roles needed, I have made sure all traces of the CA server have been removed from AD.  I am not sure what is causing this problem.  Any ideas please would be greatly appreciated.

Thanks
Avatar of RDAdams
RDAdams
Flag of United States of America image

Have all the computers ip lease times expired renewed since the new DC was put into service?
Avatar of digitized75
digitized75

ASKER

No, I have the DHCP settings set for infinite, but the workstations all get their IP address from the new DC anyway.
When I see that error, the first thing I think of is a multihomed server. Do you have multiple NICs on the server??
ChiefIT:  Yes I have multiple NICs, but one of them is disabled and not connected to ethernet.

I did find an interesting clue and hopefully you guys can help me make some sense of it.  On the old DC, which has been demoted to a member server, I am showing so Security Audit events that correspond to the times I have the issue described above.  I have attached a screenshot showing a couple of the event logs.

I don't believe these should be showing up on what is now nothing more than a member server.  Am I correct in that thinking?

Thanks
Here is the file attachment.  I had to use Firefox to do it for some reason, as IE 8 wouldn't upload this.
eventlog.jpg
I would set the IP lease to expire in 1 hour on all machines then wait for a day or so to ensure all the machines have their lease from the new DC.  Once completed then you could set it back to infinite.  Remember that setting the lease time short will also help the DNS cache on each computer renew to the current proper settings.  This should fix your issue.
Do you have PST files stored on this file server? How busy is it in general?

The security audits aren't really interesting.

Thanks,
Brian Desmond
Active Directory MVP
Okay:

Here is what I think is going on. Though you have one nic disabled, it may have been enabled for a short period of time. So, it registered its DNS SRV records of both NICs. This causes slowness when contacting the DC for services, such as logging on and authentication.

You may want to clean up your SRV records and prevent that second NIC from registering its SRV records with DNS.

This will help guide you through verifying your SRV records>
http://support.microsoft.com/kb/816587

This will help you prevent from registering both nics SRV records again>
https://www.experts-exchange.com/questions/23356031/There-are-currently-no-logon-servers-available-to-service-the-logon-request.html

Also, you may want to run a DCdiag /v report to see if you can find additional information on any discrepancies that utility can pick up.
RDAdams, I am in the process of trying your solution now.  It has been almost 24 hours and no incident yet, but I am going to give it a full week before proclaiming this solved.

bdesmond, no I don't have any PST's stored on this server

ChiefIT, This was a new server and I never had the NIC connected to the network.  I just disabled it, but I am checking out the links anyway and educating myself a bit.  I will also be using them to check my SRV records, just in case.

I will update this case in a week or sooner if the problem happens again.

Thanks
Well, the problem came back this afternoon.  It is like the computer browser service on the PC's is hanging up when trying to access/explore any mapped drives.  

I have shortened the lease time, renewed the IP address and checked the SRV records.  Still having this problem.  I checked with the DCDiag tool and no problems were found.

Once again, there were no event logs generated on the server or the clients from this issue.

These are the problems that make people go crazy!!!!  Any more ideas from anyone?
OK I'm a little lost here now that we've got a browser issue too.

What exactly are the repro steps for this hang. Where does the hang occur, and what process/ui appears to hang?

Thanks,
Brian Desmond
Active Directory MVP
Sorry about that, I am adding a little confusion here.  Let me try to explain the one and only problem I am having.

Periodically (sometimes daily, sometimes after a few days) the client PC's at one of my site will hang when trying to access networked drives or when trying to browse drives (either local or network).  This will happen suddenly and without warning.  The only thing that fixes it is to reboot the Domain Controller (Windows 2008 Standard) for that site.

This has been occurring since demoting the old Windows 2003 DC to a member/file server and adding the new Windows 2008 DC.  All roles have been transferred correctly to the new DC.

When this issue occurs, it leaves not event log traces and no services appear to be hung up.  As stated, the only thing that fixes the problem is rebooting the new 2008 DC.
ASKER CERTIFIED SOLUTION
Avatar of bdesmond
bdesmond
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The users home folders are on the new DC and their mapped drives point to this new DC as well.  When this problem occurs, it doesn't just effect this servers shares though.  Sometimes it even effects the workstations ability to browse its own local drives.
OK. Let's collect some data on the new DC.

We really need to capture Poolmon data when this is happening. There is a set of scripts that will do logging for us over time. I put a copy at http://bdesmond01.dyndns.org/poolmon3vbsperf.zip. Extract them somewhere (e.g. c:\poolmon), run _LogPool-as-a-service and leave it until this hang happens again. We'll need the resulting data (in Poolmon-Output) at that point. Also a timestamp for the hang(s) would help alot.

Thanks,
Brian Desmond
Active Directory MVP
Thanks, I will try this out and post again after the issue occurs again.  It could be a few days, just to let you know.

Thanks again.
Good news, the problem is happending right now.  I just started poolmon and I am letting it collect data for a little bit.  I will post back shortly.  Time this started to occur was about 5:15 PM CDT.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ChiefIT:  I am running Windows Server 2008 SP1.  Are there any similar DNS issues with it?

bdesmond:  I have attached the poolmon output that was captured while the problem was occurring.  I didn't notice anthing unusual, but maybe you can spot something.  

Thanks both of you for the continued ideas.

Poolmon-OUTPUT.zip
OK something didn't work here.

If you run poolmon.exe What happens? Post a screenshot?

Thanks,
Brian Desmond
Active Directory MVP
Here is what I get.
Whoops, here
screenshot.doc
OK that's the script - run poolmon.exe from that folder...

Thanks,
Brian Desmond
Active Directory MVP
Ok, that is running and I have a blue screen with white writing all over.  How is this being logged?
It isn't.

Let's try doing this without the service (just leave the session logged in). There is an additional batch file int he archive called LogPool. Run that one and let's see if we get any data. All that data you see on the screen should be getting dumped to the text files in the output folder.

Thanks,
Brian Desmond
Active Directory MVP
I don't believe 2008 SP1 had the same discrepancy as 2003 server SP1.
I am currently running these:

Poolmon.exe - I am watching the screen and looking for large differences between the DIFF and FREES fields.  If the DIFF value is higher than the FREE, than I put a watch on it as a possible memory leak.

Wireshark running - To look for any odd protocol traffic running, specifically when the issue occurs.

I have recorded GPO application times on the servers and workstations, in the hope that I can try to find some correlation there with a possible security policy being applied.

If a conclusion isn't reached while looking for a memory leak, I am going to take you on another path. I think your OLD server may be trying to communicate with your 2008 server using NTLM hash authentication instead of a compatible kerberose authentication. You may have old credentials on your old member server that are saved. You might try to remap the drive with new credentials saying to remap upon logon.

Before doing so, you might go into control pannel>>users>>advanced>>managed passwords and remove all passwords managed for that mapped drive.

This is a little bit about how the different authentications work. 2008 server is NOT compatible with legacy authentication of LM or NTLMhash.

https://www.experts-exchange.com/questions/23132123/Computer-failed-to-join-or-logon-to-domain-days-later-after-reboot.html
It's not a leak simply if Diff>Free - memory has to stay allocated for things to work.

Send a screenshot of the poolmon screen adn I can tell you if something looks odd...

Thanks,
Brian Desmond
Active Directory MVP
ChiefIT:  The mapped drives are on the XP clients and mapped to the new server.  The member server is on its way out the door once this problem is corrected and two more programs are moved.  No clients are now mapping to the OLD server.  Could your possible resolution still apply?

bdeslmond:  I will post screenshots tomorrow evening when I go in to work on this after hours.


I am also going to do one other thing.  This server is a HP Proliant DL365 G5 model with a P400 SAS RAID Card in it.  This also has a Quad Core Operton processor and 4GB of DDR2 memory.  I have found that I need to update my hardware drivers, since some of them are 2 1/2 years out of date.  I am going to do this as well, just in case this could be the problem.

Let me know your thoughts please.  Again, I appreciate the help from both of you.
If your clients were on an older domain, and were configured to use non-kerberose logons, it could still apply. Legacy authentication protocols existed in 2003 server to allow the 2003 server to be backwards compatible with older computers and printers. Without telling those computers and printers "we no longer use LM or NTLMhash" some hashes may be saved on the client computer and cause the inability to logon successfully with some domain services, such as some files and printers. You could set a group policy to NOT allow LM authentication, and another group policy that prevents from saving credentials on the client machine.

Most likely, the problems that exist are a result of a token or memory leak. You may operate fine for a couple hours and your access token may go away. This is equivillent to all of a sudden loosing your password. A reboot might fix the issue for a short duration on a token leak. However, a memory leak will build until your server or client computer freezes.

To enhance your security and possbly prevent a token leak it is best to disable LMhash and NTLMhash saved credentials on the client machines. This is what I think might be happening. After a few hours of being logged on using your kerberose logon, you loose your security token and access to the mapped share. So, it goes in on the client machine and locates old credentials of that mapped drive. Those credentials don't comply to the kerberose standard and are therefore told to buggar off. The client will continue to try to gain access to the mapped drive using the wrong credential set and eventually the flood of attempts may be freezing or shutting down the mapped drive. Token leaks and memory leaks are two different entities. A token leak is when you loose your security token or kerberose ticket from the Kerberos ticket granting agent, (your DC). If this happens you will usually loose the ability to do any domain services and you should see event log errors on the client and server. The memory leak it sounds like you have a decent grasp on and  you are trying to evaluate if something is not freeing or allocating correctly. A memory leak on the server should effect the entire domain. So, I would look for a memory leak on the client.
++++++++++++++++++++++++++++++++++++++
Another plausible issue is a master browser conflict. This is pretty easy to troubleshoot and fix. To troubleshoot, simply go into the event logs of the two servers and see if there are 8032 and 8021 events that elude to something like "xxxcomputer thinks it is the master browser for the domain, the browser service is stopped and an election has been forced".

The domain master browser service is designed to provide the skeleton of your computers and file shares for the network. It populates the computers and shares in "My network Places", in other words. Then, to access the file, your Kerberose ticket is compared to the file ACL (Access Control List) of the shares. Since I see that $ sign in front of your "Kerberos" comparison of the ACL, it appears to me like your client is trying to gain control of that mapped drive using old credentials. Since the 2008 server says "no way" I think your client is hounding the server for access.
It just hit me like a brick wall on what your alluding too.  I will check this out when I go in tomorrow evening.  Excellent explanation by the way!  I will post again tomorrow regarding this.

Thanks
ChiefIT:  Here is what I did to test the token leak theory out.

Changed GPO's for all client workstations.

Network Security:  Do Not Store LAN Manager hash value on next password change - Enabled (Was set for Not Defined)
Network Security:  LAN Manger authentication value - Not Defined  (Was set for Send LM and NTLM Response)

I believe this was the proper way to do this.
Well I am still getting workstation names with the $ at the end on the old server in the security logs.  I wonder if what I did hasn't addressed the token leak issue.
bdesmond:

I have attached the screenshots of the poolmon.exe as requested.

Thanks.
poolmon-screenshots.zip
Can you press B to get it sorted by Bytes descending. Use P to cycle between Paged and NonPaged. It'd be helpful if you could get the logging script going.
The logging script keeps crashing on this server, unless I am doing something wrong.
Including LogPool.bat?
I have attached a screenshot of what I have running now.

Here is the attachment.
logging.doc
OK do me a favor and post the Poolmon-Output folder and let's see if it's got useful data...
Here is the output.

Rename the APP_HandleCount.txt to .csv  I had to change it for posting purposes.
PoolmonOutput1.zip
OK output is still FUBARed. We'll do this a different way.

Please add this registry setting - http://support.microsoft.com/kb/244139 and reboot. Next time the box is hung/not accepting requests, hold down the /right/ Control key and press Scroll Lock twice. Your machine should generate a blue screen and write a dump out. If you could zip and attach the resulting file that would be a good start.

Note that occasionally KVMs eat the Scroll Lock key and don't pass it on to the system. Also if this doesn't work, hit caps lock a couple times - does the LED change state on the keyboard?

Thanks,
Brian Desmond
Active Directory MVP
bdesmond:  I wanted to make sure that I clarified this first.  The server actually never hangs and will serve requests while the issue is occurring.  It is the clients that hang and some clients cannot access their mapped drives while others can.  Eventually the issue begins to effect everybody.  So I am not sure if I will catch anything from the server or not.
You bounce the server though to resolve?

Thanks,
Brian Desmond
Active Directory MVP
Yea, I do bounce the server to resolve the situation, but still the server never hangs, never generates event logs explaining why this happened, and generally doesn't show any hint of a problem.  
Yep so next time instead of the bounce, do the Ctrl+Scroll Scroll thing.

Thanks,
Brian Desmond
Active Directory MVP
Just wanted to give an update regarding this case.  I implemented ChiefIT's idea Sunday night and I also updated the hardware drivers for this server.  I have not yet had the issue happen again, but it hasn't been that long.  I did notice one thing weird about an hour ago and that was if I hit the ALT-CTRL-DEL key combo to lock my machine, it would take a long time to display anything on the screen after clicking LOCK COMPUTER.  When I would hit ALT-CTRL-DEL again to unlock the workstation, it would display a blue screen for awhile before displaying the desktop icons.  I would estimate the waiting period for either operation was about 30 seconds - 1 minute.  The problem has since cleared up.  I am not sure if this is related or not.
Bad news...... the issue has happened again.  I am beginning to lean towards a memory leak for two reasons.

1.  It went 5 days before the issue happened
2.  My monitor was blank (no signal) and was fixed after the reboot.

I was unable to make sure the dump file worked, as I couldn't verify if it blue screened and I didn't see the dump file in %systemroot%.  I cannot seem to use poolmon because Windows 2008 server seems to mess it up.

This is so frustrating.....  any other ideas guys?

Thanks so much for your suggestions thus far, they are much appreciated.
What kind of keyboard (USB/PS2) and do you have SP2 for 2008 loaded?

You would have seen the bluescreen. Is this an HP server by chance?

Thanks,
Brian Desmond
Active Directory MVP
Any event log errors in the 1000's, like 1020 or 1009?
bdesmond:  This is connected to a KVM using the PS/2 port.  Also, this is a HP Proliant DL365 G5 server and no SP2 for Windows 2008 isn't installed.  I believe that is was just released to RTM a few days ago.

ChiefIT:  I cannot find any Event ID's matching those you gave.
OK so these steps will work more reliably in lieu of the KVM http://briandesmond.com/blog/forcing-a-blue-screen-via-ilo-ilo2-version/

Thanks,
Brian Desmond
Active Directory MVP
OK, I have everything ready to go and I will generate the crash the next time this issue happens.

Thanks
I wanted to add another symptom that occurs when the issue I am dealing with happens.  This server also acts as a squid proxy server for internet usage.  When I first opened this thread, it appeared that network shares on this server became inaccessible and that was the only symptom.  Turns out, while continuing to troubleshoot this, that I have run across a couple of others.

1.  The internet will quit working for the end users (since this server also run squid).
2.  Computers that are at diffent remote sites within our organization will not be able to access this servers shares.  (I originally thought this was limited to only the workstation on the local site).  This is a multi-site domain (only one domain).
3.  Workstations can browse other server shares without problem, just not the shares of the server having the problem.
4.  This will sometimes cause the entire workstation to freeze up and it struggles to find the share that was trying to be accessed.
5.  I can still release and renew the IP address from the server while this issue is being experienced.

It doesn't appear that all network services are being effected, but browsing the network shares of this server or trying to use the internet, which relies on the squid service running on this server, are the most noticeable problems.  I currently have perfmon running to watch for trends (Windows 2008 server has a much improved interface for this) and I am also using task manager to watch for private working set memory usage.

What I don't understand is this.... why I sometimes have to reboot this server once a day, but then I can also go for 5 days before a reboot.  This is very random and shows very little sign of a pattern to follow.

Thanks again.
Could be a number of things. Your workstations freezing in this scenario is likely because they have home drives on this box.

Thanks,
Brian Desmond
Active Directory MVP
Alright, this problem is as good as solved!  Here is what the problem was...

The culprit was Symantec Antivirus 10.2.  I have included a link that pointed me in this direction.

http://www.petri.co.il/forums/archive/index.php/t-25791.html

I uninstalled Symantec and am running Vipre (Sunbelt Software) and I have not had any issues now for 8 days.

Thanks for the time, ideas, and knowledge Chief and bdesmond.  I learned quite a few things from the input you both provided.  I will split the points in half for your time.
The solution ended up being a program that was installed.  I am awarding points based on the time, effort, and good ideas put forth.  Their input really helped narrow the potential problems down.