Link to home
Start Free TrialLog in
Avatar of jcfrietman
jcfrietman

asked on

Exchange crashing with errors 9056, 9057, 9188, 8260.

Hello All

Once the exchange server comes back up after a reboot, it stay stable for about 30 minutes and the it will start crashing.  The event reports the following:

Event ID 9188
System Attendant failed to read the membership of group...
Event ID 9144
Clients will not be directed to ths GC
Event ID 8260
Cannot acces Address List
Event ID 9057
NSPI Proxy cannot contact any GC
Event ID 9056
NSPI Proxy listener thread
Event ID 7031
The SA terminated unexpectedly.

The DC is actually on another server.  The FSMO roles are also on this server, and another Exchange server can successfully authenticate with this server.

Server is currently down.
Avatar of JasonBigham
JasonBigham

when did this start? sounds like it's trying to contact a GC that isn't available. Anything changed in your environment? Added or removed a GC? then of course, if the first error is right, then all you'd need to do is add that computer object to the Exchange Domain servers group, per Jason's suggestion.

could you post the entire 9188 and 9144 events?

D
Avatar of jcfrietman

ASKER

Thanks for this.  I did actually looked at this article, but I think the problem is a bit deep.

The server is fully operational for about 20 minutes after services have been re-started.  They start failling with the event id.

At the moment SA has crashed but Info Store is still running (some how).  Whilst trying to get into Users and Comp we get Naming Information cannot be located because:
The server is not operational

The FSMO/DC is running fine on another server.

Many Thanks

Juan Carlos
You still haven't posted the events.....

d
How many DCs and GCs do you have? When did this start?

David
One more...is something taking up a lot of processor, or does this machine seem very sluggish and slow?

d
I know sorry just posting them now:

9144
NSPI Proxy failed to connect to Global Catalog server.domain.com over Tcp/Ip. This server is down or unreachable. Clients will not be directed to this GC until it is available again.
Solution
The global catalog server in which the exchange server is trying to contact is offli
ne. Bring the server back online and exchange will then service the clients.  

9188
Microsoft Exchange System Attendant failed to read the membership of group 'cn=Exchange Domain Servers,cn=Users,dc=your domain'. Error code '80072030'.

Please check whether the local computer is a member of the group. If it is not, stop all the Microsoft Exchange services, add the local computer into the group manually and restart all the services.
Solution
Fronm a newsgroup post: "I had this same problem and resolved two different ways. The first way was when I made the E2K server a DC the error would go away. However, once I demoted it back to a member the error came back. Fortunately, I found help from someone in the newsgroups. They suggested I should go to AD Users and Comps and remove the server in question from the EDS group and restart the system. Once I removed it, the server added it back, by realizing it was an exchange server. This then solved my problem. I have not seen the error for nearly a month since doing this".


________________________________________________

Nothing has changed in the environment.  We have two GC at this site and two Exchange servers.  The first E2K is working fine and connecting no problem to the FSMO and GC.  The second E2k decides to die after 30 minutes

Thanks for all your help
Hello Again

Yes we do have them and we were wondering about that.  The process is part of the storage manager and is need to in order to access the info store on the SAN.

Yes, the system is currently running at 76 %, even though System Attendant is not.  

I know the CPU is being abused but will this actually kill all links to the GC and kill the SA ?

thanks (again)

It's possible, especially since the reboot clears all processes, and the Exchange server comes up normally. If it couldnt' contact a GC on startup, nothing would start. you seem to be failing later on as the CPU usage begins to increase dramatically. Does the other Exchange server experience this kind of usage by the storage manager? If it's not, then there maybe something wrong with that app, or even your SAN (God Forbid!)
Let me know...
D
Hello

Our plan as per your suggestion:

Stop all exchange services, reboot server.  Restart server but not to re-start Exchange.
Monitor the server and monitor GC access.
We will monitor the processes and see if GC dies.  If GC does not die but CPU is still very high then it is a problem with that process.
The problem with that process is that it is essential to the path to the Exchange databases.  We do not want to upgrade to Volume Manager 3.0 cause if all else fails then we will not be able to find the exchange databases.

Another thing we are doing, is building another exchange boxed with a different name.  I would like to move the info store to this box and try and bring it up on line.  Will this have any implications with the AD, since for example Joe Bloggs belongs to Server A on the AD but now he is on Server B, but AD has of course not updated.

Many thanks

Hello

Restarted server, waited about 25 minutes and then the server failed once again (Exchange was not started).

So far the only error I have got so far in the Application Log is as follows:

Event ID   2104
Topology
Process WINMGMT.EXE (PID 1576). All the DS in servers in the domain are not responding

any ideas please?
Might be time to bust this out as well;

http://support.microsoft.com/?id=321708
"Another thing we are doing, is building another exchange boxed with a different name.  I would like to move the info store to this box and try and bring it up on line.  Will this have any implications with the AD, since for example Joe Bloggs belongs to Server A on the AD but now he is on Server B, but AD has of course not updated."

Bad idea. No one will be able to contact that server, user objects are looking at server A. Second, you'll then have a server orphaned in your AD, that owns all the mailboxes. If you delete it from AD, your AD will go haywire. sit tight, this obviously isn not an Exchange issue right now, don't compound your misery.

D

BTW, did you pinpoint the name of the service and the exe file that's eating your processor? If so, post here please.

D
Hello Guys

Thanks for all your help so far.

The system log reported the following :
Event ID 5783
The session setup to the Windows NT or Windows 2000 Domain Controller  <server name 2> for the domain  <domain name 2> is not responsive. The current RPC call from Netlogon on <server name 1> to <server name 2> has been cancelled.

One of the solutions was to apply SP3.  ( We are currently running SP2 for windows and exchange) .  I agree with David, this is not an Exchange issue but rather than a W2K issue.  Can you guys still help?

The actual file that is taking most resources is VxSvc.exe, but I think this might be a dead end.  I am just about to say my prayers and install SP3.  Do we need to install any additional patches?

thanks
Are you not allowed to go to later SP's? If not, sounds like it would be a good option to consider... considering your situation.

I sue all the latest, no probs at all... nothing to fear.
I'm starting to think that this machine is not patched against the RPC bug. Would you please check to see that the hotfixes 823980 and 824124 have been applied to this machine? Something on your network is pounding this machine flat....

D
Well, you didn't want to read this, but...

http://forums.veritas.com/discussions/thread.jspa?threadID=3890&tstart=60

Your SAN is giving you problems, or at least the service is. I've seen several threads like this one, but no solution yet. you may want to give Dell a call SOON!!

D
Yea, I LOVE how none of these guys got a solid answer, don't you? there's a reason that Dell is at the bottom of the food chain when it comes to enterprise class servers.

D
BINGO!!!

"We had a big problem with that here. Traced it down to the Volume Capacity Monitoring "feature" of Array Manager. It was slowing down machines, and causing crashes. There is a Dell utility with "Array manager Utilities" that will allow you to toggle the feature. What we found was to load the utility, connect to the box in question, check the feature on, apply, check the feature off, apply.

That should help greatly, it's fixed most of our woes, we're working on mass disabling it."

Stole that from the Dell site...

D
Hello

We have loaded the following Hot Fixes:

KB823980 and KB 824146.  it is something on the machine that is killing it.

Are these fixes OK?  

If SP3 does not work what do you guys think?  go home and cry?

cheers
those are the mandatory RPC hotfixes, to block the blaster virus and its variants. you're fine on those, every server and desktop in your company should have those 2 fixes.

did you read my other posts?

d
I have been reading them and even though I agree that we might have a problem with the actual exe itself, I do not think we have a problem with the SAN.

I have been monitoring the server and it is currently running at around 66% to 75 %, so no memory leaks here.  (I know we cannot see the DC either!).

I have forward this to my colleague Cliff, who will investigate as well, and pass on his comments.  (he might respond with my alias)

Yes, spending endless night in front of monitors, we managed to patch all servers and desktops, so I think we are clear of the virus! (I hope)

thanks
you're right, it's the file itself.the 66-75% is usable I suppose. unless it's more than one processor, then it's actually using up the 100%. Fact is, it's keeping you from seeing the GC/DC, and since all that operates on LDAP, this exe must be taking a very high priority.

D
What model server is this?

D
Cheers for your comments.

The server is a Stratus Server.  I have also noticed that the NMS Service is not running on the faulty server even though it is running of the working server.  (nmssvc.exe)

http://www.answersthatwork.com/Tasklist_pages/tasklist_n.htm

I have looked it up and it is an Intel driver service, that liases with the Simple Network Management Protocol.  Not sure if this is essential to the overall running of the system, but after the reboot I will check this.  It is currently set to Automatic, but it is currently stopped.  I will only try this after I have installed SP3 (currently backing up files).


You mentioned two hotfixes beforehand, we have installed these two:  

KB823980 and KB 824146   (you mentioned 824124).  Are we still running the correct ones?
Thanks

JC
ASKER CERTIFIED SOLUTION
Avatar of David Wilhoit
David Wilhoit
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hello  again and good morning!

Well, one of the NIC failed on the server after reboot with SP3.  Corrected error and reboot once again.

The server has been up and running ever since but of course we are monitoring the server.  Many thanks for all your help much appreciated.

We are still going to move the mailboxes to a more stable server, but we will use the wizard instead

thanks

Juan Carlos
just build another server in the AG, and move mailboxes....you'll be better off....

D