Link to home
Start Free TrialLog in
Avatar of EBIZ-Mark
EBIZ-Mark

asked on

Lotus Notes Domino Server - Service keeps halting

Hi,

Our Lotus Notes Domino service keeps halting on our Win 2003 Server.

Win32_Exit_Code is 1067 ( 0 x 426 )

Notes server is v 7.0.2

Initial question is :  What information do I need to gather, to help problem solve this, and where will I find that information ?

Thanks
Avatar of EBIZ-Mark
EBIZ-Mark

ASKER

Found the following log files.
core-nSMTP-W32I-WTR01-2012-11-14.zip
Avatar of Sjef Bosman
Steps to execute:
1. upgrade to R8.5.3
2. upgrade!
3. ... ;-)
:
:
7. open the note.ini file of your server and comment out the line with ServerTasks, then restart the server; gradually, using the Admin console, reactivate all tasks until the server crashes; the last task you started probably killed the server

Hopefully, the Admin console will give you more information than that, e.g. which database was being used, and maybe even why it stopped. There may be a corrupt database, a mail db or a mail.box db, or it may be a damaged DLL or so. Once you found it, you can use Compact or Fixup to try to repair that db.
Most likely it's the SMTP process. If it is, repair or remove the mail.box database(s). If you intend to remove it, stop the Router task first, then rename that db and restart the Router task. The Router will create a new mail.box. Check what mails are in the renamed database, these can be moved to the newly created mail.box database.
Thanks.

If I read between the lines, I take it you are very subtly suggesting that I upgrade :-)

This needs to be on our list, and we are seriously considering it.

I like the idea of the trial and error on the ServerTasks, however, I haven't got a fixed time that the server stays up for before it crashes, so this might take longer than it sounds.

We've also got a Backup Exec Agent faulting at the same time. so I thought they might be linked.

The server has been up for 3 hours, but has just gone down again. So I'll try the mail.box suggesting above.
Use the Lotus Notes Diagnostic (LND) tool from IBM to read the nsd*.log file (it's free - just google it.) That tool can open the NSD, and may help you find what caused the crashes. It will give keywords to search at ibm.com, and it may list the databases that were open when it crashed.
Me?? Suggesting anything? Noooooooh, wouldn't dare...  the idea... really...
OK. Renamed the Mail.box and let the system create a fresh one.

The one I renamed was 1.3GB in size !!  Nothing deliverable left in it, so didn't need to copy anything over.

Can I hope that this was the cause ?   I'll also look at the LND that has been suggested.

I've still got the Backup Exec Agent that is failing, which has stopped us doing a Tape backup of Notes the past two nights. I'll have to look at this, and post another thread for help if I need to.

Back to the Notes problem. We'll monitor it and see how stable it is now.

The number of undeliverables is an issue I need to sort too.   Is there a setting I can change that stops the system from leaving these for me to delete ?  i.e. If the mail can't be delivered, delete after 5 days or something ?

Thanks
For the undeliverables, tell Notes to purge what is left in mail.box with a Replication setting:

Open mail.box > click the File menu > Replication > Settings > Space Savers > check/enable the "Remove documents not modified" setting > change the number next to it to any number from 1 to 5 (yes, you can use "1" safely) > click OK

That will automatically purge documents that sit in mail.box for 1-5 days. Since only delivery failures sit in mail.box, this setting will not touch processed email.
1.3Gig ?? That's a lot. You should know that regular inspection of mail.box is very important for the health of the system. Once a week or so should do (but if you adopt Thomas's idea you won't be able to see what's clogging up the mailbox).

Next issue: if you have 1 mail.box database, and you have more than (say) 20 users, it can help to increase the number of mailboxes. See your Admin Help database about this. It's in the Configuration document for the server, the Router-SMTP/Basics tab. You have to restart the Router for it to take effect.
OK Thanks. I'll look at the secondary mailbox idea and the auto purge idea.

Unfortunately, I'm aware of the lack of day to day server admin that's being done, and although I'd love to, I'm not going into a rant on here as to why this is an issue !

So, Notes went down again last night.   I've looked at the LND tool with a couple of NSD log files, but I'm not sure what its telling me.   Obviously I see the lines like
"Invalid stack frame detected: Unable to read process memory for frame", but I don't know how to narrow it down.

Notes appears to be staying up for about 3 to 4 hours between crashes.

Anything else I could share with you to allow you to advise me ?  

Many Thanks

PS:  Each time the server has to be restarted everyones Inbox and (some) Sent messages turn 'unread' for messages dated in the last 2-3 days. Is this normal and/or why ?
As much as I would love to read the rant, I'm not going to push you... ;-)

Domino down again, during the night? Is it always during the night, and around the same time? Or does it also crash during the day, every 3-4 hours (as you say)?

Disable some tasks from the ServerTasks line in notes.ini, the ones that you can do without. Can you post the line here, please?

By the way, is this happening all of a sudden? What changed recently, can you remember??
Really busy at the moment but Sjef pointed me at this Q too.  I will look back this evening or tomorrow if you don't get anywhere inbetween,.

Any issues in the WIndows event logs etc.?

Can you disable the anti-virus on there in case it is doing a full scan at that time of night and locking files that are neede?  Do you have backups running at those times?  Volume shadow copy turned on?

Whereabouts are you in the world.  Maybe it would be worth getting someone to give the server a once-over?

Steve
Thanks

I don't think anything has changed.  Symantec would have done a Liveupdate as it regularly does, but this wasn't immediately before the problem.

Looking back at our logs. The service did stop once before in March this year, but only once and then not again until 15:08 on 13th Nov.
Since then the service has stopped at:
15:23 13th Nov
15:53
16:53
05:53 14th Nov ( after being restarted at 23:30 on 13th)
12:53
16:06
23:06

Restarted at 6:30 (ish) this morning.

Partial Notes.ini :-
 
ServerTasks=ntask,npas,Update,Replica,Router,AMgr,AdminP,CalConn,Sched,RnRMgr,Http
ServerTasksAt1=Catalog,Design
ServerTasksAt2=UpdAll
ServerTasksAt3=Object Info -Full
ServerTasksAt5=Statlog

I take it any changes I make to this, will only go live when the service is next started?  So, I should wait until the next time in goes down ?  
Disable = temporarily delete ?
If you have symantec AV on there make sure you exclude the domino data directory IMO.
ASKER CERTIFIED SOLUTION
Avatar of Sjef Bosman
Sjef Bosman
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Could well be Sjef... will look more when I can, still in the inbox.
Thanks for this.  Server has continued to stop.  Ran yesterday afternoon without the HTTP, but it still 'terminated unexpectedly'. Chose HTTP as it is something we use, and is used by the department/site that runs 24 hours.

Good to know that I can stop/start tasks in Admin to test.

The repetition in similar times continues...   Since I last listed they have occured at:

13:06:46 15th Nov
20:06:50
03:06:56 16th Nov

Bear in mind that overnight, the gap between downtimes does not equal up time.  I check the system at about 11pm and if its down, I put it back up.  I then do the same at about 6:30am.  

I'm tidying up the mail area, by archiving off all mailboxes for people no longer in the Domino Directory ( ex-employees ).

I followed a fix for Backup Exec and thought I'd solved that issue, but that service halted again last night too. At least I was able to get one more backup to tape before that happened again.

We do have some very large mailboxes, and that is something that we are trying to deal with, but although far from ideal, they are not mailboxes that would be being used into the evening and overnight.

I'll go through any logs I'm aware of and see if there is any commonality between them at the times fault happens.

Bearing in mind the suggestion that it may be something that is scheduled - is there an obvious place to look for this ?
Also, ntask and npas.  I'm sure I saw these error in one of the logs with file not found. Should these just be able to be loaded at the console the same as router, sched, replica etc ?
Depending upon what you call large, 8Gb is quite common at some places, and the databases should be able to handle getting on for 64Gb, though I don't normally let one go over about 30Gb if possible.
If it is scheduled you could stop the agent manager (amgr) task or look at what is scheduled to run with TELL AMGR SCHED from the console.

Steve
Hi,

Nothing but 'Out of office' scheduled. All for a time that doesn't correspond with the problems. Can't see anything in the logs that gives any clues.

Largest mail file 17GB,  another 10 over 10GB,  20 or so more over 1GB, another 40 or so under 1GB.

Running on a Windows 2003 server - 32 bit. 4GB RAM.  It is really slow to open at the client, so I've been blaming it on the mail sizes.

Went down again at 10:06am !!

Thanks
Over the weekend, each time the server failed, I put it back up with different ServerTasks running (and not running).   I'm sure I've proved that the server service stops which ever one of the 11 tasks you don't run.

Managed to log this with IBM this morning too.  Although they may well come back and suggest I need to upgrade to get any support from them ( I'll wait and see)

Can't see anything on the server that would explain the :06 minutes repetition.  There are a couple of systems that send out emails via SMTP.  Scheduled Macros in Access, and an event monitoring system that can email as certain events occur.

I'm going to see if any of these have been running or attempting to run at :06 past the hour recently.

Not sure what else to be trying now.

Thanks for your ongoing help
Did you also try to leave many task out, of the ServerTasks? What's the actual status, what tasks are enabled?

How many people use the server over the weekend, or during the night? If virtually nobody, you could plan to start the server with only a few tasks in ServerTasks enabled.

Anyway, if I remember correctly you can install R7 over the current installation. It will replace faulty templates and DLLs, but that's all. Just install from the media, but don't run setup of course.
Sjef,

What I did over the weekend was edit the line in the notes.ini file so that at some point between Friday afternoon and Monday morning, the Notes server was started with each of these tasks NOT in the ini file.

e.g.
Original line in ini file:

ServerTasks=ntask,npas,Update,Replica,Router,AMgr,AdminP,CalConn,Sched,RnRMgr,Http

I changed this to say,

ServerTasks=Update,Router,AdminP,CalConn,Sched,Http

and restarted the server. Then a few hours later it failed, so I re-edited the same line
to say,

ServerTasks=ntask,npas,Router,AMgr,AdminP,CalConn,Sched,RnRMgr

and restarted
etc. etc.

Until finally the only one i hadn't checked was Router, and so I did that late yesterday and it went down late last night.

Changed everything back to normal and it obviously went down again overnight.

Has this done the check I thought it would, or have I misunderstood what that line does ?
No, no misunderstandings at all. Your method seems to do the same as what I had suggested, yet differently. I prefer to remove all tasks, and see what happens then. If it crashes, good, for then it's one of the really basic tasks of Domino. If it doesn't, I'd add some tasks to the line. It's similar, but not the same.

The console.log file you attached only talks about one executable that crashed: belnapi.exe. So I'd test again, with ntask and npas both removed (the Sighmantec stuff ;). Make also very sure that the Domino data tree isn't backed up! What you could verify is that the releases of Domino and Symantec are compatible.

By the way, are there any lines in notes.ini with EXTMGR_ADDINS ?
Or maybe NSFHooks and other lines in there too?  Perhaps we could see notes.ini file?

If you run it with Backup exec remote agents stopped does it crash still?  Sorry if you said you already done this. I'd try it with no anti-virus, no backup exec services running, just domino and see.

So remove the tasks Sjef suggests, stop the backup exec and anti-virsu services and start with that.

Steve
Hi and thanks.

Yesterday, the only thing I actually got around to doing was looking at systems that may use SMTP to send emails through the Notes server.  We have a PC here that runs an Access Macro and depending on the status of a certain database, it then sends an email to some individuals.
It appeared to have an Access window hanging in the background, although other tasks were running OK.  I cleared the window and made a note to check it again later.
Anyway, since yesterday Notes has stayed up !!   Thats now 26 hours, whereas in the past week it has been going down every 3 to 4 hours.
I will try to investigate this a bit more later.

IBM have had all the log files and info I could give them, and their official response is that 'this is a known problem with 7.0.2', and the solution is to patch it ( with a patch they suggested was no longer available) or to upgrade to a later version.
This seems strange that we have been sat at 7.0.2 for 5 years, and then all of a sudden start getting a crash so regularly.

As for Backup Exec., we are supposed to have the Domino Agent for this, so that the mail files can be backed up as an 'open file'  (I believe). Although we have also got the Open Files agent too.

I attach our notes.ini file.

Strange it's staying up now. You can tell that everyone else here, thinks the problem is fixed, they've all started hitting me their day to day problems again !

I'll update again, when I've checked out the status of that emailing Access PC.

Thanks
notes.ini
Ok, well lets hope it is fixed, but if not certainly putting 7.0.4 on would be very low risk, easy option to getting the latest R7 release in place.

Moving to 8.5 if and when of course is relatively easy and gives you lots of new options and features.

Good luck with it!
There is indeed a line with EXTMGR_ADDINS in notes.ini. If you (still) want to test without the Sighmantec backup, you'd have to remove ntask and npas from the ServerTasks line, plus you have to comment out the entire EXTMGR_ADDINS line.

And you're making progress, apparently. Good! Still, it beats me how Domino could crash on an Access application not having terminated correctly.

PS And I was clearly more subtle than IBM, wasn't I? ;-)
You can have defective databases.

Try running fixup on all databases and see if there are any errors.

Any defective database should be moved outside the Domino datadirectory.

PS! we are running all our Domino servers both linux and windows on VMware ESXi free version.

A simple script makes snapshot and backup of live Domino servers - I have been able to restart any Domino server from backup - even on another VMware server.
Update from IBM, after more analysis of the log files, maybe

"The crash is being caused by a particular content in a mail causing the SMTP task to crash"

There is an auto email being sent out by Access on that 'hung PC' that was being generated at 6 minutes past the hour!!

Still no problems since my last post.   Don't think I'm going to be allowed the time to look into this any further.

We plan to look at upgrading as soon as we can.

Thanks for all you help.

Sjej - seems like you were co-ordinating help on this with dragon-it.  Where do you think the points should go?
Give 'em to Sjef, I was just passing by and didn't have enough time to look properly.

Hope it goes well, the upgrade to 8.5x isn't too difficult if/when you do get to and lots of new gadgets to play with.... though personally it's the IBM notes 9 social edition (what a stupid name like Windows for Workgroups and the like) with it's browser based full client that next interests me.

No idea where you are in the world but ask if you need help during the upgrades!

Steve
Solved, for now? Great! And 'twas the SMTP task after all. Mighty strange that Domino barfs for no good reason (personally, I'd have some problems digesting mails sent by MS Access too ;-)) and that it doesn't leave a clearer message.
Sjef gets the points for all the problem solving ideas that he came up with, and for constantly giving me updates and further things to try.
His very first post mentioned the SMTP task being the likely culprit and although it was another system bringing the Notes Server down, it was indeed due to an SMTP message hanging the system.
Thanks again.
You're more than welcome :-)
Still amazed an SMTP conversation, even from a Microsoft piece of software, could cause it to die like this, but I guess there was some sort of unexpected  'illegal' smtp transfer going on there causing an overflow or whatever.

Interesting one.... do get that server updated though :-)  Aside from everything else I don't think it was until 7.03 ot 7.04 that those nasty winmail.dat rich-text / attachment files from Outlook were handled.

Steve