Link to home
Start Free TrialLog in
Avatar of longfingaz
longfingazFlag for United States of America

asked on

Lotus Domino Server Router Needs to be restarted

Hi, I have a Lotus Domino 8.0.1 messaging server.  This server functions fine, but on occassion when large group emails are set out (I think this is why) the router decides it can no longer see certain domains and starts to que up messages.  If I stop and restart the Router in Domino all messages are released fine.  I had tweaked some settings in my Transfer Control settings in the messaging settings and it seemed to alleviate it.  However it has started to come back... Attached is a screenshot of my Transfer Control Settings.  Can anyone tell me ways to tweak this so that the router doesnt crap out on me anymore?



transfer.bmp
Avatar of SysExpert
SysExpert
Flag of Israel image

How many CPU cores on the server ?
Did  you check the recommendations for these settings in the Admin client Help and the IBM site ?

Do you have anti virus/ Spam software running on the server, as they can slow things down ?



I hope this helps !
Avatar of longfingaz

ASKER

2 Xeon processors 6 GB of RAM... AV and Spam software running. but only for incoming messages.
How many mailboxes did you set up? If it's still only one, set it to 2 or 3, you'll see that mail processing will go a lot smoother. You'll have to restart the Domino server, the mailboxes will be created automatically.
Generally messages get queued up when the destination server is unavailable.  I wonder if this is not a hardware issue at all but perhaps a user issue/situational issue.  From what you say it is only having trouble with "some" domains.  Perhaps your box (which should be quite robust from the sound of it) is overloading the destination server and so queuing up those that it is no longer getting a response for initiating.  Queuing is a funny thing ... after the first few retries is can wait 15 min ... 30 min ... an hour ... then 3 ...etc.  This would explain why when you restart your router, it sends them off quickly, as all the other messages have now gone through, the destination is ready to take the queued up ones and on the restart it tries all queued messages (as it has no idea howlong they have been queued now).

Hard thing to test for... but if there are still other messages going through and the queues are for certain domains... it most likely isn't your router's fault, but the destinations.
I have 4 mailboxes already.  
I guess the best test ... would be to let "nature" take its course next time ... and watch what happens.
Well thats the thing, its deffinetly available.  The domain that is... My email domain is Server.Domain.COm and the place it is hanging with is Domain.com... (the parent domain).  There are large email blasts leaving the server and mainly going to the parent domain.  This is where I beleive I am getting the issue.  I am going to attemtp to have this email sent via a listserv rather than through domino groups.  But I wanted to know if anyone knew better settings specifically for transfer control.
I understand what you are saying that the "domain is available"...but how many threads does the destination have running?  The server may not be available... if you try to throw 100 messages with all 13  threads at one destination and it only has 5 threads to take in your messages ... something has to give... and that would be your messages going into queue.  Then when those 5 finish you have another 5 going from unqueued ones that haven't been tried yet plus the 8 already are in queue.  They may not get to go for a long time in "computer time" as newones are going to keep filling in the threads and when the retry times come around, the threads on the destination will still be busy with fresh messages and so they go back into queue with each retry getting longer in interval.  So more get sent to the queue and more unqueued messages fill in the threads.  The chances of queued messages hitting that server while unqueued ones are still flowing is very small.  I really don't think this has anything to do with your server at all.  

Without restarting nrouter how long does it take for all the messages to go on thier own? What kind of bulk mail quantity are you talking about 1,000s? 10,0000s? more ?  Have you ever let it go 24 hours?

Here's a way to test... Try lowering the value of the Initial Transfer Retry Interval, but I wouldn't go too low or who knows what kind of fits your server might have.

I posted a snippet incase IBM decides to change the url in my previous post (for posterity) And ok I was wrong about the 1 and 3 hours ... but you get the idea.  From IBM's support site:

To view the messages you have pending and their current status, issue this command at the Domino Server console:
TELL ROUTER SHOW QUEUE
The number in the State of Retry ( ) is what indicates whether this was the first attempt to route the mail, after the initial retry was attempted or if it is a subsequent attempt.

The Algorithm used:
The Router will use the "Initial Transfer Retry Interval" field on the Server Configuration document in the Domino Directory (names.nsf). The value of this field is used to determine when a failed message will be re-tried for the first time. The additional attempts to send the message will be based on 2X and 3X the value of this field. All remaining attempts, after the third retry, will be done at this interval, (3 times the Initial Transfer Retry Interval) for the total of the 24-hour period.

The following example is based on the default value of 15 minutes:

* The Initial retry will be attempted in 15 minutes. (This is actually the second attempt to send the message).

* If the Initial Retry attempt is unsuccessful, the Router will back off the attempts, by doubling the "Initial Transfer Retry Interval" value before trying again. Retrying in 30 minutes. (This is now the third attempt to send the message).

* If the second Retry attempt is unsuccessful, the Router will back off the attempt again, this time tripling the "Initial Transfer Retry Interval" value before trying again. Retrying in 45 minutes. (This is the now the fourth attempt to send the message).

* All remaining attempts will be done at the 45 minute interval for 24 hours.

Lowering the value of the "Initial Transfer Retry Interval" will increase the retry attempts per hour and could possibly increase the success rate of routing the messages.

Increasing the value of the "Initial Transfer Retry Interval" will decrease the retry attempts per hour and will result in longer routing times.

NOTE: The only way to reset the retry interval is to recycle the Router task. Issuing a "route servername" at the console will attempt an immediate transfer, but if it is unsuccessful, the next retry interval will be attempted.
This is pure science man!

In awe...

SB
Ok, lots of great information here.  I follow completley, but the thing is until I restart the router it will not deliver those messages.  They actually show as failed after 24hrs and NDRs are sent to the users.  If I dont restart the server / router  the messages dont actually retry.  My question now is, How many "failed" attempts before it marks the message as dead.  Maybe I have my retry interval to close and because its 1000 messages to one domain the que is still not having breathing room to get those messages out.
And your line speed? Can you monitor that line or the IP-address and port? It is only Notes mail, isn't it? It could be a failing Connection document, for all I know. If you don't have an explicit Connection document to the parent domain's server, create one.
I went as far as making "host" record in windows.  It will say in the que (when it happens) "No route found to domain"  
Alright well the "No Route found to domain" is new info.  Next mass mail check the command

TELL ROUTER SHOW QUEUE

This will let you know where it stands ... I doubt that its taking 24 hrs to send a message time frame.  

What is it saying in the logs?


10/22/2008 08:35:40 AM  Router: No messages transferred to MAIL GATEWAY/DREXEL_IST via Notes
10/22/2008 08:35:40 AM  Router: No messages transferred to DREXEL.EDU (host DREXEL.EDU) via SMTP: Remote system no longer responding
10/22/2008 08:35:40 AM  Router: Successfully issued a request to push to MAIL GATEWAY/DREXEL_IST
10/22/2008 08:35:40 AM  Router: Successfully issued a request to push to MAIL GATEWAY/DREXEL_IST
10/22/2008 08:35:40 AM  Router: Failed to connect to SMTP host DREXEL.EDU because Remote system no longer responding
i was wrong about my error being "no route to domain" its " remote system no longer responding"
I still think the issue is with the destination... how often do these blasts go out?  Are there other servers blasting to it too?  if it gets blasted hourly then the queues may never get a chance.  How old are the messages that the log snippet shows?
It could also be a network/DNS issue, or doesn't that translate into "no longer responding" ? Is the server inside your domain or outside? If it is inside, and it's a Domino server, do you know why you use SMTP to transfer mail?
The blast goes out once a week.  And honestly im not 100Percent that it is related.  

My real question is, at what point does domino mark a message as dead and STOP trying to redilver it?  For some reason even if its a network "hiccup" the retries dont seem to be happening.  I am going to be upgrading this servers hardware in weeks to come and was hoping that would help resolve the issue.  But I was just curuious if it could be domino settings related.  When the server will hold the messages i can ping / nslookup the destination domain from windows command prompt no problem.  

I think my issue is somewhere in the configuration document related to retries and when to retry.
BTW the destination server is NOT a domino server so it goes via SMTP.
Don't try ping or nslookup then, but a
      telnet smtp.DREXEL.EDU 25
with the correct FQHN of course. That would connect to the SMTP server all right.

I don't think that it's Domino, but who am I... The other end may have had some anti-congestion treatment. It might consider your end to be a notorious spammer who tries to flood the server. How much data is that blast? Would it be possible to send several farts instead of one big blast?

By the way, do you also manage the receiving server? What do its logs say?
ASKER CERTIFIED SOLUTION
Avatar of Stan Reeser
Stan Reeser
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Just interested: what was the real cause, and what did you do to solve the problem??
Avatar of merchant_sc
merchant_sc

Wow..  been a while since I've weighed in on Experts Exchange for Notes stuff... .but I ran across this searching for answers to my own, similar issue. I know an answer has been accepted, but I thought I'd throw in my 2 cents for anyone else suffering from similar problem.

First, I firmly believe this is somehow tied into version 8.  I have 2 SMTP outbound servers who seem to imtermittently do the same thing.  They are supposed to be "balanced" by a content switch which hands mail off to the 2 of them, I suspect that balance is a little off and one takes more volume than the other at times, but regardless of the exact balance, each handles a pretty heavy load.

Prior to 8.0.1 (4 weeks or so ago) we ran 7.0.3 without any issue like the ones I'm seeing.  We have been watching our mail queue counts for months and have run with a single mail.box.  I was one of those 4 mail.box users, then dropped to 2 and finally settled on 1 because it made queue checks so much easier and seem to have NO impact on my routing, despite our volume.

On a typical day I could find 50-80 pending messages in queue.  MANY of those pending would be to bad domains.  Users who reply to the internet with the out of office, users who type the domain name incorrectly, domains that simply are not valid, domains that are not responding, etc.  This is expected behavior.  Messages sit in queue for 24 hours and fail back to the sender if possible.  Some wind up dead, never a huge issue.

8.0.1 comes along and I notice an issue with the queu backing up.  Somewhat like your problem description - yet different circumstances.  We never sent blasts of mail out..  just the normal day to day stuff..  a blast of mail might result in a higher pending queue (more bad addresses) than some days, but over all would normally have little impact.  With 8.0.1 I started seeing the queues on one server or the other group throughout the day.. from 50 pending in them AM to 200+ pending in the afternoon.  All the while mail was still routing in and out.  

The common problem I had was when a message was handed off to the router and it failed to route - in my case it would fail to route with this error:  DNS is unavailable or query timed out, message will be requeued.

It would sit in queue.  Now I KNOW I have pending for bad domains.  I have always had that. And I KNOW DNS was unavailalbe, side effect of our infrastructure where we get these 2 second timeouts if we're querying a domain we haven't talked to (not cached) in 24 hours.  That's OK, because the next attempt yields immediate results...

And that is my underlying problem.  THere never IS a second attempt.  That sits in queue while messages that fail for other reasons appear to retry - those imcrement, and perhaps there are too many, but from the sounds of it, it's a simply connection check, failure and move on.  Again.. new mail will route, just those that fail get stuck in limbo.

So I send to abc.com and they fail because DNS isn't available.  A quick MX record check from a command prompt will give me results, I know that if the server simply attempts again, it will route.  This doesn't do a thing until I restart the router - then it will route.

That's where we seem to have a common thread.  Your 8.0.1 server failed to route, quite possibly because your detsination server was not immediately available or did fail to respond .. and it never retried.  At least that's the way it sounds to me.  Do your logs support that or do you see repeated failed attempts to route over the 24 hour period before it sends the NDR to the sender?   Mine shows no activity until it hits the 24 hour mark and it fails.  The ones that fail for other reasons seem to be quite content to retry and show up that way...  

I do have an open PMR and was hoping this would be recognized as an issue and some ideas given, but so far I've only gotten so far as to explain myself (like this) over and over and be told that I could try playing with some windows/notes settings that impact the retry intervals..  and DNS check time outs..  it's not my underlying problem, the fact that it's either tying everything up with a few bad ones and not cycling through all my messages IS or the fact that it just drops these messages into limb and ignores them is...

Whatever the reason a regular restart of the router seems to be the only good medicine .. I'll update you all if I find any more out.
Actually, I had even forgotten about this but it happened to a client of mine on a 7.0.3 server.  They could send to EVERYONE except this one address and it had gone out fine the day before.  We were utterly convinced it was thier issue (peering or somesuch nonsense).  Had all the ISPs involved on conference... yadda yadda.  What is really odd and where I went to restart the server was new messages to be sent to that domain weren't even being tried.  When I took the server down, the router hung.  (I guess something about this issue had permeated my brain and thats why I rebooted)  Anyway comes back up and alls well.  I think I will open a PMR monday and if you give me the number of yours I can tell them to link it.  You can send it to my screenname at thedatacenter dot com.  

I guess I should give back my points but hey I only got a B anyways :)

Clearly Lotus needs to look into this more.  
Merchant ... what OS are you running domino on?  Win2K3 here.  

Has this just started happening to you?  Its really odd to see it happen across different versions. It happened to my client just last week.
Callin' the Loti on monday.
"And three people do it, three, can you imagine, three people walking in
singin a bar of Alice's Restaurant and walking out. They may think it's an
organization." - or at least a bug.
Ah..  2 Win2k3 servers as well.    I will send you my PMR #

This did just start happening post - 8.0.2 and it's sporadic at best.  My issue seems to be a combination of a couple of things ....

*Our DNS servers which do not respond fast enough if the information is not queue (DNS Query times out)
* Domino server that does not attempt to retry after that initial failure.

I haven't had the router hang yet and I typically stop/start the router task.  The longer I let it go, the more messages queue up.  If I do a : tell router show queue, my problem messages are not retrying they are listed simply as "waiting for DNS availablility" and the only way to make it attempt again is to stop/start the router.

The guy at IBM was suggesting that it's related to the number of bad addresses we have queued up (people send mail to things like ALO.com, not AOL.com, etc...) but we've always had that kind of garbage in our queues.  And I can't imainge that those 10...20.. or even 30...bad addresses create this issue, the router seems content to retry those.. and new mail routers fine IF it can find DNS, otherwise it falls to the wayside with the rest.

The most perplexing thing is that the probleml is not consistent.. so some of these things may be factors, but they certainly aren't working right!