multiple "queue active" lines for a single message, with several minutes interval

hello all
 
this may be something some of you guys stumbled across.
i worked with postfix for years but either did not or can't remember.
here are the symptoms :
 
FACTS
 
some of the mail we receive take several minutes to transit
 
when i look in the logs, i see :
- a normal reception ( connect + cleanup + qmgr ),
- then NOTHING happens for about 5 minutes (no log line contain this message id),
- then a second "qmgr...queue active" line is printed (duplicate of the previous one except for the date)
- then i see normal delivery lines
 
<log obfuscation=words_in_capital_letters>
Mar 31 16:24:54 vm-sm-prxmail01 postfix/screened_submission/smtpd[7887]: 69FDC1A8: client=HOSTNAME[ADDR], sasl_method=PLAIN, sasl_username=SENDER
Mar 31 16:24:54 vm-sm-prxmail01 postfix/cleanup[9611]: 69FDC1A8: message-id=<005c01d06bbe$69ed6bf0$3  dc843d0$@fr>
Mar 31 16:25:04 vm-sm-prxmail01 postfix/qmgr[12992]: 69FDC1A8: from=<SENDER>, size  =640961, nrcpt=1 (queue active)
Mar 31 16:30:03 vm-sm-prxmail01 postfix/qmgr[18606]: 69FDC1A8: from=<SENDER>, size  =640961, nrcpt=1 (queue active)
Mar 31 16:30:04 vm-sm-prxmail01 postfix/smtp[18611]: 69FDC1A8: to=<RECIPIENT>, relay=o  RELAY[172.17.20.197]:587, delay=311, delays=310/0.02/0.1/0.66  , dsn=2.0.0, status=sent (250 2.0.0 t2VETixt017803 Message accepted for delivery)
Mar 31 16:30:04 vm-sm-prxmail01 postfix/qmgr[18606]: 69FDC1A8: removed
</log>
 
A BIT OF INTERPRETATION
 
looking at the delays (in the smtp client's log line), it would seem that the reception of the data took 5 minutes, and the emission was short
... which would be consistent if the first "queue active" line was not present at all
 
looking at the peer's logs, it would seem like postfix never attempted to deliver the message during those 5 minutes
which would also be consistent with a long reception duration
( i'm not knowlegeable regarding that server and cannot play with as much as i'd like to, so i'm unsure when the first log line occurs )
 
the mail is not very big, but it is sent using a smartphone over a mobile network (3G at best and probably 2G or edge)
which would also look consistent
 
unfortunately i cannot activate debugs easily or the server will just die
 
i saw the same symptom several times for different messages.
most of the messages were bigger than the average on this platform ( but small still : 1M or so tops ).
 
QUESTION(S)
 
so my question is the following :
- should i assume that the data was still in transfer during those 5 minutes ? (AFAIK, clearly no)
- should i assume that the data transfer was short ? maybe the qmgr just crashed and picked up a few minutes later ? (this would be my bet, currently)
any other ideas regarding this issue ?
 
EXTRA INFORMATION
- postfix version is 2.11ubuntu ( with a bunch of hacks added by debian maintainers and/or canonical )
- the "screened submission" stuff is because i use postscreen in front of 587 ( only using tarpitting and very basic rejections based on dirty smtp implementations )
- the rest of the configuration is pretty straightforwards : only authenticated users can send email, a few basic checks are performed by a remote milter which answers very fast, and all email is sent to one of 2 smarthosts based on the recipient using a regular file based transport table
 
thanks all
 
alexis
LVL 27
skullnobrainsAsked:
Who is Participating?
 
skullnobrainsAuthor Commented:
i think i finally got it :

seems that in many cases (and i'd assume all cases), postfix will log " warning: problem talking to service rewrite: Connection reset by peer" more or less at the same time.

a mixture of debugging, old memories and browsing the net gives 3 possibilities :
- trivial rewrite crashes (and is restarted by master so subsequent connections work alright)
- a lookup failed -> in my case, trivial rewrite does not need any lookup except for a texthash transport map which just does not fail, even less randomly and would produce explicit log lines anyway
- trivial rewrite was killed

after tracing back, i tracked back some reload commands occuring at the same time. i'll monitor those, and take a few extra steps and be back there one i can confirm this was / was not the issue

for now, my guess is the following :
- postfix relaod kills the rewrite process before the queue manager finishes to process the message.
- the queue manager somehow crashes ( i guess since the id changes ) or looses track of the message anyway
- the queue manager is restarted/waked-up more or less 300 seconds later (as per the default master config)
- ... and when it starts or rescans the queue, the queue manager logs the "active" line again
0
 
DrAtomicCommented:
I've seen similar behaviour with badly implemented greylisting on a domain; just as a thought of investigation.

If the connection is indeed edge the 5 minutes for 1M messages seem very legit.
0
 
skullnobrainsAuthor Commented:
there is no greylisting on this platform : they are used by actual users, using SASL over submission port, so greylisting would be a no-go. the smarthost does not implement greylisting either.

given the fact that the first line gives the mail size, (and the size is always the real one whatever the headers say), it seems clear that the whole mail was transferred before the first line was written to the log.

when you say similar issues, do you mean a duplicate "queue active" line ? what i'm really wondering about, is what the hell was postfix doing during those 5 minutes ? btw, i know the qmgr was not actually dead because other mail went through during that time.
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
DrAtomicCommented:
Jup, I saw the duplicate active line before with a bad greylisting implementation (where the connection was held open for 10 minutes but nothing was allowed), anyway it was just a thought that occurred.

With regard to the data session; it might get double stamped if the data contains from: and  to: fields, not sure can't test myself atm (i.e. the log parses the rcptto and mailto first then opens the datasession, and logs the second line as the datasession is completed by parsing the additional from and to mailheaders).

Try to duplicate the behaviour by performing the manual smtp test in this link: http://www.port25.com/how-to-check-an-smtp-connection-with-a-manual-telnet-session-2/
0
 
skullnobrainsAuthor Commented:
"I saw the duplicate active line before" : as far as i know, greylisting would reject the connection rather than let it open for 10 minutes. i never saw postfix print such a log line before it actually queued the message. can you elaborate ?

i'm not sure i understand your second §. do you mean postfix logs this information during the smtp exchange (namely after the RCPT TO and before the DATA) ??? this would be inconsistent with both my experience and the fact that postfix accurately logs the data size... i'm not sure i understand your point. i can confirm for sure that whatever is written in from: and to: headers never makes it's way to the log unless you configure specific header_checks, or run in debug mode

I already did dozens of test using netcat, scripts, and whatever else available. this error is either too seldom, or due to some weird client/network behavior which i failed to guess. i'm more than fluent when it comes to smtp and would not be asking questions here before i ran such a test (and actually such a huge bunch of them i can't enumerate). waiting for ages before sending the final "." does not reproduce the problem for example, but maybe i just missed something obvious.

--

i'm very interested in what was wrong with the greylisting implementation if you can confirm that you saw these exact same duplicate 2 lines ( i mean 2 qmgr lines with the same QID and size actually corresponding to the same message both stating "queue active" )

thanks for your time
0
 
DrAtomicCommented:
i'm very interested in what was wrong with the greylisting implementation if you can confirm that you saw these exact same duplicate 2 lines ( i mean 2 qmgr lines with the same QID and size actually corresponding to the same message both stating "queue active" )

What happened with the greylisting thingy is that the data stream didnt get processed untill the greylisting timer ran out (it sorta held the data stream hostage so to speak, the session didnt get closed where it should have); once that ran out the smtp session continued reposting the sender log entry and actually processed the mail fine. I can't remember what was wrong with the greylisting exactly other then that we threw out the config and rebuild it from the ground up. Basicly where the greylisting should have rejected/postponed accepting the connection it accepted it initially then stopped processing it.

Again I just threw this out there because that's the one instance where I saw similar double stamped log entries, the train of thought was that it may help you with an angle of further investigating your issue.

i'm not sure i understand your second §. do you mean postfix logs this information during the smtp exchange (namely after the RCPT TO and before the DATA) ???

Basically it was an idea to test out; in the data section you can define from: and to: fields as well in addition to the mail to: and rcpt to: fields in the smtp envelope; I was wondering if that might be the cause of the double stamped sender.

The best way to tackle this issue would be to get a hold of that mobile phone and have it send a large message over edge whilst you run a packet capture on the postfix server so that you can compare the contents of the packets with the log entry in the maillog.
0
 
skullnobrainsAuthor Commented:
ok, thanks for your input

i'm pretty sure that postfix does not log any header-related information whatsoever unless you apply header-checks or run in debug mode, or possibly get a warning because of invalid headers. anyway header information is never mixed with envelope information.

thanks for the information regarding greylisting.

note, that it does not have to be a bad implementation but can also be some kind of tarpitting, but it gave me a few ideas of things to try... unsuccessfully until now.

in my case, this is a submission port so there is no greylisting.

also, note that greylisting usually occurs at the RCPT_TO stage while the duplicate log line only apears after the complete data was received.

when using postfix in proxy mode, it is possible to apply checks after data but then postfix does not get to log this line.

since milters can perform post-data checks as well (before sending the confirmation), it might be possible to reproduce with a faulty milter, but unfortunately i did not get to try so far. since there is a milter :[ running on this system, maybe it just makes a mess of things from time to time, i'll dig into that whenever i get a chance.


The best way to tackle this issue would be to get a hold of that mobile phone and have it send a large message over edge whilst you run a packet capture on the postfix server so that you can compare the contents of the packets with the log entry in the maillog.

sure. i'm pretty aware of that but i just can't : i'm currently debugging this issue on a platform i setup for an ISP, and this user is from an external company. i'm not even sure that the user is aware that we are the ones handling his email ( and if not, i'm not supposed to tell ) and i have no idea regarding his location other that what geoiplookup, whois, and the likes can provide. and when it comes to mobile phone, it all boils up to next to nothing... so grabbing the phone is not all that realistic.

for now, i tried using various scripts. i cannot play a lot with the smtp part since i setup postfix to kick users whenever they take too much time to send commands, too much time on the smtp part, misimplement the protocol in various ways (among other things). i tried tarpitting like mad ( like sending one character at a time, using pv to throttle, ... ) ... until now to no avail but i was unable to test for more than a few minutes due to some idiotic network security in the office... guess i'll have to try at home

note that the user is hardly aware of the problem, which only occurs very unfrequently which is why it is such a pain to debug

-

anyway, thanks a lot for your input

i'm still hoping that somebody saw the exact same issue and possibly can shed some more light so i'll keep this question open for now.
0
 
DrAtomicCommented:
also, note that greylisting usually occurs at the RCPT_TO stage while the duplicate log line only apears after the complete data was received.

Grey listing can actually be done on all smtp levels, I suspect the one config we had the issue with and killed was doing it at the data level: https://tools.ietf.org/html/rfc6647#page-6

Bit of a bummer to hear you can't grab the phone, thanks for your response as well.

Investigating the milter(s) might be worthwhile, check your error log because if a milter adds a rcpt to the envelope then that has an error level of 9 (anything <10 should be logged in the error log in a default config).

Hopefully someone else will join in and confirm he saw the issue too but also found out why exactly.
0
 
skullnobrainsAuthor Commented:
<b>a normal postfix server produces a whole big bunch of 'queue active' lines when it is reloaded because of the initial queue rescan. this is the only moment when postfix would normally repeat "queue active" for something that is already in the queue.</b> ( except when you use smtp filters but then you actually have different messages ).

this host is configured to use a local smarthost, so it has a normally empty or almost empty queue at all times. so it  only produces a single duplicate one from time to time even when it is accidentally frequently reloaded.

note : i'm unsure whether the 5 minutes delay is produced by the next reload (the faulty script is croned every 5 minutes) or by the qmgr's wakeup.
0
 
skullnobrainsAuthor Commented:
updates
- the warning relative to trivial-rewrite does not always occur
- the qmgr does not "crash", it is recycled which is normal during reloads
- the 5 minutes are still a half/mystery
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.