Link to home
Start Free TrialLog in
Avatar of Mark
Mark

asked on

spamassassin Bayes data files which, where?

I'm running spamassassin 3.3.2 on Slackware64 14.1, kernel 3.10.17. I use sa-learn to "train" Bayes weekly, sometime daily. I'm trying to locate the Bayes data files. I do not have bayes_path set in local.cf so I suppose it is using a "default" location. In searching my drive I came up with:
-rw------- 1 spamd spamd 652K Mar 26 00:42 /home/spamd/.spamassassin/bayes_seen
-rw------- 1 spamd spamd 5.0M Mar 26 00:42 /home/spamd/.spamassassin/bayes_toks
-rw------- 1 spamd spamd 51K Mar 26 00:42 /home/spamd/.spamassassin/bayes_journal
-rw------- 1 root root 2.5M Mar 26 00:34 /root/.spamassassin/bayes_seen
-rw------- 1 root root 4.7M Mar 26 00:34 /root/.spamassassin/bayes_toks

Open in new window

It appears that I have bayes database files in 2 places. Can that be? I've checked serveral times and both of these locations see to be within minutes of each other after running sa-learn. The /root/.spamassassin files date remains the same until I run sa-learn again. The /home/spamd/.spamassassin files change throughout the day.

My assumption is that the files in /root/.spamassassin are updated as a result of sa-learn and the ones in /home/spamd are updated by the continuously running spamd -- possibly as a results of auto-learn.

My concern is the connection between the 2 sets of files. Is my sa-learn training going into /root/.spamassassin making it into /home/spamd, and is spamd using my training? If I used bayes_path in local.cf would I only have one set of files?

Could someone please explain to me how this works.
Avatar of arnold
arnold
Flag of United States of America image

The setup is often to have a systemwide DB which would be in the /home/spamd directory and a more customized per user which would be in the user's homedir.
If the systemwide DB is being regularly updated, it would suggest that your main delivery /etc/procmailrc includes a teaching/learning setup to spamd.

If I am not mistaken, you are using procmail as LDA for sendmail.
Avatar of Mark
Mark

ASKER

Arnold:
If I am not mistaken, you are using procmail as LDA for sendmail.
yes, you are correct.
If the systemwide DB is being regularly updated, it would suggest that your main delivery /etc/procmailrc includes a teaching/learning setup to spamd.
What I am doing is capturing a copy of all mail sent or received into an /etc/mail/allmail mbox folder. I then periodically, manually sort this mail into ham/spam and use `sa-learn --[sp|h]am` to train. local.cf is set to have bayes autolearn, but I am pretty sure that the spam filtering is done before procmail gets involved. Spam filtering is done by spamd and spamassassin milters in sendmail. After passing through those, it goes to a bcc-milter that saves a copy of the email in /etc/mail/allmail, then the message is passed on to the local delivery agent, procmail. Spamassassin might actually totally reject messages based on spam score and messages thus rejected never make it to the bcc-milter or the allmail archive or the users' mail folders.

Given all that, I still don't exactly understand the relationship between the /root and /home/spamd bayes databases. It does seem like my manual training has effect as messages I train as spam seem to have like messages increasing in BAYES_nn scores as time goes on.

Right now (1:00AM) I just ran sa-learn --showdots --mbox --spam isSpam where isSpam is my hand extracted collection of spam. I then have:
-rw------- 1 spamd spamd 652K Mar 27 00:43 /home/spamd/.spamassassin/bayes_seen
-rw------- 1 spamd spamd 5.0M Mar 27 00:43 /home/spamd/.spamassassin/bayes_toks
-rw------- 1 spamd spamd 27K Mar 27 00:44 /home/spamd/.spamassassin/bayes_journal
-rw------- 1 root root 2.5M Mar 27 01:00 /root/.spamassassin/bayes_seen
-rw------- 1 root root 4.7M Mar 27 01:00 /root/.spamassassin/bayes_toks

Open in new window

Which is leading me to believe the databases are not connected and the /home/spamd database is getting updated by self-learning and I'm just amusing myself with the sa-learn to no actual effect.

Do you agree? If so, how do I get these databases synchronized/merged and my sa-learn actually doing something useful? The thing is, I originally simply enabled Bayes in local.cf and there weren't really any predefined settings for the database. And the bayes scoring did seem to kick in after I had added enough messages, so it does seem the databases are connected somehow -- but at this point it's essentially magic to me.
The self learning IMHO is a mistake given an errant classification is self reinforcing.
II believe using sa-learn includes the option to point to the DB that should be updated.

Using the --dbpath /home/spamd/... To sa-learn
The auto learning might be triggered by the option starting spamd

I think sa-learn has a sync option.
Your sa-learn does help. At this stage I guess you are having it learn when a spam passes through or a good email is caught.

The difficulty in answering your question, though, is that someone's spam is another ones wanted email.

Presumably your main filter running on all incoming deals with capturing general agreed upon spam, then each individual's mailbox learns when a user places a message in a spam....
Avatar of Mark

ASKER

Presumably your main filter running on all incoming deals with capturing general agreed upon spam, then each individual's mailbox learns when a user places a message in a spam....
No. While it is true that spamassassin can be configured on a by-user basis. It is not in this case. There are not bayes DB files in user folders (other than /home/spamd). This host servers as a Maildir IMAP respository. Users connect via Outlook. What they subsequently do with their mail is of no concern to spamassassin.

I've looked at other systems I've configured with spamassassin as a system-wide milter and they also have this dual database thing going on: one in /root which gets updated by manual sa-learn training and one under /home/spamd that gets updated god-knows-how.

This is my fundamental question: does the /home/spamd bayes database get updated from the /root bayes database. If so, how, when? These files are binary so I can't compare them and I find no utility that actually tells me what database bayes is looking at.

Maybe this is unknowable.
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Mark

ASKER

`sa-learn --dbpath` let's me set the path, but doesn't display the current/default path.

`sa-learn --dump` give me meaningless gobbledygook like:

:
0.972         17          1 1427309898  8988cde38d
0.987          1          0 1423494801  6825dc24df
0.999         10          0 1424405986  35f21c286b
0.004          0          4 1425186068  46c3133d97
:

`sa-learn --dump magic` gives me a bunch of counts/totals which are likewise meaningless to me. No info about where the database is.

bayes_tok is not referenced in sa-learn except as "=item bayes_toks". I'll guess I'llhave to read through the sa-learn source to see if can see where the database is being accessed.

The lack of inquiry tools to determine the location of the database kind of makes it seem like a "don't worry about it" thing; that we're not really suppose to mess with it. Yet, they do have `sa-learn --dbpath`, but with no more information than that other than saying the path should be in "bayes_path form", whatever that is. I guess I'll have to search for some examples on how people have used this option.
you can use --dbpath path_to_db_diles when you are teaching spamassassin

sa-learn --dump all

compare the running of the same sa-learn as root and as sudo -s -u spamd sa-learn --dbpath

There was a binary bayesfilter used BerkelyDB. but that is neigther here nor there.
--dump has magic, all and data.


Deflecting in part.
 Bogofilter, http://sourceforge.net/projects/bogofilter/

BMF
http://sourceforge.net/projects/bmf/files/bmf-0.9.4.tar.gz/download?use_mirror=heanet

back to topic at hand.

Try the following to confirm whether your sa-learn updates both or just one.

use sa-learn on a message that you see as spam that passed through.
Then pass the same message using to test whether it gets marked as spam.

your /etc/procmailrc after drop privileges, does it run spamassassin tests as well or you have a local .procmailrc?

I think the repository is stored where the user under whose credentials spamd is running.
sa-learn ran as spamd will teach the system wide, or run as a user will teach the local ....
Avatar of Mark

ASKER

I set 'bayes_path /home/spamd/.spamassassin/bayes' in local.cf and made sure that the bayes* files in that folder were owned by spamd.spamd. Then my training command is:

sa-learn -u spamd --showdots --mbox --spam isSpam caughtSpam

This all does update the spamd bayes files and it does seem to work with trapping new messages.