spamassassin Bayes data files which, where?

I'm running spamassassin 3.3.2 on Slackware64 14.1, kernel 3.10.17. I use sa-learn to "train" Bayes weekly, sometime daily. I'm trying to locate the Bayes data files. I do not have bayes_path set in local.cf so I suppose it is using a "default" location. In searching my drive I came up with:
-rw------- 1 spamd spamd 652K Mar 26 00:42 /home/spamd/.spamassassin/bayes_seen
-rw------- 1 spamd spamd 5.0M Mar 26 00:42 /home/spamd/.spamassassin/bayes_toks
-rw------- 1 spamd spamd 51K Mar 26 00:42 /home/spamd/.spamassassin/bayes_journal
-rw------- 1 root root 2.5M Mar 26 00:34 /root/.spamassassin/bayes_seen
-rw------- 1 root root 4.7M Mar 26 00:34 /root/.spamassassin/bayes_toks

Open in new window

It appears that I have bayes database files in 2 places. Can that be? I've checked serveral times and both of these locations see to be within minutes of each other after running sa-learn. The /root/.spamassassin files date remains the same until I run sa-learn again. The /home/spamd/.spamassassin files change throughout the day.

My assumption is that the files in /root/.spamassassin are updated as a result of sa-learn and the ones in /home/spamd are updated by the continuously running spamd -- possibly as a results of auto-learn.

My concern is the connection between the 2 sets of files. Is my sa-learn training going into /root/.spamassassin making it into /home/spamd, and is spamd using my training? If I used bayes_path in local.cf would I only have one set of files?

Could someone please explain to me how this works.
LVL 1
MarkAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
The setup is often to have a systemwide DB which would be in the /home/spamd directory and a more customized per user which would be in the user's homedir.
If the systemwide DB is being regularly updated, it would suggest that your main delivery /etc/procmailrc includes a teaching/learning setup to spamd.

If I am not mistaken, you are using procmail as LDA for sendmail.
0
MarkAuthor Commented:
Arnold:
If I am not mistaken, you are using procmail as LDA for sendmail.
yes, you are correct.
If the systemwide DB is being regularly updated, it would suggest that your main delivery /etc/procmailrc includes a teaching/learning setup to spamd.
What I am doing is capturing a copy of all mail sent or received into an /etc/mail/allmail mbox folder. I then periodically, manually sort this mail into ham/spam and use `sa-learn --[sp|h]am` to train. local.cf is set to have bayes autolearn, but I am pretty sure that the spam filtering is done before procmail gets involved. Spam filtering is done by spamd and spamassassin milters in sendmail. After passing through those, it goes to a bcc-milter that saves a copy of the email in /etc/mail/allmail, then the message is passed on to the local delivery agent, procmail. Spamassassin might actually totally reject messages based on spam score and messages thus rejected never make it to the bcc-milter or the allmail archive or the users' mail folders.

Given all that, I still don't exactly understand the relationship between the /root and /home/spamd bayes databases. It does seem like my manual training has effect as messages I train as spam seem to have like messages increasing in BAYES_nn scores as time goes on.

Right now (1:00AM) I just ran sa-learn --showdots --mbox --spam isSpam where isSpam is my hand extracted collection of spam. I then have:
-rw------- 1 spamd spamd 652K Mar 27 00:43 /home/spamd/.spamassassin/bayes_seen
-rw------- 1 spamd spamd 5.0M Mar 27 00:43 /home/spamd/.spamassassin/bayes_toks
-rw------- 1 spamd spamd 27K Mar 27 00:44 /home/spamd/.spamassassin/bayes_journal
-rw------- 1 root root 2.5M Mar 27 01:00 /root/.spamassassin/bayes_seen
-rw------- 1 root root 4.7M Mar 27 01:00 /root/.spamassassin/bayes_toks

Open in new window

Which is leading me to believe the databases are not connected and the /home/spamd database is getting updated by self-learning and I'm just amusing myself with the sa-learn to no actual effect.

Do you agree? If so, how do I get these databases synchronized/merged and my sa-learn actually doing something useful? The thing is, I originally simply enabled Bayes in local.cf and there weren't really any predefined settings for the database. And the bayes scoring did seem to kick in after I had added enough messages, so it does seem the databases are connected somehow -- but at this point it's essentially magic to me.
0
arnoldCommented:
The self learning IMHO is a mistake given an errant classification is self reinforcing.
II believe using sa-learn includes the option to point to the DB that should be updated.

Using the --dbpath /home/spamd/... To sa-learn
The auto learning might be triggered by the option starting spamd

I think sa-learn has a sync option.
Your sa-learn does help. At this stage I guess you are having it learn when a spam passes through or a good email is caught.

The difficulty in answering your question, though, is that someone's spam is another ones wanted email.

Presumably your main filter running on all incoming deals with capturing general agreed upon spam, then each individual's mailbox learns when a user places a message in a spam....
0
How the Cloud Can Help You as an MSSP

Today, every Managed Security Service Provider (MSSP) needs a platform to deliver effective and efficient security-as-a-service to their customers. Scale, elasticity and profitability are a few of the many features that a Cloud platform offers. Register today to learn more!

MarkAuthor Commented:
Presumably your main filter running on all incoming deals with capturing general agreed upon spam, then each individual's mailbox learns when a user places a message in a spam....
No. While it is true that spamassassin can be configured on a by-user basis. It is not in this case. There are not bayes DB files in user folders (other than /home/spamd). This host servers as a Maildir IMAP respository. Users connect via Outlook. What they subsequently do with their mail is of no concern to spamassassin.

I've looked at other systems I've configured with spamassassin as a system-wide milter and they also have this dual database thing going on: one in /root which gets updated by manual sa-learn training and one under /home/spamd that gets updated god-knows-how.

This is my fundamental question: does the /home/spamd bayes database get updated from the /root bayes database. If so, how, when? These files are binary so I can't compare them and I find no utility that actually tells me what database bayes is looking at.

Maybe this is unknowable.
0
arnoldCommented:
I'm not sure whether the sa-learn as run by root updates both or just one.
You can when using sa-learn --dbpath
you could use sa-learn --dump (all, magic, data) along with import to the new one.

look at the dump option to display what the settings are in the /home/spamd versus /root/

sa-learn is a perl script and you can see where and what it is doing.

another option is to run sa-learn as spamd

sudo -s -u spamd sa-learn

The location of the config file used by the service might be in /etc/mail/spamassassin it could be elsewhere....
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
MarkAuthor Commented:
`sa-learn --dbpath` let's me set the path, but doesn't display the current/default path.

`sa-learn --dump` give me meaningless gobbledygook like:

:
0.972         17          1 1427309898  8988cde38d
0.987          1          0 1423494801  6825dc24df
0.999         10          0 1424405986  35f21c286b
0.004          0          4 1425186068  46c3133d97
:

`sa-learn --dump magic` gives me a bunch of counts/totals which are likewise meaningless to me. No info about where the database is.

bayes_tok is not referenced in sa-learn except as "=item bayes_toks". I'll guess I'llhave to read through the sa-learn source to see if can see where the database is being accessed.

The lack of inquiry tools to determine the location of the database kind of makes it seem like a "don't worry about it" thing; that we're not really suppose to mess with it. Yet, they do have `sa-learn --dbpath`, but with no more information than that other than saying the path should be in "bayes_path form", whatever that is. I guess I'll have to search for some examples on how people have used this option.
0
arnoldCommented:
you can use --dbpath path_to_db_diles when you are teaching spamassassin

sa-learn --dump all

compare the running of the same sa-learn as root and as sudo -s -u spamd sa-learn --dbpath

There was a binary bayesfilter used BerkelyDB. but that is neigther here nor there.
--dump has magic, all and data.


Deflecting in part.
 Bogofilter, http://sourceforge.net/projects/bogofilter/

BMF
http://sourceforge.net/projects/bmf/files/bmf-0.9.4.tar.gz/download?use_mirror=heanet

back to topic at hand.

Try the following to confirm whether your sa-learn updates both or just one.

use sa-learn on a message that you see as spam that passed through.
Then pass the same message using to test whether it gets marked as spam.

your /etc/procmailrc after drop privileges, does it run spamassassin tests as well or you have a local .procmailrc?

I think the repository is stored where the user under whose credentials spamd is running.
sa-learn ran as spamd will teach the system wide, or run as a user will teach the local ....
0
MarkAuthor Commented:
I set 'bayes_path /home/spamd/.spamassassin/bayes' in local.cf and made sure that the bayes* files in that folder were owned by spamd.spamd. Then my training command is:

sa-learn -u spamd --showdots --mbox --spam isSpam caughtSpam

This all does update the spamd bayes files and it does seem to work with trapping new messages.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
AntiSpam

From novice to tech pro — start learning today.