sean finney on 30 Mar 2004 15:12:03 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] a question for the spamasassin gurus


hey folks,

we're attempting to get sa's bayesian filtering implemented globally
for the 2-3,000 users on our mail system, and was wondering if anyone
with similar experiences had anything to say about the pros and cons
of different ways of setting this up on a modestly large scale like this.

currently, sa places tagged spam into a junk mailbox, and everything
else defaults into an inbox.  one idea we've had is to create a third
mailbox "spam", where users could put spam that made it past sa.  this
way we wouldn't have to worry about dilution of the bayesian db's by
already caught spam.  the trouble with this is that it requires teaching
users to do something, and isn't immediately effective (users have to
learn from >= 200 messages before the bayesian filter even starts
working).

another idea is to have sa-learn process the junk mailbox.  the pros
i see to this are that many folks have already been trained to put
their spam there, and there's already a sizeable corpus from which to
learn.  the cons are that the majority of this mail was probably already
caught by sa, so i could see this diluting the effectiveness of the
bayesian filter to catch stuff that sa missed on its own[1].  of course,
maybe the messages are still similar enough that this would be helpful?

the third idea we had was to administer a global bayesian db ourselves
(us == mail admins or maybe the its dept.).  the pros to this is there's
less work to get that going, no 4 hour cron jobs every night, and more
technically skilled folks are ensuring the effectiveness of the filter.
the cons are of course that the individual user would not have the
ability to report spam/ham (at least in an automated sense[2]).


has anyone implemented anything like this?  other ideas?  thoughts
would be greatly appreciated.


thanks,
	sean

[1] apparently sa-learn automatically ignores spamassassin markup, which
    is convenient, but i think it'd still be reinforcing sa to catch
    what it already knows is spam
[2] we thought about this, but in the end a disgruntled user could
    exploit this to mark anything from certain higher-ups or internal
    mailing lists as spam, which wouldn't be all that great with a large
    database used by everyone.

Attachment: signature.asc
Description: Digital signature