gabriel rosenkoetter on 30 Mar 2004 18:12:03 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] a question for the spamasassin gurus


I'm actually setting up an SMTP relay (Postfix) to run spam and
virus filtering here at work RIGHT NOW. :^>

On Tue, Mar 30, 2004 at 10:11:18AM -0500, sean finney wrote:
> we're attempting to get sa's bayesian filtering implemented globally
> for the 2-3,000 users on our mail system, and was wondering if anyone
> with similar experiences had anything to say about the pros and cons
> of different ways of setting this up on a modestly large scale like this.

SA's Bayesian component mostly sucks. You probably actually want to
be using SpamProbe. See http://spamprobe.sourceforge.net/.

SP is a bit better suited to site-wide implementation (it doesn't
feel like a balancing act to maintain a system-wide DB) and it does
as *way* better job at Bayesian[-like] spam filtering, as it uses
multi-word tokens. It's also significantly faster than SA's Bayesian
processing.

You could choose to only completely filter messages that both SA and
SP think are spam, merely tagging the others (ideally in the
subject line, so that users who don't want to know what a mail
header is can see it). That's probably what I'll do here at work.
You could also just toss messages that either think are spam in the
junk mbox (which is what I do at home).

> currently, sa places tagged spam into a junk mailbox, and everything
> else defaults into an inbox.  one idea we've had is to create a third
> mailbox "spam", where users could put spam that made it past sa.  this
> way we wouldn't have to worry about dilution of the bayesian db's by
> already caught spam.  the trouble with this is that it requires teaching
> users to do something, and isn't immediately effective (users have to
> learn from >= 200 messages before the bayesian filter even starts
> working).

Why are you making each user have their own DB?

Keep a central one, and feed it off the spam that the IT staff gets.
That'll be close enough to what the rest of your users will see.

Remember that you MUST also feed ham to Bayes DBs regularly (in
vaguely equal propotion to the amount of spam you feed it).

> learn.  the cons are that the majority of this mail was probably already
> caught by sa, so i could see this diluting the effectiveness of the
> bayesian filter to catch stuff that sa missed on its own[1].

sa-learn knows when you give it the same message again and refuses
to process it. (Try it; feed it an mbox, then feed it the same mbox
again. Then put four more messages in the mbox, and feed it to
sa-learn again. See what I mean?)

So, processing the junk mailboxes will only waste time (it calculates
a cryptographic one-way function of the message to do this), not
dilute the DB. If it's *almost* but not quite the same message,
you do want it reprocessed, because that means

The reason this approach won't work is that it's precisely the spam
that was NOT caught that you want to feed to the Bayes db. If it
already got sent to junk, you don't care any more. (Unless you're
suggesting that users *already* manually move things into junk? They
probably don't. They probably just delete it.)

> the third idea we had was to administer a global bayesian db ourselves
> (us == mail admins or maybe the its dept.).  the pros to this is there's
> less work to get that going, no 4 hour cron jobs every night, and more
> technically skilled folks are ensuring the effectiveness of the filter.
> the cons are of course that the individual user would not have the
> ability to report spam/ham (at least in an automated sense[2]).

There's no reason you couldn't *let* them report it in the way you
outlined above, you just don't force them to, and someone with a
clue takes a glance over the messages in the shared missed-spam
mbox before feeding it to your Bayesian learning tool.

The bonus of using a shared DB is that you can populate it right
now, so the Bayesian filtering will be functional for all users
immediately. (SA's Bayesian stuff may suck... but it's still better
than just SA with no Bayesian component, trust me I know.) If they
have to populate their own, they'll need to dump a bunch of messages
in before they see any benefit. There's not much sense in that.

Also, you've totally neglected giving users a way to populate *ham*
into the Bayesian databases (which means the DBs will all be
weighted incorrectly, and probably generate a lot of false
positives). This is absolutely necessary if they're keeping their
own DBs, but it's something you can do behind the scenes if they're
using a shared one.

> [2] we thought about this, but in the end a disgruntled user could
>     exploit this to mark anything from certain higher-ups or internal
>     mailing lists as spam, which wouldn't be all that great with a large
>     database used by everyone.

Rubbish. That's what whitelists are for. SA has a pretty good set of
known public mailing lists, and it's easy to copy what it's doing
for your internal mailing lists (which, incidentally, you'd need
to do even if you were setting up a Bayes DB for each user, since
the first thing that most users on your system would dump in that
spam mailbox is a certain mailing list with the initials R.S.).

-- 
gabriel rosenkoetter
gr@eclipsed.net

Attachment: pgp8kwJQhDRnv.pgp
Description: PGP signature