gabriel rosenkoetter on 30 Mar 2004 18:12:03 -0000 |
I'm actually setting up an SMTP relay (Postfix) to run spam and virus filtering here at work RIGHT NOW. :^> On Tue, Mar 30, 2004 at 10:11:18AM -0500, sean finney wrote: > we're attempting to get sa's bayesian filtering implemented globally > for the 2-3,000 users on our mail system, and was wondering if anyone > with similar experiences had anything to say about the pros and cons > of different ways of setting this up on a modestly large scale like this. SA's Bayesian component mostly sucks. You probably actually want to be using SpamProbe. See http://spamprobe.sourceforge.net/. SP is a bit better suited to site-wide implementation (it doesn't feel like a balancing act to maintain a system-wide DB) and it does as *way* better job at Bayesian[-like] spam filtering, as it uses multi-word tokens. It's also significantly faster than SA's Bayesian processing. You could choose to only completely filter messages that both SA and SP think are spam, merely tagging the others (ideally in the subject line, so that users who don't want to know what a mail header is can see it). That's probably what I'll do here at work. You could also just toss messages that either think are spam in the junk mbox (which is what I do at home). > currently, sa places tagged spam into a junk mailbox, and everything > else defaults into an inbox. one idea we've had is to create a third > mailbox "spam", where users could put spam that made it past sa. this > way we wouldn't have to worry about dilution of the bayesian db's by > already caught spam. the trouble with this is that it requires teaching > users to do something, and isn't immediately effective (users have to > learn from >= 200 messages before the bayesian filter even starts > working). Why are you making each user have their own DB? Keep a central one, and feed it off the spam that the IT staff gets. That'll be close enough to what the rest of your users will see. Remember that you MUST also feed ham to Bayes DBs regularly (in vaguely equal propotion to the amount of spam you feed it). > learn. the cons are that the majority of this mail was probably already > caught by sa, so i could see this diluting the effectiveness of the > bayesian filter to catch stuff that sa missed on its own[1]. sa-learn knows when you give it the same message again and refuses to process it. (Try it; feed it an mbox, then feed it the same mbox again. Then put four more messages in the mbox, and feed it to sa-learn again. See what I mean?) So, processing the junk mailboxes will only waste time (it calculates a cryptographic one-way function of the message to do this), not dilute the DB. If it's *almost* but not quite the same message, you do want it reprocessed, because that means The reason this approach won't work is that it's precisely the spam that was NOT caught that you want to feed to the Bayes db. If it already got sent to junk, you don't care any more. (Unless you're suggesting that users *already* manually move things into junk? They probably don't. They probably just delete it.) > the third idea we had was to administer a global bayesian db ourselves > (us == mail admins or maybe the its dept.). the pros to this is there's > less work to get that going, no 4 hour cron jobs every night, and more > technically skilled folks are ensuring the effectiveness of the filter. > the cons are of course that the individual user would not have the > ability to report spam/ham (at least in an automated sense[2]). There's no reason you couldn't *let* them report it in the way you outlined above, you just don't force them to, and someone with a clue takes a glance over the messages in the shared missed-spam mbox before feeding it to your Bayesian learning tool. The bonus of using a shared DB is that you can populate it right now, so the Bayesian filtering will be functional for all users immediately. (SA's Bayesian stuff may suck... but it's still better than just SA with no Bayesian component, trust me I know.) If they have to populate their own, they'll need to dump a bunch of messages in before they see any benefit. There's not much sense in that. Also, you've totally neglected giving users a way to populate *ham* into the Bayesian databases (which means the DBs will all be weighted incorrectly, and probably generate a lot of false positives). This is absolutely necessary if they're keeping their own DBs, but it's something you can do behind the scenes if they're using a shared one. > [2] we thought about this, but in the end a disgruntled user could > exploit this to mark anything from certain higher-ups or internal > mailing lists as spam, which wouldn't be all that great with a large > database used by everyone. Rubbish. That's what whitelists are for. SA has a pretty good set of known public mailing lists, and it's easy to copy what it's doing for your internal mailing lists (which, incidentally, you'd need to do even if you were setting up a Bayes DB for each user, since the first thing that most users on your system would dump in that spam mailbox is a certain mailing list with the initials R.S.). -- gabriel rosenkoetter gr@eclipsed.net Attachment:
pgp8kwJQhDRnv.pgp
|
|