sean finney on 30 Mar 2004 15:12:03 -0000 |
hey folks, we're attempting to get sa's bayesian filtering implemented globally for the 2-3,000 users on our mail system, and was wondering if anyone with similar experiences had anything to say about the pros and cons of different ways of setting this up on a modestly large scale like this. currently, sa places tagged spam into a junk mailbox, and everything else defaults into an inbox. one idea we've had is to create a third mailbox "spam", where users could put spam that made it past sa. this way we wouldn't have to worry about dilution of the bayesian db's by already caught spam. the trouble with this is that it requires teaching users to do something, and isn't immediately effective (users have to learn from >= 200 messages before the bayesian filter even starts working). another idea is to have sa-learn process the junk mailbox. the pros i see to this are that many folks have already been trained to put their spam there, and there's already a sizeable corpus from which to learn. the cons are that the majority of this mail was probably already caught by sa, so i could see this diluting the effectiveness of the bayesian filter to catch stuff that sa missed on its own[1]. of course, maybe the messages are still similar enough that this would be helpful? the third idea we had was to administer a global bayesian db ourselves (us == mail admins or maybe the its dept.). the pros to this is there's less work to get that going, no 4 hour cron jobs every night, and more technically skilled folks are ensuring the effectiveness of the filter. the cons are of course that the individual user would not have the ability to report spam/ham (at least in an automated sense[2]). has anyone implemented anything like this? other ideas? thoughts would be greatly appreciated. thanks, sean [1] apparently sa-learn automatically ignores spamassassin markup, which is convenient, but i think it'd still be reinforcing sa to catch what it already knows is spam [2] we thought about this, but in the end a disgruntled user could exploit this to mark anything from certain higher-ups or internal mailing lists as spam, which wouldn't be all that great with a large database used by everyone. Attachment:
signature.asc
|
|