sean finney on 30 Mar 2004 19:49:07 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] a question for the spamasassin gurus


On Tue, Mar 30, 2004 at 01:11:47PM -0500, gabriel rosenkoetter wrote:
> SA's Bayesian component mostly sucks. You probably actually want to
> be using SpamProbe. See http://spamprobe.sourceforge.net/.

cool, i'll take a look at that.

> > else defaults into an inbox.  one idea we've had is to create a third
> > mailbox "spam", where users could put spam that made it past sa.  this
> > way we wouldn't have to worry about dilution of the bayesian db's by
> > already caught spam.  the trouble with this is that it requires teaching
> > users to do something, and isn't immediately effective (users have to
> > learn from >= 200 messages before the bayesian filter even starts
> > working).
> 
> Why are you making each user have their own DB?

this is not what we're currently doing, it's one of three options we're
considering.  reasons for considering per-user dbs are outlined below.

> Keep a central one, and feed it off the spam that the IT staff gets.
> That'll be close enough to what the rest of your users will see.

i think this is probably the best idea and what we'll end up doing.

> Remember that you MUST also feed ham to Bayes DBs regularly (in
> vaguely equal propotion to the amount of spam you feed it).

yup.

> > learn.  the cons are that the majority of this mail was probably already
> > caught by sa, so i could see this diluting the effectiveness of the
> > bayesian filter to catch stuff that sa missed on its own[1].
> 
> sa-learn knows when you give it the same message again and refuses
> to process it. (Try it; feed it an mbox, then feed it the same mbox
> again. Then put four more messages in the mbox, and feed it to
> sa-learn again. See what I mean?)

i think you missed my point.  by using a corpus of mail already tagged
as spam, you're basically reinforcing sa to catch mail it already knew
was spam, instead of training it to catch what it missed.  thus, when
you send a piece of spam missed by sa, it's weight is somewhat diluted
due to the other messages.  in reality this might not be all that bad
because i'd be willing to wager that there's enough of a similarity
between the two that even with caught spam you'd still have a net
positive change on sa's effectiveness (just somewhat less than with an
equal quantity of real spam).

> So, processing the junk mailboxes will only waste time (it calculates
> a cryptographic one-way function of the message to do this), not
> dilute the DB. If it's *almost* but not quite the same message,
> you do want it reprocessed, because that means

our junk boxes are deleted weekly, i image that if we were going to
learn from them it'd be just before that, so there'd be as little
wasted duplicate effort as possible.

> The reason this approach won't work is that it's precisely the spam
> that was NOT caught that you want to feed to the Bayes db. If it
> already got sent to junk, you don't care any more. (Unless you're
> suggesting that users *already* manually move things into junk? They
> probably don't. They probably just delete it.)

this is true, and also why i think the first idea (creating a second
mailbox for missed spam) is doomed to fail.

> There's no reason you couldn't *let* them report it in the way you
> outlined above, you just don't force them to, and someone with a
> clue takes a glance over the messages in the shared missed-spam
> mbox before feeding it to your Bayesian learning tool.

i'm sure as hell not volunteering to have a clue if it means spending
another half hour of my day reading other people's spam :) 

> The bonus of using a shared DB is that you can populate it right
> now, so the Bayesian filtering will be functional for all users
> immediately. (SA's Bayesian stuff may suck... but it's still better
> than just SA with no Bayesian component, trust me I know.) If they
> have to populate their own, they'll need to dump a bunch of messages
> in before they see any benefit. There's not much sense in that.

i completely agree.  i think most folks would lose interest in having
to manually do this long before they started seeing results.

> Also, you've totally neglected giving users a way to populate *ham*
> into the Bayesian databases (which means the DBs will all be
> weighted incorrectly, and probably generate a lot of false
> positives). This is absolutely necessary if they're keeping their
> own DBs, but it's something you can do behind the scenes if they're
> using a shared one.

if we were doing this per-user, we'd probably just use their inboxes
(note that this is seperate from their mail spool, so they would still
have a chance to delete/move missed spam before it's scanned).  this is
horribly ugly because some folks have ~40 MB mailboxes which would have
to be scanned each time.   yet another reason global spam prefs sound
like a better idea.


> > [2] we thought about this, but in the end a disgruntled user could
> >     exploit this to mark anything from certain higher-ups or internal
> >     mailing lists as spam, which wouldn't be all that great with a large
> >     database used by everyone.
> 
> Rubbish. That's what whitelists are for. SA has a pretty good set of
> known public mailing lists, and it's easy to copy what it's doing
> for your internal mailing lists (which, incidentally, you'd need
> to do even if you were setting up a Bayes DB for each user, since
> the first thing that most users on your system would dump in that
> spam mailbox is a certain mailing list with the initials R.S.).

r-s is exactly why i don't want to do that.  the problem is that
managing this whitelist could quickly turn into a pita, unless you
can whitelist a whole domain.  even then, i wouldn't be comfortable
letting one random user arbitrarily affect whether another user's mail
were categorized as spam.

anyway, i think we'll end up going with the global its-administered 
learning, and i'll look into spamprobe too.


thanks,
	sean

Attachment: signature.asc
Description: Digital signature