gabriel rosenkoetter on 30 Mar 2004 23:06:02 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] a question for the spamasassin gurus


On Tue, Mar 30, 2004 at 02:48:28PM -0500, sean finney wrote:
> cool, i'll take a look at that.

Using a combination of SA and SP was the going Best Practice in the
general OSS stop-spamming-me-dammit community, last I checked. (It's
also what Darxus, who you'd have to email privately these days since
I doubt he's reading PLUG carefually any more, does; I started doing
it because he pointed it out.)

> this is not what we're currently doing, it's one of three options we're
> considering.  reasons for considering per-user dbs are outlined below.

Right right, I hadn't read everything when I started replying. :^>

> i think you missed my point.  by using a corpus of mail already tagged
> as spam, you're basically reinforcing sa to catch mail it already knew
> was spam, instead of training it to catch what it missed.  thus, when
> you send a piece of spam missed by sa, it's weight is somewhat diluted
> due to the other messages.

I don't think either of us know enough about the math going on
inside sa-learn and the Bayesian-like (none of these are actually
proper Bayesian Analysis, if I recall correctly, but they're Close
Enough) analysis to say whether or not that action would dilute
anything (I can think of some ways of doing this where it wouldn't).

The real reason you don't want to feed it spam that regular rules
caught is that you simply don't need to and, considering how
lumbering and slow SA's Bayesian gear is, you want to keep its DB as
slim-lined as possible, and you want to run sa-learn on as few
messages as possible. So only feed it stuff that that the latest SA
rules missed.

(The case is not the same in SP. The standard approach there is to
just feed it every single piece of mail that comes in, then prune
out tokens that appear fewer than 2 times and haven't been seen for
some set period of time. It seems to do all of that way faster than
SA.)

> our junk boxes are deleted weekly, i image that if we were going to
> learn from them it'd be just before that, so there'd be as little
> wasted duplicate effort as possible.

Fair enough. I still say that teaching SA Bayesian rules about spam
that SA already marked as spam is mostly pointless.

> this is true, and also why i think the first idea (creating a second
> mailbox for missed spam) is doomed to fail.

Go ahead and create a second mailbox and tell people that, if they
feel like helping out, they're welcome to toss their spam in there.
Then go pick up the messages periodically (once a week?) and skim
over them to make sure they're not legit mail (or just trust your
whitelists ;^>). We've had such a shared mailbox on our Exchange
server for over a year now. I've got quite a bit of ${EMPLOYER}-
specific spam to feed to SP now. (I just have to bug the Windows
folks to *get* it...)

> i'm sure as hell not volunteering to have a clue if it means spending
> another half hour of my day reading other people's spam :) 

It's pretty easy to scan headers visually, though. I mean, if you're
doing spam filtering on your own mail, you're already used to this
(making sure you didn't get any false positives in the last
week/month/whatever).

Or whatever, decide they're all a bunch of punks. They mostly are.
:^>

> i completely agree.  i think most folks would lose interest in having
> to manually do this long before they started seeing results.

A "mark as spam!" in Swat's webmail interface couldn't hurt (much).

Especially if it was fed through something (like, oh, Los ;^>) for
approval. No, wait, that's a really terrible idea. Never mind.

> if we were doing this per-user, we'd probably just use their inboxes
> (note that this is seperate from their mail spool, so they would still
> have a chance to delete/move missed spam before it's scanned).  this is
> horribly ugly because some folks have ~40 MB mailboxes which would have
> to be scanned each time.

Well, keep in mind that any sane learner (both SA's and SP's do
this) will hash the messages (SP uses md5, dunno about SA) and not
reprocess individual messages its already seen. But that'd still be
a lot to process.

> r-s is exactly why i don't want to do that.  the problem is that
> managing this whitelist could quickly turn into a pita, unless you
> can whitelist a whole domain.

Whitelists are full-fleged globs in SA (see the sample user_prefs
file you've ALREADY GOT THERE ;^>), and you can whitelist on the
Received-From: (search for whitelist_from_rcvd in `perldoc
Mail::SpamAssassin::Conf`) rather than on the envelope sender
(though, I think r-s rewrites the envelope sender without rewriting
the From: header so that it still appears to be from the originator
to most MUAs, but is very obviously from the list, right? So you
don't have to do anything fancy; whitelists only care about the
envelope sender, last I checked).

> even then, i wouldn't be comfortable letting one random user
> arbitrarily affect whether another user's mail were categorized as
> spam.

Huh? I don't understand how that would happen.

> anyway, i think we'll end up going with the global its-administered 
> learning, and i'll look into spamprobe too.

I haven't played as long with SP as I have with SA, and I wouldn't
use it alone (whereas you can get away with using SA alone), but I'm
liking it so far.

-- 
gabriel rosenkoetter
gr@eclipsed.net

Attachment: pgpjBovft0uQa.pgp
Description: PGP signature