Jeff Abrahamson on 22 Oct 2004 13:18:02 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Spam programs


On Fri, Oct 22, 2004 at 08:27:13AM -0400, Tobias DiPasquale wrote:
>   [28 lines, 133 words, 1091 characters]  Top characters: _asni-ot
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On Oct 22, 2004, at 7:53 AM, Jeff Abrahamson wrote:
> > I am running bogofilter with a database of 23,272 spams and 43034
> > non-spam messages.
> 
> I would recommend using a database composed of an order of magnitude 
> less spam and ham (on the order of 2000 apiece). This has proven to be 
> the most accurate in terms of individual precision. Pick the 2000 
> spammiest spams and 2000 hammiest hams and create a database using 
> those and see if your false negative rate doesn't go down.

The problem is that that requires manual selection.

Perhaps I should use grepmail to select all spam and ham from the past
N months, where N is chosen to make the final numbers work out.

But I'm curious why this should be so.  It's usually possible to reach
a decision with more confidence if one has less data.  More data adds
nuance to decisions.  Why should Bayesian filters (or Markovian or...)
work worse if there's more data?

-- 
 Jeff

 Jeff Abrahamson  <http://www.purple.com/jeff/>    +1 215/837-2287
 GPG fingerprint: 1A1A BA95 D082 A558 A276  63C6 16BF 8C4C 0D1D AE4B

 A cool book of games, highly worth checking out:
 http://www.amazon.com/exec/obidos/ASIN/1931686963/purple-20

Attachment: signature.asc
Description: Digital signature