Jeff Abrahamson on 22 Oct 2004 13:18:02 -0000 |
On Fri, Oct 22, 2004 at 08:27:13AM -0400, Tobias DiPasquale wrote: > [28 lines, 133 words, 1091 characters] Top characters: _asni-ot > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Oct 22, 2004, at 7:53 AM, Jeff Abrahamson wrote: > > I am running bogofilter with a database of 23,272 spams and 43034 > > non-spam messages. > > I would recommend using a database composed of an order of magnitude > less spam and ham (on the order of 2000 apiece). This has proven to be > the most accurate in terms of individual precision. Pick the 2000 > spammiest spams and 2000 hammiest hams and create a database using > those and see if your false negative rate doesn't go down. The problem is that that requires manual selection. Perhaps I should use grepmail to select all spam and ham from the past N months, where N is chosen to make the final numbers work out. But I'm curious why this should be so. It's usually possible to reach a decision with more confidence if one has less data. More data adds nuance to decisions. Why should Bayesian filters (or Markovian or...) work worse if there's more data? -- Jeff Jeff Abrahamson <http://www.purple.com/jeff/> +1 215/837-2287 GPG fingerprint: 1A1A BA95 D082 A558 A276 63C6 16BF 8C4C 0D1D AE4B A cool book of games, highly worth checking out: http://www.amazon.com/exec/obidos/ASIN/1931686963/purple-20 Attachment:
signature.asc
|
|