Rich Kulawiec on 4 Oct 2016 04:09:48 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Top-posting

On Mon, Sep 26, 2016 at 07:37:21PM -0500, Rich Freeman wrote:
> Honestly, the on-disk storage format of the mail and index data
> doesn't really concern me because email clients shouldn't ever look at
> that stuff.  They should be talking to the mail server and giving it
> queries.

Wellllll...but it does.  Munging original source data to support search
functionality is a bad idea.  If you want to create a modified, enriched
version of original source data: well, okay, but then you just doubled
storage requirements and every time you change the way those modifications
and enrichments are done, you have to re-import all your data.  Which
isn't awful, but it is somewhat of a PITA.

I think architecturally it's much better to keep the original data in
its native form and to do the heavy lifting in the search engine.

> The problem is IMAP really seems inadequate to this task.  

Right.  It is.  That's why you don't use it for search, you only use
it to retrieve the messages which are returned as the result of a
search query.  There are lots of ways to do this: one is to use
the search engine's output ("the list of messages which are tagged
as relevant to the Foo Project") as input to an IMAP server, take
the entire corpus of returned messages and construct a synthetic,
temporary mailbox which contains only those messages, and then hand
that to the mail client.  This has obvious performance issues and
thus needs refinement (e.g., if there are 4000 such matching messages
then the temporary mailbox will be large and take a while to build)
but it's meant to be illustrative, not a final design.

> The storage can be in whatever format works well.  Getting my mail out
> in mbox format is useless if it doesn't still have the tags applied,
> though there is no reason you couldn't export in that format if it
> isn't the native format.  I'm not convinced that a huge sequential
> text file with indexed byte offsets is the best solution for what
> amounts to a key/value storage system.  But, whatever, the storage
> isn't the problem here; the protocol is the problem as far as I can tell.  

You're darn right about that last part: IMAP predates the ubiquity
of search and thus wasn't designed with search in mind.  (But with
deference to the late Mark Crispin, who not only did most of this
work but wrote the terrific UW-IMAP server: it's not IMAP's job
to do search.  It does what it's supposed to do really, really well.)

And in general, there are all kinds of tradeoffs here.

Storing mail in big sequential text files really isn't that bad:
if the search engine indices note the filename and starting byte
offset, access to a message is an open() and an lseek() away.
(If if the ending byte is there too, it's one read() call to
have the whole thing.)  If these are stored on something fast,
like an SSD, then performance should be pretty zippy.

But mail can also be stored MH-style, in individual files, in which
case retrieval means grabbing the entire contents.  This alleviates
the need to store start/end bytes in the search engine indices but
it comes at the cost of way more open() calls, proportional to the
number of messages to be retrieved.  (But this also allows you to
use Unix command-line tools on messages en masse, and that is pretty
darn useful.)

If you anticipate all of the searching and retrieval happening later,
you can do things to stack the deck in your favor.  For instance:
suppose you subscribe to a great many discussion lists, including
one about BIND -- "bind-users".  Use procmail to toss all those incoming
into a single file, also named bind-users, and stash it.  (If it gets
too big: bind-users.001, bind-users.002, etc.)  When a future query
arrives and the first 87 out of 100 results happen to hit messages
which once traversed bind-users, many of those messages are going
to be in the same file/files.  This is going to make retrieval a lot
faster (far fewer open() calls, much higher ratio of hits on cache)
than it would be if those messages were scattered over many diverse files.

There are all kinds of tradeoffs like this, and which ones pay off
really depends on the mail corpus in play -- its nature and size --
and the search engine, and the metadata you want to use, and...

But two things to keep in mind are (1) sequential text files
aren't as slow as they used to be ;) and (2) throwing hardware
at the problem is often a very cost/time-effective move.  Just moving
200G of mbox files from disk to SSD makes a lot of performance
issues go away.  (Particularly if you make that read-only...which you
can...because then the cache is never invalid.)

It's also not unreasonable to keep all the original/important
message metadata in RAM cache.  My last message to this list showed
up with 2852 bytes of headers.  A first pass at discarding
unimportant headers leaves 1581 bytes.  A second pass gets it
down to 747 bytes, and removing the labels leaves 558 bytes.

So if we double that and guesstimate 1K/message, we can easily fit
message metadata for 1M messages in 1G of RAM.  The last 15 years-ish
of the linux-kernel mailing list (which is really REALLY busy)
contains about 2M messages, so: 2G of RAM *before* we get fancy
and collapse duplicate information (e.g., List-Id).  It's thus
not out of the question to stuff 32G or 64G of memory into a box and
do things like "read in all metadata into an in-RAM hash".

This isn't meant to be prescriptive, just descriptive.  The point
(well, one of the points) is that if you think carefully about the
nature of your data and the kind of searches you're going to want
to perform, you can use some pretty simple methods to gain a lot
of performance.  But -- and I think I'm agreeing with you here --
you're probably not going to want to do all of this via IMAP.

Philadelphia Linux Users Group         --
Announcements -
General Discussion  --