Rich Kulawiec on 4 Oct 2016 04:09:48 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Top-posting |
On Mon, Sep 26, 2016 at 07:37:21PM -0500, Rich Freeman wrote: > Honestly, the on-disk storage format of the mail and index data > doesn't really concern me because email clients shouldn't ever look at > that stuff. They should be talking to the mail server and giving it > queries. Wellllll...but it does. Munging original source data to support search functionality is a bad idea. If you want to create a modified, enriched version of original source data: well, okay, but then you just doubled storage requirements and every time you change the way those modifications and enrichments are done, you have to re-import all your data. Which isn't awful, but it is somewhat of a PITA. I think architecturally it's much better to keep the original data in its native form and to do the heavy lifting in the search engine. > The problem is IMAP really seems inadequate to this task. Right. It is. That's why you don't use it for search, you only use it to retrieve the messages which are returned as the result of a search query. There are lots of ways to do this: one is to use the search engine's output ("the list of messages which are tagged as relevant to the Foo Project") as input to an IMAP server, take the entire corpus of returned messages and construct a synthetic, temporary mailbox which contains only those messages, and then hand that to the mail client. This has obvious performance issues and thus needs refinement (e.g., if there are 4000 such matching messages then the temporary mailbox will be large and take a while to build) but it's meant to be illustrative, not a final design. > The storage can be in whatever format works well. Getting my mail out > in mbox format is useless if it doesn't still have the tags applied, > though there is no reason you couldn't export in that format if it > isn't the native format. I'm not convinced that a huge sequential > text file with indexed byte offsets is the best solution for what > amounts to a key/value storage system. But, whatever, the storage > isn't the problem here; the protocol is the problem as far as I can tell. You're darn right about that last part: IMAP predates the ubiquity of search and thus wasn't designed with search in mind. (But with deference to the late Mark Crispin, who not only did most of this work but wrote the terrific UW-IMAP server: it's not IMAP's job to do search. It does what it's supposed to do really, really well.) And in general, there are all kinds of tradeoffs here. Storing mail in big sequential text files really isn't that bad: if the search engine indices note the filename and starting byte offset, access to a message is an open() and an lseek() away. (If if the ending byte is there too, it's one read() call to have the whole thing.) If these are stored on something fast, like an SSD, then performance should be pretty zippy. But mail can also be stored MH-style, in individual files, in which case retrieval means grabbing the entire contents. This alleviates the need to store start/end bytes in the search engine indices but it comes at the cost of way more open() calls, proportional to the number of messages to be retrieved. (But this also allows you to use Unix command-line tools on messages en masse, and that is pretty darn useful.) If you anticipate all of the searching and retrieval happening later, you can do things to stack the deck in your favor. For instance: suppose you subscribe to a great many discussion lists, including one about BIND -- "bind-users". Use procmail to toss all those incoming into a single file, also named bind-users, and stash it. (If it gets too big: bind-users.001, bind-users.002, etc.) When a future query arrives and the first 87 out of 100 results happen to hit messages which once traversed bind-users, many of those messages are going to be in the same file/files. This is going to make retrieval a lot faster (far fewer open() calls, much higher ratio of hits on cache) than it would be if those messages were scattered over many diverse files. There are all kinds of tradeoffs like this, and which ones pay off really depends on the mail corpus in play -- its nature and size -- and the search engine, and the metadata you want to use, and... But two things to keep in mind are (1) sequential text files aren't as slow as they used to be ;) and (2) throwing hardware at the problem is often a very cost/time-effective move. Just moving 200G of mbox files from disk to SSD makes a lot of performance issues go away. (Particularly if you make that read-only...which you can...because then the cache is never invalid.) It's also not unreasonable to keep all the original/important message metadata in RAM cache. My last message to this list showed up with 2852 bytes of headers. A first pass at discarding unimportant headers leaves 1581 bytes. A second pass gets it down to 747 bytes, and removing the labels leaves 558 bytes. So if we double that and guesstimate 1K/message, we can easily fit message metadata for 1M messages in 1G of RAM. The last 15 years-ish of the linux-kernel mailing list (which is really REALLY busy) contains about 2M messages, so: 2G of RAM *before* we get fancy and collapse duplicate information (e.g., List-Id). It's thus not out of the question to stuff 32G or 64G of memory into a box and do things like "read in all metadata into an in-RAM hash". This isn't meant to be prescriptive, just descriptive. The point (well, one of the points) is that if you think carefully about the nature of your data and the kind of searches you're going to want to perform, you can use some pretty simple methods to gain a lot of performance. But -- and I think I'm agreeing with you here -- you're probably not going to want to do all of this via IMAP. ---rsk ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug