Soren Harward on 19 Dec 2016 15:22:16 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] genome sequencing (was Re: Wanted: volunteers with bandwidth, storage, coding skills to help save climate data)

On Mon, Dec 19, 2016 at 4:49 PM Rich Freeman <> wrote:
(I assume they're still using replication termination of some kind to do sequencing).

Sort of.  The Solexa/Illumina platform does "sequencing by synthesis".  Basically, you start with several million short, single-stranded oligonucleotides anchored to a surface.  Then you add fluorescently-tagged nucleotide bases one at a time (called a "flow"), and image the surface each time a new base is added.  Then do 30–100 cycles of flows, gradually synthesizing the complementary sequence; hence the name "sequencing by synthesis".  You figure out the sequence for each short oligonucleotide by seeing which flow causes it to light up.  So TAAGTC would light up on the A flow in the first cycle (remember it's the complementary base), the T flow on the second and third cycles, the C flow on the fourth cycle, etc.

Most of the ~1TB of raw data from a single sequencing run is the hundreds of multi-megapixel 16-bit grayscale uncompressed digital images of the surface; it's so much data that even in 2005, Solexa had to use an FPGA accelerator so that the image analysis didn't take weeks.  The software processes down the images to call bases for each oligonucleotide "read".  So even though the final sequences of all the reads compress down to a few dozens of MBs, it's good practice to keep the raw image data around until you're certain you don't need it.

The main reason you'd need the raw image data is that calling bases — at least doing it well — is much, much harder than you'd expect.  Early versions of the Solexa software weren't that good at base calling, and even now I think the stock software is still a bit behind the state of the art.  I'm a patent examiner in bioinformatics, and every year I do a couple applications for new base calling algorithms.

Soren Harward

