Rich Freeman on 19 Dec 2016 13:49:02 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data

On Mon, Dec 19, 2016 at 12:02 PM, Doug Stewart <> wrote:
> Sequences coming off of the Illumina sequencers were a terabyte raw. (The
> Illuminas had single-use special purpose chips that would ostensibly take a
> sample and break it down such that it could parallelize the sequencing to a
> great extent, reducing a process that would take a week/weeks prior into
> about a 4 hour process.) Post-processing and alignment internal to the
> sequencers themselves would generally pull the data down to ~400GB which was
> then shuffled off to our Isilon array. The full alignment and post-proc
> process on the computing cluster would reduce the per-subject sizes down to
> 50-100GB per person, or, if the researchers were really aggressive and only
> needed a few sections of the genome, all the way down to 10GB per or so.

Interesting.  I have been out of the field since those came along, but
unless they're doing an incredible amount of over-sampling I suspect
this isn't actually sequence data but raw sensor data, such as a trace
of fluorescence/absorbance at multiple wavelengths over time at a
reasonably high sampling rate.  They could even be sampling an entire
spectrum and storing that at a high sampling rate.  That seems like
overkill since presumably there are only 4 wavelengths they really
need to sample but it would account for the large amount of data.  The
peaks as the individual oligonucleotides come through would be
processed to yield the sequence (I assume they're still using
replication termination of some kind to do sequencing).

> The figures I generally see are that a fully-aligned, non-repeating human
> genome, if each base pair is represented with their corresponding ASCII
> characters, is approximately 90GB per human, uncompressed. Those numbers
> above would suggest to me that the Illumina units were emitting subjects'
> sequences roughly in quadruplicate.

The human genome is only 3.2 Gb total (haploid), or 6.4 Gb if you want
both copies (which would be relevant if you're interested in
diagnostic information for an individual).  A base can only have one
of 4 values, which is only two bits of entropy even if it were
completely random, which of course it is not.

Now if you store it as one ASCII character per base then that is a
6.4GB of data, but that is still a far way off from the size you
quote, unless it is 10x oversampled and you're storing all the raw
contigs (which I think would not be useful after re-assembly).  Even
if you don't take any efforts to store the files in a particularly
efficient manner, simply running them through gzip would at least get
them down to the 2 bits per byte, and probably quite a bit more (how
much of the genome contains repeating palindromes and such?).

Now, if their files are extremely verbose and they're storing a lot of
redundant information and they don't bother to run it through gzip,
well, then sure you can make just about anything take up 60GB if you
try hard enough...  :)

Otherwise, I'm not really sure what they're actually trying to store here...

Philadelphia Linux Users Group         --
Announcements -
General Discussion  --