Rich Freeman on 19 Dec 2016 13:49:02 -0800 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data |
On Mon, Dec 19, 2016 at 12:02 PM, Doug Stewart <zamoose@gmail.com> wrote: > Sequences coming off of the Illumina sequencers were a terabyte raw. (The > Illuminas had single-use special purpose chips that would ostensibly take a > sample and break it down such that it could parallelize the sequencing to a > great extent, reducing a process that would take a week/weeks prior into > about a 4 hour process.) Post-processing and alignment internal to the > sequencers themselves would generally pull the data down to ~400GB which was > then shuffled off to our Isilon array. The full alignment and post-proc > process on the computing cluster would reduce the per-subject sizes down to > 50-100GB per person, or, if the researchers were really aggressive and only > needed a few sections of the genome, all the way down to 10GB per or so. Interesting. I have been out of the field since those came along, but unless they're doing an incredible amount of over-sampling I suspect this isn't actually sequence data but raw sensor data, such as a trace of fluorescence/absorbance at multiple wavelengths over time at a reasonably high sampling rate. They could even be sampling an entire spectrum and storing that at a high sampling rate. That seems like overkill since presumably there are only 4 wavelengths they really need to sample but it would account for the large amount of data. The peaks as the individual oligonucleotides come through would be processed to yield the sequence (I assume they're still using replication termination of some kind to do sequencing). > > The figures I generally see are that a fully-aligned, non-repeating human > genome, if each base pair is represented with their corresponding ASCII > characters, is approximately 90GB per human, uncompressed. Those numbers > above would suggest to me that the Illumina units were emitting subjects' > sequences roughly in quadruplicate. > The human genome is only 3.2 Gb total (haploid), or 6.4 Gb if you want both copies (which would be relevant if you're interested in diagnostic information for an individual). A base can only have one of 4 values, which is only two bits of entropy even if it were completely random, which of course it is not. Now if you store it as one ASCII character per base then that is a 6.4GB of data, but that is still a far way off from the size you quote, unless it is 10x oversampled and you're storing all the raw contigs (which I think would not be useful after re-assembly). Even if you don't take any efforts to store the files in a particularly efficient manner, simply running them through gzip would at least get them down to the 2 bits per byte, and probably quite a bit more (how much of the genome contains repeating palindromes and such?). Now, if their files are extremely verbose and they're storing a lot of redundant information and they don't bother to run it through gzip, well, then sure you can make just about anything take up 60GB if you try hard enough... :) Otherwise, I'm not really sure what they're actually trying to store here... -- Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug