Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to

Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data

On Mon, Dec 19, 2016 at 12:02 PM, Doug Stewart <zamoose@gmail.com> wrote:

Working from memory:
Sequences coming off of the Illumina sequencers were a terabyte raw. (The Illuminas had single-use special purpose chips that would ostensibly take a sample and break it down such that it could parallelize the sequencing to a great extent, reducing a process that would take a week/weeks prior into about a 4 hour process.) Post-processing and alignment internal to the sequencers themselves would generally pull the data down to ~400GB which was then shuffled off to our Isilon array. The full alignment and post-proc process on the computing cluster would reduce the per-subject sizes down to 50-100GB per person, or, if the researchers were really aggressive and only needed a few sections of the genome, all the way down to 10GB per or so.

There's obviously a lot of repeated and discardable information in those data sets, but the full data sets were of extreme interest to the research community, thus the product of the sequencers (the ~400GB results) tended to be the ones they needed to shuttle about.

When I left CHOP, we were hosting almost a half petabyte of largely sequenced samples in our research storage cluster and the researchers would readily fill whatever space we threw at them.

The figures I generally see are that a fully-aligned, non-repeating human genome, if each base pair is represented with their corresponding ASCII characters, is approximately 90GB per human, uncompressed. Those numbers above would suggest to me that the Illumina units were emitting subjects' sequences roughly in quadruplicate.

Again, this is all working from memory.

On Mon, Dec 19, 2016 at 11:17 AM, Rich Freeman <r-plug@thefreemanclan.net> wrote:
On Mon, Dec 19, 2016 at 11:11 AM, Doug Stewart <zamoose@gmail.com> wrote:
> The problem with data is that, even at the fattest pipe speeds, the fastest
> transit method is still overnighting HDDs via FedEx. We used to get DNA
> sequences from Tufts, Johns Hopkins, etc. via this method when I was at
> CHOP. Transfer time via Internet2 connections: ~1 month. Via FedEx: 2 days.
>

How long ago was that? A human genome is only 4gigabases, with 2 bits
per base (before compression). Granted, I hear some plants are just
insane but a lot of that is duplicative.

1GB isn't THAT much data to transfer, and that is before compression.

Now, if it is all stored as ASCII files with 1 character per base and
maybe 10-20% overhead with things like line numbers and such then I
could see it expanding, but that is still only a 4-5x expansion in
size.

So, maybe a human genome that is 10-20x oversampled (you're sending
raw contigs and not the assembled result) and poorly encoded you're
talking about a day of downloading.

Unless you're talking about 1998 and your network admin doesn't want
you using more than 20kb/s of bandwidth...

--
Rich
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

--
-Doug

___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug