Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to

Rich Freeman on 19 Dec 2016 13:49:02 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data

From: Rich Freeman <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Subject: Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
Date: Mon, 19 Dec 2016 16:48:54 -0500
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to; bh=t12amWDunYXB+Ud54A0WK00WcGkIuRZQ/fSxjiqayHo=; b=a4isWXBz4v/dRn2jO9SB9u8RnkwT4806Tj4FULNQafQBEMVLcoMl6bU4MrRLxti3aR Trz6nRkNieK0yklbnzQ934PhWBymE6qWZmsbi+gTj7VA8qucwo9XziTo12GalQgbxuhn 7gyzVmk9Z3+6mDwgvDYy6KL44LQYdjvcld03zg1H+baLdJnlH1f3o/H2R9JxMOIXqh2S 8kaO4ySo4SjcE5q5yFT+XJl4EzAybWO84B8PJpx7qpd1ej+01U5yNeqbjtTYLscSG4kU usvTOgLqwkh/Pa4ZyKxNRs6rR5A4BX45tY2A4+hEGBbjEagW2VFE+gLS9sozn6dl25Bp R6Sg==
Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: "plug" <plug-bounces@lists.phillylinux.org>

On Mon, Dec 19, 2016 at 12:02 PM, Doug Stewart <zamoose@gmail.com> wrote:
> Sequences coming off of the Illumina sequencers were a terabyte raw. (The
> Illuminas had single-use special purpose chips that would ostensibly take a
> sample and break it down such that it could parallelize the sequencing to a
> great extent, reducing a process that would take a week/weeks prior into
> about a 4 hour process.) Post-processing and alignment internal to the
> sequencers themselves would generally pull the data down to ~400GB which was
> then shuffled off to our Isilon array. The full alignment and post-proc
> process on the computing cluster would reduce the per-subject sizes down to
> 50-100GB per person, or, if the researchers were really aggressive and only
> needed a few sections of the genome, all the way down to 10GB per or so.

Interesting.  I have been out of the field since those came along, but
unless they're doing an incredible amount of over-sampling I suspect
this isn't actually sequence data but raw sensor data, such as a trace
of fluorescence/absorbance at multiple wavelengths over time at a
reasonably high sampling rate.  They could even be sampling an entire
spectrum and storing that at a high sampling rate.  That seems like
overkill since presumably there are only 4 wavelengths they really
need to sample but it would account for the large amount of data.  The
peaks as the individual oligonucleotides come through would be
processed to yield the sequence (I assume they're still using
replication termination of some kind to do sequencing).

>
> The figures I generally see are that a fully-aligned, non-repeating human
> genome, if each base pair is represented with their corresponding ASCII
> characters, is approximately 90GB per human, uncompressed. Those numbers
> above would suggest to me that the Illumina units were emitting subjects'
> sequences roughly in quadruplicate.
>

The human genome is only 3.2 Gb total (haploid), or 6.4 Gb if you want
both copies (which would be relevant if you're interested in
diagnostic information for an individual).  A base can only have one
of 4 values, which is only two bits of entropy even if it were
completely random, which of course it is not.

Now if you store it as one ASCII character per base then that is a
6.4GB of data, but that is still a far way off from the size you
quote, unless it is 10x oversampled and you're storing all the raw
contigs (which I think would not be useful after re-assembly).  Even
if you don't take any efforts to store the files in a particularly
efficient manner, simply running them through gzip would at least get
them down to the 2 bits per byte, and probably quite a bit more (how
much of the genome contains repeating palindromes and such?).

Now, if their files are extremely verbose and they're storing a lot of
redundant information and they don't bother to run it through gzip,
well, then sure you can make just about anything take up 60GB if you
try hard enough...  :)

Otherwise, I'm not really sure what they're actually trying to store here...

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

Follow-Ups:
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Doug Stewart <zamoose@gmail.com>
- Re: [PLUG] genome sequencing (was Re: Wanted: volunteers with bandwidth, storage, coding skills to help save climate data)
  - From: Soren Harward <stharward@gmail.com>

References:
- [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Rich Kulawiec <rsk@gsp.org>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: LeRoy <ldc@lrcressy.com>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: "Eric H. Johnson" <ejohnson@camalytics.com>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Alex Ruijie Fang <frjalex@temple.edu>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Paul Walker <pjwalker76@gmail.com>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: "Eric H. Johnson" <ejohnson@camalytics.com>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Doug Stewart <zamoose@gmail.com>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
  - From: Doug Stewart <zamoose@gmail.com>

Prev by Date: Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
Next by Date: Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
Previous by thread: Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
Next by thread: Re: [PLUG] Wanted: volunteers with bandwidth, storage, coding skills to help save climate data
Index(es):
- Date
- Thread