Keith C. Perry on 25 Apr 2016 16:47:04 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] [plug-announce] TOMORROW - Tue, Apr 19, 2016: PLUG North - "Linux Containers" by Jason Plum and Rich Freeman (*6:30 pm* at CoreDial in Blue Bell)


What may help is if we get some reference points...

Some quick google-fu for what its worth

https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_rays1

(which references http://www.zdnet.com/article/dram-error-rates-nightmare-on-dimm-street/ as the "well-documented fact" that modern computers are susceptible to occasional random bit flips.

https://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electronics

Rich my have some others...

We know error correction can help with such matters.  However, error correction is not silver bullet and will not correct all errors.  There are limits but practically speaking, things do not happen in such a way that somewhere along the way an error detection or correction mechanisms does not catch it.  So my contention is that if n+1 case generically means an error can be detected and corrected then there exists the n+2 case where you will have a silent failure (that really should be n+3 since way error correct works is that you usually still have some error detection even if it can't be automatically corrected).  This is not, not a good thing because we still have 2 case that are in our favor.

However, if n+1 is rare then n+2 is more rare (and n+3 is rarer still) and we're only talking about one component of a system.  This is definitely a good thing.

Therefore, while it may be technically correct to say one file system is better than another, practically speaking talking about bit flips from cosmic ray isn't alone enough of a motivator to choose one file system over another (assuming you can definitively say that a particular file system is better suited to handle randomly occurring bit flips). XFS has error detection and redundancy, I think EXT4 does too.  Of course BTRFS and ZFS do so that's simply not enough of a practical argument when other parts of data storage and retrieve process also have some level of these functions.

I agree, bad data and corrupt data are not the same and point is that a silent failure at the file system layer would treat good, bad and corrupt data equally.  That should not be happen from a process point of view- something else should be able to identify the problem and correct it or at least identify it.  Thus, illustrating why such function are not solely implemented in the file system.

My own personal experience has been that I've never not been able to track down random data corruption to something I would have to chalk up to as cosmic rays.  If the studies are to be believed then it implies that current process are robust enough to ensure data gets written properly despite them.  There are lots of other things I could talk about that do break processes- water, lightening, squirrels, magnets, humidity, a guy in a backhoe, et cetera but not that one LOL. I suppose that is a good thing.

If anyone can share a story about a cosmic ray bit flip, I'd like to hear about it and how you tracked it down.


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Owner, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: bergman@merctech.com
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Monday, April 25, 2016 6:19:30 PM
Subject: Re: [PLUG] [plug-announce] TOMORROW - Tue, Apr 19,	2016: PLUG North - "Linux Containers" by Jason Plum and Rich	Freeman (*6:30 pm* at CoreDial in Blue Bell)

In the message dated: Mon, 25 Apr 2016 17:45:16 -0400,
The pithy ruminations from "Keith C. Perry" on 
<Re: [PLUG] [plug-announce] TOMORROW - Tue, Apr 19, 2016: PLUG North - "Linux C
ontainers" by Jason Plum and Rich Freeman (*6:30 pm* at CoreDial in Blue Bell)>
 were:
=> "
=> The whole point of zfs/btrfs is that they DO detect these kinds of
=> corruptions, and they automatically use the redundant copy of the
=> data.  Sure, if the same block in n+2 drives gets hit by a cosmic ray
=> then you'll still lose it, but that is MUCH less likely to happen than
=> a single flip, or two flips of unrelated blocks.  The whole server
=> could get hit by an asteroid as well."
=> 
=> I think you might be missing my point that file system is only one layer of technology.  We can assume that one file system is superior to another an you would have have the problem of the hard disk failing.  Whether its do to cosmic rays or some other random event it is possible to have the physical HD fail in a way that would represent a silent failure in the file system.
=> 

I'm not clear on what you're trying to say here. By "silent failure
in the file system" do you mean that the fault in a physical drive was
handled transparently and caused no disruption to the filesystem (a good
thing), or do you mean that the fault was undetected by the filesystem,
which failed (a bad thing).

A physical HD that has a fault and is not detected by other layers in
the system represents a design (or operation) failure, much more than
a hardware failure.

If you consider the entire 'system', from user management of files,
to the filesystem, to redundant paths to storage devices, to redundant
controllers, to logical volumes in a redundant array of drives, down to
individual storage devices, then a well designed (and well-managed &
monitored) system does not have silent failures.

Data consistency at each level can be verified and some corruption
potentially recovered.

For some perspective, see the 2007 LISA talk by Andrew Hume:

	https://www.usenix.org/legacy/event/lisa07/tech/tech.html#hume

=> Likewise you can realize a silent error from a higher level- if you were to write a file with bad data to the file system, that file system, any file system, would never know.  For instance, a topic I'm constantly discussing now is ransomware.  Despite specific methods of active detection for that, the general question remains, how can you determine in a absolute manner over time that a file has been encrypted without the users consent.  Those solution are out of scope for here but I will say how you do that has nothing to do with the file system.  

You need to distinguish bad data (ie, unwanted content that has no
errors) from corrupt data (data that has errors after the content was
created). Ransomware represents bad data, but not corrupt data. Yes, from the
user's point of view the content is inaccessible, but it is not corrupt.

Now you're getting into general questions of Information Theory, not
data corruption. [Paging Dr.  Turing & Dr. Shannon to the Blue Conference
Room, please.]

=> 
=> 
=> ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
=> Keith C. Perry, MS E.E. 
=> Owner, DAO Technologies LLC 
=> (O) +1.215.525.4165 x2033 
=> (M) +1.215.432.5167 
=> www.daotechnologies.com
=> 

-- 
Mark Bergman    Biker, Rock Climber, SCUBA Diver, Unix mechanic, IATSE #1 Stage
hand
'94 Yamaha GTS1000A^2
bergman@panix.com 				https://www.flickr.com/photos/r
msppu

http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40panix.com

I want a newsgroup with a infinite S/N ratio! Now taking CFV on:
rec.motorcycles.stagehands.pet-bird-owners.pinballers.unix-supporters
15+ So Far--Want to join? Check out: http://www.panix.com/~bergman 
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug