Rich Freeman on 25 Jul 2012 05:32:14 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Backup drive filling up

On Wed, Jul 25, 2012 at 7:51 AM, Walt Mankowski <> wrote:
> I think it's got to do something close to a complete read pass.  It
> generally takes about 45 minutes on my box.

That is why I generally don't use rsync for backups.  It does have
benefits if you're running it as a daemon over a network, as it is
economical with network IO, but it does not do much to cut down on
disk IO.  Basically it assumes that files could have been modified
without their mtimes being updated, and that actually can happen.
Looking at the man pages it seems like you can tell it to ignore
times, but not to rely on times without checking checksums.

Ideally it should let you tell it to use checksums on only one side of
the operation, and cache them on the other.  There is a real risk that
mtimes might not be reliable in my source data, but an offline backup
disk shouldn't be touched except by rsync, so if it just kept a file
with a checksum index then it could read that file without having to
go recalculating 500GB of hashes.

If you use Amazon s3 and utilities like s3cmd sync you get this
automatically - Amazon S3 maintains hashes for everything they store
which can be read without retrieving the file (and incurring bandwidth
costs).  s3cmd sync reads files locally to determine checksums and
compares those to Amazon's, and only uploads files that don't match.
Alas, I don't think it supports binary diffs/etc.

While it would probably be harder to read without tools, a
content-hashed backup solution would get around some of this.  I used
to use a linux backup solution called backuppc which didn't use
content-hashing per-se, but it did make a pass at the backup
directories and replace files with hard-links to deduplicate them.
That is dangerous if the backups aren't only manipulated through the
software (or simply read from the directories), but the backups stored
by that software were ordinary directory trees.  A safer option would
be to convert the files to reflinks if btrfs were available (these are
COW copies that share blocks until they are modified).

The biggest limitation of anything that stores backups as copied
directory trees is that it is very wasteful of space, as filesystems
aren't super-efficient for files you don't intend to access often.

Philadelphia Linux Users Group         --
Announcements -
General Discussion  --