bergman on 13 Aug 2013 09:02:58 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Offline Backup Solutions |
In the message dated: Sun, 11 Aug 2013 20:20:00 -0400, The pithy ruminations from Rich Freeman on <[PLUG] Offline Backup Solutions> were: => After attending yesterday's Bacula talk I am thinking about doing => offline backups to an eSATA drive. I'm not sure if Bacula is actually => the right tool for the job though. Interesting discussion... I've actively used BackupPC since ~2005 and bacula since ~2006, and rsync & tar & friends since 198?, however I don't think the issue here is really about the software package, as much as it is about the process. => => I'd like to define the following classes of jobs: => 1. MythTV video - 1-2x/wk backup, no retention of deleted/changed => files (~1TB with high turnover) => 2. Unimportant files - daily backup, short retention of => deleted/changed files (~1TB with low turnover) => 3. Important files - hourly backup, long retention of deleted/changed => files (~30GB with low turnover) It sounds like you've got a good understanding of your data, including the relative importance and churn rate of different categories. That's a vital starting point in deciding on any backup solution. So, it sounds like a full backup would be ~2TB, and nightly backup would be small. However, a differential backup could be the same order of magniture as a full, given that the MythTV data may change completely. That's a pretty small volume, given today's storage media. Think of cheap redundant media as a way to gain reliablity (in case of media errors, physical disaster) while reducing the complexity of the software part of the process. => => Some of the important files might come from other hosts running => Windows (which makes something like Bacula more attractive). BackupPC can do Windows quite easily. => => I'd like all but #1 to run automatically (optional for #1). I'd like => my large offline storage to remain, well, offline (not physically => connected). Automated backups would all go into online storage, and => would be migrated to the offline storage when it is connected. Online => storage would have a capacity of ~100-200GB tops (ie it cannot store a => full backup of anything but the important files). It sounds like your critical concern (particularly based on later postings) is keeping the full backup offline as much as possible in order to decrease the chance of media or human error corrupting that backup. That sounds like a not insanely unreasonable concern to me, but you should examine the chance of the risk of corrupting an 'online' backup device AND requiring a recovery from that backup vs the complexity to mitigate the risk. => => Any suggestions? How are others handling offline storage? I could => just manually mirror things but then I lose the security of automated => backups. I could leave the offline storage online, but then that => makes it vulnerable to many failures that would take out the originals => (even if unmounted when not in use). => => I was looking at Bacula and it seems like I could sort-of do this. [Valid concerns about Bacula SNIPPED] => => This just seems more complicated than it needs to be. Surely somebody => must be doing backups using offline disks? Most of the logic is built => around having a box of tapes and rotating through those, but that is => incredibly expensive these days as tape just hasn't kept pace, and I'm Define 'expensive'. A tape drive is expensive. Tape media is cheap. I've got 100s of LTO4 tapes, holding 800GB~1.6TB depending on compression, at ~$25 each. However, that's probably not the solution for your problem. => not going to rotate disks that will end up being 90% empty, or have Why not? Disks are 'cheap' compared to your time and effort. => the system be doing full backups on multiple-TB of data with any => frequency. => => I could just do manual rsyncs/etc, but then if I forget to do it for a => week I am taking a fair bit of risk, and managing retention with rsync => doesn't sound simple. I could also just leave the drive online but One possiblity is not to try to manage retention with rsync, but to rsync to multiple devices that are physically rotated off-site. => unmounted. One advantage of rsync though is that recovery is => brain-dead simple. I don't mind the thought of recovering onto bare That's a huge advantage. Don't forget, the reason why we do backups is to be able to recover data, not for the warm happy feeling of having copied 1s and 0s. If you can't recover it accurately and easily, then the 'backup' effort was wasted. Simple = Good [and this advice is coming from someone who is managing a way-too-complex bacula installation]. => metal from something like tar/dar/etc, but for something like Bacula => the bar is considerably higher. => => The important stuff is already being backed up to S3, and I don't => think I'm going to change that. This is really about faster recovery => in the event of something other than a fire and backing up all the => other junk that doesn't warrant that kind of treatment. I'm also => contemplating moving to btrfs and I'd really only want to do that if I => had a fairly full set of recent backups at all times. => => How are others handling offline backup? I may just be => over-engineering things. I could probably script up manual backups Yes, I believe you're in danger of over engineering things. Think about the parameters and risk cases here: do you have [practical] limits on the backup & restore duration (it'll take significant time to copy ~2TB of data, but perhaps a multi-hour backup is OK, and it lets you remove the complexity of deduplication) what are the risk cases you're trying to minimize? accidental erasure of online files any backup solution will handle this damage to the original drives/computer that necessiatates a bare-metal recovery some backup solutions are better than others logical damage to the attached (on-line) backup devices (ie., 'rm -rf' or 'fdisk') how great is this risk, really? damage to the attached backup devices (ie. power surge, pipe bursts, etc.) whether on-line or off-line keeping the drives off-line is of little benefit unless they are physically separate from the original data & computers more importantly, how great is the risk that you will experience damage to the backup media at a time when you also need to do a recovery? [the only mitigation is off-site storage, whether physically keeping your backup drives separate from the original data, or S3 as your data backup and 'apt-get' as your OS installation backup] Some [semi-random] observations keep it simple drives are cheap, especially compared to engineering time a simple backup solution makes recovering data more likely, especially under stress in my observation "generic" external USB drives have the highest failure rate (regardless of manufacturer, size, drive model, etc.) of any computing items I've worked with over the last 20+ years...I wouldn't rely on them as the sole source of data, but they're fine for one of "N" copies I'd suggest something like: purchase 3x 3TB drives determine a backup 'solution' with a schedule that meets your needs, for example a weekly 'full' and nightly incrementals label the drives (physically and logically) so that both the backup software and the human beings can easily tell which drive should be used physically rotate the drives, something like week 1: put drive C off-site retrieve drive B from off-site storage drive A on-line full backup incr x 6 (week N of the year, not evenly divisible by 2 or 3) week 2: put drive A off-site retrieve drive C from off-site storage drive B on-line full backup incr x 6 (week N of the year, evenly divisible by 2) week 3: put drive B off-site retrieve drive A from off-site storage drive C on-line full backup incr x 6 (week N of the year, evenly divisible by 3) Yes, you're risking some kind of damage by leaving the active drive 'online', but the trade off is that the chance of successful backups is much higher than mounting/unmounting the drive, you reduce the likelihood of physical electrical damage by disconnecting/reconnecting the drive nightly, you reduce the chance of human error (forgetting to connect the drive), etc. Note that this scheme can make use of backup software that uses an index and is aware of backups that are stored on media that's offline, but it doesn't require such a scheme. I do something similar with BackupPC--I've got multiple external drives, and keep the 'catalog' (and all BackupPC config files, etc.) on each drive--when I plug in a drive and restart BackupPC, there's no awareness of any other media. I'd use a fairly simple wrapper (run via cron nightly) around the chosen backup software, where the wrapper would just: check the week (ie., week "N" of the year modulo 2 or modulo 3) run "fdisk -l", looking for a drive (physically attached, powered up) with the correct label complain loudly (email, on-screen message, etc.) if the device cannot be found mount the drive if it is not already mounted complain loudly (email, on-screen message, etc.) if the device cannot be mounted run your chosen backup if it's the end of the week (for whatever day you decide to rotate media), send a reminder to move the device off-site Mark => => Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug