Backing up is hard to do (right)
You can never overstate the importance of backups. Over the last year I have put quite a bit of effort in making sure my data is backed up properly. The purpose of this article is not to describe backup best practices (that is a vast subject, there are other, better resources available on the web, and in any case there is no one-size-fits-all solution). I am just documenting my setup, the requirements that drove it, and possibly give readers some ideas.
The first part in planning for backup is to do an inventory of the assets you are trying to protect. In my case, in order of priority:
- 1.5GB of scans of important documents: birth certificates, diplomas, invoices, legal documents, bank statements, and so on. This data is very sensitive, and should be encrypted.
- 150GB of digital photos and scans
- My address book, which lives on my laptop
- My source code repositories
- My personal email, approximately .75GB
- The contents of this website, about 5GB
- 190GB of music (lossless rips of my CD collection)
- My Temboz article database
Thus the total storage capacity required for a full backup is reaching the 400GB mark. This in itself precludes DVD-R or even tape backup (short of buying an expensive LTO-4 tape drive or an autoloader, that is).
The second step is to devise your threat model. In my case, by decreasing order of likelihood:
- Human error
- Hard drive failure
- Software failure (e.g. filesystem corruption)
- Silent data loss or corruption, e.g a defective disk
- Theft
- Fire, earthquake, natural disaster, etc.
Third, some general principles I believe in:
- Do not use proprietary backup formats. The best format is plain files on a filesystem identical in structure to the original.
- Do not rely on offline media for backups. The watched pot does not boil over, online data is much less likely to go bad without my noticing until it is too late.
- A backup plan needs to be effortless to be successful. Plugging in external drives when backups are needed, or rotating drives between home and office is something I have tried, but not stuck to.
- Backups should be verified — they should generate positive feedback, so that the absence of feedback can alert to problems
- For all types of data, there should be one and only one reference machine that holds the authoritative copy. Multi-master synchronization and replication is possible using tools like Unison, but is much harder to manage and increases the risk of human error.
With these preliminaries out of the way, here is my system:
- My primary backups reside on my home server, a Sun Ultra 40 M2 workstation, running Solaris 10. This machine is very quiet, so I can keep it running in the room next to my bedroom without disturbing my sleep. It is also relatively power-efficient at 160W with seven hard drives.
- One of the seven drives is the 160GB boot drive, and the other six are 750GB Seagate drives configured in a 3TB ZFS RAID-Z2 storage pool.
- With large SATA drives, reconstruction after a drive failure is long and the risk of another drive failing due to the stress of rebuilding is not negligible. RAID-Z2 can tolerate two drives failing, unlike RAID 5 which can only tolerate a single drive failure. This level of data protection is higher than RAID 1 since RAID 1 won’t protect you if two drives that are the mirror of one another fail. You can get the same level of protection in RAID 6 or RAID-DP.
- I have scripts to take ZFS snapshots daily, equivalent to the auto-snapshot service. The daily snapshots are kept for the current month, then I keep only monthly snapshots. Snapshots are the primary line of defense against human error.
- Snapshot technology consumes only as much disk space as required to store the differences between the snapshot and current versions of a file, and is much more efficient than schemes like Apple’s Time Machine where a single byte change to a multi-gigabyte file like a Parallels virtual disk image will cause the entire file to be duplicated, wasting storage. Because snapshots are taken near instantly and cost almost nothing, they are an extremely powerful feature of a storage subsystem.
- I backup from my various machines to the Sun via rsync over ssh. An incremental backup of my PowerMac G5, which has most of the 400GB in my backup set, takes less than 5 minutes over Gigabit Ethernet, despite the ssh encryption.
- ZFS is probably the best filesystem, bar none, but it is not perfect, as demonstrated by the Joyent outage and you still need another copy for backup in case of ZFS corruption.
- Every night at 2AM a cron job on my old home server (2x400GB, ZFS RAID 0), that I now I keep at work, pulls updates from the Sun using rsync over ssh (the company firewall won’t let me push updates to it from the Sun). Another cron job at 8AM kills any leftover rsync processes, e.g. if there are more data changes to transfer than fit in the 1-2 GB that can be transferred in 6 hours over my relatively pokey 320-512kbps DSL uplink (no thanks to AT&T’s benighted refusal to upgrade its tired infrastructure).
- My cron jobs use verbose output which generates an email sent back to me. I could suppress those messages, but then I would lose the ability to detect errors.
- A last line of defense is to back up my server at work to a D-Link DNS-323 NAS box using rsync over NFS. This cute little unit holds two Western Digital Green Power 1TB drives in RAID 1, which slide right in, no tools required. It consumes next to no power or desk space. Since it runs Linux and is easy to extend using fun-plug, I could conceivably run the cron and rsync from there. As a bonus, the built-in mt-daapd server streams my entire music collection to iTunes over the LAN so I can listen to any of my CDs at work.
- It can take a few days for this data bucket brigade to catch up with a particularly intense photo shoot, but it will eventually and is never too far behind. This provides me with near continuous data protection and disaster recovery.
Update (2009-10-07):
I made some changes. My office backup server is now an inexpensive Shuttle KPC 4500 running OpenSolaris 2009.06 and a 1TB drive. It in turn backs up to the DNS-323, although I need to qualify the recommendation – like many embedded Linux devices, the DNS-323 has a distressing tendency to get wedged every now and then, requiring a reboot, and is not reliable enough as primary offsite backup in my book. OpenSolaris, of course, is rock-stable, and the hardware is not much more expensive (I paid $400 for the KPC).
My backups are now much faster since I upgraded to 20Mbps symmetric Metro Ethernet service from Webpass a month ago.
Update (2014-01-09):
Since I moved to a semi-suburban house two years ago and had to revert to AT&T’s abysmally slow DSL service, remote backups over rsync are no longer a viable option and I have to use sneakernet. My current setup is:
- A Time Machine backup onto a 4TB internal drive inside my Mac
- hourly rsync backups onto a 2TB WD My Passport Studio. I actually have two of these and rotate them between home and office. They have a metal case (helps heat dissipation and increase drive lifetime and reliability) as well as hardware AES encryption