Project Home

Arnie Simple Incremental Backups System

Author: Martin Blais <blais@furius.ca>
Date: 2005-07-23

Abstract

A solution for my offsite backups solution. This is an internal design document and may not be up-to-date. These are just notes I took in the process of implementing that project.

Forces

We want:

Solution

Some definitions:

Client
The machine that needs to be backed up.
Server
The machine where backed up data archives are to be sent.

The algorithms

Backup

A file with the lists of files that were last backed up, including a CRC for each file (e.g. an MD5 sum) is kept and maintained on the client. If the file is not available, a full backup should be run.

  1. We load the last-backed-up file list;
  2. We compare against the current list of files to be backed up:
    • Files that have been added are marked to be backed up;
    • Files that have changed (their MD5 sums or timestamps differ) are marked to be backed up.
  3. We tar/zip (.tar.bz2) the list of files that have been added or changed into a backup archive file;
  4. We also save a full listing of the files that are present at the moment of backup, in a special file in the backup archive;
  5. The backup archive is optionally encrypted with a key, named with the current date/time, and sent to the server, to be put in a directory along with the other archives;

Restore

We assume that we have a list of backup archives, whose union contain at least one copy of the files in file list stored in a backup archive.

  1. We open the latest archive file for the date at which we want to restore the backup and fetch the special file list for that date;
  2. We fetch the list of files stored within all the archives that precede the date for which we want to restore. We do this lazily--we only open the files if we need to (see below).
  3. For each of the files in the list of files to restore: look for it from latest to earliest archive file. Open the archives as needed (and keep their file lists cached).
    1. When a file is found, untar it into the restored output directory;
    2. If we run at the end of the list of archives, output and error and continue;

Another idea for restore would be to untar all the archives in order over each other and then to remove the files that are not supposed to be present. This might actually be faster than the lookup I was planning to do.

About Dates

There are two options for the restore algorithm to sort the files by date:

  1. The filenames are assumed to contain the date/time embedded in the filename. There is no need to open the file to find out the date/time of the archive;
  2. The date/time is stored as the mtime of the history file contained within the archive. We need to open all the archives in order to just find the date/times of the archives. The advantage is that the archives can be named anything.

Decision: We archive with information both in the filename and in the mtime of the history file contained therein. The restore script by default will rely on the filename to determine its timestamp, but optionally will be able to look into the history file in case the filenames have been munged somehow.

About File Naming

Decision: full backups should not be named specially. The restore script will be able to list which archives will be required to extract the list of files from the contained history file.

Notes

Implementation Notes

Treating Directories Just Like Files

  • Treat directories exactly just as files, so that empty backups are of size 0 (it also much cleaner)
    • This means that we should store directories only if their attributes (permissions) changed. What matters is:
      • For files: crc, permissions, user, mtime
      • For directories: permissions, user, mtime

Empty Directories

  • Check if .tar.bz2 can contain empty directories.

    Answer: Yes! And it also stores the permissions of directories ONLY if the directories are stored as entries of their own. Otherwise, the permissions are not kept on the parent directories.

    So you must store entries for the directories as you go if you want to recover the entire permissions.

  • Can empty directories be created with Python's tarfile module? Can I store permissions too? If so, then we should store the list of directories in the history file, to make sure that we can recreate them exactly the same, with the same permissions as well.

    Yes! Extracting a directory file extracts the directory with all its permissions.

Future Ideas (Will Not Necessarily Get Done)

Date for archives

  • We could use the time specified on the arniehistory file inside the archive instead of the string that is embedded within the archive filename.

Dates between machines

  • Dates between machines could be different, so choosing a date that is stored on the source machine seems appropriate and important.

    Implement that date/time embedded within the filename (default) or the date/time stored as the mtime of the contained history file (slower, have to open all the archives to find out the date/times)

Attributes Support

  • Attributes not supported: st_uid st_gid st_atime st_mtime st_ctime

    Add support for some of those.