Tuesday, October 13, 2009

Getting started with rsync, for the paranoid

When a computer tool has the potential to be dangerous, my paranoia manifests itself by making sure I understand what the tool is doing in detail before I use it. rsync is a very powerful tool you can use to clone directory trees with. It's also possible to wipe out your local files with it, and understanding what it does is quite complicated to figure out. It doesn't help that the rsync manual page is a monster.

The basic tutorials I find in Google all seem a bit off so let me start with why I wrote this. You don't need to start an rsync server to use it, you really don't need or even want to start by setting up unsecure keys, and the tutorials that just focus on the basics leave me not sure sure what I just did. Quick and dirty guide to rsync is the closest to what I'm going to do here, but it lacks the theory and distrust I find essential to keeping myself out of trouble.

Let's start with local rsync, which is how you should get familiar with the tool. One useful mental model here is to think of rsync as a more powerful cp and scp rolled into one initially, then focus on how it differs. The canonical simplest rsync example looks like this:
$ rsync -av source destination

What does this actually do though? To understand that, you first need to unravel the options presented. This takes a while, because they're nested two levels deep! Here's a summary:
-v, --verbose      increase verbosity
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories

-l, --links copy symlinks as symlinks
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files

-t, --times preserve times
-p, --perms preserve permissions
-g, --group preserve group
-o, --owner preserve owner (super-user only)

I've broken these out into the similar groups here. Verbose you're going to want on in most cases, outside of automated operations like inside of cron. The first thing to be aware of with this simple recipe is that turning on archive mode means you're going to get recursive directory traversal. The "-l -D" behavior you probably want in most cases, to properly handle special files and symbolic links. You'll almost always want to preserve the times involved too. But whether you want to preserve the user and group information really depends on the situation. If you're copying to remote system, this might not make any sense at all, which means you can't just use "-a" and will need to decompose the operations here to include all of the remaining ones. In many cases where remote transfer is involved, you'll also want to use "-z" to compress too.

How does rsync make its decisions?

What are the problem spots to be concerned about here, the ones that can eat your data if you're not careful? In order to talk about that, you really need to understand how rsync makes its decisions by default, and its other major modes. Here's the relevant bits from the man page that describe how it decides what files should be transferred; you have to collect the beginning and the details related to a couple of options to figure out the major modes it might run in:
Rsync finds files that need to be transferred using a “quick check” algorithm (by default) that looks for files that have changed in size or in last-modified time. Any changes in the other preserved attributes (as requested by options) are made on the destination file directly when the quick check indicates that the file’s data does not need to be updated.

-I, --ignore-times: Normally rsync will skip any files that are already the same size and have the same modification timestamp. This option turns off this “quick check” behavior, causing all files to be updated.

--size-only: This modifies rsync’s “quick check” algorithm for finding files that need to be transferred, changing it from the default of transferring files with either a changed size or a changed last-modified time to just looking for files that have changed in size. This is useful when starting to use rsync after using another mirroring system which may not preserve timestamps exactly.

-c, --checksum: This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a “quick check” that (by default) checks if each file’s size and time of last modification match between the sender and receiver. This option changes this to compare a 128-bit MD4 checksum for each file that has a matching size. Generating the checksums means that both sides will expend a lot of disk I/O reading all the data in the files in the transfer (and this is prior to any reading that will be done to transfer changed files), so this can slow things down significantly...Note that rsync always verifies that each transferred file was correctly reconstructed on the receiving side by checking a whole-file checksum that is generated as the file is transferred, but that automatic after-the-transfer verification has nothing to do with this option’s before the-transfer “Does this file need to be updated?” check.

From this we can assemble the method used on each source file to determine whether to transfer it or not. Once the decision to transfer has been made, the rest of the tests related to that decision are redundant.
  1. Is --ignore-times on? If so, decide to transfer the file.
  2. Do the sizes match? If not, decide to transfer.
  3. Default mode with --size-only off: check the modification times on the file. If the source file is newer, decide to transfer.
  4. Checksum mode: compute remote and local checksums. If they don't match, decide to transfer.
  5. Transfer the file if we decided to above, computing a checksum along the way.
  6. Confirm the transfer checksum matches against the original
  7. Update any attributes we're supposed to manage whether or not the file was transferred.
Understanding rsync's workflow and decision making process is essential if you want to reach the point where you can safely use the really dangerous options like "--delete".

Common problem spots

One thing to be concerned about even in simple cases is that if you if you made a copy of something without preserving the times in the past, the copy will have a later timestamp than the original. This can turn ugly if you're trying to get the local additions to a system back to the original again, as all the copies will look like later ones and you'll transfer way more data than you'd expect. If you know you've just added files on a remote system and don't want to touch the ones that are already there, you can use this option:
  --ignore-existing skip updating files that exist on receiver

This will also keep you from making many classes of horrible errors by not allowing it to overwrite files, so turning it on can be extremely helpful when learning rsync in the first place.

If you're not sure what files have changed but always want to prefer the version on the source node, you can save on network bandwidth here by using the checksum option. That can take a while to scan all of the files involved to compute the checksums, but you'll only transfer the ones that changed even even if the modification times match. Another useful option to know about here is --modify-window, which allows you to add some slack into the timestamp computation, for example if the system clocks involved are a bit low resolution or out of sync.

Using rsync to compare copies

The sophistication of the options here means that you can get rsync to answer questions like "what files have really changed between these two copies?" without actually doing anything. You just need to use one or both of these options:
-n, --dry-run         perform a trial run with no changes made
-i, --itemize-changes output a change-summary for all updates

When learning how to use rsync in the first place, this should be your standard approach anyway: do a dry run with itemized changes, confirm it's doing what you expected, and then fire it off. You'll learn how the whole thing works that way soon enough. Note that if using checksum mode, those will get computed twice this way, but if your files are big enough that this matter you probably should be really paranoid about messing them up too. A rsync dry run with checksums turned on is a great way to get a high level "diff" between two directory trees, either locally or remotely, without touching either one of them.

Other useful parameters to turn on when getting started with rsync are "--stats" and "--progress".

Remote links

Next up are some notes on how the remote links work. If you put a ":" in the name, rsync defaults to using ssh for remote links; again you can think of this as being like scp. Since no admin in their right mind sets up an rsync server nowadays, this is the standard way you're going to want to operate. If you're not using the default ssh port (22), you need to specify it like this:
$ rsync --rsh='ssh -p 12345' source destination

You can abbreviate this to "-e", but I find it makes more sense and is easier to remember using the long version here. You're specifying how it should reach the remote shell here and that's reflected in the long option, the short one just got a random character that wasn't already used.

Wrap-up

That covers the basic rsync internals I wanted to know before I used the tool, and that usually get skipped over. The other tricky bit you should know is how directory handling changes based on whether there's a trailing slash on paths, that's covered elsewhere quite well so I'm not going to get into it here.

You should know enough now to use rsync and really understand what it's going to do, as well as how to be paranoid about using it. Don't overwrite things unless you know it's safe, always use a dry run for a new candidate rsync command, and break down the options you use to the subset you need if the big options collections like "-a" do more than that.

Where to go from here? In order of increasing knowledge requirements I'd suggest these three links:
  1. rsync Tips & Tricks: This gives some more detail about some of the options you should know about I skimped on, and covers a lot of odd situations too.
  2. Backups using rsync: Great description of how many of the more obscure parameters actually work. This will suggest what underdocumented parameters like the deletion ones actually do, and suggest how you could use some of them.
  3. Easy Automated Snapshot-Style Backups with Linux and Rsync: The gold mine guide of advanced techniques here. Once past the basics, it's easy to justify studying this for as long as it takes to understand how the whole thing works, as you'll learn a ton about how powerful rsync and how powerful rsync can be along the way.

1 comment:

DeBiL said...

"If you're copying to remote system, this might not make any sense at all, which means you can't just use "-a" and will need to decompose the operations here to include all of the remaining ones."

You don't need to do that. You can use -a and --no-perms to exclude -p.