sources/tech/20190508 How to use advanced rsync for large Linux backups.md
6.9 KiB
How to use advanced rsync for large Linux backups
Basic rsync commands are usually enough to manage your Linux backups, but a few extra options add speed and power to large backup sets.
It seems clear that backups are always a hot topic in the Linux world. Back in 2017, David Both offered Opensource.com readers tips on "Using rsync to back up your Linux system," and earlier this year, he published a poll asking us, "What's your primary backup strategy for the /home directory in Linux?" In another poll this year, Don Watkins asked, "Which open source backup solution do you use?"
My response is rsync. I really like rsync! There are plenty of large and complex tools on the market that may be necessary for managing tape drives or storage library devices, but a simple open source command line tool may be all you need.
Basic rsync
I managed the binary repository system for a global organization that had roughly 35,000 developers with multiple terabytes of files. I regularly moved or archived hundreds of gigabytes of data at a time. Rsync was used. This experience gave me confidence in this simple tool. (So, yes, I use it at home to back up my Linux systems.)
The basic rsync command is simple.
`rsync -av SRC DST`
Indeed, the rsync commands taught in any tutorial will work fine for most general situations. However, suppose we need to back up a very large amount of data. Something like a directory with 2,000 sub-directories, each holding anywhere from 50GB to 700GB of data. Running rsync on this directory could take a tremendous amount of time, particularly if you're using the checksum option, which I prefer.
Performance is likely to suffer if we try to sync large amounts of data or sync across slow network connections. Let me show you some methods I use to ensure good performance and reliability.
Advanced rsync
One of the first lines that appears when rsync runs is: "sending incremental file list." If you do a search for this line, you'll see many questions asking things like: why is it taking forever? or why does it seem to hang up?
Here's an example based on this scenario. Let's say we have a directory called /storage that we want to back up to an external USB device mounted at /media/WDPassport.
If we want to back up /storage to a USB external drive, we could use this command:
`rsync -cav /storage /media/WDPassport`
The c option tells rsync to use file checksums instead of timestamps to determine changed files, and this usually takes longer. In order to break down the /storage directory, I sync by subdirectory, using the find command. Here's an example:
`find /storage -type d -exec rsync -cav {} /media/WDPassport \;`
This looks OK, but if there are any files in the /storage directory, they will not be copied. So, how can we sync the files in /storage? There is also a small nuance where certain options will cause rsync to sync the . directory, which is the root of the source directory; this means it will sync the subdirectories twice, and we don't want that.
Long story short, the solution I settled on is a "double-incremental" script. This allows me to break down a directory, for example, breaking /home into the individual users' home directories or in cases when you have multiple large directories, such as music or family photos.
Here is an example of my script:
HOMES="alan"
DRIVE="/media/WDPassport"
for HOME in $HOMES; do
cd /home/$HOME
rsync -cdlptgov --delete . /$DRIVE/$HOME
find . -maxdepth 1 -type d -not -name "." -exec rsync -crlptgov --delete {} /$DRIVE/$HOME \;
done
The first rsync command copies the files and directories that it finds in the source directory. However, it leaves the directories empty so we can iterate through them using the find command. This is done by passing the d argument, which tells rsync not to recurse the directory.
`-d, --dirs transfer directories without recursing`
The find command then passes each directory to rsync individually. Rsync then copies the directories' contents. This is done by passing the r argument, which tells rsync to recurse the directory.
`-r, --recursive recurse into directories`
This keeps the increment file that rsync uses to a manageable size.
Most rsync tutorials use the a (or archive ) argument for convenience. This is actually a compound argument.
`-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)`
The other arguments that I pass would have been included in the a ; those are l , p , t , g , and o.
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
The --delete option tells rsync to remove any files on the destination that no longer exist on the source. This way, the result is an exact duplication. You can also add an exclude for the .Trash directories or perhaps the .DS_Store files created by MacOS.
`-not -name ".Trash*" -not -name ".DS_Store"`
Be careful
One final recommendation: rsync can be a destructive command. Luckily, its thoughtful creators provided the ability to do "dry runs." If we include the n option, rsync will display the expected output without writing any data.
`rsync -cdlptgovn --delete . /$DRIVE/$HOME`
This script is scalable to very large storage sizes and large latency or slow link situations. I'm sure there is still room for improvement, as there always is. If you have suggestions, please share them in the comments.
via: https://opensource.com/article/19/5/advanced-rsync
作者:Alan Formy-Duval 选题:lujun9972 译者:译者ID 校对:校对者ID