In Day 9's post we learned about some ideas for how to do proper backups when using AWS services.
In today's post we'll take a hands-on approach to automating creating resources and performing the action needs to achieve these kinds of backups, using some bash scripts and the Boto python library for AWS.
Since IO performance is key for many applications and services, it is common to use your EC2 instance's ephermeral storage and Linux software raid for your instance's local data storage. While EBS volumes can have erratic performance, they are useful to provide backup storage that's not tied to your instance, but is still accessible through a filesystem.
The approach we're going to take is as follows:
- Make a two EBS volume software raid1 and mount as /backups
- Make a shell script to rsync /data to /backups
- Set the shell script up to run as a cron job
Making the EBS volumes
Adding the EBS volumes to your instance can be done with a simple Boto script
add-volumes.py
Once you've run this script you'll have two new volumes attached as local devices on your EC2 instance.
Making the RAID1
Now you'll want to make a two volume RAID1 from the EBS volumes and make a filesystem on it.
The following shell script takes care of this for you
make-raid1-format.sh
Now you have a /backups/ you can rsync files and folders to for your backup process.
rsync shell script
rsync is the best method for syncing data on Linux servers.
The following shell script will use rsync to make backups for you.
rsync-backups.sh
making a cron job
To make this a cron job that runs once a day, you can add a file like the following, which assumes you put rsync-backups.sh in /usr/local/bin
This cron job will run as root, at 12:15AM in the timezone of the instance.
/etc/cron.d/backups
MAILTO="me@me.biz" 15 00 * * * root /usr/bin/flock -w 10 /var/lock/backups /usr/local/bin/rsync-backups.sh > /dev/null 2>&1
Data Rotation, Retention, Etc
To improve on how your data is rotated and retained you can explore a number of open source tools, including:
Now that you've got your data backed up to EBS volumes, or you're using EBS volumes as your main source of datastore, you're going to want to ensure a copy of your data exists elsewhere. This is where S3 is a great fit.
As you've seen, rsync is often the key tool in moving data around on and between Linux filesystems, so it makes sense that we'd use an rsync style utility that talks to S3.
For this we'll look at how we can use boto-rsync.
boto-rsync is a rough adaptation of boto's s3put script which has been reengineered to more closely mimic rsync. Its goal is to provide a familiar rsync-like wrapper for boto's S3 and Google Storage interfaces.
By default, the script works recursively and differences between files are checked by comparing file sizes (e.g. rsync's --recursive and --size-only options). If the file exists on the destination but its size differs from the source, then it will be overwritten (unless the -w option is used).
boto-rsync is simple to use, being as easy as boto-rsync [OPTIONS] /local/path/ s3://bucketname/remote/path/
, which assumes you have your AWS key put in ~/.boto
or the ENV variables set.
boto-rsync has a number of options you'll be familiar with from rsync and you should consult the README to get more familiar with this.
As you can see, you can easily couple boto-rsync with a cron job and some script to get backups going to S3.
One of the recent features added to S3 was the ability to use lifecycle policies to archive your S3 objects to Glacier
You can create a lifecycle policy to archive data in an S3 bucket to glacier very easily with the following boto code.
s3-glacier.py
As you can see, there are many options for automating your backups on AWS in comprehensive and flexible ways, and this post is only the tip of the iceberg.