Home | About | rss

Offsite Rsync Backups Using Amazon S3

 

Persistent Storage

Backing up with rsync to amazon is about to get a lot lot easier:

 > We would like to share some details about a major upcoming feature
 > that many of you have requested - persistent storage for EC2.
 >
 > This new feature provides reliable, persistent storage volumes, for
 > use with Amazon EC2 instances.  These volumes exist independently from
 > any Amazon EC2 instances, and will behave like raw, unformatted hard
 > drives or block devices, which may then be formatted and configured
 > based on the needs of your application.  The volumes will be
 > significantly more durable than the local disks within an Amazon EC2
 > instance.  Additionally, our persistent storage feature will enable
 > you to automatically create snapshots of your volumes and back them up
 > to Amazon S3 for even greater reliability.
 > You will be able to create volumes ranging in size from 1 GB to 1 TB,
 > and will be able to attach multiple volumes to a single instance.
 > Volumes are designed for high throughput, low latency access from
 > Amazon EC2, and can be attached to any running EC2 instance where they
 > will show up as a device inside of the instance.  This feature will
 > make it even easier to run everything from relational databases to
 > distributed file systems to Hadoop processing clusters using Amazon
 > EC2.
 >
 > When persistent storage is launched, Amazon EC2 will be adding several
 > new APIs to support the persistent storage feature.  Included will be
 > calls to manage your volume (CreateVolume?, DeleteVolume?), mount your
 > volume to your instance (AttachVolume?, DetachVolume?) and save
 > snapshots to Amazon S3 (CreateSnapshot?, DeleteSnapshot?).
 >
 > This new functionality is already being used privately by a handful of
 > EC2 customers, and will be publicly available later this year.  We
 > will be expanding the private offering as we get closer to launch.
 > Please go to the web page below to sign up if you are interested in
 > participating.
 >
 > Sign Up:
 > http://www.amazon.com/gp/html-forms-controller/ec2-persistent-storage

My original idea on how to solve this problem, sans persistent storage, is below.


Amazon's S3 is a great service that provides off-site, internet accessible storage at stunningly cheap prices.

But they don't support rsync, which is a crying shame.

This is the theory of a work-around, which would give you:

At todays prices it would be ~$25/month. Half as much storage (50Gb) would be $15. It'll scale up too.

The only thing missing is the implementation, which is left as an exercise for the reader. If you have an implementation, I'll be happy to link to it.

Ingredients

Amazon EC2

Not storage, but a remote "virtual" computer you can ssh into and do stuff with. We don't need it 24/7, so only need to pay for the hours it's running, which should be small.

Amazon S3

One can create, start/stop, terminate EC2 instances via command line tools, albeit that they require a Java Runtime. Once running, you can ssh to the instance. You ssh as root, password-less, using a key-pair.

This is all automatable with about a dozen lines of bash, including getting the domain/ip of your newly created instance.

Loopback Filesystem

We need a loopback style filesystem, hence forth called LBFS. Ideally, we want a "growable" one, which is only as large as is required for the data it contains - but for now, I'm going to assume a simple/standard LBFS which is 100Gb is size.

Implementation

Per backup-run:

In addition to bandwidth charges, you have the S3 storage and $0.10 per (wall clock) hour that your EC2 instance is running to complete this task.

1 hour @ 40Kb/sec upload on ADSL =~ 140Mb. Let's assume that is a decent daily average. Conveniently, 140Mb/day == 100Gb every 2 years.

Fetching/putting the 100Gb LBFS. Say 5Mb/sec =~ 6 hours each way? This is "internal" bandwidth, free, between Amazon machines.

5Mb/s is the one un-tested part of my theory, but some googling suggests it's not un-reasonable and by running the 20 GETs in parallel we may improve on this. Anyway...

Costs

Running the above:

Assume we ran this once per week. ($2.30*4) + $15/month for S3 storing 100Gb

~ $25/month for 100Gb redundant, offsite storage backed up weekly via rsync, plus some additional command line shenanigans that you only have to write once.

If you left your EC2 instance running 24/7, it would be $75/month, which is more than bytemark.co.uk, but then you get 140Gb of disk space included and 1.7Gb of RAM. Afaik, your IP address remains static as long as the instance remains running, but they don't make promises about uptime. Don't shoot me if they change your IP - the notion that it doesn't change is just a theory.

50Gb instead of 100Gb then it's ~$15/month all-in on Amazon.

The equivalent $25 on rsync.net gets you 15Gb, but more frequent backups if you need em. That said, run your Amazon instance 24/7 and you can backup stuff as often as you like. I'm struggling to see a likely scenario where using rsync.net is better than the Amazon route. With the possible exception that to go the Amazon route you (or someone) has to write the various bits of wrapping code to automate the process.

If you want to get in touch, you can find my details on the About page.

See Also