Syncing between large S3 buckets

My problem is as follows. I want to copy many many files from one S3 bucket to another. So many, in fact, that s3cmd explodes with Killed because the oom killer has taken exception to it listing every single file in both buckets before starting (yes, I am using the latest version).

This script assumes that the copying account can read from the source bucket. If you want to copy files from a bucket owned by a different user, give the user running the script access to the bucket by visiting the S3 console, and adding the username of the copying user (email address works fine) as a “Grantee” in the Permissions for the source bucket, with “list” permisssions.

It’s a very stupid script. It’ll enumerate every file in the source bucket, then copy it to the target bucket blindly unless there’s already a file in the target with the same name. It doesn’t do checksumming, size comparison, or anything clever. It won’t delete files in the target that don’t exist in the source. It won’t update changed files. But it’s faster than s3cmd, it starts doing things instantly, it’s pretty cheap (and safe) to kill and restart, and it’ll run on a tiny EC2 instance, so you won’t incurr bandwidth transfer charges (you’ll still pay for requests. This isn’t free).

ASSETS_AWS_ACCESS_KEY_ID = '...'
ASSETS_AWS_SECRET_ACCESS_KEY = '...'
BUCKET_FROM = "source-bucket-name"
BUCKET_TO = "target-bucket-name"

import boto # developed on boto 2.9.6.
from boto.s3.connection import S3Connection
conn = S3Connection(ASSETS_AWS_ACCESS_KEY_ID, ASSETS_AWS_SECRET_ACCESS_KEY)
source = conn.get_bucket(BUCKET_FROM)
target = conn.get_bucket(BUCKET_TO)

# .list() is a magical iterator object, it'll make
# more requests of S3 as needed
for idx, entry in enumerate(source.list()):
    if entry.name.endswith("/"):
        continue
    print idx, entry.name
    if not target.get_key(entry.name):
        # this is a trade-off. Checking for target existence makes the first
        # run slower, but subsequent runs much faster, assuming only a subset
        # of files change.
        print "..copying"
        try:
            entry.copy(dst_bucket=target, dst_key=entry.name, validate_dst_bucket=False)
        except boto.exception.S3ResponseError, e:
            # Only copying files I have access to. Never bomb out half-way.
            print e

print "all done!"

(Side note. Be nice to future maintainers – don’t use the S3 Virtual Hosting stuff. You can’t rename buckets, and you can’t re-parent them so you’re stuck using that bucket forever. Put the files in a bucket with any old name, then use Cloudfront to serve them.)

Links on distributed databases and replication

Reading a RWW article on non-relational databases, I came across the term ‘Eventual Consistency’ which is something I’ve seen a couple of times recently. I immediately and loudly demanded that mattb tell me what it meant. He proceeded to dump waaay too much reading material on me almost instantly, which tells me that I’m onto something. I hereby relay the following, so that I don’t lose the links:

  • Eventually Consistent – Revisited by Werner Vogels (Amazon CTO). A nice overview of what the term means, including a list of things that you take for granted and aren’t guaranteed, things you don’t take for granted and aren’t guaranteed, and things that it never occurred to you to doubt, that aren’t guaranteed. For instance…

  • ..suppose you wanted to sync with a database traveling in a different relativistic frame? Well, ok, maybe slightly less serious than that, but things can still disagree on what the time is. Time, Clocks and the Ordering of Events in a Distributed System.

  • Amazon S3 Availability Event: July 20, 2008 – Amazon S3 fell over a few months ago. I remember this because all my twitter icons went away. Anyway, this is why. Interesting in the context of the other two.