Tom Insam

Syncing between large S3 buckets

My problem is as follows. I want to copy many many files from one S3 bucket to another. So many, in fact, that s3cmd explodes with Killed because the oom killer has taken exception to it listing every single file in both buckets before starting (yes, I am using the latest version).

This script assumes that the copying account can read from the source bucket. If you want to copy files from a bucket owned by a different user, give the user running the script access to the bucket by visiting the S3 console, and adding the username of the copying user (email address works fine) as a “Grantee” in the Permissions for the source bucket, with “list” permisssions.

It’s a very stupid script. It’ll enumerate every file in the source bucket, then copy it to the target bucket blindly unless there’s already a file in the target with the same name. It doesn’t do checksumming, size comparison, or anything clever. It won’t delete files in the target that don’t exist in the source. It won’t update changed files. But it’s faster than s3cmd, it starts doing things instantly, it’s pretty cheap (and safe) to kill and restart, and it’ll run on a tiny EC2 instance, so you won’t incurr bandwidth transfer charges (you’ll still pay for requests. This isn’t free).

ASSETS_AWS_ACCESS_KEY_ID = '...'
ASSETS_AWS_SECRET_ACCESS_KEY = '...'
BUCKET_FROM = "source-bucket-name"
BUCKET_TO = "target-bucket-name"

import boto # developed on boto 2.9.6.
from boto.s3.connection import S3Connection
conn = S3Connection(ASSETS_AWS_ACCESS_KEY_ID, ASSETS_AWS_SECRET_ACCESS_KEY)
source = conn.get_bucket(BUCKET_FROM)
target = conn.get_bucket(BUCKET_TO)

# .list() is a magical iterator object, it'll make
# more requests of S3 as needed
for idx, entry in enumerate(source.list()):
    if entry.name.endswith("/"):
        continue
    print idx, entry.name
    if not target.get_key(entry.name):
        # this is a trade-off. Checking for target existence makes the first
        # run slower, but subsequent runs much faster, assuming only a subset
        # of files change.
        print "..copying"
        try:
            entry.copy(dst_bucket=target, dst_key=entry.name, validate_dst_bucket=False)
        except boto.exception.S3ResponseError, e:
            # Only copying files I have access to. Never bomb out half-way.
            print e

print "all done!"

(Side note. Be nice to future maintainers - don’t use the S3 Virtual Hosting stuff. You can’t rename buckets, and you can’t re-parent them so you’re stuck using that bucket forever. Put the files in a bucket with any old name, then use Cloudfront to serve them.)