Slow S3 Bucket to Bucket copy #787

muhammad-ammar · 2016-08-25T20:12:35Z

I have written a Python3 script which is using boto to copy data from one S3 bucket to another bucket. Now i have updated that script to use boto3. The issue is that S3 bucket to bucket copy is very slow as compared to the code written using boto.

I have tested the code on my local system as well as on an EC2 instance but results are same.

Below are both the scripts. Script written with boto is taking around 26 minutes to copy 2GB of data from one S3 bucket to another. Script written with boto3 is taking around 1 hour and 20 minutes to copy 2GB of data from one S3 bucket to another. Both the buckets are in same region.

Can anyone please help me to understand the reason of slowness with boto3.

Code using boto3

from queue import Queue
from threading import Thread

############ boto 2.x
import boto
from boto.s3.key import Key
from boto.s3.connection import S3Connection

############ boto 3
import boto3
import boto3.session
import botocore


class CopyWorker(Thread):

    def __init__(self, queue, src_bucket_name, dst_bucket_name):
        self._queue = queue
        self._src_bucket_name = src_bucket_name
        self._dst_bucket_name = dst_bucket_name
        session = boto3.session.Session()
        s3 = session.resource('s3')
        self._src_bucket, self._dst_bucket = s3.Bucket(self._src_bucket_name), s3.Bucket(self._dst_bucket_name)
        super(CopyWorker, self).__init__()

    def run(self):
        while True:
            key = self._queue.get()
            self._dst_bucket.copy(
                CopySource={
                    'Bucket': self._src_bucket_name,
                    'Key': key.key
                },
                Key=key.key,
            )
            self._queue.task_done()


class S3(object):

    @staticmethod
    def s3_resource():
        return boto3.resource('s3')

    @classmethod
    def copy_files(cls, src_bucket_name, dst_bucket_name, threads=20):
        src_bucket = cls.bucket(src_bucket_name)
        dst_bucket = cls.bucket(dst_bucket_name)
        copy_queue = Queue(maxsize=1000)

        for thread in range(threads):
            worker = CopyWorker(copy_queue, src_bucket_name, dst_bucket_name)
            worker.daemon = True
            worker.start()

        for keys in cls.bucket_keys(src_bucket):
            for key in keys:
                copy_queue.put(key)

        copy_queue.join()

   @classmethod
   def bucket_keys(cls, bucket):
        keys = []
        for key in bucket.objects.all():
            keys.append(key)

            if len(keys) == 1000:
                yield keys
                keys = []
        else:
            yield keys

    @classmethod
    def bucket(cls, bucket_name):
        s3 = cls.s3_resource()
        bucket = s3.Bucket(bucket_name)
        try:
            s3.meta.client.head_bucket(Bucket=bucket.name)
        except botocore.exceptions.ClientError as e:
            # If a client error is thrown, then check that it was a 404 error.
            # If it was a 404 error, then the bucket does not exist.
            error_code = int(e.response['Error']['Code'])
            if error_code == 404:
                raise ValueError('{} bucket doesn\'t exist'.format(bucket_name))

        return bucket

S3.copy_files('bucket-prod', 'bucket-bkp')

Code using boto

class CopyWorker(Thread):

    def __init__(self, queue, src_bucket_name, dst_bucket_name):
        self._queue = queue
        self._src_bucket_name = src_bucket_name
        self._dst_bucket_name = dst_bucket_name
        self._src_bucket, self._dst_bucket = self.__s3()
        super(CopyWorker, self).__init__()

    def __s3(self):
        conn = S3Connection()
        src_bucket = conn.get_bucket(self._src_bucket_name)
        dst_bucket = conn.get_bucket(self._dst_bucket_name)
        return src_bucket, dst_bucket

    def run(self):
        while True:
            key = self._queue.get()
            self._dst_bucket.copy_key(key.key, self._src_bucket_name, key.key)
            self._queue.task_done()


class S3(object):

    @staticmethod
    def _connect():
        return S3Connection()

    def copy_files(self, src_bucket_name, dst_bucket_name, threads=20):
        s3 = self._connect()
        src_bucket = self._bucket(s3, src_bucket_name)
        dst_bucket = self._bucket(s3, dst_bucket_name)
        copy_queue = Queue(maxsize=1000)

        for thread in range(threads):
            worker = CopyWorker(copy_queue, src_bucket_name, dst_bucket_name)
            worker.daemon = True
            worker.start()

        for keys in self._keys(src_bucket):
            for key in keys:
                copy_queue.put(key)

        copy_queue.join()

    @staticmethod
    def _bucket(connection, bucket_name):
        bucket = connection.lookup(bucket_name)
        if bucket is None:
            raise ValueError('Incorrect Bucket Name >> {}'.format(bucket_name))

        return bucket

    @staticmethod
    def _keys(bucket):
        keys = []
        for key in bucket:
            keys.append(key)

            if len(keys) == 1000:
                yield keys
                keys = []

        if keys:
            yield keys

S3().copy_files('bucket-prod', 'bucket-bkp')

The text was updated successfully, but these errors were encountered:

kyleknap · 2016-08-30T19:51:29Z

@muhammad-ammar
So one thing to note is that the copy() is already multithreaded as noted in the docs. How large are the files you are transferring typically?

I have a feeling what is happening is that the threads may be starving themselves in the sense that you have 20 threads making copy() calls but for each copy(), it alone can make 10 concurrent requests if the files are relatively large (in terms of MB). So in a worst case scenario you can have 200 threads making requests at the same time and that point you maybe throttling yourself a lot. One thing that you may want to do if that is the case is configure the managed copy to use less threads by passing a TransferConfig object to the copy() method.

Let me know if that helps.

muhammad-ammar · 2016-08-31T13:50:31Z

@kyleknap Thanks for tip. I will check it today and update you.

muhammad-ammar · 2016-08-31T18:07:18Z

@kyleknap There are 70616 files i am copying from one bucket to another and only two files are of size 1.6 MB. All the other files are less then 300KB. According to TransferConfig multipart_threshold is 8MB which means multiple threads will be created only when a file of size 8MB is encountered which in my case will never happen.

kyleknap · 2016-08-31T23:15:16Z

Interesting. That should not be the problem then. We will do some researching, try to reproduce this, and get back to you on what we find.

The one thing that does not make sense to me yet is if only CopyObject calls are being made and assuming request latency is the bottleneck in the transfer process, why is it 3x worse in the boto3 version than the boto version. That should not be the case. At worst, it should be some small overhead.

muhammad-ammar · 2016-09-01T06:18:09Z

Thanks @kyleknap. I will also further look into it to get some hint about what 's happening.

kyleknap · 2016-09-01T16:19:29Z

If you are doing research as well, one thing that I would probably try (assuming I could reproduce it) is try to use the client.copy_object() or object.copy_from() methods. Those are the low level methods and do not have any managed components to them. This will at least determine if it is an issue with the high level transfer methods.

muhammad-ammar · 2016-09-02T13:04:46Z

@kyleknap I think i have already tried object.copy_from() but i will check them both again. Thanks

muhammad-ammar · 2016-09-05T09:50:21Z

@kyleknap Looks like client.copy_object() gives huge performance improvement. By using it i was able to copy same amount of data in just 22 minutes. I found little time for testing and only checked it once. I will further check it and update you.

muhammad-ammar · 2016-09-05T21:27:39Z

@kyleknap client.copy_object() improves the performance and issue is resolved now . But i am wondering that why performance is bad for higher-level resource objects?

kyleknap · 2016-09-06T16:46:36Z

@muhammad-ammar so are you seeing copy_object() on the client being much faster than the copy_from of the Bucket or Object resource? I wonder if the issue is that there is some extra loading of the resource that is happening. What does the client version of the code look like?

muhammad-ammar · 2016-09-07T16:42:29Z

self.s3.meta.client.copy_object(
                    CopySource={
                        'Bucket': self._src_bucket_name,
                        'Key': key.key
                    },
                    Bucket=self._dst_bucket_name,
                    Key=key.key
                )

@kyleknap This is what i have changed in class CopyWorker. And after this change bucket to bucket copy is improved and now the average time to copy same amount of data is less then the time taken by old code.

kyleknap · 2016-09-08T17:24:47Z

Thanks for looking into this. We will try this out as well to see if we get a difference as well. One way to look at what may be going on is to use boto3.set_stream_logger('') to see all of the debug logs and see where the slowness may be coming from. You can also filter it by package so I would suggest also replacing the empty string with 'botocore' and 'boto3' to get more filtered debug logs.

muhammad-ammar · 2016-09-08T17:42:38Z

Thanks Kyle, I will definitely try this.

On Thu, Sep 8, 2016, 10:24 PM Kyle Knapp notifications@github.com wrote:

Thanks for looking into this. We will try this out as well to see if we
get a difference as well. One way to look at what may be going on is to use
boto3.set_stream_logger('') to see all of the debug logs and see where
the slowness may be coming from. You can also filter it by package so I
would suggest also replacing the empty string with 'botocore' and 'boto3'
to get more filtered debug logs.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#787 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGdFNMCqpUCMnrc14X0IIg5A561uvqb7ks5qoETjgaJpZM4JtdMw
.

mukeshyadav1997 · 2020-07-26T19:12:33Z

Hi @muhammad-ammar
Need a help, actually i want to pass a list of objects that needs to be copied from one bucket into another, can someone please help me, like how i can use as an input into the class?

mukeshyadav1997 · 2020-07-26T19:24:26Z

It's like I've keys that are stored in a list, so as here in this script we're making query for key, but in my case I already have the list of key, so the only thing I want to know how i can pass my list of keys to the class.
Also help me to know whats happening at every class methods, as I'm new to multi threading and classes.

github-actions · 2021-09-14T19:05:17Z

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. Because it has been longer than one year since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment to prevent automatic closure, or if the issue is already closed, please feel free to reopen it.

kyleknap added question closing-soon This issue will automatically close in 4 days unless further comments are made. labels Aug 30, 2016

kyleknap added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. question labels Aug 31, 2016

dlstadther mentioned this issue May 14, 2018

S3Client to use Boto3 spotify/luigi#2149

Merged

swetashre added the s3 label Feb 3, 2020

swetashre added s3 and removed s3 labels Mar 23, 2020

swetashre removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Sep 14, 2020

github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 14, 2021

github-actions bot closed this as completed Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow S3 Bucket to Bucket copy #787

Slow S3 Bucket to Bucket copy #787

muhammad-ammar commented Aug 25, 2016

kyleknap commented Aug 30, 2016

muhammad-ammar commented Aug 31, 2016

muhammad-ammar commented Aug 31, 2016

kyleknap commented Aug 31, 2016

muhammad-ammar commented Sep 1, 2016

kyleknap commented Sep 1, 2016

muhammad-ammar commented Sep 2, 2016

muhammad-ammar commented Sep 5, 2016

muhammad-ammar commented Sep 5, 2016

kyleknap commented Sep 6, 2016

muhammad-ammar commented Sep 7, 2016 •

edited

Loading

kyleknap commented Sep 8, 2016

muhammad-ammar commented Sep 8, 2016

mukeshyadav1997 commented Jul 26, 2020

mukeshyadav1997 commented Jul 26, 2020 •

edited

Loading

github-actions bot commented Sep 14, 2021

Slow S3 Bucket to Bucket copy #787

Slow S3 Bucket to Bucket copy #787

Comments

muhammad-ammar commented Aug 25, 2016

Code using boto3

Code using boto

kyleknap commented Aug 30, 2016

muhammad-ammar commented Aug 31, 2016

muhammad-ammar commented Aug 31, 2016

kyleknap commented Aug 31, 2016

muhammad-ammar commented Sep 1, 2016

kyleknap commented Sep 1, 2016

muhammad-ammar commented Sep 2, 2016

muhammad-ammar commented Sep 5, 2016

muhammad-ammar commented Sep 5, 2016

kyleknap commented Sep 6, 2016

muhammad-ammar commented Sep 7, 2016 • edited Loading

kyleknap commented Sep 8, 2016

muhammad-ammar commented Sep 8, 2016

mukeshyadav1997 commented Jul 26, 2020

mukeshyadav1997 commented Jul 26, 2020 • edited Loading

github-actions bot commented Sep 14, 2021

muhammad-ammar commented Sep 7, 2016 •

edited

Loading

mukeshyadav1997 commented Jul 26, 2020 •

edited

Loading