Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow S3 Bucket to Bucket copy #787

Closed
muhammad-ammar opened this issue Aug 25, 2016 · 16 comments
Closed

Slow S3 Bucket to Bucket copy #787

muhammad-ammar opened this issue Aug 25, 2016 · 16 comments

Comments

@muhammad-ammar
Copy link

I have written a Python3 script which is using boto to copy data from one S3 bucket to another bucket. Now i have updated that script to use boto3. The issue is that S3 bucket to bucket copy is very slow as compared to the code written using boto.

I have tested the code on my local system as well as on an EC2 instance but results are same.

Below are both the scripts. Script written with boto is taking around 26 minutes to copy 2GB of data from one S3 bucket to another. Script written with boto3 is taking around 1 hour and 20 minutes to copy 2GB of data from one S3 bucket to another. Both the buckets are in same region.

Can anyone please help me to understand the reason of slowness with boto3.

Code using boto3

from queue import Queue
from threading import Thread

############ boto 2.x
import boto
from boto.s3.key import Key
from boto.s3.connection import S3Connection

############ boto 3
import boto3
import boto3.session
import botocore


class CopyWorker(Thread):

    def __init__(self, queue, src_bucket_name, dst_bucket_name):
        self._queue = queue
        self._src_bucket_name = src_bucket_name
        self._dst_bucket_name = dst_bucket_name
        session = boto3.session.Session()
        s3 = session.resource('s3')
        self._src_bucket, self._dst_bucket = s3.Bucket(self._src_bucket_name), s3.Bucket(self._dst_bucket_name)
        super(CopyWorker, self).__init__()

    def run(self):
        while True:
            key = self._queue.get()
            self._dst_bucket.copy(
                CopySource={
                    'Bucket': self._src_bucket_name,
                    'Key': key.key
                },
                Key=key.key,
            )
            self._queue.task_done()


class S3(object):

    @staticmethod
    def s3_resource():
        return boto3.resource('s3')

    @classmethod
    def copy_files(cls, src_bucket_name, dst_bucket_name, threads=20):
        src_bucket = cls.bucket(src_bucket_name)
        dst_bucket = cls.bucket(dst_bucket_name)
        copy_queue = Queue(maxsize=1000)

        for thread in range(threads):
            worker = CopyWorker(copy_queue, src_bucket_name, dst_bucket_name)
            worker.daemon = True
            worker.start()

        for keys in cls.bucket_keys(src_bucket):
            for key in keys:
                copy_queue.put(key)

        copy_queue.join()

   @classmethod
   def bucket_keys(cls, bucket):
        keys = []
        for key in bucket.objects.all():
            keys.append(key)

            if len(keys) == 1000:
                yield keys
                keys = []
        else:
            yield keys

    @classmethod
    def bucket(cls, bucket_name):
        s3 = cls.s3_resource()
        bucket = s3.Bucket(bucket_name)
        try:
            s3.meta.client.head_bucket(Bucket=bucket.name)
        except botocore.exceptions.ClientError as e:
            # If a client error is thrown, then check that it was a 404 error.
            # If it was a 404 error, then the bucket does not exist.
            error_code = int(e.response['Error']['Code'])
            if error_code == 404:
                raise ValueError('{} bucket doesn\'t exist'.format(bucket_name))

        return bucket
S3.copy_files('bucket-prod', 'bucket-bkp')

Code using boto

class CopyWorker(Thread):

    def __init__(self, queue, src_bucket_name, dst_bucket_name):
        self._queue = queue
        self._src_bucket_name = src_bucket_name
        self._dst_bucket_name = dst_bucket_name
        self._src_bucket, self._dst_bucket = self.__s3()
        super(CopyWorker, self).__init__()

    def __s3(self):
        conn = S3Connection()
        src_bucket = conn.get_bucket(self._src_bucket_name)
        dst_bucket = conn.get_bucket(self._dst_bucket_name)
        return src_bucket, dst_bucket

    def run(self):
        while True:
            key = self._queue.get()
            self._dst_bucket.copy_key(key.key, self._src_bucket_name, key.key)
            self._queue.task_done()


class S3(object):

    @staticmethod
    def _connect():
        return S3Connection()

    def copy_files(self, src_bucket_name, dst_bucket_name, threads=20):
        s3 = self._connect()
        src_bucket = self._bucket(s3, src_bucket_name)
        dst_bucket = self._bucket(s3, dst_bucket_name)
        copy_queue = Queue(maxsize=1000)

        for thread in range(threads):
            worker = CopyWorker(copy_queue, src_bucket_name, dst_bucket_name)
            worker.daemon = True
            worker.start()

        for keys in self._keys(src_bucket):
            for key in keys:
                copy_queue.put(key)

        copy_queue.join()

    @staticmethod
    def _bucket(connection, bucket_name):
        bucket = connection.lookup(bucket_name)
        if bucket is None:
            raise ValueError('Incorrect Bucket Name >> {}'.format(bucket_name))

        return bucket

    @staticmethod
    def _keys(bucket):
        keys = []
        for key in bucket:
            keys.append(key)

            if len(keys) == 1000:
                yield keys
                keys = []

        if keys:
            yield keys
S3().copy_files('bucket-prod', 'bucket-bkp')
@kyleknap
Copy link
Contributor

@muhammad-ammar
So one thing to note is that the copy() is already multithreaded as noted in the docs. How large are the files you are transferring typically?

I have a feeling what is happening is that the threads may be starving themselves in the sense that you have 20 threads making copy() calls but for each copy(), it alone can make 10 concurrent requests if the files are relatively large (in terms of MB). So in a worst case scenario you can have 200 threads making requests at the same time and that point you maybe throttling yourself a lot. One thing that you may want to do if that is the case is configure the managed copy to use less threads by passing a TransferConfig object to the copy() method.

Let me know if that helps.

@kyleknap kyleknap added question closing-soon This issue will automatically close in 4 days unless further comments are made. labels Aug 30, 2016
@muhammad-ammar
Copy link
Author

@kyleknap Thanks for tip. I will check it today and update you.

@muhammad-ammar
Copy link
Author

@kyleknap There are 70616 files i am copying from one bucket to another and only two files are of size 1.6 MB. All the other files are less then 300KB. According to TransferConfig multipart_threshold is 8MB which means multiple threads will be created only when a file of size 8MB is encountered which in my case will never happen.

@kyleknap
Copy link
Contributor

Interesting. That should not be the problem then. We will do some researching, try to reproduce this, and get back to you on what we find.

The one thing that does not make sense to me yet is if only CopyObject calls are being made and assuming request latency is the bottleneck in the transfer process, why is it 3x worse in the boto3 version than the boto version. That should not be the case. At worst, it should be some small overhead.

@kyleknap kyleknap added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. question labels Aug 31, 2016
@muhammad-ammar
Copy link
Author

Thanks @kyleknap. I will also further look into it to get some hint about what 's happening.

@kyleknap
Copy link
Contributor

kyleknap commented Sep 1, 2016

If you are doing research as well, one thing that I would probably try (assuming I could reproduce it) is try to use the client.copy_object() or object.copy_from() methods. Those are the low level methods and do not have any managed components to them. This will at least determine if it is an issue with the high level transfer methods.

@muhammad-ammar
Copy link
Author

@kyleknap I think i have already tried object.copy_from() but i will check them both again. Thanks

@muhammad-ammar
Copy link
Author

@kyleknap Looks like client.copy_object() gives huge performance improvement. By using it i was able to copy same amount of data in just 22 minutes. I found little time for testing and only checked it once. I will further check it and update you.

@muhammad-ammar
Copy link
Author

@kyleknap client.copy_object() improves the performance and issue is resolved now . But i am wondering that why performance is bad for higher-level resource objects?

@kyleknap
Copy link
Contributor

kyleknap commented Sep 6, 2016

@muhammad-ammar so are you seeing copy_object() on the client being much faster than the copy_from of the Bucket or Object resource? I wonder if the issue is that there is some extra loading of the resource that is happening. What does the client version of the code look like?

@muhammad-ammar
Copy link
Author

muhammad-ammar commented Sep 7, 2016

self.s3.meta.client.copy_object(
                    CopySource={
                        'Bucket': self._src_bucket_name,
                        'Key': key.key
                    },
                    Bucket=self._dst_bucket_name,
                    Key=key.key
                )

@kyleknap This is what i have changed in class CopyWorker. And after this change bucket to bucket copy is improved and now the average time to copy same amount of data is less then the time taken by old code.

@kyleknap
Copy link
Contributor

kyleknap commented Sep 8, 2016

Thanks for looking into this. We will try this out as well to see if we get a difference as well. One way to look at what may be going on is to use boto3.set_stream_logger('') to see all of the debug logs and see where the slowness may be coming from. You can also filter it by package so I would suggest also replacing the empty string with 'botocore' and 'boto3' to get more filtered debug logs.

@muhammad-ammar
Copy link
Author

Thanks Kyle, I will definitely try this.

On Thu, Sep 8, 2016, 10:24 PM Kyle Knapp notifications@github.com wrote:

Thanks for looking into this. We will try this out as well to see if we
get a difference as well. One way to look at what may be going on is to use
boto3.set_stream_logger('') to see all of the debug logs and see where
the slowness may be coming from. You can also filter it by package so I
would suggest also replacing the empty string with 'botocore' and 'boto3'
to get more filtered debug logs.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#787 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGdFNMCqpUCMnrc14X0IIg5A561uvqb7ks5qoETjgaJpZM4JtdMw
.

@mukeshyadav1997
Copy link

Hi @muhammad-ammar
Need a help, actually i want to pass a list of objects that needs to be copied from one bucket into another, can someone please help me, like how i can use as an input into the class?

@mukeshyadav1997
Copy link

mukeshyadav1997 commented Jul 26, 2020

It's like I've keys that are stored in a list, so as here in this script we're making query for key, but in my case I already have the list of key, so the only thing I want to know how i can pass my list of keys to the class.
Also help me to know whats happening at every class methods, as I'm new to multi threading and classes.

@swetashre swetashre removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Sep 14, 2020
@github-actions
Copy link

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. Because it has been longer than one year since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment to prevent automatic closure, or if the issue is already closed, please feel free to reopen it.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants