Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded parallel uploads on S3 #1337

Closed
Amoki opened this issue Nov 9, 2023 · 5 comments · Fixed by #1386
Closed

Threaded parallel uploads on S3 #1337

Amoki opened this issue Nov 9, 2023 · 5 comments · Fixed by #1386

Comments

@Amoki
Copy link

Amoki commented Nov 9, 2023

I'm uploading a lot of small files. I can thread the uploads using ThreadPool and it works great.
If I use more than 10 threads, I have the warning Connection pool is full, discarding connection: my.s3.server.url. Connection pool size: 10 .
I found 2 issues about this problem (#1106 and #610).

I tried to use TransferConfig, without success.

AWS_S3_TRANSFER_CONFIG = TransferConfig(
    use_threads=True,
    max_concurrency=70,
)

I tried with use_threads=False and use_threads=True.

What should I do to increase the max number of threads?

@jschneier
Copy link
Owner

Thanks, clearly my answer in the other thread is wrong. The correct config that needs to be overriden is the one on the class level currently, here.

@jschneier
Copy link
Owner

Would definitely take documentation (or a feature!) for this one. It's session config vs transfer config.

@Amoki
Copy link
Author

Amoki commented Nov 22, 2023

Thanks!
I managed to make it work :)

The best way to override the config is not very clear.

class MyS3Storage(S3Storage):
    def __init__(self, **settings):
        # I don't have an easy access to settings before the init
        super().__init__(**settings)
        # Is there any side effect if I change the config after its first definition 
        self.config = Config(
            s3={"addressing_style": self.addressing_style},
            signature_version=self.signature_version,
            proxies=self.proxies,
            max_pool_connections=70,
        )


class MyS3Storage(S3Storage):
    # No easy access to settings here
    config = Config(
        s3={"addressing_style": "virtual"},
        signature_version=5,
        proxies=None,
        max_pool_connections=70,
    )

What do you think about something like:

class S3Storage(CompressStorageMixin, BaseStorage):
    """
    Amazon Simple Storage Service using Boto3

    This storage backend supports opening files in read or write
    mode and supports streaming(buffering) data in chunks to S3
    when writing.
    """

    def __init__(self, **settings):
        omitted = object()

        super().__init__(**settings)

        ...

        self.config = self.get_config()

        ...

    def get_config(self):
        # If config provided in subclass, signature_version and addressing_style
        # settings/args are ignored.
        if hasattr(self, 'config'):
            # Keep retro-compatibility
            return self.config

        return Config(
            s3={"addressing_style": self.addressing_style},
            signature_version=self.signature_version,
            proxies=self.proxies,
        )

@Amoki
Copy link
Author

Amoki commented Nov 24, 2023

Or the same API than TransertConfig

@anentropic
Copy link

anentropic commented Jan 9, 2024

am I right in understanding that using a thread pool TransferConfig will benefit upload of large files (they get turned into concurrent multipart upload) but not really for "lots of small files" case?

seems like underlying Django storage implementation wants to upload each file one-by-one, though there is a workaround for threading multiple files outlined in comments to this old issue: https://code.djangoproject.com/ticket/23517

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants