-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3Client to use Boto3 #2149
S3Client to use Boto3 #2149
Conversation
Super excited for this! Thanks for taking a stab at the update, @ouanixi ! |
luigi/contrib/s3.py
Outdated
(bucket, key) = self._path_to_bucket_and_key(destination_s3_path) | ||
|
||
# grab and validate the bucket | ||
s3_bucket = self.s3.get_bucket(bucket, validate=True) | ||
if not self.validate_bucket(bucket): | ||
self.s3.create_bucket(Bucket=bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to create a new bucket here? I'd think not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that functionality is desired, i'd lobby for an option parameter which is off by default.
luigi/contrib/s3.py
Outdated
|
||
# grab and validate the bucket | ||
s3_bucket = self.s3.get_bucket(bucket, validate=True) | ||
if not self.validate_bucket(bucket): | ||
self.s3.create_bucket(Bucket=bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern here. (see comment within put()
luigi/contrib/s3.py
Outdated
for item in s3_bucket.list(prefix=key_path): | ||
last_modified_date = time.strptime(item.last_modified, "%Y-%m-%dT%H:%M:%S.%fZ") | ||
for item in s3_bucket.objects.filter(Prefix=key_path): | ||
last_modified_date = item.last_modified.date() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes the ability to limit dir contents by datetime.
When I added this functionality, i only needed it for dates. But perhaps someone may need it for more granular times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears you you removed copy()
, __copy_multipart()
, and put_multipart()
@dlstadther Yeah they've been removed on purpose, I'll add them again later. Of course the point is to maintain the same interface as we had before. Sorry for rushing with an MR but it was more to let you guys know someone is working on the update. It is still Work In Progress Would it be best for me to ping you when I believe the code is ready for reviewing ? Or are you happy to feedback on the go like this ?? |
@ouanixi Cool. Just wanted to make sure they weren't forgotten. I didn't see any explicit mention of them forthcoming in the PR description. But glad to know they will be included when complete. I'll likely review as you push changes (and as I have time). Thanks! |
@dlstadther so far, most of the functionality provided by the old interface has been tested an implemented. The remaining work is:
Questions I have are:
|
Note that I removed the S3Target tests. I'll obviously put them back once I'm finished with S3Client |
@ouanixi There are some cases where we only need to get back the string representation of the key and other times (i.e. list with returned key) where we need to get back the key's metadata. It'd be awesome if boto3 could still return key metadata which included size and created/last modified dates. |
@dlstadther Thanks for getting back to me. I've been trying to stick to the current interface we have so we stay backward compatible but AWS's replacement of the concept of Key might be problematic. A key in boto3 simply means a full path to an The remaining interface in the S3Client is fine. It's literally just the return value of get_key |
luigi/contrib/s3.py
Outdated
|
||
|
||
class ReadableS3File(object): | ||
def __init__(self, s3_key): | ||
self.s3_key = s3_key | ||
# This is a botocore StreamingBody object | ||
self.s3_key = s3_key.get()['Body'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @ouanixi for such delay getting back to you! The currently luigi implementation only requires Key metadata for I'm not sure i follow the need for an iterable key object for Personably, I don't use it. But looks like @ddaniels888 added this functionality back in late 2013. Maybe he can weigh in on this matter. |
Hey @dlstadther — I added the ability to iterate line-by-line for parity with the way other file-like objects behaved in luigi. I don't think we're using it personally, but I'm not sure who else uses ReadableS3File (lots of AWS luigi users around). |
@dlstadther That's OK I think I could use an adapter until the guys down in botocore inherit StreamingBody from I'm now working on a few failing tests. I will push as soon as I have an update and things are tidy. |
Hey @dlstadther This this is now ready for reviewing. I apologize there's a lot of changes but in general the main points you should be aware of are the following: 1 - S3Client has no attribute Key anymore. Also as this is my first contribution, I'm not sure why the CI is failing. All of the tests pass in my local environment !!! |
@dlstadther thank you.
I think the problem is with my understanding of tox and travis though !! Cheers |
@ouanixi @dlstadther Any timeline on this? I would like to use it! :) @ouanixi Did you see this comment from Dillon about the CI failures?
|
@brianestlin hey I did see the comments indeed. That didn't help unfortunately :( |
Sorry guys; i haven't had time to review this. @brianestlin do you feel comfortable reviewing too? |
@dlstadther Maybe -- I'll try to take a look this week and let you know. |
@ouanixi Regarding the Regarding Hope this helps. |
@brianestlin Thanks for your feedback. Regarding Don't know how @dlstadther feels about this ? |
I would think Travis can receive env vars through its config. See travis documentation. Assuming there's no valid reason to directly access the s3 property, i'm cool making it private. (personally, i've never needed to access it directly). |
Hi @dlstadther thanks for your feedback. I've had a little bit of time to look into this earlier on and found that the batch tests aren't mocking the batch client. They pass fine because they skipOnTravis when This is not the first time this happened (I've had to change tests for redshift, ecs, and now batch) and I feel maybe we should be stricter when accepting PR's using this pattern of testing ? I'll try and find some time to fix the failing batch tests this weekend hopefully :) Thanks again for your help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌
Since all of my comments have been addressed, there is no need for another approval. Go ahead! |
'Key': src_key | ||
} | ||
self.s3.meta.client.copy( | ||
copy_source, dst_bucket, dst_key, Config=transfer_config, ExtraArgs=kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using this for a couple days and it is taking drastically longer to copy lots of small files.
I'm trying to understand this multi-threaded logic for boto3.
Previously, we were assigning async threads to copy individual files. This new version appears to be assigning possibly multiple threads to copy portions of a single file. Am I reading this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if i'm possibly trying to supply too many thread. So i'm decreasing my thread count and will report back with results when i have them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of thread count, i'm getting 6.67 files copied per second.
I'm skeptical that the current threading implementation provides any benefit for the copying of lots of small files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-enabled ThreadPool
logic for using apply_async
and actually multithreaded copy works again.
Will submit a PR.
Description
This work is to move away from boto to start using Boto3 for the S3Client.
Motivation and Context
Boto being no longer supported, Luigi should move away from it and use the current Boto3. This would solve a number of issues for example:
One of the main motivation from my part is the lack of support for aws task roles in boto. As more people are using this AWS functionality, It would make sense to move the S3Client to use boto3.
More reasons can be found here: Update S3 client to use Boto3 #1344
Have you tested this? If so, how?
This is Work In Progress. I'm trying my best to stick to the original tests but sometimes change is inevitable (Happy to chat for more details)
Note
This is my very first contribution so feedback and suggestions are more than welcome.