Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3Client to use Boto3 #2149

Merged
merged 78 commits into from
May 1, 2018
Merged

S3Client to use Boto3 #2149

merged 78 commits into from
May 1, 2018

Conversation

ouanixi
Copy link
Contributor

@ouanixi ouanixi commented Jun 4, 2017

Description

This work is to move away from boto to start using Boto3 for the S3Client.

Motivation and Context

Boto being no longer supported, Luigi should move away from it and use the current Boto3. This would solve a number of issues for example:

  • More stable downloads for big files
  • Built-in support for multipart uploads
  • Encryption with KMS
  • ....
    One of the main motivation from my part is the lack of support for aws task roles in boto. As more people are using this AWS functionality, It would make sense to move the S3Client to use boto3.
    More reasons can be found here: Update S3 client to use Boto3 #1344

Have you tested this? If so, how?

This is Work In Progress. I'm trying my best to stick to the original tests but sometimes change is inevitable (Happy to chat for more details)

Note

This is my very first contribution so feedback and suggestions are more than welcome.

@mention-bot
Copy link

@ouanixi, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ddaniels888, @jpiper and @gpoulin to be potential reviewers.

@dlstadther
Copy link
Collaborator

Super excited for this! Thanks for taking a stab at the update, @ouanixi !

(bucket, key) = self._path_to_bucket_and_key(destination_s3_path)

# grab and validate the bucket
s3_bucket = self.s3.get_bucket(bucket, validate=True)
if not self.validate_bucket(bucket):
self.s3.create_bucket(Bucket=bucket)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to create a new bucket here? I'd think not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that functionality is desired, i'd lobby for an option parameter which is off by default.


# grab and validate the bucket
s3_bucket = self.s3.get_bucket(bucket, validate=True)
if not self.validate_bucket(bucket):
self.s3.create_bucket(Bucket=bucket)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here. (see comment within put()

for item in s3_bucket.list(prefix=key_path):
last_modified_date = time.strptime(item.last_modified, "%Y-%m-%dT%H:%M:%S.%fZ")
for item in s3_bucket.objects.filter(Prefix=key_path):
last_modified_date = item.last_modified.date()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the ability to limit dir contents by datetime.

When I added this functionality, i only needed it for dates. But perhaps someone may need it for more granular times.

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears you you removed copy(), __copy_multipart(), and put_multipart()

@ouanixi
Copy link
Contributor Author

ouanixi commented Jun 5, 2017

@dlstadther Yeah they've been removed on purpose, I'll add them again later. Of course the point is to maintain the same interface as we had before. Sorry for rushing with an MR but it was more to let you guys know someone is working on the update. It is still Work In Progress

Would it be best for me to ping you when I believe the code is ready for reviewing ? Or are you happy to feedback on the go like this ??

@dlstadther
Copy link
Collaborator

@ouanixi Cool. Just wanted to make sure they weren't forgotten. I didn't see any explicit mention of them forthcoming in the PR description. But glad to know they will be included when complete.

I'll likely review as you push changes (and as I have time).

Thanks!

@ouanixi
Copy link
Contributor Author

ouanixi commented Jun 7, 2017

@dlstadther so far, most of the functionality provided by the old interface has been tested an implemented. The remaining work is:

  • implement and test copy() and __copy_multipart() (although the latter might not be needed if boto3 offers automatic multi-part like it does with upload)
  • limit dir contents by datetime rather than date only (as per your comment)

Questions I have are:

  • get_key() used to return a Key object which doesn't exist in boto3. Would we be OK with returning the key string representation ? Or would you prefer an ObjectSummary object instead ?
  • The new S3Client won't have a Key object as attribute, would this cause any issues ?

@ouanixi
Copy link
Contributor Author

ouanixi commented Jun 7, 2017

Note that I removed the S3Target tests. I'll obviously put them back once I'm finished with S3Client

@dlstadther
Copy link
Collaborator

@ouanixi There are some cases where we only need to get back the string representation of the key and other times (i.e. list with returned key) where we need to get back the key's metadata. It'd be awesome if boto3 could still return key metadata which included size and created/last modified dates.

@ouanixi
Copy link
Contributor Author

ouanixi commented Jun 17, 2017

@dlstadther Thanks for getting back to me. I've been trying to stick to the current interface we have so we stay backward compatible but AWS's replacement of the concept of Key might be problematic. A key in boto3 simply means a full path to an Object. In order to get metadata (more specifically the size of the object) we need to return an ObjectSummary from get_key. This doesn't matter much until we look at ReadableS3File that takes in a Key object (that is iterable). The closest we can get to they Key interface is a StreamingBody that exposes a read and close methods but it's unfortunately not iterable.
Any suggestions as to what we could do here ?

The remaining interface in the S3Client is fine. It's literally just the return value of get_key



class ReadableS3File(object):
def __init__(self, s3_key):
self.s3_key = s3_key
# This is a botocore StreamingBody object
self.s3_key = s3_key.get()['Body']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlstadther
Copy link
Collaborator

Sorry @ouanixi for such delay getting back to you!

The currently luigi implementation only requires Key metadata for S3Client list/listdir methods (also used by copy). I'm not sure that it's a problem that ReadableS3File can't return this metadata.

I'm not sure i follow the need for an iterable key object for ReadableS3File. Do you mean fbo of updating its __iter__ dunder method? It appears that that for iterating over the contents of the file. If boto3 doesn't offer the ability to iterate over the contents of the file by line, we deprecate ReadableS3File.

Personably, I don't use it. But looks like @ddaniels888 added this functionality back in late 2013. Maybe he can weigh in on this matter.

@ddaniels
Copy link
Contributor

ddaniels commented Jul 5, 2017

Hey @dlstadther — I added the ability to iterate line-by-line for parity with the way other file-like objects behaved in luigi. I don't think we're using it personally, but I'm not sure who else uses ReadableS3File (lots of AWS luigi users around).

@ouanixi
Copy link
Contributor Author

ouanixi commented Jul 23, 2017

@dlstadther That's OK I think I could use an adapter until the guys down in botocore inherit StreamingBody from IOBase which then would make it iterable.
Here's the discussion in their repo.

I'm now working on a few failing tests. I will push as soon as I have an update and things are tidy.
boto/botocore#879

@ouanixi
Copy link
Contributor Author

ouanixi commented Jul 24, 2017

Hey @dlstadther This this is now ready for reviewing. I apologize there's a lot of changes but in general the main points you should be aware of are the following:

1 - S3Client has no attribute Key anymore.
2- S3Client's get_key returns an ObjectSummary object https://boto3.readthedocs.io/en/latest/reference/services/s3.html#objectsummary
3 - As a consequence, the ReadableS3File takes in an 'ObjectSummary' object as argument in the constructor.

Also as this is my first contribution, I'm not sure why the CI is failing. All of the tests pass in my local environment !!!

@ouanixi ouanixi changed the title [WIP] S3Client to use Boto3 S3Client to use Boto3 Jul 24, 2017
@dlstadther
Copy link
Collaborator

dlstadther commented Jul 26, 2017

@ouanixi I'll review this as i have time.

Regarding the Travis failures, @Tarrasch merged a PR recently that allows Travis to run properly. You'll need to rebase with master to get this update.

@ouanixi ouanixi changed the title S3Client to use Boto3 [WIP] S3Client to use Boto3 Jul 26, 2017
@ouanixi
Copy link
Contributor Author

ouanixi commented Jul 26, 2017

@dlstadther thank you.
Moved this back to [WIP] as I'm not sure what's happening with the CI!! Random tests are getting errors now; namely:

  • contrib.ecs_test.TestECSTask: botocore spitting NoCredentialsError: Unable to locate credentials
  • redshift_test.TestRedshiftManifestTask: botocore spitting Missing required parameter in input: "Bucket"

I think the problem is with my understanding of tox and travis though !!
I'll wait for your input.

Cheers

@brianestlin
Copy link
Contributor

@ouanixi @dlstadther Any timeline on this? I would like to use it! :)

@ouanixi Did you see this comment from Dillon about the CI failures?

You'll need to rebase with master to get this update.

@ouanixi
Copy link
Contributor Author

ouanixi commented Aug 7, 2017

@brianestlin hey I did see the comments indeed. That didn't help unfortunately :(
Waiting on @dlstadther for reviewing :)

@dlstadther
Copy link
Collaborator

Sorry guys; i haven't had time to review this. @brianestlin do you feel comfortable reviewing too?

@brianestlin
Copy link
Contributor

@dlstadther Maybe -- I'll try to take a look this week and let you know.

@brianestlin
Copy link
Contributor

@ouanixi Regarding the NoCredentialsError, see https://github.com/spotify/luigi/blob/master/test/contrib/ecs_test.py#L24 and L39 -- prior to your change, boto3 wasn't installed so this test was being skipped. Now that you're requiring boto3, the test is being run and tripping up on the lack of credentials in the environment. Or that's my hunch, anyway. I am not sure the right approach to make this work, whether there is some way to configure aws credentials on travis or whether we just need another way to opt this test out (maybe it should check for the existence of the credentials in that try block in L39?).

Regarding TestRedshiftManifestTask, I believe that is a legit test failure because the value of client.s3 here https://github.com/spotify/luigi/blob/master/test/redshift_test.py#L64 is now a boto3 object with a different API than before.

Hope this helps.

@ouanixi
Copy link
Contributor Author

ouanixi commented Aug 9, 2017

@brianestlin Thanks for your feedback.
Regarding the NoCredentialsError:
boto3 can get the credentials from env vars ? Maybe there's a way to expose them in Travis ?? Does anyone know how to do it ?

Regarding TestRedshiftManifestTask The guys there are accessing the s3 property from within the S3Client, which is now a boto3 object so yeah you're right that's why it's failing. I think we should make the s3 property private so people know not to use it ?? I think it should have been from the start anyway.
I feel that the best thing to do is to edit the test not to use S3Client to create buckets but use boto3 directly ?? I don't think exposing internal boto3 stuff in S3Client is a good idea anyway !!

Don't know how @dlstadther feels about this ?

@dlstadther
Copy link
Collaborator

I would think Travis can receive env vars through its config. See travis documentation.

Assuming there's no valid reason to directly access the s3 property, i'm cool making it private. (personally, i've never needed to access it directly).

@ouanixi
Copy link
Contributor Author

ouanixi commented Apr 26, 2018

Hi @dlstadther thanks for your feedback.

I've had a little bit of time to look into this earlier on and found that the batch tests aren't mocking the batch client. They pass fine because they skipOnTravis when boto3 cannot be imported but of course now with my changes, it can!

This is not the first time this happened (I've had to change tests for redshift, ecs, and now batch) and I feel maybe we should be stricter when accepting PR's using this pattern of testing ?

I'll try and find some time to fix the failing batch tests this weekend hopefully :)

Thanks again for your help

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

@honnix
Copy link
Member

honnix commented May 1, 2018

Since all of my comments have been addressed, there is no need for another approval. Go ahead!

@dlstadther dlstadther merged commit c76fb2b into spotify:master May 1, 2018
'Key': src_key
}
self.s3.meta.client.copy(
copy_source, dst_bucket, dst_key, Config=transfer_config, ExtraArgs=kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using this for a couple days and it is taking drastically longer to copy lots of small files.

I'm trying to understand this multi-threaded logic for boto3.

Previously, we were assigning async threads to copy individual files. This new version appears to be assigning possibly multiple threads to copy portions of a single file. Am I reading this correctly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if i'm possibly trying to supply too many thread. So i'm decreasing my thread count and will report back with results when i have them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of thread count, i'm getting 6.67 files copied per second.

I'm skeptical that the current threading implementation provides any benefit for the copying of lots of small files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-enabled ThreadPool logic for using apply_async and actually multithreaded copy works again.

Will submit a PR.

@ouanixi ouanixi deleted the s3client branch July 13, 2018 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants