S3Client to use Boto3 #2149

ouanixi · 2017-06-04T22:48:59Z

Description

This work is to move away from boto to start using Boto3 for the S3Client.

Motivation and Context

Boto being no longer supported, Luigi should move away from it and use the current Boto3. This would solve a number of issues for example:

More stable downloads for big files
Built-in support for multipart uploads
Encryption with KMS
....
One of the main motivation from my part is the lack of support for aws task roles in boto. As more people are using this AWS functionality, It would make sense to move the S3Client to use boto3.
More reasons can be found here: Update S3 client to use Boto3 #1344

Have you tested this? If so, how?

This is Work In Progress. I'm trying my best to stick to the original tests but sometimes change is inevitable (Happy to chat for more details)

Note

This is my very first contribution so feedback and suggestions are more than welcome.

mention-bot · 2017-06-04T22:49:01Z

@ouanixi, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ddaniels888, @jpiper and @gpoulin to be potential reviewers.

dlstadther · 2017-06-05T11:23:44Z

Super excited for this! Thanks for taking a stab at the update, @ouanixi !

dlstadther · 2017-06-05T15:36:46Z

luigi/contrib/s3.py

        (bucket, key) = self._path_to_bucket_and_key(destination_s3_path)

        # grab and validate the bucket
-        s3_bucket = self.s3.get_bucket(bucket, validate=True)
+        if not self.validate_bucket(bucket):
+            self.s3.create_bucket(Bucket=bucket)


Do we really want to create a new bucket here? I'd think not.

If that functionality is desired, i'd lobby for an option parameter which is off by default.

dlstadther · 2017-06-05T15:39:12Z

luigi/contrib/s3.py


        # grab and validate the bucket
-        s3_bucket = self.s3.get_bucket(bucket, validate=True)
+        if not self.validate_bucket(bucket):
+            self.s3.create_bucket(Bucket=bucket)


Same concern here. (see comment within put()

dlstadther · 2017-06-05T15:44:05Z

luigi/contrib/s3.py

-        for item in s3_bucket.list(prefix=key_path):
-            last_modified_date = time.strptime(item.last_modified, "%Y-%m-%dT%H:%M:%S.%fZ")
+        for item in s3_bucket.objects.filter(Prefix=key_path):
+            last_modified_date = item.last_modified.date()


This removes the ability to limit dir contents by datetime.

When I added this functionality, i only needed it for dates. But perhaps someone may need it for more granular times.

dlstadther

It appears you you removed copy(), __copy_multipart(), and put_multipart()

ouanixi · 2017-06-05T16:17:26Z

@dlstadther Yeah they've been removed on purpose, I'll add them again later. Of course the point is to maintain the same interface as we had before. Sorry for rushing with an MR but it was more to let you guys know someone is working on the update. It is still Work In Progress

Would it be best for me to ping you when I believe the code is ready for reviewing ? Or are you happy to feedback on the go like this ??

dlstadther · 2017-06-05T17:30:33Z

@ouanixi Cool. Just wanted to make sure they weren't forgotten. I didn't see any explicit mention of them forthcoming in the PR description. But glad to know they will be included when complete.

I'll likely review as you push changes (and as I have time).

Thanks!

ouanixi · 2017-06-07T22:26:23Z

@dlstadther so far, most of the functionality provided by the old interface has been tested an implemented. The remaining work is:

implement and test copy() and __copy_multipart() (although the latter might not be needed if boto3 offers automatic multi-part like it does with upload)
limit dir contents by datetime rather than date only (as per your comment)

Questions I have are:

get_key() used to return a Key object which doesn't exist in boto3. Would we be OK with returning the key string representation ? Or would you prefer an ObjectSummary object instead ?
The new S3Client won't have a Key object as attribute, would this cause any issues ?

ouanixi · 2017-06-07T22:27:46Z

Note that I removed the S3Target tests. I'll obviously put them back once I'm finished with S3Client

dlstadther · 2017-06-08T14:27:59Z

@ouanixi There are some cases where we only need to get back the string representation of the key and other times (i.e. list with returned key) where we need to get back the key's metadata. It'd be awesome if boto3 could still return key metadata which included size and created/last modified dates.

ouanixi · 2017-06-17T19:02:14Z

@dlstadther Thanks for getting back to me. I've been trying to stick to the current interface we have so we stay backward compatible but AWS's replacement of the concept of Key might be problematic. A key in boto3 simply means a full path to an Object. In order to get metadata (more specifically the size of the object) we need to return an ObjectSummary from get_key. This doesn't matter much until we look at ReadableS3File that takes in a Key object (that is iterable). The closest we can get to they Key interface is a StreamingBody that exposes a read and close methods but it's unfortunately not iterable.
Any suggestions as to what we could do here ?

The remaining interface in the S3Client is fine. It's literally just the return value of get_key

ouanixi · 2017-06-17T19:08:04Z

luigi/contrib/s3.py



 class ReadableS3File(object):
    def __init__(self, s3_key):
-        self.s3_key = s3_key
+        # This is a botocore StreamingBody object
+        self.s3_key = s3_key.get()['Body']


@dlstadther
http://botocore.readthedocs.io/en/latest/reference/response.html#botocore.response.StreamingBody
This is not iterable :(

dlstadther · 2017-07-01T19:54:18Z

Sorry @ouanixi for such delay getting back to you!

The currently luigi implementation only requires Key metadata for S3Client list/listdir methods (also used by copy). I'm not sure that it's a problem that ReadableS3File can't return this metadata.

I'm not sure i follow the need for an iterable key object for ReadableS3File. Do you mean fbo of updating its __iter__ dunder method? It appears that that for iterating over the contents of the file. If boto3 doesn't offer the ability to iterate over the contents of the file by line, we deprecate ReadableS3File.

Personably, I don't use it. But looks like @ddaniels888 added this functionality back in late 2013. Maybe he can weigh in on this matter.

ddaniels · 2017-07-05T19:32:37Z

Hey @dlstadther — I added the ability to iterate line-by-line for parity with the way other file-like objects behaved in luigi. I don't think we're using it personally, but I'm not sure who else uses ReadableS3File (lots of AWS luigi users around).

ouanixi · 2017-07-23T19:27:13Z

@dlstadther That's OK I think I could use an adapter until the guys down in botocore inherit StreamingBody from IOBase which then would make it iterable.
Here's the discussion in their repo.

I'm now working on a few failing tests. I will push as soon as I have an update and things are tidy.
boto/botocore#879

ouanixi · 2017-07-24T21:57:15Z

Hey @dlstadther This this is now ready for reviewing. I apologize there's a lot of changes but in general the main points you should be aware of are the following:

1 - S3Client has no attribute Key anymore.
2- S3Client's get_key returns an ObjectSummary object https://boto3.readthedocs.io/en/latest/reference/services/s3.html#objectsummary
3 - As a consequence, the ReadableS3File takes in an 'ObjectSummary' object as argument in the constructor.

Also as this is my first contribution, I'm not sure why the CI is failing. All of the tests pass in my local environment !!!

dlstadther · 2017-07-26T02:21:25Z

@ouanixi I'll review this as i have time.

Regarding the Travis failures, @Tarrasch merged a PR recently that allows Travis to run properly. You'll need to rebase with master to get this update.

ouanixi · 2017-07-26T21:57:26Z

@dlstadther thank you.
Moved this back to [WIP] as I'm not sure what's happening with the CI!! Random tests are getting errors now; namely:

contrib.ecs_test.TestECSTask: botocore spitting NoCredentialsError: Unable to locate credentials
redshift_test.TestRedshiftManifestTask: botocore spitting Missing required parameter in input: "Bucket"

I think the problem is with my understanding of tox and travis though !!
I'll wait for your input.

Cheers

brianestlin · 2017-08-07T19:35:48Z

@ouanixi @dlstadther Any timeline on this? I would like to use it! :)

@ouanixi Did you see this comment from Dillon about the CI failures?

You'll need to rebase with master to get this update.

ouanixi · 2017-08-07T19:38:46Z

@brianestlin hey I did see the comments indeed. That didn't help unfortunately :(
Waiting on @dlstadther for reviewing :)

dlstadther · 2017-08-07T21:16:54Z

Sorry guys; i haven't had time to review this. @brianestlin do you feel comfortable reviewing too?

brianestlin · 2017-08-07T22:40:58Z

@dlstadther Maybe -- I'll try to take a look this week and let you know.

brianestlin · 2017-08-08T22:49:39Z

@ouanixi Regarding the NoCredentialsError, see https://github.com/spotify/luigi/blob/master/test/contrib/ecs_test.py#L24 and L39 -- prior to your change, boto3 wasn't installed so this test was being skipped. Now that you're requiring boto3, the test is being run and tripping up on the lack of credentials in the environment. Or that's my hunch, anyway. I am not sure the right approach to make this work, whether there is some way to configure aws credentials on travis or whether we just need another way to opt this test out (maybe it should check for the existence of the credentials in that try block in L39?).

Regarding TestRedshiftManifestTask, I believe that is a legit test failure because the value of client.s3 here https://github.com/spotify/luigi/blob/master/test/redshift_test.py#L64 is now a boto3 object with a different API than before.

Hope this helps.

ouanixi · 2017-08-09T10:35:10Z

@brianestlin Thanks for your feedback.
Regarding the NoCredentialsError:
boto3 can get the credentials from env vars ? Maybe there's a way to expose them in Travis ?? Does anyone know how to do it ?

Regarding TestRedshiftManifestTask The guys there are accessing the s3 property from within the S3Client, which is now a boto3 object so yeah you're right that's why it's failing. I think we should make the s3 property private so people know not to use it ?? I think it should have been from the start anyway.
I feel that the best thing to do is to edit the test not to use S3Client to create buckets but use boto3 directly ?? I don't think exposing internal boto3 stuff in S3Client is a good idea anyway !!

Don't know how @dlstadther feels about this ?

dlstadther · 2017-08-09T11:30:46Z

I would think Travis can receive env vars through its config. See travis documentation.

Assuming there's no valid reason to directly access the s3 property, i'm cool making it private. (personally, i've never needed to access it directly).

ouanixi · 2018-04-26T22:39:55Z

Hi @dlstadther thanks for your feedback.

I've had a little bit of time to look into this earlier on and found that the batch tests aren't mocking the batch client. They pass fine because they skipOnTravis when boto3 cannot be imported but of course now with my changes, it can!

This is not the first time this happened (I've had to change tests for redshift, ecs, and now batch) and I feel maybe we should be stricter when accepting PR's using this pattern of testing ?

I'll try and find some time to fix the failing batch tests this weekend hopefully :)

Thanks again for your help

dlstadther

🙌

honnix · 2018-05-01T18:17:40Z

Since all of my comments have been addressed, there is no need for another approval. Go ahead!

dlstadther · 2018-05-14T12:35:50Z

luigi/contrib/s3.py

+                'Key': src_key
+            }
+            self.s3.meta.client.copy(
+                copy_source, dst_bucket, dst_key, Config=transfer_config, ExtraArgs=kwargs)


I've been using this for a couple days and it is taking drastically longer to copy lots of small files.

I'm trying to understand this multi-threaded logic for boto3.

Previously, we were assigning async threads to copy individual files. This new version appears to be assigning possibly multiple threads to copy portions of a single file. Am I reading this correctly?

boto/boto3#787

I'm not sure if i'm possibly trying to supply too many thread. So i'm decreasing my thread count and will report back with results when i have them.

Regardless of thread count, i'm getting 6.67 files copied per second.

I'm skeptical that the current threading implementation provides any benefit for the copying of lots of small files.

Re-enabled ThreadPool logic for using apply_async and actually multithreaded copy works again.

Will submit a PR.

dlstadther reviewed Jun 5, 2017

View reviewed changes

dlstadther requested changes Jun 5, 2017

View reviewed changes

ouanixi commented Jun 17, 2017

View reviewed changes

ouanixi force-pushed the s3client branch from 631160f to 6d32531 Compare July 24, 2017 21:49

ouanixi changed the title ~~[WIP] S3Client to use Boto3~~ S3Client to use Boto3 Jul 24, 2017

ouanixi force-pushed the s3client branch from cd426c4 to 81f0e7e Compare July 26, 2017 19:37

ouanixi changed the title ~~S3Client to use Boto3~~ [WIP] S3Client to use Boto3 Jul 26, 2017

Ouanis Seddaoui and others added 16 commits April 26, 2018 22:38

fixed flake8 errors

42fabe8

added multipart s3 tests

5f4b7f6

added multipart s3 tests

ee7817a

added multipart s3 tests

e57cf51

travis to trigger builds on master

32ab31a

skip on travis message made clearer

8b2d1e2

add s3client with rebase

7524f4e

removed arn resolution

29a24f2

s3client resolving auth alone

aae7417

logging boto3 resource call

938c57d

fixed as per PR's comments

1059926

fix as per PR comments

59efe32

removed blank line for flake8

1a66ca2

removed unused OptionParameter

20f7738

add deprecated error to s3.resource

2438180

fixed flake8

0975957

ouanixi added 3 commits April 30, 2018 20:52

batch tests skip on travis

098e19b

fix flake8 errors

334b580

rebase with master

2237c6c

ouanixi force-pushed the s3client branch from b23df5f to 2237c6c Compare April 30, 2018 20:03

ouanixi added 3 commits April 30, 2018 21:08

update listdir docstring

01866ef

skip multipart test on travis

adf4796

further multipart tests to skip on travis

45714d6

dlstadther approved these changes Apr 30, 2018

View reviewed changes

dlstadther merged commit c76fb2b into spotify:master May 1, 2018

dlstadther mentioned this pull request May 10, 2018

Boto3 put_multipart AttributeError #2421

Closed

dlstadther reviewed May 14, 2018

View reviewed changes

ouanixi deleted the s3client branch July 13, 2018 16:50

S3Client to use Boto3 #2149

S3Client to use Boto3 #2149

Conversation

ouanixi commented Jun 4, 2017

Description

Motivation and Context

Have you tested this? If so, how?

Note

mention-bot commented Jun 4, 2017

dlstadther commented Jun 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlstadther left a comment • edited Loading

Choose a reason for hiding this comment

ouanixi commented Jun 5, 2017 • edited Loading

dlstadther commented Jun 5, 2017

ouanixi commented Jun 7, 2017

ouanixi commented Jun 7, 2017

dlstadther commented Jun 8, 2017

ouanixi commented Jun 17, 2017 • edited Loading

Choose a reason for hiding this comment

dlstadther commented Jul 1, 2017

ddaniels commented Jul 5, 2017

ouanixi commented Jul 23, 2017 • edited Loading

ouanixi commented Jul 24, 2017

dlstadther commented Jul 26, 2017 • edited Loading

ouanixi commented Jul 26, 2017 • edited Loading

brianestlin commented Aug 7, 2017

ouanixi commented Aug 7, 2017

dlstadther commented Aug 7, 2017

brianestlin commented Aug 7, 2017

brianestlin commented Aug 8, 2017

ouanixi commented Aug 9, 2017 • edited Loading

dlstadther commented Aug 9, 2017

ouanixi commented Apr 26, 2018

dlstadther left a comment

Choose a reason for hiding this comment

honnix commented May 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlstadther left a comment •

edited

Loading

ouanixi commented Jun 5, 2017 •

edited

Loading

ouanixi commented Jun 17, 2017 •

edited

Loading

ouanixi commented Jul 23, 2017 •

edited

Loading

dlstadther commented Jul 26, 2017 •

edited

Loading

ouanixi commented Jul 26, 2017 •

edited

Loading

ouanixi commented Aug 9, 2017 •

edited

Loading