Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(s3.list): type hinting for chunked arg #1113

Merged
merged 5 commits into from
Jan 25, 2022

Conversation

jmahlik
Copy link
Contributor

@jmahlik jmahlik commented Jan 14, 2022

Issue #, if available:
N/A

Description of changes:
Fixing the return type of s3.list_objects to account for the chunked arg by using typing.overload.

Currently, I have to use cast in a lot of places (40+ and growing). cast is confusing to data scientists who consume the code base, as they think it is actually casting the value or haven't seen it before. This addition allows the correct return type to be reported based on the args passed. It will save quite a bit of overhead on the user end.

Tested with mypy==0.910 and mypy==0.931.

from typing import cast, List
import awswrangler as wr

# Currently I have to do this all over the code base (not using chunked)
# Or even worse... #type: ignore it
thing = cast(List[str], wr.s3.list_objects("thing"))[0]

# --- with the changes ---
# Mypy issues, and runtime issues if you try to index a 
# generator
# error: Value of type "Union[List[str], Iterator[List[str]]]" is not indexable
thing = wr.s3.list_objects("thing", chunked=True)[0]

# No mypy issues nor runtime issues, this will return a list
# So, we no longer need to use cast
thing = wr.s3.list_objects("thing")[0]

If this is agreeable, I would be happy to take on some of the other functions that have chunking.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: adbb892
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking
Copy link
Contributor

Neat, I like it, although I must say it's quite verbose

@jmahlik
Copy link
Contributor Author

jmahlik commented Jan 14, 2022

Neat, I like it, although I must say it's quite verbose

Agreed,, it is extremely verbose. I could not figure out a better way to implement it though. If someone can, please do. I guess these lines would take care of maybe several hundred lines on the user side.

@diegodebrito @victoriarouton grep for it

@jaidisido
Copy link
Contributor

I agree that it's a bit too verbose I am afraid, especially that there are other methods that have even larger input arguments. Since we are only interested in the return type, I wonder if there is a way to specify **args, **kwargs instead?

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 804bb0f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jmahlik
Copy link
Contributor Author

jmahlik commented Jan 17, 2022

I agree that it's a bit too verbose I am afraid, especially that there are other methods that have even larger input arguments. Since we are only interested in the return type, I wonder if there is a way to specify **args, **kwargs instead?

It might be possible to only include the chuncked arg with a ... in the overloads. I'll give a shot at making something like these examples from the mypy docs work tonight.

from typing import overload

@overload
def mouse_event(x1: int, y1: int) -> ClickEvent: ...
@overload
def mouse_event(x1: int, y1: int, x2: int, y2: int) -> DragEvent: ...

# The actual *implementation* of 'mouse_event'.
# The implementation contains the actual runtime logic.
#
# It may or may not have type hints. If it does, mypy
# will check the body of the implementation against the
# type hints.
#
# Mypy will also check and make sure the signature is
# consistent with the provided variants.

def mouse_event(x1: int,
                y1: int,
                x2: Optional[int] = None,
                y2: Optional[int] = None) -> Union[ClickEvent, DragEvent]:
    if x2 is None and y2 is None:
        return ClickEvent(x1, y1)
    elif x2 is not None and y2 is not None:
        return DragEvent(x1, y1, x2, y2)
    else:
        raise TypeError("Bad arguments")


class M: ...

@overload
def get_model(model_or_pk: M, flag: bool = ...) -> M: ...
@overload
def get_model(model_or_pk: int, flag: bool = ...) -> M | None: ...

from typing import Union, overload

# Overload *variants* for 'mouse_event'.
# These variants give extra information to the type checker.
# They are ignored at runtime.

@jmahlik
Copy link
Contributor Author

jmahlik commented Jan 18, 2022

I tried a few more things. A possible alternative is converting chucked to a keyword only arg. That is potentially more verbose and would impact the api.

python/mypy#7333

python/mypy#8634

Copy link
Contributor

@kukushking kukushking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better type hints >> LOC so I'm ok to merge this

@jmahlik
Copy link
Contributor Author

jmahlik commented Jan 21, 2022

IMO better type hints >> LOC so I'm ok to merge this

That'd be splendid! s3 list is probably the one users (at least in our case) are most likely to be indexing that doesn't resolve to Any (like the pandas returns), so it should cut out quite a few cast LOC. Taking in to account just our end user code base, it would already be an overall net reduction in LOC.

The one thing I'd like to validate a bit more is if the overloads should be using the defaults i.e. ignore_suffix: Union[str, List[str], None] = None, or the ... i.e. ignore_suffix: Union[str, List[str], None] = ...,. I'll get on that asap.

@jmahlik
Copy link
Contributor Author

jmahlik commented Jan 21, 2022

IMO better type hints >> LOC so I'm ok to merge this

That'd be splendid! s3 list is probably the one users (at least in our case) are most likely to be indexing that doesn't resolve to Any (like the pandas returns), so it should cut out quite a few cast LOC. Taking in to account just our end user code base, it would already be an overall net reduction in LOC.

The one thing I'd like to validate a bit more is if the overloads should be using the defaults i.e. ignore_suffix: Union[str, List[str], None] = None, or the ... i.e. ignore_suffix: Union[str, List[str], None] = ...,. I'll get on that asap.

Did some testing. It works either way but the ... seems like the better option since if any of the defaults ever changes, the overload doesn't need to be updated.

I updated to ... in the overloads.

Happy to rebase this if needed since there were some merges into it.

Here's what I used for testing.

# Redundant cast error -- this is a good thing
thing = cast(List[str], wr.s3.list_objects("thing"))[0]

thing = wr.s3.list_objects("thing", "*", chunked=True)

# These are all valid calls
wr.s3.list_objects("thing", ["*"], ["^"], datetime.datetime(2022, 1, 1), chunked=True)
wr.s3.list_objects("thing", ["*"], ["^"], datetime.datetime(2022, 1, 1), chunked=False)


wr.s3.list_objects(
    path="thing",
    suffix="*",
    ignore_suffix="^",
    last_modified_begin=datetime.datetime(2022, 1, 1),
    last_modified_end=datetime.datetime(2022, 1, 1),
    ignore_empty=True,
    chunked=False,
    s3_additional_kwargs={"foo": "bar"},
    boto3_session=boto3.Session(),
)

wr.s3.list_objects(
    "thing",
    "*",
    "^",
    last_modified_begin=datetime.datetime(2022, 1, 1),
    last_modified_end=datetime.datetime(2022, 1, 1),
    ignore_empty=True,
    chunked=False,
    s3_additional_kwargs={"foo": "bar"},
    boto3_session=boto3.Session(),
)

wr.s3.list_objects(
    "thing",
    "*",
    "^",
    last_modified_begin=datetime.datetime(2022, 1, 1),
    last_modified_end=datetime.datetime(2022, 1, 1),
    ignore_empty=True,
    chunked=True,
    s3_additional_kwargs={"foo": "bar"},
    boto3_session=boto3.Session(),
)

wr.s3.list_objects(
    "thing",
    "*",
    "^",
    last_modified_begin=datetime.datetime(2022, 1, 1),
    last_modified_end=datetime.datetime(2022, 1, 1),
    chunked=True,
    s3_additional_kwargs={"foo": "bar"},
    boto3_session=boto3.Session(),
)

@jmahlik jmahlik force-pushed the fix-s3-list-typing branch from 511b471 to df49487 Compare January 21, 2022 18:34
@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 511b471
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido merged commit 02a2d46 into aws:main Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants