Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: raise error to eager len calls #416

Merged
merged 9 commits into from
Nov 15, 2023

Conversation

ikrommyd
Copy link
Contributor

I've seen new users not immediately understanding that len calls are eager when divisions are unknown and wondering why their analysis code takes too long before finding out it was just a couple of len calls that shouldn't be there.
I was a victim of this myself. Therefore, I think a good idea would be to add a warning.

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (fa07194) 93.94% compared to head (eaf85ea) 93.94%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
- Coverage   93.94%   93.94%   -0.01%     
==========================================
  Files          23       23              
  Lines        3123     3122       -1     
==========================================
- Hits         2934     2933       -1     
  Misses        189      189              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@agoose77
Copy link
Collaborator

I'd be tempted to make this an error. Implicit computation is something we generally want to avoid, and whilst we have a config option for this, I wonder if the default should change.

@douglasdavis ?

@ikrommyd
Copy link
Contributor Author

@agoose77 It is definitely an option. I'm wondering however whether it will ruin pytests or something where len calls are performed on small collections. We can adapt to that for sure but what I don't know is whether users want to use len sometimes even though it's a an eager computation for some reason.

@jpivarski
Copy link
Collaborator

I think so, too. Warnings in Python are too easy to lose, so they don't have much benefit beyond deprecations (because pytest can be configured to always show them and treat them as errors, not warnings after all).

You don't want to be getting the len of an unknown-length object. First, it definitely shouldn't make one's whole process eager and therefore run for a long time, but also, (sometimes?) getting a message saying that it's probably not what you want is too weak. An object that can't return a meaningful integer for len shouldn't be considered as satisfying the Sized protocol.

@jpivarski
Copy link
Collaborator

That is, it should be an error.

@jpivarski
Copy link
Collaborator

If someone wants an eager collection, there should be only one way to do it: by calling compute. (That's what's Awkward v1 laziness was missing, this explicitness.) The error message can tell them to call compute if they want something to be eager.

@ikrommyd
Copy link
Contributor Author

Alrighty. If @douglasdavis agrees, I'm gonna change this into an error and then go through the dask-awkward and coffea codes for any len calls in the tests suites

@douglasdavis
Copy link
Collaborator

I agree with Jim and Angus that this should raise for unknown partition sizes:

something along the lines of:

if not self.known_divisions:
    raise ValueError("cannot determine length of collection with unknown partitions sizes")

I was scratching my head trying to figure out why this didn't raise to begin with and I honestly can't figure that out. I even added the patch to dask/dask 2 years ago to raise on calling dask.array.Array.__len__ with unknown chunks 🤦

@douglasdavis
Copy link
Collaborator

Having the exception get raised for unknown divisions sizes may surface some places where should be able to define divisions at collection instantiation time

@martindurant
Copy link
Collaborator

"cannot determine length of collection with unknown partitions sizes"

"... without executing the graph"

(or something like that)

@ikrommyd
Copy link
Contributor Author

Alrighty, I'll make it happen and also add another sentence that directs the user to use dak.num on axis=0 for a lazy Scalar of the length. I must also go through all len calls in dask-awkward and coffea and make sure no eager len calls are happening anywhere

@douglasdavis
Copy link
Collaborator

Yea you may surface a test or two in dask-awkward that will need updating, just changing the expectation to the raise is probably all that needs doing on those. If it pops up, something along the lines of:

assert not array.known_divisions
with pytest.raises(ValueError, match="cannot determine length of collection"):
    len(array)

I'm not sure about coffea :)

@ikrommyd
Copy link
Contributor Author

Yup. There shouldn't be any such len calls in the source code itself though right?

@ikrommyd ikrommyd changed the title feat: add a user warning to eager len calls feat: raise error to eager len calls Nov 15, 2023
@ikrommyd
Copy link
Contributor Author

I've updated the code to raise an error and also the 3 tests that were failing. I don't understand the mypy pre-commit.ci error though. I'll check the code source code and also coffea tomorrow in case I'm missing something. Let's not merge yet.

@ikrommyd
Copy link
Contributor Author

@lgray There a few eager len calls in coffea's analysis_tools tests that I'm gonna replace with dak.num(..., axis=0) calls with a PR. I've ran pytest locally and those are the only tests that failed due to this change. Do you know of the top of your head of any such len calls in the coffea source code that I should take care of? I'm not expecting any.

@lgray
Copy link
Collaborator

lgray commented Nov 15, 2023

Not that I recall - I can give a search later (or you can since you're on this particular thread).

@douglasdavis douglasdavis changed the title feat: raise error to eager len calls feat!: raise error to eager len calls Nov 15, 2023
@ikrommyd
Copy link
Contributor Author

I've just went over control-Fing the src code in my IDE and all the len(...) calls looked fine to me. You can check this as well if you want but otherwise I'm good with merging

@douglasdavis
Copy link
Collaborator

Great- thanks, @iasonkrom!

@douglasdavis douglasdavis merged commit 5bd87d0 into dask-contrib:main Nov 15, 2023
21 of 22 checks passed
@ikrommyd ikrommyd deleted the warn-on-eager-len-call branch November 15, 2023 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants