Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make IndexOpsMixin (and Index) generic #760

Merged
merged 25 commits into from
Aug 14, 2023

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Aug 3, 2023

Mostly working, but still a few failing tests and I haven't added new tests.

xref microsoft/pyright#5642

tests/test_scalars.py Outdated Show resolved Hide resolved
@twoertwein
Copy link
Member Author

Most of the the remaining errors caused by Index[S1] where S1 is Any/Unknown, e.g., df.columns. Even though S1 is bound, the type system/checkers widen the type to Any/Unknown.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 3, 2023

I think this may close #502 and #340 . Can you check if that should be added to the list?

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed some things to consider.

Am wondering how MultiIndex will work as a subclass of Index[S1] . Is everything good because we just say class MultiIndex(Index), so the S1 is ignored? But will some of the methods of Index that refer to S1 still work right if the index is a MultiIndex ?

pandas-stubs/core/frame.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/indexes/base.pyi Show resolved Hide resolved
pandas-stubs/core/indexes/base.pyi Show resolved Hide resolved
tests/test_indexes.py Show resolved Hide resolved
@twoertwein
Copy link
Member Author

twoertwein commented Aug 3, 2023

Am wondering how MultiIndex will work as a subclass of Index[S1] . Is everything good because we just say class MultiIndex(Index), so the S1 is ignored? But will some of the methods of Index that refer to S1 still work right if the index is a MultiIndex ?

I'm not sure, same applies to CategoricalIndex (pd.Categorical is not in S1, but probably should be there).

edit: Leaving CategoricalIndex as-is might not be too bad as CategoricalIndex([...])[0] is the underlying type and not pd.Categorical

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 3, 2023

@twoertwein I'm headed out of town for 3 days, so won't look at this until Monday.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good, IMHO

pandas-stubs/core/indexes/base.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/indexes/base.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/indexes/multi.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/indexes/timedeltas.pyi Show resolved Hide resolved
@@ -1761,7 +1761,8 @@ def test_getmultiindex_columns() -> None:
[(i, s) for i in [1] for s in df.columns.get_level_values(1)]
]
res4: pd.DataFrame = df[[df.columns[0]]]
check(assert_type(df[df.columns[0]], pd.Series), pd.Series)
column: Scalar = df.columns[0]
check(assert_type(df[column], pd.Series), pd.Series)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be annoying if one can't write df[df.columns[0]] without doing what you have done here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this can be fixed - maybe we can do some gymnastic with the order of overloads. df.columns is Index[Any] and would therefore probably match the first overload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that both mypy and pyright do not limit a bounded TypeVar to its bound when it is unknown/any, we have only two options:

  • make people cast the Index to the appropriate type
  • return Scalar
  • return Scalar now; in the future, make pd.DataFrame generic in terms of the Index; return S1

I have no particular preference, except that making DataFrame generic might not happen anytime soon

btw. I will offline from Friday-Sunday, you are welcome to push changes to this PR or also merge it. Since this is a large PR and the tests might not cover everything, I would prefer if you or others could run this version of pandas-stubs on your internal codebase before the next release to catch potential regressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: seems that the newest version of numexp broke CI/pandas

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that both mypy and pyright do not limit a bounded TypeVar to its bound when it is unknown/any, we have only two options:

  • make people cast the Index to the appropriate type
  • return Scalar
  • return Scalar now; in the future, make pd.DataFrame generic in terms of the Index; return S1

I have no particular preference, except that making DataFrame generic might not happen anytime soon

Would changing DataFrame.columns() to return Index[Scalar] work?

btw. I will offline from Friday-Sunday, you are welcome to push changes to this PR or also merge it. Since this is a large PR and the tests might not cover everything, I would prefer if you or others could run this version of pandas-stubs on your internal codebase before the next release to catch potential regressions.

I will see if I can give this a try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a large PR and the tests might not cover everything, I would prefer if you or others could run this version of pandas-stubs on your internal codebase before the next release to catch potential regressions.

I tried the version I placed in your repo with the PR on two large code bases and no new errors appeared, so I think once you merge that in, I can approve and merge in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing it and finding a way to let mypy (and pyright) clearly indicate the unintended calls!

tests/test_indexes.py Show resolved Hide resolved
tests/test_indexes.py Show resolved Hide resolved
@twoertwein twoertwein marked this pull request as ready for review August 9, 2023 00:46
Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test if using DataFrame.columns() returning Index[Scalar] would solve the problem noted in another comment so that the expression df[df.columns[0]] would work without having to do a cast or creating a temporary variable?

Comment on lines 749 to 754
assert_type(
md_int64_index // td, Never # pyright: ignore[reportGeneralTypeIssues]
)
assert_type(
md_float_index // td, Never # pyright: ignore[reportGeneralTypeIssues]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerned about the above change. Previously, mypy was seeing that this is an invalid operation. If you make the above change, mypy is no longer detecting that. That's not a good thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't know how to fix it. I think the reason why it worked previously was a behavior of Never when it is being used on input arguments: it is interpreted by mypy/pyright to indicate that that call is invalid. Now, with the generic class, we need to have annotations on the input arguments (we don't have the luxury of using Never there anymore). We can still return Never (but that has a slightly different semantic meaning).

Copy link
Collaborator

@Dr-Irv Dr-Irv Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a fix. I created a PR against your branch at twoertwein#3

Key was to remove __floordiv__() from OpsMixin and let any subclass of OpsMixin define only the valid values.

I think it's a mypy bug. See python/mypy#15861

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the mypy folks say it's not a bug, and pyright says that its implementation was incorrect. So I think the only way to do this is with what I did - you can't use Never or NoReturn to "override" the implementation in a subclass that matches Any

Comment on lines 787 to 792
assert_type(
md_int64_index / td, Never # pyright: ignore[reportGeneralTypeIssues]
)
assert_type(
md_float_index / td, Never # pyright: ignore[reportGeneralTypeIssues]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@twoertwein
Copy link
Member Author

twoertwein commented Aug 9, 2023

Can you test if using DataFrame.columns() returning Index[Scalar] would solve the problem noted in another comment so that the expression df[df.columns[0]] would work without having to do a cast or creating a temporary variable?

It sounds as if it should work but it creates different problems:

  • Scalar is a Union: using df.columns[0] for comparisons or other operations would fail (we have a few tests for that - they work because it is currently Any)
  • Scalar is actually not compatible with S1 (I believe np.tiemdelta64 and np.datetime64 are missing in S1)
  • it makes columns: Index[str] = df.columns not possible anymore Bug with iterating df.columns #502

I try to see how far I get on my local branch but I'm not too optimistic that there is an ideal solution for this problem (except for DataFrame being generic in index and columns)

edit: columns -> Index[str] is a lie but It passes the tests and is probably also true for most people

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 9, 2023

edit: columns -> Index[str] is a lie but It passes the tests and is probably also true for most people

Yes, I agree. For the cases where that was incorrect, then you have to do a cast, and if we make the type wider, then you always have to do a cast, even in the most common case.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 9, 2023

I will see if I can find some time to test that one issue related to why mypy isn't picking up the incorrect arithmetic operation and to test with some code bases I have. Once done, I'll approve and merge.

@twoertwein
Copy link
Member Author

I will see if I can find some time to test that one issue related to why mypy isn't picking up the incorrect arithmetic operation

I believe you have an open issue at mypy related to that :) In theory, mypy should still warn about unreachable code after those lines and prevent people from using the return value.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to avoid using npt in the constructors

pandas-stubs/core/indexes/base.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/indexes/base.pyi Outdated Show resolved Hide resolved
pandas-stubs/core/indexes/base.pyi Outdated Show resolved Hide resolved
Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @twoertwein . Great contribution

@Dr-Irv Dr-Irv merged commit f7621f4 into pandas-dev:main Aug 14, 2023
13 checks passed
@twoertwein twoertwein deleted the iter_interpolate branch August 14, 2023 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants