Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type hints to core.py #7707

Merged
merged 10 commits into from
Mar 23, 2022
Merged

Add type hints to core.py #7707

merged 10 commits into from
Mar 23, 2022

Conversation

bridgream
Copy link
Contributor

I attempted to add type hints to core.py. I wasn't able to eliminate all mypy errors, but I hoped to make incremental changes because adding type hints seemed very intrusive and could easily trigger conflicts.

Related to #6496

@bridgream
Copy link
Contributor Author

CI failed because I used typing.ParamSpec, which was only introduced in Python 3.10. I'll seek to fix that.

@@ -27,12 +28,18 @@
# lesser tested. For now we encourage users to pass a simple list of string.
FeatNamesT = Optional[List[str]]

list_or_dict = TypeVar("list_or_dict", List, Dict)
ArrayLike = np.ndarray | Sequence
Copy link
Member

@trivialfis trivialfis Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we can just use Any. I have been working on defining the protocol of input data: #7242 We accept a lot more types than ndarray.

python-package/xgboost/core.py Outdated Show resolved Hide resolved
raise RuntimeError(f"expected {ctype} pointer")
res = np.zeros(length, dtype=dtype)
if not ctypes.memmove(res.ctypes.data, cptr, length * res.strides[0]):
if not ctypes.memmove(res.ctypes.data, cptr, length * res.strides[0]): # "memmove" does not return a value?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's probably a bug in mypy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right. By the way, the maintainers of typeshed were able to fix this bug in a flash.

@@ -790,7 +807,7 @@ def set_uint_info(self, field: str, data) -> None:
from .data import dispatch_meta_backend
dispatch_meta_backend(self, data, field, 'uint32')

def save_binary(self, fname, silent=True) -> None:
def save_binary(self, fname: str | os.PathLike, silent: bool = True) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using Union instead. This requires from future import __annotations__, which has different behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that different behavior matters (except for the explicit import), but since typing.Union is available for Python <3.10, I agree it's better to stick with that.

@@ -806,7 +823,7 @@ def save_binary(self, fname, silent=True) -> None:
c_str(fname),
ctypes.c_int(silent)))

def set_label(self, label) -> None:
def set_label(self, label: Collection) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why is this a Collection? Maybe ArrayLike?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad; I was confused by dispatch_meta_backend but apparently I missed the docstring...

@bridgream
Copy link
Contributor Author

The CI build still failed due to TypeError: '_ctypes.PyCArrayType' object is not subscriptable. Could it be because I used a platform specific annotation Union[bytes, ctypes.Array[ctypes.c_char_p]]?

@bridgream bridgream marked this pull request as draft March 1, 2022 02:42
@bridgream
Copy link
Contributor Author

bridgream commented Mar 1, 2022

I was able to eliminate most mypy errors (from 60+ to 19; hopefully that was not because I introduced too many Any). I think most of the remaining errors would involve more intrusive changes or changes of other files.

@bridgream
Copy link
Contributor Author

It turns out we cannot include ctypes.pointer[] in any part of the Python scripts. However, to properly annotate typed C pointers, it seems we have to use this grammar. In fact, typeshed has a class interface for ctypes.pointer but this is not recognized by Python.

Maybe we should separate the definitions of types from source code by putting them in a stub file?

@trivialfis
Copy link
Member

Apologies for the slow response, will come back to this tomorrow.

@trivialfis
Copy link
Member

Maybe we should separate the definitions of types from source code by putting them in a stub file?

That sounds like a good idea, we need to have some stubs later.

@bridgream
Copy link
Contributor Author

@trivialfis Thanks for following up! I've done a quick demo using stub (basically I just used stubgen and filled types that weren't automatically inferred, so that I was able to make mypy satisfied) and erased previous work (I made a backup of the previous commit, though).

Some issues:

  1. With the existence of core.pyi, mypy simply ignores all other files including core.py itself (which is expected behavior). Is this still helpful? I guess we also wanted mypy to help us do static check for implementation?
  2. Should we generate a stub file for each .py file, or just one __init__.pyi enumerating all APIs?
  3. I've added a _typing.py to collect all typing-related declaration but maybe I should follow typing.py as in this draft PR. Could you please offer some advice?

@trivialfis
Copy link
Member

Thank you for experimenting with this! Does it make sense to simply use ctypes.pointer for now? I think python typing is still evolving and we don't have to be too harsh on ourselves at the moment. It's unlikely that we have a perfect solution for everything in short term.

@bridgream
Copy link
Contributor Author

@trivialfis Simply using ctypes.pointer should work. In principle, typeshed invents its own annotation for specific C pointers (e.g. ctypes.pointer[ctypes.c_char_p], but subscription is not natively supported by Python), so this is only usable in .pyi files.

Let me move on with inline annotation without specific types for C pointers for now.
Also no worries about the work; I'm having great fun playing with the typing system after all.

@trivialfis
Copy link
Member

Let me move on with inline annotation without specific types for C pointers for now.

Excellent!

@trivialfis
Copy link
Member

Hi, please let me know when it's ready for review. ;-)

@bridgream
Copy link
Contributor Author

Hi @trivialfis, after some of our work, there are still 22 mypy errors in core.py but I find them quite tricky to tackle. The errors can be generated with

cd python-package
mypy xgboost/core.py --ignore-missing-imports

Besides the errors that relate to ctypes, some errors are due to the flexible user interface. For example, the setter feature_names starting from core.py:1004 takes either a str, a List[str] or None. Even if there's code to deal with each case, mypy cannot understand the implementation but simply throws errors.

While I don't think simply adding # type: ignore is a good solution, I'm not sure whether we still should solve all the errors now. Otherwise, given the improvement we've already done, I think it's a good time to review and possibly merge.

@bridgream bridgream marked this pull request as ready for review March 21, 2022 23:37
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice work. One comment inlined.

ArrayLike = Any
PathLike = Union[str, os.PathLike]
CupyT = ArrayLike # maybe need a stub for cupy arrays
NdarrayOrCupyT = Union[np.ndarray, CupyT]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be better off to return Any for prediction. The return type is actually a dependent type, which means given the input type, the return type is known so it's not a union. From a static type language's point of view (like c++, rust), the union can be changed at runtime, but the return type is determined at compile time.

Copy link
Member

@trivialfis trivialfis Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy doesn't support dependent type yet, you can find some related feature requests in their issue list. To provide a concrete example, if the users use a cupy array as input to inplace_predict, then they know the output is also a cupy array. However, with Union the users will be asked to handle both numpy array and cupy array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. By the way, should we use string literal for cupy types, e.g. "cupy.ndarray" instead of CupyT here https://github.com/bridgream/xgboost/blob/mypy/python-package/xgboost/core.py#L276

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bridgream I think it will show an error as mypy couldn't find cupy on the test environment (cupy is an optional dependency)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, overloading might help, but the change might be tedious as cupy is an optional dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should follow your solution in this PR by creating an interface. Since xgboost only touches some basic interface, this shouldn't take too much work. We could then introduce @overload to achieve better hints.
I can try that next.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice work! Will merge as part of 1.6.

@bridgream
Copy link
Contributor Author

Great, thank you! I'll continue to improve type hints in other Python scripts.

@trivialfis trivialfis merged commit c92ab2c into dmlc:master Mar 23, 2022
@bridgream bridgream deleted the mypy branch March 24, 2022 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants