-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] add type hints for custom objective and metric functions in scikit-learn interface #4547
Conversation
…scikit-learn interface
python-package/lightgbm/sklearn.py
Outdated
[_ArrayLike, _ArrayLike, _GroupType], | ||
Tuple[np.ndarray, np.ndarray] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify why are you distinguishing all these three types (_ArrayLike
, _GroupType
, np.ndarray
)? They are all documented as array-like
.
LightGBM/python-package/lightgbm/sklearn.py
Lines 32 to 49 in bd28a36
y_true : array-like of shape = [n_samples] | |
The target values. | |
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) | |
The predicted values. | |
Predicted values are returned before any transformation, | |
e.g. they are raw margin instead of probability of positive class for binary task. | |
group : array-like | |
Group/query data. | |
Only used in the learning-to-rank task. | |
sum(group) = n_samples. | |
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, | |
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc. | |
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) | |
The value of the first order derivative (gradient) of the loss | |
with respect to the elements of y_pred for each sample point. | |
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) | |
The value of the second order derivative (Hessian) of the loss | |
with respect to the elements of y_pred for each sample point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that y_true
and y_pred
could be lists, I thought they had to be a pandas Series, numpy array, or scipy matrix.
For grad and hess, it seems that they cannot be scipy matrices or pandas DataFrames / Series (although I didn't realize they could be lists)
LightGBM/python-package/lightgbm/basic.py
Lines 2956 to 2957 in bd28a36
grad, hess = fobj(self.__inner_predict(0), self.train_set) | |
return self.__boost(grad, hess) |
LightGBM/python-package/lightgbm/basic.py
Lines 2972 to 2977 in bd28a36
grad : list or numpy 1-D array | |
The value of the first order derivative (gradient) of the loss | |
with respect to the elements of score for each sample point. | |
hess : list or numpy 1-D array | |
The value of the second order derivative (Hessian) of the loss | |
with respect to the elements of score for each sample point. |
To be honest, I'm pretty unsure about the meaning of "array-like" in different parts of LightGBM's docs and I'm not always sure which combinations of these are supported when I see that:
- list
- numpy array
- scipy sparse matrix
- h2o datatable
- pandas DataFrame
- pandas Series
So I took a best guess based on a quick look through the code, but I probably need to test all of those combinations and then updated this PR / the docs as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, absolutely agree with that array-like
everywhere in the sklearn-wrapper looks confusing. I might be wrong, but it was written before scikit-learn introduced a formal definition of array-like
term:
https://scikit-learn.org/stable/glossary.html#term-array-like
All these values in custom function signatures are supposed to have exactly 1 dimension, right? I believe it will be safe for now assign them the following types which we treat as 1-d array internally
LightGBM/python-package/lightgbm/basic.py
Lines 179 to 180 in bd28a36
raise TypeError(f"Wrong type({type(data).__name__}) for {name}.\n" | |
"It should be list, numpy 1-D array or pandas Series") |
For grad
and hess
that function list_to_1d_numpy
is applied directly.
LightGBM/python-package/lightgbm/basic.py
Lines 2984 to 2985 in bd28a36
grad = list_to_1d_numpy(grad, name='gradient') | |
hess = list_to_1d_numpy(hess, name='hessian') |
For weight
and group
only np.ndarray
is possible, if I'm not mistaken:
LightGBM/python-package/lightgbm/sklearn.py
Lines 176 to 179 in bd28a36
elif argc == 3: | |
return self.func(labels, preds, dataset.get_weight()) | |
elif argc == 4: | |
return self.func(labels, preds, dataset.get_weight(), dataset.get_group()) |
LightGBM/python-package/lightgbm/basic.py
Lines 2215 to 2225 in bd28a36
def get_weight(self): | |
"""Get the weight of the Dataset. | |
Returns | |
------- | |
weight : numpy array or None | |
Weight for each data point from the Dataset. | |
""" | |
if self.weight is None: | |
self.weight = self.get_field('weight') | |
return self.weight |
LightGBM/python-package/lightgbm/basic.py
Lines 2271 to 2288 in bd28a36
def get_group(self): | |
"""Get the group of the Dataset. | |
Returns | |
------- | |
group : numpy array or None | |
Group/query data. | |
Only used in the learning-to-rank task. | |
sum(group) = n_samples. | |
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, | |
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc. | |
""" | |
if self.group is None: | |
self.group = self.get_field('group') | |
if self.group is not None: | |
# group data from LightGBM is boundaries data, need to convert to group size | |
self.group = np.diff(self.group) | |
return self.group |
LightGBM/python-package/lightgbm/basic.py
Lines 1507 to 1510 in bd28a36
if weight is not None: | |
self.set_weight(weight) | |
if group is not None: | |
self.set_group(group) |
LightGBM/python-package/lightgbm/basic.py
Lines 2099 to 2119 in bd28a36
def set_weight(self, weight): | |
"""Set weight of each instance. | |
Parameters | |
---------- | |
weight : list, numpy 1-D array, pandas Series or None | |
Weight to be set for each data point. | |
Returns | |
------- | |
self : Dataset | |
Dataset with set weight. | |
""" | |
if weight is not None and np.all(weight == 1): | |
weight = None | |
self.weight = weight | |
if self.handle is not None and weight is not None: | |
weight = list_to_1d_numpy(weight, name='weight') | |
self.set_field('weight', weight) | |
self.weight = self.get_field('weight') # original values can be modified at cpp side | |
return self |
LightGBM/python-package/lightgbm/basic.py
Lines 2141 to 2162 in bd28a36
def set_group(self, group): | |
"""Set group size of Dataset (used for ranking). | |
Parameters | |
---------- | |
group : list, numpy 1-D array, pandas Series or None | |
Group/query data. | |
Only used in the learning-to-rank task. | |
sum(group) = n_samples. | |
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, | |
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc. | |
Returns | |
------- | |
self : Dataset | |
Dataset with set group. | |
""" | |
self.group = group | |
if self.handle is not None and group is not None: | |
group = list_to_1d_numpy(group, np.int32, name='group') | |
self.set_field('group', group) | |
return self |
For y_true
the same logic is applicable as for weight
and group
.
LightGBM/python-package/lightgbm/sklearn.py
Line 172 in bd28a36
labels = dataset.get_label() |
For y_pred
only np.ndarray
is possible
LightGBM/python-package/lightgbm/basic.py
Line 2956 in bd28a36
grad, hess = fobj(self.__inner_predict(0), self.train_set) |
LightGBM/python-package/lightgbm/basic.py
Line 3732 in bd28a36
feval_ret = eval_function(self.__inner_predict(data_idx), cur_data) |
LightGBM/python-package/lightgbm/basic.py
Line 3763 in bd28a36
return self.__inner_predict_buffer[data_idx] |
LightGBM/python-package/lightgbm/basic.py
Line 3750 in bd28a36
self.__inner_predict_buffer[data_idx] = np.empty(n_preds, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to get back to this one!
I just pushed ea1aada with my best understanding of your comments above, but to be honest I still am confused about exactly what is allowed.
Here is my interpretation of those comments / links:
- eval function
y_true
= list, numpy array, or pandas Seriesy_pred
= numpy arraygroup
= numpy arrayweight
= numpy array
- objective function
y_true
= list, numpy array, or pandas Seriesy_pred
= numpy arraygroup
= numpy arraygrad
(output) = list, numpy array, or pandas Serieshess
(output) = list, numpy array, or pandas Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checked this and I think your interpretation is fine. Thanks for the detailed investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, my previous comment was very vague. But you've got almost everything right from it! 😄
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the deep investigation into accepted types!
Thanks again for the help @StrikerRUS , this one required a lot of investigation 😄 |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Created in response to #4544 (review).
Contributes to #3756.
Proposes introducing more specific type hints for custom objective and metric functions in the scikit-learn and Dask interfaces in the Python package.