-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Support 2d collections as input for init_score
in multiclass classification task
#4150
[python-package] Support 2d collections as input for init_score
in multiclass classification task
#4150
Conversation
@StrikerRUS I'd appreciate your feedback on this when you have time. |
@jmoralez Thanks a lot for picking this up!
|
Thank you. I've removed the try/except and replaced it with checks for the dimension of the input collection and I flatten without looking for the classes. Please let me know what you think. |
@StrikerRUS should I keep going with this? |
@jmoralez Sorry I've missed this PR! Yes, I think more intuitive shape of arguments is good enhancement. Will get back to this PR soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I already said, I think this is useful enhancement, but it seems it requires deeper integration and more docs updates (see comments below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmoralez
I'm so sorry for the delay again!
I like the PR in it's current state and I hope this is the last round of review. Please check my comments below.
@@ -1583,17 +1583,14 @@ def test_init_score(task, output, cluster): | |||
'time_out': 5 | |||
} | |||
init_score = random.random() | |||
# init_scores must be a 1D array, even for multiclass classification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need updates in type hints and docstrings for Dask module.
LightGBM/python-package/lightgbm/dask.py
Line 395 in cfe8eb1
init_score: Optional[_DaskVectorLike] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 401 in cfe8eb1
eval_init_score: Optional[List[_DaskCollection]] = None, |
LightGBM/python-package/lightgbm/dask.py
Lines 423 to 424 in cfe8eb1
init_score : Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None) | |
Init score of training data. |
LightGBM/python-package/lightgbm/dask.py
Lines 442 to 443 in cfe8eb1
eval_init_score : list of Dask Arrays, Dask Series or None, optional (default=None) | |
Initial model score for each validation set in eval_set. |
LightGBM/python-package/lightgbm/dask.py
Line 1024 in cfe8eb1
init_score: Optional[_DaskVectorLike] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1030 in cfe8eb1
eval_init_score: Optional[List[_DaskCollection]] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1162 in cfe8eb1
init_score: Optional[_DaskVectorLike] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1167 in cfe8eb1
eval_init_score: Optional[List[_DaskCollection]] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1195 in cfe8eb1
init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)", |
LightGBM/python-package/lightgbm/dask.py
Line 1198 in cfe8eb1
eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)", |
LightGBM/python-package/lightgbm/dask.py
Line 1341 in cfe8eb1
init_score: Optional[_DaskVectorLike] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1345 in cfe8eb1
eval_init_score: Optional[List[_DaskCollection]] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1372 in cfe8eb1
init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)", |
LightGBM/python-package/lightgbm/dask.py
Line 1375 in cfe8eb1
eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)", |
LightGBM/python-package/lightgbm/dask.py
Line 1502 in cfe8eb1
init_score: Optional[_DaskVectorLike] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1507 in cfe8eb1
eval_init_score: Optional[List[_DaskCollection]] = None, |
LightGBM/python-package/lightgbm/dask.py
Line 1539 in cfe8eb1
init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)", |
LightGBM/python-package/lightgbm/dask.py
Line 1542 in cfe8eb1
eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 2c7ef3c. I think the docstrings maybe ended up a bit too verbose, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_score
in multiclass classification task
init_score
in multiclass classification taskinit_score
in multiclass classification task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmoralez Many thanks for addressing all previous comments. I don't have any more except one new below and #4150 (comment).
python-package/lightgbm/basic.py
Outdated
@@ -1145,7 +1188,7 @@ def __init__(self, data, label=None, reference=None, | |||
sum(group) = n_samples. | |||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, | |||
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc. | |||
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None) | |||
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task) or None, optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS do you think this should have a comma before or None
? I did include it in the dask docstrings but I just realized this doesn't have it. I'll make them consistent but would like to know what you think is the correct one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you did it. I believe a comma before or None
prevents users from thinking that it possible to include None
s into a list: #4557 (comment). However, I'm not sure whether it is grammatically correct or not. @jameslamb should know for sure 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a personally prefer a, b, c, or None, optional (default=None) (with the
,before
or None`), but both are equally valid and I don't think you need to change anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR! LGTM, except one typo below:
@jameslamb Would you like to provide your review? |
yes please, thanks! I can provide a review later tonight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me! The logic makes sense and I think the added test (plus the fact that the changes to existing tests were so minimal) gives me extra confidence that this change is working.
I just left some minor comments around two areas:
- when introducing new module-level objects in the Python package that are only intended for internal use, I think we should prefix them with a
_
to make that a bit clearer - new functions / methods / classes in the Python package should have type hints added, in pursuit of [python-package] type hints in python package #3756 . PR authors and reviewers have the most context about the expected types right now, during the PR introducing new code
python-package/lightgbm/basic.py
Outdated
@@ -161,14 +161,24 @@ def is_1d_list(data): | |||
return isinstance(data, list) and (not data or is_numeric(data[0])) | |||
|
|||
|
|||
def is_1d_collection(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_1d_collection(data): | |
def _is_1d_collection(data: Any) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be adding type hints for new code when possible, to increase the chance of catching bugs with mypy
and reduce the amount of effort needed for #3756.
I'd also like to recommend prefixing objects that we don't want to encourage people to import with _
, to make it clearer that they're intended to be internal
python-package/lightgbm/basic.py
Outdated
@@ -180,6 +190,39 @@ def list_to_1d_numpy(data, dtype=np.float32, name='list'): | |||
"It should be list, numpy 1-D array or pandas Series") | |||
|
|||
|
|||
def is_numpy_2d_array(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_numpy_2d_array(data): | |
def _is_numpy_2d_array(data: Any) -> bool: |
python-package/lightgbm/basic.py
Outdated
return isinstance(data, np.ndarray) and len(data.shape) == 2 and data.shape[1] > 1 | ||
|
||
|
||
def is_2d_list(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_2d_list(data): | |
def _is_2d_list(data: Any) -> bool: |
python-package/lightgbm/basic.py
Outdated
return isinstance(data, list) and len(data) > 0 and is_1d_list(data[0]) | ||
|
||
|
||
def is_2d_collection(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_2d_collection(data): | |
def _is_2d_collection(data: Any) -> bool: |
python-package/lightgbm/basic.py
Outdated
) | ||
|
||
|
||
def data_to_2d_numpy(data, dtype=np.float32, name='list'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def data_to_2d_numpy(data, dtype=np.float32, name='list'): | |
def _data_to_2d_numpy(data, dtype=np.float32, name='list'): |
Could you also add type hints here?
python-package/lightgbm/basic.py
Outdated
|
||
def data_to_2d_numpy(data, dtype=np.float32, name='list'): | ||
"""Convert data to numpy 2-D array.""" | ||
if is_numpy_2d_array(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_numpy_2d_array(data): | |
if _is_numpy_2d_array(data): |
python-package/lightgbm/basic.py
Outdated
dtype = np.float64 | ||
data = list_to_1d_numpy(data, dtype, name=field_name) | ||
if is_1d_collection(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_1d_collection(data): | |
if _is_1d_collection(data): |
python-package/lightgbm/basic.py
Outdated
data = list_to_1d_numpy(data, dtype, name=field_name) | ||
if is_1d_collection(data): | ||
data = list_to_1d_numpy(data, dtype, name=field_name) | ||
elif is_2d_collection(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif is_2d_collection(data): | |
elif _is_2d_collection(data): |
python-package/lightgbm/basic.py
Outdated
if is_1d_collection(data): | ||
data = list_to_1d_numpy(data, dtype, name=field_name) | ||
elif is_2d_collection(data): | ||
data = data_to_2d_numpy(data, dtype, name=field_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data = data_to_2d_numpy(data, dtype, name=field_name) | |
data = _data_to_2d_numpy(data, dtype, name=field_name) |
add type hints to new functions make commas consistent in dask and basic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all the suggestions I made were addressed, thanks very much.
This PR should remove a bit of friction for users, really nice addition!
Looking forward for other PRs with |
@jmoralez Could you please sync with |
Hi. I have covid so I'll be inactive for a couple of weeks, feel free to make any changes to my PRs. |
Ah! I'm so sorry to hear that @jmoralez !!! Don't worry at all about notifications in LightGBM, focus on resting up and hopefully feeling better soon! Let us know in the maintainer Slack whenever you're feeling better, and until then I'll avoid |
Ah, get well soon! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This intends to solve #4046 by allowing an
(n_samples, n_classes)
collection to be provided asinit_score
inlightgbm.LGBMClassifier
'sfit
method.I'd also like investigate the possibility to allow
grad
andhess
to be 2d collections as well for custom objectives.I'm opening this to get some feedback on my current approach.