Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Support 2d collections as input for init_score in multiclass classification task #4150

Merged
merged 19 commits into from
Sep 17, 2021

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Apr 1, 2021

This intends to solve #4046 by allowing an (n_samples, n_classes) collection to be provided as init_score in lightgbm.LGBMClassifier's fit method.

I'd also like investigate the possibility to allow grad and hess to be 2d collections as well for custom objectives.

I'm opening this to get some feedback on my current approach.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Apr 1, 2021

@StrikerRUS I'd appreciate your feedback on this when you have time.

@StrikerRUS
Copy link
Collaborator

@jmoralez Thanks a lot for picking this up!
As an early feedback I can say that I support the way you do it. But

@jmoralez
Copy link
Collaborator Author

jmoralez commented Apr 3, 2021

Thank you. I've removed the try/except and replaced it with checks for the dimension of the input collection and I flatten without looking for the classes. Please let me know what you think.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jun 8, 2021

@StrikerRUS should I keep going with this?

@StrikerRUS
Copy link
Collaborator

@jmoralez Sorry I've missed this PR! Yes, I think more intuitive shape of arguments is good enhancement. Will get back to this PR soon.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I already said, I think this is useful enhancement, but it seems it requires deeper integration and more docs updates (see comments below).

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Show resolved Hide resolved
tests/python_package_test/test_basic.py Show resolved Hide resolved
tests/python_package_test/test_basic.py Show resolved Hide resolved
python-package/lightgbm/basic.py Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmoralez
I'm so sorry for the delay again!
I like the PR in it's current state and I hope this is the last round of review. Please check my comments below.

@@ -1583,17 +1583,14 @@ def test_init_score(task, output, cluster):
'time_out': 5
}
init_score = random.random()
# init_scores must be a 1D array, even for multiclass classification
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need updates in type hints and docstrings for Dask module.

init_score: Optional[_DaskVectorLike] = None,

eval_init_score: Optional[List[_DaskCollection]] = None,

init_score : Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)
Init score of training data.

eval_init_score : list of Dask Arrays, Dask Series or None, optional (default=None)
Initial model score for each validation set in eval_set.

init_score: Optional[_DaskVectorLike] = None,

eval_init_score: Optional[List[_DaskCollection]] = None,

init_score: Optional[_DaskVectorLike] = None,

eval_init_score: Optional[List[_DaskCollection]] = None,

init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)",

eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)",

init_score: Optional[_DaskVectorLike] = None,

eval_init_score: Optional[List[_DaskCollection]] = None,

init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)",

eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)",

init_score: Optional[_DaskVectorLike] = None,

eval_init_score: Optional[List[_DaskCollection]] = None,

init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)",

eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2c7ef3c. I think the docstrings maybe ended up a bit too verbose, let me know what you think.

Copy link
Collaborator

@StrikerRUS StrikerRUS Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmoralez Thanks a lot! I like your explicit wordings.
I'm so sorry, I merged #4558 and introduced conflicts.

Also, I just understood... There shouldn't be any updates for DaskLGBMRegressor's and DaskLGBMRanker's docstrings and type annotations, (for multi-class task) is not applicable there.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@StrikerRUS StrikerRUS changed the title WIP: [python-package] Support 2d collections as input for multiclass classification WIP: [python-package] Support 2d collections as input for init_score in multiclass classification task Aug 9, 2021
@jmoralez jmoralez changed the title WIP: [python-package] Support 2d collections as input for init_score in multiclass classification task [python-package] Support 2d collections as input for init_score in multiclass classification task Aug 25, 2021
@jmoralez jmoralez marked this pull request as ready for review August 25, 2021 03:02
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmoralez Many thanks for addressing all previous comments. I don't have any more except one new below and #4150 (comment).

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@@ -1145,7 +1188,7 @@ def __init__(self, data, label=None, reference=None,
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task) or None, optional (default=None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS do you think this should have a comma before or None? I did include it in the dask docstrings but I just realized this doesn't have it. I'll make them consistent but would like to know what you think is the correct one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you did it. I believe a comma before or None prevents users from thinking that it possible to include Nones into a list: #4557 (comment). However, I'm not sure whether it is grammatically correct or not. @jameslamb should know for sure 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a personally prefer a, b, c, or None, optional (default=None) (with the ,beforeor None`), but both are equally valid and I don't think you need to change anything

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR! LGTM, except one typo below:

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

@jameslamb Would you like to provide your review?

@jameslamb
Copy link
Collaborator

@jameslamb Would you like to provide your review?

yes please, thanks! I can provide a review later tonight

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me! The logic makes sense and I think the added test (plus the fact that the changes to existing tests were so minimal) gives me extra confidence that this change is working.

I just left some minor comments around two areas:

  • when introducing new module-level objects in the Python package that are only intended for internal use, I think we should prefix them with a _ to make that a bit clearer
  • new functions / methods / classes in the Python package should have type hints added, in pursuit of [python-package] type hints in python package #3756 . PR authors and reviewers have the most context about the expected types right now, during the PR introducing new code

@@ -161,14 +161,24 @@ def is_1d_list(data):
return isinstance(data, list) and (not data or is_numeric(data[0]))


def is_1d_collection(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_1d_collection(data):
def _is_1d_collection(data: Any) -> bool:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be adding type hints for new code when possible, to increase the chance of catching bugs with mypy and reduce the amount of effort needed for #3756.

I'd also like to recommend prefixing objects that we don't want to encourage people to import with _, to make it clearer that they're intended to be internal

@@ -180,6 +190,39 @@ def list_to_1d_numpy(data, dtype=np.float32, name='list'):
"It should be list, numpy 1-D array or pandas Series")


def is_numpy_2d_array(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_numpy_2d_array(data):
def _is_numpy_2d_array(data: Any) -> bool:

return isinstance(data, np.ndarray) and len(data.shape) == 2 and data.shape[1] > 1


def is_2d_list(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_2d_list(data):
def _is_2d_list(data: Any) -> bool:

python-package/lightgbm/basic.py Show resolved Hide resolved
return isinstance(data, list) and len(data) > 0 and is_1d_list(data[0])


def is_2d_collection(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_2d_collection(data):
def _is_2d_collection(data: Any) -> bool:

)


def data_to_2d_numpy(data, dtype=np.float32, name='list'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def data_to_2d_numpy(data, dtype=np.float32, name='list'):
def _data_to_2d_numpy(data, dtype=np.float32, name='list'):

Could you also add type hints here?


def data_to_2d_numpy(data, dtype=np.float32, name='list'):
"""Convert data to numpy 2-D array."""
if is_numpy_2d_array(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_numpy_2d_array(data):
if _is_numpy_2d_array(data):

dtype = np.float64
data = list_to_1d_numpy(data, dtype, name=field_name)
if is_1d_collection(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_1d_collection(data):
if _is_1d_collection(data):

data = list_to_1d_numpy(data, dtype, name=field_name)
if is_1d_collection(data):
data = list_to_1d_numpy(data, dtype, name=field_name)
elif is_2d_collection(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif is_2d_collection(data):
elif _is_2d_collection(data):

if is_1d_collection(data):
data = list_to_1d_numpy(data, dtype, name=field_name)
elif is_2d_collection(data):
data = data_to_2d_numpy(data, dtype, name=field_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = data_to_2d_numpy(data, dtype, name=field_name)
data = _data_to_2d_numpy(data, dtype, name=field_name)

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all the suggestions I made were addressed, thanks very much.

This PR should remove a bit of friction for users, really nice addition!

@StrikerRUS
Copy link
Collaborator

Looking forward for other PRs with grad/hess and y_pred in custom objectives.

@StrikerRUS
Copy link
Collaborator

@jmoralez Could you please sync with master to merge recent CI fixes into this branch?

@jmoralez
Copy link
Collaborator Author

Hi. I have covid so I'll be inactive for a couple of weeks, feel free to make any changes to my PRs.

@jameslamb
Copy link
Collaborator

Ah! I'm so sorry to hear that @jmoralez !!!

Don't worry at all about notifications in LightGBM, focus on resting up and hopefully feeling better soon! Let us know in the maintainer Slack whenever you're feeling better, and until then I'll avoid @-ing you.

@StrikerRUS
Copy link
Collaborator

Ah, get well soon!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] init_score and data structures in custom functions shape for multiclass classification
3 participants