[python-package] Support 2d collections as input for `init_score` in multiclass classification task #4150

jmoralez · 2021-04-01T04:36:30Z

This intends to solve #4046 by allowing an (n_samples, n_classes) collection to be provided as init_score in lightgbm.LGBMClassifier's fit method.

I'd also like investigate the possibility to allow grad and hess to be 2d collections as well for custom objectives.

I'm opening this to get some feedback on my current approach.

jmoralez · 2021-04-01T04:38:48Z

@StrikerRUS I'd appreciate your feedback on this when you have time.

StrikerRUS · 2021-04-02T14:38:34Z

@jmoralez Thanks a lot for picking this up!
As an early feedback I can say that I support the way you do it. But

I'd better replace try/except with complex if ***-1d check
self.params is not reliable way to retrieve any info (for example, refer to Load back saved parameters with save_model to Booster object #2613). If we were in a Booster class, number of classes should be gotten via https://lightgbm.readthedocs.io/en/latest/C-API.html#c.LGBM_BoosterGetNumClasses C API function. But we are in Dataset class, so I guess we can just blindly flatten the data and allow cpp code deal with wrong resulted data points.

jmoralez · 2021-04-03T22:51:05Z

Thank you. I've removed the try/except and replaced it with checks for the dimension of the input collection and I flatten without looking for the classes. Please let me know what you think.

jmoralez · 2021-06-08T02:52:27Z

@StrikerRUS should I keep going with this?

StrikerRUS · 2021-06-08T22:24:28Z

@jmoralez Sorry I've missed this PR! Yes, I think more intuitive shape of arguments is good enhancement. Will get back to this PR soon.

StrikerRUS

As I already said, I think this is useful enhancement, but it seems it requires deeper integration and more docs updates (see comments below).

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

python-package/lightgbm/basic.py

StrikerRUS

@jmoralez
I'm so sorry for the delay again!
I like the PR in it's current state and I hope this is the last round of review. Please check my comments below.

StrikerRUS · 2021-08-09T20:23:45Z

tests/python_package_test/test_dask.py

@@ -1583,17 +1583,14 @@ def test_init_score(task, output, cluster):
            'time_out': 5
        }
        init_score = random.random()
-        # init_scores must be a 1D array, even for multiclass classification


I guess we need updates in type hints and docstrings for Dask module.

LightGBM/python-package/lightgbm/dask.py

Line 395 in cfe8eb1

init_score: Optional[_DaskVectorLike] = None,

LightGBM/python-package/lightgbm/dask.py

Line 401 in cfe8eb1

eval_init_score: Optional[List[_DaskCollection]] = None,

LightGBM/python-package/lightgbm/dask.py

Lines 423 to 424 in cfe8eb1

init_score : Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)

Init score of training data.

LightGBM/python-package/lightgbm/dask.py

Lines 442 to 443 in cfe8eb1

eval_init_score : list of Dask Arrays, Dask Series or None, optional (default=None)

Initial model score for each validation set in eval_set.

LightGBM/python-package/lightgbm/dask.py

Line 1024 in cfe8eb1

init_score: Optional[_DaskVectorLike] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1030 in cfe8eb1

eval_init_score: Optional[List[_DaskCollection]] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1162 in cfe8eb1

init_score: Optional[_DaskVectorLike] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1167 in cfe8eb1

eval_init_score: Optional[List[_DaskCollection]] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1195 in cfe8eb1

init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)",

LightGBM/python-package/lightgbm/dask.py

Line 1198 in cfe8eb1

eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)",

LightGBM/python-package/lightgbm/dask.py

Line 1341 in cfe8eb1

init_score: Optional[_DaskVectorLike] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1345 in cfe8eb1

eval_init_score: Optional[List[_DaskCollection]] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1372 in cfe8eb1

init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)",

LightGBM/python-package/lightgbm/dask.py

Line 1375 in cfe8eb1

eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)",

LightGBM/python-package/lightgbm/dask.py

Line 1502 in cfe8eb1

init_score: Optional[_DaskVectorLike] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1507 in cfe8eb1

eval_init_score: Optional[List[_DaskCollection]] = None,

LightGBM/python-package/lightgbm/dask.py

Line 1539 in cfe8eb1

init_score_shape="Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)",

LightGBM/python-package/lightgbm/dask.py

Line 1542 in cfe8eb1

eval_init_score_shape="list of Dask Arrays or Dask Series or None, optional (default=None)",

Addressed in 2c7ef3c. I think the docstrings maybe ended up a bit too verbose, let me know what you think.

@jmoralez Thanks a lot! I like your explicit wordings.
I'm so sorry, I merged #4558 and introduced conflicts.

Also, I just understood... There shouldn't be any updates for DaskLGBMRegressor's and DaskLGBMRanker's docstrings and type annotations, (for multi-class task) is not applicable there.

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

StrikerRUS

@jmoralez Many thanks for addressing all previous comments. I don't have any more except one new below and #4150 (comment).

tests/python_package_test/test_basic.py

jmoralez · 2021-09-02T14:37:12Z

python-package/lightgbm/basic.py

@@ -1145,7 +1188,7 @@ def __init__(self, data, label=None, reference=None,
            sum(group) = n_samples.
            For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
            where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
-        init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
+        init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task) or None, optional (default=None)


@StrikerRUS do you think this should have a comma before or None? I did include it in the dask docstrings but I just realized this doesn't have it. I'll make them consistent but would like to know what you think is the correct one.

I like how you did it. I believe a comma before or None prevents users from thinking that it possible to include Nones into a list: #4557 (comment). However, I'm not sure whether it is grammatically correct or not. @jameslamb should know for sure 🙂

I have a personally prefer a, b, c, or None, optional (default=None) (with the ,beforeor None`), but both are equally valid and I don't think you need to change anything

StrikerRUS

Thank you very much for this PR! LGTM, except one typo below:

python-package/lightgbm/dask.py

StrikerRUS · 2021-09-02T22:41:08Z

@jameslamb Would you like to provide your review?

jameslamb · 2021-09-02T23:19:29Z

@jameslamb Would you like to provide your review?

yes please, thanks! I can provide a review later tonight

jameslamb

Looks pretty good to me! The logic makes sense and I think the added test (plus the fact that the changes to existing tests were so minimal) gives me extra confidence that this change is working.

I just left some minor comments around two areas:

when introducing new module-level objects in the Python package that are only intended for internal use, I think we should prefix them with a _ to make that a bit clearer
new functions / methods / classes in the Python package should have type hints added, in pursuit of [python-package] type hints in python package #3756 . PR authors and reviewers have the most context about the expected types right now, during the PR introducing new code

jameslamb · 2021-09-03T01:52:21Z

python-package/lightgbm/basic.py

@@ -161,14 +161,24 @@ def is_1d_list(data):
    return isinstance(data, list) and (not data or is_numeric(data[0]))


+def is_1d_collection(data):


Suggested change

def is_1d_collection(data):

def _is_1d_collection(data: Any) -> bool:

I think we should be adding type hints for new code when possible, to increase the chance of catching bugs with mypy and reduce the amount of effort needed for #3756.

I'd also like to recommend prefixing objects that we don't want to encourage people to import with _, to make it clearer that they're intended to be internal

jameslamb · 2021-09-03T01:54:44Z

python-package/lightgbm/basic.py

@@ -180,6 +190,39 @@ def list_to_1d_numpy(data, dtype=np.float32, name='list'):
                        "It should be list, numpy 1-D array or pandas Series")


+def is_numpy_2d_array(data):


Suggested change

def is_numpy_2d_array(data):

def _is_numpy_2d_array(data: Any) -> bool:

jameslamb · 2021-09-03T01:55:18Z

python-package/lightgbm/basic.py

+    return isinstance(data, np.ndarray) and len(data.shape) == 2 and data.shape[1] > 1
+
+
+def is_2d_list(data):


Suggested change

def is_2d_list(data):

def _is_2d_list(data: Any) -> bool:

python-package/lightgbm/basic.py

jameslamb · 2021-09-03T01:55:51Z

python-package/lightgbm/basic.py

+    return isinstance(data, list) and len(data) > 0 and is_1d_list(data[0])
+
+
+def is_2d_collection(data):


Suggested change

def is_2d_collection(data):

def _is_2d_collection(data: Any) -> bool:

jameslamb · 2021-09-03T01:57:17Z

python-package/lightgbm/basic.py

+    )
+
+
+def data_to_2d_numpy(data, dtype=np.float32, name='list'):


Suggested change

def data_to_2d_numpy(data, dtype=np.float32, name='list'):

def _data_to_2d_numpy(data, dtype=np.float32, name='list'):

Could you also add type hints here?

jameslamb · 2021-09-03T01:58:45Z

python-package/lightgbm/basic.py

+
+def data_to_2d_numpy(data, dtype=np.float32, name='list'):
+    """Convert data to numpy 2-D array."""
+    if is_numpy_2d_array(data):


Suggested change

if is_numpy_2d_array(data):

if _is_numpy_2d_array(data):

jameslamb · 2021-09-03T01:59:21Z

python-package/lightgbm/basic.py

            dtype = np.float64
-        data = list_to_1d_numpy(data, dtype, name=field_name)
+            if is_1d_collection(data):


Suggested change

if is_1d_collection(data):

if _is_1d_collection(data):

jameslamb · 2021-09-03T01:59:28Z

python-package/lightgbm/basic.py

-        data = list_to_1d_numpy(data, dtype, name=field_name)
+            if is_1d_collection(data):
+                data = list_to_1d_numpy(data, dtype, name=field_name)
+            elif is_2d_collection(data):


Suggested change

elif is_2d_collection(data):

elif _is_2d_collection(data):

jameslamb · 2021-09-03T01:59:48Z

python-package/lightgbm/basic.py

+            if is_1d_collection(data):
+                data = list_to_1d_numpy(data, dtype, name=field_name)
+            elif is_2d_collection(data):
+                data = data_to_2d_numpy(data, dtype, name=field_name)


Suggested change

data = data_to_2d_numpy(data, dtype, name=field_name)

data = _data_to_2d_numpy(data, dtype, name=field_name)

add type hints to new functions make commas consistent in dask and basic

jameslamb

Looks like all the suggestions I made were addressed, thanks very much.

This PR should remove a bit of friction for users, really nice addition!

StrikerRUS · 2021-09-09T19:22:07Z

Looking forward for other PRs with grad/hess and y_pred in custom objectives.

StrikerRUS · 2021-09-12T11:25:48Z

@jmoralez Could you please sync with master to merge recent CI fixes into this branch?

jmoralez · 2021-09-13T00:56:05Z

Hi. I have covid so I'll be inactive for a couple of weeks, feel free to make any changes to my PRs.

jameslamb · 2021-09-13T14:35:27Z

Ah! I'm so sorry to hear that @jmoralez !!!

Don't worry at all about notifications in LightGBM, focus on resting up and hopefully feeling better soon! Let us know in the maintainer Slack whenever you're feeling better, and until then I'll avoid @-ing you.

StrikerRUS · 2021-09-15T22:36:52Z

Ah, get well soon!

github-actions · 2023-08-23T16:25:48Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

initial implementation of init_score for multiclass classification

d387d6c

jameslamb added feature in progress labels Apr 1, 2021

jmoralez added 2 commits April 3, 2021 16:45

check for 1d or 2d collection in init_score

75bb7ef

remove dataset import

cd9a41c

StrikerRUS reviewed Jun 14, 2021

View reviewed changes

jmoralez added 3 commits June 23, 2021 18:52

initial comments

5644295

merge master

0edd5e7

update dask test and docstrings

e11a44c

StrikerRUS reviewed Jun 26, 2021

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

update docstrings

a6b1744

StrikerRUS reviewed Jul 9, 2021

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

jmoralez added 2 commits July 12, 2021 20:21

move logic to set_field. reshape back on get_field

ad44959

merge master

6222521

StrikerRUS requested changes Aug 9, 2021

View reviewed changes

StrikerRUS changed the title ~~WIP: [python-package] Support 2d collections as input for multiclass classification~~ WIP: [python-package] Support 2d collections as input for init_score in multiclass classification task Aug 9, 2021

jmoralez added 2 commits August 24, 2021 22:00

add type hints and update docstrings for dask. fix Dataset.set_field

2c7ef3c

Merge branch 'master' into feature/multiclass-init_score

d3b763f

jmoralez changed the title ~~WIP: [python-package] Support 2d collections as input for init_score in multiclass classification task~~ [python-package] Support 2d collections as input for init_score in multiclass classification task Aug 25, 2021

jmoralez marked this pull request as ready for review August 25, 2021 03:02

jmoralez requested review from chivee, henry0312 and shiyu1994 as code owners August 25, 2021 03:02

StrikerRUS reviewed Sep 2, 2021

View reviewed changes

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

revert wrong docstrings and type hints

1676e55

jmoralez added 2 commits September 1, 2021 20:53

merge master

16f0b9d

add extra comma for consistency

9bb4454

jmoralez commented Sep 2, 2021

View reviewed changes

StrikerRUS approved these changes Sep 2, 2021

View reviewed changes

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved

jameslamb added awaiting review and removed in progress labels Sep 2, 2021

jameslamb requested changes Sep 3, 2021

View reviewed changes

jameslamb removed the awaiting review label Sep 4, 2021

jmoralez added 3 commits September 4, 2021 21:22

prefix private functions with underscore

a3e890d

add type hints to new functions make commas consistent in dask and basic

add missing spaces after type hint

94aa6a9

remove shape condition for dataframe in is_2d_collection

9fb7f2b

jmoralez linked an issue Sep 6, 2021 that may be closed by this pull request

[python-package] init_score and data structures in custom functions shape for multiclass classification #4046

Closed

StrikerRUS added the awaiting review label Sep 9, 2021

StrikerRUS requested a review from jameslamb September 9, 2021 16:14

jameslamb approved these changes Sep 9, 2021

View reviewed changes

Merge branch 'master' into feature/multiclass-init_score

91e7429

StrikerRUS removed the awaiting review label Sep 15, 2021

Merge branch 'master' into feature/multiclass-init_score

7822810

StrikerRUS merged commit f1f5ba1 into microsoft:master Sep 17, 2021

StrikerRUS mentioned this pull request Sep 17, 2021

[python-package] init_score and data structures in custom functions shape for multiclass classification #4046

Closed

jmoralez deleted the feature/multiclass-init_score branch September 20, 2021 14:33

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

	init_score : Dask Array or Dask Series of shape = [n_samples] or None, optional (default=None)
	Init score of training data.

	eval_init_score : list of Dask Arrays, Dask Series or None, optional (default=None)
	Initial model score for each validation set in eval_set.

		@@ -161,14 +161,24 @@ def is_1d_list(data):
		return isinstance(data, list) and (not data or is_numeric(data[0]))


		def is_1d_collection(data):

	def is_1d_collection(data):
	def _is_1d_collection(data: Any) -> bool:

		@@ -180,6 +190,39 @@ def list_to_1d_numpy(data, dtype=np.float32, name='list'):
		"It should be list, numpy 1-D array or pandas Series")


		def is_numpy_2d_array(data):

	def is_numpy_2d_array(data):
	def _is_numpy_2d_array(data: Any) -> bool:

		return isinstance(data, np.ndarray) and len(data.shape) == 2 and data.shape[1] > 1


		def is_2d_list(data):

		return isinstance(data, list) and len(data) > 0 and is_1d_list(data[0])


		def is_2d_collection(data):

	def is_2d_collection(data):
	def _is_2d_collection(data: Any) -> bool:

	def data_to_2d_numpy(data, dtype=np.float32, name='list'):
	def _data_to_2d_numpy(data, dtype=np.float32, name='list'):

	data = data_to_2d_numpy(data, dtype, name=field_name)
	data = _data_to_2d_numpy(data, dtype, name=field_name)

[python-package] Support 2d collections as input for init_score in multiclass classification task #4150

[python-package] Support 2d collections as input for init_score in multiclass classification task #4150

Conversation

jmoralez commented Apr 1, 2021

jmoralez commented Apr 1, 2021

StrikerRUS commented Apr 2, 2021

jmoralez commented Apr 3, 2021

jmoralez commented Jun 8, 2021

StrikerRUS commented Jun 8, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Sep 2, 2021

jameslamb commented Sep 2, 2021

jameslamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Sep 9, 2021

StrikerRUS commented Sep 12, 2021

jmoralez commented Sep 13, 2021

jameslamb commented Sep 13, 2021

StrikerRUS commented Sep 15, 2021

github-actions bot commented Aug 23, 2023

[python-package] Support 2d collections as input for `init_score` in multiclass classification task #4150

[python-package] Support 2d collections as input for `init_score` in multiclass classification task #4150

StrikerRUS Sep 1, 2021 •

edited

Loading

jameslamb left a comment •

edited

Loading