[dask] pass additional predict() parameters through when input is a Dask Array #4399

jameslamb · 2021-06-23T04:16:35Z

.predict() in the Dask estimators allows users to pass additional prediction parameters (https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict-parameters) through **kwargs. To be applied correctly, those **kwargs have to be passed through several layers of function calls inside the package.

One of those pass-throughs is currently missing, and as a result additional prediction parameters will be silently ignored when data passed to .predict() is a Dask Array.

This PR proposes fixing that and adding tests confirming that additional parameters are being passed through correctly.

Notes for Reviewers

I looked at lightgbm.basic._InnerPredictor.predict() for the names of specific parameters to be passed through.

LightGBM/python-package/lightgbm/basic.py

Lines 661 to 663 in bd21efe

    
           def predict(self, data, start_iteration=0, num_iteration=-1, 
        
                       raw_score=False, pred_leaf=False, pred_contrib=False, data_has_header=False, 
        
                       is_reshape=True):

tests/python_package_test/test_dask.py

StrikerRUS

Very nice catch! Just suggestion for more descriptive variable name below.
What about early stopping for prediction? Is it supported in Dask?
https://lightgbm.readthedocs.io/en/latest/Parameters.html#pred_early_stop

Some examples of corresponding tests for non-Dask estimators:

LightGBM/tests/python_package_test/test_sklearn.py

Lines 593 to 599 in d517ba1

    
           # Tests other parameters for the prediction works 
        
           res_engine = gbm.predict(X_test) 
        
           res_sklearn_params = clf.predict_proba(X_test, 
        
                                                  pred_early_stop=True, 
        
                                                  pred_early_stop_margin=1.0) 
        
           with pytest.raises(AssertionError): 
        
               np.testing.assert_allclose(res_engine, res_sklearn_params)

LightGBM/tests/python_package_test/test_sklearn.py

Lines 627 to 633 in d517ba1

    
           # Tests other parameters for the prediction works, starting from iteration 10 
        
           res_engine = gbm.predict(X_test, start_iteration=10) 
        
           res_sklearn_params = clf.predict_proba(X_test, 
        
                                                  pred_early_stop=True, 
        
                                                  pred_early_stop_margin=1.0, start_iteration=10) 
        
           with pytest.raises(AssertionError): 
        
               np.testing.assert_allclose(res_engine, res_sklearn_params)

LightGBM/tests/python_package_test/test_engine.py

Lines 451 to 462 in c738c83

    
           pred_parameter = {"pred_early_stop": True, 
        
                             "pred_early_stop_freq": 5, 
        
                             "pred_early_stop_margin": 1.5} 
        
           ret = multi_logloss(y_test, gbm.predict(X_test, **pred_parameter)) 
        
           assert ret < 0.8 
        
           assert ret > 0.6  # loss will be higher than when evaluating the full model 
        
           pred_parameter = {"pred_early_stop": True, 
        
                             "pred_early_stop_freq": 5, 
        
                             "pred_early_stop_margin": 5.5} 
        
           ret = multi_logloss(y_test, gbm.predict(X_test, **pred_parameter)) 
        
           assert ret < 0.2

tests/python_package_test/test_dask.py

jameslamb · 2021-06-23T23:50:39Z

What about early stopping for prediction? Is it supported in Dask?

I don't understand what "early stopping for prediction" actually means, can you explain it to me? The parameter descriptions at https://lightgbm.readthedocs.io/en/latest/Parameters.html#pred_early_stop are just short phrases using the same words as the parameter name (e.g. pred_early_stop_margin = "the threshold of margin in early-stopping prediction"), and I don't understand from the unit tests linked to in #4399 (review) what that functionality actually does.

I understand that early stopping for training means "stop the boosting process if performance on a validation set fails to improve", but I don't understand what early stopping means when you're generating predictions.

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS · 2021-06-24T19:48:46Z

I guess it is something like "stop accumulating predictions of individual trees in final prediction if individual contributions are becoming insignificant".
Here is explanation from original author I just found: #565 (comment).
Original PR: #550.

jameslamb · 2021-06-24T23:42:34Z

ahhh I see, interesting! Ok I can add calls with prediction early stopping to the tests in these PRs.

From #550, the tests you linked in #4399 (review) and my own investigation it seems that ~~that parameter~~ those parameters will only have an effect for classification objectives, so I'll only add it to the classifier tests.

jameslamb · 2021-06-25T02:03:54Z

Added a test with prediction early stopping in dccf44e

StrikerRUS

Thanks for the fix!

StrikerRUS · 2021-06-26T13:00:08Z

tests/python_package_test/test_dask.py

+        p1_early_stop_raw = dask_classifier.predict(
+            dX,
+            pred_early_stop=True,
+            pred_early_stop_margin=1.0,
+            pred_early_stop_freq=2,
+            raw_score=True
+        )


Just curious: why does this particular line not ends with .compute()?

just an oversight, there should be a .compute(). I've opened #4412 to add it.

github-actions · 2023-08-23T19:18:17Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 2 commits June 22, 2021 22:37

[dask] pass predict() kwargs through when input is a Dask Array

80aed4e

add tests

7e8d3a7

jameslamb added the fix label Jun 23, 2021

jameslamb requested a review from StrikerRUS June 23, 2021 04:16

jameslamb changed the title ~~[dask] pass predict() kwargs through when input is a Dask Array~~ [dask] pass additional predict() parameters through when input is a Dask Array Jun 23, 2021

jameslamb commented Jun 23, 2021

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

StrikerRUS reviewed Jun 23, 2021

View reviewed changes

Apply suggestions from code review

41da3fd

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

jameslamb added 2 commits June 24, 2021 20:49

Merge branch 'master' into fix/dask-predict-kwargs

0e9ec96

add prediction early stopping params

dccf44e

jameslamb requested a review from StrikerRUS June 25, 2021 02:03

jameslamb added the awaiting review label Jun 25, 2021

StrikerRUS approved these changes Jun 26, 2021

View reviewed changes

StrikerRUS removed the awaiting review label Jun 26, 2021

StrikerRUS merged commit 8116d88 into master Jun 26, 2021

StrikerRUS deleted the fix/dask-predict-kwargs branch June 26, 2021 13:01

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] pass additional predict() parameters through when input is a Dask Array #4399

[dask] pass additional predict() parameters through when input is a Dask Array #4399

jameslamb commented Jun 23, 2021

StrikerRUS left a comment •

edited

Loading

jameslamb commented Jun 23, 2021

StrikerRUS commented Jun 24, 2021

jameslamb commented Jun 24, 2021 •

edited

Loading

jameslamb commented Jun 25, 2021

StrikerRUS left a comment

StrikerRUS Jun 26, 2021

jameslamb Jun 27, 2021

github-actions bot commented Aug 23, 2023

	def predict(self, data, start_iteration=0, num_iteration=-1,
	raw_score=False, pred_leaf=False, pred_contrib=False, data_has_header=False,
	is_reshape=True):

	# Tests other parameters for the prediction works
	res_engine = gbm.predict(X_test)
	res_sklearn_params = clf.predict_proba(X_test,
	pred_early_stop=True,
	pred_early_stop_margin=1.0)
	with pytest.raises(AssertionError):
	np.testing.assert_allclose(res_engine, res_sklearn_params)

	# Tests other parameters for the prediction works, starting from iteration 10
	res_engine = gbm.predict(X_test, start_iteration=10)
	res_sklearn_params = clf.predict_proba(X_test,
	pred_early_stop=True,
	pred_early_stop_margin=1.0, start_iteration=10)
	with pytest.raises(AssertionError):
	np.testing.assert_allclose(res_engine, res_sklearn_params)

	pred_parameter = {"pred_early_stop": True,
	"pred_early_stop_freq": 5,
	"pred_early_stop_margin": 1.5}
	ret = multi_logloss(y_test, gbm.predict(X_test, **pred_parameter))
	assert ret < 0.8
	assert ret > 0.6 # loss will be higher than when evaluating the full model

	pred_parameter = {"pred_early_stop": True,
	"pred_early_stop_freq": 5,
	"pred_early_stop_margin": 5.5}
	ret = multi_logloss(y_test, gbm.predict(X_test, **pred_parameter))
	assert ret < 0.2

[dask] pass additional predict() parameters through when input is a Dask Array #4399

[dask] pass additional predict() parameters through when input is a Dask Array #4399

Conversation

jameslamb commented Jun 23, 2021

Notes for Reviewers

StrikerRUS left a comment • edited Loading

Choose a reason for hiding this comment

jameslamb commented Jun 23, 2021

StrikerRUS commented Jun 24, 2021

jameslamb commented Jun 24, 2021 • edited Loading

jameslamb commented Jun 25, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Jun 26, 2021

Choose a reason for hiding this comment

jameslamb Jun 27, 2021

Choose a reason for hiding this comment

github-actions bot commented Aug 23, 2023

StrikerRUS left a comment •

edited

Loading

jameslamb commented Jun 24, 2021 •

edited

Loading