Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] raise an informative error instead of segfaulting when custom objective produces incorrect output #4815

Merged
merged 22 commits into from
Dec 30, 2021

Conversation

yaxxie
Copy link
Contributor

@yaxxie yaxxie commented Nov 19, 2021

Using the following code can cause a segfault due to LGBM_BoosterUpdateOneIterCustom making an assumption that the passed float arrays match the length of the training data.

import lightgbm
import numpy

X = numpy.random.randn(10_000_000, 5)
Y = numpy.random.choice([0, 1], 10_000_000)

ds = lightgbm.Dataset(X, Y)

def bad_grads(x, y):
    return numpy.random.randn(2), numpy.random.rand(2)

lightgbm.train({}, ds, fobj=bad_grads)

A faulty fobj function can cause at best totally incorrect boosting and at worst segmentation fault. This small patch prevents this from occurring. I noticed it while adding support for LGBM_BoosterUpdateOneIterCustom to the julia package (see https://github.com/IQVIA-ML/LightGBM.jl/pull/114/files)

@StrikerRUS
Copy link
Collaborator

@yaxxie Thanks a lot for this PR!

making an assumption that the passed float arrays match the length of the training data.

This assumption is wrong. grad and hess should be length of n_samples * num_model_per_iteration. num_model_per_iteration equals to 1 for regression and binary classification and #classes for multiclass classification.

Also, what do you think about moving this check to cpp side

bool TrainOneIter(const score_t* gradients, const score_t* hessians) override {

bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {

so that all language packages will benefit from it?

@yaxxie
Copy link
Contributor Author

yaxxie commented Nov 20, 2021

Thanks @StrikerRUS I figured the assumption would be wrong when I saw some failing tests

If we move this to the CPP side, it would require an API change, right?

@StrikerRUS
Copy link
Collaborator

If we move this to the CPP side, it would require an API change, right?

Sorry, could you please clarify what API changes do you mean?

I believe something like

int64_t total_size = static_cast<int64_t>(num_data_) * num_tree_per_iteration_;
CHECK_EQ(hessians.size(), hessians.size());
CHECK_EQ(hessians.size(), total_size);

would be enough.

@yaxxie
Copy link
Contributor Author

yaxxie commented Nov 20, 2021

Happy to try this out, but I can't see how it would work. When the data array comes from outside the lib, how do we know its size?

@StrikerRUS
Copy link
Collaborator

When the data array comes from outside the lib, how do we know its size?

Sorry, didn't get your question.

@yaxxie
Copy link
Contributor Author

yaxxie commented Nov 23, 2021

The gradients and hessians variables are arrays of float, not std::vector with size methods, and this makes sense because sometimes they are generated internally (so the gradients_ and hessians_ variables are std::vector), but when they come from outside the library (as in being passed the pointers from LGBM_BoosterUpdateOneIterCustom) we do not (cannot) know their sizes.

The other C-API points will pass size variables where they need to know them (such as when we construct from mat and we need a size variable there, or when we retrieve string names and we need to know how much was allocated to the underlying buffer so as not to overwrite past allocated memory).

This is why the API can segfault when passed incorrectly sized gradients, since, an underlying assumption is made about the size of the allocated buffers which is not (and cannot) be verified by the library itself. We could have the user pass the sizes to the API call, but this would then still require changes at all language implementations.

I'm happy for someone to point out where I've gone wrong with the above.

(I tried the code sample provided, it won't compile because float*'s don't have size method)

/home/yaxattax/git/LightGBM/include/LightGBM/utils/log.h:35:9: note: in definition of macro ‘CHECK’
   35 |   if (!(condition))                                                         \
      |         ^~~~~~~~~
/home/yaxattax/git/LightGBM/src/boosting/gbdt.cpp:385:3: note: in expansion of macro ‘CHECK_EQ’
  385 |   CHECK_EQ(hessians.size(), gradients.size());
      |   ^~~~~~~~
/home/yaxattax/git/LightGBM/src/boosting/gbdt.cpp:386:21: error: request for member ‘size’ in ‘hessians’, which is of non-class type ‘const score_t*’ {aka ‘const float*’}
  386 |   CHECK_EQ(hessians.size(), total_size);
      |                     ^~~~

@shiyu1994
Copy link
Collaborator

@yaxxie Thanks for working on this! Yes, I think there should be a change in C API. Currently the C API only accepts the pointers to the gradients and hessians. If we want to know the length of the array allocated outside C API, we must add new parameters. But the change in C API may require changes in the code of every language packages.

So maybe doing the check in the Python and R side is preferable. @StrikerRUS WDYT.

@yaxxie
Copy link
Contributor Author

yaxxie commented Nov 28, 2021

So maybe doing the check in the Python and R side is preferable.

We'd also want to update the docs for C-API to make it explicit that there is a length expectation for the array of floats passed.

@StrikerRUS
Copy link
Collaborator

@yaxxie Sorry, I didn't notice what type gradients and hessians actually are. I confused them with gradients_ and hessians_.

@shiyu1994

If we want to know the length of the array allocated outside C API, we must add new parameters. But the change in C API may require changes in the code of every language packages.

Can we use the fact that those arrays should always be the length of static_cast<int64_t>(num_data_) * num_tree_per_iteration_? I believe new parameters for actual lengths makes a little sense here, as we already know their size in correct API usage scenario.

So maybe doing the check in the Python and R side is preferable.

If we cannot do any checks with raw pointers, I'm OK with this way.

@shiyu1994
Copy link
Collaborator

Can we use the fact that those arrays should always be the length of static_cast<int64_t>(num_data_) * num_tree_per_iteration_?

@StrikerRUS I think fix itself is the guarantee this. To check the size in C API side, we must know the exact length of the allocated array in the Python side.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider checking some my minor suggestions below:

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
yaxxie and others added 2 commits November 30, 2021 23:48
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@yaxxie
Copy link
Contributor Author

yaxxie commented Dec 1, 2021

@StrikerRUS Thanks will get around to doing these properly soon
I want to add a simple test and also I think it would be worthwhile updating the C-API documentation to state this requirement upon the caller -- could you point me in the right place to make this change?

@StrikerRUS
Copy link
Collaborator

@yaxxie Thank you!
Basic tests in which you call C API directly should be added in the following file:
https://github.com/microsoft/LightGBM/blob/master/tests/c_api_test/test_.py
For basic Python tests where you don't use train() or cv() function from engine.py file you should use this file
https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_basic.py
If you want to add cpp tests, they should go to somewhere in this folder
https://github.com/microsoft/LightGBM/tree/master/tests/cpp_tests

As for modifying C API docs, edit comments in Doxygen format in this file
https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/c_api.h

yaxxie and others added 2 commits December 16, 2021 18:03
@yaxxie
Copy link
Contributor Author

yaxxie commented Dec 21, 2021

@StrikerRUS Anything else needed?

@StrikerRUS
Copy link
Collaborator

@jameslamb @shiyu1994 Would you like to be a second reviewer for this PR?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this!

I followed the conversation with @StrikerRUS and @shiyu1994 and understand why we've chosen to do this check on the Python / R side instead of in C/C++.

Please see two small suggestions to make the tests slightly stricter. Would you please also write up a feature request at https://github.com/microsoft/LightGBM/issues documenting the need to do this same work for the R package?

bad_bst_multi = lgb.Booster({'objective': "none", "num_class": len(classes)}, ds_multiclass)
good_bst_multi = lgb.Booster({'objective': "none", "num_class": len(classes)}, ds_multiclass)
good_bst_binary.update(fobj=_good_gradients)
with pytest.raises(ValueError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError):
with pytest.raises(ValueError, match="number of models per one iteration (1)"):

Could you please use match to look for a specific error message? That way, this test won't silently pass if a change results in lightgbm raising a different, unrelated ValueError.

with pytest.raises(ValueError):
bad_bst_binary.update(fobj=_bad_gradients)
good_bst_multi.update(fobj=_good_gradients)
with pytest.raises(ValueError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError):
with pytest.raises(ValueError, match="number of models per one iteration (3)"):

@jameslamb jameslamb changed the title fix for bad grads causing segfault [python] raise an informative error instead of segfaulting when custom objective produces incorrect output Dec 22, 2021
@jameslamb
Copy link
Collaborator

@yaxxie @StrikerRUS I just changed the title of this PR to hopefully be a bit more informative for the purposes of release notes

@yaxxie
Copy link
Contributor Author

yaxxie commented Dec 22, 2021

@jameslamb I opened #4905 and pushed commit to address your remarks. Please do let me know if anything else is required.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test changes and for opening #4905! Please see a few more small suggestions.

bad_bst_multi = lgb.Booster({'objective': "none", "num_class": len(classes)}, ds_multiclass)
good_bst_multi = lgb.Booster({'objective': "none", "num_class": len(classes)}, ds_multiclass)
good_bst_binary.update(fobj=_good_gradients)
with pytest.raises(ValueError, match="number of models per one iteration \(1\)"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError, match="number of models per one iteration \(1\)"):
with pytest.raises(ValueError, match="number of models per one iteration \\(1\\)"):

Please see these linting errors seen in https://github.com/microsoft/LightGBM/runs/4613473477?check_suite_focus=true

./tests/python_package_test/test_basic.py:604:78: W605 invalid escape sequence '('
./tests/python_package_test/test_basic.py:604:81: W605 invalid escape sequence ')'
./tests/python_package_test/test_basic.py:607:79: W605 invalid escape sequence '('
./tests/python_package_test/test_basic.py:607:95: W605 invalid escape sequence ')'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't escape any symbols for the readability purpose. Just add a * symbol at the end:

Suggested change
with pytest.raises(ValueError, match="number of models per one iteration \(1\)"):
with pytest.raises(ValueError, match="number of models per one iteration (1) *"):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape is necessary; ( and ) are characters which means something to the regular expression engine. I'll switch to using re.escape

with pytest.raises(ValueError, match="number of models per one iteration \(1\)"):
bad_bst_binary.update(fobj=_bad_gradients)
good_bst_multi.update(fobj=_good_gradients)
with pytest.raises(ValueError, match=f"number of models per one iteration \({len(classes)}\)"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError, match=f"number of models per one iteration \({len(classes)}\)"):
with pytest.raises(ValueError, match=f"number of models per one iteration \\({len(classes)}\\)"):

Comment on lines 593 to 596
X = np.random.randn(100, 5)
y_binary = np.random.choice([0, 1], 100)
classes = [0, 1, 2]
y_multiclass = np.random.choice(classes, 100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X = np.random.randn(100, 5)
y_binary = np.random.choice([0, 1], 100)
classes = [0, 1, 2]
y_multiclass = np.random.choice(classes, 100)
X = np.random.randn(100, 5)
y_binary = np.array([0] * 50 + [1] * 50)
classes = [0, 1, 2]
y_multiclass = np.random.choice([0] * 33 + [1] * 33 + [2] * 34)

Sorry, just thought of this...can you please remove the randomness from this data construction? Since you're using completely-random data and not testing the produced models, the values don't really matter.

Choosing randomly and using such a small amount of data makes it possible that these tests could fail randomly due to situations like "y_binary is all 0s". It may seem like a small probability, but consider that the Python tests run about 40 times on every commit to every pull request in this project.

@yaxxie
Copy link
Contributor Author

yaxxie commented Dec 23, 2021

@jameslamb let me know if a946d28 addresses your concerns

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me, thanks very much for the help!

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@@ -572,6 +572,9 @@ LIGHTGBM_C_EXPORT int LGBM_BoosterRefit(BoosterHandle handle,
/*!
* \brief Update the model by specifying gradient and Hessian directly
* (this can be used to support customized loss functions).
* \note
* The length of the arrays referenced by ``grad`` and ``hess`` must be equal to
* ``num_class * num_train_data``, this is not verified by the library, the caller must ensure this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be this IS verified by the library, or simply delete this last sentence? Because we are actually verifying this through this pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the C-API docs -- the context here is the lightgbm.so library, rather than python or R libraries. What we're saying is that a caller to this function LGBM_BoosterUpdateOneIterCustom is responsible to ensure that the condition is met. I'm happy to tweak the wording, but what the python library does as a convenience to the user is not applicable here.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing custom objective function signature in recent commit.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants