You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we encounter missing values during prediction after training without any missing data, the model predicts these examples without logging a warning or an exception (see issue [this issue])(#2921):
Missing numerical values set to zero
Missing categorical values are sent to the right leaf
As a result, several or all features could be missing and the model would still return a prediction (of unknown quality).
I propose to at least log a warning and allow the model to be configured in a strict mode where unexpected missing values lead to an exception (I would argue this should be the default, but it might not work for compatibility reasons).
Motivation
Changing the current behavior is important for using lightgbm in production. When working with a train-test-split missing data in the testset is easily recognized and a difference between test and train is less likely than a difference between training and production data.
In production, data or code bugs can lead to one or multiple features being missing. In my experience, bugs that change the data happen as commonly as other bugs.
The current behavior would silently impute them to zero (numerical case) or assign them to an existing leaf (categorical case). The model would silently misbehave and it could be hard to detect, especially if the bug is only on the inference side, but not on the training data (which is typical when the data is not coming from a common feature store).
Description
Throw an exception when missing values are seen during inference but not during training. Value imputing should probably be done before calling the model, so I propose to make throwing an exception the default behaviour.
You might not agree (hence the current implementation), so maybe it could be an option to log a warning?
It would also be great to improve the documentation on the use_missing=false flag:
set this to false to disable the special handle of missing value
The doc string doesn't give an explanation of what is done instead during training and inference when disabling the special value handling.
The text was updated successfully, but these errors were encountered:
rmminusrslash
changed the title
missing values during prediction can throw an Exception missing data wasn't present during training
missing values during prediction should throw an Exception if missing data wasn't present during training
Mar 2, 2021
Closed in favor of being in #2302. We decided to keep all feature requests in one place.
Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.
Summary
When we encounter missing values during prediction after training without any missing data, the model predicts these examples without logging a warning or an exception (see issue [this issue])(#2921):
As a result, several or all features could be missing and the model would still return a prediction (of unknown quality).
I propose to at least log a warning and allow the model to be configured in a strict mode where unexpected missing values lead to an exception (I would argue this should be the default, but it might not work for compatibility reasons).
Motivation
Changing the current behavior is important for using lightgbm in production. When working with a train-test-split missing data in the testset is easily recognized and a difference between test and train is less likely than a difference between training and production data.
In production, data or code bugs can lead to one or multiple features being missing. In my experience, bugs that change the data happen as commonly as other bugs.
The current behavior would silently impute them to zero (numerical case) or assign them to an existing leaf (categorical case). The model would silently misbehave and it could be hard to detect, especially if the bug is only on the inference side, but not on the training data (which is typical when the data is not coming from a common feature store).
Description
Throw an exception when missing values are seen during inference but not during training. Value imputing should probably be done before calling the model, so I propose to make throwing an exception the default behaviour.
You might not agree (hence the current implementation), so maybe it could be an option to log a warning?
It would also be great to improve the documentation on the use_missing=false flag:
The doc string doesn't give an explanation of what is done instead during training and inference when disabling the special value handling.
References
#2921
https://lightgbm.readthedocs.io/en/latest/Parameters.html
The text was updated successfully, but these errors were encountered: