missing values during prediction should throw an Exception if missing data wasn't present during training #4040

rmminusrslash · 2021-03-02T13:36:39Z

Summary

When we encounter missing values during prediction after training without any missing data, the model predicts these examples without logging a warning or an exception (see issue [this issue])(#2921):

Missing numerical values set to zero
Missing categorical values are sent to the right leaf

As a result, several or all features could be missing and the model would still return a prediction (of unknown quality).

I propose to at least log a warning and allow the model to be configured in a strict mode where unexpected missing values lead to an exception (I would argue this should be the default, but it might not work for compatibility reasons).

Motivation

Changing the current behavior is important for using lightgbm in production. When working with a train-test-split missing data in the testset is easily recognized and a difference between test and train is less likely than a difference between training and production data.

In production, data or code bugs can lead to one or multiple features being missing. In my experience, bugs that change the data happen as commonly as other bugs.

The current behavior would silently impute them to zero (numerical case) or assign them to an existing leaf (categorical case). The model would silently misbehave and it could be hard to detect, especially if the bug is only on the inference side, but not on the training data (which is typical when the data is not coming from a common feature store).

Description

Throw an exception when missing values are seen during inference but not during training. Value imputing should probably be done before calling the model, so I propose to make throwing an exception the default behaviour.

You might not agree (hence the current implementation), so maybe it could be an option to log a warning?

It would also be great to improve the documentation on the use_missing=false flag:

set this to false to disable the special handle of missing value

The doc string doesn't give an explanation of what is done instead during training and inference when disabling the special value handling.

References

#2921
https://lightgbm.readthedocs.io/en/latest/Parameters.html

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-03-27T22:09:05Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

rmminusrslash changed the title ~~missing values during prediction can throw an Exception missing data wasn't present during training~~ missing values during prediction should throw an Exception if missing data wasn't present during training Mar 2, 2021

jameslamb added the feature request label Mar 2, 2021

StrikerRUS added the help wanted label Mar 27, 2021

StrikerRUS mentioned this issue Mar 27, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Mar 27, 2021

jmoralez mentioned this issue Oct 11, 2023

Missing values: NaN prediction does not match expectation from dump_model() #6139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing values during prediction should throw an Exception if missing data wasn't present during training #4040

missing values during prediction should throw an Exception if missing data wasn't present during training #4040

rmminusrslash commented Mar 2, 2021 •

edited

Loading

StrikerRUS commented Mar 27, 2021

missing values during prediction should throw an Exception if missing data wasn't present during training #4040

missing values during prediction should throw an Exception if missing data wasn't present during training #4040

Comments

rmminusrslash commented Mar 2, 2021 • edited Loading

Summary

Motivation

Description

References

StrikerRUS commented Mar 27, 2021

rmminusrslash commented Mar 2, 2021 •

edited

Loading