Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense #812

Closed
j-mark-hou opened this issue Aug 9, 2017 · 9 comments · Fixed by #4909

Comments

@j-mark-hou
Copy link
Contributor

j-mark-hou commented Aug 9, 2017

In particular, I'm thinking about these things:

  1. if the input is an np.array, check that the columns is the same as the number of features the lgb.Booster object uses. if not, throw a warning.
  2. if the input is a pd.Dataframe object, should check that the feature_names of the lgb.Booster object is a superset of the columns of the pd.Dataframe
  3. if feature names in the booster object are repeated, or if column names in the pd.Dataframe are repeated, fall back to 0.

if these things sound reasonable I'd be happy to add these checks to the lgb.Booster.predict() function prior to calling the lgb._InnerPredictor.predict()

@guolinke
Copy link
Collaborator

@j-mark-hou sure, very happy to see this feature.

@guolinke
Copy link
Collaborator

@j-mark-hou any updates ?

@j-mark-hou
Copy link
Contributor Author

I think maybe this should be rolled into a more systematic rewrite of the pandas api. There's some current things about the implementation that I don't quite understand, and unfortunately I don't currently have the time to dig deeper into this. Sorry, I'll let you know if I revisit this at some point in the future.

@arsenyinfo
Copy link

Currently, it leads to some kind of inconsistency:

In [1]: from lightgbm import LGBMClassifier, Dataset, train
   ...: import numpy as np
   ...:
   ...: x_data, y_data = np.random.rand(1000, 100), np.random.rand(1000) > .5
   ...: x_bad = np.random.rand(1000, 101)
   ...:
   ...:
   ...: def sklearn_style():
   ...:     clf = LGBMClassifier()
   ...:     clf.fit(x_data, y_data)
   ...:     return clf.predict(x_bad)
   ...:
   ...:
   ...: def xgboost_style():
   ...:     dataset = Dataset(x_data, y_data)
   ...:     params = {'application': 'binary'}
   ...:     booster = train(params, dataset)
   ...:     return booster.predict(x_data)
   ...:

In [2]: sklearn_style().shape
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-281be6aeee0d> in <module>()
----> 1 sklearn_style().shape

<ipython-input-1-adc7de9c0c86> in sklearn_style()
      9     clf = LGBMClassifier()
     10     clf.fit(x_data, y_data)
---> 11     return clf.predict(x_bad)
     12
     13

~/.pyenv/versions/3.6.2/lib/python3.6/site-packages/lightgbm/sklearn.py in predict(self, X, raw_score, num_iteration)
    674
    675     def predict(self, X, raw_score=False, num_iteration=0):
--> 676         class_probs = self.predict_proba(X, raw_score, num_iteration)
    677         class_index = np.argmax(class_probs, axis=1)
    678         return self._le.inverse_transform(class_index)

~/.pyenv/versions/3.6.2/lib/python3.6/site-packages/lightgbm/sklearn.py in predict_proba(self, X, raw_score, num_iteration)
    704                              "match the input. Model n_features_ is %s and "
    705                              "input n_features is %s "
--> 706                              % (self._n_features, n_features))
    707         class_probs = self.booster_.predict(X, raw_score=raw_score, num_iteration=num_iteration)
    708         if self._n_classes > 2:

ValueError: Number of features of the model must match the input. Model n_features_ is 100 and input n_features is 101

In [3]: xgboost_style().shape
[LightGBM] [Info] Number of positive: 506, number of negative: 494
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data: 1000, number of used features: 100
[LightGBM] [Info] Finished loading 100 models
Out[3]: (1000,)

Imho, check on feature count is really important, and this lack of assertion may lead to harsh, not so easy to debug issues.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@jsh9
Copy link

jsh9 commented Aug 31, 2019

I may be interested helping improve this. What would be the ideal behavior here? Does LightGBM make predictions only based on column order (like sklearn), or based on column names?

@StrikerRUS
Copy link
Collaborator

The shape of data for prediction is now checked at cpp side, thanks to #2464.

The next steps may be to check the type and order of features, and their names in case of pandas DataFrame.

@jmoralez
Copy link
Collaborator

Reopening because I'm working on this.

@jmoralez jmoralez reopened this Dec 24, 2021
StrikerRUS added a commit that referenced this issue Jun 27, 2022
…812) (#4909)

* check feature names and order in predict with dataframe

* slice df in predict to remove the target

* scramble features

* handle int column names

* only change column order when needed

* include validate_features param in booster and sklearn estimators

* document validate_features argument

* use all_close in preds checks and check for assertion error to compare different arrays

* perform remapping and checks in cpp

* remove extra logs

* fixes

* revert cpp

* proposal

* remove extra arg

* lint

* restore _data_from_pandas arguments

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* move data conversion to Predictor.predict

* use Vector2Ptr

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants