Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] do not use all features when using lbgm.LGBMClassifier #5915

Closed
meiyuxin opened this issue Jun 8, 2023 · 7 comments
Closed
Labels

Comments

@meiyuxin
Copy link

meiyuxin commented Jun 8, 2023

oofs = []
preds = [] 
for i in range(20):
    pred = 0
    oof=0
    under = RandomUnderSampler(sampling_strategy={1:108,0:108},random_state=i,
                              replacement=True)
    x_re,y_re = under.fit_resample(train_iter,train_y)
    x_tr,x_val,y_tr,y_val = train_test_split(x_re,y_re,test_size=0.2,
                                             random_state=42)

    model = lgbm.LGBMClassifier(metric='binary_logloss',
                                objective='binary',
                                n_estimators=1000,
                                random_state=42,verbose=10,
                                boosting_type ='GBDT',
                               
                               )
   
    model.fit(x_tr,y_tr,eval_set=[(x_val,y_val)],early_stopping_rounds=20)
    val_pred = model.predict_proba(x_val,num_iteration=model.best_iteration_,metric='binary_logloss')
    val_pred = val_pred[:,1]
    oof = true_loss(y_val,val_pred)
    oofs.append(oof)
    print(f'折{i},损失:{oof}')
    pred = model.predict_proba(test_iter)
    preds.append(pred)
print(f'平均损失{np.mean(np.array(oofs))}')
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Number of positive: 81, number of negative: 91
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.746124
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.047252
[LightGBM] [Debug] init for col-wise cost 0.000099 seconds, init for row-wise cost 0.000292 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000359 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2385
[LightGBM] [Info] Number of data points in the train set: 172, number of used features: 55
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.470930 -> initscore=-0.116410
[LightGBM] [Info] Start training from score -0.116410
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3

issue:

[LightGBM] [Info] Number of data points in the train set: 172, number of used features: 55

why the model just use only 55 features,there are 56 features totally
is there any problem?i'm really confused

x_tr:  172 rows × 56 columns
AB AF AH AM AR AX AY AZ BC BD ... FI FL FR FS GB GE GF GH GI GL
0.734956 8981.87235 85.200147 87.864987 8.138688 5.519157 0.027100 24.653424 4.687676 3040.05118 ... 9.603646 3.624578 1.34879 0.602797 30.715204 102.463358 40629.670470 22.277627 103.211788 0.186686
0.371751 4517.85686 107.469060 29.978960 8.138688 7.716189 0.025578 22.920374 1.229900 5108.29464 ... 7.773330 0.173229 1.35372 1.171729 17.924954 72.611063 119643.014900 33.362486 50.629820 21.978000
0.316202 2851.62152 85.200147 9.566633 8.138688 3.215817 0.025578 12.087236 1.229900 7070.56673 ... 7.271647 14.566408 0.76966 0.853398 18.445866 72.611063 11844.496760 37.224884 35.601624 0.055005
1.734838 9189.95242 85.200147 630.518230 8.138688 11.596431 0.025578 21.666276 15.855168 4924.93213 ... 9.068885 0.173229 0.49706 0.358969 18.771436 139.528613 28703.793260 35.059262 19.966436 21.978000
0.482849 5302.16918 85.200147 62.689474 11.112036 4.642116 0.025578 12.667020 1.229900 3721.06285 ... 8.627845 2.934356 0.96164 0.115141 18.864456 72.611063 5625.703206 29.474041 86.532368 0.069980
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0.222196 1720.37921 85.200147 11.458900 8.138688 3.268971 0.032277 10.020180 1.229900 3220.22568 ... 8.363221 0.173229 1.75218 0.067730 26.873478 72.611063 47954.860760 35.662064 21.334740 21.978000
2.042494 3209.18378 105.701133 14.837727 8.138688 9.992952 0.025578 13.177482 1.229900 6366.32014 ... 11.031513 0.173229 0.76386 0.101595 16.594768 140.632863 14173.430870 29.034963 21.499348 21.978000
0.346113 1819.38504 145.666734 19.498712 8.138688 4.899027 0.027710 10.234448 2.765518 5845.53993 ... 17.421080 1.514886 1.32037 0.196417 18.110994 85.181845 12851.988610 31.475939 17.672212 0.693000
0.585401 2830.23235 85.200147 14.407244 8.138688 4.305474 0.038976 9.919348 1.229900 5865.79458 ... 11.941158 11.222916 0.77604 0.399607 29.012938 91.719005 1759.614273 28.934496 31.175212 0.091696
0.679407 3996.42551 85.200147 196.938230 8.138688 7.264380 0.025578 11.967498 4.396014 3885.96846 ... 11.621404 5.833028 2.08046 0.738257 18.966778 72.611063 3361.259349 31.513149 7.541104 0.134308

172 rows × 56 columns

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Without seeing the exact dataset I can't say for sure, but I suspect that one of your columns cannot possibly be used based on the parameters you're using.

For example, LightGBM has a parameter min_data_in_bin which prevents the creation of bins in the feature histograms that contain too few records. (docs link)

That parameter defaults to 3. During Dataset construction, features that are impossible to split based on this rule are dropped. For example, if that feature AR has 171 values of 8.138688 and a single value of 11.112036, it can't possibly be used by LightGBM... there is no possible split that would result in at least 3 records on either side.

If you really really really want to try to use such features, you can set min_data_in_bin = 1, min_data_in_leaf = 1.

If you want to avoid dropping uninformative features during Dataset construction, you can pass feature_pre_filter=False (docs link) in the params for lgb.Dataset().

@jameslamb
Copy link
Collaborator

NOTE: I've reformatting your question, to more clearly differentiate between code + logs and your question text.

If you're new to GitHub, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for information on how to format text here.

@jameslamb jameslamb changed the title do not use all features when using lbgm.LGBMClassifier [python-package] do not use all features when using lbgm.LGBMClassifier Jun 8, 2023
@meiyuxin
Copy link
Author

meiyuxin commented Jun 8, 2023

I just reviewed my dataset and indeed found a feature that only has two values. You clarified my doubts, and I'm very grateful for that. However, I have one more question: Why is it that when I wrap the data using lgb_train = lgb.Dataset(X_train.drop("Id",axis=1), y_train,free_raw_data=False) and then proceed with training, the model can use all the features?

p.s. I am a beginner and I am learning how to use GitHub. ^_^

@jameslamb
Copy link
Collaborator

jameslamb commented Jun 8, 2023

Your dataset is very small (172 rows), and that's before you do a train-test split that reduces the training data to around 137 rows.

LightGBM's overfitting protections have default values designed to work well on larger datasets, with at least 1000s of records. For example:

  • min_data_in_leaf (default = 20) (docs)
  • min_data_in_bin (default = 3) (docs)

In addition, LightGBM is not deterministic by default. What you may be seeing as "I get a different result when I do this other thing" may actually just be the run-to-run variation due to randomness in LightGBM. And small random changes could have a very significant impact on such a small dataset. To get fully-reproducible results across multiple runs on the same machine and with the same version of LightGBM, pass the following parameters through params:

  • deterministic = True
  • num_threads = 1
  • force_row_wise = True
  • seed = 708

For more on that, see the following:

@meiyuxin
Copy link
Author

meiyuxin commented Jun 9, 2023 via email

@jameslamb
Copy link
Collaborator

No problem, glad it helped 😊

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants