[python-package] do not use all features when using lbgm.LGBMClassifier #5915

meiyuxin · 2023-06-08T14:42:58Z

oofs = []
preds = [] 
for i in range(20):
    pred = 0
    oof=0
    under = RandomUnderSampler(sampling_strategy={1:108,0:108},random_state=i,
                              replacement=True)
    x_re,y_re = under.fit_resample(train_iter,train_y)
    x_tr,x_val,y_tr,y_val = train_test_split(x_re,y_re,test_size=0.2,
                                             random_state=42)

    model = lgbm.LGBMClassifier(metric='binary_logloss',
                                objective='binary',
                                n_estimators=1000,
                                random_state=42,verbose=10,
                                boosting_type ='GBDT',
                               
                               )
   
    model.fit(x_tr,y_tr,eval_set=[(x_val,y_val)],early_stopping_rounds=20)
    val_pred = model.predict_proba(x_val,num_iteration=model.best_iteration_,metric='binary_logloss')
    val_pred = val_pred[:,1]
    oof = true_loss(y_val,val_pred)
    oofs.append(oof)
    print(f'折{i},损失:{oof}')
    pred = model.predict_proba(test_iter)
    preds.append(pred)
print(f'平均损失{np.mean(np.array(oofs))}')

[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Number of positive: 81, number of negative: 91
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.746124
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.047252
[LightGBM] [Debug] init for col-wise cost 0.000099 seconds, init for row-wise cost 0.000292 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000359 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2385
[LightGBM] [Info] Number of data points in the train set: 172, number of used features: 55
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.470930 -> initscore=-0.116410
[LightGBM] [Info] Start training from score -0.116410
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 6 and depth = 3

issue:

[LightGBM] [Info] Number of data points in the train set: 172, number of used features: 55

why the model just use only 55 features,there are 56 features totally
is there any problem?i'm really confused

x_tr:  172 rows × 56 columns

AB	AF	AH	AM	AR	AX	AY	AZ	BC	BD	...	FI	FL	FR	FS	GB	GE	GF	GH	GI	GL
0.734956	8981.87235	85.200147	87.864987	8.138688	5.519157	0.027100	24.653424	4.687676	3040.05118	...	9.603646	3.624578	1.34879	0.602797	30.715204	102.463358	40629.670470	22.277627	103.211788	0.186686
0.371751	4517.85686	107.469060	29.978960	8.138688	7.716189	0.025578	22.920374	1.229900	5108.29464	...	7.773330	0.173229	1.35372	1.171729	17.924954	72.611063	119643.014900	33.362486	50.629820	21.978000
0.316202	2851.62152	85.200147	9.566633	8.138688	3.215817	0.025578	12.087236	1.229900	7070.56673	...	7.271647	14.566408	0.76966	0.853398	18.445866	72.611063	11844.496760	37.224884	35.601624	0.055005
1.734838	9189.95242	85.200147	630.518230	8.138688	11.596431	0.025578	21.666276	15.855168	4924.93213	...	9.068885	0.173229	0.49706	0.358969	18.771436	139.528613	28703.793260	35.059262	19.966436	21.978000
0.482849	5302.16918	85.200147	62.689474	11.112036	4.642116	0.025578	12.667020	1.229900	3721.06285	...	8.627845	2.934356	0.96164	0.115141	18.864456	72.611063	5625.703206	29.474041	86.532368	0.069980
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
0.222196	1720.37921	85.200147	11.458900	8.138688	3.268971	0.032277	10.020180	1.229900	3220.22568	...	8.363221	0.173229	1.75218	0.067730	26.873478	72.611063	47954.860760	35.662064	21.334740	21.978000
2.042494	3209.18378	105.701133	14.837727	8.138688	9.992952	0.025578	13.177482	1.229900	6366.32014	...	11.031513	0.173229	0.76386	0.101595	16.594768	140.632863	14173.430870	29.034963	21.499348	21.978000
0.346113	1819.38504	145.666734	19.498712	8.138688	4.899027	0.027710	10.234448	2.765518	5845.53993	...	17.421080	1.514886	1.32037	0.196417	18.110994	85.181845	12851.988610	31.475939	17.672212	0.693000
0.585401	2830.23235	85.200147	14.407244	8.138688	4.305474	0.038976	9.919348	1.229900	5865.79458	...	11.941158	11.222916	0.77604	0.399607	29.012938	91.719005	1759.614273	28.934496	31.175212	0.091696
0.679407	3996.42551	85.200147	196.938230	8.138688	7.264380	0.025578	11.967498	4.396014	3885.96846	...	11.621404	5.833028	2.08046	0.738257	18.966778	72.611063	3361.259349	31.513149	7.541104	0.134308

172 rows × 56 columns

The text was updated successfully, but these errors were encountered:

jameslamb · 2023-06-08T15:00:25Z

Thanks for using LightGBM.

Without seeing the exact dataset I can't say for sure, but I suspect that one of your columns cannot possibly be used based on the parameters you're using.

For example, LightGBM has a parameter min_data_in_bin which prevents the creation of bins in the feature histograms that contain too few records. (docs link)

That parameter defaults to 3. During Dataset construction, features that are impossible to split based on this rule are dropped. For example, if that feature AR has 171 values of 8.138688 and a single value of 11.112036, it can't possibly be used by LightGBM... there is no possible split that would result in at least 3 records on either side.

If you really really really want to try to use such features, you can set min_data_in_bin = 1, min_data_in_leaf = 1.

If you want to avoid dropping uninformative features during Dataset construction, you can pass feature_pre_filter=False (docs link) in the params for lgb.Dataset().

jameslamb · 2023-06-08T15:00:54Z

NOTE: I've reformatting your question, to more clearly differentiate between code + logs and your question text.

If you're new to GitHub, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for information on how to format text here.

meiyuxin · 2023-06-08T15:56:29Z

I just reviewed my dataset and indeed found a feature that only has two values. You clarified my doubts, and I'm very grateful for that. However, I have one more question: Why is it that when I wrap the data using lgb_train = lgb.Dataset(X_train.drop("Id",axis=1), y_train,free_raw_data=False) and then proceed with training, the model can use all the features?

p.s. I am a beginner and I am learning how to use GitHub. ^_^

jameslamb · 2023-06-08T16:27:33Z

Your dataset is very small (172 rows), and that's before you do a train-test split that reduces the training data to around 137 rows.

LightGBM's overfitting protections have default values designed to work well on larger datasets, with at least 1000s of records. For example:

min_data_in_leaf (default = 20) (docs)
min_data_in_bin (default = 3) (docs)

In addition, LightGBM is not deterministic by default. What you may be seeing as "I get a different result when I do this other thing" may actually just be the run-to-run variation due to randomness in LightGBM. And small random changes could have a very significant impact on such a small dataset. To get fully-reproducible results across multiple runs on the same machine and with the same version of LightGBM, pass the following parameters through params:

deterministic = True
num_threads = 1
force_row_wise = True
seed = 708

For more on that, see the following:

Feature Importance changing drastically with shuffling of data in Lightgbm binary classifier. #5887 (comment)

meiyuxin · 2023-06-09T03:10:48Z

Sorry for the late reply, thank you very much, your answer solved my problem.

…

------------------ 原始邮件 ------------------ 发件人: "James ***@***.***>; 发送时间: 2023年6月9日(星期五) 凌晨0:27 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [microsoft/LightGBM] [python-package] do not use all features when using lbgm.LGBMClassifier (Issue #5915) Your dataset is very small (172 rows), and that's before you do a train-test split that reduces the training data to around 137 rows. LightGBM's overfitting protections have default values designed to work well on larger datasets, with at least 1000s of records. For example: min_data_in_leaf (default = 20) (docs) min_data_in_bin (default = 3) (docs) In addition, LightGBM is not deterministic by default. What you may be seeing as "I get a different result when I do this other thing" may actually just be the run-to-run variation due to randomness in LightGBM. And small random changes could have a very significant impact on such a small dataset. To get fully-reproducible results across multiple runs on the same machine and with the same version of LightGBM, pass the following parameters through params: deterministic = True num_threads = 1 force_row_wise = True seed = 708 For more on that, see the following: #5887 (comment) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

jameslamb · 2023-06-09T04:44:31Z

No problem, glad it helped 😊

github-actions · 2023-09-13T00:18:53Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the question label Jun 8, 2023

jameslamb changed the title ~~do not use all features when using lbgm.LGBMClassifier~~ [python-package] do not use all features when using lbgm.LGBMClassifier Jun 8, 2023

jameslamb added the awaiting response label Jun 8, 2023

github-actions bot removed the awaiting response label Jun 8, 2023

jameslamb added the awaiting response label Jun 8, 2023

github-actions bot removed the awaiting response label Jun 9, 2023

jameslamb closed this as completed Jun 9, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] do not use all features when using lbgm.LGBMClassifier #5915

[python-package] do not use all features when using lbgm.LGBMClassifier #5915

meiyuxin commented Jun 8, 2023 •

edited by jameslamb

Loading

jameslamb commented Jun 8, 2023

jameslamb commented Jun 8, 2023

meiyuxin commented Jun 8, 2023

jameslamb commented Jun 8, 2023 •

edited

Loading

meiyuxin commented Jun 9, 2023 via email

jameslamb commented Jun 9, 2023

github-actions bot commented Sep 13, 2023

[python-package] do not use all features when using lbgm.LGBMClassifier #5915

[python-package] do not use all features when using lbgm.LGBMClassifier #5915

Comments

meiyuxin commented Jun 8, 2023 • edited by jameslamb Loading

jameslamb commented Jun 8, 2023

jameslamb commented Jun 8, 2023

meiyuxin commented Jun 8, 2023

jameslamb commented Jun 8, 2023 • edited Loading

meiyuxin commented Jun 9, 2023 via email

jameslamb commented Jun 9, 2023

github-actions bot commented Sep 13, 2023

meiyuxin commented Jun 8, 2023 •

edited by jameslamb

Loading

jameslamb commented Jun 8, 2023 •

edited

Loading