-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] do not use all features when using lbgm.LGBMClassifier #5915
Comments
Thanks for using LightGBM. Without seeing the exact dataset I can't say for sure, but I suspect that one of your columns cannot possibly be used based on the parameters you're using. For example, LightGBM has a parameter That parameter defaults to If you really really really want to try to use such features, you can set If you want to avoid dropping uninformative features during |
NOTE: I've reformatting your question, to more clearly differentiate between code + logs and your question text. If you're new to GitHub, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for information on how to format text here. |
I just reviewed my dataset and indeed found a feature that only has two values. You clarified my doubts, and I'm very grateful for that. However, I have one more question: Why is it that when I wrap the data using lgb_train = lgb.Dataset(X_train.drop("Id",axis=1), y_train,free_raw_data=False) and then proceed with training, the model can use all the features? p.s. I am a beginner and I am learning how to use GitHub. ^_^ |
Your dataset is very small (172 rows), and that's before you do a train-test split that reduces the training data to around 137 rows. LightGBM's overfitting protections have default values designed to work well on larger datasets, with at least 1000s of records. For example: In addition, LightGBM is not deterministic by default. What you may be seeing as "I get a different result when I do this other thing" may actually just be the run-to-run variation due to randomness in LightGBM. And small random changes could have a very significant impact on such a small dataset. To get fully-reproducible results across multiple runs on the same machine and with the same version of LightGBM, pass the following parameters through
For more on that, see the following: |
Sorry for the late reply, thank you very much, your answer solved my problem.
…------------------ 原始邮件 ------------------
发件人: "James ***@***.***>;
发送时间: 2023年6月9日(星期五) 凌晨0:27
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [microsoft/LightGBM] [python-package] do not use all features when using lbgm.LGBMClassifier (Issue #5915)
Your dataset is very small (172 rows), and that's before you do a train-test split that reduces the training data to around 137 rows.
LightGBM's overfitting protections have default values designed to work well on larger datasets, with at least 1000s of records. For example:
min_data_in_leaf (default = 20) (docs)
min_data_in_bin (default = 3) (docs)
In addition, LightGBM is not deterministic by default. What you may be seeing as "I get a different result when I do this other thing" may actually just be the run-to-run variation due to randomness in LightGBM. And small random changes could have a very significant impact on such a small dataset. To get fully-reproducible results across multiple runs on the same machine and with the same version of LightGBM, pass the following parameters through params:
deterministic = True
num_threads = 1
force_row_wise = True
seed = 708
For more on that, see the following:
#5887 (comment)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
No problem, glad it helped 😊 |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
issue:
why the model just use only 55 features,there are 56 features totally
is there any problem?i'm really confused
172 rows × 56 columns
The text was updated successfully, but these errors were encountered: