-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target and Count encodings for categorical features - Iteration 1 #4964
Conversation
const std::string count_prior_weight_key = "prior_weight"; | ||
|
||
namespace LightGBM { | ||
double CategoryFeatureTargetEncoder::Encode(double feature_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fold ID considered for encoding of training dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I see that the encoder is per-fold. So there's no question.
@tongwu-msft Thank you for working on this. The design overall looks good to me. Can we move on to integrating with the dataset classes? |
Yes, let me split the big PR into multiple iterations, this is it1 and we can do it parallel. |
@tongwu-msft is this still WIP? |
I'm very confused. After 7 months of inactivity, this PR was closed without any comment...and it was supposed to replace #3234 which has not received a commit in about 8 months. @shiyu1994 @tongwu-msft @guolinke what is happening with the large changes to LightGBM's handling of categorical values? Is someone working on this and should it still be considered a requirement for v4.0.0 (#5153)? |
I can pick this up. But I'll first focus on the CUDA parts before v4.0.0. Another option would be extracting the categorical feature handling into a separate module for LightGBM dataset preprocessing (something like https://github.com/microsoft/LightGBM-transform). So that we can focus on the core training algorithm of GBDT in our codebase. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Split PR: Target and Count encodings for categorical features to multiple part.
Implementation