Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target and Count encodings for categorical features - Iteration 1 #4964

Closed
wants to merge 22 commits into from

Conversation

tongwu-sh
Copy link
Contributor

@tongwu-sh tongwu-sh commented Jan 21, 2022

Split PR: Target and Count encodings for categorical features to multiple part.

Implementation

  1. CategoryFeatureEncoder: abstract class for encoders.
  2. CategoryFeatureEncoderManager: encode value for aggregated multiple encoders.
  3. CategoryFeatureTargetInformationCollector: collect data information before training.

const std::string count_prior_weight_key = "prior_weight";

namespace LightGBM {
double CategoryFeatureTargetEncoder::Encode(double feature_value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fold ID considered for encoding of training dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I see that the encoder is per-fold. So there's no question.

@shiyu1994
Copy link
Collaborator

@tongwu-msft Thank you for working on this. The design overall looks good to me. Can we move on to integrating with the dataset classes?

@tongwu-sh
Copy link
Contributor Author

@tongwu-msft Thank you for working on this. The design overall looks good to me. Can we move on to integrating with the dataset classes?

Yes, let me split the big PR into multiple iterations, this is it1 and we can do it parallel.

@tongwu-sh tongwu-sh marked this pull request as ready for review January 27, 2022 09:11
@guolinke
Copy link
Collaborator

guolinke commented Mar 9, 2022

@tongwu-msft is this still WIP?

@tongwu-sh tongwu-sh closed this Aug 9, 2022
@jameslamb
Copy link
Collaborator

I'm very confused. After 7 months of inactivity, this PR was closed without any comment...and it was supposed to replace #3234 which has not received a commit in about 8 months.

@shiyu1994 @tongwu-msft @guolinke what is happening with the large changes to LightGBM's handling of categorical values? Is someone working on this and should it still be considered a requirement for v4.0.0 (#5153)?

cc @jmoralez @StrikerRUS @btrotta

@shiyu1994
Copy link
Collaborator

I can pick this up. But I'll first focus on the CUDA parts before v4.0.0. Another option would be extracting the categorical feature handling into a separate module for LightGBM dataset preprocessing (something like https://github.com/microsoft/LightGBM-transform). So that we can focus on the core training algorithm of GBDT in our codebase.

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants