Target and Count encodings for categorical features - Iteration 1 #4964

tongwu-sh · 2022-01-21T09:12:01Z

Split PR: Target and Count encodings for categorical features to multiple part.

Implementation

CategoryFeatureEncoder: abstract class for encoders.
CategoryFeatureEncoderManager: encode value for aggregated multiple encoders.
CategoryFeatureTargetInformationCollector: collect data information before training.

shiyu1994 · 2022-01-25T07:55:02Z

src/feature_engineering/category_feature_target_encoder.cpp

+const std::string count_prior_weight_key = "prior_weight";
+
+namespace LightGBM {
+  double CategoryFeatureTargetEncoder::Encode(double feature_value) {


Is the fold ID considered for encoding of training dataset?

Ok. I see that the encoder is per-fold. So there's no question.

shiyu1994 · 2022-01-25T07:59:12Z

@tongwu-msft Thank you for working on this. The design overall looks good to me. Can we move on to integrating with the dataset classes?

tongwu-sh · 2022-01-27T09:06:42Z

@tongwu-msft Thank you for working on this. The design overall looks good to me. Can we move on to integrating with the dataset classes?

Yes, let me split the big PR into multiple iterations, this is it1 and we can do it parallel.

guolinke · 2022-03-09T16:01:43Z

@tongwu-msft is this still WIP?

jameslamb · 2022-08-16T02:20:42Z

I'm very confused. After 7 months of inactivity, this PR was closed without any comment...and it was supposed to replace #3234 which has not received a commit in about 8 months.

@shiyu1994 @tongwu-msft @guolinke what is happening with the large changes to LightGBM's handling of categorical values? Is someone working on this and should it still be considered a requirement for v4.0.0 (#5153)?

cc @jmoralez @StrikerRUS @btrotta

shiyu1994 · 2022-08-16T07:58:21Z

I can pick this up. But I'll first focus on the CUDA parts before v4.0.0. Another option would be extracting the categorical feature handling into a separate module for LightGBM dataset preprocessing (something like https://github.com/microsoft/LightGBM-transform). So that we can focus on the core training algorithm of GBDT in our codebase.

github-actions · 2023-11-15T00:20:57Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Tong Wu added 14 commits January 10, 2022 18:10

add encoders

6c5a52b

0111

4293a7c

0112

57fd43e

add dump to json

d6f5da0

0114

16e897a

0118

7581e38

0120

70a6504

add tests

0e58649

Add more tests

253c596

Merge branch '0121master'

f7f9511

Fix format

cc99a50

remove tab

e09391a

fix warning

1c8c932

remopve space

b1162e0

shiyu1994 reviewed Jan 25, 2022

View reviewed changes

Tong Wu added 8 commits January 27, 2022 12:03

fix warning

b2290f9

fix warning

b0c3fbd

fix warning

b587b16

fix warning

da0d50b

fix build error

b3232b5

fix build error

61bd980

fix warning

21f8f49

fix warning

9d66950

tongwu-sh marked this pull request as ready for review January 27, 2022 09:11

tongwu-sh requested review from btrotta, guolinke, henry0312 and huanzhang12 as code owners January 27, 2022 09:11

tongwu-sh requested review from hzy46, jameslamb, Laurae2 and StrikerRUS as code owners January 27, 2022 09:11

jameslamb added feature in progress labels Feb 4, 2022

tongwu-sh closed this Aug 9, 2022

jameslamb removed the in progress label Aug 13, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target and Count encodings for categorical features - Iteration 1 #4964

Target and Count encodings for categorical features - Iteration 1 #4964

tongwu-sh commented Jan 21, 2022 •

edited

Loading

shiyu1994 Jan 25, 2022

shiyu1994 Jan 25, 2022

shiyu1994 commented Jan 25, 2022

tongwu-sh commented Jan 27, 2022

guolinke commented Mar 9, 2022

jameslamb commented Aug 16, 2022

shiyu1994 commented Aug 16, 2022

github-actions bot commented Nov 15, 2023

Target and Count encodings for categorical features - Iteration 1 #4964

Target and Count encodings for categorical features - Iteration 1 #4964

Conversation

tongwu-sh commented Jan 21, 2022 • edited Loading

Implementation

shiyu1994 Jan 25, 2022

Choose a reason for hiding this comment

shiyu1994 Jan 25, 2022

Choose a reason for hiding this comment

shiyu1994 commented Jan 25, 2022

tongwu-sh commented Jan 27, 2022

guolinke commented Mar 9, 2022

jameslamb commented Aug 16, 2022

shiyu1994 commented Aug 16, 2022

github-actions bot commented Nov 15, 2023

tongwu-sh commented Jan 21, 2022 •

edited

Loading