-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation dataset creation via Sequence #4184
Comments
@shiyu1994 Would you mind explaining Lines 726 to 770 in fba18e4
Specifically: Can it be used to create training dataset as well. From my current understanding, it's creating a new dataset, regardless of training or validation, from a referenced dataset. To be specific, copying its BinMapper, resetting actually data and associated states to empty and nothing else. However I'm not sure what these does: Lines 750 to 752 in fba18e4
Lines 758 to 760 in fba18e4
Are they related to |
In principal, we don't specify Lines 750 to 752 in fba18e4
Here the feature_need_push_zeros_ are just copied from the referenced training data. And it makes no different if we write feaure_need_push_zeros_ = dataset->feature_need_push_zeros_ .
Lines 758 to 760 in fba18e4
But these lines do indicate an important difference between training data and validation data. In LightGBM, features are bundled together according to their sparsity, to speedup training. See Exclusive Feature Bundling in the paper: https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf Since validation data is only used for inference, there's no need to group the features. So we set group_feature_start_[i] = i and group_feature_cnt_[i] = 1 to make each single feature remains a group by itself.
|
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Implemented in #4089. |
Originally posted by @shiyu1994 in #4089 (comment)
This post is created for keeping track of and providing new LightGBM developer more insight on design of LightGBM data pipelines.
Please close this issue upon #4089 is merged.
The text was updated successfully, but these errors were encountered: