Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation dataset creation via Sequence #4184

Closed
Willian-Zhang opened this issue Apr 16, 2021 · 4 comments
Closed

Validation dataset creation via Sequence #4184

Willian-Zhang opened this issue Apr 16, 2021 · 4 comments

Comments

@Willian-Zhang
Copy link
Contributor

Willian-Zhang commented Apr 16, 2021

This issue refers to an underdevelopment feature/PR: Sequence support for dataset

Originally posted by @shiyu1994 in #4089 (comment)

This post is created for keeping track of and providing new LightGBM developer more insight on design of LightGBM data pipelines.

Please close this issue upon #4089 is merged.

@Willian-Zhang
Copy link
Contributor Author

@shiyu1994 Would you mind explaining CreateValid from:

LightGBM/src/io/dataset.cpp

Lines 726 to 770 in fba18e4

void Dataset::CreateValid(const Dataset* dataset) {
feature_groups_.clear();
num_features_ = dataset->num_features_;
num_groups_ = num_features_;
max_bin_ = dataset->max_bin_;
min_data_in_bin_ = dataset->min_data_in_bin_;
bin_construct_sample_cnt_ = dataset->bin_construct_sample_cnt_;
use_missing_ = dataset->use_missing_;
zero_as_missing_ = dataset->zero_as_missing_;
feature2group_.clear();
feature2subfeature_.clear();
has_raw_ = dataset->has_raw();
numeric_feature_map_ = dataset->numeric_feature_map_;
num_numeric_features_ = dataset->num_numeric_features_;
// copy feature bin mapper data
feature_need_push_zeros_.clear();
group_bin_boundaries_.clear();
uint64_t num_total_bin = 0;
group_bin_boundaries_.push_back(num_total_bin);
group_feature_start_.resize(num_groups_);
group_feature_cnt_.resize(num_groups_);
for (int i = 0; i < num_features_; ++i) {
std::vector<std::unique_ptr<BinMapper>> bin_mappers;
bin_mappers.emplace_back(new BinMapper(*(dataset->FeatureBinMapper(i))));
if (bin_mappers.back()->GetDefaultBin() !=
bin_mappers.back()->GetMostFreqBin()) {
feature_need_push_zeros_.push_back(i);
}
feature_groups_.emplace_back(new FeatureGroup(&bin_mappers, num_data_));
feature2group_.push_back(i);
feature2subfeature_.push_back(0);
num_total_bin += feature_groups_[i]->num_total_bin_;
group_bin_boundaries_.push_back(num_total_bin);
group_feature_start_[i] = i;
group_feature_cnt_[i] = 1;
}
feature_groups_.shrink_to_fit();
used_feature_map_ = dataset->used_feature_map_;
num_total_features_ = dataset->num_total_features_;
feature_names_ = dataset->feature_names_;
label_idx_ = dataset->label_idx_;
real_feature_idx_ = dataset->real_feature_idx_;
forced_bin_bounds_ = dataset->forced_bin_bounds_;
}

Specifically: Can it be used to create training dataset as well.

From my current understanding, it's creating a new dataset, regardless of training or validation, from a referenced dataset. To be specific, copying its BinMapper, resetting actually data and associated states to empty and nothing else.

However I'm not sure what these does:

LightGBM/src/io/dataset.cpp

Lines 750 to 752 in fba18e4

if (bin_mappers.back()->GetDefaultBin() !=
bin_mappers.back()->GetMostFreqBin()) {
feature_need_push_zeros_.push_back(i);

LightGBM/src/io/dataset.cpp

Lines 758 to 760 in fba18e4

group_bin_boundaries_.push_back(num_total_bin);
group_feature_start_[i] = i;
group_feature_cnt_[i] = 1;

Are they related to Exclusive Feature Bundling featured in the paper?

@shiyu1994
Copy link
Collaborator

In principal, we don't specify reference when creating training dataset. So CreateValid shouldn't be used for creating training dataset.

LightGBM/src/io/dataset.cpp

Lines 750 to 752 in fba18e4

if (bin_mappers.back()->GetDefaultBin() !=
bin_mappers.back()->GetMostFreqBin()) {
feature_need_push_zeros_.push_back(i);

Here the feature_need_push_zeros_ are just copied from the referenced training data. And it makes no different if we write feaure_need_push_zeros_ = dataset->feature_need_push_zeros_.

LightGBM/src/io/dataset.cpp

Lines 758 to 760 in fba18e4

group_bin_boundaries_.push_back(num_total_bin);
group_feature_start_[i] = i;
group_feature_cnt_[i] = 1;

But these lines do indicate an important difference between training data and validation data. In LightGBM, features are bundled together according to their sparsity, to speedup training. See Exclusive Feature Bundling in the paper: https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
Since validation data is only used for inference, there's no need to group the features. So we set group_feature_start_[i] = i and group_feature_cnt_[i] = 1 to make each single feature remains a group by itself.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StrikerRUS
Copy link
Collaborator

Implemented in #4089.

#4089 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants