-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor split evaluator using device-wide Scan primitive #7197
base: master
Are you sure you want to change the base?
Conversation
This reverts commit b32a634.
Can someone help me fix |
Let me take a look later today. |
Failing test from the JVM package:
|
That's a test saying the result should have a good match with the exact tree method. I removed the constraint in #7214 to get the test pass but haven't investigated the cause yet. |
Reproduced the failing gtest. I think there are still some issues with this PR, compared the result between master and this PR with this simple script: import numpy as np
import xgboost
import ctypes
np.random.seed(1994)
kRows = 2048
kCols = 100
X = np.random.randn(kRows, kCols)
y = np.random.randn(kRows)
dtrain = xgboost.DMatrix(X, y)
bst = xgboost.train(
{"tree_method": "gpu_hist"},
dtrain,
num_boost_round=10,
evals=[(dtrain, "validation_0")],
) master[0] validation_0-rmse:1.01426 rework-evaluation[0] validation_0-rmse:1.01243 |
@trivialfis Thanks for the simple Python script. I managed to fix the bug that was failing the Python script. (The bug had to do with setting the threshold Unfortunately, the gtest |
EDIT. The bug is fixed in the latest commit. Another small counterexample, where categorical data don't yield the same result as its one-hot-encoded version: import pandas as pd
import numpy as np
import xgboost as xgb
def make_categorical(
n_samples: int, n_features: int, n_categories: int, onehot: bool
):
rng = np.random.RandomState(1994)
pd_dict = {}
for i in range(n_features + 1):
c = rng.randint(low=0, high=n_categories, size=n_samples)
pd_dict[str(i)] = pd.Series(c, dtype=np.int64)
df = pd.DataFrame(pd_dict)
label = df.iloc[:, 0]
df = df.iloc[:, 1:]
for i in range(0, n_features):
label += df.iloc[:, i]
label += 1
df = df.astype("category")
categories = np.arange(0, n_categories)
for col in df.columns:
df[col] = df[col].cat.set_categories(categories)
if onehot:
return pd.get_dummies(df), label
return df, label
rows = 10
cols = 3
rounds = 2
cats = 4
onehot, label = make_categorical(rows, cols, cats, True)
cat, _ = make_categorical(rows, cols, cats, False)
parameters = {"tree_method": "gpu_hist", "predictor": "gpu_predictor"}
m = xgb.DMatrix(onehot, label, enable_categorical=False)
xgb.train(
parameters,
m,
num_boost_round=rounds,
evals=[(m, "Train")]
)
m = xgb.DMatrix(cat, label, enable_categorical=True)
xgb.train(
parameters,
m,
num_boost_round=rounds,
evals=[(m, "Train")]
) Output:
The correct output is as follows, using the code from the
|
d01b721
to
7b8bf95
Compare
I managed to fix all failing tests in gtest. Fingers crossed. |
EDIT. The bug is fixed in the latest commit. Nope, another failing test with the custom objective:
|
This reverts commit eadb3ce.
if (kv.first == "default_left" || kv.first == "split_conditions") { | ||
auto const& l_arr = get<Array const>(l_obj.at(kv.first)); | ||
auto const& r_arr = get<Array const>(r_obj.at(kv.first)); | ||
ASSERT_EQ(l_arr.size(), r_arr.size()); | ||
} else { | ||
CompareJSON(l_obj.at(kv.first), r_obj.at(kv.first)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows two models to differ in default_left
and split_conditions
but still require them to produce idential loss_chg
. The relaxation is necessary because the test produced two split candidates with identical loss_chg
and findex
but different default_left
and split_conditions
. The given scenario occurred because there was no missing values for that particular feature.
All gtests are now fixed. One last test to fix, this time from multi-GPU side:
|
I believe I fixed all the tests that I've seen to fail so far. Hopefully no more issues. Fingers crossed. |
All tests have passed. cc @trivialfis @RAMitchell |
*/ | ||
dh::device_vector<DeviceSplitCandidate> out_reduce(3); | ||
GPUTrainingParam param = left.param; | ||
thrust::reduce_by_key( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note. We may want to use cub::DeviceSegmentedReduce
instead.
Codecov Report
@@ Coverage Diff @@
## master #7197 +/- ##
=======================================
Coverage 82.76% 82.76%
=======================================
Files 13 13
Lines 4061 4061
=======================================
Hits 3361 3361
Misses 700 700 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor improvements could be made around unreadable if statements, but mostly looks good.
As discussed, we will need performance analysis before merging.
TreeEvaluator::SplitEvaluator<GPUTrainingParam> evaluator, | ||
EvaluateSplitInputs<GradientSumT> left, | ||
EvaluateSplitInputs<GradientSumT> right) { | ||
auto l_n_features = left.feature_segments.empty() ? 0 : left.feature_segments.size() - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations do we get empty feature segments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When left
parameter is set to {}
. This case is handled in a short if
block titled trivial split
l.Update(r, param); | ||
return l; | ||
}); | ||
if (right.gradient_histogram.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this edge case occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Root node.
}); | ||
/** | ||
* Perform segmented reduce to find the best split candidate per node. | ||
* Note that there will be THREE segments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there three segments? It looks like you're only using two output values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we first generate candidate splits with a forward pass over the gradient histogram, in the order of ascending node_idx
(0 or 1), and then generate candidate splits with a backward pass over the histgoram, in the order of descending node_idx
. So the list of candidate splits generated will consist of three segments
[segment for left child node] [segment for right child node] [segment for left child node]
So after performing a segmented reduce, there is a small kernel launch to copy out_reduce[3]
to out_splits[2]
.
ret.forward = entry.forward; | ||
ret.gpair = split_input.gradient_histogram[entry.hist_idx]; | ||
ret.parent_sum = split_input.parent_sum; | ||
if (((entry.node_idx == 0) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of code is hard to understand and could be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a small utility function that checks whether feature_set
contains ret.findex
.
// Segmented Scan | ||
return rhs; | ||
} | ||
if (((lhs.node_idx == 0) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
EvaluateSplits(out_split, evaluator, input, {}); | ||
ReduceElem<GradientSumT> | ||
__device__ ReduceValueOp<GradientSumT>::operator() (ScanElem<GradientSumT> e) { | ||
ReduceElem<GradientSumT> ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can readability be improved here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific? All this function does is to initialize the members of ReduceElem
. Do you find the computation of left_sum
and right_sum
confusing? Any suggestion is appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review. Will run some benchmarks.
auto l_n_features = left.feature_segments.empty() ? 0 : left.feature_segments.size() - 1; | ||
auto r_n_features = right.feature_segments.empty() ? 0 : right.feature_segments.size() - 1; | ||
if (!(r_n_features == 0 || l_n_features == r_n_features)) { | ||
throw std::runtime_error("Invariant violated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG(FATAL) ?
* [segment for left child node] [segment for right child node] [segment for left child node] | ||
* This is due to how we perform forward and backward passes over the gradient histogram. | ||
*/ | ||
dh::device_vector<DeviceSplitCandidate> out_reduce(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use dh::TemporaryArray
, which uses caching allocator.
return l; | ||
}); | ||
if (right.gradient_histogram.empty()) { | ||
dh::LaunchN(1, [out_reduce = dh::ToSpan(out_reduce), out_splits]__device__(std::size_t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can fuse this kernel with the one in here
xgboost/src/tree/updater_gpu_hist.cu
Line 342 in d27a427
dh::LaunchN(2, [=] __device__(size_t idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. It will make it harder to reason about EvaluateSplits()
function however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From recent profiling results, small kernels have a significant impact when we grow deeper trees. But I agree that it is premature optimization if we fuse them now. Let's see the result first. ;-)
auto l_n_features = left.feature_segments.empty() ? 0 : left.feature_segments.size() - 1; | ||
auto r_n_features = right.feature_segments.empty() ? 0 : right.feature_segments.size() - 1; | ||
if (!(r_n_features == 0 || l_n_features == r_n_features)) { | ||
throw std::runtime_error("Invariant violated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG(FATAL) ?
// size <= size < size * 2 | ||
return thrust::make_tuple(size * 2 - 1 - i, false); | ||
} else { | ||
return thrust::make_tuple(static_cast<uint32_t>(0), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not, I put it there just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put a SPAN_CHECK
here as runtime assert.
|
||
template <typename GradientSumT> | ||
struct ScanElem { | ||
uint32_t node_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bst_node_t
.
GradientSumT partial_sum{0.0, 0.0}; | ||
GradientSumT parent_sum{0.0, 0.0}; | ||
float loss_chg{std::numeric_limits<float>::lowest()}; | ||
int32_t findex{-1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bst_note_t
.
Instead of performing per-block scan, perform a segmented scan using the entire GPU. We simulate segmented scan using a fancy scan operator. This refactor will help balance work and raise GPU utilization.
Original idea by @trivialfis
WIP, as some tests are failing.