-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for large size clusters. #10880
Conversation
Debug GCP. log. don't clean. Disable log. Logs. More. Less. check grid size. work on test. finalise. Log early. Type. build. Log. Remove. Cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving only a few comments as I'm not very familiar with XGBoost's internals. Overall, looks good.
@@ -56,6 +57,14 @@ SockAddrV4 SockAddrV4::InaddrAny() { return MakeSockAddress("0.0.0.0", 0).V4(); | |||
SockAddrV6 SockAddrV6::Loopback() { return MakeSockAddress("::1", 0).V6(); } | |||
SockAddrV6 SockAddrV6::InaddrAny() { return MakeSockAddress("::", 0).V6(); } | |||
|
|||
[[nodiscard]] Result TCPSocket::Listen(std::int32_t backlog) { | |||
backlog = std::max(backlog, 256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're imposing a limit of 256, should that be added to the method docs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a brief mention, thank you for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -123,7 +123,8 @@ RabitTracker::RabitTracker(Json const& config) : Tracker{config} { | |||
listener_ = TCPSocket::Create(addr.IsV4() ? SockDomain::kV4 : SockDomain::kV6); | |||
return listener_.Bind(host_, &this->port_); | |||
} << [&] { | |||
return listener_.Listen(); | |||
CHECK_GT(this->n_workers_, 0); | |||
return listener_.Listen(this->n_workers_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens now if this->n_workers_ > 256
, will those be in the backlog and processed only after some of the first 256 listeners terminate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the listener should put them in the backlog until the queue is not full. The tracker handles workers whenever they connect, first come first serve.
template <typename L> | ||
__global__ void LaunchNKernel(int device_idx, size_t begin, size_t end, | ||
L lambda) { | ||
for (auto i : GridStrideRange(begin, end)) { | ||
lambda(i, device_idx); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see this is being removed because it's not used anywhere, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not used, small cleanup.
[] XGBOOST_DEVICE(bst_idx_t ridx, std::int32_t /*nidx_in_batch*/, RegTree::Node) { | ||
return ridx < 3; | ||
}); | ||
ASSERT_EQ(partitioner.GetNumNodes(), 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cumulative sum of left and right indices, in line 197 above: 1 + 2 = 3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The n_nodes_
internal variable is used for sanity checks, it's initialized with 1 with the root node included. The reasoning of choice is that we don't update the partitioner if there's no split for the root node, but the n_nodes >= 1
is an invariant for all cases since we must have a root node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @trivialfis !
- Increase listener backlog. - Check for empty kernels.
Ideally, we should use
n_workers
for the backlog inlisten
. However, during bootstrap, the workers don't have any information about the communication group other than the tracker address. We can add additional communication protocols for obtaining then_workers
before starting to listen, but that requires careful testing. We will backport the one line change for now.