Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for large size clusters. #10880

Merged
merged 7 commits into from
Oct 14, 2024
Merged

Fixes for large size clusters. #10880

merged 7 commits into from
Oct 14, 2024

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Oct 10, 2024

  • Increase listener backlog.
  • Check for empty kernels.

Ideally, we should use n_workers for the backlog in listen. However, during bootstrap, the workers don't have any information about the communication group other than the tracker address. We can add additional communication protocols for obtaining the n_workers before starting to listen, but that requires careful testing. We will backport the one line change for now.

Debug GCP.

log.

don't clean.

Disable log.

Logs.

More.

Less.

check grid size.

work on test.

finalise.

Log early.

Type.

build.

Log.

Remove.

Cleanup.
@trivialfis trivialfis changed the title [Don't Merge] Debug gcp Fixes for large size clusters. Oct 11, 2024
@trivialfis trivialfis mentioned this pull request Oct 11, 2024
10 tasks
@trivialfis trivialfis marked this pull request as ready for review October 14, 2024 05:20
Copy link

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving only a few comments as I'm not very familiar with XGBoost's internals. Overall, looks good.

@@ -56,6 +57,14 @@ SockAddrV4 SockAddrV4::InaddrAny() { return MakeSockAddress("0.0.0.0", 0).V4();
SockAddrV6 SockAddrV6::Loopback() { return MakeSockAddress("::1", 0).V6(); }
SockAddrV6 SockAddrV6::InaddrAny() { return MakeSockAddress("::", 0).V6(); }

[[nodiscard]] Result TCPSocket::Listen(std::int32_t backlog) {
backlog = std::max(backlog, 256);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're imposing a limit of 256, should that be added to the method docs as well?

Copy link
Member Author

@trivialfis trivialfis Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a brief mention, thank you for pointing this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -123,7 +123,8 @@ RabitTracker::RabitTracker(Json const& config) : Tracker{config} {
listener_ = TCPSocket::Create(addr.IsV4() ? SockDomain::kV4 : SockDomain::kV6);
return listener_.Bind(host_, &this->port_);
} << [&] {
return listener_.Listen();
CHECK_GT(this->n_workers_, 0);
return listener_.Listen(this->n_workers_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens now if this->n_workers_ > 256, will those be in the backlog and processed only after some of the first 256 listeners terminate?

Copy link
Member Author

@trivialfis trivialfis Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the listener should put them in the backlog until the queue is not full. The tracker handles workers whenever they connect, first come first serve.

Comment on lines -206 to -212
template <typename L>
__global__ void LaunchNKernel(int device_idx, size_t begin, size_t end,
L lambda) {
for (auto i : GridStrideRange(begin, end)) {
lambda(i, device_idx);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see this is being removed because it's not used anywhere, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used, small cleanup.

[] XGBOOST_DEVICE(bst_idx_t ridx, std::int32_t /*nidx_in_batch*/, RegTree::Node) {
return ridx < 3;
});
ASSERT_EQ(partitioner.GetNumNodes(), 3);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cumulative sum of left and right indices, in line 197 above: 1 + 2 = 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n_nodes_ internal variable is used for sanity checks, it's initialized with 1 with the root node included. The reasoning of choice is that we don't update the partitioner if there's no split for the root node, but the n_nodes >= 1 is an invariant for all cases since we must have a root node.

Copy link

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @trivialfis !

@trivialfis trivialfis merged commit 347bb14 into dmlc:master Oct 14, 2024
31 checks passed
@trivialfis trivialfis deleted the debug-gcp branch October 14, 2024 16:11
trivialfis added a commit to trivialfis/xgboost that referenced this pull request Oct 16, 2024
- Increase listener backlog.
- Check for empty kernels.
trivialfis added a commit that referenced this pull request Oct 17, 2024
- Increase listener backlog.
- Check for empty kernels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants