Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit_bootstrap_cache ] failed xgb worker recover from other workers #4808

Merged
merged 39 commits into from
Sep 17, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
cfa48ff
rebase and pull rabit with latest feature
chenqin Aug 25, 2019
c763e68
avoid pull fault tollerant rabit with windows platform
Aug 25, 2019
fb484fb
detect windows 64
chenqin Aug 25, 2019
4889c2a
Revert "detect windows 64"
chenqin Aug 26, 2019
df6f715
patch rabit boostrap cache to xgb(except checkpoint)
chenqin Aug 26, 2019
d8a8a3e
add max_depth, tree_method into checkpoint
Aug 27, 2019
1ec7dc7
pass configs when loading model
Aug 27, 2019
3dfa7e4
save missed configs to checkpoint
Aug 27, 2019
6734f9a
add support of xgb worker fail recovery for approx
chenqin Aug 28, 2019
b96888a
Merge branch 'master' into rabit_dist
chenqin Aug 28, 2019
6e1bdca
workaround booster save/load inconsistency, leave fix to item 3 in ht…
chenqin Aug 29, 2019
e2369d3
Merge branch 'rabit_dist' of github.com:chenqin/xgboost into rabit_dist
chenqin Aug 29, 2019
2bde716
check if rabit_bootstrap_cache set before write to checkpoint
Aug 29, 2019
e30e45e
fix clang-tidy
Aug 30, 2019
82bbfbe
remove unnessary changes, adding hist test
Aug 30, 2019
98034e2
revert CMAKE file change on win32 check
Aug 30, 2019
4e134f9
remove n_gpus
Sep 5, 2019
61615c0
point to latest rabit, remove is_bootstrap
Sep 5, 2019
ece34c9
visual studio don't support latest openmp
Sep 5, 2019
fc1a39f
cleanup xgb_recovery test scripts
Sep 5, 2019
d31fe75
include linux enviorment
chenqin Sep 6, 2019
283ed5b
avoid signautre issue in osx
chenqin Sep 6, 2019
d77ad53
remove linux enviorment
chenqin Sep 6, 2019
a006266
fix openmp vs support
chenqin Sep 6, 2019
884736d
misc
chenqin Sep 6, 2019
15fd6a6
update rabit
chenqin Sep 7, 2019
4e90a6e
Revert "update rabit"
Sep 7, 2019
a65f065
apply memory access optimized customized reduction
chenqin Sep 8, 2019
7865e6b
Revert "apply memory access optimized customized reduction"
chenqin Sep 8, 2019
656d43c
update rabit
chenqin Sep 8, 2019
c34de2f
update rabit with is_bootstrap parameter removed
Sep 9, 2019
337f1a8
switch base to dmlc/rabit master
Sep 10, 2019
ac90063
per feedback
chenqin Sep 11, 2019
0a38691
Merge branch 'rabit_dist' of github.com:chenqin/xgboost into rabit_dist
chenqin Sep 11, 2019
3246613
per feedback, move distributed xgboost recovery to jenkins
chenqin Sep 13, 2019
03c790a
try fix jenkins
chenqin Sep 13, 2019
b822358
use exec in runxgb shell script
Sep 12, 2019
d81965b
try fix jenkins missing xgboost exe
Sep 13, 2019
9e94f0f
try fix path in jenkins
Sep 13, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# disable sudo for container build.
sudo: required

# Enabling test on Linux and OS X
# Enabling test OS X
os:
- osx

Expand Down
23 changes: 16 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ option(GOOGLE_TEST "Build google tests" OFF)
option(USE_DMLC_GTEST "Use google tests bundled with dmlc-core submodule (EXPERIMENTAL)" OFF)
option(USE_NVTX "Build with cuda profiling annotations. Developers only." OFF)
set(NVTX_HEADER_DIR "" CACHE PATH "Path to the stand-alone nvtx header")
option(RABIT_MOCK "Build rabit with mock" OFF)
chenqin marked this conversation as resolved.
Show resolved Hide resolved
## CUDA
option(USE_CUDA "Build with GPU acceleration" OFF)
option(USE_NCCL "Build with NCCL to enable distributed GPU support." OFF)
Expand Down Expand Up @@ -88,17 +89,25 @@ list(APPEND LINKED_LIBRARIES_PRIVATE dmlc)

# rabit
# full rabit doesn't build on windows, so we can't import it as subdirectory
if(MINGW OR R_LIB)
if(MINGW OR R_LIB OR WIN32)
set(RABIT_SOURCES
rabit/src/engine_empty.cc
rabit/src/c_api.cc)
else ()
set(RABIT_SOURCES
rabit/src/allreduce_base.cc
rabit/src/allreduce_robust.cc
rabit/src/engine.cc
rabit/src/c_api.cc)
endif (MINGW OR R_LIB)
if(RABIT_MOCK)
set(RABIT_SOURCES
rabit/src/allreduce_base.cc
rabit/src/allreduce_robust.cc
rabit/src/engine_mock.cc
rabit/src/c_api.cc)
else()
set(RABIT_SOURCES
rabit/src/allreduce_base.cc
rabit/src/allreduce_robust.cc
rabit/src/engine.cc
rabit/src/c_api.cc)
endif(RABIT_MOCK)
endif (MINGW OR R_LIB OR WIN32)
add_library(rabit STATIC ${RABIT_SOURCES})
target_include_directories(rabit PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_LIST_DIR}/dmlc-core/include>
Expand Down
32 changes: 32 additions & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ pipeline {
script {
parallel ([
'build-cpu': { BuildCPU() },
'build-cpu-rabit-mock': { BuildCPUMock() },
'build-gpu-cuda9.0': { BuildCUDA(cuda_version: '9.0') },
'build-gpu-cuda10.0': { BuildCUDA(cuda_version: '10.0') },
'build-gpu-cuda10.1': { BuildCUDA(cuda_version: '10.1') },
Expand All @@ -76,6 +77,7 @@ pipeline {
'test-python-gpu-cuda10.0': { TestPythonGPU(cuda_version: '10.0') },
'test-python-gpu-cuda10.1': { TestPythonGPU(cuda_version: '10.1') },
'test-python-mgpu-cuda10.1': { TestPythonGPU(cuda_version: '10.1', multi_gpu: true) },
'test-cpp-rabit': {TestCppRabit()},
'test-cpp-gpu': { TestCppGPU(cuda_version: '10.1') },
'test-cpp-mgpu': { TestCppGPU(cuda_version: '10.1', multi_gpu: true) },
'test-jvm-jdk8': { CrossTestJVMwithJDK(jdk_version: '8', spark_version: '2.4.3') },
Expand Down Expand Up @@ -185,6 +187,22 @@ def BuildCPU() {
}
}

def BuildCPUMock() {
node('linux && cpu') {
unstash name: 'srcs'
echo "Build CPU with rabit mock"
def container_type = "cpu"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_mock_cmake.sh
"""
echo 'Stashing rabit C++ test executable (xgboost)...'
stash name: 'xgboost_rabit_tests', includes: 'xgboost'
deleteDir()
}
}


def BuildCUDA(args) {
node('linux && cpu') {
unstash name: 'srcs'
Expand Down Expand Up @@ -279,6 +297,20 @@ def TestPythonGPU(args) {
}
}

def TestCppRabit() {
node(nodeReq) {
unstash name: 'xgboost_rabit_tests'
unstash name: 'srcs'
echo "Test C++, rabit mock on"
def container_type = "cpu"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/runxgb.sh xgboost tests/ci_build/approx.conf.in
"""
deleteDir()
}
}

def TestCppGPU(args) {
nodeReq = (args.multi_gpu) ? 'linux && mgpu' : 'linux && gpu'
node(nodeReq) {
Expand Down
2 changes: 2 additions & 0 deletions src/common/hist_util.cc
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@ void DenseCuts::Init
}
CHECK_EQ(summary_array.size(), in_sketchs->size());
size_t nbytes = WXQSketch::SummaryContainer::CalcMemCost(max_num_bins * kFactor);
// TODO(chenqin): rabit failure recovery assumes no boostrap onetime call after loadcheckpoint
// we need to move this allreduce before loadcheckpoint call in future
sreducer.Allreduce(dmlc::BeginPtr(summary_array), nbytes, summary_array.size());
p_cuts_->min_vals_.resize(sketchs.size());

Expand Down
2 changes: 1 addition & 1 deletion src/common/random.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ class ColumnSampler {
*/
ColumnSampler() {
uint32_t seed = common::GlobalRandom()();
rabit::Broadcast(&seed, sizeof(seed), 0);
rabit::Broadcast(&seed, sizeof(seed), 0, "seed");
rng_.seed(seed);
}

Expand Down
3 changes: 2 additions & 1 deletion src/data/data.cc
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,8 @@ DMatrix* DMatrix::Load(const std::string& uri,
/* sync up number of features after matrix loaded.
* partitioned data will fail the train/val validation check
* since partitioned data not knowing the real number of features. */
rabit::Allreduce<rabit::op::Max>(&dmat->Info().num_col_, 1);
rabit::Allreduce<rabit::op::Max>(&dmat->Info().num_col_, 1, nullptr,
nullptr, fname.c_str());
// backward compatiblity code.
if (!load_row_split) {
MetaInfo& info = dmat->Info();
Expand Down
24 changes: 21 additions & 3 deletions src/learner.cc
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,9 @@ class LearnerImpl : public Learner {
kv.second = "cpu_predictor";
LOG(INFO) << "Switch gpu_predictor to cpu_predictor.";
}
if (saved_configs_.find(saved_param) != saved_configs_.end()) {
cfg_[saved_param] = kv.second;
}
}
}
attributes_ = std::map<std::string, std::string>(attr.begin(), attr.end());
Expand Down Expand Up @@ -302,6 +305,10 @@ class LearnerImpl : public Learner {
p_metric->Configure({cfg_.begin(), cfg_.end()});
}

// copy dsplit from config since it will not run again during restore
if (tparam_.dsplit == DataSplitMode::kAuto && rabit::IsDistributed()) {
tparam_.dsplit = DataSplitMode::kRow;
}
this->configured_ = true;
}

Expand Down Expand Up @@ -332,8 +339,15 @@ class LearnerImpl : public Learner {
}
}
{
// Write `predictor`, `gpu_id` parameters as extra attributes
for (const auto& key : std::vector<std::string>{"predictor", "gpu_id"}) {
std::vector<std::string> saved_params{"predictor", "gpu_id"};
// check if rabit_bootstrap_cache were set to non zero before adding to checkpoint
if (cfg_.find("rabit_bootstrap_cache") != cfg_.end() &&
chenqin marked this conversation as resolved.
Show resolved Hide resolved
(cfg_.find("rabit_bootstrap_cache"))->second != "0") {
std::copy(saved_configs_.begin(), saved_configs_.end(),
std::back_inserter(saved_params));
}
// Write `predictor`, `n_gpus`, `gpu_id` parameters as extra attributes
for (const auto& key : saved_params) {
auto it = cfg_.find(key);
if (it != cfg_.end()) {
mparam.contain_extra_attrs = 1;
Expand Down Expand Up @@ -601,7 +615,7 @@ class LearnerImpl : public Learner {
num_feature = std::max(num_feature, static_cast<unsigned>(num_col));
}
// run allreduce on num_feature to find the maximum value
rabit::Allreduce<rabit::op::Max>(&num_feature, 1);
rabit::Allreduce<rabit::op::Max>(&num_feature, 1, nullptr, nullptr, "num_feature");
if (num_feature > mparam_.num_feature) {
mparam_.num_feature = num_feature;
}
Expand Down Expand Up @@ -648,6 +662,10 @@ class LearnerImpl : public Learner {
std::vector<std::shared_ptr<DMatrix> > cache_;

common::Monitor monitor_;

/*! \brief saved config keys used to restore failed worker */
std::set<std::string> saved_configs_ = {"max_depth", "tree_method", "dsplit",
chenqin marked this conversation as resolved.
Show resolved Hide resolved
"seed", "silent", "num_round", "gamma", "min_child_weight"};
chenqin marked this conversation as resolved.
Show resolved Hide resolved
};

std::string const LearnerImpl::kEvalMetric {"eval_metric"}; // NOLINT
Expand Down
12 changes: 12 additions & 0 deletions tests/ci_build/approx.conf.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Originally an example in demo/regression/
tree_method=approx
eta = 0.5
gamma = 1.0
seed = 0
min_child_weight = 0
max_depth = 5

num_round = 12
save_period = 100
data = "demo/data/agaricus.txt.train"
eval[test] = "demo/data/agaricus.txt.test"
10 changes: 10 additions & 0 deletions tests/ci_build/build_mock_cmake.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env bash
set -e

rm -rf build
mkdir build
cd build
cmake -DRABIT_MOCK=ON -DCMAKE_VERBOSE_MAKEFILE=ON ..
make clean
make -j$(nproc)
cd ..
13 changes: 13 additions & 0 deletions tests/ci_build/runxgb.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

#run make in rabit/test to generate librabit_mock
#update config.mk and build xgboost using mock
export DMLC_SUBMIT_CLUSTER=local

submit="python3 dmlc-core/tracker/dmlc-submit"
# build xgboost with librabit mock
# define max worker retry with dmlc-core local num atempt
# instrument worker failure with mock=xxxx
# check if host recovered from expectected iteration
echo "====== 1. Fault recovery distributed test ======"
exec $submit --cluster=local --num-workers=10 --local-num-attempt=10 $1 $2 mock=0,10,1,0 mock=1,11,1,0 mock=1,11,1,1 mock=0,11,1,0 mock=4,11,1,0 mock=9,11,1,0 mock=8,11,2,0 mock=4,11,3,0 rabit_bootstrap_cache=1 rabit_debug=1