Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-enable recovery test after rabit allow duplicated allreduce calls with same signature #5031

Closed
wants to merge 9 commits into from

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Nov 12, 2019

Reeanble rabit tests after fix
dmlc/rabit#128 merged

Rabit assume bootstrap allreduce has unique signature, in #5012 we find it failed to satisfy such constraint.

The goal of this pr is to loosen such constraint and allow last write overwrite results with same signature during bootstrap phase. Here is example https://github.com/dmlc/xgboost/pull/4990/files#diff-5fabd019d3f1af7572baa4a6301cf076R36

@trivialfis
Copy link
Member

Restarted the aborted tests.

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@1733c9e). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5031   +/-   ##
=========================================
  Coverage          ?   71.52%           
=========================================
  Files             ?       11           
  Lines             ?     2311           
  Branches          ?        0           
=========================================
  Hits              ?     1653           
  Misses            ?      658           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1733c9e...5dd5a75. Read the comment docs.

@chenqin chenqin changed the title try to enable rabit test reenable rabit test after rabit allow duplicated bootstrap allreduce calls with same signature Nov 12, 2019
@chenqin chenqin changed the title reenable rabit test after rabit allow duplicated bootstrap allreduce calls with same signature re-enable recovery test after rabit allow duplicated allreduce calls with same signature Nov 12, 2019
@chenqin chenqin marked this pull request as ready for review November 13, 2019 03:47
@chenqin
Copy link
Contributor Author

chenqin commented Nov 13, 2019

#5033

@chenqin
Copy link
Contributor Author

chenqin commented Nov 14, 2019

@trivialfis can we rerun tests, a bit weird.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 14, 2019

The error message in the R tests is

undefined symbol: _ZN5rabit6engine9AllgatherEPvmmmm

I thought the R package was using mock Rabit? (Distributed training is not supported for XGBoost-R)

@chenqin
Copy link
Contributor Author

chenqin commented Nov 14, 2019

The error message in the R tests is

undefined symbol: _ZN5rabit6engine9AllgatherEPvmmmm

I thought the R package was using mock Rabit? (Distributed training is not supported for XGBoost-R)

I see, that’s probably not related to mock but all gather interface merged in master.

@chenqin
Copy link
Contributor Author

chenqin commented Nov 25, 2019

depends on dmlc/rabit#129
@trivialfis

@trivialfis
Copy link
Member

Will look into this once our Windows worker is back. Sorry for being slow here.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 1, 2019

@trivialfis Sorry for the delay, I was spending time with my family over the Thanksgiving holiday. The blocking issue now is inability to run the compiled C++ tests (testxgboost.exe) in Windows workers; the executable simply crashes. I'm investigating now.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 2, 2019

@trivialfis I found the reason why testxgboost.exe was crashing. I had to install a system lib called Microsoft Visual C++ Redistributable for Visual Studio. #5061 will be resolved shortly.

@trivialfis
Copy link
Member

@hcho3. That's great!

@trivialfis
Copy link
Member

@chenqin Could you make a rebase onto master branch. As you used your master branch I can't do that for you.

@trivialfis
Copy link
Member

@chenqin pls help rebasing, otherwise the Jenkins Windows tests can not pass.

@chenqin
Copy link
Contributor Author

chenqin commented Dec 15, 2019

@nateagr
dmlc/rabit#113

`
Error: package or namespace load failed for 'xgboost' in dyn.load(file, DLLpath = DLLpath, ...):

[2019-12-15T01:52:26.359Z] unable to load shared object '/workspace/xgboost/xgboost.Rcheck/xgboost/libs/xgboost.so':

[2019-12-15T01:52:26.359Z] /workspace/xgboost/xgboost.Rcheck/xgboost/libs/xgboost.so: undefined symbol: ZN5rabit6engine9AllgatherEPvmmmmPKciS3
`

@chenqin
Copy link
Contributor Author

chenqin commented Jan 2, 2020

@trivialfis @hcho3 @CodingCat happy new year

@hcho3
Copy link
Collaborator

hcho3 commented Jan 2, 2020

@chenqin You too

@trivialfis
Copy link
Member

trivialfis commented Jan 2, 2020

@CodingCat Could you please share the rabit bug you discovered? My next target would be merging existing PRs. No more new bug fix from me unless critical.

@CodingCat
Copy link
Member

CodingCat commented Jan 2, 2020 via email

@trivialfis
Copy link
Member

I mean I won't focus on bug fixes, others can continue the process. And merging PRs will focus on bug fixes PRs.

@trivialfis
Copy link
Member

Clarified, sorry for the ambiguity.

trivialfis added a commit to trivialfis/xgboost that referenced this pull request Feb 15, 2020
@trivialfis trivialfis mentioned this pull request Feb 15, 2020
@chenqin
Copy link
Contributor Author

chenqin commented Feb 25, 2020

thanks @trivialfis helping get enable rabit test pr in. close this pr

@chenqin chenqin closed this Feb 25, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants