Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a minimal reducer size and parent_down size #139

Merged
merged 8 commits into from
Jul 25, 2020
Merged

Conversation

FelixYBW
Copy link
Contributor

@FelixYBW FelixYBW commented May 7, 2020

currently distributed xgboost's performance varies largely run to run even we tried to input exact the same partition into each worker. Its performance is more than 2x poor if we wrongly config to use 1GbE network (It's another issue we need to fix), or if we run on a IPoIB network.

The root cause is that sometimes a message is broken into two pieces when a child receives it, but the second piece has a long wait up to >10 ms, while the network is empty actually.
Looks like the poll(-1) caused this issue but I set it to poll(0)+pause which doesn't help.

image

This happens more frequently on 1GbE network and IPoIB, which leads to even poor performance.

The fix is to set a minimal reducer size as N Byte, so every time parent can reduce and pass down the N Byte message at least. Then child can wait until receiving N Byte. In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.

image

FelixYBW added 2 commits May 8, 2020 06:10
 fix bug. Rewrite the minimal reducer size check, make sure it's 1~N times of minimal reduce size

 Assume the minimal reduce size is X, the logic here is
 1: each child upload total_size of message
 2: each parent receive X message at least, up to total_size
 3: parent reduce X or NxX or total_size message
 4: parent sends X or NxX or total_size message to its parent
 4: parent's parent receive X message at least, up to total_size. Then reduce X or NxX or total_size message
 6: parent's parent sends X or NxX or total_size message to its children
 7: parent receives X or NxX or total_size message, sends to its children
 8: child receive X or NxN or total_size message.

 During the whole process, each transfer is (1~N)xX Byte message or up to total_size.

 if X is larger than total_size, then allreduce allways reduce the whole messages and pass down.
@FelixYBW
Copy link
Contributor Author

I did an experiment of epoll, but performance is the same even if I use edge_trigger. We still need the busy recv fix here. A epoll_wait + recv still leads to delays.
Some questions are still open:

  • It mainly happens on parent_down. Not frequently happen on receiving child_up.
  • This fix solved most of the delay, but not all. Sometimes the whole message is delayed. It may because the network hardware delay (less likely) or because OS scheduling.
  • Why the busy poll (poll with 0 timeout) doesn't solve the problem.

@trivialfis
Copy link
Member

Let me fix the travis tests.

@trivialfis trivialfis self-requested a review May 15, 2020 12:34
@trivialfis
Copy link
Member

CI should be fixed in #140 .

@FelixYBW
Copy link
Contributor Author

FelixYBW commented Jun 5, 2020

what's the error? style check on my local machine passed.

src/engine_mpi.cc:13: Found C++ system header after other system header. Should be: engine_mpi.h, c system, c++ system, other.

The file isn't touched.

11 #define NOMINMAX
12 #include <mpi.h>
13 #include
14 #include "rabit/internal/engine.h"

@hcho3
Copy link
Contributor

hcho3 commented Jun 5, 2020

Cpplint 1.5.0 is released last week and it changed the way it checks the header order.

@hcho3
Copy link
Contributor

hcho3 commented Jun 5, 2020

CI is fixed in #141

@FelixYBW
Copy link
Contributor Author

FelixYBW commented Jun 5, 2020

Passed eventually. It's odd I have to move <mpi.h> in front of . Added
fix allreduce_base.cc's header seq

@FelixYBW
Copy link
Contributor Author

@trivialfis @hcho3

Any comments on the PR? Can we merge it?

@hcho3
Copy link
Contributor

hcho3 commented Jul 21, 2020

@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)? I ask this question because the plot you posted shows only the time for sending messages. The CI set up here may not detect all possible ways this change can break XGBoost.

In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.

Can I take it as that you tested with XGBoost? How is the impact on the end-to-end training performance?

@FelixYBW
Copy link
Contributor Author

@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)?

Yes, it's tested in several clusters with xgboost training.

Copy link
Contributor

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection from me.

@FelixYBW
Copy link
Contributor Author

@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)? I ask this question because the plot you posted shows only the time for sending messages. The CI set up here may not detect all possible ways this change can break XGBoost.

In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.

Can I take it as that you tested with XGBoost? How is the impact on the end-to-end training performance?

Yes, we tested on xgboost. It works well so far. The chart in the first post is the end2end xgboost training performance improvement on 3 clusters with 4 nodes in each cluster but using different network. The dataset is a 600M Row x 46 columns one. Also the patch makes xgboost training time much more stable.

@hcho3
Copy link
Contributor

hcho3 commented Jul 21, 2020

The chart in the first post is the end2end xgboost training performance improvement on 3 clusters with 4 nodes in each cluster but using different network.

@FelixYBW Nice, thanks for clarifying. 1.5-3x boost end-to-end is quite nice.

@hcho3 hcho3 merged commit e6cd74e into dmlc:master Jul 25, 2020
@hcho3
Copy link
Contributor

hcho3 commented Jul 25, 2020

@FelixYBW Merged. Thanks! I'll file a pull request to use this commit in XGBoost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants