Set a minimal reducer size and parent_down size #139

FelixYBW · 2020-05-07T23:20:19Z

currently distributed xgboost's performance varies largely run to run even we tried to input exact the same partition into each worker. Its performance is more than 2x poor if we wrongly config to use 1GbE network (It's another issue we need to fix), or if we run on a IPoIB network.

The root cause is that sometimes a message is broken into two pieces when a child receives it, but the second piece has a long wait up to >10 ms, while the network is empty actually.
Looks like the poll(-1) caused this issue but I set it to poll(0)+pause which doesn't help.

This happens more frequently on 1GbE network and IPoIB, which leads to even poor performance.

The fix is to set a minimal reducer size as N Byte, so every time parent can reduce and pass down the N Byte message at least. Then child can wait until receiving N Byte. In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.

…t each time.

fix bug. Rewrite the minimal reducer size check, make sure it's 1~N times of minimal reduce size Assume the minimal reduce size is X, the logic here is 1: each child upload total_size of message 2: each parent receive X message at least, up to total_size 3: parent reduce X or NxX or total_size message 4: parent sends X or NxX or total_size message to its parent 4: parent's parent receive X message at least, up to total_size. Then reduce X or NxX or total_size message 6: parent's parent sends X or NxX or total_size message to its children 7: parent receives X or NxX or total_size message, sends to its children 8: child receive X or NxN or total_size message. During the whole process, each transfer is (1~N)xX Byte message or up to total_size. if X is larger than total_size, then allreduce allways reduce the whole messages and pass down.

FelixYBW · 2020-05-11T07:56:24Z

I did an experiment of epoll, but performance is the same even if I use edge_trigger. We still need the busy recv fix here. A epoll_wait + recv still leads to delays.
Some questions are still open:

It mainly happens on parent_down. Not frequently happen on receiving child_up.
This fix solved most of the delay, but not all. Sometimes the whole message is delayed. It may because the network hardware delay (less likely) or because OS scheduling.
Why the busy poll (poll with 0 timeout) doesn't solve the problem.

trivialfis · 2020-05-11T10:03:42Z

Let me fix the travis tests.

trivialfis · 2020-05-17T17:14:35Z

CI should be fixed in #140 .

FelixYBW · 2020-06-05T07:29:50Z

what's the error? style check on my local machine passed.

src/engine_mpi.cc:13: Found C++ system header after other system header. Should be: engine_mpi.h, c system, c++ system, other.

The file isn't touched.

11 #define NOMINMAX
12 #include <mpi.h>
13 #include
14 #include "rabit/internal/engine.h"

hcho3 · 2020-06-05T07:33:42Z

Cpplint 1.5.0 is released last week and it changed the way it checks the header order.

hcho3 · 2020-06-05T07:43:05Z

CI is fixed in #141

FelixYBW · 2020-06-05T21:19:42Z

Passed eventually. It's odd I have to move <mpi.h> in front of . Added
fix allreduce_base.cc's header seq

FelixYBW · 2020-07-21T22:04:03Z

@trivialfis @hcho3

Any comments on the PR? Can we merge it?

hcho3 · 2020-07-21T22:09:00Z

@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)? I ask this question because the plot you posted shows only the time for sending messages. The CI set up here may not detect all possible ways this change can break XGBoost.

In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.

Can I take it as that you tested with XGBoost? How is the impact on the end-to-end training performance?

FelixYBW · 2020-07-21T22:11:31Z

@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)?

Yes, it's tested in several clusters with xgboost training.

hcho3

No objection from me.

FelixYBW · 2020-07-21T22:28:03Z

@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)? I ask this question because the plot you posted shows only the time for sending messages. The CI set up here may not detect all possible ways this change can break XGBoost.

In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.

Can I take it as that you tested with XGBoost? How is the impact on the end-to-end training performance?

Yes, we tested on xgboost. It works well so far. The chart in the first post is the end2end xgboost training performance improvement on 3 clusters with 4 nodes in each cluster but using different network. The dataset is a 600M Row x 46 columns one. Also the patch makes xgboost training time much more stable.

hcho3 · 2020-07-21T22:31:15Z

The chart in the first post is the end2end xgboost training performance improvement on 3 clusters with 4 nodes in each cluster but using different network.

@FelixYBW Nice, thanks for clarifying. 1.5-3x boost end-to-end is quite nice.

hcho3 · 2020-07-25T19:47:29Z

@FelixYBW Merged. Thanks! I'll file a pull request to use this commit in XGBoost.

FelixYBW added 2 commits May 8, 2020 06:10

set a minimal reducer msg size. Receive the same data size from paren…

54b6010

…t each time.

FelixYBW force-pushed the master branch from 6de2857 to 72f7be7 Compare May 11, 2020 00:57

trivialfis self-requested a review May 15, 2020 12:34

bw added 2 commits June 4, 2020 20:14

Merge branch 'master' of https://github.com/FelixYBW/rabit

2f39f51

Follow style check rule

6160c7a

Merge branch 'master' of https://github.com/dmlc/rabit

630013e

FelixYBW force-pushed the master branch from 676f6fb to 630013e Compare June 5, 2020 18:43

bw added 2 commits June 5, 2020 13:05

fix the cpplint check

11120d7

fix allreduce_base header seq

96bfda3

FelixYBW mentioned this pull request Jun 10, 2020

[Experience] Training exactly the same model run to run on Spark dmlc/xgboost#5775

Closed

Merge branch 'master' into master

ee0ebc2

hcho3 approved these changes Jul 21, 2020

View reviewed changes

hcho3 merged commit e6cd74e into dmlc:master Jul 25, 2020

hcho3 mentioned this pull request Jul 26, 2020

[POC] [WIP] Update Rabit dmlc/xgboost#5943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set a minimal reducer size and parent_down size #139

Set a minimal reducer size and parent_down size #139

FelixYBW commented May 7, 2020 •

edited

Loading

FelixYBW commented May 11, 2020

trivialfis commented May 11, 2020

trivialfis commented May 17, 2020

FelixYBW commented Jun 5, 2020

hcho3 commented Jun 5, 2020

hcho3 commented Jun 5, 2020

FelixYBW commented Jun 5, 2020 •

edited

Loading

FelixYBW commented Jul 21, 2020

hcho3 commented Jul 21, 2020 •

edited

Loading

FelixYBW commented Jul 21, 2020

hcho3 left a comment

FelixYBW commented Jul 21, 2020

hcho3 commented Jul 21, 2020

hcho3 commented Jul 25, 2020

Set a minimal reducer size and parent_down size #139

Set a minimal reducer size and parent_down size #139

Conversation

FelixYBW commented May 7, 2020 • edited Loading

FelixYBW commented May 11, 2020

trivialfis commented May 11, 2020

trivialfis commented May 17, 2020

FelixYBW commented Jun 5, 2020

hcho3 commented Jun 5, 2020

hcho3 commented Jun 5, 2020

FelixYBW commented Jun 5, 2020 • edited Loading

FelixYBW commented Jul 21, 2020

hcho3 commented Jul 21, 2020 • edited Loading

FelixYBW commented Jul 21, 2020

hcho3 left a comment

Choose a reason for hiding this comment

FelixYBW commented Jul 21, 2020

hcho3 commented Jul 21, 2020

hcho3 commented Jul 25, 2020

FelixYBW commented May 7, 2020 •

edited

Loading

FelixYBW commented Jun 5, 2020 •

edited

Loading

hcho3 commented Jul 21, 2020 •

edited

Loading