-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set a minimal reducer size and parent_down size #139
Conversation
fix bug. Rewrite the minimal reducer size check, make sure it's 1~N times of minimal reduce size Assume the minimal reduce size is X, the logic here is 1: each child upload total_size of message 2: each parent receive X message at least, up to total_size 3: parent reduce X or NxX or total_size message 4: parent sends X or NxX or total_size message to its parent 4: parent's parent receive X message at least, up to total_size. Then reduce X or NxX or total_size message 6: parent's parent sends X or NxX or total_size message to its children 7: parent receives X or NxX or total_size message, sends to its children 8: child receive X or NxN or total_size message. During the whole process, each transfer is (1~N)xX Byte message or up to total_size. if X is larger than total_size, then allreduce allways reduce the whole messages and pass down.
I did an experiment of epoll, but performance is the same even if I use edge_trigger. We still need the busy recv fix here. A epoll_wait + recv still leads to delays.
|
Let me fix the travis tests. |
CI should be fixed in #140 . |
what's the error? style check on my local machine passed. src/engine_mpi.cc:13: Found C++ system header after other system header. Should be: engine_mpi.h, c system, c++ system, other. The file isn't touched. 11 #define NOMINMAX |
Cpplint 1.5.0 is released last week and it changed the way it checks the header order. |
CI is fixed in #141 |
Passed eventually. It's odd I have to move <mpi.h> in front of . Added |
Any comments on the PR? Can we merge it? |
@FelixYBW Was this change tested in a production cluster with XGBoost (or other application)? I ask this question because the plot you posted shows only the time for sending messages. The CI set up here may not detect all possible ways this change can break XGBoost.
Can I take it as that you tested with XGBoost? How is the impact on the end-to-end training performance? |
Yes, it's tested in several clusters with xgboost training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection from me.
Yes, we tested on xgboost. It works well so far. The chart in the first post is the end2end xgboost training performance improvement on 3 clusters with 4 nodes in each cluster but using different network. The dataset is a 600M Row x 46 columns one. Also the patch makes xgboost training time much more stable. |
@FelixYBW Nice, thanks for clarifying. 1.5-3x boost end-to-end is quite nice. |
@FelixYBW Merged. Thanks! I'll file a pull request to use this commit in XGBoost. |
currently distributed xgboost's performance varies largely run to run even we tried to input exact the same partition into each worker. Its performance is more than 2x poor if we wrongly config to use 1GbE network (It's another issue we need to fix), or if we run on a IPoIB network.
The root cause is that sometimes a message is broken into two pieces when a child receives it, but the second piece has a long wait up to >10 ms, while the network is empty actually.
Looks like the poll(-1) caused this issue but I set it to poll(0)+pause which doesn't help.
This happens more frequently on 1GbE network and IPoIB, which leads to even poor performance.
The fix is to set a minimal reducer size as N Byte, so every time parent can reduce and pass down the N Byte message at least. Then child can wait until receiving N Byte. In our test, xgboos uses allreduce(sum), the real calculation time is very little so we can set it upto 1MB which is bigger than the maximal msage size of our workload.