-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Fix #3489: Spark repartitionForData can potentially sh… #3654
Conversation
The build was passed the first time, but it failed after I closed / reopen the PR |
something wrong when downloading file from github....retriggered the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the first round of review and have some questions about the correctness of the algorithm
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
the new changes lead to the stuck of some test case? |
hmm, all test passed in my local build. the missing test starts from "distributed training with the specified worker number" Here is my local log:
Compared with Javis log: It looks like the cache writing was not finished. (with "useExternalMemory = true") ? |
re-trigger the build |
@CodingCat the reason that java test hung was that I set "nWorkers = 3" in the test which exceeded the CPU cores in testing container. I changed it to "nWorkers = numWorkers" |
@CodingCat Need some help here. Do you have any idea why the python-test failed? |
@weitian I restarted the test. It had failed due to random out of memory error. |
@hcho3 pre-merge still failed, any idea? |
@weitian Tests have been flaky lately. I'll need to take a look. |
Why did you restart the tests? Jenkins was still running |
@hcho3 @CodingCat |
@weitian I have reset the Jenkins environment to get rid of residual files that was causing conflicts. @CodingCat Does it look good to you? Should we merge it once tests are fixed? |
I haven't got bandwidth to take a second round on this, maybe @yanboliang can help on reviewing |
I took a quick review of the current approach, the assumptions here still raise some concerns to me how about we make it as simple as
OR
|
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
val edgeGroups = trainingData.mapPartitions( | ||
new LabeledPointGroupIterator(_)).filter(_._1).map(r => (TaskContext.getPartitionId(), r._2)) | ||
// the edgeGroups should be small enough to be processed in one partition | ||
val stitchedGroup = edgeGroups.coalesce(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this assumption true? what if we have 4 partitions each with millions of records belonging to a single group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do some math here.
Assume that we have a 100GB training data, the HDFS block size is 128MB, we have 800 partitions when loading data into RDD.
for each RDD partition, there are two edge groups, the first group and the last group, so the total number of edge groups are 800 * 2, which is small enough to be coalesced to one executor and processed.
What is an edge group? See if a group has 100 LabeledPoint, 30 of them may in one RDD partition, the 70 of them may in the next RDD partition. So there are only two edge groups in each partition.
The millions of groups other then the first and last groups in a RDD partition are normal groups and we do not need to stitch them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how you guarantee
-
the data is loaded as plain HDFS file
-
user didn't apply any additional transformation including repartition before feeding as training data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just use HDFS as an example, since we only collect two groups from each partition, the total size of edge groups is relatively small. So this is actually a optimization for performance, not causing performance issue.
Or we can use RDD group by which actually does the same logic but in a distributed fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still didn’t answer my previous question, how about the user pass in a RDD containing 4 partitions each of which contains only 1/2 group data,
All groups will be collected into a single partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all data are collection into one partition, and then produces a list of groups in one partition.
@weitian Please ignore the failed test |
@CodingCat To push the group partition logic to the user may not be a good idea. Sharing my own experience, when I started to use the xgboost-spark package, I just called the XGBoostRegressor API, and there was no error or exception. I found this bug until I saw the metrics which was very different from the result of the model trained in a single node. Since evenly partitioned data is very critical for parallel training, the RDD has to be repartitioned and user has to deal with the crossing border issue with groups, which is very tricky. Another issue is that, even the data is pre-partitioned, the current logic to split the training and testing sets still mess up the groups. I know this is a PR with a lot of changes, but logically all those fixes in the PR are connected and it is hard to break them into multiple PRs. |
@weitian it's not about that this PR has many changes, it's that some assumptions which this PR is based on either bring potentially-incorrect results (previous dropping group version) or performance bottleneck (current coalesce(1) version) my suggestion is
2.1) if not follow, by default, stop the application and give the reason 2.2) user can switch to the dropping group version if they know what they are doing |
|
I have filed an issue #3720 to address flaky tests once and for all. |
generally LGTM, more comments though |
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
...kages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostGeneralSuite.scala
Outdated
Show resolved
Hide resolved
assert(groups(15)._2(0).group == 16) | ||
} | ||
|
||
test("distributed training with group data") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any making-sense but small dataset we can use for ranking unit test? I think the current unit test regarding ranking are all testing if the training can finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is to make sure that, after training data is repartitioned and split, the training groups are still correct and the order of LabeledPoints within a group are kept.
My changed unit test has a good coverage on the corner cases: test("repartitionForTrainingGroup with group data")
Yes, it only tests if training can finished. I can not use my training data from Apple, so let's see if we could find some sample data set for ranking.
...ackages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
…uffle all data and lose ordering required for ranking objectives
@hcho3 still some failure in AppVeyor build. Any idea? |
@weitian That was a weird failure. I think it was due to spurious download failure for R packages. |
thanks, merged ! |
…y shuffle all data and lose ordering required for ranking objectives (dmlc#3654)
ranker training needs grouped data and the order of labeledPoint within a group has to be kept for evaluation.
For distributed training, repartition train data is important to keep the workload balanced. But the RDD repartition here shuffles all data, which breaks group and loses ordering required for ranking.
Also the random splitting the test data out train data here breaks the group, too.
The fix is to group the train data as RDD[Array[XGBLabeledPoint]] before the repartitioning and splitting.
While loading training data into RDD partitions, we could not control the edge of each partition, and the LabeledPoints belonging to one group may be broken into two RDD partitions. The chunks are stitched together by group id.
Another challenge is to conserve the JVM heap memory usage, so we can leave more memory to the XGBoost native library for training. Based on the streaming feature of Scala, the Iterator of a collection is "view based" rather than "builder based".
It means the Scala Iterator is lazy-evaluation which does not load all data into memory.
So we need to keep all transformation steps lazy (Spark transformation is lazy by default). This is the design idea for LabeledPointGroupIterator.
Based on same reason, duplication of Iterator here is not very memory efficient since the gap between two iterators are cached in memory.