repartition #4859

paleylouie · 2019-09-16T02:49:54Z

I trained xgboost on spark, and found an insteresting thing: If I repartition the train data rdd and do not repartition the test data rdd, the AUC on the test dataset would be around 0.5 (seems the prediction was in wrong order).
If I do not repartition on the train RDD, everything works fine. I read the source code of xgboost in scala and did not found anything strange.

PS:
Spark version 2.3.0
xgboost4j-spark version 0.82

hcho3 · 2019-09-16T03:00:53Z

As part of [RFC] Design of Checkpoint Mechanism in XGBoost-Spark #4786, there will be an option to enable deterministic repartitioning, so that you can ensure reproducibility of model training. See [jvm-packages] enable deterministic repartitioning when checkpoint is enabled #4807 for work in progress.
We have an outstanding issue with AUC currently, i.e. in distributed setting, AUC metric is not accurate. See [Roadmap] More robust metric calculation in distributed setting #4663. For now, would you be able to use other metrics such as accuracy or RMSE? Alternatively, you should tune your partitioning logic to mix your data well, so that positive and negative labels are both well represented in all partitioned.

paleylouie · 2019-09-16T03:26:53Z

As part of [RFC] Design of Checkpoint Mechanism in XGBoost-Spark #4786, there will be an option to enable deterministic repartitioning, so that you can ensure reproducibility of model training. See [jvm-packages] enable deterministic repartitioning when checkpoint is enabled #4807 for work in progress.

We have an outstanding issue with AUC currently, i.e. in distributed setting, AUC metric is not accurate. See [Roadmap] More robust metric calculation in distributed setting #4663. For now, would you be able to use other metrics such as accuracy or RMSE? Alternatively, you should tune your partitioning logic to mix your data well, so that positive and negative labels are both well represented in all partitioned.

Thank you for answering, but I did not use the metrics in xgboost package, I predicted the test data set probability and then used org.apache.spark.mllib.evaluation.BinaryClassificationMetrics to calculate AUC, so point 2 may not apply to my situation.

As to point 1, I guess maybe the checkpoint of train rdd affected my predict process?

hcho3 · 2019-09-16T03:32:00Z

It's rather that, currently, we do not control the way data gets partitioned, so partitioning is non-deterministic. See "Deterministic Partitioning" section in #4786.

CodingCat · 2019-09-16T03:33:42Z

0.82 has a bug for prediction when there is upstream repartition, please use 0.9

paleylouie · 2019-09-16T03:48:36Z

0.82 has a bug for prediction when there is upstream repartition, please use 0.9

ok, thank you guys! I will try later!

hcho3 closed this as completed Sep 16, 2019

hcho3 reopened this Sep 16, 2019

hcho3 closed this as completed Sep 16, 2019

lock bot locked as resolved and limited conversation to collaborators Dec 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repartition #4859

repartition #4859

paleylouie commented Sep 16, 2019 •

edited

Loading

hcho3 commented Sep 16, 2019 •

edited

Loading

paleylouie commented Sep 16, 2019

hcho3 commented Sep 16, 2019

CodingCat commented Sep 16, 2019

paleylouie commented Sep 16, 2019

repartition #4859

repartition #4859

Comments

paleylouie commented Sep 16, 2019 • edited Loading

hcho3 commented Sep 16, 2019 • edited Loading

paleylouie commented Sep 16, 2019

hcho3 commented Sep 16, 2019

CodingCat commented Sep 16, 2019

paleylouie commented Sep 16, 2019

paleylouie commented Sep 16, 2019 •

edited

Loading

hcho3 commented Sep 16, 2019 •

edited

Loading