Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repartition #4859

Closed
paleylouie opened this issue Sep 16, 2019 · 5 comments
Closed

repartition #4859

paleylouie opened this issue Sep 16, 2019 · 5 comments

Comments

@paleylouie
Copy link

paleylouie commented Sep 16, 2019

I trained xgboost on spark, and found an insteresting thing: If I repartition the train data rdd and do not repartition the test data rdd, the AUC on the test dataset would be around 0.5 (seems the prediction was in wrong order).
If I do not repartition on the train RDD, everything works fine. I read the source code of xgboost in scala and did not found anything strange.

PS:
Spark version 2.3.0
xgboost4j-spark version 0.82

@hcho3
Copy link
Collaborator

hcho3 commented Sep 16, 2019

  1. As part of [RFC] Design of Checkpoint Mechanism in XGBoost-Spark #4786, there will be an option to enable deterministic repartitioning, so that you can ensure reproducibility of model training. See [jvm-packages] enable deterministic repartitioning when checkpoint is enabled #4807 for work in progress.

  2. We have an outstanding issue with AUC currently, i.e. in distributed setting, AUC metric is not accurate. See [Roadmap] More robust metric calculation in distributed setting #4663. For now, would you be able to use other metrics such as accuracy or RMSE? Alternatively, you should tune your partitioning logic to mix your data well, so that positive and negative labels are both well represented in all partitioned.

@hcho3 hcho3 closed this as completed Sep 16, 2019
@hcho3 hcho3 reopened this Sep 16, 2019
@paleylouie
Copy link
Author

  1. As part of [RFC] Design of Checkpoint Mechanism in XGBoost-Spark #4786, there will be an option to enable deterministic repartitioning, so that you can ensure reproducibility of model training. See [jvm-packages] enable deterministic repartitioning when checkpoint is enabled #4807 for work in progress.
  2. We have an outstanding issue with AUC currently, i.e. in distributed setting, AUC metric is not accurate. See [Roadmap] More robust metric calculation in distributed setting #4663. For now, would you be able to use other metrics such as accuracy or RMSE? Alternatively, you should tune your partitioning logic to mix your data well, so that positive and negative labels are both well represented in all partitioned.

Thank you for answering, but I did not use the metrics in xgboost package, I predicted the test data set probability and then used org.apache.spark.mllib.evaluation.BinaryClassificationMetrics to calculate AUC, so point 2 may not apply to my situation.

As to point 1, I guess maybe the checkpoint of train rdd affected my predict process?

@hcho3
Copy link
Collaborator

hcho3 commented Sep 16, 2019

It's rather that, currently, we do not control the way data gets partitioned, so partitioning is non-deterministic. See "Deterministic Partitioning" section in #4786.

@CodingCat
Copy link
Member

0.82 has a bug for prediction when there is upstream repartition, please use 0.9

@paleylouie
Copy link
Author

0.82 has a bug for prediction when there is upstream repartition, please use 0.9

ok, thank you guys! I will try later!

@hcho3 hcho3 closed this as completed Sep 16, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Dec 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants