Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] add Rapids plugin support #7491

Merged
merged 3 commits into from
Dec 17, 2021
Merged

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Nov 30, 2021

This PR is the final PR for #7361.

For now, there is an issue for CPU transform. Model A transforms CPU dataset which reads from fileABC and get the result AA, while Model A transforms GPU dataset which reads from fileABC and get the result BB. We expect AA should be equal BB, but in fact, they are different.

I've figured out why. will create an following PR to fix that after this PR merged.

Add GPU train/transform support for XGBoost4j-Spark-Gpu by leveraging
spark-rapids.
@wbo4958
Copy link
Contributor Author

wbo4958 commented Nov 30, 2021

@trivialfis Could you help to review it. Thx

@wbo4958
Copy link
Contributor Author

wbo4958 commented Dec 2, 2021

I just did the performance ETL+Trainging test for CPU and GPU on Mortgage 2000-year-dataset and got about (541−180)÷180=2x speed up

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review. I noticed that external memory is mentioned in the code, is there any usage example of it in Spark package?

<scala.version>2.12.8</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<hadoop.version>2.7.3</hadoop.version>
<maven.wagon.http.retryHandler.count>5</maven.wagon.http.retryHandler.count>
<log.capi.invocation>OFF</log.capi.invocation>
<use.cuda>OFF</use.cuda>
<cudf.version>21.08.2</cudf.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to move it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both xgboost4j-gpu and xgboost4j-spark-gpu need cudf.version

case regressor: XGBoostRegressor => if (regressor.isDefined(regressor.groupCol)) {
regressor.getGroupCol } else ""
case _: XGBoostClassifier => ""
case _ => throw new RuntimeException("Unsupporting estimator: " + estimator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case _ => throw new RuntimeException("Unsupporting estimator: " + estimator)
case _ => throw new RuntimeException("Unsupported estimator: " + estimator)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked all the comments, and seems no extra commit is needed. So I'd like to fix it in the following PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

require(est.isDefined(est.treeMethod) && est.getTreeMethod.equals("gpu_hist"),
s"GPU train requires tree_method set to gpu_hist")
val groupName = estimator match {
case regressor: XGBoostRegressor => if (regressor.isDefined(regressor.groupCol)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the regressor in the spark package responsible for ranking too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird.

isCacheData: Boolean): Map[String, ColumnDataBatch] = {
// Cache is not supported
if (isCacheData) {
logger.warn("Dataset cache is not support for GPU pipeline!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a cache in the context of the spark package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@wbo4958 wbo4958 Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache which means cache the spark computation result in local storage. Gpu Pipeline only accelerate the Dataset cache instead of RDD cache. so for now we just disable the cache.

will figure out a way to handle this in the following PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. That sounds fine.

noEvalSet: Boolean): RDD[Watches] = {

val sc = dataMap(TRAIN_NAME).rawDF.sparkSession.sparkContext
val maxBin = xgbExeParams.toMap.getOrElse("max_bin", 256).asInstanceOf[Int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lower this configuration into C++ or set a default parameter on the API that's more visible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, XGBoost java has set the default value to 256. Here, it's just a safer way to get max_bin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the parameter is not optional then I think this safety is not necessary. We should aim for removing any duplicated logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed to be removed. but if you insist, I can get rid of it in the following PR

@@ -0,0 +1,292 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuPerTest vs. GpuPreTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuPerTest is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@trivialfis
Copy link
Member

I noticed that external memory is mentioned in the code, is there any usage example of it in Spark package?

@wbo4958

@wbo4958
Copy link
Contributor Author

wbo4958 commented Dec 10, 2021

I just ran another round of xgboost training on Mortgage Rows: 83, 270, 160, features Columns: 27 on Spark local mode Using the latest spark-rapids jars and cudf jar

Type time (s)
CPU load + train 464.432
GPU load + train 62.354

The speed up is 6.4x

Type time (s)
CPU load + ETL+ train 1222.662
GPU load + ETL + train 322.677

The speed up is 2.79x.

Just as I said, there is room for optimization for Gpu ETL+train

@trivialfis
Copy link
Member

@hcho3 @RAMitchell I looked into the PR and it seems fine to me. But I haven't been able to provide detailed reviews due to my lack of experience with spark and the size of this PR. Would be really helpful if you can take a look into this.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Dec 14, 2021

@hcho3 @RAMitchell, Could you help to review it?

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wbo4958.

For now, there is an issue for CPU transform. Model A transforms CPU dataset which reads from fileABC and get the result AA, while Model A transforms GPU dataset which reads from fileABC and get the result BB. We expect AA should be equal BB, but in fact, they are different.

Can you elaborate on this and why it cannot be fixed in this PR?

One concern we have had recently is the run time of the JVM tests on CI. JVM uses a disproportionately large amount of the CI budget. Can you measure how long the tests take on CI and make sure the time is not significantly increasing due to this PR?

Apart from the above, I'm inclined to merge this as it's mostly tests.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Dec 16, 2021

Hi @RAMitchell, I can fix this in this PR. But it is really the CPU legacy bug that is not introduced by any my previous PRs. So I'd like to have another PR to fix that.

Yeah, actually it will cost a lot of time when running the whole CPU unit tests. I've disabled the CPU unit tests when running GPU unit tests which are pretty fast.

@trivialfis trivialfis merged commit 24e2580 into dmlc:master Dec 17, 2021
@wbo4958 wbo4958 deleted the xgb-spark-gpu branch December 20, 2021 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants