-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PySpark] add gpu support for spark local mode #8068
Conversation
@trivialfis How to run python tests locally? is there a doc for that? |
You can check the doc for pytest. |
python-package/xgboost/spark/core.py
Outdated
if self.getOrDefault(self.num_workers) > 1: | ||
raise ValueError( | ||
"Training XGBoost on the spark local mode only supports num_workers = 1, " | ||
+ "and only primary GPU device will be used." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still support multiple GPU workers in spark local mode.
We can get partition id from spark TaskContext, and use the partition id as the gpu_id for the corresponding spark task. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx. Hmm, there will be only 1 process for spark local mode, If we support the scenario of num_workers > 1, then all the GPU training tasks will run on the same process, which seems not to be supported due to the NCCL issue. eg,
Task1 taking GPU 0 runs on Process1
Task2 taking GPU 1 runs on Process1
@trivialfis is this supported by xgboost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work but let's not invite trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. The GPU supporting for the local mode is mostly used in local debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wbo4958
A correction:
local mode pyspark all the GPU training tasks will run on the same process
This is not true.
For pyspark, in barrier mode pyspark task, each python UDF task will be run in an individual python process (in pyspark code, the process is called python worker).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, my bad. I previously thought it was the JVM side. Ok, let me add it support.
Please address the linter error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR is to support gpu training (use_gpu=True) on Spark local mode.
Please note that,
This feature is needed since it's really convenient for local debugging.