-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark] Cleanup data processing. #8344
Conversation
* Enable additional combinations of ctor parameters. * Unify procedures for QuantileDMatrix and DMatrix.
@@ -208,10 +208,14 @@ def create_dmatrix_from_partitions( # pylint: disable=too-many-arguments | |||
|
|||
def append_m(part: pd.DataFrame, name: str, is_valid: bool) -> None: | |||
nonlocal n_features | |||
if name in part.columns and part[name].shape[0] > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why get rid of the check "part[name].shape[0] > 0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty data tests are passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty partition bug seems to be fixed in the nightly spark build. I couldn't reproduce the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what? really? I will check it, please hold on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must add check part[feature_cols].shape[0] > 0
, or else, stack_series will throw exception. The latest change is ok.
one comment. |
@rongou Have you seen this error before https://buildkite.com/xgboost/xgboost-ci/builds/497#0183e6a1-75bf-4402-b6cd-8a50084f3067 ? |
Not sure, maybe the server didn't have the time to start? The CI instances may be pretty overloaded. I wonder if we should add a wait in the test |
Sent PR #8351 that might help with this failure. |
@trivialfis Once databricks suggested only enabling feature_cols when use_gpu is enabled. I think it's ok to enable it right now.. |
LGTM |
We will have to leave it to the next release. Making extensive tests for various combinations of parameters is not trivial. |
Close #8341
@wbo4958 Could you please share why is this necessary? Is it possible to use the normal
features
column withgpu_hist
at least in theory (maybe with lower initialization performance)?xgboost/python-package/xgboost/spark/core.py
Line 282 in 80e10e0