-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] DeviceQuantileDmatrix failing with memory errors on A100-80GB #6822
Comments
@ayushdg How did you install XGBoost? Did you build it from the source? |
Hmm .. somewhere inside thrust. Suspecting the |
I just ran it with master branch on rtx8000, it works fine? What's your XGBoost version? |
This is the one shipped with the rapids nightly containers:
I tested on v100, and a100 40gb and it worked fine on those gpu's as well. Was failing specifically on the a100 80gb (though it has more memory). Not sure why. |
@ayushdg Thanks for sharing. I will test it on A100. |
On large datasets, DeviceQuantileDMatrix fails with memory errors that look like some kind of overflow/ptr bug.
Reproducer:
The following scripts succeeds on a
V100-32GB
andA100-40GB
but fails onA100-80GB
Stacktrace:
Additional Info
The requested memory value
- Requested memory: 18446744073675998463
from the stack trace is extremely close to the unsigned int64 limitThe text was updated successfully, but these errors were encountered: