-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] cuda error running llama 3.2 #11047
[Bugfix] cuda error running llama 3.2 #11047
Conversation
Signed-off-by: Gene Su <e870252314@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
cc @comaniac |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
when and why would it fail? |
When running llama 3.2 in a ray actor, runtime error will be thrown, and cause module loading to fail. But regardless, the way this is used now should probably never fail and should just return falsely values in those cases. |
@GeneDer I'm concerned with this change, and would like to know why it fails. I'm afraid this might be caused by some incorrect setup from your infra side, because normally, as long as you can run |
@youkaichao There are literally no other changes on our end beside upgrading vllm 😅 Also you can see the offending PR called on those methods on module loading, which IMO is not supposed to fail out right and should just use the default values. |
i think the right way should be removing the function call on module loading, rather than changing the function |
While that also works, I disagree with the approach. The caller ( |
this is true. that's why i want to know why it fails in your environment. usually it means something is wrong with the nvidia setup. nvml should work for all nvidia datacenter hardware. |
I don't think that's the case unless it's always been setup incorrectly until now that prefix prefill module just revealed the issue lol But the environment has not been changed, between the upgrade, it's always been running the engine in a ray actor on a gpu cluster. |
can you try to reproduce, if you can get error from |
Re: https://github.com/vllm-project/vllm/pull/9850/files#diff-107fd4a59dcd0831ff802fefe9c49eac02432b6a6d1f508075a8b1809c1468b4R11-R15
Those
.get_device_capability
and.has_device_capability
are now called on module loading of prefix prefill, however, they can throw errors when using it with cuda. This PR catches those unexpected runtime errors and returns the corresponding value (None
andFalse
) in the failure cases so the module can be loaded successfully.