Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: nitro cuda windows low performance on machine has multiple GPUs - tested using Jan App #269

Closed
hiento09 opened this issue Dec 14, 2023 · 4 comments
Assignees
Labels
type: bug Something isn't working

Comments

@hiento09
Copy link
Contributor

hiento09 commented Dec 14, 2023

Describe the bug
My windows machine has 3 GPUs, when I enabled all 3 GPUs, the token speed was slow (6-9/s) and it even not able to load tinyllama 1B. When I disabled 2 GPUs, 1 active only, the performance was back to normal

Screenshots

  • 3 GPUs active

    • Low performance
      image
    • Load tinyllama error
      image
  • 1 GPU active only, then the performance was back to normal
    image

Desktop (please complete the following information):

  • OS: Windows 11
  • Nvidia driver: 531.18
  • cuda version: 12.3
  • Nitro version: 0.1.27
  • GPU:
  • 1 RTX 4070ti
  • 2 RTX 1660ti
@hiento09 hiento09 added the type: bug Something isn't working label Dec 14, 2023
@hiento09 hiento09 changed the title bug: nitro cuda windows not able to load tinyllama 1B bug: nitro cuda windows low performance on machine has multiple GPUs Dec 14, 2023
@hiento09 hiento09 changed the title bug: nitro cuda windows low performance on machine has multiple GPUs bug: nitro cuda windows low performance on machine has multiple GPUs - tested using Jan App Dec 14, 2023
@KossBoii
Copy link

@hiento09 I have a feeling that this problem coming from the communication between different GPUs. I'll look out for this while reading the codebase right now.

@linhtran174 linhtran174 self-assigned this Dec 15, 2023
@hiro-v
Copy link
Contributor

hiro-v commented Dec 17, 2023

@KossBoii that's the exact problem of multiple GPU problem.
I tested again on that machine:

  • If using only 4070ti => 55tok/sec
  • If using either 1 out of 2 2 1660ti => 28tok/sec

The distributed inference requires:

  • Good bandwidth between GPUs
  • The discrepancies between multiple GPUs should be not too much (e.g in this case 4070ti have to wait for 1660ti to compute). And also this case uses PCIe3 and 4, not NVlink => The data have to transmitted via CPU to get to another GPU.
  • Explicitly set the value for TP (tensor parallel) in nitro.

It depends but I think the option to use 1 model on a single GPU with the help of CUDA_VISIBLE_DEVICES makes sense in this case (i.e hardware sensing feature)

@hiro-v hiro-v assigned hiento09 and unassigned linhtran174 Dec 18, 2023
@hiento09 hiento09 closed this as completed Jan 4, 2024
@hiento09 hiento09 reopened this Jan 4, 2024
@hiro-v
Copy link
Contributor

hiro-v commented Mar 22, 2024

This should be properly supported with this instead: ggerganov/llama.cpp#6017

@0xSage
Copy link
Contributor

0xSage commented Jul 1, 2024

closing in favor of tracking this more granularly, now that we have various engines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

6 participants