-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync : ggml (ggml-backend) #3548
Conversation
ggml-ci
Hello, I have some questions about backends, and these are general (apply to any project, not necessarily LLM inference) as I have seen that ggml-cuda and other backends in llama.cpp differ from the implementation in the ggml repository, there are some specific kernels. I am planning to implement the CUDA backend in stable-diffusion.cpp, but I have a lot of doubts. Could you please take some time to clarify them for me? |
@ggerganov yes, I think we should merge this here to avoid conflicts in the future. I already tested the changes to @FSSRepo most of the development of the backends happens in this repository, but the changes to ggml are regularly synced between the ggml and llama.cpp repositories. I am not sure if that was your question, if you have other doubts just ask. |
sync : ggml (ggml-backend) (ggerganov#3548)
This change broke offloading Falcon-7B with ngl=34 (v cache on GPU but not k cache). I'm getting this assertion failure:
|
@cebtenzzre I think it was already broken before, this change just made it assert rather than generate nonsense. The source of the problem is that the packed wqkv weight doesn't play well with CUDA, it needs the kcur portion on the CPU, but currently I don't see a good way to copy a view back to the CPU. This should be fixed in the future with ggml-backend. |
Maybe Falcon was already broken (don't remember if I checked), but LLaMA 7B with ngl=34 was definitely producing correct output before this PR: #3820 |
// check if a tensor is allocated by this buffer | ||
static bool ggml_allocr_is_own(struct ggml_allocr * alloc, const struct ggml_tensor * tensor) { | ||
void * ptr = tensor->data; | ||
return ptr >= alloc->data && (char *)ptr < (char *)alloc->data + alloc->max_size; | ||
return tensor->buffer == alloc->buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In train & finetune I make new allocator, allocate tensors, free the allocator and then repeat that for different groups of tensors.
Before this change the actual data pointer was used to check ownership. This pointer differed between successive allocators due to a different data buffer.
But the buffer
member used here may not differ, because it is malloced by ggml_backend_buffer_init
. It actually happens that the previous memory address is used, because the previous allocator along with its buffer was freed.
So with this change ggml_allocr_is_own
does not work correctly with the use case of a temporary allocator.
The allocator must be kept around as long as the tensors itself.
Small example:
std::vector<char> buf0;
std::vector<char> buf1;
buf0.resize(...);
buf1.resize(...);
struct ggml_tensor * a = ...;
struct ggml_tensor * b = ...;
struct ggml_allocr * alloc;
alloc = ggml_allocr_new(buf0.data(), buf0.size(), alignment);
ggml_allocr_alloc(alloc, a);
ggml_allocr_free(alloc);
alloc = ggml_allocr_new(buf1.data(), buf1.size(), alignment);
ggml_allocr_alloc(alloc, b);
// Now ggml_allocr_is_own(alloc, a) can be unexpectedly `true`
ggml_allocr_free(alloc);
This results in the allocator freeing tensors that are not actually owned by itself.
I write it 'can happen', but I actually observe this behaviour in train & finetune.
Whether it manifests itself probably depends on malloc.
I can change the train & finetune programs to not free the allocators.
This would solve the problem.
But in the above example it still is a bit weird, that ggml_allocr_is_own(alloc, a)
can unexpectedly return true
when the previous allocator was freed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed other cases, but this should have been fixed in #4486.
The problem is that checking the addresses doesn't really work with measure allocators. The previous implementation used virtual memory as a stopgap, but this has other problems. Ultimately, the issue is that ggml_allocr_new
creates a temporary ggml_backend_buffer
that is freed when the allocator is freed, and from that point these tensors cannot be used anymore since they point to a freed buffer. When using ggml-backend
it is possible to free the allocator if you supply your own buffer, but otherwise the ggml-alloc
should be kept alive for the lifetime of the tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this solves issues somewhat for finetune. But there are some more cases. I will push PR after I come back from holiday family visit.
@slaren Should we merge what we already have for
ggml-backend
? If there isn't any breaking change, we might want to do so in order to keep things in sync more easily