-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : update mul_mat_id to use the same tensor for all the experts #6387
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
I tried to update Also for grok, I don't even have enough free disk space to try that one. |
|
Ok, that's good to know. I guess the same code should be usable for grok, but the memory usage may be a problem. For mixtral, |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
How about make a new |
I think it is better to convert the weights to the new format at load time than to maintain multiple versions of the op. I also plan to make further changes to |
IMHO, developers who use ggml would prefer backward compatibility. |
We generally try to maintain backward compatibility to a good extend, but sometimes it can be too difficult. MoE functionality is on a trend to become the standard so we need good support. Maintaining duplicate kernels would increase significantly the required effort |
In this case, if the original Pros:
Cons:
Anyway, if a breaking change is unavoidable, then just do it as soon as possible. |
If my understanding is correctly,Only affect MoE,right? |
Yes |
I'm sorry but as an heavy user of Mixtral 8x7b Instruct can some tell me if: |
If you want to create new quants with an imatrix, you need to convert the model again from an hf or pth model, you cannot use the same fp16 gguf. You don't need to use any special flags. |
Thanks for this work. |
If you are fully offloading the model there is very little advantage to converting the models again. For CPU and Metal, it allows using mmap which can improve load times (especially if you are restarting llama.cpp repeatedly), but in most cases it won't really matter. For performance benefits, see #6505 (but that also works with older models). |
I would say that is a slight understatement. In my case, for Mixtral 8x7B, the load time increased from a few seconds to 3 minutes or so. At first I thought my HW is dying, because the loading process was accompanied by a flood of kernel messages (related to Intel IOMMU*) so severe, it made the machine unresponsive and consumed all available space in /var/log within a few minutes. That turned out to be likely an unrelated issue, perhaps just randomly triggered by the way llama.cpp uses ROCm while loading the model. Ruling out a HW problem, it occurred to me to try an older build. And sure enough, the slow loading went away and git bisect eventually led me to this commit. It may have been a good idea to print a warning to the log when a model starts being loaded in a "compatibility mode". Would have saved me a lot of frustration and time. :) (But I have only myself to be angry at, because I even remember seeing this PR and the notes about breaking compatibility..) *) In case anyone else on ROCm and Intel CPU also experiences a flood of messages like:
followed by a stack trace and other related info for a bug in |
It is impossible to account for AMD drivers unpredictability, this change doesn't cause any meaningful overhead when offloading a old model. Without offloading and with Metal, it may cause the model to be evicted from the system cache if you don't have enough memory for two copies. This will increase the load time of a second usage, but the time of the first load should be essentially the same. |
Ah, I suppose that would be the main slowdown in my case. I did not realize the conversion is not happening in place. Thanks for the background. Agreed on the AMD drivers though, they still have plenty of work to do to make the whole stack rock solid and reliable (e.g., the GPU in my new AMD laptop crashed in three different ways in the past two months..) |
If you are not offloading a large portion of the model while using a CUDA or HIP build, a difference is that it will try to allocate a pinned buffer for the CPU portion of the model (this is the same that happens when disabling mmap). In some systems this can cause instability if there is too much pinned memory, since this memory is not available to other processes. You can disable this behavior by defining the environment variable |
Now I realize I first misunderstood your comment about model being evicted from the cache; I'm loading from a fast SSD, so cache alone would not explain such a big slowdown. But what you say about pinned memory probably explains it: I'm offloading about 10 of 33 to the GPU, and the portion left to the CPU takes up almost all the RAM. So if the CPU portion is pinned, there is barely anything left to work with, and the slowdown is probably caused by swapping. Indeed, after setting EDIT: Now reading about memory pinning, and the IOMMU errors may be related after all, since another, less common message present in the flood was "amdgpu: init_user_pages: Failed to get user pages: -1". And |
…6387) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
this is a very bad change since lots of mixtral models need be created again, please consider backward compatibility as llama.cpp is so popular. |
It's possible I have found a problem with the MOE imatrix calculations as a result of this PR's changes I guess, but posted the info in #6515 as DBRX was causing me the problems. I don't really know what the intentions are with the weighting (should less selected experts have lower importance now, etc), and the actual quant code is very obtuse but it looks a lot like the experts' MLP weights are all getting downscaled by a factor of
|
Hi @jukofyork, thanks for looking into this. It's very possible that this PR introduced some issue in the imatrix generation of MoE models, I tried to maintain the previous behavior, but I don't know how it works and the code is entirely uncommented, so I depend on code review. |
Hi, I just tested this change to
and definitely isn't just a benign change of scale (the same model acts quite differently). The Is the intension that less selected expert MLPs are to have their weights downscaled proportionally? I assume under the old scheme that the allocation of bits was done on a per-tensor basis so this is a pretty big change as even with the fix above, it's now allocated based on the proportion of times the top-k gating network selects an expert's MLP? I can see arguments both for and against this though, so not 100% clear which would be best... If the old behaviour is to be returned then it's not going to be easy to pass a vector of struct Stats { Could be adapted inside of There is probably a far better way to do both these fixes by altering
needing a look at too. This actually causes the printout to be confusing too:
|
The intention was that during quantization, each expert is quantized separately using the fragment of the imatrix that corresponding to that expert: Lines 14894 to 14900 in 628b299
But I may have gotten that wrong. Ultimately the goal was to preserve the same behavior. |
I think this should maintain the old "per-expert" scaling behaviour:
and shouldn't require any breaking changes to the imatrix file format, work correctly when joining imatrix files of differing sample sizes, and the It's O(d) currently (for simplicity) but could easily be reduced to O(n_experts) later as all the counts will be the same for each expert. I'll re-quant It also fixes the weird block saving behaviour for MOEs:
|
Sorry, missed your post. Looking some more here:
then a change of scale shouldn't effect BUT: I did get a different result earlier by multiplying all Perhaps as a test it might be worth rounding each of the 16 experts differently and then checking all the complicated slicing loops are really dealing with the experts as expected using asserts up until they get quantized? |
This commented out bit of code didn't work:
Just got I'm still no wiser what Luckily by printing out some debugging info found this also works:
(but probably triggers the I guess |
It took a lot of digging, but I think I can see where absolute scales matter:
The important bit:
So
and the non-linearity of |
I added pull request #7099, but this doesn't just want pushing as is - it unnecessarily doubles the memory overhead, but hopefully can be used as a test to refactor, etc. |
I can confirm this is actually doing something useful as before |
Changes the storage of experts in memory from a tensor per expert, to a single 3D tensor with all the experts. This will allow us support models with a large number of experts such as qwen2moe.
Existing MoE model files (ie. mixtral and grok) with split experts are still usable, however, for the CPU and Metal backends, it will cause the data will be copied to a buffer without mmap, which may increase load times slightly. Additionally, imatrices created after this change is merged cannot be used to quantize old models, and imatrices created with previous versions of llama.cpp cannot be used to quantize new models.
Fixes #6082