-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] Memory usage is extremely low when running 65b 4-bit models. (Only use 5GB) #864
Comments
@stuxnet147 There was a recent change made that allowed the usage of mmap in order to avoid loading the entire model into memory, this helped users with not enough memory to still be able to run larger models. However, it negatively affects users that do have enough memory to run the full model. There is an option you can pass to the program to disable this: Though, even with this for the 65B model there may be slow performance, because |
Can I use --mlock option in windows? |
@stuxnet147 I'm not sure since I'm on Linux.. but it should still work, try passing that option to the binary and see if it complains. |
Once #801 has been merged it should be possible to provide the |
There is no negative impact. The current implementation of llama.cpp, no matter if using mmap or not, is not suitable for users without enough memory. To get to the question: That's the big benefit of using mmap() over malloc(). Overall the memory consumption is the same as before. |
I believe you may be thinking of the scenario when one does random reads on disk, but in the case of llama.cpp (pre #613) - the entire model was loaded into memory upon initialization, at that point it is faster than mmap because mmap still has disk latency/performance to deal with, they are not the same, and mmap does come with a runtime performance penalty (post initialization step, which is faster with mmap) precisely because it doesn't load everything into ram. Not sure what you mean by the OS memory vs the process memory, you can in fact load things that are larger than your available system memory by using mmap because it doesn't load everything in at once, only what is actually being read (which in the case of llama.cpp depends on the prompt) - you wouldn't be able to load it all at once though if that's what you meant when you said the memory consumption is the same as before - because in actually it wouldn't be (unless you were very unlucky or kept the conversation going for long) because most prompts don't use more than some fraction of the model's weights. |
I might have overlooked something but the only thing that is different is that mmap does not preload the data, but that's a feature not a problem and it's a tiny change to preload it. You could add So no: mmap does not have to deal with more disk latency than no-mmap, it deals with exactly the same amount of latency just in a different distribution when that latency is experience. Meaning: it's happening at first inference once instead of on load. Regarding OS memory management: you can not "load larger than RAM" into process memory. You can reserve process memory for more than you actually can load. That's quite a difference. |
I just gave the preloading a test, it's a bit more than one line but works with mmap:
|
I understand what you're getting at, but the fact is the current reason mmap is being used is for sparse loading of the model, not for the concurrent loading scenario you described above. Each inference can read different weights in the model, so it's not just the first inference that gets hit with a disk performance/latency penalty, but each subsequent inference will load in different parts of the file and result in further disk access. Until the entire file is loaded into memory by mmap, it will be slower than loading straight from ram (at least for inferences). |
I recall a reddit discussion where jart though the model is sparse, but that was a fallacy from misunderstanding the low process memory consumption. Or do you refer to something else ? mmap() comes with a lot of improvements for llama.cpp but "sparse inference" is not one of those. |
Not sure what you mean, in my some odd hours of testing it with 30b/65b, that is exactly what it does. Are you speaking from theory or do you have evidence? To be clear I'm on Linux; I'm not sure if that makes a difference here. |
To receive one single output token you need to load every single byte of the weights from disk. Linux/Windows makes no difference, it's the same model architecture. |
@cmp-nct Consider that if that were true mmap would copy the entire model into memory upon the first inference, which doesn't currently happen. You can verify this very easily by looking at the total system memory usage (including swap). It is significantly lower with mmap vs without even after the first inference. See: #638 (comment) |
No, I can not confirm that. The system memory consumption is 100% of the model before you see the first token appear. P.S. I'd be happy to be proven wrong, always glad to learn about a mistake. But that needs to point to an actual part of the code. The inference code is just a page, if you ignore the ggml background. P.P.S. If you really see disk loading after the first inference this would indicate a deeper flaw and serious performance impact. |
@cmp-nct Now I may be mistaken about it loading additional weights with each additional inference; it could be that there is always some weights that aren't being used, but by default are being copied into memory anyways as part of the model loading process. This seems to be the behavior on my system (running arch), for example, using the 65B model after first inference - total system memory: |
One possibility for why the fault rates are lower could be that the OS loads in multiple pages when a page fault is triggered, so fewer faults occur but the same amount of data is still loaded in. (That doesn’t explain your RAM usage chart though) |
Yea it's just a memory monitoring glitch or bug. I should add: I have seen such behavior during development of my linked commit. |
@cmp-nct I'm not sure if that explanation holds, because even during the token generation, when the inference is still in progress, the memory usage is like above, unless you're implying its being allocated/discarded faster than the tool can detect. |
Yes I say the tool is either faulty or too slow to report such stuff. Or both. |
@cmp-nct Hmm, wouldn't mmap be significantly slower then on inference? I have noticed some slow down but it isn't in the order of magnitude we'd expect when going from 30GB/s ram to 3GB/s SSD. All of this is just theory though, it would be interesting to see the actual numbers on how many allocate/deallocates are being done per nanosecond with mmap. |
No memory copy is involved. All the OS does is set flags, that's my guess. In any case, you sadly do not have a 60% sparse model. |
@cmp-nct What about the cache of mmap? To my understanding it does cache whatever you did read in memory so that the next read doesn't need to re-fetch from disk. "An application can determine which pages of a mapping are https://man7.org/linux/man-pages/man2/mmap.2.html Though it may be specific to Linux? (https://biriukov.dev/docs/page-cache/2-essential-page-cache-theory/) |
Dear llama.cpp team,
I am experiencing two issues with llama.cpp when using it with the following hardware:
The first issue is that although the model requires a total of 41478.18 MB of memory, my machine only uses 5 GB of memory when running the model. I would like to know if this is normal behavior or if there is something wrong with other.
The second issue is related to the token generation speed of the model. Despite my powerful CPU, which consists of two Xeon Silver 4216 processors, I am only getting a token generation speed of 0.65/s. This speed seems slower than what I would expect from my hardware. Could you please advise on how to improve the token generation speed?
Here is the information you may need to help troubleshoot the issue:
[Software Env]
[Output]
The text was updated successfully, but these errors were encountered: