-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advise the kernel to preload the mapped memory #740
Conversation
cc @comex @danielzgtg for testing since this is essentially your idea |
For pure experiment you need to use different files on disk between runs, cos prev used file is still hangs in memory and affects subseq runs. No change for me. macos catalina, llvm 16, haswell, 2 core
|
Testing with Linux HDD Before
Linux SSD Before
Linux HDD After
Linux SSD After
Yes, I had this idea 13 hours ago in #693 (comment) . No, I could not measure the improvement I predicted on Linux in #734 (comment) . The data I gathered shows it's not very statistically significant, and even ignoring that, that the trends for HDD and SSD are opposite for some reason. I'm certain that this fix will help and is necessary for Windows users. I could test on Windows but it would take me a long time to set things up. |
You may also want to try with It is also possible that the low performance on some OS is actually caused by too much read-ahead when the data are used in a random order. In this case, experimenting with |
I observe that macOS Catalina default madvise MADV_NORMAL works best. (Intel, haswell)
|
Try: |
It was tried in the post above: #740 (comment) |
Looks good. One thing I’d change: instead of running this at load time, run it before every eval. This way, if, say, you’re running an interactive session, and the kernel decides to page out the model while it’s waiting for input, it’ll get paged back in efficiently. (That won’t help if the kernel decides to page out the model in the middle of evaluation, but there’s no way to help that without mlock.) |
I remembered a better option on Linux. We can use We might still |
You're totally right, I almost forgot about it! Just tried here on my ARM 4GB board and the flash read speed raised from 65-89 MB/s to 92-120! On my PC however it was the opposite, the load time doubled, from 9.9s to 20s. However about 1.7s from these 10 extra seconds were recovered in eval time, likely because the data were already where they were needed. But it's quite strange. |
Updated the commit to use int64_t length = lseek(fd, 0, SEEK_END);
+#ifdef __linux__
+ void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
+#else // MAP_POPULATE is only supported on Linux
void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0);
+#endif
close(fd); |
Does it hurt to keep both |
Features: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields (which improves flexibility, and will make it easier to support the new GPTQ-for-LLaMa models in the future). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Issues: - I switched from fopen/fread instead of ifstream, both to avoid the need to open the same file again to mmap it, and because I thought would be optimized to skip the buffer for large reads... XXX - VirtualLock does not work at all on the one Windows VM I tested it on (it complains about quota). Todo: figure out why. - Need to verify that fread actually is fast. - However, it doesn't work when I test it on my VM? Todo: Figure out why. Implementation notes: I tried to across several functions to make it easier to modify/refactor the code in the future. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
Features: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Todo: - **VirtualLock does not work at all** on the one Windows machine I tested it on (it complains about quota). Figure out why. - Verify that using the `fopen` family of functions actually does what I think it does, performance-wise. - More testing. Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
Features: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Todo: - **VirtualLock does not work at all** on the one Windows machine I tested it on (it complains about quota). Figure out why. - Verify that using the `fopen` family of functions actually does what I think it does, performance-wise. - More testing. Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
- Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
- Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)
Has been merged as part of #801 |
- In the launcher, if an existing value is set for a file value (e.g. Model), use that file's directory the initial directory when the file dialog is opened with 'Browse'. - In the launcher always set the intial directory for 'Load' to cwd.
Hopefully this helps with the loading times when using mmap() on Windows and Unix (Linux/macOS).
I tested only on macOS when the load time of 7B model got decreased from 7 seconds to 2 seconds with no inference performance change.
This needs further testing, so I am opening this as a draft.
One possible improvement is to call
VirtualLock(addr, length)
for Windows to lock the specified region of the process's virtual address space into physical memory. But I need someone to test this for me, whether this is needed and helpful.