-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix model loading time through prefetching the file on another thread #734
Conversation
I created a separate thread to read the file therefore improving loading time due to fetching file from disk.
It would be better if we use |
Very interesting that Windows is not issuing correctly sized reads by default. Which tool are you using to measure this? That said, this solution is really hacky. You probably want to use |
I just on windows empty the standby list through RAM MAP utility in windows and then run the llama.cpp and go to taskmgr and look at the performance disk tab. |
ggml.h
Outdated
#ifndef _POSIX_THREADS | ||
#if defined(_WIN32) | ||
#include <windows.h> | ||
#endif | ||
typedef HANDLE pthread_t; | ||
typedef DWORD thread_ret_t; | ||
static int pthread_create(pthread_t* out, void* unused, thread_ret_t(*func)(void*), void* arg); | ||
static int pthread_join(pthread_t thread, void* unused); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why reinvent std::thread
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::thread does not exist in msvc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to exist at https://learn.microsoft.com/en-us/cpp/standard-library/thread-class?view=msvc-170 . But I didn't test, so feel free to resolve this thread if std::thread
doesn't work on your machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::thread do exists in msvc
Gonna try PrefetchVirtualMemory before draft to non draft. |
PrefetchVirtualMemory is perfect and it works on windows properly. Gonna do some POSIX tests and update https://github.com/CoderRC/libmingw32_extended |
I don't like the fact that this approach complicates stuff also for platforms that are not broken (i.e. not Windows). Can we get rid of everything else and keep just the In another word, does this fix the issue with no further changes required? diff --git a/llama.cpp b/llama.cpp
index 854bb89..78fbace 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -327,6 +327,11 @@ static void *mmap_file(const char *fname, uint64_t *mm_length) {
void *addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
CloseHandle(hMapping);
if (!addr) return 0;
+ // Prefetch the virtual memory range
+ WIN32_MEMORY_RANGE_ENTRY range;
+ range.VirtualAddress = addr;
+ range.NumberOfBytes = (SIZE_T)length;
+ PrefetchVirtualMemory(GetCurrentProcess(), 1, &range, 0);
#else
int fd = open(fname, O_RDONLY);
if (fd == -1) return 0; |
madvise should speed things up for Linux too. It might not be as dramatic, but there should be a tiny improvement |
Came up with #740 - it speeds the loading time of 7B model 3.5 times on my macbook |
Try on linux it works and compare speeds to #740 |
You changed end-of-line format (from Unix to Windows), so GitHub shows you replaced all the lines. Please fix this. Also don't merge master into your branch. What you should do is to rebase: https://www.atlassian.com/git/tutorials/rewriting-history/git-rebase |
…reate and pthread_join
Try to do tests and tell result below and compare to #740 |
With this PR my HDD's loading speed went from 5 mb/s to 50 mb/s of model. It's not as fast as before mmap but we're getting there |
I found that when using the current llama.cpp that the first model loading time went down due to OS not knowing which location of the file will be read and it caused a 4kb requests to disk, by reading the file on the seperate thread it will now cause only 1mb requests. Which for example 1000000 4kb requests to disk is decreased down to 4000 1mb requests to disk which is a big difference.
Specifically solves #705
Right now it read for me 500 mb/s, after this patch it now reads 2.3 gb/s.
Try: git clone https://github.com/CoderRC/llama.cpp
to see the difference of first loading time only after restart to produce accurate results
From:
llama_print_timings: load time = 31928.12 ms
To:
llama_print_timings: load time = 11478.47 ms