Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix model loading time through prefetching the file on another thread #734

Closed
wants to merge 11 commits into from
Closed

Conversation

CoderRC
Copy link

@CoderRC CoderRC commented Apr 3, 2023

I found that when using the current llama.cpp that the first model loading time went down due to OS not knowing which location of the file will be read and it caused a 4kb requests to disk, by reading the file on the seperate thread it will now cause only 1mb requests. Which for example 1000000 4kb requests to disk is decreased down to 4000 1mb requests to disk which is a big difference.
Specifically solves #705
Right now it read for me 500 mb/s, after this patch it now reads 2.3 gb/s.

Try: git clone https://github.com/CoderRC/llama.cpp
to see the difference of first loading time only after restart to produce accurate results

From:
llama_print_timings: load time = 31928.12 ms

To:
llama_print_timings: load time = 11478.47 ms

CoderRC added 2 commits April 2, 2023 23:48
I created a separate thread to read the file therefore improving loading time due to fetching file from disk.
@danielzgtg
Copy link

It would be better if we use madvise and MADV_WILLNEED on Linux and the equivalent on Windows. This way the OS will still read the pages from the disk without needing to copy them to the separate thread.

@comex
Copy link
Contributor

comex commented Apr 3, 2023

Very interesting that Windows is not issuing correctly sized reads by default. Which tool are you using to measure this?

That said, this solution is really hacky. You probably want to use PrefetchVirtualMemory on Windows and madvise on Unix.

@CoderRC
Copy link
Author

CoderRC commented Apr 3, 2023

I just on windows empty the standby list through RAM MAP utility in windows and then run the llama.cpp and go to taskmgr and look at the performance disk tab.

ggml.h Outdated
Comment on lines 779 to 787
#ifndef _POSIX_THREADS
#if defined(_WIN32)
#include <windows.h>
#endif
typedef HANDLE pthread_t;
typedef DWORD thread_ret_t;
static int pthread_create(pthread_t* out, void* unused, thread_ret_t(*func)(void*), void* arg);
static int pthread_join(pthread_t thread, void* unused);
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reinvent std::thread?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::thread does not exist in msvc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to exist at https://learn.microsoft.com/en-us/cpp/standard-library/thread-class?view=msvc-170 . But I didn't test, so feel free to resolve this thread if std::thread doesn't work on your machine

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::thread do exists in msvc

@CoderRC CoderRC marked this pull request as draft April 3, 2023 04:59
@CoderRC CoderRC marked this pull request as draft April 3, 2023 04:59
@CoderRC
Copy link
Author

CoderRC commented Apr 3, 2023

Gonna try PrefetchVirtualMemory before draft to non draft.

@CoderRC
Copy link
Author

CoderRC commented Apr 3, 2023

PrefetchVirtualMemory is perfect and it works on windows properly. Gonna do some POSIX tests and update https://github.com/CoderRC/libmingw32_extended

@prusnak
Copy link
Collaborator

prusnak commented Apr 3, 2023

I don't like the fact that this approach complicates stuff also for platforms that are not broken (i.e. not Windows).

Can we get rid of everything else and keep just the PrefetchVirtualMemory call?

In another word, does this fix the issue with no further changes required?

diff --git a/llama.cpp b/llama.cpp
index 854bb89..78fbace 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -327,6 +327,11 @@ static void *mmap_file(const char *fname, uint64_t *mm_length) {
     void *addr = MapViewOfFile(hMapping, FILE_MAP_READ, 0, 0, 0);
     CloseHandle(hMapping);
     if (!addr) return 0;
+    // Prefetch the virtual memory range
+    WIN32_MEMORY_RANGE_ENTRY range;
+    range.VirtualAddress = addr;
+    range.NumberOfBytes = (SIZE_T)length;
+    PrefetchVirtualMemory(GetCurrentProcess(), 1, &range, 0);
 #else
     int fd = open(fname, O_RDONLY);
     if (fd == -1) return 0;

@danielzgtg
Copy link

not broken (i.e. not Windows).

madvise should speed things up for Linux too. It might not be as dramatic, but there should be a tiny improvement

@prusnak
Copy link
Collaborator

prusnak commented Apr 3, 2023

madvise should speed things up for Linux too. It might not be as dramatic, but there should be a tiny improvement

Came up with #740 - it speeds the loading time of 7B model 3.5 times on my macbook

@CoderRC
Copy link
Author

CoderRC commented Apr 3, 2023

Try on linux it works and compare speeds to #740

@prusnak
Copy link
Collaborator

prusnak commented Apr 3, 2023

You changed end-of-line format (from Unix to Windows), so GitHub shows you replaced all the lines. Please fix this. Also don't merge master into your branch. What you should do is to rebase: https://www.atlassian.com/git/tutorials/rewriting-history/git-rebase

@CoderRC
Copy link
Author

CoderRC commented Apr 3, 2023

Try to do tests and tell result below and compare to #740

@comex
Copy link
Contributor

comex commented Apr 3, 2023

@CoderRC Can you please test #740 as well?

@x02Sylvie
Copy link

x02Sylvie commented Apr 5, 2023

With this PR my HDD's loading speed went from 5 mb/s to 50 mb/s of model.

It's not as fast as before mmap but we're getting there

@CoderRC
Copy link
Author

CoderRC commented Apr 19, 2023

I am going to close this since #801 was merged and this merge solved my model loading time for msys2 with mingw32. If want to solve Efficient preloading for mmap() go to #869

@CoderRC CoderRC closed this Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants