Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Linux scheduling and cache memory usage #101

Closed
wants to merge 5 commits into from

Conversation

kakra
Copy link
Contributor

@kakra kakra commented Oct 18, 2020

This is my try at fixing fossilize_replay to dominate the cache (and forcing the kernel to swap out anonymous memory). It should utilize CPU cache lines better, reduce scheduling pressure (broken out into separate PR), reduce cache and memory pressure, and improve writeback behavior of concurrent desktop loads.

This PR only hints database reads currently as that is most effective. The kernel has no real functionality for write hinting to get rid of dirty data in the cache except using fsync(). But using fsync() may have an inverse effect of blocking other processes by locking metadata structures in the file system, btrfs being most sensitive to it.

Maybe someone can instruct me how to setup a testing sandbox for this (which doesn't require a virtual machine) so I could test the impact of this. This PR is currently only crafted from theory. I don't even know how the database is actually accessed but I believe almost all IO requests are done mostly only once and thus we can hint the page cache after each read request.

If this shows positive effects, I would continue to work on it and look into the write case. There are two ways:

  1. We could use fadvise to tell the kernel that we are not going to reuse the dirty data in the cache, tho that is documented in man 2 posix_fadvise as

    The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed.

  2. We could use fsync() to sync every write. That may slow down fossilize_replay a lot but usually results in way less cache thrashing. It may, however, also lock metadata in file systems and thus block other processes. It has a much higher potential of stopping swapping introduced by the replayer, so it may be worth finding some heuristics to use fsync regularly but not on every write. It could also be combined with point 1.

Also, I didn't test if it still compiles and works in Windows.

Feel free to cherry pick the scheduling bits of this, I'll rebase the PR then.

@kakra

This comment has been minimized.

@kakra

This comment has been minimized.

@kakra kakra force-pushed the improve-linux-scheduling branch 2 times, most recently from fd36bde to 1ecb6f9 Compare October 20, 2020 23:44
@kakra kakra marked this pull request as ready for review October 21, 2020 00:15
Copy link
Contributor Author

@kakra kakra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HansKristian-Work I'm satisfied with the PR so far. The write throttling part mentioned initially won't be implemented as part of this PR as the writes are actually not happening inside the fossilize code but are injected by the shader compiler of GPU drivers. We'd probably need some throttling heuristics around the actual shader calls but that should become a separate PR.

Please review.

Copy link
Collaborator

@HansKristian-Work HansKristian-Work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the IO prio patches for Linux/OSX and the self/exe thing are fine, but I need:

  • Please rebase those patches on a separate PR.
  • Please add links to relevant documentation on the magic IO-prio interfaces so anyone can verify it's correct. Hacking in manual defines like this is a slight red flag.

For posix_fadvise stuff, I'll need to study some things first.

fossilize_external_replayer_linux.hpp Outdated Show resolved Hide resolved
@kakra

This comment has been minimized.

@kakra
Copy link
Contributor Author

kakra commented Oct 21, 2020

While reviewing, could you let me know if all the posix_fadvise() call make sense? I think as long as the read is introduced by the background job, it's totally fine. But I think fossilize also has a recorder mode which traces a Vulkan application to record the shaders. That code path should probably not discard most reads from the cache to maintain the best latency for apps being instrumented, unless it's a one-shot read.

Also, it looks like some fwrite() calls write big blobs of data which we should probably fsync() after writing in an upcoming PR.

We should maybe add some performance counters which sum calls, times, and amount of data, and maybe put it into some shared memory segment so we could read some stats while the service is running. Something which allows to find where the big reads and writes are going.

@kakra kakra force-pushed the improve-linux-scheduling branch from 1ecb6f9 to 6a61cb3 Compare October 21, 2020 19:37
Copy link
Contributor Author

@kakra kakra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes added. When all is good, I'll split them out to separate PRs.

@@ -27,6 +27,15 @@
#include <fcntl.h>
#endif

#ifdef __linux__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the fadvise calls seem reasonable, although I have some concerns. This is a pretty obscure corner of Linux, so:

  • How does DONTNEED affect other processes? Do all processes using a particular range of the file have to call DONTNEED or close the file for it to be effective?

  • The ranges here seem very narrow. Like a few bytes each time, and each blob tends to be rather small. Is it even meaningful to use fadvise on tiny byte ranges like this? Can Linux somehow link together these DONTNEED requests into complete blocks/pages?

  • Looking at the man-page there is POSIX_FADV_NOREUSE. Isn't this more appropriate in some scenarios?

  • Is there any way to observe the page cache behavior anywhere so we're not just adding this blindly? I think is the key, since I can play around with this myself while I'm thinking how to refactor this part of Fossilize into something more efficient. To make this work well, I feel there will be a fair amount of tuning needed.

The access pattern is going to be something like where the primary process first scans through the archive.

This is fully sequential. The entire archive isn't actually read, only the metadata, but a sequential scan should prime everything, and that seems to be the behavior we're getting by default on Linux. This step serves two purposes, figuring out how to subdivide work for the children, and warm the page cache for child processes later.
https://github.com/ValveSoftware/Fossilize/blob/master/cli/fossilize_replay_linux.hpp#L455

Child processes come in and also scan through the archive in a strided fashion to figure out the metadata. This is technically redundant, and I have some ideas how to fix this. We cannot rely on fork() on Windows, so most likely I will emit a prepared metadata block in SHM instead, so that child processes do not have to parse anything. If the disk cache is primed, the current implementation should be fast, although I can see a problem if the archive is just too large (many many GB), it's going to thrash in some way.

After the initial scan, and when replay begins, the access pattern turns quite random as blobs are read as needed and out of order. Not all blobs are read, especially pipeline objects are subdivided among child processes to process. This still works well if the entire archive is present in the page cache. I think the current code only emits DONTNEED hints when blobs are actually read, but we should probably ensure that unused parts of the file don't get cached? I'm not sure how SEQUENTIAL works here. If it implicitly de-priorities data once it's read, that's probably fine.

As for writes, Fossilize writes very little data on its own. The only writes I would worry about is the disk cache writers in the drivers.

Copy link
Contributor Author

@kakra kakra Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the fadvise calls seem reasonable, although I have some concerns. This is a pretty obscure corner of Linux, so:

Yep.

  • How does DONTNEED affect other processes? Do all processes using a particular range of the file have to call DONTNEED or close the file for it to be effective?

DONTNEED operates on ranges in the cache. If there's dirty data in the range, the request is ignored, otherwise the cache is discarded - no matter which process was using it in the first place. But I don't think that's a concern here because fossilize is the only user of the data at that time.

  • The ranges here seem very narrow. Like a few bytes each time, and each blob tends to be rather small. Is it even meaningful to use fadvise on tiny byte ranges like this? Can Linux somehow link together these DONTNEED requests into complete blocks/pages?

This is a valid concern. Maybe we'd need to ask kernel guys or look at the kernel source. I could imaging that is discards all pages touched by the range - unless those pages are dirty where it would fall back to ignoring the request.

  • Looking at the man-page there is POSIX_FADV_NOREUSE. Isn't this more appropriate in some scenarios?

In theory, yes. But (a) it seems to be made for the write case, and (b) it isn't implemented at all in current kernels. The man page actually states that, and it says previous implementation semantics have been a bug. There's no cache hinting for writes in Linux.

  • Is there any way to observe the page cache behavior anywhere so we're not just adding this blindly? I think is the key, since I can play around with this myself while I'm thinking how to refactor this part of Fossilize into something more efficient. To make this work well, I feel there will be a fair amount of tuning needed.

Given that you keep the rest of your system idle during a test, you can drop caches to prepare it as a cold run and then watch egrep '(Cached|Dirty|Write):' /proc/meminfo before and after running fossilize-replay. It will give a rough idea of how effectively we freed cache. You could also run fossilize-replay in a cgroup with memory accounting turned on and watch how much memory it allocated: This accounting will also include cache usage. The latter is what I did (tho, I looked at the Steam cgroup so I didn't test fossilize isolated), and looking at how fast swap usage increased. With these patches, my system stood much more responsive during the 21 GB fossilize job when running the Path of Exile shaders. Apparently, responsiveness is hard to measure, it's subjective.

The access pattern is going to be something like where the primary process first scans through the archive.

This is fully sequential. The entire archive isn't actually read, only the metadata, but a sequential scan should prime everything, and that seems to be the behavior we're getting by default on Linux. This step serves two purposes, figuring out how to subdivide work for the children, and warm the page cache for child processes later.
https://github.com/ValveSoftware/Fossilize/blob/master/cli/fossilize_replay_linux.hpp#L455

Yeah but actually we probably don't want to prime the cache - this makes 9 GB for the Path of Exile shaders worth of cache - which is probably discarded early if it is that huge. This doesn't even fit within the Average Joe system with 8-12 GB of RAM. Instead, we could use fadvise to tell the kernel that we are doing sequential access (which will double the read-ahead window for this file), or that we are doing random access (which will set read-ahead to 0). And if we know we are not going to need that data again soonish, we should discard it because we are not a foreground process.

Child processes come in and also scan through the archive in a strided fashion to figure out the metadata. This is technically redundant, and I have some ideas how to fix this.

Yeah, I think that's a good slice of the problem.

We cannot rely on fork() on Windows,

Apparently. Makes me sad. :-(

so most likely I will emit a prepared metadata block in SHM instead, so that child processes do not have to parse anything. If the disk cache is primed, the current implementation should be fast, although I can see a problem if the archive is just too large (many many GB), it's going to thrash in some way.

The biggest problem is when the processes have to copy that data into private memory - no matter if it's shared or isn't. If we exhaust the system memory by that way, there's not even a point in trying to be as disk friendly as possible because we are going to flush the cache from memory. In that case, we don't need to prime anything in the first place. Also, the children are not latency sensitive, I don't think priming serves any valuable purpose.

After the initial scan, and when replay begins, the access pattern turns quite random as blobs are read as needed and out of order. Not all blobs are read, especially pipeline objects are subdivided among child processes to process.

In that case we should use FADV_RANDOM to turn off read-ahead: We don't need to read additional 64 kB if we only going to read some small randomly placed blobs. If IO priority is lowered, the system can take care of reordering IO requests optimally so we don't need to prime the cache to reduce seek overhead. At least for Linux. I think Windows works very differently here.

This still works well if the entire archive is present in the page cache. I think the current code only emits DONTNEED hints when blobs are actually read, but we should probably ensure that unused parts of the file don't get cached? I'm not sure how SEQUENTIAL works here. If it implicitly de-priorities data once it's read, that's probably fine.

As pointed out above: SEQUENTIAL doubles the read-ahead window (the kernel will read ahead 128k instead of 64k), essentially speeding up sequential workloads by reducing IO requests count. RANDOM sets read-ahead to 0, telling the kernel that we exactly only want to read that particular data and no read-ahead for adjacent data which we probably don't need at all.

In strace, it looks like the file stream functions break big reads into small 4k pieces, so SEQUENTIAL made sense here.

As for writes, Fossilize writes very little data on its own. The only writes I would worry about is the disk cache writers in the drivers.

Yeah, it was hard for me to spot any write at all during strace. All the writes seem to come from the nvidia driver lib. But it opens two file descriptors on behalf the process as can be seen in /proc/<fossilize_pid>/fd. It should actually be possible to issue fsync() or fadvise on those descriptors. But for a first implementation, it should be enough to use fsync() every now and then (maybe by looking at iostats of the process and sync on every 4 MB written): It should block writes on the fd until the write-back is drained and thus throttle the write-back to a sane value. Background info: Linux tends to stall on writes because it considers blocking reads more perceivable to interactive users than blocking writes: After all, you'd expect the application to stall when you save your document. So if we put a file descriptor into synchronize state, we should actually successfully throttle accumulation of dirty data in the cache, preventing the kernel from swapping too early.

So, yeah, I also concluded that we only need to worry about the writes to files we actually cannot control. And accumulation of dirty data in the cache is one of the remaining problems I'm seeing (except the duplication of memory anyways), and it will write back only slowly because we are reading a lot of stuff and Linux prefers reading over writing (this is kinda like letting people into an elevator before letting others out, so we need some guard that advises the other way around). At some point, it will start blocking all processes doing IO just to get rid of the pile of dirty writeback data - and then your systems stalls very visibly and may freeze for seconds or even minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: I was thinking to not only consider HW threads for how many children to spawn, maybe also consider ceil(sysmem * 0.5 / dbsize) and use the lowest of both values, or some biased value like (2 * sysmem_count + hwthreads_count) / 3... But the memory duplication should be fixed in the first place. And also think about datasets which do not fit into sysmem at all. The Path of Exile dataset seems to be a good example, it's 9 GB for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The ranges here seem very narrow. Like a few bytes each time, and each blob tends to be rather small. Is it even meaningful to use fadvise on tiny byte ranges like this? Can Linux somehow link together these DONTNEED requests into complete blocks/pages?

This is a valid concern. Maybe we'd need to ask kernel guys or look at the kernel source. I could imaging that is discards all pages touched by the range - unless those pages are dirty where it would fall back to ignoring the request.

I've asked in IRC and while I got no definitive answer, I was told this: Unless the request returns EINVAL, it'll also discard partially requested pages from the cache. So, if we check the return value, we'll know.

Copy link
Contributor Author

@kakra kakra Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HansKristian-Work I've looked at the kernel. This is how it works: WILLNEED will round the range to the outer page boundary, DONTNEED will round the range to the inner page boundary. So we cannot DONTNEED sub-page ranges.

It's here:
https://github.com/torvalds/linux/blob/96485e4462604744d66bf4301557d996d80b85eb/mm/fadvise.c#L68

Also, POSIX_FADV_{NORMAL,RANDOM,SEQUENTIAL} do not care about ranges at all - at least for the generic implementation, they just operate on the file handle.

Following the code paths further, I also see that you could prime the cache with WILLNEED instead of reading the complete file: The kernel will just start readahead of the range you give it: It will start a while loop chunking the range into 2 MB slices and start loading data into cache without latency overhead, but only if the range exceeds the readahead window, and respecting the device optimal IO size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear with me, I'm still in the process of understanding how it works. But the picture slowly becomes clear. Thanks for your patience.

Yeah, I don't think the individual metadata blocks do a lot of pressure. It's the sum of all small pieces for a big file. DONTNEED at close time seems to help. And yes, the own copy currently seems the biggest problem but I cannot help here a lot, you know the code much better.

I'd probably go for the write pressure introduced by the drivers own shader cache next. Could you point me to where the writes by the driver are probably injected? Should be somewhere around the compiler calls, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vkCreateGraphicsPipeline and vkCreateComputePipeline would be the place the driver does something, but it's not going to be in the calling thread. All drivers have a "disk cache thread" or similar, so it's completely asynchronous when this happens. I think what you can look into in that case is to check the IO writeback cache size and flush it.

The ideal place for this would be a thread in the main process, assuming it's possible to control fsync behavior for the entire process tree like this. If not, then somewhere here would be a good fit: https://github.com/ValveSoftware/Fossilize/blob/master/cli/fossilize_replay.cpp#L2086 We just kicked off some parsing work, and while that happens, we can poke at the IO cache a bit perhaps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, the own copy currently seems the biggest problem but I cannot help here a lot, you know the code much better.

Yes, I'll look into this soon. Should be a pretty good win in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at htop while the replayer starts writing, you can press L on the process to see the open files: The process has open file handles on the NVIDIA shader cache which I could access through /proc/self/fd and use fadvise() or fsync() on it. Not sure for mesa, tho. While it is probably async, the handles are visible in our process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mesa uses one individual file per blob entry, so we won't be able to hook any individual FDs.

@HansKristian-Work
Copy link
Collaborator

Requested changes added. When all is good, I'll split them out to separate PRs.

Ok. Once the nits I have are fixed, I think the three commits can be rebased out to a separate PR.

@kakra kakra force-pushed the improve-linux-scheduling branch from 6a61cb3 to 5d1b2ae Compare October 22, 2020 18:49
@kakra kakra marked this pull request as draft October 22, 2020 18:51
@kakra
Copy link
Contributor Author

kakra commented Oct 22, 2020

Converted back to draft while breaking out finalized commits.

@kakra kakra mentioned this pull request Oct 22, 2020
Copy link
Contributor Author

@kakra kakra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefers disabling readahead now instead of trying to discard tiny sub-page ranges. Instead, it now discards the files from cache at close time which also covers the write-back case of writes done by fossilize itself.

fossilize_db.cpp Outdated
@@ -739,6 +792,9 @@ struct StreamArchive : DatabaseInterface
if (!file)
return false;

/* we are seeking into the file, only reading tiny chunks: disable readahead */
LINUX_fadvise(file, 0, 0, POSIX_FADV_RANDOM);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version of commit now disables readahead instead of discarding cache on each read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After letting it run over night, it seems to be more memory/cache friendly as swap usage only increased by 200 MB instead of 2 GB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be brutal when reading a database from a cold HDD. Is there any issue with scanning through metadata with SEQUENTIAL, then transition the file into RANDOM or NOREUSE?

Copy link
Contributor Author

@kakra kakra Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transitioning should be fine... NOREUSE is a no-op, see my other comment.

But why brutal? It sets the readahead window to 0 for file contents only without incurring penalty on other file system activity of the OS, and we probably don't need 64k of readahead if the next seek is going to be further away. How big are the jumps between seeks anyways? This may justify if readahead is worth the additional cache pressure:

If the jumps are sub-page ranges, we end up reading the complete file anyways and readahead would be justified, and SEQUENTIAL makes sense as it doubles the readahead. But then again, WILLNEED for the whole file is probably the right thing to do. After the process is done, it should issue DONTNEED for the slice of the database it was working on to reduce memory pressure on constrained systems. Currently, my patch DONTNEEDs the whole file at close time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I mean, if we start out from RANDOM, like this PR is doing, then the initial scan through the archive will be very slow. We observed this on Windows in particular, where O_SEQUENTIAL from a cold HDD really helped.

What I'm thinking is, the ideal way of doing this would be:

  • SEQUENTIAL, while scanning through. Ideally, any pages we have read through will be de-prioritized so that scanning sequentially over a huge archive does not dominate the cache.
  • RANDOM after the initial scan. If pages are dropped while we scan through the archive, we should not do any read-ahead while processing the archive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so if the tests showed that, I'll conclude that the seeks are sub-page or sub-readahead-window size, and SEQUENTIAL really helps here. So your plan sounds good.

While scanning through the entries sequentially, we could just DONTNEED what's behind of the seek pointer (so we do not run into an issue with sub-page sizes), right? However, if the process starts working with the DB later, it may need to re-read blocks from disk. Apparently, DONTNEED wouldn't de-prioritize the cached pages, it will throw them away almost immediately (unless some CPU or thread still has a hold on it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, right. If DONTNEED immediately drops the pages, that's bad. We really want to keep the pages in cache if it fits. We just don't want to dominate other processes if it doesn't fit. I guess the best we can do is SEQUENTIAL -> RANDOM and leave it at that.

When I figure out a shared SHM metadata block, this initial scan is not necessary in child processes, and we should be able to just go RANDOM out of the gate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I figure out a shared SHM metadata block

Maybe we should wait for that change then first before fiddling around in this PR any further? It sounds like it has the best potential of reducing memory pressure. In the end, we may be fine with using just WILLNEED in the parent before child spawning, and DONTNEED after the children are done, and use RANDOM for the actual file access. WILLNEED will trigger immediate readahead of the complete file using device optimal IO size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can park this PR until then, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not claim credit for the fadvise lines, so feel free to incorporate that into your code redesign when it makes sense. I think the work here has given us enough clues and a better understanding how the kernel works with fadvise.

@kakra
Copy link
Contributor Author

kakra commented Oct 23, 2020

@HansKristian-Work

We cannot rely on fork() on Windows, so most likely I will emit a prepared metadata block in SHM instead

Why use fork() or execv() with SHM at all? Why does it need this complicated sub-process architecture when the standard CLI call does just multithreading? Is it about possible crashes?

My thoughts are the following: fork() behaves more or less like a thread whose address space has been duplicated, threads in Linux are actually full processes, Linux doesn't really care about the difference, it still knows that a PID is a thread and may tweak some stuff but basically it's a process that shares memory. So the closest we can get to fork() would be to just only use threads in both Windows and Linux, and deal with concurrent memory access. The explicit use of SHM in some places may also be dropped then. Multithreading seems to be already implemented and works. So why the duplicate efforts of maintaining both modes in the code? Especially given the fact that on my Path of Exile test set, the multithreaded version uses 1.4 GB PSS (proportional segment size) and the multiprocess version uses 200 MB + 8x 1.2 GB PSS after full init.

@HansKristian-Work
Copy link
Collaborator

Why use fork() or execv() with SHM at all? Why does it need this complicated sub-process architecture when the standard CLI call does just multithreading? Is it about possible crashes?

Multi-processing is a must, since we must be able to recover from crashes and keep going in a new process. Safely recovering in a multi-threaded setup is impossible.

@HansKristian-Work
Copy link
Collaborator

Merged your two other PRs, thanks! I guess this PR should be rebased.

@kakra kakra force-pushed the improve-linux-scheduling branch from d3edb60 to fc8aebc Compare October 23, 2020 08:37
@kakra
Copy link
Contributor Author

kakra commented Oct 23, 2020

Multi-processing is a must, since we must be able to recover from crashes and keep going in a new process. Safely recovering in a multi-threaded setup is impossible.

So as long as we can ensure that the crash didn't happen due to awkward contents in the proposed SHM additions, we should stay safe in multiprocess mode.

fossilize_db.cpp Outdated Show resolved Hide resolved
As a background service, we don't want to dominate the cache, so hint
the kernel when we no longer need the data in cache: This commit
removes the database from cache when we close it.

Also hint the kernel at database opening time that we're going to
read the database randomly to prevent excess cache usage introduced by
readahead.

Todo:

* This does not yet fix the cache pressure introduced by the shader
  cache in the graphics driver, neither does it fix an issue in the
  replayer that duplicates a lot of data in memory.

* It may be better to first the readahead mode to sequential on initial
  load, then switch to random mode.

* When starting a game, fossilize-replay should probably pause or
  exit immediately (fossilize itself is not in control of that).

See-also: ValveSoftware#84
See-also: ValveSoftware#99
Signed-off-by: Kai Krakow <kai@kaishome.de>
This adds some thoughts, references and future plans to the code so
contributors more easily understand the rationale and the problems
behind this and do not trip any pitfalls.

Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
@kakra kakra force-pushed the improve-linux-scheduling branch from 69740d8 to 6fb75a1 Compare December 20, 2020 19:00
Copy link
Contributor Author

@kakra kakra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HansKristian-Work This keeps writeback introduced by GPU drivers under control for my system. Even with fossilize working, Steam is now able to download data at full speed without lagging my desktop.

From the commit description:

Outdated, was written in a hurry

This is a quite primitive attempts to keep write-back data under
control. We simply use the current loadavg to estimate a value of
outstanding write requests. If we get above 1 loadavg per active core
(by dividing loadavg by worker threads), we throttle execution for a
few milliseconds to give the disk time to write data back. The value
of 20ms per requests was found experimentally (it matches around one
revolution of a standard HDD + some overhead). In my test setup it
works quite well: It keeps the CPU mostly as busy as before but the
loadavg peaks at around 9.5 for an 8-core system instead of going to
15+.

We cannot expect fossilize to finish any further if the disks cannot
keep up with the writeback amount from the GPU driver, so there's no
advantage in not waiting for small periods of time.

A better solution could measure the IO PSI data of the process try to
keep the IO latency below a certain threshold. The problem with loadavg
in Linux is that it measures everything in a system that waits busily
for events, be it IO, memory allocation, other tasks, etc... And we are
mainly interested in keep IO under control, everything else can be
covered by the CPU scheduler.

v1: This is quite a primitive attempt to keep write-back data under
control. We simply use the current loadavg to estimate a value of
outstanding write requests. If we get above 1 loadavg per active core
(by dividing loadavg by worker threads), we throttle execution for a
few milliseconds to give the disk time to write data back. The value
of 20ms per request was found experimentally (it matches around one
revolution of a standard HDD + some overhead). In my test setup it
works quite well: It keeps the CPU mostly as busy as before but the
loadavg peaks at around 9.5 for an 8-core system instead of going to
15+.

We cannot expect fossilize to progress any further if the disks cannot
keep up with the write-back amount from the GPU driver, anyways, so
there's no advantage in not waiting for small periods of time.

A better solution could measure the IO PSI data of the process trying
to keep the IO latency below a certain threshold. But the problem with
loadavg in Linux is that it measures everything in a system that waits
busily for events, be it IO, memory allocation, other tasks, etc... But
we are mainly interested in keep IO under control, everything else can
be covered by the CPU scheduler.

Signed-off-by: Kai Krakow <kai@kaishome.de>
@kakra kakra force-pushed the improve-linux-scheduling branch from 6fb75a1 to 1e32c5d Compare December 21, 2020 07:55
@HansKristian-Work
Copy link
Collaborator

I'm back in office after new years, I'll get back to you then.

Copy link
Contributor Author

@kakra kakra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HansKristian-Work Happy new year!

Turns out there's a problem in btrfs (which I'm using) in kernel 5.4 (up to and not including 5.8 or 5.9): Writing to large files involves a huge kernel time overhead in locking and systime due to some internals, ultimately resulting in the desktop stalling for minutes and loadavg spiking. I'm now running 5.10, and while I didn't test fossilize yet, other programs triggering the problem seem no longer be affected by this issue: My desktop runs well with almost no latency issues despite a quite high load (because I forced those programs to reprocess their data).

But this doesn't actually fix the underlying issue how the GPU driver writes to the cache files, running on 5.4 just uncovered the problem for me and made it prominently visible. I'd assume that this will be much less visible with 5.10. But we should probably still assume that there may be a huge performance overhead introduced by the write-back of a GPU driver and might want to keep that under control somehow, besides the fact that NVIDIA should probably look at why the driver always rewrites the cache data - no matter if it was dirty or clean (at least this is how it seems from my perspective).

Next step: Testing a big Steam download and see how desktop latency is affected.

Comment on lines +2385 to +2392
if (current_load > num_worker_threads)
{
// Interprets the current load as number of outstanding IO requests
// 20ms is the time we can expect an IO request to finish
uint32_t throttle_ms = (uint32_t)(current_load * 20.0 / num_worker_threads);
LOGI("Throttling threads %d load %0.2f throttle_ms %d\n", num_worker_threads, current_load, throttle_ms);
std::this_thread::sleep_for(std::chrono::milliseconds(throttle_ms));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HansKristian-Work This keeps write-back introduced by GPU drivers under control for my system. Even with fossilize working, Steam is now able to download data at full speed without lagging my desktop.

From the commit description:

This is quite a primitive attempt to keep write-back data under control. We simply use the current loadavg to estimate a value of outstanding write requests. If we get above 1 loadavg per active core (by dividing loadavg by worker threads), we throttle execution for a few milliseconds to give the disk time to write data back. The value of 20ms per request was found experimentally (it matches around one revolution of a standard HDD + some overhead). In my test setup it works quite well: It keeps the CPU mostly as busy as before but the loadavg peaks at around 9.5 for an 8-core system instead of going to 15+.

We cannot expect fossilize to progress any further if the disks cannot keep up with the write-back amount from the GPU driver, anyways, so there's no advantage in not waiting for small periods of time.

A better solution could measure the IO PSI data of the process trying to keep the IO latency below a certain threshold. But the problem with loadavg in Linux is that it measures everything in a system that waits busily for events, be it IO, memory allocation, other tasks, etc... But we are mainly interested in keep IO under control, everything else can be covered by the CPU scheduler.

Signed-off-by: Kai Krakow <kai@kaishome.de>
@HansKristian-Work
Copy link
Collaborator

HansKristian-Work commented Jan 4, 2021

First of all, the PR should be rebased, although I'll probably implement this myself once I've hashed out details, since I'm considering if this can be implemented in the control process rather than per-child process, and that implementation would be a bit more complicated. We could use SIGSTOP/SIGCONT to handle throttling in a central location. Would be nice to be able to dynamically throttle down number of active processes instead of having all threads start throttling in unison.

Using loadavg (CPU) seems like a very brittle approach, what is this supposed to measure? If other tasks are running on the system, Fossilize will needlessly throttle itself. Driver disk caches generally run in a thread, so stalling in the compilation thread doesn't really accomplish that much. Using SIGSTOP/SIGCONT we could throttle any random driver thread.

I've studied the Linux PSI API a bit, and this seems very close to what we want. There are mostly details to be figured out.

From what I understand, the problem is that the disk write cache gets too large too quickly, but I don't know how we can measure that. Would that show up as a stall in any fashion? The kernel appears to have some tunable parameters that control this behavior:

  • /proc/sys/vm/dirty_ratio (App will block here) (50% on my system)
  • /proc/sys/vm/dirty_background_ratio (Kernel will start swapping out) (20% on my system)
  • /proc/sys/vm/dirty_expire_centisecs (How long can pages stay dirty) (30 seconds on my system)

If we don't do anything active, it seems like we might have to wait at least 30 seconds to see any disk activity unless we explode memory disk cache usage.

One idea there is to call sync() at regular intervals, but that feels very ugly as well. I cannot find any process-group wide sync() call in Linux, so it would potentially disrupt the system in some way. Monitoring /proc/meminfo for Dirty could be useful here.

@kakra
Copy link
Contributor Author

kakra commented Jan 4, 2021

First of all, the PR should be rebased

I thought I did that. I can check it later when I'm back home.

although I'll probably implement this myself once I've hashed out details, since I'm considering if this can be implemented in the control process rather than per-child process, and that implementation would be a bit more complicated. We could use SIGSTOP/SIGCONT to handle throttling in a central location

Yes, that's true. Currently, it tends to throttle all processes at the same time while in reality it should scale the amount of parallel processes. My implementation is more a proof of concept of where the write-amplification happens and if throttling would be a working solution, and I think it probably accomplished that, and maybe also found one problem of at least the NVIDIA drivers which seem to rewrite even clean cache entries back to disk. I'm totally relying on you to work out a better, central solution from that. I'm unsure how to do it because it seems to distribute the work into equally sized chunks per thread instead of streaming and queuing data to a variable amount of workers. In the latter case, one could just scale the amount of workers dynamically by loadavg (or some other measurement).

I've studied the Linux PSI API a bit, and this seems very close to what we want. There are mostly details to be figured out.

I think it has an event interface which a process can subscribe to be notified when a threshold has been hit. loadavg worked for me because my load is primarily dominated by IO. Even for other systems it should not be a too bad choice for throttling down fossilize if other processes need any system resources, just the throttle algorithm is too primitive (using sleep doesn't adjust to resource usage very well). If we mostly care about IO, PSI may be the right thing. We should keep in mind that many system out there may not have it yet, tho.

The kernel appears to have some tunable parameters that control this behavior:

This is too difficult to tune, the default should be just fine with the tendency that percent-based defaults are probably too high for systems with 16+ GB of RAM. But whatever we do, this cannot fix it, I already tried: Increasing the dirty buffer will just increase latency for all processes once something triggers write-back. Decreasing it tends to make better interactivity at the cost of throughput and adding a lot of short stutters.

One idea there is to call sync() at regular intervals, but that feels very ugly as well. I cannot find any process-group wide sync() call in Linux

There's fsync() but you'd need to pass file-descriptors to it. We actually can see the NVIDIA file descriptors in our process space but we don't have any knowledge when and why to sync which of them. And for mesa it would probably not work at all if it creates many small files: This sound like many short-lived file descriptors vs. NVIDIA using long-lived file descriptors. In any case, we don't want to call sync(). So the only solution is to find a way to properly throttle the processes generating data written back to disk.

If we don't do anything active, it seems like we might have to wait at least 30 seconds to see any disk activity unless we explode memory disk cache usage.

I'm not sure how this works, it probably depends on a lot more than just this threshold. And unless we don't start blocking/starving other processes for IO, everything should be fine - not matter if the buffer is 300 seconds or 3 seconds. If we can queue 5 GB of writeback and then close the file within 30 seconds, it should not matter at all for other processes, the flusher will write it back. But if we start adding more data, we may start starving other processes for IO. This is probably more of an issue with btrfs but the NVIDIA driver seems to add its own nuance to it.

Driver disk caches generally run in a thread, so stalling in the compilation thread doesn't really accomplish that much

At least it stopped fossilize feeding more data into the driver when resource pressure raised too high. The effect of it was pretty much immediate, it doesn't really matter if the driver defers flushing data to disk in a secondary thread: At some point, we eventually see latency going up, and if we throttle NOW, that's enough to keep it under control. My loadavg throttler had a latency of 5 seconds, tho. The bigger problem I was having with needlessly throttling fossilize down was because I was seeing write-back in both the hit and miss branch, so I had to throttle both branches even when throttling just the miss branch should be enough. After trying a few tests, I didn't find it slowing down too much, even when the system had some other load: This is because fossilize runs nice anyways, it shouldn't run anyways when other pure CPU-bound processes increase the load. This theory matched my observations, so loadavg worked well enough.

@kakra
Copy link
Contributor Author

kakra commented Jan 4, 2021

BTW:

Using loadavg (CPU) seems like a very brittle approach

loadavg is not just CPU, it's a bit different in Linux compared to traditional Unix. It measures the average amount of processes in running or blocking state (usually R and D, nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible), so it will also measure processes waiting on IO (or even memory, swap...). In my primitive implementation I just pretended that we are IO-stalled if the load is above the amount of CPU cores - this usually works quite well. To get a finer resolution of the 1-minute average, I inverted the loadavg function to calculate the last 5 seconds window.

But yes, it's more a hack than a correct solution but it worked surprisingly well.

@HansKristian-Work
Copy link
Collaborator

Ok, I guess that makes it a little more usable, but I'm still confused about the case where --num-threads is smaller than number of physical cores. In that case, it will almost always be the case that Fossilize thinks it's overloading the system.

Does this mean loadavg can actually be larger than number of logical cores? Say 32 threads are running on 32 logical cores, and 3 threads are blocking on IO, would that count as 35.0 load?

I guess I'll need to add some things to fossilize-synth first so I can synthesize a truly massive FOZ archive which can be used for testing, since I cannot normally reproduce any IO thrashing at all ...

@kakra
Copy link
Contributor Author

kakra commented Jan 5, 2021

Does this mean loadavg can actually be larger than number of logical cores? Say 32 threads are running on 32 logical cores, and 3 threads are blocking on IO, would that count as 35.0 load?

Given a purely CPU bound load, a load of 1 saturates one core, a load of 2 saturates two cores, etc. Usual desktop loads rarely create a load above 1, no matter how many cores you have - except you already have some heavy workload running. But in that case, you probably don't want fossilize to add any more anyways. So I concluded we get away with this idea: If the load (expected to be below 1.0 for an idle fossilize) goes above the number of threads we are running, we are fully blocking other processes from using those particular cores. If that is for IO blocking, we actually should stop generating new data that potentially ends up in write-back, thus I'm throttling for a few ms (based on the load we "overshot").

As a rule of thumb, you could say that a load equal to number of cores fully saturates the system resources. That's not necessarily CPU only, you can see a load of 35 on a 32-cores system with only 10% CPU usage. Also, the load is not a percentage, it's a function of processes waiting in the scheduler runqueue. I've seen loadavgs of 300 on my 8-core system while CPU usage was only 30% - that was clearly due to IO wait. Such high values also mean that every process that needs to do IO now has a very very high chance of blocking. So we want to throttle in fossilize to not create excessive load. So even if CPU is mostly idle but the loadavg is 35, fossilize doesn't want to be process that adds that extra tiny amount of IO that makes other processes start any more blocking. Since fossilize indirectly creates lots of small IOs, it has a high potential of raising the load exponentially (or at least excessively) under already existing high loads. That's the problem I was actually seeing: As long as I stay around loadavg 8-9 with fossilize running, all is fine. But if we get above 10, it starts feeling sluggish, and if the get above 20, things start to "explode" and any more IO added by fossilize only exponentially adds to the load, resulting in it quickly rising to 100 or 200 without even using much CPU any longer.

I'm pretty sure the loadavg approach is not that bad (since we do not want to compete for CPU nor for IO) but if we wanted to limit the throttling to IO pressure, PSI would probably a better approach. Also, I'm supporting your claim that it could be better implemented in a central location.

To get back to your question: I don't think its as simple as adding numbers up. Unless those three extra processes won't block on IO, you won't see loadavg go beyond what it's currently showing. After all, IO is usually just dispatching a request, then waiting for a signal, nothing is blocking yet, other processes can resume CPU usage after this. But if we start blocking on IO (e.g., because file system meta data needs to sync, this has a high impact on btrfs), loadavg will increase.

@kakra
Copy link
Contributor Author

kakra commented Jan 5, 2021

But hey, maybe I got the reasoning behind your question: Yes, it may make sense to use loadavg > hw_threads instead of loadavg > num_threads. Actually, using num_threads was my tool to see if it manages to stay at that loadavg given the number of threads, and it did, saturating the CPU cores at around 95-97% for each fossilize thread. If fossilize is purely CPU-bound on an otherwise idle system, it should never cross the loadavg > num_threads border, thus it should never wait. If the system crosses this load, chances are that fossilize would have had to wait anyways, so the sleep calculation seemed to work well. If you'd use loadavg > hw_threads, that sleep duration formula probably needs some more tuning (e.g., we should try to eliminate the load probably not caused by ourselves to calculate the cost factor for the sleep function).

@kakra
Copy link
Contributor Author

kakra commented Jan 5, 2021

we are fully blocking other processes from using those particular cores

Addendum: In many cases of sentences like this, "cores" can be read as "resource". loadavg is a mix of measuring resource contention, be it CPU, disk, memory, ...

@kakra
Copy link
Contributor Author

kakra commented Jan 11, 2021

Since we've now got some great fairness improvements in fossilize, I'm going to rebase this one last time but I think the few bits left are not worthwhile to merge into fossilize.

@HansKristian-Work What do you think? Anything of importance left here?

@HansKristian-Work
Copy link
Collaborator

I don't think there's much to add here. The gist of this PR is merged already.

@kakra
Copy link
Contributor Author

kakra commented Jan 31, 2021

Latest Steam beta with updated fossilize works wonderfully, closing.

@kakra kakra closed this Jan 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants