WIP: Write-back throttling

v1: This is quite a primitive attempt to keep write-back data under control. We simply use the current loadavg to estimate a value of outstanding write requests. If we get above 1 loadavg per active core (by dividing loadavg by worker threads), we throttle execution for a few milliseconds to give the disk time to write data back. The value of 20ms per request was found experimentally (it matches around one revolution of a standard HDD + some overhead). In my test setup it works quite well: It keeps the CPU mostly as busy as before but the loadavg peaks at around 9.5 for an 8-core system instead of going to 15+. We cannot expect fossilize to progress any further if the disks cannot keep up with the write-back amount from the GPU driver, anyways, so there's no advantage in not waiting for small periods of time. A better solution could measure the IO PSI data of the process trying to keep the IO latency below a certain threshold. But the problem with loadavg in Linux is that it measures everything in a system that waits busily for events, be it IO, memory allocation, other tasks, etc... But we are mainly interested in keep IO under control, everything else can be covered by the CPU scheduler. Signed-off-by: Kai Krakow <kai@kaishome.de>
ValveSoftware · Dec 21, 2020 · 1e32c5d · 1e32c5d
1 parent 941925c
commit 1e32c5d
Showing 1 changed file with 48 additions and 0 deletions.
diff --git a/cli/fossilize_replay.cpp b/cli/fossilize_replay.cpp
@@ -56,6 +56,10 @@
 #include <map>
 #include <assert.h>
 
+#ifdef __linux__
+#include <cmath>
+#endif
+
 #ifdef FOSSILIZE_REPLAYER_SPIRV_VAL
 #include "spirv-tools/libspirv.hpp"
 #endif
@@ -918,6 +922,7 @@ struct ThreadedReplayer : StateCreatorInterface
 						else
 							pipeline_cache_misses.fetch_add(1, std::memory_order_relaxed);
 					}
+					maybe_throttle();
 				}
 				else
 				{
@@ -1058,6 +1063,7 @@ struct ThreadedReplayer : StateCreatorInterface
 						else
 							pipeline_cache_misses.fetch_add(1, std::memory_order_relaxed);
 					}
+					maybe_throttle();
 				}
 				else
 				{
@@ -2083,6 +2089,7 @@ struct ThreadedReplayer : StateCreatorInterface
 
 			if (memory_index == 0)
 			{
+				//TODO Q: Maybe flush write-back here somehow? -> A: No, GPU driver writes do not come from this thread
 				work.push_back({ get_order_index(MAINTAIN_SHADER_MODULE_LRU_CACHE),
 				                 [this]() {
 					                 // Now all worker threads are drained for any work which needs shader modules,
@@ -2345,6 +2352,47 @@ struct ThreadedReplayer : StateCreatorInterface
 		queued_count[item.memory_context_index]++;
 	}
 
+	double m_prev_loadavg;
+	inline void maybe_throttle()
+	{
+#ifdef __linux__
+		double loadavg[1];
+		// TODO Maybe use PSI on modern systems to measure the current IO latency?
+		const int rv = ::getloadavg(loadavg, 1);
+		if (rv != 1)
+		{
+			LOGE("Failed to query load average\n");
+			return;
+		}
+
+		static const double load_exp = exp(-5.0 / 60.0);
+
+		// Taken from github.com/Zygo/bees:
+		// Averages are fun, but want to know the load from the last 5 seconds.
+		// Invert the load average function:
+		// LA = LA * load_exp + N * (1 - load_exp)
+		// LA2 - LA1 = LA1 * load_exp + N * (1 - load_exp) - LA1
+		// LA2 - LA1 + LA1 = LA1 * load_exp + N * (1 - load_exp)
+		// LA2 - LA1 + LA1 - LA1 * load_exp = N * (1 - load_exp)
+		// LA2 - LA1 * load_exp = N * (1 - load_exp)
+		// LA2 / (1 - load_exp) - (LA1 * load_exp / 1 - load_exp) = N
+		// (LA2 - LA1 * load_exp) / (1 - load_exp) = N
+		// except for rounding error which might make this just a bit below zero.
+		const double current_load = max(0.0, (loadavg[0] - m_prev_loadavg * load_exp) / (1 - load_exp));
+
+		m_prev_loadavg = loadavg[0];
+
+		if (current_load > num_worker_threads)
+		{
+			// Interprets the current load as number of outstanding IO requests
+			// 20ms is the time we can expect an IO request to finish
+			uint32_t throttle_ms = (uint32_t)(current_load * 20.0 / num_worker_threads);
+			LOGI("Throttling threads %d load %0.2f throttle_ms %d\n", num_worker_threads, current_load, throttle_ms);
+			std::this_thread::sleep_for(std::chrono::milliseconds(throttle_ms));
+		}
+#endif
+	}
+
 	unsigned num_worker_threads = 0;
 	unsigned loop_count = 0;