Use Envoy cpuset size to set the default number or worker threads #5975

ipuustin · 2019-02-15T13:45:59Z

Description:

If --concurrency option is not specified, set the default number of worker threads based on the cpuset of the process. This means that the amount of worker threads is not controlled by the hardware thread count but instead the cpu allocation.

For example, if Envoy is started in a Docker container like this, only four worker threads are created if --concurrency option is not specified:

$ docker run --cpuset-cpus="0-3" ...

The original behavior was that the number of CPU threads in the system would be used to determine the number of worker threads. This number could be very large for servers with multiple physical CPUs and hyperthreading enabled.

This change has a potentially large performance impact, since it changes the default number of threads assigned to Envoy. However, since the original intent of the default concurrency value was to have the number of worker threads matching to the HW capabilities, after this change the Envoy behavior is probably closer to what is expected. Also, std::thread::hardware_concurrency() docs say that the returned value "should be considered only a hint".

There are no tests yet, because I wanted to have some input whether a change like this would be desirable before finalizing the work. Another potential improvement possibility would be monitoring the process cpuset and changing the amount of worker threads dynamically.

Risk Level: Medium
Testing: manual testing with taskset and docker
Docs Changes: Yes, explained the change of behavior
Release Notes: Yes

ipuustin · 2019-02-15T14:45:43Z

After (re-)reading the commit guidelines, I created issue #5976 for discussion.

mattklein123 · 2019-02-15T16:37:10Z

@ipuustin thanks for working on this. My feeling is that this is too risky to universally change. I'm thinking we can potentially add a new CLI option to enable this behavior? WDYT?

Also, I would prefer if we could factor out the Linux specific code into a dedicated CC file without any ifdefs, and then this file can be excluded on OSX/Windows, etc.

ipuustin · 2019-02-18T08:17:07Z

@mattklein123 Thanks for review comments. I understand very well the risk consideration. I'm totally fine with adding a command line switch. Maybe --cpuset-threads, which, if enabled, would mean that the default worker threads number would be calculated based on the assigned cpuset? Having the Linux-specific code in a separate file makes also a lot of sense. I'll do the changes once we get the potential design somewhat nailed down.

mattklein123 · 2019-02-19T18:45:30Z

I'll do the changes once we get the potential design somewhat nailed down.

This proposal sounds great to me.

sbelair2 · 2019-02-19T22:19:50Z

@ipuustin, @mattklein123 There is a potential issue with this with the integration of Envoy and VPP. VPP comes up early and grabs some cores for DPDK and packet graph processing. I don't know how deterministic VPP's core selection is, but it is guaranteed that those cores can be pegged at 100% of CPU under load. I figured we'd be fine as long as we don't try to set core affinities for Envoy's cpuset. Is this the case?

ipuustin · 2019-02-20T14:22:29Z

@sbelair2 This PR doesn't really deal with how Envoy's cpuset might be assigned. It just tries to make the default number of worker threads smaller if Envoy is running on a limited cpuset.

If Envoy and VPP are running in the same container, they indeed share the same assigned cpuset. When VPP "grabs some cores" it probably limits itself running on those, but Envoy's cpuset isn't limited unless there's some startup wrapper which makes Envoy run on the remaining of the original cpuset. I think the consideration there should just be making the cpuset to be large enough to accommodate both Envoy and VPP. It shouldn't matter too much if there's one or two Envoy worker threads too many because VPP keeps one or two cores at full load.

sbelair2 · 2019-02-20T15:57:06Z

@ipuustin Thanks for the clarification. Btw, VPP will be running outside of any containers, but I see no issue as you state it.

mattklein123

Thanks this is great, some high level comments and then will take a deeper look on the next pass. Thank you!

/wait

mattklein123 · 2019-02-21T17:02:06Z

docs/root/operations/cli.rst

@@ -76,6 +76,12 @@ following are the command line options that Envoy supports.
  `connection` component to run at `trace` level, you should pass ``upstream:debug,connection:trace`` to 
  this flag. See ``ALL_LOGGER_IDS`` in :repo:`/source/common/common/logger.h` for a list of components.

+.. option:: --cpuset-threads
+
+  *(optional)* This flag is used to control the number of worker threads if --concurrency option is not


nit: I assume this is only a Linux thing? Can we potentially make that a little more and possible link to some definition / man page / etc. of cpuset?

I added a mention that this is about Linux and a link to the kernel docs.

mattklein123 · 2019-02-21T17:03:31Z

source/server/options_impl_platform_default.cc

+
+namespace Envoy {
+
+uint32_t OptionsImplPlatform::getCpuCount() { return std::thread::hardware_concurrency(); }


Should we print a warning here that the platform does not support cpuset so Envoy is defaulting to HW concurrency?

I think that's a good idea (since it's obviously not working as the user tries to use that). I added the warning.

mattklein123 · 2019-02-21T17:04:54Z

source/server/options_impl_platform_linux.cc

+
+namespace Envoy {
+
+uint32_t OptionsImplPlatform::getCpuCount() {


Can we get some test coverage of this code? Potentially a test that only runs in Linux? If you want to test it using mocks you can also consider using OsSysCalls (that might be good also to test the edge cases).

I added some tests. I chose to refactor the getCpuCount() function into two to make the testing easier.

htuch · 2019-02-21T19:07:01Z

source/server/options_impl_platform_linux.cc

+    return hw_threads;
+  }
+
+  std::string statusLine;


@oschaaf what are your thoughts on this vs. https://github.com/envoyproxy/envoy-perf/blob/master/nighthawk/source/common/utility.cc#L17?

htuch · 2019-02-21T19:10:08Z

@ipuustin can you merge master to pick up the compile_time_options CI fix?

oschaaf · 2019-02-21T19:58:52Z

source/server/options_impl_platform_linux.cc

+  if (threads > 0 && threads <= hw_threads) {
+    return threads;
+  }
+  return hw_threads;


This might mean we return 0, which may surprise an unsuspecting caller?

You're right. I added a check to make sure hw_threads >= 1.

oschaaf · 2019-02-21T20:01:12Z

source/server/options_impl_platform_linux.cc

+
+uint32_t OptionsImplPlatform::getCpuCount() {
+  unsigned int hw_threads = std::thread::hardware_concurrency();
+  std::string path("/proc/" + std::to_string(getpid()) + "/status");


Nit: Would '/proc/self/status' work as well here? Perhaps slightly simpler?

Makes sense, fixed.

oschaaf · 2019-02-26T19:41:01Z

So I have been mulling this over for a bit, and one thing that crossed my mind, is that at some point it may make sense to discern between optimizing for throughput and optimizing for latency when considering available hardware resources. Specific to this proposed feature, such priority would make a difference in how to treat logical cpu's resulting from hyperthreading, depending on what should be prioritized. If latency is important, then maybe you want to avoid using logical cores.. if bandwidth is important, then maybe there's a factor we can derive that works best for Envoy (1.5 threads per HT-enabled cpu? 2? I'm not sure).

In any case, over at https://github.com/envoyproxy/envoy-perf/blob/master/nighthawk/source/common/utility.cc#L17, determineCpuCoresWithAffinity() relies on pthread_getaffinity_np() and CPU_COUNT. Unfortunately that also does not make a distinction between physical and logical cores, but I feel that relying on api calls to do a straightforward count of cpu's results in an implementation that is easy to grok.

However, I also peeked at libhwloc while implementing determineCpuCoresWithAffinity(). I think libhwloc does what this PR implements (and a lot more). As such, it may be worth taking a peek at their linux implementation of this specific functionality, which is looks similar to the change proposed here?

https://github.com/open-mpi/hwloc/blob/master/hwloc/topology-linux.c#L4941

Also, I noticed that libhwloc does not seem to use /proc/self/status. Not sure if it is related, but the man pages say something potentially interesting about it:

/proc/[pid]/status
              Provides much of the information in /proc/[pid]/stat and
              /proc/[pid]/statm in a format that's easier for humans to
              parse.

Considering our purpose to have a machine parse it, I wonder, is /proc/self/status the best field to use for our purpose?

(ps - As far as I can tell the libhwloc license is pretty permissive: https://github.com/open-mpi/hwloc/blob/master/COPYING)

ipuustin · 2019-02-27T08:55:57Z

@oschaaf Thanks for the insightful comment!

Having heuristics which try to estimate whether HT is enabled and estimate the thread number based on that might be very tricky, because I think that the performance difference is depending on the SMT implementation that the processor has. Another way could be providing known-to-be-fast templates for different purposes: "If you want to optimize for XYZ, try starting with this configuration, and benchmark your workload..."

For a single system (or a class of systems) it could be possible to find good thread factors. I've heard (but haven't benchmarked this myself) that disabling HT on some cores can be pretty closely "emulated" by this algorithm: find out the HT pairs in the CPU topology (/proc/cpuinfo core ids), disable the other CPU in the pair (echo 0 > /sys/devices/system/cpu/cpuX/online), put the remaining CPU cores into a cpuset, and give that cpuset to the process. I don't know what's the scope of envoy-perf, but maybe it could use this method to find factors that work well for several workloads in both HT-enabled and HT-disabled systems?

If we can come up with a suitable "default" thread number factor, that factor can be made to scale with the number of CPUs available. For a example, maybe there could be two worker threads if there was only one CPU, but 25 worker threads for 20 CPU cores? The current factor of 1 is pretty simplistic.

But coming back to this PR, what changes are you proposing? Should we add libhwloc dependency and use that? Or use sched_getaffinity() / pthread_getaffinity_np() for getting the CPU count? I'm fine with both approaches.

oschaaf · 2019-02-27T10:19:24Z

I feel adding libhwloc as a dependency for just this may be a bit heavy-weight if we only need a small subset of functionality out of it, but perhaps it could serve as inspiration when pursuing the current approach to implement this feature? For example, it also considers cgroups, which could be nice (and which is mentioned in the linked kernel docs)?

I'm not a lawyer, but I think even just copying some code out of it is OK as long as care is taken to properly attribute.

For just counting cpu's with affinity, then imho sched_getaffinity() / pthread_getaffinity_np() may be easier to grok and test?

ipuustin · 2019-02-27T17:27:31Z

@oschaaf : I added a patch changing the logic to use sched_getaffinity(). Please take a look if this is more what you had in mind.

Do you mean by cgroups that we would access cpu cgroup and allocate right amount of threads when Envoy is be running on all cores but with a limited cpu quota? I agree that the cpuset size calculation doesn't help with that, but I'm not sure if limiting the amount of worker threads is even the right thing to do in that case.

mattklein123

Thanks this looks great. A few small comments.

/wait

mattklein123 · 2019-03-01T17:27:37Z

include/envoy/api/os_sys_calls_linux.h

@@ -0,0 +1,24 @@
+#pragma once


Can you add something like:

#ifndef __LINUX__ #error "Linux platform included on non-Linux" #endif

(or something like that). Same for the impl.

mattklein123 · 2019-03-01T17:28:17Z

source/common/api/os_sys_calls_impl_linux.h

+namespace Envoy {
+namespace Api {
+
+class LinuxOsSysCallsImpl : public OsSysCalls, LinuxOsSysCalls {


Is there value in embedding and calling through? It seems like we could just have 2 syscall impls that are used?

Ok, I changed this to work the way you suggested. I was originally thinking that maybe it would make sense that there would be only one *OsSysCalls version for each platform, which would support all the system calls available for the platform. I'm totally fine with either way though.

mattklein123 · 2019-03-01T17:29:06Z

source/server/options_impl.cc

+    // that can be known.
+    concurrency_ = OptionsImplPlatform::getCpuCount();
+  } else {
+    if (concurrency.isSet() && cpuset_threads_ && cpuset_threads.isSet()) {


Do you have a test for this case?

Added a test.

mattklein123 · 2019-03-01T17:29:48Z

source/server/options_impl_platform_linux.cc

+      linux_os_syscalls.sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
+  if (result.rc_ == -1) {
+    // Fall back to number of hardware threads.
+    return hw_threads;


Can you add a test for this case? The value in adding linux syscalls is now you can have a mock and unit test this function.

Add a function for finding the size of the cpuset of the process. In non-Linux platforms, fall back to the number of HW threads in the system. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Add a new configuration option "--cpuset-threads". If that option is set and "--concurrency" option is not set, only start a number of worker threads which equals to the size of the available cpuset. For example, if Envoy is started in a Docker container like this, only four worker threads are created if only "--cpuset-threads" option is set: $ docker run --cpuset-cpus="0-3" ... Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Also describe the changed server_info structure. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

…nabled. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Do not call-through to the non-Linux-specific system calls using the Linux system call API object. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

mattklein123

Thank you looks awesome!

/retest

repokitteh-read-only · 2019-03-04T19:26:31Z

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #5975 (review) was submitted by @mattklein123.

see: more, trace.

…voyproxy#5975) Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com> Signed-off-by: Fred Douglas <fredlas@google.com>

htuch · 2019-03-11T19:14:54Z

source/common/api/BUILD

-    srcs = ["os_sys_calls_impl.cc"],
-    hdrs = ["os_sys_calls_impl.h"],
+    srcs = ["os_sys_calls_impl.cc"] + select({
+        "@bazel_tools//src/conditions:linux_x86_64": ["os_sys_calls_impl_linux.cc"],


FTR, this breaks the Google import since we don't have @bazel_tools conditional in this form internally. I'd prefer to stick with //bazel:BUILD conditionals, I will do an update PR for this.

htuch · 2019-03-12T15:49:46Z

test/server/options_impl_test.cc

+  unsigned int fake_cpuset_size = std::thread::hardware_concurrency();
+  unsigned int fake_hw_threads = fake_cpuset_size;
+
+  EXPECT_EQ(OptionsImplPlatformLinux::getCpuAffinityCount(fake_hw_threads), fake_cpuset_size);


We're seeing this break inside sandboxed build environments.. I think we will work around this by just disabling this feature, but if others notice it, might be worth further investigation.

I think there's a bug in the test design -- if you run the test in an environment where cpuset is reduced, the hardware thread count will not match with the cpuset size. What we can say is that the actual cpuset size <= hw thread count. I need to think if the test can be fixed in a better way.

* Use cpuset size instead of total hardware threads to determine total worker threads. * Can be overridden by explicitly setting `--concurrency` arg More in https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-cpuset-threads Discussion in envoyproxy/envoy#5975 Relates to envoyproxy#1718 Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* Set `--cpuset-threads` in EnvoyProxy cmdline arg * Use cpuset size instead of total hardware threads to determine total worker threads. * Can be overridden by explicitly setting `--concurrency` arg More in https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-cpuset-threads Discussion in envoyproxy/envoy#5975 Relates to #1718 Signed-off-by: Arko Dasgupta <arko@tetrate.io> * lint Signed-off-by: Arko Dasgupta <arko@tetrate.io> --------- Signed-off-by: Arko Dasgupta <arko@tetrate.io>

ipuustin force-pushed the cpuset branch from 0b00b23 to eaa9e16 Compare February 15, 2019 14:01

ipuustin mentioned this pull request Feb 15, 2019

Default worker thread number is the amount of HW threads, not the cpuset size. #5976

Closed

mattklein123 self-assigned this Feb 15, 2019

mattklein123 added the waiting label Feb 15, 2019

ipuustin force-pushed the cpuset branch from 11def0a to 01e532a Compare February 21, 2019 12:35

repokitteh-read-only bot removed the waiting label Feb 21, 2019

ipuustin force-pushed the cpuset branch 2 times, most recently from 865395f to fbe451a Compare February 21, 2019 13:29

mattklein123 requested changes Feb 21, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Feb 21, 2019

htuch reviewed Feb 21, 2019

View reviewed changes

oschaaf reviewed Feb 21, 2019

View reviewed changes

ipuustin force-pushed the cpuset branch from fbe451a to 4f1feea Compare February 26, 2019 11:18

repokitteh-read-only bot removed the waiting label Feb 26, 2019

ipuustin force-pushed the cpuset branch 3 times, most recently from 681d16a to 2f5f912 Compare February 26, 2019 12:52

mattklein123 requested changes Mar 1, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Mar 1, 2019

ipuustin added 14 commits March 4, 2019 10:58

options: Linux-specific code for finding out the number of CPUs.

9c96541

Add a function for finding the size of the cpuset of the process. In non-Linux platforms, fall back to the number of HW threads in the system. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

api: add cpuset-threads to server_info protobuf file.

61516d2

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

docs: describe --cpuset-threads configuration option.

da9cff6

Also describe the changed server_info structure. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

tests: add tests for --cpuset-threads option.

dd5b915

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

tests: check the logic for reading cpusets from /proc/self/status.

736a028

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

options: use sched_getaffinity to determine CPU set size.

7cfab2b

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

docs: update the links in version_history.rst and cli.rst.

d2c2235

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

options: add warning if both --concurrency and --cpuset-threads are e…

1986a39

…nabled. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

os_syscalls: add support for Linux-specific syscalls.

cdff7ba

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

linux: error out if Linux-specific files are used in a non-Linux build.

7e33c82

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

tests: add test for cpuset-threads warning.

af55037

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

os_syscalls: remove call-through composition.

c9eb033

Do not call-through to the non-Linux-specific system calls using the Linux system call API object. Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

tests: test for failing sched_getaffinity().

d00e04e

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

ipuustin force-pushed the cpuset branch from 0810f9d to d00e04e Compare March 4, 2019 13:04

repokitteh-read-only bot removed the waiting label Mar 4, 2019

mattklein123 approved these changes Mar 4, 2019

View reviewed changes

mattklein123 merged commit f59a30e into envoyproxy:master Mar 4, 2019

fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019

Use Envoy cpuset size to set the default number or worker threads (en…

197f227

…voyproxy#5975) Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com> Signed-off-by: Fred Douglas <fredlas@google.com>

lizan mentioned this pull request Mar 6, 2019

fix mac test build #6187

Merged

htuch reviewed Mar 11, 2019

View reviewed changes

htuch reviewed Mar 12, 2019

View reviewed changes

ipuustin mentioned this pull request Mar 13, 2019

test: fix cpuset-threads tests #6278

Merged

jpeach mentioned this pull request Aug 31, 2021

feat(*) automatically set proxy concurrency kumahq/kuma#2691

Merged

6 tasks

arkodg mentioned this pull request Jul 28, 2023

feat: Set --cpuset-threads in EnvoyProxy cmdline arg envoyproxy/gateway#1727

Merged


		namespace Envoy {

		uint32_t OptionsImplPlatform::getCpuCount() { return std::thread::hardware_concurrency(); }

Use Envoy cpuset size to set the default number or worker threads #5975

Use Envoy cpuset size to set the default number or worker threads #5975

Conversation

ipuustin commented Feb 15, 2019

ipuustin commented Feb 15, 2019

mattklein123 commented Feb 15, 2019

ipuustin commented Feb 18, 2019

mattklein123 commented Feb 19, 2019

sbelair2 commented Feb 19, 2019

ipuustin commented Feb 20, 2019

sbelair2 commented Feb 20, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch commented Feb 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oschaaf commented Feb 26, 2019

ipuustin commented Feb 27, 2019

oschaaf commented Feb 27, 2019

ipuustin commented Feb 27, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

repokitteh-read-only bot commented Mar 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment