Improve the calculation for the number of workers in the worker pool #5041

andiflabs · 2024-06-13T17:22:57Z

Summary

This commit changes the default value for nodeWorkersMax to the number of CPU cores, minus one, capped at 16. Before this change, nodeWorkersMax was hardcoded to 6.

As a consequence of this change, most users with empty configuration will observe a larger number of workers spawned. Users who run on systems with SMT enabled and wi.

For example: on a system with 2 cores, with SMT enabled (4 logical CPUs in total), and empty configuration: before this change, the number of workers would have been .

The rationale behind using the number of CPU cores rather than the number of logical CPUs is that the worker pool should mostly be used for CPU-bound and CPU-intensiv.

This commit also changes the default value for nodeWorkers to be based on the number of processing units available to the process, rather than the number of logical.

Some examples to understand the impact of this change:

users with nodeWorkers set in their configuration: no change
users with nodeWorkers/nodeWorkersMax not set, 2 CPU cores, SMT enabled (4 CPU threads):
number of workers before this change = min(4 - 1, 6) = 3
number of workers after this change = min(4 - 1, 2 - 1, 16) = 1
users with nodeWorkers/nodeWorkersMax not set, 8 CPU cores, SMT enabled (16 CPU threads):
number of workers before this change = min(16 - 1, 6) = 6
number of workers after this change = min(16 - 1, 8 - 1, 16) = 7
users with nodeWorkers/nodeWorkersMax not set, 8 CPU cores, SMT disabled (8 CPU threads):
number of workers before this change = min(16 - 1, 6) = 6
number of workers after this change = min(8 - 1, 8 - 1, 16) = 7
users with nodeWorkers/nodeWorkersMax not set, 32 CPU cores, SMT enabled (64 CPU threads):
number of workers before this change = min(64 - 1, 6) = 6
number of workers after this change = min(64 - 1, 32 - 1, 16) = 16
users with nodeWorkers/nodeWorkersMax not set, 32 CPU cores, SMT enabled (64 CPU threads), scheduler affinity set to 4 CPUs:
number of workers before this change = min(64 - 1, 6) = 6
number of workers after this change = min(4 - 1, 32 - 1, 16) = 3

Testing Plan

Documentation

Does this change require any updates to the Iron Fish Docs (ex. the RPC API
Reference)? If yes, link a
related documentation pull request for the website.

[ ] Yes

Breaking Change

Is this a breaking change? If yes, add notes below on why this is breaking and label it with breaking-change-rpc or breaking-change-sdk.

[ ] Yes

dguenther · 2024-06-14T15:04:02Z

ironfish/src/workerPool/pool.ts

-  }
+  // TODO: ideally we should use `os.availableParallelism()` here, instead of
+  // `os.cpus().length`, but it seems like Typescript doesn't have the type
+  // bindings for it


I thought this too, but I think it was only added in 18.14.0: https://nodejs.org/api/os.html#osavailableparallelism and we technically support all versions of 18. You could check if it exists at runtime and call it, else fall back to os.cpus().length

Somehow I was convinced we jumped to Node 20, but no, you're right: we're still on Node 18

NullSoldier · 2024-06-18T17:39:55Z

ironfish/src/fileStores/config.ts

@@ -90,13 +90,24 @@ export type ConfigOptions = {
  /**
   * The number of CPU workers to use for long-running node operations, like creating
   * transactions and verifying blocks. 0 disables workers (this is likely to cause
-   * performance issues), and -1 auto-detects based on the number of CPU cores.
+   * performance issues), and a negative value resolves to the amount of parallelism units
+   * available to the process (as returned by `os.availableParallelism()`) minus one.


While correct, node runners are often dev-ops / normal technical people, while this documentation is engineering focused. I think this is fine but I try not to refer to node documentation or something dev-ops people may struggle with.

I didn't realize this was user-visible documentation; I'll correct that

We usually copy these docs into the website, but it's manual and we don't need too. I think this is fine if we don't copy them.

NullSoldier · 2024-06-18T17:59:20Z

This PR looks good, but I want to add some context for discussion because I know node-runners that won't be able to start their node now may be coming here for questions. Technically this PR makes perfect sense, but business wise there are some considerations.

First, user's have a wide variety of configurations. We have users with high mem/low cpu, high cpu/low mem, low mem/low cpu. During our testnet phases we required users to run nodes to accumulate testnest activity points. This means that during our testnet phase, we had a large range of low end VPS's being rented to run Iron Fish nodes that probably had very little memory, and often many more cores. On top of that, there were a few 100 core user's running nodes that would brag and show off their HTOP.

Now onto the next consideration. Node is a memory hungry beast. Unfortunately because JS is single threaded, if you want parallelism you need to use a node forked process worker pool. That means that our worker pool spawns essentially new node runtimes for each worker. That takes 100+ MB for each new worker spawned. If you have 10 physical cores you'll need enough memory for the main node process and 9 cores.

Now back to the original of maxNodeWorkers. During our testnet we had many users that could simply not start their nodes when we increased the amount of parallelism because their systems would scale up and unexpectedly run out of memory. The reason we introduced was to because we made a trade off. We decided arbitrarily that 6 workers was enough.

The reason we wanted to cap it down was to make most user's node's run without crashing, and because we wanted developers to develop with the understanding of what it felt like to run a low end / medium node and have sympathy for the user. We found that when we capped the performance of the node to an expected average user's machine, engineers contributed optimizations much more frequently. We also no longer received reports of crashing from OOO on startup from the worker pool spawning workers.

So this might not be a problem anymore, but I think the PR still should address the original reason maxNodeWorkers existed and that you should make the case that it isn't needed anymore or that our end user's won't be impacted if we make this change. I think some may no longer be able to start their nodes without tweaking the configuration, and we can decide that's ok.

This commit changes the default value for `nodeWorkersMax` to the number of CPU cores, minus one, capped at 16. Before this change, `nodeWorkersMax` was hardcoded to 6, which resulted in limited performance on systems that have more than 6 cores. As a consequence of this change, most users with empty configuration will observe a larger number of workers spawned. Users who run on systems with SMT enabled and with a low number of cores may observe a lower number of workers. For example: on a system with 2 cores, with SMT enabled (4 logical CPUs in total), and empty configuration: before this change, the number of workers would have been 4 - 1 = 3, after this change it is 2 - 1 = 1. The rationale behind using the number of CPU cores rather than the number of logical CPUs is that the worker pool should mostly be used for CPU-bound and CPU-intensive workloads. For this kind of workloads, SMT is known to be detrimetrial for performance (other workloads that are primarily I/O-bound should not use the worker pool but would benefit from async code). This commit also changes the default value for `nodeWorkers` to be based on the number of processing units available to the process, rather than the number of logical CPUs. On most systems, this will not result in any difference. This change will only impact users that limit the Node process through sandboxing tecniques like CPU affinity masks or cgroups. Some examples to understand the impact of this change: - users with `nodeWorkers` set in their configuration: no change - users with `nodeWorkers`/`nodeWorkersMax` not set, 2 CPU cores, SMT enabled (4 CPU threads): number of workers before this change = min(4 - 1, 6) = 3 number of workers after this change = min(4 - 1, 2 - 1, 16) = 1 - users with `nodeWorkers`/`nodeWorkersMax` not set, 8 CPU cores, SMT enabled (16 CPU threads): number of workers before this change = min(16 - 1, 6) = 6 number of workers after this change = min(16 - 1, 8 - 1, 16) = 7 - users with `nodeWorkers`/`nodeWorkersMax` not set, 8 CPU cores, SMT disabled (8 CPU threads): number of workers before this change = min(16 - 1, 6) = 6 number of workers after this change = min(8 - 1, 8 - 1, 16) = 7 - users with `nodeWorkers`/`nodeWorkersMax` not set, 32 CPU cores, SMT enabled (64 CPU threads): number of workers before this change = min(64 - 1, 6) = 6 number of workers after this change = min(64 - 1, 32 - 1, 16) = 16 - users with `nodeWorkers`/`nodeWorkersMax` not set, 32 CPU cores, SMT enabled (64 CPU threads), scheduler affinity set to 4 CPUs: number of workers before this change = min(64 - 1, 6) = 6 number of workers after this change = min(4 - 1, 32 - 1, 16) = 3

andiflabs added the perf-improvement label Jun 13, 2024

dguenther reviewed Jun 14, 2024

View reviewed changes

andiflabs force-pushed the andrea/better-num-workers branch from 9aef7cf to a7f81df Compare June 14, 2024 21:32

andiflabs marked this pull request as ready for review June 15, 2024 21:13

andiflabs requested a review from a team as a code owner June 15, 2024 21:13

andiflabs force-pushed the andrea/better-num-workers branch from a7f81df to 17106cb Compare June 17, 2024 17:37

NullSoldier reviewed Jun 18, 2024

View reviewed changes

andiflabs force-pushed the andrea/better-num-workers branch from 17106cb to 0d4a81e Compare June 18, 2024 17:48

NullSoldier closed this Jun 18, 2024

NullSoldier reopened this Jun 18, 2024

NullSoldier approved these changes Jun 18, 2024

View reviewed changes

andiflabs force-pushed the andrea/better-num-workers branch from 0d4a81e to 815c2ed Compare June 18, 2024 21:21

andiflabs force-pushed the andrea/better-num-workers branch from 815c2ed to c237236 Compare June 18, 2024 21:24

andiflabs merged commit c237236 into staging Jun 18, 2024
13 checks passed

andiflabs deleted the andrea/better-num-workers branch June 18, 2024 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the calculation for the number of workers in the worker pool #5041

Improve the calculation for the number of workers in the worker pool #5041

andiflabs commented Jun 13, 2024 •

edited

Loading

dguenther Jun 14, 2024

andiflabs Jun 14, 2024

NullSoldier Jun 18, 2024

andiflabs Jun 18, 2024

NullSoldier Jun 18, 2024

NullSoldier commented Jun 18, 2024 •

edited

Loading

Improve the calculation for the number of workers in the worker pool #5041

Improve the calculation for the number of workers in the worker pool #5041

Conversation

andiflabs commented Jun 13, 2024 • edited Loading

Summary

Testing Plan

Documentation

Breaking Change

dguenther Jun 14, 2024

Choose a reason for hiding this comment

andiflabs Jun 14, 2024

Choose a reason for hiding this comment

NullSoldier Jun 18, 2024

Choose a reason for hiding this comment

andiflabs Jun 18, 2024

Choose a reason for hiding this comment

NullSoldier Jun 18, 2024

Choose a reason for hiding this comment

NullSoldier commented Jun 18, 2024 • edited Loading

andiflabs commented Jun 13, 2024 •

edited

Loading

NullSoldier commented Jun 18, 2024 •

edited

Loading