-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the calculation for the number of workers in the worker pool #5041
Conversation
ironfish/src/workerPool/pool.ts
Outdated
} | ||
// TODO: ideally we should use `os.availableParallelism()` here, instead of | ||
// `os.cpus().length`, but it seems like Typescript doesn't have the type | ||
// bindings for it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this too, but I think it was only added in 18.14.0: https://nodejs.org/api/os.html#osavailableparallelism and we technically support all versions of 18. You could check if it exists at runtime and call it, else fall back to os.cpus().length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I was convinced we jumped to Node 20, but no, you're right: we're still on Node 18
9aef7cf
to
a7f81df
Compare
a7f81df
to
17106cb
Compare
ironfish/src/fileStores/config.ts
Outdated
@@ -90,13 +90,24 @@ export type ConfigOptions = { | |||
/** | |||
* The number of CPU workers to use for long-running node operations, like creating | |||
* transactions and verifying blocks. 0 disables workers (this is likely to cause | |||
* performance issues), and -1 auto-detects based on the number of CPU cores. | |||
* performance issues), and a negative value resolves to the amount of parallelism units | |||
* available to the process (as returned by `os.availableParallelism()`) minus one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While correct, node runners are often dev-ops / normal technical people, while this documentation is engineering focused. I think this is fine but I try not to refer to node documentation or something dev-ops people may struggle with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize this was user-visible documentation; I'll correct that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually copy these docs into the website, but it's manual and we don't need too. I think this is fine if we don't copy them.
17106cb
to
0d4a81e
Compare
This PR looks good, but I want to add some context for discussion because I know node-runners that won't be able to start their node now may be coming here for questions. Technically this PR makes perfect sense, but business wise there are some considerations. First, user's have a wide variety of configurations. We have users with high mem/low cpu, high cpu/low mem, low mem/low cpu. During our testnet phases we required users to run nodes to accumulate testnest activity points. This means that during our testnet phase, we had a large range of low end VPS's being rented to run Iron Fish nodes that probably had very little memory, and often many more cores. On top of that, there were a few 100 core user's running nodes that would brag and show off their HTOP. Now onto the next consideration. Node is a memory hungry beast. Unfortunately because JS is single threaded, if you want parallelism you need to use a node forked process worker pool. That means that our worker pool spawns essentially new node runtimes for each worker. That takes 100+ MB for each new worker spawned. If you have 10 physical cores you'll need enough memory for the main node process and 9 cores. Now back to the original of The reason we wanted to cap it down was to make most user's node's run without crashing, and because we wanted developers to develop with the understanding of what it felt like to run a low end / medium node and have sympathy for the user. We found that when we capped the performance of the node to an expected average user's machine, engineers contributed optimizations much more frequently. We also no longer received reports of crashing from OOO on startup from the worker pool spawning workers. So this might not be a problem anymore, but I think the PR still should address the original reason |
0d4a81e
to
815c2ed
Compare
This commit changes the default value for `nodeWorkersMax` to the number of CPU cores, minus one, capped at 16. Before this change, `nodeWorkersMax` was hardcoded to 6, which resulted in limited performance on systems that have more than 6 cores. As a consequence of this change, most users with empty configuration will observe a larger number of workers spawned. Users who run on systems with SMT enabled and with a low number of cores may observe a lower number of workers. For example: on a system with 2 cores, with SMT enabled (4 logical CPUs in total), and empty configuration: before this change, the number of workers would have been 4 - 1 = 3, after this change it is 2 - 1 = 1. The rationale behind using the number of CPU cores rather than the number of logical CPUs is that the worker pool should mostly be used for CPU-bound and CPU-intensive workloads. For this kind of workloads, SMT is known to be detrimetrial for performance (other workloads that are primarily I/O-bound should not use the worker pool but would benefit from async code). This commit also changes the default value for `nodeWorkers` to be based on the number of processing units available to the process, rather than the number of logical CPUs. On most systems, this will not result in any difference. This change will only impact users that limit the Node process through sandboxing tecniques like CPU affinity masks or cgroups. Some examples to understand the impact of this change: - users with `nodeWorkers` set in their configuration: no change - users with `nodeWorkers`/`nodeWorkersMax` not set, 2 CPU cores, SMT enabled (4 CPU threads): number of workers before this change = min(4 - 1, 6) = 3 number of workers after this change = min(4 - 1, 2 - 1, 16) = 1 - users with `nodeWorkers`/`nodeWorkersMax` not set, 8 CPU cores, SMT enabled (16 CPU threads): number of workers before this change = min(16 - 1, 6) = 6 number of workers after this change = min(16 - 1, 8 - 1, 16) = 7 - users with `nodeWorkers`/`nodeWorkersMax` not set, 8 CPU cores, SMT disabled (8 CPU threads): number of workers before this change = min(16 - 1, 6) = 6 number of workers after this change = min(8 - 1, 8 - 1, 16) = 7 - users with `nodeWorkers`/`nodeWorkersMax` not set, 32 CPU cores, SMT enabled (64 CPU threads): number of workers before this change = min(64 - 1, 6) = 6 number of workers after this change = min(64 - 1, 32 - 1, 16) = 16 - users with `nodeWorkers`/`nodeWorkersMax` not set, 32 CPU cores, SMT enabled (64 CPU threads), scheduler affinity set to 4 CPUs: number of workers before this change = min(64 - 1, 6) = 6 number of workers after this change = min(4 - 1, 32 - 1, 16) = 3
815c2ed
to
c237236
Compare
Summary
This commit changes the default value for
nodeWorkersMax
to the number of CPU cores, minus one, capped at 16. Before this change,nodeWorkersMax
was hardcoded to 6.As a consequence of this change, most users with empty configuration will observe a larger number of workers spawned. Users who run on systems with SMT enabled and wi.
For example: on a system with 2 cores, with SMT enabled (4 logical CPUs in total), and empty configuration: before this change, the number of workers would have been .
The rationale behind using the number of CPU cores rather than the number of logical CPUs is that the worker pool should mostly be used for CPU-bound and CPU-intensiv.
This commit also changes the default value for
nodeWorkers
to be based on the number of processing units available to the process, rather than the number of logical.Some examples to understand the impact of this change:
nodeWorkers
set in their configuration: no changenodeWorkers
/nodeWorkersMax
not set, 2 CPU cores, SMT enabled (4 CPU threads):number of workers before this change = min(4 - 1, 6) = 3
number of workers after this change = min(4 - 1, 2 - 1, 16) = 1
nodeWorkers
/nodeWorkersMax
not set, 8 CPU cores, SMT enabled (16 CPU threads):number of workers before this change = min(16 - 1, 6) = 6
number of workers after this change = min(16 - 1, 8 - 1, 16) = 7
nodeWorkers
/nodeWorkersMax
not set, 8 CPU cores, SMT disabled (8 CPU threads):number of workers before this change = min(16 - 1, 6) = 6
number of workers after this change = min(8 - 1, 8 - 1, 16) = 7
nodeWorkers
/nodeWorkersMax
not set, 32 CPU cores, SMT enabled (64 CPU threads):number of workers before this change = min(64 - 1, 6) = 6
number of workers after this change = min(64 - 1, 32 - 1, 16) = 16
nodeWorkers
/nodeWorkersMax
not set, 32 CPU cores, SMT enabled (64 CPU threads), scheduler affinity set to 4 CPUs:number of workers before this change = min(64 - 1, 6) = 6
number of workers after this change = min(4 - 1, 32 - 1, 16) = 3
Testing Plan
Documentation
Does this change require any updates to the Iron Fish Docs (ex. the RPC API
Reference)? If yes, link a
related documentation pull request for the website.
Breaking Change
Is this a breaking change? If yes, add notes below on why this is breaking and label it with
breaking-change-rpc
orbreaking-change-sdk
.