Scale up on memory pressure #6826
Labels
adaptive
All things relating to adaptive scaling
discussion
Discussing a topic with no specific actions yet
memory
As of today, an adaptive cluster will start new workers if the amount of queued tasks on the scheduler vastly exceeds the amount of workers that are already online; in other words if it believes that the expected "good" CPU time per new worker will be justified compared to the overhead of starting and stopping it.
There is however a second use case that would benefit from scaling up, and it is when the cluster memory is saturated. It would make sense, in this case, to fire up extra workers even if the pending CPU load does not justify it.
The definition of "cluster memory is saturated" is tricky though.
managed / (memory_limit * target)
. This would include spilled memory and be advantageous when there's a lot of spill/unspill activity, but inefficient when there's long-standing unperturbed spilled data.distributed.worker.memory.scale_up: 0.55
, you may never reach it as a cluster-wide average if some workers are heavily saturated while others aren't; this should be solved by Rebalance during a moving cluster #4906.This ticket heavily interacts with #5999.
The text was updated successfully, but these errors were encountered: