Scale up on memory pressure #6826

crusaderky · 2022-08-04T12:09:54Z

As of today, an adaptive cluster will start new workers if the amount of queued tasks on the scheduler vastly exceeds the amount of workers that are already online; in other words if it believes that the expected "good" CPU time per new worker will be justified compared to the overhead of starting and stopping it.

There is however a second use case that would benefit from scaling up, and it is when the cluster memory is saturated. It would make sense, in this case, to fire up extra workers even if the pending CPU load does not justify it.

The definition of "cluster memory is saturated" is tricky though.

One option should be to simply count cluster-wide managed / (memory_limit * target). This would include spilled memory and be advantageous when there's a lot of spill/unspill activity, but inefficient when there's long-standing unperturbed spilled data.
A variant would be to use managed_in_memory exclusively, thus ignoring the spilled data. Workloads that thrash the spill file should still benefit as they won't start spilling until they reach target anyway.
Memory unbalances across workers may need to be considered. e.g. if you set a new threshold distributed.worker.memory.scale_up: 0.55, you may never reach it as a cluster-wide average if some workers are heavily saturated while others aren't; this should be solved by Rebalance during a moving cluster #4906.
Another, more sophisticated option would be to monitor the unspill (not spill) events - or if you prefer, cache misses - over the last few seconds and implement some heuristic on top of that (note that the scheduler currently does not get these events; it will after Do not rebalance spilled keys #6002)
Number of paused workers is also a thing that could be considered.

This ticket heavily interacts with #5999.

The text was updated successfully, but these errors were encountered:

fjetter · 2022-08-04T13:14:54Z

FYI the current algorithm already has a rough estimation for memory pressure implemented. It is not exclusively looking at CPU, see

distributed/distributed/scheduler.py

Lines 7249 to 7255 in 192a8bb

    
           limit = sum(ws.memory_limit for ws in self.workers.values()) 
        
           used = sum(ws.nbytes for ws in self.workers.values()) 
        
           memory = 0 
        
           if used > 0.6 * limit and limit > 0: 
        
               memory = 2 * len(self.workers) 
        
           target = max(memory, cpu)

Of course, this logic may need some adjustment. Haven't tested this myself, yet.

crusaderky added discussion Discussing a topic with no specific actions yet memory labels Aug 4, 2022

crusaderky mentioned this issue Aug 4, 2022

Restart paused workers after a certain timeout #5999

Open

fjetter added the adaptive All things relating to adaptive scaling label Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale up on memory pressure #6826

Scale up on memory pressure #6826

crusaderky commented Aug 4, 2022 •

edited

Loading

fjetter commented Aug 4, 2022

Scale up on memory pressure #6826

Scale up on memory pressure #6826

Comments

crusaderky commented Aug 4, 2022 • edited Loading

fjetter commented Aug 4, 2022

crusaderky commented Aug 4, 2022 •

edited

Loading