Infinite loop of scaling in and out with HTEX #3696

stevenstetzler · 2024-11-14T03:43:15Z

Describe the bug
I've encountered an infinite loop of scaling in and out with the high throughput executor. Blocks get scaled out only to be immediately scaled in as idle, followed by blocks getting scaled out again.

I believe the issue is due to the block info in the high throughput executor scale in logic having infinite idle time (per their initialization). This seems related to #3353 where I can imagine un-started blocks may not report an idle time.

The logic is here: https://github.com/Parsl/parsl/blob/2024.11.11/parsl/executors/high_throughput/executor.py#L706-L747

        @dataclass
        class BlockInfo:
            tasks: int  # sum of tasks in this block
            idle: float  # shortest idle time of any manager in this block

        # block_info will be populated from two sources:
        # the Job Status Poller mutable block list, and the list of blocks
        # which have connected to the interchange.

        def new_block_info():
            return BlockInfo(tasks=0, idle=float('inf'))

        block_info: Dict[str, BlockInfo] = defaultdict(new_block_info)

        for block_id, job_status in self._status.items():
            if job_status.state not in TERMINAL_STATES:
                block_info[block_id] = new_block_info()

        managers = self.connected_managers()
        for manager in managers:
            if not manager['active']:
                continue
            b_id = manager['block_id']
            block_info[b_id].tasks += manager['tasks']
            block_info[b_id].idle = min(block_info[b_id].idle, manager['idle_duration'])

        # The scaling policy is that longest idle blocks should be scaled down
        # in preference to least idle (most recently used) blocks.
        # Other policies could be implemented here.

        sorted_blocks = sorted(block_info.items(), key=lambda item: (-item[1].idle, item[1].tasks))

        logger.debug(f"Scale in selecting from {len(sorted_blocks)} blocks")
        if max_idletime is None:
            block_ids_to_kill = [x[0] for x in sorted_blocks[:blocks]]
        else:
            block_ids_to_kill = []
            for x in sorted_blocks:
                if x[1].idle > max_idletime and x[1].tasks == 0:
                    block_ids_to_kill.append(x[0])
                    if len(block_ids_to_kill) == blocks:
                        break

and this line in particular is the culprit: https://github.com/Parsl/parsl/blob/2024.11.11/parsl/executors/high_throughput/executor.py#L744

                if x[1].idle > max_idletime and x[1].tasks == 0:

Note that new_block_info() initializes idle to float('inf') and if that default value is used in comparison with max_idletime it will always return true and be appended to the block ids to kill.

A one-line fix to

                if x[1].idle != float('inf') and x[1].idle > max_idletime and x[1].tasks == 0:

has fixed the behavior for me.

To Reproduce
I'm not sure how to reproduce this consistently, as it only appeared when moving to a new computing environment.

Expected behavior
I expect scaling to behave as normal.

Environment

OS: Rocky Linux 9
Python version: 3.11.8
Parsl version: 1.3.0-dev

Distributed Environment

Compute node
Compute nodes

The text was updated successfully, but these errors were encountered:

benclifford · 2024-11-14T10:12:51Z

That code is in scale_in which should only be called by the scaling strategy code when it has decided that there are too many blocks, and in that case the newest unused (infinitely-idle) blocks should be scaled in (deliberately in preference to any that are already running - although that choice is separately debatable).

What I'm interested in here, then, is why scale_in gets called. That gets called in several places in the scaling strategy, for example here:

parsl/parsl/jobs/strategy.py

Line 256 in 92ab47f

executor.scale_in_facade(active_blocks - min_blocks)

Usually that happens when there isn't enough load (outstanding tasks) to need that number of blocks to exist.

So maybe you can have a look at what is happening around there - there are a bunch of debug log messages in strategy.py that are intended to help understand why parsl is making scaling decisions.

stevenstetzler · 2024-12-11T03:43:56Z

@benclifford I took a closer look and think I have tracked it down. In the logs I see,

DEBUG: Strategy case 4b: more slots than tasks

i.e. there are more slots than tasks. Active slots includes pending blocks per

parsl/parsl/jobs/strategy.py

Lines 209 to 212 in 92ab47f

    
           running = sum([1 for x in status.values() if x.state == JobState.RUNNING]) 
        
           pending = sum([1 for x in status.values() if x.state == JobState.PENDING]) 
        
           active_blocks = running + pending 
        
           active_slots = active_blocks * tasks_per_node * nodes_per_block

However PENDING JobStates are included in blocks to be considered for scale in since JobStates.PENDING is not in TERMINAL_STATES

parsl/parsl/executors/high_throughput/executor.py

Lines 715 to 722 in 92ab47f

    
           def new_block_info(): 
        
               return BlockInfo(tasks=0, idle=float('inf')) 
        
           block_info: Dict[str, BlockInfo] = defaultdict(new_block_info) 
        
           for block_id, job_status in self._status.items(): 
        
               if job_status.state not in TERMINAL_STATES: 
        
                   block_info[block_id] = new_block_info()

So, scale_in is called since active_slots > active_tasks. The only active slots come from a pending block. During scale in, a BlockInfo is added with inf idle time. Since this block is pending it isn't included in connected_managers() (in which case the inf would be cleared with the true block idle time)

parsl/parsl/executors/high_throughput/executor.py

Lines 724 to 730 in 92ab47f

    
           managers = self.connected_managers() 
        
           for manager in managers: 
        
               if not manager['active']: 
        
                   continue 
        
               b_id = manager['block_id'] 
        
               block_info[b_id].tasks += manager['tasks'] 
        
               block_info[b_id].idle = min(block_info[b_id].idle, manager['idle_duration'])

The pending block is then selected for scale in and cancelled.

This is followed by a scale out and causes a loop that isn't resolved unless the block becomes available before the scaling strategy loop occurs again.

benclifford · 2024-12-11T09:22:05Z

@stevenstetzler this should be converging towards having the "correct" number of slots though: if there are more slots than tasks then (modulo some rounding problem?) blocks should be cancelled - whether they are active or pending.

The intention is that the number of slots (active or pending) converges towards the number of tasks pending or active.

What does this log line say, which is the raw data for the "more slots than tasks" decision?

logger.debug(f"Slot ratio calculation: active_slots = {active_slots}, active_tasks = {active_tasks}")

Even better, can you send me a complete parsl.log for one of these runs?

benclifford · 2024-12-11T09:33:49Z

(the question I have is not about why pending blocks are scaled in - it is why any blocks are being scaled in, if there is enough task load to scale them out 5 seconds earlier)

stevenstetzler · 2024-12-11T10:27:18Z

Here is the parsl.log
parsl.log

You will see

1733887247.182073 2024-12-10 21:20:47 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 48, active_tasks = 14
1733887252.266406 2024-12-10 21:20:52 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 0, active_tasks = 14
1733887261.421944 2024-12-10 21:21:01 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 48, active_tasks = 14
1733887266.854923 2024-12-10 21:21:06 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 0, active_tasks = 14
1733887272.831117 2024-12-10 21:21:12 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 48, active_tasks = 14
1733887278.464345 2024-12-10 21:21:18 MainProcess-774602 JobStatusPoller-Timer-Thread-22583474931920-22583470077504 parsl.jobs.strategy:214 _general_strategy DEBUG: Slot ratio calculation: active_slots = 0, active_tasks = 14

as the block gets scaled in and out.

benclifford · 2024-12-11T10:46:37Z

ok, that's interesting. looks more like it's oscillating around the convergence point (of 14 slots) rather than converging to a fixed number. let me see if I can reproduce this in a test locally.

benclifford · 2024-12-11T13:09:00Z

ok, here's what I think is a reproducer https://github.com/Parsl/parsl/tree/benc-3696 - there is some suspicious rounding in the code that chooses how to head towards the target number of blocks. I'll flesh out some more testing and hopefully it is then a simple fix.

benclifford · 2024-12-11T13:44:04Z

@stevenstetzler can you try out the fix in PR #3721?

stevenstetzler added the bug label Nov 14, 2024

benclifford added a commit that referenced this issue Dec 11, 2024

mock-based reproducer for issue #3696

aa09931

benclifford linked a pull request Dec 11, 2024 that will close this issue

Fix rounding error in htex block scale in #3721

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop of scaling in and out with HTEX #3696

Infinite loop of scaling in and out with HTEX #3696

stevenstetzler commented Nov 14, 2024

benclifford commented Nov 14, 2024

stevenstetzler commented Dec 11, 2024

benclifford commented Dec 11, 2024

benclifford commented Dec 11, 2024

stevenstetzler commented Dec 11, 2024

benclifford commented Dec 11, 2024

benclifford commented Dec 11, 2024

benclifford commented Dec 11, 2024

Infinite loop of scaling in and out with HTEX #3696

Infinite loop of scaling in and out with HTEX #3696

Comments

stevenstetzler commented Nov 14, 2024

benclifford commented Nov 14, 2024

stevenstetzler commented Dec 11, 2024

benclifford commented Dec 11, 2024

benclifford commented Dec 11, 2024

stevenstetzler commented Dec 11, 2024

benclifford commented Dec 11, 2024

benclifford commented Dec 11, 2024

benclifford commented Dec 11, 2024