Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit task pool size #987

Closed
hjoliver opened this issue Jun 19, 2014 · 6 comments · Fixed by #3515
Closed

limit task pool size #987

hjoliver opened this issue Jun 19, 2014 · 6 comments · Fixed by #3515
Labels
efficiency For notable efficiency improvements
Milestone

Comments

@hjoliver
Copy link
Member

Currently (on the iso8601 branch) the runahead limit is a suite-wide parameter that acts as follows: each task proxy spawns a 'waiting' successor when it enters the 'submitted' state. If a waiting task is beyond the runahead limit, it goes into a special pool that does not participate in dependency matching. Tasks under the limit drop back into the main pool. So at least one instance of every defined task proxy exists at any one time, more if one or more instances are submitted or running at the time. Only those below the runahead limit will impact scheduling performance (but all affect cylc's memory footprint). Note that tasks do not spawn multiple waiting instances out to the runahead limit.

@matthewrmshin has some ideas on how to avoid a suite-wide runahead limit...

[edit: it seems the focus of this issue is really on limiting the task pool size, which is one function of the runahead limit but not the only one - we may still need it as well (see below).]

@hjoliver hjoliver added this to the later milestone Jun 19, 2014
@matthewrmshin
Copy link
Contributor

Moved this from the re-factor proposal:

Run ahead limit is used to address 2 main problems:

We don't want to flood the disk. (Assuming that most tasks generate some data.)

  • Currently, run ahead addresses the question of when it is safe to start install tasks (or any tasks that eat up lots disk space) for a cycle.
  • This should be a suite design issue.
  • If we have tasks that eat up lots of disk space, then they should always be paired with housekeep tasks.
  • A dependency by an install task on a housekeep task that clears data generated from a historical cycle would be sufficient to address the problem?

We don't want to flood the task pool.

  • Consider a suite's task pool. If it becomes inefficient at around 2000 task proxies (more in the future, but a limit will still exist), then a run ahead limit based on cycle time or number of cycles does not make sense. (E.g. a 300 task-per-cycle suite can happily run 6 cycles ahead, but a 1500 task-per-cycle suite can only run a cycle ahead.) Instead, we should move to an architecture that allows us to set the maximum size of the task pool.

@hjoliver
Copy link
Member Author

As discussed, limiting the task pool size will improve performance for much larger suites, and in principle it would allow us to drop the runahead limit. But in my opinion we will still need a runahead limit as well as limiting the pool size. Otherwise even small suites - if they contain quick-running data-retrieval tasks that are not constrained by clock triggers, or similar - could fill the task pool. This may not be a real problem unless the run-ahead tasks fill up the disk, but it will make the suite hard to monitor (graph view) and it may cause panic in users (OMG - my get_data task just ran out 100 years ahead!). If this is not considered a problem the user can just set the runahead limit very high.

@hjoliver hjoliver changed the title rethink the runahead limit limit task pool size Jun 25, 2014
@hjoliver
Copy link
Member Author

Issue title changed from "rethink the runahead limit" to "limit task pool size".

As per my previous comment, the runahead limit, although related, is a different concept and will probably still be needed in addition to limiting the pool size.

@hjoliver
Copy link
Member Author

hjoliver commented Jun 25, 2014

Limiting the pool size would require knowing which waiting tasks will be needed first, as excluding the wrong ones from dependency matching could stall the suite? See also #993. How to do this? I guess we'll need to use dependency information from the suite graph: if a task has achieved the 'submitted' state or beyond, its downstream dependants could be needed soon and so should be created (#993) and added to the task pool. (related: this could also be used to ditch our long-standing free-for-all dependency matching - each waiting task really only needs to check the outputs of the few upstream tasks that the graph says it depends on - #108 (comment)) [done]. This would greatly reduce the size of the task pool as currently conceived (which contains waiting instances of almost all tasks), but we'll still need an absolute size limit as well, e.g. for suites with massive ensembles in which all members become ready to run at once. Maybe a FIFO queue or similar to manage adding tasks that are "needed next" to the dependency-matching pool, to avoid stalling parts of the suite by continually excluding those tasks from the pool. Also: retention of succeeded tasks in the pool (and the cleanup algorithm to remove them) could be dispensed with if we just keep track of completed outputs instead [#1902].

@hjoliver
Copy link
Member Author

hjoliver commented Jun 22, 2016

[meeting]

  • agreed we should move to spawn immediate graph descendants of active tasks
  • cylc-8 ?
  • we'll need to be able to operate on tasks that are not in the pool yet or have left it (could have minimal "ghost" objects that can be promoted to real task proxies when needed)

@benfitzpatrick
Copy link
Contributor

When we change the spawning behaviour (see the superseded issue #1538) so that there is not an implicit task successors submit dependency, we need to be able to report possible changes in behaviour in cylc validate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants