cylc validate: expensive for large numbers of inter task dependencies #1776

arjclark · 2016-04-06T11:52:33Z

Problem encountered as a result of a user's suite which had dependencies between members of multiple large families.

Problem can be boiled down as follows:

Consider triggering of the type:

FAM1:succeed-any => FAM2

where FAM1 has N members and FAM2 has M members.

When cylc expands out this triggering, each of the M members of FAM2 has N prerequisites from FAM1, as:

fam1_member_1:succeed | ... | fam1_member_N:succeed => fam2_member_1
...
fam1_member_1:succeed | ... | fam1_member_N:succeed => fam2_member_M

As a result of this, two problems occur with validating a suite:

It can have an excessively large memory footprint when carrying out the routine
Validation can take an (impractical) age in order to complete

Problem 1) is hard to solve as the edges are generated by the graphviz library. Problem 2) is addressed in pull request #1777.

In the situation seen, the suite concerned had 1/2 million edges in it as a result of the families construction (I'll update this issue with a reference example in near future).

Some of the problem can be solved by re-writing the graphing to reduce the number of pre-requisites, so:

FAM1:succeed-any => FAM2

can be replaced with:

FAM1:succeed-any => dummy_marker_task => FAM2

which creates N+M dependencies rather than N*M.

When we re-visit handling graphing/dependencies in cylc we may look to make this kind of efficiency saving internally to cylc rather than having the user need to manually write it.

The text was updated successfully, but these errors were encountered:

arjclark · 2016-04-06T14:10:13Z

Cleaned up snippet of the suite.rc that highlighted the issue in the first place:

#!jinja2

{% set BUILD=true %}
{% set RECON=true %}
{% set NUM_MEMBERS_N768=30 %}
{% set NUM_MEMBERS_N512=60 %}
{% set NUM_MEMBERS_N48=50 %}
{% set NUM_SHARED_RECON_MEMBERS=480 %}

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    [[dependencies]]
         [[[ R1 ]]]
            graph = """
    {%- if BUILD == true %}
            fcm_make => fcm_make2 => \
    {%- endif %}
    {%- if RECON == true %}
            RECON:succeed-all =>  RECON_SHARED
    {%- endif %}
            RECON:succeed-all    => ATMOS_N768:succeed-all  => COMPARE_N768
            ATMOS_N768:start-any => ATMOS_N512:succeed-all  => COMPARE_N512
            ATMOS_N512:start-any => ATMOS_N48:succeed-all   => COMPARE_N48
            """
         [[[ R/2/P1 ]]]
           graph = """
           RECON_SHARED[-P1]:succeed-all => RECON_SHARED
           ATMOS[-P1]:succeed-all => ATMOS_N768:succeed-all => COMPARE_N768
           ATMOS_N768:start-any   => ATMOS_N512:succeed-all => COMPARE_N512
           ATMOS_N512:start-any   => ATMOS_N48:succeed-all  => COMPARE_N48
           ATMOS:succeed-all      => prune
           RECON_SHARED:succeed-all => prune

Can't recommend running it for real as all those tasks running at once (unconstrained) nukes my desktop with all the locally running jobs. In reality the tasks are put in various HPC queues so the suite's progress is throttled by limits on those.

hjoliver · 2016-04-07T00:37:44Z

It would be easy to auto-insert family done marker tasks into suite graphs already (there may be an even more "internal" solution than this longer term): if the LHS of a dependency pair is a family, simply substitute it with "family => family_done".

arjclark · 2016-04-07T07:50:48Z

Been pondering this overnight and something along those lines had crossed my mind, though I'd personally want it to be hidden internally as it'd only confuse the user thinking they'd gained an extra task somehow - "what's this task doing here? I didn't create it!"

We could probably finesse the dependency pairing substitution a bit too so that only cases where a family triggers into a family results in a marker task as:

FAM1:succeed-any => FAM2

to:

FAM1:succeed-any => FAM1_succeed_any_marker => FAM2

is useful, but:

FAM1:succeed-any => single_task

to:

FAM1:succeed-any => FAM1_succeed_any_marker => single_task

is actually more expensive than the original formulation.

Additionally, we'd need to be careful not to just insert a task proxy automatically as it could have unintended impacts on existing suite design as:

FAM1:succeed-any => FAM2
FAM1:fail-all => recovery => !FAM1

converted to:

FAM1:succeed-any => FAM1_succeed_any_marker => FAM2
FAM1:fail-all => recovery => !FAM1 # Assuming auto adding of markers not putting in one where RHS is a single task

Wouldn't auto recover any more in the same way.

I think the "best" solution would be one where, internally, cylc didn't expand out the FAMILY entries and instead had an internal object that represented the state of that namespace at a given cycle, which would be what the dependencies ultimately hung off. With that, we'd need to ensure cyclic dependency checking was a bit smarter so that for any given triggering sequence where the FAMILY triggers didn't get expanded out it would do a check along the lines of (pseudocode):

for item in triggering_sequence:
    if item in subsequent_triggering_sequence_items or intersect(item, subsequent_triggering_sequence):
        raise(CyclicDependencyError)

benfitzpatrick · 2016-04-07T08:05:22Z

I think the "best" solution would be one where, internally, cylc didn't expand out the FAMILY entries and instead had an internal object that represented the state of that namespace at a given cycle, which would be what the dependencies ultimately hung off.

👍 from me

hjoliver · 2016-04-07T08:41:31Z

@arjclark - Yeah your "best" solution is the kind of thing I was (loosely) thinking of above with 'an even more "internal" solution longer term'. We should definitely try to do this. However, if it turns out that's not so easy to implement in the short term, the marker task solution would be very easy (with the refinements you mentioned above). I don't really buy the confused user argument - it would be a small number of tasks and they could be given very self-explanatory names such as "dummy_marker_FAM_done". Certainly looking at the suite graph would make their purpose pretty obvious.

hjoliver · 2016-06-22T11:16:43Z

[meeting]

superseded by Use shared dependency objects. #1894

arjclark self-assigned this Apr 6, 2016

arjclark added this to the later milestone Apr 6, 2016

arjclark added bug? priority medium labels Apr 6, 2016

arjclark mentioned this issue Apr 6, 2016

validation: Reduce O(N**2) edge comparison to O(N) #1777

Merged

This was referenced Apr 11, 2016

Memory use in large suites #1222

Closed

Document writing efficient inter-family triggering. #1791

Merged

hjoliver mentioned this issue May 4, 2016

Improved family triggering. #1828

Closed

hjoliver mentioned this issue Jun 15, 2016

More efficient scheduling of large ensembles #107

Closed

hjoliver removed the priority medium label Jun 19, 2016

hjoliver closed this as completed Jun 22, 2016

hjoliver reopened this Jun 22, 2016

hjoliver added the superseded label Jun 22, 2016

hjoliver closed this as completed Jun 22, 2016

matthewrmshin removed this from the later milestone Jun 22, 2016

trwhitcomb mentioned this issue Jul 29, 2019

Speed up dependency checks for large numbers of outputs #3255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cylc validate: expensive for large numbers of inter task dependencies #1776

cylc validate: expensive for large numbers of inter task dependencies #1776

arjclark commented Apr 6, 2016

arjclark commented Apr 6, 2016

hjoliver commented Apr 7, 2016

arjclark commented Apr 7, 2016

benfitzpatrick commented Apr 7, 2016

hjoliver commented Apr 7, 2016

hjoliver commented Jun 22, 2016

cylc validate: expensive for large numbers of inter task dependencies #1776

cylc validate: expensive for large numbers of inter task dependencies #1776

Comments

arjclark commented Apr 6, 2016

arjclark commented Apr 6, 2016

hjoliver commented Apr 7, 2016

arjclark commented Apr 7, 2016

benfitzpatrick commented Apr 7, 2016

hjoliver commented Apr 7, 2016

hjoliver commented Jun 22, 2016