Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc validate: expensive for large numbers of inter task dependencies #1776

Closed
arjclark opened this issue Apr 6, 2016 · 6 comments
Closed
Assignees

Comments

@arjclark
Copy link
Contributor

arjclark commented Apr 6, 2016

Problem encountered as a result of a user's suite which had dependencies between members of multiple large families.

Problem can be boiled down as follows:

Consider triggering of the type:

FAM1:succeed-any => FAM2

where FAM1 has N members and FAM2 has M members.

When cylc expands out this triggering, each of the M members of FAM2 has N prerequisites from FAM1, as:

fam1_member_1:succeed | ... | fam1_member_N:succeed => fam2_member_1
...
fam1_member_1:succeed | ... | fam1_member_N:succeed => fam2_member_M

As a result of this, two problems occur with validating a suite:

  1. It can have an excessively large memory footprint when carrying out the routine
  2. Validation can take an (impractical) age in order to complete

Problem 1) is hard to solve as the edges are generated by the graphviz library. Problem 2) is addressed in pull request #1777.

In the situation seen, the suite concerned had 1/2 million edges in it as a result of the families construction (I'll update this issue with a reference example in near future).

Some of the problem can be solved by re-writing the graphing to reduce the number of pre-requisites, so:

FAM1:succeed-any => FAM2

can be replaced with:

FAM1:succeed-any => dummy_marker_task => FAM2

which creates N+M dependencies rather than N*M.

When we re-visit handling graphing/dependencies in cylc we may look to make this kind of efficiency saving internally to cylc rather than having the user need to manually write it.

@arjclark
Copy link
Contributor Author

arjclark commented Apr 6, 2016

Cleaned up snippet of the suite.rc that highlighted the issue in the first place:

#!jinja2

{% set BUILD=true %}
{% set RECON=true %}
{% set NUM_MEMBERS_N768=30 %}
{% set NUM_MEMBERS_N512=60 %}
{% set NUM_MEMBERS_N48=50 %}
{% set NUM_SHARED_RECON_MEMBERS=480 %}

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    [[dependencies]]
         [[[ R1 ]]]
            graph = """
    {%- if BUILD == true %}
            fcm_make => fcm_make2 => \
    {%- endif %}
    {%- if RECON == true %}
            RECON:succeed-all =>  RECON_SHARED
    {%- endif %}
            RECON:succeed-all    => ATMOS_N768:succeed-all  => COMPARE_N768
            ATMOS_N768:start-any => ATMOS_N512:succeed-all  => COMPARE_N512
            ATMOS_N512:start-any => ATMOS_N48:succeed-all   => COMPARE_N48
            """
         [[[ R/2/P1 ]]]
           graph = """
           RECON_SHARED[-P1]:succeed-all => RECON_SHARED
           ATMOS[-P1]:succeed-all => ATMOS_N768:succeed-all => COMPARE_N768
           ATMOS_N768:start-any   => ATMOS_N512:succeed-all => COMPARE_N512
           ATMOS_N512:start-any   => ATMOS_N48:succeed-all  => COMPARE_N48
           ATMOS:succeed-all      => prune
           RECON_SHARED:succeed-all => prune

Can't recommend running it for real as all those tasks running at once (unconstrained) nukes my desktop with all the locally running jobs. In reality the tasks are put in various HPC queues so the suite's progress is throttled by limits on those.

@hjoliver
Copy link
Member

hjoliver commented Apr 7, 2016

It would be easy to auto-insert family done marker tasks into suite graphs already (there may be an even more "internal" solution than this longer term): if the LHS of a dependency pair is a family, simply substitute it with "family => family_done".

@arjclark
Copy link
Contributor Author

arjclark commented Apr 7, 2016

Been pondering this overnight and something along those lines had crossed my mind, though I'd personally want it to be hidden internally as it'd only confuse the user thinking they'd gained an extra task somehow - "what's this task doing here? I didn't create it!"

We could probably finesse the dependency pairing substitution a bit too so that only cases where a family triggers into a family results in a marker task as:

FAM1:succeed-any => FAM2

to:

FAM1:succeed-any => FAM1_succeed_any_marker => FAM2

is useful, but:

FAM1:succeed-any => single_task

to:

FAM1:succeed-any => FAM1_succeed_any_marker => single_task

is actually more expensive than the original formulation.

Additionally, we'd need to be careful not to just insert a task proxy automatically as it could have unintended impacts on existing suite design as:

FAM1:succeed-any => FAM2
FAM1:fail-all => recovery => !FAM1

converted to:

FAM1:succeed-any => FAM1_succeed_any_marker => FAM2
FAM1:fail-all => recovery => !FAM1 # Assuming auto adding of markers not putting in one where RHS is a single task

Wouldn't auto recover any more in the same way.

I think the "best" solution would be one where, internally, cylc didn't expand out the FAMILY entries and instead had an internal object that represented the state of that namespace at a given cycle, which would be what the dependencies ultimately hung off. With that, we'd need to ensure cyclic dependency checking was a bit smarter so that for any given triggering sequence where the FAMILY triggers didn't get expanded out it would do a check along the lines of (pseudocode):

for item in triggering_sequence:
    if item in subsequent_triggering_sequence_items or intersect(item, subsequent_triggering_sequence):
        raise(CyclicDependencyError)

@benfitzpatrick
Copy link
Contributor

I think the "best" solution would be one where, internally, cylc didn't expand out the FAMILY entries and instead had an internal object that represented the state of that namespace at a given cycle, which would be what the dependencies ultimately hung off.

👍 from me

@hjoliver
Copy link
Member

hjoliver commented Apr 7, 2016

@arjclark - Yeah your "best" solution is the kind of thing I was (loosely) thinking of above with 'an even more "internal" solution longer term'. We should definitely try to do this. However, if it turns out that's not so easy to implement in the short term, the marker task solution would be very easy (with the refinements you mentioned above). I don't really buy the confused user argument - it would be a small number of tasks and they could be given very self-explanatory names such as "dummy_marker_FAM_done". Certainly looking at the suite graph would make their purpose pretty obvious.

@hjoliver
Copy link
Member

[meeting]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants