Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph parsing: space removal bug #2534

Closed
hjoliver opened this issue Jan 9, 2018 · 20 comments · Fixed by #2644
Closed

graph parsing: space removal bug #2534

hjoliver opened this issue Jan 9, 2018 · 20 comments · Fixed by #2644
Labels
bug Something is wrong :(
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Jan 9, 2018

This:

foo => bar baz

should be an error, but it is parsed as equivalent to:

foo => barbaz

Not likely to cause a problem (so long as strict validation, or rose suite-run, is used) because barbaz would have to be defined under [runtime] ... but still a bug.

@hjoliver hjoliver added the bug Something is wrong :( label Jan 9, 2018
@hjoliver hjoliver added this to the soon milestone Jan 9, 2018
@matthewrmshin
Copy link
Contributor

@hjoliver
Copy link
Member Author

hjoliver commented Jan 9, 2018

Yes, for this reason:
https://github.com/cylc/cylc/blob/4e985c3dda52540fa2c2c5cc32accb65373f6b60/lib/cylc/graph_parser.py#L153-L155
(which sounds good in most respects, but it might be nasty to fix)

@matthewrmshin matthewrmshin changed the title graph parsing bug graph parsing: space removal bug Jan 18, 2018
@matthewrmshin
Copy link
Contributor

@hjoliver I am unable to think of a quick fix for this one. Are you able to do so? If not, I am tempted to drop this to a later milestone.

In the longer term, I'd like to take a good look at our parsers, and probably try to separate the logic into stages, e.g. by adding a lexical analyser for the raw input, before parsing the tokens into a data structure and for further processing.

@hjoliver
Copy link
Member Author

Same, no quick fix, sorry. At least users are unlikely to hit this bug (for the reason given in the description above).

@matthewrmshin matthewrmshin modified the milestones: soon, later Jan 26, 2018
@ColemanTom
Copy link
Contributor

I'm not going to explore this properly, but, could you just do something like:

    if not re.match("^((?!\w+\s+\w+).)*$", line):
        Throw an exception or spit out a warning or however you want to handle it
    # Apparently this is the fastest way to strip all whitespace!: 
    line = "".join(line.split())  

I don't know if any extra characters would need to be explored there (or how you would deal with something like

foo => bar \
    baz

which I think would have to rely on strict validation to catch unles you've joined lines in advance?

@hjoliver
Copy link
Member Author

@ColemanTom - thanks for the suggestion, we'll look it at. We don't need to consider line continuation markers - those lines are concatenated first up when the file is read in.

@matthewrmshin
Copy link
Contributor

Maybe I am over-thinking things here, but the issue is a bit more complicated because task names can contain non-alphanumeric characters. (My fault, I suppose. Should have left things simple.)

@ColemanTom
Copy link
Contributor

Ok. I assume there is a list of unacceptable characters? e.g. = > | & ( ) :?

So ^((?!\w+\s+\w+).)*$ could be ^((?![^=>|&():]+\s+[^=>|&():]+).)*$ adding any other unacceptable characters in? It feels like it should be that simple (unless of course there is a whole heap of extra complexity in the namespace definition/acceptance).

Tried it on

foo => bar baz #failed
foo => bar & baz #passed
foo => ba3$r& baz@!^ #passed
foo => (ba3$r baz@!^) #failed
foo => (ba3$r| baz@!^) #passed

@matthewrmshin
Copy link
Contributor

The good set of characters for a task name can be found in this module:
https://github.com/cylc/cylc/blob/master/lib/cylc/task_id.py

The problem is that we also have a lot of grammar in a graph node. E.g. We can have cycle offsets, parameters and trigger qualifiers. (The most problematic being the hyphen character -, which can be used in task names and parameters values, but is also used as the minus sign in a cycle offset, and in trigger qualifiers such as :finish-all.)

@ColemanTom
Copy link
Contributor

This may be a touch crude, but:
\A[\s(]*\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?(\s*[&|]\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?[\s)]*)*(\s*=\s*>\s*[\s(]*\s*!?\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?(\s*[&|]\s*!?\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?[\s)]*)*)*\Z

Or, a bit easier to follow...

Construct the regex string

import re
taskname_pattern = "\s*\w[\w\-+@%]*"
task_suffix_options = "(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?"
lhs_task_pattern =  standard_pattern + task_suffix_options
rhs_task_pattern = "\s*!?" + lhs_task_pattern
multiple_task_pattern = "[\s(]*{pattern}(\s*[&|]{pattern}[\s)]*)*"
lhs_multiple_task_pattern = multiple_task_pattern.format(pattern=lhs_task_pattern)
rhs_multiple_task_pattern = multiple_task_pattern.format(pattern=rhs_task_pattern)
final_pattern = "\A" + lhs_multiple_task_pattern + "(\s*=\s*>\s*" + rhs_multiple_task_pattern + ")*\Z"
checker = re.compile(final_pattern)

Functions used to test:

def check_pass():
  for p in pass_list:
      if not checker.match(p):
          print("Pass did not pass...:  " + p)

def check_fail():
  for f in fail_list:
      if checker.match(f):
          print("Fail should not have passed...:  " + f)

Syntax that you would expect to fail:

fails = """    foo => !
    foo => @
    foo => +
    foo => %
    foo => @b
    foo => +b
    foo => %b
    foo => bar baz
    get<run pets> => feed<run,pets>
    get<run,pets> => feed<run pets>
    OBS_GET<run>: => OBS_PROC<run>
    OBS_GET<run>:succeed-all => OBS_PROC<>
    foo[] => foo"""

Syntax you would expect to pass

passes = """    pre => init<run>
    pre => init< run>
    pre => init<run >
    get<run,pets> => feed<run,pets>
    get< run,pets> => feed<run,pets>
    get<run, pets> => feed<run,pets>
    get<run,pets > => feed<run,pets>
    get<run , pets> => feed<run,pets>
    get <run,pets> => feed<run,pets>
    get<run,pets-1> => get<run,pets>
    get<run,pets -1> => get<run,pets>
    get<run,pets- 1> => get<run,pets>
    get<run,pets - 1> => get<run,pets>
    get<run =1,pets=dogs> => boo
    get<run= 1,pets=dogs> => boo
    get<run = 1,pets=dogs> => boo
    get < run = 1 , pets = dogs > => boo
    get<run=1,pets=dogs>=>boo
    OBS_GET<run>:succeed-all => OBS_PROC<run>
    OBS_GET<run> :succeed-all => OBS_PROC<run>
    OBS_GET<run>: succeed-all => OBS_PROC<run>
    OBS_GET<run> : succeed-all => OBS_PROC<run>
    OBS_GET<run>:succeed -all => OBS_PROC<run>
    OBS_GET<run>:succeed- all => OBS_PROC<run>
    OBS_GET<run> : succeed - all => OBS_PROC<run>
    foo => bar
    foo => b-
    foo => b@
    foo => b+
    foo => b%
    foo => bar& baz
    foo => bar &baz
    foo => bar & baz
    foo => bar| baz
    foo => bar |baz
    foo => bar | baz
    model[-P1M] => model
    model[- P1M] => model
    model[-P1M ] => model
    model [ - P1M ] => model
    Analysis[+PT12H] => ObSensitivity
    Analysis[ +PT12H] => ObSensitivity
    Analysis[+ PT12H] => ObSensitivity
    Analysis[ + PT12H ] => ObSensitivity
    copy:expired => !proc
    copy:expired => ! proc
    foo
    foo => bar => test
    foo => _
    foo => bar_baz
    foo[-P2M]:out2 => baz
    foo[ - P2M ] : out2 => baz
    foo[ -P2M]:out2 => baz
    foo[- P2M]:out2 => baz
    foo[-P2M] :out2 => baz
    foo[-P2M]: out2 => baz
    foo[ -P2M ]:out2 => baz
    sim<r,m,c=2>[-P1Y] => sim<r,m,c=0> | bar<g,h,i=3>
    OBS_GET<run>:succeed - all&foo => OBS_PROC<run> => hello => test | boo"""  

Results:

>>> check_fail()
Fail should not have passed...:      get<run pets> => feed<run,pets>
Fail should not have passed...:      get<run,pets> => feed<run pets>
>>> check_pass()
>>>

So, not quite perfectly capturing the expected failures because I didn't put super logic inside the <...> or [....] portions of a task regex to account for param space param. It wouldn't be too hard to do, but I don't fully know the syntax permissible for a parameter name for example, or the time stuff inside the [...]. Those probably get captured fine anyway because a parameter wouldn't exist that is expected, or a time syntax wouldn't be accepted.

I'm sure there are other syntax I haven't accounted for, but following the format/functions above, it would be easy to create a function to do this stuff and add more strings to a pytest framework.

I know this isn't a high priority thing, but its a good excuse for me to practice and improve my regex knowledge.

@hjoliver
Copy link
Member Author

hjoliver commented Feb 7, 2018

@ColemanTom - thanks for this. I haven't had time to look at it in detail, sorry, and am about to go away for a week. But we can probably use what you've done if you can accommodate those two failing cases (and if the massive regex doesn't create a perfomance problem... and maybe it won't, in context).

@ColemanTom
Copy link
Contributor

Sure thing. Feel free to assign this Issue to me if you like and I'll fork/send a merge request when I find the time to finish this and adding tests. I am guessing here - https://github.com/cylc/cylc/tree/master/tests/validate ?

@matthewrmshin
Copy link
Contributor

Hi @ColemanTom Please feel free to raise a pull request. Yes, please add a test under https://github.com/cylc/cylc/blob/master/tests/validate/ - e.g. https://github.com/cylc/cylc/blob/master/tests/validate/64-circular.t should give you a good starting point. (Tests are shell scripts that outputs results in the Test Anything Protocol.)

@ColemanTom
Copy link
Contributor

I've done more exploration of this. I'm very disappointed with the regex ability in python. No (natural) atomic groups or possessive quantifiers which would have made things a bit cleaner. But, with some pseudo-atomic groups which Python can handle, it is quite quick.

I have discovered two bugs in my logic which I still need to fix. And some behaviour in cylc that I was not expecting. I am wondering if it is intentional and I just didn't realise, or its a quirk?

graph = """
    a => b
        => c
"""

I expected that would fail as c is not following anything, but it works and is treated as a => b => c. I could not see that as an example in the user guide. So I'm not entirely sure how to handle that. It implies that on a given line you don't need a LHS node, so I could make that LHS optional I guess, change a + to a *, not hard, but I wanted to make sure it is expected and reasonable behaviour. If it were only the line => c, and no line above, cylc aborts. So I am guessing it is expected?

@ColemanTom
Copy link
Contributor

I can answer my own question, yes, so long as the => isn't the leading thing on the first line, it can be on subsequent lines. Noticed it in the code.

@matthewrmshin
Copy link
Contributor

@ColemanTom Yes, this is deliberate - to allow users to easily manipulate continuation lines.

@hjoliver
Copy link
Member Author

See #1900

@hjoliver
Copy link
Member Author

(graph lines can get very long, but backslash continuation is unsafe - trailing whitespace breaks it).

@ColemanTom
Copy link
Contributor

Another question, this time about triggering off of remote tasks: https://cylc.github.io/cylc/html/single/cug-html.html#12.26

It doesn't say explicitly in the docs, but the validation fails, so I'm assuming it cannot be applied directly with any other graph node suffix (e.g. time triggers, parameters, family triggers), they have to be completely independent? I just want to make sure my understanding of things is correct.

@hjoliver
Copy link
Member Author

hjoliver commented Feb 21, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants