Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Regex replacement in sub command broken in WDL 1.0 #3990

Closed
rhpvorderman opened this issue Aug 9, 2018 · 13 comments
Closed

Regression: Regex replacement in sub command broken in WDL 1.0 #3990

rhpvorderman opened this issue Aug 9, 2018 · 13 comments

Comments

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Aug 9, 2018

version 1.0

workflow test_regex {
    call echo {
      input:
    }

}

task echo {
  input {
      String myFile = "dir/dir/bladiebla.fq.gz"
    }
  # Chop of .gz
  String name = sub(basename(myFile), "\\.gz$","")
   
  command {
    echo ${myFile}
    echo ${basename(myFile)}
    echo ${name}  
  }
  output {
    Array[String] result = read_lines(stdout())  
  }   
}

Run with cromwell version 34.
I expect name to be bladiebla.fq but unfortunately it is bladiebla.fq.gz
The regex used work in earlier versions of cromwell (31).
We noticed this when changing to WDL 1.0

@rhpvorderman rhpvorderman changed the title Regression: Regex replacement in sub command broken. Regression: Regex replacement in sub command broken in WDL 1.0 Aug 9, 2018
@cjllanwarne
Copy link
Contributor

Thanks for the report! Note that in the meantime you can use the "strip suffix" on basename to remove the .gz suffix without sub (since wanting to do that is such a common scenario)

@rhpvorderman
Copy link
Collaborator Author

Thanks a lot for pointing that out! That does not fix the issue we have in our tasks, but it will make our code look cleaner in cases where it applies.

@aednichols
Copy link
Collaborator

@rhpvorderman as a workaround it appears that making your pattern "\.gz$" - dropping an \ - yields the desired result.

We are looking into why this changed.

@rhpvorderman
Copy link
Collaborator Author

rhpvorderman commented Aug 10, 2018

Thanks @aednichols! Oh then it is not really a bug right. Then it is a feature ;-).
In a proper regex an actual dot is \. but in scala this needs to be double escaped as \\. .
It does not really matter as long as it works and the implementation is properly documented somewhere. I could not find anything about the regex implementation on the cromwell readthedocs. And sifting trough the code looking for the snippet that does this exact work was quite hard.

Regardless, changing regex implementations make WDL unportable, which is undesirable. I will make a pull request to the spec.

@aednichols
Copy link
Collaborator

aednichols commented Aug 10, 2018

I think the bug is actually in Cromwell's WDL draft-2 support.

Submitting a draft-2 workflow with PCRE pattern \.gz$ yields an Unrecognized token error on the backslash.

A work-around is to double-escape with \\.gz$, thus becoming non-standard and unexpectedly failing when you moved to 1.0.

@aednichols
Copy link
Collaborator

As an aside, here are the code locations where regex evaluation happens

As you can see the Scala code is identical, so something is going on in draft-2 that "consumes" a \ before it gets to the regex evaluator.

@cjllanwarne
Copy link
Contributor

I think this could be a case of "WDL 1.0 isn't evaluating escape codes in strings properly", per:
https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#whitespace-strings-identifiers-constants

@aednichols
Copy link
Collaborator

Does this mean that, because \\. is not on the list of accepted double-quote uses, it should be an error?

@aednichols
Copy link
Collaborator

@rhpvorderman to give you an update - I agree that there's something wrong here, but I can't identify which component requires updating - the WDL spec, the WDL grammar in spec repo, or Cromwell.

I will consult my teammate @cjllanwarne when he comes back from vacation in about 1.5 weeks.

@rhpvorderman
Copy link
Collaborator Author

@aednichols thanks for the update. We are currently in the process of updating the WDL spec at openwdl/wdl#243. This will hopefully make clear what sort of regular expression should be used. After that it will be a lot easier to decide whether the regular expression evaluation is broken or not.

@aednichols
Copy link
Collaborator

Ah, definitely missed that thread 🤦‍♂️ Thanks.

@aednichols
Copy link
Collaborator

Fun fact: the wdl_unescape function is never called in the WDL 1.0 parser.

@aednichols
Copy link
Collaborator

@rhpvorderman thanks for reporting, the fix is in Cromwell 35 (out now!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants