Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PatternValidator does not correctly validate input with newlines #495

Open
kmalski opened this issue Jan 12, 2022 · 13 comments
Open

PatternValidator does not correctly validate input with newlines #495

kmalski opened this issue Jan 12, 2022 · 13 comments

Comments

@kmalski
Copy link
Contributor

kmalski commented Jan 12, 2022

Hi,

I have found that neither java regex based (PatternValidatorJava) nor joni based (PatternValidatorEcma262) pattern validation does not work correctly with newlines.

Any of implementation does not correctly interpret ^ and $ anchors. I would expect that, when I use them at the start and end of pattern (eg. ^[a-z]{1,10}$) they would not allow to pass any trailing newline character (eg. abc\n should not be matched). There are separate problems with both implementation, so I will describe them individually.

  1. Joni
    The problem is with default configuration for ECMAScript syntax in Joni library, which has multiline matching by default enabled. From the json-schema-validator code:
  private boolean matches(String value) {
      if (compiledRegex == null) {
          return true;
      }

      byte[] bytes = value.getBytes();
      return compiledRegex.matcher(bytes).search(0, bytes.length, Option.NONE) >= 0;
  }

For the fast fix, the last line can be changed to:

      return compiledRegex.matcher(bytes).search(0, bytes.length, -Option.MULTILINE) >= 0;

but I have also rised issue in the Joni library, as I believe that this is not correct default (ECMAScript has disabled multiline matching by default).

What's more interesting, because of enabled multiline matching, currently this input \r\nab\nab\n will match this pattern ^[a-z]{1,10}$ and pass validation. We want to allow single character-only word, and the entire sentence passes.

  1. Java built-in regex find vs match
    Current implementation of matching with java regex looks like this:
  private boolean matches(String value) {
      return compiledPattern == null || compiledPattern.matcher(value).find();
  }

The problem is how the find method works. From the documentation: Attempts to find the next subsequence of the input sequence that matches the pattern. So this function matches subsequence of input and for the two JDKs, which I tried, returns true for input abab\n and pattern ^[a-z]{1,10}$, despite that I used ^ and $ anchors. So any ending newline character will be always allowed for such patterns.

Possible solution is to use matches method which attempts to match the entire region against the pattern, but this will result in implicit adding ^ and $ anchors to every pattern.

@stevehu
Copy link
Contributor

stevehu commented Jan 12, 2022

@kmalski Thanks a lot for raising this issue and providing detailed explanations and solutions. I would say that multiline support for pattern matching is very rare and that is why nobody found this issue before. What we can do is to make this configurable and default to the single-line. This will maintain backward compatibility and at the same time, allow users with multiline pattern matching to be enabled when necessary. What do you think?

Here is the class for the configuration and we just need to add a new entry with default as single-line.

https://github.com/networknt/json-schema-validator/blob/master/src/main/java/com/networknt/schema/SchemaValidatorsConfig.java

@kmalski
Copy link
Contributor Author

kmalski commented Jan 13, 2022

@stevehu Looks good for me. I found one more case in Joni library, when subtract of Option.MULTILINE is not enough, so I think it's reasonable to wait for answer (at least some time).

@stevehu
Copy link
Contributor

stevehu commented Jan 14, 2022

Thanks for the update. Let's keep this issue open until we have the proper solution.

@anro87
Copy link

anro87 commented Apr 21, 2023

We are facing the same issue within our project. Any plans to adjust the SchemaValidatorsConfig.java as mentioned by @stevehu above?

@stevehu
Copy link
Contributor

stevehu commented Apr 21, 2023

@anro87 I think @kmalski has documented the solution already. Would you like to submit a PR to get it implemented?

@anro87
Copy link

anro87 commented Apr 24, 2023

@stevehu You mean we should implement it based on @kmalski's solution and open a PR for it? Guess we can do that. ;)

@stevehu
Copy link
Contributor

stevehu commented Apr 24, 2023

Yes. Let us know if you need any help. Thanks.

@mriehl
Copy link

mriehl commented May 4, 2023

Hey @stevehu , I'm one of @anro87 's colleagues. I've just started writing a test for this feature but it quickly dawned on me that the problem was already fixed a few days ago in #737
The pattern validator is now using matches as outlined above and I can't reproduce the issue.

It's not configurable though, not sure what that means for a release since it would technically break existing behavior

Note: the Joni/ECMA262 side still exhibits this issue, it's the JDK-based matcher that has been fixed.

@mriehl
Copy link

mriehl commented May 9, 2023

Waiting on feedback how I should proceed. These are the options we have:

  • adjust the joni validator to behave similarly or wait for something to happen in Multiline Option with ^ and $ anchors  jruby/joni#57
    (my personal preference, there are probably lots of users out there who don't expect input with newlines to pass validation. There are probably also security implications)
  • introduce a config flag just for the joni validator and leave master as-is (non-joni validator uses match and not find)
  • introduce a config flag for both validators (what was initially discussed here).
    Not sure how that will affect the unit tests added in Enable unit-tests for ECMA 262 regular expressions #737 but still an option

@fdutton
Copy link
Contributor

fdutton commented May 22, 2023

@mriehl I would like to see your solution for both. #782 raises another issue with my solution when the regex is not anchored.

@mriehl
Copy link

mriehl commented May 22, 2023

Hm I didn't really think about the use case in #782 .
A configuration flag for implicit anchoring would solve this though it's not elegant.

What do you think about keeping it the way it is now (no implicit anchoring) and doing very simple detection if the regex wants to be anchored, in that case we would use matches instead of find. This seems like it would do the right thing in 99,99% of cases

fdutton pushed a commit that referenced this issue May 22, 2023
@justin-tay
Copy link
Contributor

Re-opening as the fix for this was reverted due to causing other issues in

I'm not familiar enough with the issue and it doesn't look like this is easy to fix so using GraalJS might be an option.

@justin-tay justin-tay reopened this Jun 13, 2024
@kmalski
Copy link
Contributor Author

kmalski commented Jun 14, 2024

Hi, great progress of the library from the time when I originally opened this issue. @justin-tay to make it easier for you, I have checked to see if this issue still exists and currently I can reproduce it for Joni and JDK implementations (GraalJS implementation seems to work fine).

@Test
void multilineString() {
    RegularExpression regex = new JoniRegularExpression("^[a-z]{1,10}$");
    assertTrue(regex.matches("abc\n"));
}
@Test
void multilineString() {
    RegularExpression regex = new JDKRegularExpression("^[a-z]{1,10}$");
    assertTrue(regex.matches("\nabc\n"));
}

Both test cases pass, but because we are using anchors here, I would expect them to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants