-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added test for polynomial backtracking #2597
Added test for polynomial backtracking #2597
Conversation
For single character matches, you can subtract character classes by using a negative lookahead before the character class. E.g. for a word character that's not also a number, you can do However, I am a bit wary of introducing a performance overhead on every single Prism usage, by default, to prevent this, which is only a concern when highlighting unsafe content. |
I usually try to avoid negative lookaheads to fix backtracking issues for 3 reasons:
While I could do something about points 1 and 2 (tooling can be improved and we could transform our regexes to do these trivial optimizations as part of the build process), point 3 still remains.
I'd rather have Prism be safe by default or at least as save as we can manage. |
The alternative I suppose is to manually compute the subtraction result as a character class, but is that really better simplicity-wise?
At any cost? |
In some cases, it is but here... no. Should I write a little babel plugin that applies the lookaheads, so we don't have to worry as much about the performance overhead? At least the minified files should be fast. (see side note below)
No. If I wanted to guarantee O(n) runtime for Prism on all input, I'd have to reimplement everything using something like an LR parser. Yeah... let's not for the time being. Side noteEven the fixed version of the above example will take O(n^2) time for some strings: /(?!\s)[_$a-z\xA0-\uFFFF](?:(?!\s)[$\w\xA0-\uFFFF])*(?=\s*=>)/i
.test("a".repeat(5_000)) // takes about 35 milliseconds on my computer To make it O(n) for all input, we'd have to add a lookbehind as well: /(?<!(?!\s)[_$a-z\d\xA0-\uFFFF])(?!\s)[_$a-z\xA0-\uFFFF](?:(?!\s)[$\w\xA0-\uFFFF])*(?=\s*=>)/i
.test("a".repeat(5_000 * 1000)) // takes about 70 milliseconds on my computer After applying all the /(?<![^\s\x00-\x08\x0e-\x1f!"#\x25-\x2f\x3a-\x40[\\\]`\x7b-\x9f^])[^\s\d\x00-\x08\x0e-\x1f!"#\x25-\x2f\x3a-\x5e`\x7b-\x9f][^\s\x00-\x08\x0e-\x1f!"#\x25-\x2f\x3a-\x40[\\\]`\x7b-\x9f^]*(?=\s*=>)/i
.test("a".repeat(5_000 * 1000)) // takes about 16 milliseconds on my computer /Side note Side side noteWhile playing around, I found that the lookahead versions also have another problem: The stack. For long strings (>1MB), both Chrome and Firefox throw an error: /(?<!(?!\s)[_$a-z\d\xA0-\uFFFF])(?!\s)[_$a-z\xA0-\uFFFF](?:(?!\s)[$\w\xA0-\uFFFF])*(?=\s*=>)/i.test("a".repeat(10e6))
// Chrome 86
> Uncaught RangeError: Maximum call stack size exceeded
// FireFox 80
> Uncaught InternalError: too much recursion The version with applied lookaheads doesn't have this problem. The regex will happily test strings >100MB. Well, it will until we hit the hard limits of the browsers (about 500MB in Chrome and about 1GB in Firefox). /Side side note While it would really nice to have guaranteed O(n) runtime for all of our regexes, it is also impossible in general (e.g. The cost? Fixing 200-400 of Prism's regexes. @LeaVerou You're right. I'll use looaheads when there's no other way and be done with it. I was hoping that there was some better way that I overlook but unfortunately it seems like there isn't. Talking about it help me clear this up. Thank you. Side side side noteIf we ever wanted to try to get O(n) for all regexes (where it's possible), we can use the current test for polynomial backtracking. Basically, it's possible to exploit the linear backtracking of regex /Side side side note |
Thank you for the thoughtful analysis @RunDevelopment! |
True, but we already do that to some extend (#2296). I'm all for using tricks to optimize the minified files but I'm also keenly aware that this might go wrong someday. We could run all language tests against the minified files to prevent that thou.
How do we use XRegExp here? It's an amazing library but it doesn't optimize regexes. |
It allows building regexes from smaller regexes, may make it easier to subtract groups, and once we use it, it may also help simplify our grammars. |
Are you sure about that? There is surprisingly little tooling for regexes available in the JS ecosystem. |
Is this not a common use case for Prism? I know at least one very large organization that uses Prism for highlighting user input... and I know of several (and imagine many more) use Highlight.js in exactly such a fashion (chat rooms, message boards, etc)... all filled with content provided by users. Obviously one can imagine many scenarios where all the content is "safe" but I guess if you'd asked me off the top of my head I'd have guessed much more unsafe content than safe. |
This was branched from #2590. Blocked by #2590.
Very similar to #2590, I added a test that checks for polynomial backtracking in Prism's regexes. It only detects one kind of polynomial backtracking but even that yields hundreds of potential vulnerabilities. The problem this time around is that I don't know how to fix them all.
The problem
Let's look at one example from JavaScript to illustrate the problem:
/[_$a-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*=>)/i
Do you see it? Here is the output of the test:
The problem is that
[$\w\xA0-\uFFFF]
and\s
are not disjoint, so we can just take any character that the two have in common and create an attack string. The test tells us that the available characters are[\xa0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
.Using this knowledge, we can create the attack string
"\xA0".repeat(n)
for anyn
. The above regex will take O(n^3) time to reject strings of that form. On my laptop, it takes over 1 second for/[_$a-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*=>)/i.test("\xa0".repeat(2000))
to returnfalse
.Now that you understand the example, how do we fix this?
Well, we "just" have to make
[$\w\xA0-\uFFFF]
and\s
disjoint. The correct way to do it is to subtract\s
from[$\w\xA0-\uFFFF]
but how do we do that... The other is to subtract[$\w\xA0-\uFFFF]
from\s
.This question goes out to my fellow maintainers. How do we fix this?
There are a lot of other examples like this too. Just take a look at the TravisCI logs. There are literally hundreds of these cases and most are non-trivial to fix.
@Golmote @mAAdhaTTah @LeaVerou
How does the analysis work?
It's not necessary to understand the analysis in order to know that all the problems it finds are vulnerabilities that should be fixed. If you don't care about the implementation details, skip this section.
I go through all quantifiers that have an unbounded maximum (e.g.
+
,{2,}
). A quantifier's element is then converted into a character set such that the character set is guaranteed accepted by the quantifier's element (e.g.a+
->[a]
,(a|bb|c)*
->[ac]
,(\w\s*|a\b|foo)*
->[]
). The basic idea is that we now know all characters that the quantifier will trivially accept.To get polynomial backtracking, we just have to find a compatible quantifier and a path to get there. This is done by examining the regex AST nodes that can be reached from the initial quantifier and then recursively continuing the search until we cannot proceed or a compatible quantifier was found.
Examples:
a*a*
. The initial quantifier is the lefta*
with the character set[a]
. We then find the righta*
reachable from the lefta*
. The righta*
is compatible with[a]
, so we just found a match.The attack string is
"a"*n+$
. ($
is some suffix that causes the full regex to reject all inputs.)[bh\d]+[a-z][a-fA-F]+
: The initial quantifier is[bh\d]+
with the character set[0-9bh]
. We then find[a-z]
reachable from[bh\d]+
. To proceed, we have to narrow down the available character set from[0-9bh]
to[bh]
. From[a-z]
reachable is[a-fA-F]+
. We narrow down the available character set from[bh]
to[b]
to match the character set of[a-fA-F]+
. Since we still have available characters,[bh\d]+
and[a-fA-F]+
are compatible.The attack string is
"b"*n+$
.This is a very limited analysis but it's still quite good.