-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lint: add linter for unicode directional formatting characters #72287
Conversation
This PR adds a linter to disallow the use of directional formatting characters to help prevent them being used to get malicious code past code review. Ideally our code-review tool would highlight such characters for us since such characters might routinely appear in binary artifacts. See also: - https://www.trojansource.codes/ - https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html - golang/go#20209 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm neutral on this. It's pretty cheap and we don't currently have any legitimate use case for these characters in our codebase. But it's also very narrowly targeted and I think there are plenty of other ways to subtly make code not do what it appears to (see the underhanded C contest for example). If there were a well-maintained linter against unicode shenanigans (especially something based on data directly from the unicode consortium or a project like ICU), I'd be in favor of using it. But I don't really like maintaining our own ad-hoc blocklist of unicode characters.
@@ -633,6 +633,54 @@ func TestLint(t *testing.T) { | |||
} | |||
}) | |||
|
|||
t.Run("TestDisallowedCharacterSequences", func(t *testing.T) { | |||
t.Parallel() | |||
disallowedCharacterSequences := []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this list of characters come from? We should include a link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters
I believe it also matches what rust added a check for in their advisory.
"\u202C", // POP-DIRECTIONAL-FORMATTING | ||
"\u202D", // LEFT-TO-RIGHT-OVERRIDE | ||
"\u202E", // RIGHT-TO-LEFT-OVERRIDE | ||
"\u2066", // LEFT-TO-RIGHT-ISOLATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the "isolate" characters actually dangerous? I understand the problems with the override characters, but I've also had to use them in a real project to get bidi text to render correctly. The isolate characters were introduced more recently (2013, after the last time I had to deal with these issues) and were supposed to be a safer alternative.
This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code. That may be acceptable in CockroachDB since it is already our policy that source code and comments be in English, but it would not be good if this rule spread into compilers and other language-wide tools. If we could narrow the block list to only the override and embedding characters while allowing isolates, that would be a better outcome since it could address the security concerns while still allowing all languages to be properly represented (I think. I can't read any RTL languages so there's still quite a bit of conjecture here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trojan source paper (https://trojansource.codes/trojan-source.pdf) does have an example of malicious use of right-to-left isolate. (maybe first-strong-isolate is still safe, but I'm not sure).
I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code
A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).
I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.
This check is indeed pretty naive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's also very narrowly targeted and I think there are plenty of other ways to subtly make code not do what it appears to (see the underhanded C contest for example).
💯 There are lots of ways to subvert code review.
Now that GitHub has an explicit warning about this, I think the remaining value here is pretty minimal. I think, let's hold off to see if reviewable might be able to add a similar UI indication. If they can, that seems sufficient to me.
@@ -633,6 +633,54 @@ func TestLint(t *testing.T) { | |||
} | |||
}) | |||
|
|||
t.Run("TestDisallowedCharacterSequences", func(t *testing.T) { | |||
t.Parallel() | |||
disallowedCharacterSequences := []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters
I believe it also matches what rust added a check for in their advisory.
"\u202C", // POP-DIRECTIONAL-FORMATTING | ||
"\u202D", // LEFT-TO-RIGHT-OVERRIDE | ||
"\u202E", // RIGHT-TO-LEFT-OVERRIDE | ||
"\u2066", // LEFT-TO-RIGHT-ISOLATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code
A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).
I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.
This check is indeed pretty naive.
This PR adds a linter to disallow the use of directional formatting
characters to help prevent them being used to get malicious code past
code review.
Ideally our code-review tool would highlight such characters for us
since such characters might routinely appear in binary artifacts.
See also:
Release note: None