-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lint: add linter for unicode directional formatting characters #72287
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -633,6 +633,54 @@ func TestLint(t *testing.T) { | |
} | ||
}) | ||
|
||
t.Run("TestDisallowedCharacterSequences", func(t *testing.T) { | ||
t.Parallel() | ||
disallowedCharacterSequences := []string{ | ||
"\u202A", // LEFT-TO-RIGHT-EMBEDDING | ||
"\u202B", // RIGHT-TO-LEFT-EMBEDDING | ||
"\u202C", // POP-DIRECTIONAL-FORMATTING | ||
"\u202D", // LEFT-TO-RIGHT-OVERRIDE | ||
"\u202E", // RIGHT-TO-LEFT-OVERRIDE | ||
"\u2066", // LEFT-TO-RIGHT-ISOLATE | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are the "isolate" characters actually dangerous? I understand the problems with the override characters, but I've also had to use them in a real project to get bidi text to render correctly. The isolate characters were introduced more recently (2013, after the last time I had to deal with these issues) and were supposed to be a safer alternative. This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code. That may be acceptable in CockroachDB since it is already our policy that source code and comments be in English, but it would not be good if this rule spread into compilers and other language-wide tools. If we could narrow the block list to only the override and embedding characters while allowing isolates, that would be a better outcome since it could address the security concerns while still allowing all languages to be properly represented (I think. I can't read any RTL languages so there's still quite a bit of conjecture here). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The trojan source paper (https://trojansource.codes/trojan-source.pdf) does have an example of malicious use of right-to-left isolate. (maybe first-strong-isolate is still safe, but I'm not sure). I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).
This check is indeed pretty naive. |
||
"\u2067", // RIGHT-TO-LEFT-ISOLATE | ||
"\u2068", // FIRST-STRONG-ISOLATE | ||
"\u2069", // POP-DIRECTIONAL-ISOLATE | ||
} | ||
pattern := strings.Join(disallowedCharacterSequences, "|") | ||
cmd, stderr, filter, err := dirCmd( | ||
crdb.Dir, | ||
"git", | ||
"grep", | ||
"-nE", | ||
pattern, | ||
"--", | ||
":!*.woff2", | ||
":!*.png", | ||
":!*.tgz", | ||
":!pkg/ccl/importccl/testdata/avro/stock-10000.bjson", | ||
":!pkg/ccl/importccl/testdata/avro/stock-10000.ocf", | ||
) | ||
if err != nil { | ||
t.Fatal(err) | ||
} | ||
|
||
if err := cmd.Start(); err != nil { | ||
t.Fatal(err) | ||
} | ||
|
||
if err := stream.ForEach(filter, func(s string) { | ||
t.Errorf("\n%s <- forbidden use of disallowed character sequence.", s) | ||
}); err != nil { | ||
t.Error(err) | ||
} | ||
|
||
if err := cmd.Wait(); err != nil { | ||
if out := stderr.String(); len(out) > 0 { | ||
t.Fatalf("err=%s, stderr=%s", err, out) | ||
} | ||
} | ||
}) | ||
|
||
t.Run("TestInternalErrorCodes", func(t *testing.T) { | ||
t.Parallel() | ||
cmd, stderr, filter, err := dirCmd( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this list of characters come from? We should include a link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters
I believe it also matches what rust added a check for in their advisory.