lint: add linter for unicode directional formatting characters #72287

stevendanna · 2021-11-01T15:41:22Z

This PR adds a linter to disallow the use of directional formatting
characters to help prevent them being used to get malicious code past
code review.

Ideally our code-review tool would highlight such characters for us
since such characters might routinely appear in binary artifacts.

See also:

Release note: None

This PR adds a linter to disallow the use of directional formatting characters to help prevent them being used to get malicious code past code review. Ideally our code-review tool would highlight such characters for us since such characters might routinely appear in binary artifacts. See also: - https://www.trojansource.codes/ - https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html - golang/go#20209 Release note: None

cockroach-teamcity · 2021-11-01T15:41:28Z

This change is

stevendanna · 2021-11-01T23:45:19Z

GitHub makes this pretty clear now:

But reviewable does not.

dt · 2021-11-02T00:12:35Z

Nice!

Personally, I’d probably say the (new, I think?) GitHub banner is sufficient, but folks who still use reviewable might disagree, and I suppose we support that workflow. I don’t feel strongly either way so I’ll defer to @bdarnell or @knz.

bdarnell

I'm neutral on this. It's pretty cheap and we don't currently have any legitimate use case for these characters in our codebase. But it's also very narrowly targeted and I think there are plenty of other ways to subtly make code not do what it appears to (see the underhanded C contest for example). If there were a well-maintained linter against unicode shenanigans (especially something based on data directly from the unicode consortium or a project like ICU), I'd be in favor of using it. But I don't really like maintaining our own ad-hoc blocklist of unicode characters.

bdarnell · 2021-11-02T14:53:01Z

pkg/testutils/lint/lint_test.go

@@ -633,6 +633,54 @@ func TestLint(t *testing.T) {
 		}
 	})

+	t.Run("TestDisallowedCharacterSequences", func(t *testing.T) {
+		t.Parallel()
+		disallowedCharacterSequences := []string{


Where did this list of characters come from? We should include a link.

This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters

I believe it also matches what rust added a check for in their advisory.

bdarnell · 2021-11-02T15:18:31Z

pkg/testutils/lint/lint_test.go

+			"\u202C", // POP-DIRECTIONAL-FORMATTING
+			"\u202D", // LEFT-TO-RIGHT-OVERRIDE
+			"\u202E", // RIGHT-TO-LEFT-OVERRIDE
+			"\u2066", // LEFT-TO-RIGHT-ISOLATE


Are the "isolate" characters actually dangerous? I understand the problems with the override characters, but I've also had to use them in a real project to get bidi text to render correctly. The isolate characters were introduced more recently (2013, after the last time I had to deal with these issues) and were supposed to be a safer alternative.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code. That may be acceptable in CockroachDB since it is already our policy that source code and comments be in English, but it would not be good if this rule spread into compilers and other language-wide tools. If we could narrow the block list to only the override and embedding characters while allowing isolates, that would be a better outcome since it could address the security concerns while still allowing all languages to be properly represented (I think. I can't read any RTL languages so there's still quite a bit of conjecture here).

The trojan source paper (https://trojansource.codes/trojan-source.pdf) does have an example of malicious use of right-to-left isolate. (maybe first-strong-isolate is still safe, but I'm not sure).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code

A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

This check is indeed pretty naive.

stevendanna

But it's also very narrowly targeted and I think there are plenty of other ways to subtly make code not do what it appears to (see the underhanded C contest for example).

💯 There are lots of ways to subvert code review.

Now that GitHub has an explicit warning about this, I think the remaining value here is pretty minimal. I think, let's hold off to see if reviewable might be able to add a similar UI indication. If they can, that seems sufficient to me.

stevendanna · 2021-11-02T16:47:29Z

pkg/testutils/lint/lint_test.go

@@ -633,6 +633,54 @@ func TestLint(t *testing.T) {
 		}
 	})

+	t.Run("TestDisallowedCharacterSequences", func(t *testing.T) {
+		t.Parallel()
+		disallowedCharacterSequences := []string{


This was all of the "explicit" formatting characters listed in: https://www.unicode.org/reports/tr9/#Directional_Formatting_Characters

I believe it also matches what rust added a check for in their advisory.

stevendanna · 2021-11-02T16:51:30Z

pkg/testutils/lint/lint_test.go

+			"\u202C", // POP-DIRECTIONAL-FORMATTING
+			"\u202D", // LEFT-TO-RIGHT-OVERRIDE
+			"\u202E", // RIGHT-TO-LEFT-OVERRIDE
+			"\u2066", // LEFT-TO-RIGHT-ISOLATE


This is important because banning all of these characters tells speakers of RTL languages that their language cannot be represented properly in source code

A great point. Further, even in CRDB, it might get in the way of non-english developers who want to leave notes to themselves in the source code while working on something (even if they plan to remove those notes before submitting the PR).

I also suspect that there's a way to combine homoglyph attacks with bidi manipulation without any explicit RTL control characters. For example, the arabic date separator U+060D can pass for a comma in some fonts and it is flagged for RTL mode.

This check is indeed pretty naive.

breml mentioned this pull request Nov 1, 2021

Add github.com/breml/bidichk linter golangci/golangci-lint#2330

Merged

stevendanna requested a review from dt November 1, 2021 23:45

bdarnell reviewed Nov 2, 2021

View reviewed changes

stevendanna commented Nov 2, 2021

View reviewed changes

stevendanna closed this Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lint: add linter for unicode directional formatting characters #72287

lint: add linter for unicode directional formatting characters #72287

stevendanna commented Nov 1, 2021

cockroach-teamcity commented Nov 1, 2021

stevendanna commented Nov 1, 2021

dt commented Nov 2, 2021

bdarnell left a comment

bdarnell Nov 2, 2021

stevendanna Nov 2, 2021

bdarnell Nov 2, 2021

bdarnell Nov 2, 2021

stevendanna Nov 2, 2021 •

edited

Loading

stevendanna left a comment

stevendanna Nov 2, 2021

stevendanna Nov 2, 2021 •

edited

Loading

lint: add linter for unicode directional formatting characters #72287

lint: add linter for unicode directional formatting characters #72287

Conversation

stevendanna commented Nov 1, 2021

cockroach-teamcity commented Nov 1, 2021

stevendanna commented Nov 1, 2021

dt commented Nov 2, 2021

bdarnell left a comment

Choose a reason for hiding this comment

bdarnell Nov 2, 2021

Choose a reason for hiding this comment

stevendanna Nov 2, 2021

Choose a reason for hiding this comment

bdarnell Nov 2, 2021

Choose a reason for hiding this comment

bdarnell Nov 2, 2021

Choose a reason for hiding this comment

stevendanna Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

stevendanna Nov 2, 2021

Choose a reason for hiding this comment

stevendanna Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

stevendanna Nov 2, 2021 •

edited

Loading

stevendanna Nov 2, 2021 •

edited

Loading