Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mountinfo: simpify, speedup, and fix unescape #144

Merged
merged 2 commits into from
Sep 9, 2024

Conversation

kolyshkin
Copy link
Collaborator

This is an alternative to #143.

It has been pointed out that unescape implementation is unnecessarily
complex, and I tend to agree. Plus, we never expect any errors during
unescaping, and if there are any, we can always return the string as is.

So, drop the error return, and simplify the algorithm. Modify the test
case and the benchmark accordingly.

NOTE this also fixes the issue of trying to convert escape sequences
where the character code is >= utf8.RuneSelf (128 decimal, 0200 octal).
Adding those to the output may result in invalid utf-8 strings. The new
code doesn't try to unescape those (and the tests are amended to check
that).

Add some more test cases.

Here's benchstat output:

name               old time/op    new time/op    delta
Unescape-20           851ns ± 1%     358ns ± 3%  -57.94%  (p=0.008 n=5+5)

name               old alloc/op   new alloc/op   delta
Unescape-20            640B ± 0%      255B ± 0%  -60.16%  (p=0.000 n=5+4)

name               old allocs/op  new allocs/op  delta
Unescape-20            31.0 ± 0%      21.0 ± 0%  -32.26%  (p=0.008 n=5+5)

@kolyshkin
Copy link
Collaborator Author

kolyshkin commented Jul 26, 2024

Basically, the differences between this and #143 are:

In my defense, this time I really tried to make this as simple as possible.

@kolyshkin kolyshkin changed the title Unescape speedup mountinfo: simpify, speedup, and fix unescape Jul 26, 2024
Comment on lines 737 to 742
// NOTE Go interprets \NNN inside double quotes exactly
// the same way as the function being tested, so:
// - for input strings, we must either escape all backslash
// characters ("\\") or use backticks `\`;
// - for output strings, we can write something like "\040",
// which is identical to " ".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: perhaps put this above testCases as this doesn't apply to this test-case specifically

{"\\\\", "\\\\"},
{`\177`, "\177"}, // Max allowed ASCII value.
{`\222`, `\222`}, // Too large value.
{`Это\040камон\040какой-то`, "Это камон какой-то"}, // Some UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh; of course I was curious what this was; funny how translate is inconsistent on c or k for kamon. And now I'm curious what "kamon" means 😂

Screenshot 2024-07-29 at 10 29 34

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 There's no such word, this is just a Russian approximation of how to say "come on".

The whole phrase is from an old MC Vspishkin & Nikiforovna song, in which he said something like "это какой-то чек вис аут! это комон какой-то!" in which "чек вис аут" and "комон" are approximation of English pronunciation of "check this out" and "come on". Overall it means nothing.

Here's a fragment with a link to the complete song: https://youtube.com/clip/UgkxjT1WPmAYApCJBD2N21iATuwL5aryYm7-?si=zwYUH4f21ugLPTPU

PS I just realized I wrote it as камон while he says it with a distinct "о" (комон). I'm taking it very seriously, so let me fix this.

Comment on lines +750 to +753
{`/path/"with'quotes,\040space,\011tab`, `/path/"with'quotes, space, tab`},
{`\12`, `\12`}, // Not enough digits.
{`\134`, `\`}, // Backslash.
{`"'"'"'`, `"'"'"'`},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering; as we're updating all of these, should we also use keys here? I know it would be a lot more verbose, but it's generally "best practice" (I think?).

We could even consider adding a doc field to describe test-cases (instead of adding a comment), but I realise most don't have a description if they have a specific purpose 🤔

Happy to hear what you think though; if it's too verbose?

	// NOTE Go interprets \NNN inside double quotes exactly
	// the same way as the function being tested, so:
	//  - for input strings, we must either escape all backslash
	//    characters ("\\") or use backticks `\`;
	//  - for output strings, we can write something like "\040",
	//    which is identical to " ".
	testCases := []struct {
		doc, input, output string
	}{
		{
			input:  "",
			output: "",
		},
		{
			input:  "/",
			output: "/",
		},
		{
			input:  "/some/longer/path",
			output: "/some/longer/path",
		},
		{
			input:  `/path\040with\040spaces`,
			output: "/path\040with\040spaces",
		},
		{
			input:  "/path/with\\134backslash",
			output: "/path/with\\backslash",
		},
		{
			input:  "/tab\\011in/path",
			output: "/tab\011in/path",
		},
		{
			input:  `/path/"with'quotes`,
			output: `/path/"with'quotes`,
		},
		{
			input:  `/path/"with'quotes,\040space,\011tab`,
			output: `/path/"with'quotes, space,	tab`,
		},
		{
			doc:    "Not enough digits",
			input:  `\12`,
			output: `\12`,
		},
		{
			doc:    "Backslash",
			input:  `\134`,
			output: `\`,
		},
		{
			input:  `"'"'"'`,
			output: `"'"'"'`,
		},
		{
			doc:    "Backslash with extra digit",
			input:  `/\1345`,
			output: `/\5`,
		},
		{
			input:  `/\12x`,
			output: `/\12x`,
		},
		{
			doc:    "Not enough digits",
			input:  `\0`,
			output: `\0`,
		},
		{
			doc:    "NUL (min allowed ASCII value)",
			input:  `\000\000`,
			output: "\000\000",
		},
		{
			input:  `\x`,
			output: `\x`,
		},
		{
			input:  "\\\\",
			output: "\\\\",
		},
		{
			doc:    "Max allowed ASCII value",
			input:  `\177`,
			output: "\177",
		},
		{
			doc:    "Too large value",
			input:  `\222`,
			output: `\222`,
		},
		{
			doc:    "Some UTF-8",
			input:  `Это\040камон\040какой-то`,
			output: "Это камон какой-то",
		},
	}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, with two fields only I think this will be overly verbose and taking too much vertical space.

Comment on lines +773 to +774
func BenchmarkUnescape(b *testing.B) {
testCases := []string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW; I copied these test cases from the other test. We could possibly consider making those test-cases a package-variable, so that we can use the same ones here as part of the benchmark (if we don't want to repeat the cases).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but:

  1. When we change that variable, we change benchmark's input, making it somewhat harder to compare before/after.
  2. Ultimately a test case and a benchmark should have different inputs. For test cases, we look for various ways to break things. For benchmarks, we look for either most typical or most complex (time consuming) cases, or both.

@kolyshkin
Copy link
Collaborator Author

Updated; PTAL @thaJeztah

thaJeztah and others added 2 commits August 13, 2024 23:05
    go test -v -test.benchmem -count=10 -run ^$ -bench BenchmarkUnescape .
    go: downloading golang.org/x/sys v0.1.0
    goos: linux
    goarch: arm64
    pkg: github.com/moby/sys/mountinfo
    BenchmarkUnescape
    BenchmarkUnescape-10     1000000      1068 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10      992292      1082 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1050 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1038 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1034 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1045 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1071 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1033 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1032 ns/op     640 B/op      31 allocs/op
    BenchmarkUnescape-10     1000000      1070 ns/op     640 B/op      31 allocs/op
    PASS
    ok  github.com/moby/sys/mountinfo 10.653s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It has been pointed out that unescape implementation is unnecessarily
complex, and I tend to agree. Plus, we never expect any errors during
unescaping, and if there are any, we can always return the string as is.

So, drop the error return, and simplify the algorithm. Modify the test
case and the benchmark accordingly.

NOTE this also fixes the issue of trying to convert escape sequences
where the character code is >= utf8.RuneSelf (128 decimal, 0200 octal).
Adding those to the output may result in invalid utf-8 strings. The new
code doesn't try to unescape those (and the tests are amended to check
that).

Finally, add some more test cases, and improve the documentation.

Here's benchstat output:
name               old time/op    new time/op    delta
Unescape-20           851ns ± 1%     358ns ± 3%  -57.94%  (p=0.008 n=5+5)

name               old alloc/op   new alloc/op   delta
Unescape-20            640B ± 0%      255B ± 0%  -60.16%  (p=0.000 n=5+4)

name               old allocs/op  new allocs/op  delta
Unescape-20            31.0 ± 0%      21.0 ± 0%  -32.26%  (p=0.008 n=5+5)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Collaborator Author

Rebased; check that golangci-linter v1.60.1 (from #147) produces no warnings.

@kolyshkin kolyshkin requested a review from thaJeztah September 6, 2024 21:19
@kolyshkin
Copy link
Collaborator Author

PTAL @thaJeztah, I think I've addressed all your comments.

Copy link
Contributor

@austinvazquez austinvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice results.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thaJeztah thaJeztah merged commit a9393fd into moby:main Sep 9, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants