Fix encoding of non-ascii contents written to parameter files. #18972

zhengwei143 · 2023-07-18T15:15:32Z

When args are written to parameter files, non-ascii values are wrongly encoded again as utf-8. This seems to be unaffected by the JDK20 upgrade of Bazel, and has always been happening.

Repro:

$ cat encoding/BUILD$ cat encoding/BUILD
load("defs.bzl", "cat")

cat(
    name = "test_cat",
    out = "cat.txt",
    content = "привет",
)

sh_binary(
    name = "cat_bin",
    srcs = ["test_cat.sh"],
)

$ cat encoding/defs.bzl
def _test_cat_impl(ctx):
  args = ctx.actions.args()
  args.use_param_file("%s", use_always = True)
  args.add(ctx.attr.content)
  ctx.actions.run(
    inputs = [],
    outputs = [ctx.outputs.out],
    arguments = [args, ctx.outputs.out.path],
    executable = ctx.executable.cat_bin,
  )

cat = rule(
  implementation = _test_cat_impl,
  attrs = {
    "out": attr.output(mandatory = True),
    "content": attr.string(mandatory = True),
    "cat_bin": attr.label(
      executable = True,
      cfg = "exec",
      allow_files = True,
      default = Label("//encoding:cat_bin"),
    ),
  })

$ cat encoding/test_cat.sh
#!/bin/sh
cat "$1" >> "$2";

$ bazel build //encoding:test_cat
$ cat bazel-bin/encoding/cat.txt-0.params
'Ð¿Ñ�Ð¸Ð²ÐµÑ�'
$ cat bazel-bin/encoding/cat.txt
'Ð¿Ñ�Ð¸Ð²ÐµÑ�'

…ell.

zhengwei143 · 2023-07-19T12:59:01Z

Potential fix for #18792

tetromino

This seems logically correct but inefficient: writeContentUtf8 is a private method which has exactly 1 call site, so we can certainly avoid double-recoding.

I would suggest reverting the change to writeContent(), and instead changing writeContentUtf8() to the following:

      ...
      if (stringUnsafe.getCoder(line) == StringUnsafe.LATIN1 && isAscii(bytes)) {
        outputStream.write(bytes);
      } else if (!StringUtil.decodeBytestringUtf8(line).equals(line)) {
        // We successfully decoded line from utf8 - meaning it was already encoded as utf8.
        // We do not want to double-encode.
        outputStream.write(bytes);
      } else {
        ByteBuffer encodedBytes = encoder.encode(CharBuffer.wrap(line));
        ...

zhengwei143 force-pushed the fix-encoding branch from 55ac165 to deef072 Compare July 18, 2023 15:40

Fix encoding of non-ascii contents written to parameter files.

0762612

zhengwei143 force-pushed the fix-encoding branch from deef072 to 0762612 Compare July 19, 2023 11:37

zhengwei143 added 2 commits July 19, 2023 13:47

Try testing on windows.

58d97e2

Remove testing on windows, the newly added test fails on windows as w…

a46c532

…ell.

zhengwei143 requested a review from tetromino July 19, 2023 12:36

zhengwei143 marked this pull request as ready for review July 19, 2023 12:36

github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams labels Jul 19, 2023

coeuvre approved these changes Jul 19, 2023

View reviewed changes

Merge branch 'master' into fix-encoding

8a92773

tetromino requested changes Jul 21, 2023

View reviewed changes

Avoid double recoding when dealing with an already utf8 encoded string.

a8d1c7e

tetromino approved these changes Jul 21, 2023

View reviewed changes

copybara-service bot closed this in dc80fa7 Jul 21, 2023

github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding of non-ascii contents written to parameter files. #18972

Fix encoding of non-ascii contents written to parameter files. #18972

zhengwei143 commented Jul 18, 2023 •

edited

Loading

zhengwei143 commented Jul 19, 2023

tetromino left a comment •

edited

Loading

Fix encoding of non-ascii contents written to parameter files. #18972

Fix encoding of non-ascii contents written to parameter files. #18972

Conversation

zhengwei143 commented Jul 18, 2023 • edited Loading

zhengwei143 commented Jul 19, 2023

tetromino left a comment • edited Loading

Choose a reason for hiding this comment

zhengwei143 commented Jul 18, 2023 •

edited

Loading

tetromino left a comment •

edited

Loading