*_binary args get splitted #12313

dmivankov · 2020-10-20T12:14:54Z

Description of the problem / feature request:

Arguments set by args attribute of *_binary get splitted

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

BUILD:

cc_binary(
  name = "main",
  srcs = [":main.c"],
)
cc_binary(
  name = "main2",
  srcs = [":main.c"],
  args = ["a", "b c"], # "b c" gets split
)
cc_binary(
  name = "main3",
  srcs = [":main.c"],
  args = ["a", "\"b c\""], # excessive quoting is needed
)
cc_binary(
  name = "main4",
  srcs = [":main.c"],
  args = ["a \"b c\""], # we could say singleton lists are preferred and are splitted
)

main.c:

#include <stdio.h>
int main(int argc, char** argv) {
	for (int i = 1; i < argc; ++i) {
		printf("%d: %s\n", i, argv[i]);
	}
	return 0;
}

Testing:

$ bazel run :main -- a "b c"
1: a
2: b c
$ bazel run :main2
1: a
2: b
3: c
$ bazel run :main3
1: a
2: b c
$ bazel run :main4
1: a
2: b c

What operating system are you running Bazel on?

Linux/NixOS

What's the output of `bazel info release`?

release 3.3.1- (@non-git)

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

Comes from NixOS packages
nix-env -iA nixpkgs.bazel

Any other information, logs, or outputs that you want to share?

Luckily

cc_binary(
  name = "main5",
  srcs = [":main.c"],
  args = ["a \"b c\" >/dev/null"],
)

treats ">/dev/null" as string rather than redirection. Could've been an injection vulnerability otherwise

Also tested with recent git version 6bf44ba

The text was updated successfully, but these errors were encountered:

dmivankov · 2020-10-20T16:59:43Z

extra info

$ bazel build :main
$ bazel run --shell_executable=$(pwd)/bazel-bin/main :main2 -- "d e"
1: -c
2: SOME_PATH/execroot/__main__/bazel-out/k8-fastbuild/bin/main2 a b c 'd e'

also

BUILD file args=[" 'x' "] -> argv = ["x"]
bazel run -- " 'x' " -> argv = [" 'x' "]

dmivankov · 2020-10-20T17:12:33Z

Looks like the issue is
https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java#L178
calling
https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/RunfilesSupport.java#L429
which has ".tokenized()" call inside

tetromino · 2020-10-22T16:11:49Z

Not sure who owns RunfilesSupport, but this looks like a real logic bug.

jwnimmer-tri · 2020-10-22T16:27:38Z

To my reading, https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args specifically says that args undergoes Bourne shell tokenization, which explaining the splitting. I don't understand what the bug is?

dmivankov · 2020-10-23T06:51:53Z

May be just a documentation issue then
https://docs.bazel.build/versions/master/be/common-definitions.html#sh-tokenization

Certain string attributes of some rules are split into multiple words according to the tokenization rules of the Bourne shell

while
https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args

args | List of strings;

Intuitively, when you supply array of strings, you don''t expect split and combine the lists behavior, also sh-tokenization is said to apply to string arguments, not to lists of string.

ulfjack · 2020-10-23T07:22:13Z

If I recall correctly, this has come up a few times before. I agree that the behavior should be changed to avoid shell tokenization; however, this is a difficult transition. It's not enough to have a flag because there's no way to write args in a way that is compatible with and without shell tokenization.

dmivankov · 2020-10-23T11:30:05Z

One way to migrate is

add a flag that fails the build if tokenization changes the args (splits or removes spaces), should be doable as tokenization is done by bazel
add a builtin function tokenize(): string(or string..?) -> [string], need to carefully check how it interacts with variable expansion though
then compatible way would be

args = ["--foo"] + tokenize(something) + ["abc"]
# or
args = ["--foo"] + tokenize(something1, something2) + ["abc"]

which would work regardless of flag given that there are no splits in free args

dmivankov · 2020-10-23T21:41:41Z

We also probably need builtin quote(string_with_spaces_or_quotes): string -> string
So canonical arguments would be

args = ["--foo", "abc", quote(" x "), quote("'"), quote("\'x\'")] + tokenize("a \"b b\"") + tokenize("c d", "e f")
# meaning
argv = ["--foo", "abc", " x ", "'", "\'x\'", "a", "b b", "c", "d", "e", "f"]

and rejected arguments would be

args = ["a b", " x ", "\"x\"", "\'y\'", "\"x y\"", "a \"b\" c" ] + tokenize("a \" b")

Note that currently unquoting seems to do funny and unpredictable things

args = ["a", " \"b c\" ", " \"\"b\" c\" ", "'a ' ", "'d'", "\"e\"", "\"\"x\"\"", "\"'y'\"", "''\"''\"\"\"z\"\"\"''\"''", "''zz''", "'''xx'''"]

results in

1: a
2: b c
3: b c
4: a 
5: d
6: e
7: x
8: 'y'
9: ''z''
10: zz
11: xx

aiuto · 2020-10-26T04:33:19Z

I'll do some thinking about this. To me, a list of args should be based 1:1 as the arg list of the called process. The whole concept of shell tokanization is wrong.
OTOH, I would guess there are a lot of shell genrules that might break if we change this, and windows does not quite always parse well without quotes, so this is not a simple bug fix.

My hope is that we can devise a mechanism so that binaries can indicate that they parse args correctly, and thus no tokenization is needed. What you list in args is what you get.

ulfjack · 2020-10-26T11:47:53Z

@dmivankov, if I understand correctly, you're suggesting to add a flag that changes how Bazel parses args and simultaneously changes how quote and tokenize behave internally?

You didn't say that second part explicitly. If we don't change how quote and tokenize behave internally, then this would cause double-tokenization unless we can find a way to signal inline that a string should not be tokenized inside Bazel. However, the downside of this approach is that it would be confusing that tokenize doesn't actually tokenize unless the flag is set.

I think the most straightforward way to do this is to add a tag tokenize-args (alternatively, use a full attribute). This would support a migration (add tokenize-args on all rules, flip the flag to not tokenize by default, migrate all rules to explicitly tokenize using a separate method, and remove the tag).

dmivankov · 2020-10-26T13:17:25Z

tokenize-args or even just argv=[..] that is never tokenized are also valid options.

With tokenize&quote we'd either need them to produce a tagged string/list of strings that indicates if it is wrapped(& into which of them) or not, or indeed make them behave differently depending on the flag. Technically we could also

make quote add '' when needed regardless of flag, so quote("a")=="a", quote("a b") == "'a b'"
make tokenize produce quote-ed elements
do double tokenization, but second run shouldn't be changing anything
flag only controls whether not using quote/tokenize is forbidden for args with spaces/special chars
and flag somehow only applies to user-provided strings, result of quote("a b") is "'a b'" doesn't fail the build but users can't specify such strings manually

trybka · 2022-01-14T03:35:32Z

As another data point, this is also present in the defines attribute of cc_* rules and friends.

It was added ostensibly for copts compatibility, but defines errors if you have >1 tokens, which makes this quite a bit less useful anyhow.

defines might be a whole lot easier to clean-up lending themselves to having a per-package feature or something, IDK?

gregestren · 2022-01-31T23:20:31Z

@aiuto how should we categorize this issue?

aiuto · 2022-02-01T00:42:22Z

I don't know.
This is one of those things that does not fit any of the teams. It is execution semantics, but neither local nor remote.
So, either local-exec or xproduct, where some day we will re-triage it.

Executable Starlark rules can use the new `arguments` parameter on `RunEnvironmentInfo` to specify the arguments that Bazel should pass on the command line with `test` or `run`. If set to a non-`None` value, this parameter overrides the value of the `args` attribute that is implicitly defined for all rules. This allows Starlark rules to implement their own version of this attribute which isn't bound to its proprietary processing (data label expansion and tokenization). Along the way, this commit adds test coverage and documentation for the interplay between `RunEnvironmentInfo`'s `environment` and `--test_env`. The value of the `arguments` field of `RunEnvironmentInfo` is intentionally not exposed to Starlark yet: It is not clear how these arguments should be represented and whether rules relying on the magic `args` attribute should also provide this field. Fixes bazelbuild#16076 Work towards bazelbuild#12313

Executable Starlark rules can use the new `arguments` parameter on `RunEnvironmentInfo` to specify the arguments that Bazel should pass on the command line with `test` or `run`. If set to a non-`None` value, this parameter overrides the value of the `args` attribute that is implicitly defined for all rules. This allows Starlark rules to implement their own version of this attribute which isn't bound to its proprietary processing (data label expansion and tokenization). Along the way, this commit adds test coverage and documentation for the interplay between `RunEnvironmentInfo`'s `environment` and `--test_env`. The value of the `arguments` field of `RunEnvironmentInfo` is intentionally not exposed to Starlark yet: It is not clear how these arguments should be represented and whether rules relying on the magic `args` attribute should also provide this field. RELNOTES: Executable starlark rules can use the `arguments` parameter of `RunEnvironmentInfo` to specify their command-line arguments with `bazel run` and `bazel test`. Fixes bazelbuild#16076 Work towards bazelbuild#12313

comius · 2022-11-17T20:44:42Z

I think the default implementation should be the obvious one. Parameters in the list should be passed to the binary without any changes.

Doing this might break others who had to workaround this. So anybody fixing the implementation should also put it behind an incompatible flag.

aiuto · 2022-11-23T03:13:53Z

+1. But the obvious thing is that args should be passed as is. Going back to the examples, main2 should print

1: a
2: b c

fmeum · 2022-11-23T07:07:06Z

What do you think should be the behavior of $(execpaths ...) with this flag? It would probably be more useful if that would still result in multiple arguments.

aiuto · 2022-11-23T15:43:35Z

I think all args in a list should be passed as single strings and never retokenized by blaze. So, if there are paths in execpaths that have, for example, spaces in them, they should be expanded in a way that preserves the spaces on invocation of the command line.

fmeum · 2022-11-23T15:56:38Z

@aiuto I agree, but what about $(execpaths ...) resolving to multiple file paths (note the plural form)? In that case you would certainly want the result to be 'file1' 'file2' 'file3' on the command line rather than 'file1 file2 file3', but the $(execpaths ...) would still appear in a single args list entry.

For unrelated reasons, I have thought about rewriting the LocationExpander logic to return a CustomCommandLine instead of strings. That could solve this issue.

aiuto · 2022-11-23T16:49:50Z

TBH. I'm not wrapping my head around that use case right now. That is, I don't have a mental model of when to use execpaths. I'm not sure I ever have used it. So, .... I don't know?

phst · 2024-01-15T13:27:39Z

The primary use case appears to be complex scripts in genrules, e.g.

bazel/scripts/packages/debian/BUILD

Lines 141 to 151 in df42879

    
               cmd = """ 
        
                   mkdir -p bazel/debian 
        
                   tar -xf $(location //:bazel-srcs) -C ./bazel 
        
                   for f in $(locations :debian-files); do 
        
                     cp $$f ./bazel/debian/ 
        
                   done 
        
                   cp $(location :changelog) ./bazel/debian 
        
                   dpkg-source -b ./bazel 
        
                   cp ./bazel_*.dsc $(location bazel.dsc) 
        
                   cp ./bazel_*.tar.gz $(location bazel.tar.gz) 
        
                   """,

or https://github.com/protocolbuffers/protobuf/blob/7b42f1c08b534f9e43629284e22acce6b13e7136/ruby/BUILD.bazel#L155-L174.

Ideally these would be converted to proper binaries with command-line arguments and custom actions using e.g. Args.add_all(ctx.attr.debian_files, format_each="--debian-file=%s"). That would resolve quoting/tokenization issues and obviate the need for shell quoting. So maybe we could consider any use of execpaths etc. a code smell?

fmeum · 2024-06-07T21:04:39Z

Based on the discussion so far and ignoring defines for the moment, I could see this being a reasonable path forward:

Faithfully implement

bazel/src/main/java/com/google/devtools/build/lib/shell/ShellUtils.java

Line 90 in 8ed7196

public static void tokenize(List<String> options, String optionString)

in Starlark and make it available in bazel-skylib. Since it can't perform location expansion, it would treat $(FOO) and $(location //foo:bar) as if they were tokens that don't contain any spaces.
Add a new --incompatible_no_args_tokenization that, if flipped, makes it so that all args values are treated literally with no postprocessing, with the exception of those targets that are tagged with tokenize-args.
Provide a migration script (buildozer command) that tags all targets with tokenize-args that have args values that could possibly be subject to tokenization.
Flip the flag, offering tokenize-args as a catch-all temporary and the tokenize function in Skylib as a permanent but not 100% compatible alternative.

@comius @lberki What do you think?

phst · 2024-10-02T12:17:38Z

Maybe in addition to the tokenize-args tag there should be a no-tokenize-args tag so that people can opt out eagerly of tokenization independent of the flag.

Or maybe tokenize-args should be a feature so that users can say repo(default_features = ['-tokenize_args'] in their REPO.bazel?

kjw-enf · 2024-10-23T20:34:09Z

I recently wrote a starlark rule that treats args without $(location) expansion, without escaping necessary (i.e. $$), and without environment variable expansion (at least not in starlark) and I couldn't be happier.

context: pytest which requires a "main" wrapper calling pytest.main()

def _wrapper_pytest_impl(ctx):
    # store pytest args in an expanded template of wrapper_pytest.py
    output = ctx.actions.declare_file(ctx.attr.name)
    _flags = '\"' + '\",\n  \"'.join(ctx.attr.args) + '\"'  # quoted
    _srcs = '\"' + '\",\n  \"'.join([x.path for x in ctx.files.srcs]) + '\"'  # quoted
    ctx.actions.write(
        output = output,
        content = """# -*- coding: utf-8 -*-
# standard libraries
import os
import re
import sys
# third party libraries
import pytest
flags = [
  """ + _flags + """
]
srcs = [
  """ + _srcs + """
] 
# n.b. single dollar signs in $ENV expansion only; no shell escaping required
# TODO(kjw): this regex sucks
expanded_flags = [re.sub(r'\\$(\\S+)', lambda mo: os.environ.get(mo.group(1), ''), x) for x in flags]
if __name__ == "__main__":
    sys.exit(pytest.main(expanded_flags + srcs + sys.argv[1:]))
""",
        is_executable = True,
    )
    return [DefaultInfo(files = depset([output]))]

wrapper_pytest = rule(
    implementation = _wrapper_pytest_impl,
    attrs = {
        "srcs": attr.label_list(allow_files = True),
        "args": attr.string_list(),
        "output": attr.output(),
    },
)

Caveat: I'm a starlark noob, so I'm sure that there are better ways (i.e. with an Args object) to do this.

Yes, this means that some our_py_test(args=) usage looks funny, but it's correct if you think of python subprocess.Popen([cmd]) usage. i.e.:

our_py_test(args=["-k=a and b"])

Add in the complication that not all cli parsers support (single char flags with equals (i.e. -k=). Preferentially one should switch to the --flag= format which typically does. otherwise you're left with ["-k", "a and b"] which is slightly harder to read, and people not used to strict interpretations of args will be confused:

"doesn't that need quotes?"
"why is it two args and not one?"
"why is it one arg and not two?"

aiuto added type: bug untriaged category: misc > misc team-Local-Exec Issues and PRs for the Execution (Local) team and removed category: misc > misc labels Oct 20, 2020

tetromino added team-Starlark team-Configurability platforms, toolchains, cquery, select(), config transitions and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Oct 22, 2020

gregestren assigned aiuto Oct 23, 2020

aiuto added P2 We'll consider working on this in future. (Assignee optional) and removed team-Starlark untriaged labels Oct 26, 2020

aiuto removed their assignment Jan 22, 2021

aiuto mentioned this issue Feb 21, 2021

sh_binary() / sh_test() do not quote arguments #13072

Closed

gregestren added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed P2 We'll consider working on this in future. (Assignee optional) labels Dec 3, 2021

aiuto removed P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Configurability platforms, toolchains, cquery, select(), config transitions labels Jan 14, 2022

sventiffe added untriaged team-Local-Exec Issues and PRs for the Execution (Local) team labels Mar 16, 2022

meisterT added team-XProduct and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Apr 7, 2022

fmeum mentioned this issue Sep 11, 2022

Add arguments parameter to testing.TestEnvironment #16076

Open

fmeum mentioned this issue Oct 8, 2022

Add arguments parameter to RunEnvironmentInfo #16430

Open

brandjon added team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts and removed team-Build-Language labels Nov 4, 2022

comius added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*_binary args get splitted #12313

*_binary args get splitted #12313

dmivankov commented Oct 20, 2020 •

edited

Loading

dmivankov commented Oct 20, 2020 •

edited

Loading

dmivankov commented Oct 20, 2020

tetromino commented Oct 22, 2020

jwnimmer-tri commented Oct 22, 2020

dmivankov commented Oct 23, 2020

ulfjack commented Oct 23, 2020

dmivankov commented Oct 23, 2020

dmivankov commented Oct 23, 2020

aiuto commented Oct 26, 2020

ulfjack commented Oct 26, 2020

dmivankov commented Oct 26, 2020

trybka commented Jan 14, 2022

gregestren commented Jan 31, 2022

aiuto commented Feb 1, 2022

comius commented Nov 17, 2022

aiuto commented Nov 23, 2022

fmeum commented Nov 23, 2022

aiuto commented Nov 23, 2022

fmeum commented Nov 23, 2022

aiuto commented Nov 23, 2022

phst commented Jan 15, 2024

fmeum commented Jun 7, 2024

phst commented Oct 2, 2024

kjw-enf commented Oct 23, 2024

*_binary args get splitted #12313

*_binary args get splitted #12313

Comments

dmivankov commented Oct 20, 2020 • edited Loading

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Any other information, logs, or outputs that you want to share?

dmivankov commented Oct 20, 2020 • edited Loading

dmivankov commented Oct 20, 2020

tetromino commented Oct 22, 2020

jwnimmer-tri commented Oct 22, 2020

dmivankov commented Oct 23, 2020

ulfjack commented Oct 23, 2020

dmivankov commented Oct 23, 2020

dmivankov commented Oct 23, 2020

aiuto commented Oct 26, 2020

ulfjack commented Oct 26, 2020

dmivankov commented Oct 26, 2020

trybka commented Jan 14, 2022

gregestren commented Jan 31, 2022

aiuto commented Feb 1, 2022

comius commented Nov 17, 2022

aiuto commented Nov 23, 2022

fmeum commented Nov 23, 2022

aiuto commented Nov 23, 2022

fmeum commented Nov 23, 2022

aiuto commented Nov 23, 2022

phst commented Jan 15, 2024

fmeum commented Jun 7, 2024

phst commented Oct 2, 2024

kjw-enf commented Oct 23, 2024

dmivankov commented Oct 20, 2020 •

edited

Loading

What's the output of `bazel info release`?

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

dmivankov commented Oct 20, 2020 •

edited

Loading