undocumented usage of perl for cc_test #4691

ahippler · 2018-02-23T15:38:07Z

Description of the problem / feature request:

cc_test uses a inline perl script for failed tests.

Line 153 in eb067ea

    
           perl -CSDA -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/?/g;' "$1" \

Feature requests: what underlying problem are you trying to solve with this feature?

The usage of perl is not documented.
Windows does not have Perl installed by default.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

execute a failing test on windows will print:

==================== Test output for //Test/Unit:Unit (shard 1 of 8):

[...]

external/bazel_tools/tools/test/test-setup.sh: line 152: perl: command not found
================================================================================

What operating system are you running Bazel on?

Windows 10

What's the output of `bazel info release`?

0.10.1

The perl script replaces invalid XML characters and invalid sequence in CDATA.
To get rid of perl bash or python could be used.

The text was updated successfully, but these errors were encountered:

aehlig · 2018-02-26T08:02:54Z

Categorizing as bug, as pulling in an documented run-time dependency can cause breakage for users not otherwise using perl. The main question is if we really need perl (then we should document it as a run-time dependency of bazel) or whether an appropriate set up can be achieved with more standard tools.

aehlig · 2018-02-26T08:03:59Z

It's the actual test rules that are problematic.

ulfjack · 2018-03-07T08:16:28Z

Note that this is the fallback code path, so if you make sure that your test framework generates a test.xml, that should workaround the issue.

Any suggestions on what to do here? We do need to escape the characters when generating the test.xml.

benjaminp · 2018-03-08T07:46:09Z

Perhaps rewrite test-setup.sh in a language like C++ or Python where text processing is easy to do portably? That might also be helpful for solving subtle test-setup.sh issues like #4608. (I can't imagine bash is very fun to run on Windows either.)

Relatedly, I would like the ability to customize the test setup executable. One can usually use --run_under, but then the user loses the ability to usefully specify their own --run_under.

ulfjack · 2018-03-08T09:32:34Z

We discussed that, but it means that we'd unconditionally need a C++ compiler in order to run tests. We could bundle a pre-compiled binary into Bazel, but what platform would it be compiled for? For cross-compilation, you need a binary for the target platform, not the host platform. We could bundle a binary and also ship the source, but then you still need a C++ compiler in order to do cross-compilation.

We could use Python, but then we require that everyone has python installed, even if they 'only' do web or C++ development.

We also discussed moving the fallback codepath into Bazel (i.e., have the test.xml generation in Bazel itself). However, that's also causing problems with remote execution, where we now would force test.xml generation onto the local machine. Except, of course, if we also require that remote execution provides X for some value of X.

It's not impossible to solve these problems, but there's no free lunch.

For now, I'd prefer we keep doing it in shell, but maybe there's a more standard tool than perl that we can use to do the escaping? awk? sed?

benjaminp · 2018-03-13T00:13:54Z

FWIW, Bazel doesn't work at all without a C++ compiler on the system:

$ bazel build
ERROR: in target '//external:cc_toolchain': no such package '@local_config_cc//': Traceback (most recent call last):
    File "/home/benjamin/.cache/bazel/_bazel_root/887904812217cca9bc2b9adb875daf42/external/bazel_tools/tools/cpp/cc_configure.bzl", line 42
        configure_unix_toolchain(repository_ctx, cpu_value, overriden...)
    File "/home/benjamin/.cache/bazel/_bazel_root/887904812217cca9bc2b9adb875daf42/external/bazel_tools/tools/cpp/unix_cc_configure.bzl", line 416, in configure_unix_toolchain
        find_cc(repository_ctx, overriden_tools)
    File "/home/benjamin/.cache/bazel/_bazel_root/887904812217cca9bc2b9adb875daf42/external/bazel_tools/tools/cpp/unix_cc_configure.bzl", line 407, in find_cc
        fail(("Cannot find gcc or CC%s, eithe...))
Cannot find gcc or CC, either correct your path or set the CC environment variable

ulfjack · 2018-03-13T07:44:24Z

That's true, and we should fix that independently of this problem. I'd rather not add another dependency on having a C++ compiler available.

laszlocsomor · 2018-03-13T07:52:43Z

That's true, and we should fix that independently of this problem.

+1

I'd rather not add another dependency on having a C++ compiler available.

I'd give another +1 if I weren't out of them already.

ahippler · 2018-03-13T11:41:51Z

We could use Python, but then we require that everyone has python installed, even if they 'only' do web or C++ development.

python is already required as per documentation:
https://docs.bazel.build/versions/master/install-ubuntu.html#1-install-required-packages
https://docs.bazel.build/versions/master/windows.html#software-requirements

laszlocsomor · 2018-03-13T14:34:58Z

@ahippler : Like with C++, that's true and we should fix that independently of this problem. In fact it'd be best if Bazel would only require toolchains that it needs for the build, i.e. if you don't build Python rules then Bazel shouldn't require a Python installation.

Only exception (if you could call it an exception) should be Java: the JDK is always available because we bundle one with Bazel, because Bazel itself needs one in order to run.

laszlocsomor · 2018-05-02T13:04:52Z

Thanks to @aehlig we now know what this command is supposed to do:

It means, read the string as sequence of uni-code characters, and replace every non-empty sequence of characters in the given ranges by a question mark symbol. (That's the first line. The ranges come from the esoteric definition of which unicode characters are and are not allowed in XML documents.)
The negation (^) makes sense. The allowed unicode characters are:
x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
And every non-empty sequence of other charachters is replaced by a question mark symbol.
The reason why perl was used was that the whitelist only applies after you interpreted the sequence of octets as a sequnce of (utf-8 encoded) unicode characters, and perl could do that on the fly (while the more traditional unix tools work on sequnce of octets).

ulfjack · 2018-05-14T13:54:21Z

(Note that any solution for test-setup needs to work with remote execution - this makes it very difficult to use pre-compiled binaries, because you don't know which platform the test action will actually run on.)

laszlocsomor · 2018-05-14T14:02:37Z

@ulfjack : True. Bazel could include a precompiled embedded binary for the host platform, and the remote execution platform author would have to provide one. Bazel would select the active one with toolchain rules. WDYT?

laszlocsomor · 2018-05-14T14:04:01Z

Let's not forget that the tests currently depend on Bash, so a remote execution worker would have to have Bash installed anyway.

ulfjack · 2018-05-14T14:04:29Z

That would make it much more difficult to change test-setup since all such changes would have to be rolled out to all remote execution systems.

laszlocsomor · 2018-05-14T14:04:59Z

Let's not forget that the tests currently depend on Bash, so a remote execution worker would have to have Bash installed anyway.

...meaning that requiring a test-setup binary would not make things worse. And we could provide a reference implementation in a GitHub repo.

ulfjack · 2018-05-14T14:05:10Z

I think one option would be to fork test-setup, and have different implementations for Linux/Mac and Windows.

laszlocsomor · 2018-05-14T14:06:16Z

Forking test-setup would also introduce the synchronization difficulty you alluded to.

laszlocsomor · 2018-05-15T08:52:54Z

(Note that any solution for test-setup needs to work with remote execution - this makes it very difficult to use pre-compiled binaries, because you don't know which platform the test action will actually run on.)

The options I see:

Use precompiled binary. Faster than a script and requires no runtime, but Bazel must either include the binary for all local and remote execution platforms, or the remote execution platform must provide its version of the binary and Bazel needs an appropriate toolchain rule. Need to maintain and keep in sync multiple source files.
Use scripts. Requires a runtime (Bash / Python / Powershell) but more portable than a precompiled binary. Need to maintain and keep in sync multiple script files.
Use one script. This is what we have now. Requires the same runtime (Bash / Python) on every platform (goes against Windows: break the dependency on Bash #4319 ). Need to maintain just one script.

The second option seems to be the best tradeoff. WDYT?

ulfjack · 2018-05-15T16:10:41Z

I agree.

Note that I've been working on splitting test-setup into two separate steps - right now it runs the test and then generates a test.xml file if there isn't one. The way it's done is triggering a code path on Linux and MacOS that has an inherent race condition. @agoulti proposed that we split up the two parts - run the test first and then run a separate action to generate the test.xml file if the test didn't generate one. That'll fix the race condition and potentially make the test-setup script a bit simpler.

ulfjack · 2018-05-15T16:12:15Z

There's some background in #4608.

benjaminp · 2018-05-15T16:24:41Z

The options I see:
...

There's also the option I've raised before of rewriting test-setup.sh in C++, requiring a target C++ toolchain to run tests, and compiling test-setup on the fly for the target platform. For simple use cases, a precompiled binary test-setup could be included with Bazel in the embedded tools. This has the advantage of letting test-setup use native APIs (managing processes properly in bash is ironically difficult) and not duplicating too much work across platforms. I understand the desire to avoid requiring a C++ toolchain, but it seems like the need for that ends up creeping in for any non-trivial project anyway.

laszlocsomor · 2018-07-25T15:44:44Z

@nlopezgi , another question: do you know how Bazel cancels a remotely running test action in case the user presses Ctrl+C? Does Bazel dispatch this to the remote service or does it just close the connection?

nlopezgi · 2018-07-25T16:00:14Z

re: compiler: I did not state I did not want to require it for tests, just that I was not sure its the right choice and don't want to be the one to make it w/o at least conferring with some other folks (I'll get back to you once I've confirmed). imo, if some tool is to be required for all tests, i'd rather it is the c++ compiler (instead of perl or python), but not sure what the trade-offs (effort/maintenance) between c++ vs a clever sed program would be.

re: canceling a test: not sure, you'd want to ask @ola-rozenfeld about api details

laszlocsomor · 2018-07-26T09:51:57Z

I was too optimistic with the "clever sed program". The task is to UTF-8-decode an octet-stream, test decoded characters if they fall into any of some disjoint ranges and replace them with "?", then UTF-8-encode the result again.

I'm not aware of an efficient way to do this without encoding and decoding.

ulfjack · 2018-07-26T10:09:46Z

You don't need to decode and re-encode - you can simply test on the utf-8 representation using seds regexp support.

laszlocsomor · 2018-07-26T10:18:28Z

How? Does sed support matching UTF-8 strings?

ulfjack · 2018-07-26T10:26:33Z

sed supports matching binary, and you know the utf-8 encoding, so you can check for specific utf-8 ranges, like so:

echo "ä ö ü" | LANG=C sed -e "s/[\xc0-\xdf][\x00-\xff]/?/g"

Here, I'm replacing all two-byte utf-8 sequences with a single '?' character. See https://en.wikipedia.org/wiki/UTF-8 for the multi-byte ranges.

ulfjack · 2018-07-26T11:47:55Z

After a lot of trial and error, I've come up with a sed script that - I think - does what we want:

cat test.log | LANG=C sed -E \
  -e 's/.*/& /g' \
  -e 's/(([\x9\xa\xd\x20-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xec][\x80-\xbf][\x80-\xbf]|[\xed][\x80-\x9f][\x80-\xbf]|[\xee-\xef][\x80-\xbf][\x80-\xbf]|[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf])*)./\1?/g' \
  -e 's/(.*)\?/\1/g'

ulfjack · 2018-07-26T11:50:06Z

First, add a single white space character (' ') to the end of each line. Second, replace all (possibly empty) sequences of legal characters followed by a character with the sequence of legal characters followed by a question mark character ('?'). Third, remove the trailing question mark character ('?') from each line.

laszlocsomor · 2018-07-26T12:28:41Z

Hats off, that's quite impressive.

I think you made a couple mistakes:

you need to cover the entire 2-octet UTF-8 domain (U+80 .. U+7FF), that is "e0 80 80 .. ed 9f bf", so that part of the regex should be [\xe0-\xed][\x80-\x9f][\x80-\xbf], whereas yours is split up into [\xe0-\xec][\x80-\xbf][\x80-\xbf] and [\xed][\x80-\x9f][\x80-\xbf]
the final allowed UTF-16 range is U+10000 .. U+10FFFF, that is "f0 80 80 80 .. f4 8f bf bf", so the regex should be [\xf0-\xf4][\x80-\x8f][\x80-\xbf][\x80-\xbf], but yours is [\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]

Could you compare your results with mine?

laszlocsomor · 2018-07-26T12:31:01Z

Sorry, I got confused, gimme a minute to correct this.

laszlocsomor · 2018-07-26T12:33:46Z

you need to cover the entire 2-octet UTF-8 domain (U+80 .. U+7FF)

This is wrong. I meant to say: you need to cover the U+80..U+7FF (two UTF-8 octets) and U+800..U+D7FF (three UTF-8 octets) ranges. The 2-octet domain is covered correctly ([\xc0-\xdf][\x80-\xbf]), but I think the 3-octet-matching regex is wrong in your solution.

ulfjack · 2018-07-26T12:42:53Z

Ok, how about this:

[\x9\xa\xd\x20-\x7f]                         <--- (9,A,D,20-7F)
[\xc0-\xdf][\x80-\xbf]                       <--- (0080-07FF)
[\xe0-\xec][\x80-\xbf][\x80-\xbf]            <--- (0800-CFFF)
[\xed][\x80-\x9f][\x80-\xbf]                 <--- (D000-D7FF)
[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf] <--- (010000-10FFFF)

ulfjack · 2018-07-26T12:45:18Z

Wait, there's still one range missing. Gah!

laszlocsomor · 2018-07-26T12:47:00Z

Ah I see where I was wrong, you are right to match 0800-CFFF and match D000-D7FF separately.

ulfjack · 2018-07-26T13:00:56Z

Another try:

[\x9\xa\xd\x20-\x7f]                         <--- (9,A,D,20-7F)
[\xc0-\xdf][\x80-\xbf]                       <--- (0080-07FF)
[\xe0-\xec][\x80-\xbf][\x80-\xbf]            <--- (0800-CFFF)
[\xed][\x80-\x9f][\x80-\xbf]                 <--- (D000-D7FF)
[\xee][\x80-\xbf][\x80-\xbf]                 <--- (E000-EFFF)
[\xef][\x80-\xbe][\x80-\xbf]                 <--- (F000-FFEF)
[\xef][\xbf][\x80-\xbd]                      <--- (FFF0-FFFD)
[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf] <--- (010000-10FFFF)

ulfjack · 2018-07-26T13:19:49Z

Ok, I wrote a small Java program to double-check the pattern, and it returned the expected ranges:

9-a,d-d,20-d7ff,e000-fffd,10000-10ffff

laszlocsomor · 2018-07-26T13:20:12Z

I'm glad we are free now.

ulfjack · 2018-07-26T13:41:31Z

Ok, I have a patch which conflicts with my other changes to test-setup.sh. Both are a bit risky, and we need to pick one to be merged first.

laszlocsomor · 2018-07-26T13:42:30Z

If you have changes lined up, merge those first.
This bug has been open for a long time, it's OK to wait another day.

ulfjack · 2018-08-03T08:54:12Z

Patch is here: https://bazel-review.googlesource.com/c/bazel/+/68711

There are still a couple possible issues with this that we need to look into.

My primary concern is what should happen if the default charset of the current machine is NOT UTF-8. The previous Perl solution was broken as well: it did a charset conversion from the default charset to UTF-8 on input, but also from UTF-8 to the default charset on output, which can actually re-introduce illegal characters (we might want to file a bug for that, or note it on the existing bug), breaking the resulting XML.

The new code intentionally does not perform any charset conversion.

Ideally, we'd do a charset conversion from the default charset to UTF-8 before running the sed script. Note that I override the LOCALE before running sed, which probably sets the default charset to ISO-8859-1 (this needs to be double-checked!).

find a way to do charset conversion from the default charset to utf-8; we might be able to get sed to do that for us, but we need to prevent a charset conversion on output
make sure to set the default charset temporarily to ISO-8859-1 for the sed invocation
make sure that all copies of this code are updated

ulfjack · 2018-08-03T08:54:53Z

(I'm on vacation for the rest of August, and won't be able to work on this until I'm back.)

laszlocsomor · 2018-08-03T09:04:07Z

Thanks!

The previous Perl solution was broken as well: it did a charset conversion from the default charset to UTF-8 on input, but also from UTF-8 to the default charset on output, which can actually re-introduce illegal characters (we might want to file a bug for that, or note it on the existing bug), breaking the resulting XML.

Please elaborate. How and where exactly are the conversions done?

(I'm on vacation for the rest of August, and won't be able to work on this until I'm back.)

Do you intend to finish this task yourself or to appoint it to someone else (and if so, whom)?

ulfjack · 2018-08-03T09:58:50Z

If it's still open when I'm back, I'll do it. If someone comes in and finishes my changes, I'm happy, too.

Perl does implicit charset conversion on every read and write from a file or stdin/stdout. I believe it converts from the platform default charset to UTF-8 internally. That's my reading of the docs, anyway.

ulfjack · 2018-11-26T14:59:54Z

I have a pending patch.

Profpatsch · 2018-12-19T13:35:18Z

Would be awesome to see that line go. It’s the last remaining use of perl in bazel from what I can see (I already removed the other use in #5999).

Go @ulfjack! 🍰 🏃‍♂️

Fixes bazelbuild#4691. PiperOrigin-RevId: 230308181

aehlig added type: bug P1 I'll work on this now. (Assignee required) category: misc > testing labels Feb 26, 2018

aehlig added category: rules > misc native and removed category: misc > testing labels Feb 26, 2018

dslomov added the platform: windows label Feb 27, 2018

laszlocsomor self-assigned this Mar 9, 2018

benjaminp mentioned this issue May 2, 2018

Bazel 0.13.0 adds a strict requirement to have a c compiler locally (even for RE) #5133

Closed

ulfjack assigned ulfjack and unassigned laszlocsomor Nov 26, 2018

Profpatsch mentioned this issue Dec 19, 2018

Add perl to nix-shell tweag/rules_haskell#507

Merged

bazel-io closed this as completed in 258f2b4 Jan 22, 2019

weixiao-huang pushed a commit to weixiao-huang/bazel that referenced this issue Jan 31, 2019

Avoid Perl dependency in test-setup.sh

92fffc1

Fixes bazelbuild#4691. PiperOrigin-RevId: 230308181

undocumented usage of perl for cc_test #4691

undocumented usage of perl for cc_test #4691

Comments

ahippler commented Feb 23, 2018 • edited by laszlocsomor Loading

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

aehlig commented Feb 26, 2018

aehlig commented Feb 26, 2018

ulfjack commented Mar 7, 2018

benjaminp commented Mar 8, 2018

ulfjack commented Mar 8, 2018

benjaminp commented Mar 13, 2018

ulfjack commented Mar 13, 2018

laszlocsomor commented Mar 13, 2018

ahippler commented Mar 13, 2018

laszlocsomor commented Mar 13, 2018

laszlocsomor commented May 2, 2018

ulfjack commented May 14, 2018

laszlocsomor commented May 14, 2018

laszlocsomor commented May 14, 2018

ulfjack commented May 14, 2018

laszlocsomor commented May 14, 2018

ulfjack commented May 14, 2018

laszlocsomor commented May 14, 2018

laszlocsomor commented May 15, 2018

ulfjack commented May 15, 2018

ulfjack commented May 15, 2018

benjaminp commented May 15, 2018

laszlocsomor commented Jul 25, 2018

nlopezgi commented Jul 25, 2018

laszlocsomor commented Jul 26, 2018

ulfjack commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

ulfjack commented Jul 26, 2018

ulfjack commented Jul 26, 2018 • edited Loading

ulfjack commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

ulfjack commented Jul 26, 2018 • edited Loading

ulfjack commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

ulfjack commented Jul 26, 2018

ulfjack commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

ulfjack commented Jul 26, 2018

laszlocsomor commented Jul 26, 2018

ulfjack commented Aug 3, 2018

ulfjack commented Aug 3, 2018

laszlocsomor commented Aug 3, 2018

ulfjack commented Aug 3, 2018

ulfjack commented Nov 26, 2018

Profpatsch commented Dec 19, 2018 • edited Loading

ahippler commented Feb 23, 2018 •

edited by laszlocsomor

Loading

What's the output of `bazel info release`?

ulfjack commented Jul 26, 2018 •

edited

Loading

ulfjack commented Jul 26, 2018 •

edited

Loading

Profpatsch commented Dec 19, 2018 •

edited

Loading