Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing zig solutions is slow on Exercism #63

Open
4 of 7 tasks
ee7 opened this issue Aug 24, 2023 · 8 comments
Open
4 of 7 tasks

Testing zig solutions is slow on Exercism #63

ee7 opened this issue Aug 24, 2023 · 8 comments

Comments

@ee7
Copy link
Member

ee7 commented Aug 24, 2023

82b56af is a significant speedup for a RUN zig test foo command that's added at the bottom of the Dockerfile. But it doesn't seem to be a speedup in production.

Some possible short-term ways to improve:

  1. Ensure that Zig can fully use the cache. Maybe some combination of:
    • Set USER (done in b8b9d02)
    • Create the cache later (done in 3f2c0c2)
    • Create cache via bin/run.sh, not zig test directly (done in 0636420)
    • Specify -target (done in b58066f)
    • Change some paths of solution files
    • Look closer at the how Exercism run the container
    • Set zig cache environment variables, or pass zig cache location options to zig test
  2. Put the cache on a tmpfs.
  3. Update Zig. Maybe commits like ziglang/zig@020105d could help.

Longer term:

  1. Use a bigger Zig cache.
  2. Wait for https://www.github.com/ziglang/zig/issues/16270.
  3. Use the x86_64 backend (currently marked as experimental).
  4. Use wasm, like https://playground.zigtools.org/ and https://github.com/zigtools/playground. Exercism doesn't support this for now.
  5. Wait for perf improvements in https://www.github.com/ziglang/zig/projects/6
  6. Consider trying the C backend (no longer marked as experimental), and compiling with a fast C compiler. Note that the issue for generating tcc-compatible code (ziglang/zig#13576) is currently marked for Zig 1.1.0. Aside: the Nim test runner currently compiles Nim to C, and then uses tcc.
@ee7
Copy link
Member Author

ee7 commented Aug 24, 2023

@ErikSchierboom is this and this the current and complete source of truth for how Exercism runs a container in production?

@ErikSchierboom
Copy link
Member

Yes, that is correct.

ee7 added a commit that referenced this issue Aug 27, 2023
Better conform with Dockerfile best practices [1]:

    If a service can run without privileges, use `USER` to change to a
    non-root user.

See also the Docker docs on `USER` [2], and the Alpine Linux wiki page
on creating a new user [3].

The commit [4] that added a zig cache significantly speeds up a
`RUN zig test foo` command that's added at the bottom of the Dockerfile.
But it doesn't seem to be a speedup in production.

I suspect that the zig cache isn't found in production, due to
differences in how containers are run (see Exercism's terraform and
tooling-invoker repos [5][6]).

I suspect that this commit may be insufficient to fix the problem, but
it's a step in the right direction. Possible follow-up work includes:

- Setting zig cache environment variables, or passing zig cache
  location options to `zig test`

- Changing the paths of solution files.

- Updating Zig. I saw a recent commit [7] that touches Cache.zig.

Closes: #60
Closes: #62
Refs: #63

[1] https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
[2] https://docs.docker.com/engine/reference/builder/#user
[3] https://wiki.alpinelinux.org/wiki/Setting_up_a_new_user
[4] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[5] https://github.com/exercism/terraform/blob/3ff7b3b5a2f6/terraform/tooling_invoker/ami.sh#L68-L88
[6] https://github.com/exercism/tooling-invoker/blob/88a9d826a746/lib/tooling_invoker/job_processor/exec_docker.rb#L150-L162
[7] ziglang/zig@020105d0dde6
    2023-08-20, "Cache: Fix findPrefix when paths are slightly out of the ordinary"
ee7 added a commit to ee7/exercism-zig-test-runner that referenced this issue Aug 27, 2023
This uses a worse pattern [1], but it might help Zig use the cache.

Follow up from recent commits [2][3].

Refs: exercism#63

[1] https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#use-multi-stage-builds
[2] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[3] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
ee7 added a commit that referenced this issue Aug 27, 2023
Two recent commits [1][2] tried to speed up running user solutions in
production, but weren't sufficient.

Make the `ziggy` user, rather than the user in the earlier stage, create
the cache. This is a worse Dockerfile pattern [3], but it might help Zig
to use the cache.

A zig cache directory like `~/.cache/zig` contains an `h` subdirectory,
with `.txt` files like:

    0
    18137 11111111 1692020106000000000 0d5def4e330aca188b4480e29786a6c0 1 c.zig
    18137 11111111 1692020106000000000 0d5def4e330aca188b4480e29786a6c0 1 c.zig
    9163 22222222 1692020106000000000 514b9d62f95db9ea86d2cd5008c5d946 1 std/std.zig
    9163 22222222 1692020106000000000 514b9d62f95db9ea86d2cd5008c5d946 1 std/std.zig
    22609 33333333 1692020106000000000 0538267a6cdadf6d7dadf7693fce1bed 1 std/start.zig
    [...]

It looks like the format is [4]:

    manifest header (0), then rows of:

    {size} {inode} {mtime_ns}, {crypto.auth.siphash.SipHash128 hash} {prefix} {sub_path}

where the prefixes are commented as:

    /// A set of strings such as the zig library directory or project source root, which
    /// are stripped from the file paths before putting into the cache. They
    /// are replaced with single-character indicators. This is not to save
    /// space but to eliminate absolute file paths. This improves portability
    /// and usefulness of the cache for advanced use cases.

but maybe the cache didn't allow what we were doing before this
commit: creating it with one user at some location, then moving it to
another location to be used by a different user.

If this commit still isn't sufficient, we could try updating Zig.
Exercism currently uses the Zig 0.11.0 release, which doesn't include a
recent commit for the zig cache system [5].

Refs: #63

[1] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[2] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
[3] https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#use-multi-stage-builds
[4] https://github.com/ziglang/zig/blob/0.11.0/lib/std/Build/Cache.zig#L840-L851
[5] ziglang/zig@020105d0dde6
    2023-08-20, "Cache: Fix findPrefix when paths are slightly out of the ordinary"
ee7 added a commit to ee7/exercism-zig-test-runner that referenced this issue Aug 27, 2023
Three recent commits [1][2][3] tried to speed up running user solutions
in production, but weren't sufficient. Continue trying, by running
`zig test` indirectly.

Refs: exercism#63

[1] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[2] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
[3] 3f2c0c2, 2023-08-27, "Dockerfile: create zig cache later"
ee7 added a commit to ee7/exercism-zig-test-runner that referenced this issue Aug 27, 2023
Three recent commits [1][2][3] tried to speed up running user solutions
in production, but weren't sufficient. Continue trying.

With Exercism's test runner architecture, the x86_64 CPU that creates
the cache is likely to have different features than the x86_64 CPU that
runs `zig test` in production. If that caused a cache miss, this commit
might help.

Refs: exercism#63

[1] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[2] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
[3] 3f2c0c2, 2023-08-27, "Dockerfile: create zig cache later"
ee7 added a commit that referenced this issue Aug 27, 2023
Three recent commits [1][2][3] tried to speed up running user solutions
in production, but weren't sufficient. Continue trying, by running
`zig test` indirectly.

Extra advantage: we have a single source of truth for `zig test`
options, which helps if we want to change them later.

Disadvantage: `docker build` no longer fails if this line produces a
failing test. That's why I didn't do it this way originally.

Refs: #63

[1] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[2] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
[3] 3f2c0c2, 2023-08-27, "Dockerfile: create zig cache later"
ee7 added a commit to ee7/exercism-zig-test-runner that referenced this issue Aug 27, 2023
Three recent commits [1][2][3] tried to speed up running user solutions
in production, but weren't sufficient. Continue trying.

With Exercism's test runner architecture, the x86_64 CPU that creates
the cache is likely to have different features than the x86_64 CPU that
runs `zig test` in production. If that caused a cache miss, this commit
might help.

Refs: exercism#63

[1] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[2] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
[3] 3f2c0c2, 2023-08-27, "Dockerfile: create zig cache later"
ee7 added a commit that referenced this issue Aug 27, 2023
Four recent commits [1][2][3][4] tried to speed up running user
solutions in production, but weren't sufficient. Continue trying.

With Exercism's test runner architecture, the x86_64 CPU that creates
the cache is likely to have different features than the x86_64 CPU that
runs `zig test` in production. If that meant that Zig couldn't use the
cache, this commit might help.

This commit produces the below diff to a `builtin.zig` file in a
`/home/ziggy/.cache/zig/o/` subdirectory in the image. Note the reduced
set of CPU features.

    @@ -12,66 +12,22 @@ pub const single_threaded = false;
     pub const abi = std.Target.Abi.musl;
     pub const cpu: std.Target.Cpu = .{
         .arch = .x86_64,
    -    .model = &std.Target.x86.cpu.icelake_server,
    +    .model = &std.Target.x86.cpu.x86_64,
         .features = std.Target.x86.featureSet(&[_]std.Target.x86.Feature{
             .@"64bit",
    -        .adx,
    -        .aes,
    -        .allow_light_256_bit,
    -        .avx,
    -        .avx2,
    -        .avx512bw,
    -        .avx512cd,
    -        .avx512dq,
    -        .avx512f,
    -        .avx512vl,
    -        .bmi,
    -        .bmi2,
    -        .clflushopt,
             .cmov,
    -        .crc32,
    -        .cx16,
             .cx8,
    -        .ermsb,
    -        .f16c,
    -        .fast_15bytenop,
    -        .fast_gather,
    -        .fast_scalar_fsqrt,
    -        .fast_shld_rotate,
    -        .fast_variable_crosslane_shuffle,
    -        .fast_variable_perlane_shuffle,
    -        .fast_vector_fsqrt,
    -        .fma,
    -        .fsgsbase,
    -        .fsrm,
             .fxsr,
             .idivq_to_divl,
    -        .invpcid,
    -        .lzcnt,
             .macrofusion,
             .mmx,
    -        .movbe,
             .nopl,
    -        .pclmul,
    -        .popcnt,
    -        .prefer_256_bit,
    -        .prfchw,
    -        .rdrnd,
    -        .rdseed,
    -        .rtm,
    -        .sahf,
    +        .slow_3ops_lea,
    +        .slow_incdec,
             .sse,
             .sse2,
    -        .sse3,
    -        .sse4_1,
    -        .sse4_2,
    -        .ssse3,
             .vzeroupper,
             .x87,
    -        .xsave,
    -        .xsavec,
    -        .xsaveopt,
    -        .xsaves,
         }),
     };
     pub const os = std.Target.Os{
    @@ -79,14 +35,14 @@ pub const os = std.Target.Os{
         .version_range = .{ .linux = .{
             .range = .{
                 .min = .{
    -                .major = 5,
    -                .minor = 15,
    +                .major = 3,
    +                .minor = 16,
                     .patch = 0,
                 },
                 .max = .{
                     .major = 5,
    -                .minor = 15,
    -                .patch = 0,
    +                .minor = 10,
    +                .patch = 81,
                 },
             },

Refs: #63

[1] 82b56af, 2023-08-23, ".dockerignore, Dockerfile: add a zig cache"
[2] b8b9d02, 2023-08-27, "Dockerfile, bin(run-tests), tests: set USER"
[3] 3f2c0c2, 2023-08-27, "Dockerfile: create zig cache later"
[4] 0636420, 2023-08-27, "Dockerfile: create zig cache via run.sh, not zig test"
@ee7
Copy link
Member Author

ee7 commented Aug 27, 2023

It looks like b58066f helped. See #71 (comment).

Next I'll try to compile more of the zig stdlib and add it to the image's zig cache. (Edit: that doesn't seem to produce a significant speedup).

@ee7
Copy link
Member Author

ee7 commented Aug 28, 2023

I still don't understand why if I locally:

  1. Copy a valid solution acronym.zig and its test_acronym.zig to the directory foo in the root dir of this repo
  2. Run bin/run-in-docker.sh acronym foo foo

it takes about 3 seconds locally on a slow machine (definitely using the container's zig cache), but it takes Exercism about 7-8 seconds to run the tests for acronym from the online editor. The latter time is roughly how long step 2 takes locally if I comment out the zig cache creation steps in the Dockerfile. So either:

  • Exercism still doesn't use (or fully use) the zig cache
  • or the delay is due to the container startup time. But I though we were getting 2-3 second test run times in production for interpreted languages in the past.

Any ideas?

@ErikSchierboom
Copy link
Member

It must be the container startup time, because https://github.com/exercism/zig-test-runner/actions/runs/5992634417/job/16252228546 indicates a successful deploy, which is I can verify by going to https://exercism.org/maintaining/tracks/zig.

@ee7
Copy link
Member Author

ee7 commented Aug 29, 2023

indicates a successful deploy, which is I can verify by going to https://exercism.org/maintaining/tracks/zig.

Yeah, I know about this. But it's possible for bin/run-in-docker.sh acronym foo foo to be fast locally, but slow in production, without it being entirely due to the production container startup time.

For example, it seemed like b58066f really was a consistent 2x speedup over the immediately preceding state. I believe that Zig builds for the native CPU by default (even in debug mode, with the rationale that otherwise it'd give up performance when running debug builds, which is sometimes important), like passing -march=native for C. We have to explicitly specify something like -target x86_64-linux-musl or -mcpu baseline to assume only the minimum available CPU features.

Locally, the CPU at execution time was the same as the CPU at build time, so Zig could use the image's cache. But that wasn't true for the image that was deployed in production, because the GitHub Actions CPU features in general won't match those of the AWS machine.

So I'm trying to remove other possible differences in production that stop Zig from fully using the image's cache. For example, I noticed that the cache manifest contains inodes, but I haven't checked yet whether that means the cache cannot be used if transferred to another filesystem at the same path, or whether there's an extra cost if you do that. But it's designed to avoid absolute paths, at least.

I'll ask some Zig people if I can't find out myself.

@ErikSchierboom
Copy link
Member

Locally, the CPU at execution time was the same as the CPU at build time, so Zig could use the image's cache. But that wasn't true for the image that was deployed in production, because the GitHub Actions CPU features in general won't match those of the AWS machine.

Is there anything I can run in production that will figure out the right architecture?

@ee7
Copy link
Member Author

ee7 commented Aug 30, 2023

Is there anything I can run in production that will figure out the right architecture?

Yes, but:

  • I don't want to assume anything about the CPU features of the AWS machine. That probably varies a lot.
  • Even if we could specify the exact CPUs we want for our AWS machines, for our use case, I don't think we'd get a meaningful speedup from lowering to non-baseline CPU instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants