Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cml <sub-cmd> on cml runner instance fails #920

Closed
dacbd opened this issue Mar 19, 2022 · 16 comments · Fixed by iterative/terraform-provider-iterative#458
Closed

cml <sub-cmd> on cml runner instance fails #920

dacbd opened this issue Mar 19, 2022 · 16 comments · Fixed by iterative/terraform-provider-iterative#458
Assignees
Labels
p0-critical Max priority (ASAP)

Comments

@dacbd
Copy link
Contributor

dacbd commented Mar 19, 2022

I have tried a few iterations of this:

jobs:
  setup:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
      - name: deploy runner
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml runner \
            --single \
            --labels=test \
            --token=${{ secrets.DACBD_PAT }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-highcpu-2
  test:
    needs: [setup]
    runs-on: [self-hosted, test]
    steps:           
      - uses: actions/checkout@v2
      - run: |
          which cml
          cml ci

Fails with an odd message: Error: Cannot find module '/tmp/tmp.njlS8oR0sf/.cml/cml-ehsjbxeja3/_work/cml-pulse/cml-pulse/ci'

and:

...
  test:
    needs: [setup]
    runs-on: [self-hosted, test]
    steps:           
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
      - run: |
          which cml
          cml ci

Fails with a EEXIST: file already exists, symlink '../lib/node_modules/@dvcorg/cml/bin/cml.js' -> '/usr/bin/cml'


However using --cml-version=github:iterative/cml.git#master \ to trigger the npm install -g way of setting up cml on the runner instance works fine.

@dacbd
Copy link
Contributor Author

dacbd commented Mar 19, 2022

Maybe there is a strange interaction with pkg and this:

cml/bin/cml.js

Lines 35 to 37 in f3406ac

const path = which.sync(`${basename(executable)}-${command}`);
const parameters = process.argv.slice(process.argv.indexOf(command) + 1); // HACK
process.exit(await pseudoexec(path, parameters));

?

it appears that if PKG_EXECPATH set then it fails

@dacbd
Copy link
Contributor Author

dacbd commented Mar 19, 2022

I think there is some strange process magic happening between:

Im not sure what is setting that env var for the runner instance, you can see here (the key there has been revoked)

Bad

FROM debian:buster

RUN apt-get update && apt-get install -y curl
RUN mkdir -p /opt/cml/
RUN curl --location --url https://github.com/iterative/cml/releases/download/v0.12.0/cml-linux --output /opt/cml/cml-linux
RUN chmod +x /opt/cml/cml-linux
ENV PKG_EXECPATH=/opt/cml/cml-linux
RUN ln -s /opt/cml/cml-linux /usr/bin/cml

Good

FROM debian:buster

RUN apt-get update && apt-get install -y curl
RUN mkdir -p /opt/cml/
RUN curl --location --url https://github.com/iterative/cml/releases/download/v0.12.0/cml-linux --output /opt/cml/cml-linux
RUN chmod +x /opt/cml/cml-linux
RUN ln -s /opt/cml/cml-linux /usr/bin/cml

@dacbd
Copy link
Contributor Author

dacbd commented Mar 19, 2022

Actually, it looks like this might be the culprit in which https://github.com/npm/node-which/blob/main/which.js#L21
or not
because

https://github.com/vercel/pkg#snapshot-filesystem

On the other hand, in order to access real file system at run time (pick up a user's external javascript plugin, json configuration or even get a list of user's directory) you should take process.cwd() or path.dirname(process.execPath).

@casperdcl casperdcl added the p0-critical Max priority (ASAP) label Mar 19, 2022
@DavidGOrtega DavidGOrtega self-assigned this Mar 22, 2022
@0x2b3bfa0
Copy link
Member

All the mentions to node-pseudoexec are, at least, libellous. 😄 Fortunately, this error doesn't seem to have anything to do with the command plugin feature from #703.

When you (1) run cml runner --cloud=gcp it creates a cloud instance and (2) runs cml runner inside, which, in turn, (3) executes the GitHub Actions runner in charge of (4) running the workflow that (5) calls cml ci in the example above.

The cml standalone binary (#842 & iterative/terraform-provider-iterative#255) used for (2) is a pkg binary. As such, it sets the PKG_EXECPATH environment variable that propagates to all the child processes up to (5).

I don't understand yet why setting PKG_EXECPATH causes binaries bundled with pkg to fail, but this can be mitigated by unsetting this variable before calling the GitHub runner process; see #802 for a stale draft.

@dacbd
Copy link
Contributor Author

dacbd commented Mar 23, 2022

libelous

🤣 /edit that was the one strikethrough I missed, I have not revisited the rabbit hole I dug trying to find the root of the issue

The cml standalone binary (#842 & iterative/terraform-provider-iterative#255) used for (2) is a pkg binary. As such, it sets the PKG_EXECPATH environment variable that propagates to all the child processes up to (5).

Does it set the env? In my basic docker reproduction running a cml command did not seem to set that ENV? .... Though...

I don't understand yet why setting PKG_EXECPATH causes binaries bundled with pkg to fail, but this can be mitigated by unsetting this variable before calling the GitHub runner process; see #802 for a stale draft.

I suspect you are correct that #802 would fix this.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 23, 2022

Does it set the env? In my basic docker reproduction running a cml command did not seem to set that ENV?

Yes, it does: the cml binaries bundled with pkg set that environment variable for themselves and, by extension, for all the child processes; i.e. the GitHub runner and anything you run inside.

test:
  runs-on: self-hosted
  steps:           
  - run: env | grep PKG_EXECPATH

@0x2b3bfa0
Copy link
Member

It took me a while to figure out why two binaries with the same hash (MD5 & SHA256 for good measure) behaved differently. 🤯 When a hard link solved the issue, it was easy to narrow it down to paths. 🙈

@dacbd
Copy link
Contributor Author

dacbd commented Mar 23, 2022

I presume now that the pkg format sets and uses this when it "starts up" execution and with the magic of child_process inheritance it persists.

Does it set the env? In my basic docker reproduction running a cml command did not seem to set that ENV?

Yes, it does:

test:
  needs: [setup]
  runs-on: [self-hosted, test]
  steps:           
    - run: env | grep PKG_EXECPATH

stand-alone execution with the binary pkg format is fine but when a subsequent execution happens with the case of the exec cml runner ... the "start up" process breaks its "virtual filesystem" because that env is already set?

in my docker example is you execute cml and then env the PKG_EXECPATH is not present.

@dacbd
Copy link
Contributor Author

dacbd commented Mar 23, 2022

Playing around more it seems it only fails if PKG_EXECPATH is set the actual executable

with Dockerfile:

FROM debian:buster

RUN apt-get update && apt-get install -y curl
RUN mkdir -p /opt/cml/
RUN curl --location --url https://github.com/iterative/cml/releases/download/v0.12.0/cml-linux --output /opt/cml/cml-linux
RUN chmod +x /opt/cml/cml-linux
RUN ln -s /opt/cml/cml-linux /usr/bin/cml

Which installs cml the same as tpi,

docker run --rm -it test:good /bin/bash
root@e3cab72f74e3:/# cml --help
# Good
root@e3cab72f74e3:/# PKG_EXECPATH=/random/garbage cml --help
# Good
root@e3cab72f74e3:/# PKG_EXECPATH=/opt/cml/cml-linux cml --help
# Bad

so if PKG_EXECPATH is set to the exact pkg binary cml will fail

@dacbd
Copy link
Contributor Author

dacbd commented Mar 24, 2022

While "libelous" I feel validated with suspicion of child_process/spawn weirdness, also well done on the hunt! (vercel/pkg#897 (comment)) I glanced at that chunk but, I missed the value of this in my initial grep of the codebase after I had found the culprit env.

To summarize, I agree I think the fix is solving #802 which I have an idea for beyond what is currently implemented in that branch which I intend to test out soon.

@0x2b3bfa0
Copy link
Member

As per vercel/pkg#897 (comment), I doubt that the current implementation of #802 can not unset that environment variable, because it will be overridden by pkg again. 😅

cml/src/utils.js

Lines 10 to 26 in f3406ac

const exec = async (command) => {
return new Promise((resolve, reject) => {
require('child_process').exec(
command,
{ ...process.env },
(error, stdout, stderr) => {
if (!process.stdout.isTTY) {
stdout = stripAnsi(stdout);
stderr = stripAnsi(stderr);
}
if (error) reject(new Error(`${command}\n\t${stdout}\n\t${stderr}`));
resolve((stdout || stderr).slice(0, -1));
}
);
});
};

cml/src/drivers/github.js

Lines 269 to 279 in f3406ac

await exec(
`${resolve(
workdir,
'config.sh'
)} --unattended --token "${await this.runnerToken()}" --url "${
this.repo
}" --name "${name}" --labels "${labels}" --work "${resolve(
workdir,
'_work'
)}" ${single ? ' --ephemeral' : ''}`
);

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Mar 24, 2022

It looks like iterative/terraform-provider-iterative#458 might do the trick, but it's just a huge hack to bypass the PKG_EXECPATH condition. 🪓

@dacbd
Copy link
Contributor Author

dacbd commented Mar 24, 2022

It looks like iterative/terraform-provider-iterative#458 might do the trick, but it's just a huge hack to bypass the PKG_EXECPATH condition. 🪓

👀 🤔

As per vercel/pkg#897 (comment), I doubt that the current implementation of #802 can not unset that environment variable, because it will be overridden by pkg again. 😅

I think what I have in mind shouldn't care (if it works); as far as I have seen there is no issue with repeated invocations, just the "re-invocation" within the inherited context of the first pkg cml runner call which is the "parent" process of github actions runner and thus subsequent use of the cml cmd

Finished doing some reading and plan to try it out tomorrow evening 😴

@0x2b3bfa0
Copy link
Member

🎉 Related to #721 (comment)

@casperdcl
Copy link
Contributor

Just to confirm - does TPI v0.10.2 fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants