Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching between local_repository and git_repository can result in the local_repository directory being removed #6879

Closed
phb opened this issue Dec 10, 2018 · 14 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@phb
Copy link

phb commented Dec 10, 2018

Description of the problem / feature request:

When switching between local_repository and git_repository, if pressing CTRL+C at a bad time when a bazel command is started during the local_repository -> git_repository switch, the entire local repository folder can be removed.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Repository is in state where dependency in WORKSPACE is pointing to a local directory
  2. revert/stash changes in repository resulting in WORKSPACE pointing to a remote resource
  3. Trigger bazel command, realize that it's starting to download resource and press CTRL+C
    (sample output from my latest run

$ bazel test --local_test_jobs 2 ...
INFO: Invocation ID: 774a848a-89aa-43bc-aad1-98a379ad0920
Loading: 0 packages loaded
Fetching @xxx_products; fetching 6s
^C
ERROR: build interrupted
INFO: Elapsed time: 6.677s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)
Fetching @xxx_products; fetching 6s
$

  1. Change WORKSPACE back to point at local directory
  2. Run bazel command and realize it fails, check local git repository folder and it's completely empty (including . folders)

I'm guessing what's happening here is that if the clone is aborted at a bad time, instead of removing the partly checked out folder, it ends up removing the symlink and the target contents that points to the local repository.

I believe this is a regression and it seems it started happening with 0.19, but I'm not able to confirm this.

What operating system are you running Bazel on?

This happens in Mac OS X bazel and I have not been able to try and reproduce it on other platforms.

What's the output of bazel info release?

INFO: Invocation ID: 5e97dba3-24ec-4e37-bb69-e7d9a388db06
Build label: 0.20.0-homebrew
Build target: bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Nov 30 20:40:12 2018 (1543610412)
Build timestamp: 1543610412
Build timestamp as int: 1543610412

Any other information, logs, or outputs that you want to share?

@jin jin added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged labels Dec 10, 2018
@dslomov dslomov added type: bug P1 I'll work on this now. (Assignee required) and removed untriaged labels Dec 14, 2018
@phb
Copy link
Author

phb commented May 7, 2019

Just verified this still happens with bazel 0.25.0

@aehlig
Copy link
Contributor

aehlig commented May 20, 2019

Added the platform label because of the statement

This happens in Mac OS X bazel and I have not been able to try and reproduce it on other platforms

@phb
Copy link
Author

phb commented May 23, 2019

@aehlig In my case the local repository is not inside the workspace but a sibling directory to the workspace. My local_repository(..) references the directory with an absolute path in case that would make a difference.

@aehlig
Copy link
Contributor

aehlig commented May 23, 2019

@aehlig In my case the local repository is not inside the workspace but a sibling directory to the workspace. My local_repository(..) references the directory with an absolute path in case that would make a difference.

I still did not manage to reproduce the issue. Can you provide more information, or even a simple reproduction script?

I'm currently trying on a Mac Pro with bazel 0.25.2.

@phb
Copy link
Author

phb commented May 23, 2019

I haven't been able to reproduce it outside of my main repository unfortunately. I tried your script and no luck there as well.

It's pretty easy for me to reproduce manually, so I could compile bazel from head and add in some instrumentation. Do you have any ideas of where to start looking?

One potentially additional piece of information is that I also modify ~/.bazelrc whenever I switch between the two 'modes'. The .bazelrc switch is a build define
"build --define=build=local" which triggers changes in a toolchain.

@aehlig
Copy link
Contributor

aehlig commented May 23, 2019

It's pretty easy for me to reproduce manually, so I could compile bazel from head and add in some instrumentation. Do you have any ideas of where to start looking?

git_repository is a SkylarkRepository, so the interrupt happens somewhere in the fetch method there, actually in the line Object retValue = function.call(...); so somewhere up the call stack, the InterruptedException must cause the removal. BTW, does your git_repository use the strip_prefix feature? I'm asking as this causes the clone to happen in a different directory.

Do I understand correctly, that the interrupt is essential for this issue and it does not occur when you allow the fetching to complete? Also, is the local repository already gone before the final local build (step 5 in your original post), or does the final build stop do something to cause the issue?

@phb
Copy link
Author

phb commented May 23, 2019

I believe the interrupt is essential for the issue, I have never seen it without an interrupt.

Here is a redacted log from when I just reproduced it. This was not using bazel head but 0.25.2.

$ bazel run //a/b/c:blah
Loading: 0 packages loaded
    Fetching @xxx; Cloning xxx of Xxx 20s
^C
INFO: An error occurred during the fetch of repository 'xxx`'
INFO: Call stack for the definition of repository 'xxx':
 - /Users/johanbjork/DEV/yyy/WORKSPACE:14:1
ERROR: build interrupted
INFO: Elapsed time: 20.864s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)
MacBook-Pro:yyy johanbjork$ git stash pop
## The git pop changes WORKSPACE from the git_workspace to local_repository
MacBook-Pro:yyy johanbjork$ bazel run //a/b/c:blah
DEBUG: /private/var/tmp/_bazel_johanbjork/5bbfecb1113aee50602070b203c09601/external/bazel_skylib/lib.bzl:30:1: WARNING: lib.bzl is deprecated and will go away in the future, please directly load the bzl file(s) of the module(s) needed as it is more efficient.
INFO: Analyzed target //a/b/c:blah (294 packages loaded, 9664 targets configured).
INFO: Found 1 target...
Target //a/b/c:blah up-to-date:
  bazel-bin/...
INFO: Elapsed time: 1.959s, Critical Path: 0.01s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action

Loaded image ID: sha256:65b11ef19ad4d1ea8f9c8aefd7dc8e63bf957dfa964ee1b7e5f1be53320d5401
Send Ctrl-C to stop servers
^CGracefully stopping... (press Ctrl+C again to force)
Cleaning up...
Stopping cf_singlestate_product_1 ... done
Removing cf_singlestate_product_1 ... done
Removing network cf_singlestate_default

## Here I waited a few minutes as I went to do something else, ie, before running the next build command
MacBook-Pro:yyy johanbjork$ bazel run //a/b/c
ERROR: error loading package '': Encountered error while reading extension file 'workspace_dependencies/xxx_workspace_repository.bzl': no such package '@xxx//workspace_dependencies': No WORKSPACE file found in /private/var/tmp/_bazel_johanbjork/5bbfecb1113aee50602070b203c09601/external/xxx
ERROR: error loading package '': Encountered error while reading extension file 'workspace_dependencies/xxx_workspace_repository.bzl': no such package '@xxx//workspace_dependencies': No WORKSPACE file found in /private/var/tmp/_bazel_johanbjork/5bbfecb1113aee50602070b203c09601/external/xxx
INFO: Elapsed time: 0.216s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)


## The build command itself failed when attempting to read WORKSPACE dependencies from the external repository.
## I don't know whether the repository was removed as part of the final bazel run command, or whether
## it was removed in the background while I wasn't running any command

So trying manually to switch back and forth, I've managed to reproduce it once in 25 minutes, so I can't say I still understand exactly what happens.

We do not use strip_prefix.

@phb
Copy link
Author

phb commented May 23, 2019

Ok, did some more experimentation and I can reproduce it every time by just speeding up the whole thing

git stash; bazel run //a/b/c:d; git stash pop; bazel run //a/b/c:d
  • If I skip the second run, no files deleted
  • If I add a sleep between the stash pop and the run, no files deleted
  • The files starts to be deleted pretty much first thing during the second run, depending on how quickly they get removed the build fails at different stages (as the dependencies go away)

(update)

  • If interrupting after the git clone is complete, the problem does not appear

Watching the external project folder in
/private/var/tmp/_bazel_johanbjork/5bbfecb1113aee50602070b203c09601/external/xxx
during the build, I can see it becoming a real directory during the first run, then stays a real directory until the second bazel command starts where it gets modified into a symlink, then the deletion of files starts.

I don't know anything about bazel internals but is there some deferred action queue or similar that knows that the git clone is incomplete and begins to delete from the (now symlink) during the second command?

@aehlig
Copy link
Contributor

aehlig commented May 23, 2019

I don't know anything about bazel internals but is there some deferred action queue or similar that knows that the git clone is incomplete and begins to delete from the (now symlink) during the second command?

Bazel itself has no such queue used for external repositories; but it might be git itself. A git clone uses several processes and removes the directory to be cloned upon interruption. So it could be bazel not properly waiting for git to terminate, or be a missing waitpid in git, or bazel forcefully killing the git process waiting for its child it that takes too long, leaving an orphaned clean-up process.

@aehlig
Copy link
Contributor

aehlig commented May 24, 2019

Watching the external project folder in
/private/var/tmp/_bazel_johanbjork/5bbfecb1113aee50602070b203c09601/external/xxx
during the build, I can see it becoming a real directory during the first run, then stays a real directory until the second bazel command starts where it gets modified into a symlink, then the deletion of files starts.

Can you check between the two bazel runs which processes exist? In particular, it would be helpful to know if we some git process still exists that bazel did not wait for.

@phb
Copy link
Author

phb commented May 26, 2019

Your intuition is indeed correct, there are a few git processes still running when the second bazel invocation starts.

git clone https://... /private/var/tmp/_bazel_johanbjork/5bbfecb1113aee50602070b
/usr/local/Cellar/git/2.21.0/libexec/git-core/git-remote-https origin https://...
/usr/local/Cellar/git/2.21.0/libexec/git-core/git index-pack --stdin --fix-thin --keep=fetch-pack 6090 on
/usr/local/Cellar/git/2.21.0/libexec/git-core/git fetch-pack --stateless-rpc --stdin --lock-pack --thin --

@aehlig
Copy link
Contributor

aehlig commented May 27, 2019

Looking, with that knowledge, again at the git_repository code it's pretty obvious what is going on: we do not call git directly, instead the process we spawn is a bash. And the way we're invoking git from bash, a SIGTERM does not get forwarded.

A simple reproduction case to understand what's going on is the following.

aehlig@aehlig1:/tmp$ bash -c 'git clone https://github.com/bazelbuild/bazel >/dev/null 2>&1' &
[1] 116092
aehlig@aehlig1:/tmp$ ps aux | grep git
aehlig   116092  0.0  0.0  11256  2772 pts/6    S    11:42   0:00 bash -c git clone https://github.com/bazelbuild/bazel >/dev/null 2>&1
aehlig   116093  6.0  0.0  26280  4560 pts/6    Sl   11:42   0:00 git clone https://github.com/bazelbuild/bazel
aehlig   116094 14.2  0.0 172340 12488 pts/6    R    11:42   0:00 /usr/lib/git-core/git-remote-https origin https://github.com/bazelbuild/bazel
aehlig   116111 65.0  0.0  58944 17208 pts/6    R    11:42   0:01 /usr/lib/git-core/git index-pack --stdin --fix-thin --keep=fetch-pack 116093 on aehlig1.muc.corp.google.com --check-self-contained-and-connected
aehlig   116130  0.0  0.0  12788   992 pts/6    S+   11:42   0:00 grep --color=auto git
aehlig@aehlig1:/tmp$ kill 116092
[1]+  Terminated              bash -c 'git clone https://github.com/bazelbuild/bazel >/dev/null 2>&1'
aehlig@aehlig1:/tmp$ ps aux | grep git
aehlig   116093 11.1  0.0  26280  4560 pts/6    Sl   11:42   0:01 git clone https://github.com/bazelbuild/bazel
aehlig   116094 27.0  0.0 172340 12488 pts/6    S    11:42   0:04 /usr/lib/git-core/git-remote-https origin https://github.com/bazelbuild/bazel
aehlig   116111 80.1  0.0  83052 68008 pts/6    R    11:42   0:12 /usr/lib/git-core/git index-pack --stdin --fix-thin --keep=fetch-pack 116093 on aehlig1.muc.corp.google.com --check-self-contained-and-connected
aehlig   116194  0.0  0.0  12788   988 pts/6    S+   11:42   0:00 grep --color=auto git
aehlig@aehlig1:/tmp$ exit

A related, still pending, change is #8264 which tries to remove the uses of bash in favor of calling git directly.

@philwo, we discussed the possibility of process-wrapping the processes started from repository_context.execute to properly wait for child processes in the same way as we do for actions.

@aehlig
Copy link
Contributor

aehlig commented May 27, 2019

This problem is present on all Unix-like platforms, and not specific to macOS.

@phb
Copy link
Author

phb commented May 27, 2019

Thank you for investigating, looking forward to a fix. Also easy to avoid now that I'm aware what's causing it.

@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
@philwo philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests

5 participants