Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks that call safe_mkdir(results_dir, clean=True) silently overwrite symlink with real directory #4137

Closed
mateor opened this issue Dec 13, 2016 · 13 comments · Fixed by #4139
Assignees
Labels

Comments

@mateor
Copy link
Member

mateor commented Dec 13, 2016

This will later cause the Exceptions when the cache_manager attempts to unlink the vt.results_dir, something like:

 Exception message: [Errno 1] Operation not permitted: '/Users/yic/workspace/source/.pants.d/resolve/node/252d64521cf9/<target.id>/current'.

This situation is somewhat rare - it can only happen when an invalid VT unexpectedly has contents inside the results_dir, perhaps from a failed resolve or an error during task execution.

Additionally, not only is the symlink silently replaced, but the actual clean is never executed - the clean will remove the symlink, but the unique current_results_dir is not dereferenced. So not only is the results_dir symlink bugged but the bad or failed state is preserved within the canonical path.

This is a limitation of safe_mkdir, in that it has no concept of the unique properties of symlinks. My instinct is that the the clean operation should be removed since the clean does not work on symlinks like the results_dir, which is the most likely directory for a task to interact with.

For now, a workaround within the cache_manager is to simply make sure that the current_results_dir that the symlink points to is always empty when passed to tasks. This assumes that the clean=True calls are always trying to clear old state, as opposed to happening mid-way through execution. This assumption is true for the call sites I am aware of, both inside our repo and pants itself.

This causes the clean operation to noop and therefore preserve the symlink. It also catches the additional failure state of the failed clean, since we have now forced it within the cache manager.

The intermediate workaround is going out soon, this issue merely tracks it.

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

Repros of the above issues, and further discussion, can be seen: #4051

@mateor mateor added the bug label Dec 13, 2016
@mateor mateor self-assigned this Dec 13, 2016
@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

Okay, the scope of this is perhaps even wider than initially claimed. With the upcoming changes, safe_mkdir(results_dir, clean=True), will now be caught in the error case that we were aware of, a VT that is invalid for subsequent runs (due to a failed resolve or what have you).

But in most cases, the task execution will presumably succeed and the vt will be marked valid and the results_dir path never created an additional time.

In the above case, this bug appears more severe. Since the results_dir has been overwritten from a symlink to a concrete directory, the task will be adding its output to the now concrete vt.results_dir. Unfortunately, the local_artifact_cache and artifact_cache operate using the unique 'current_results_dir`, which is would no longer get the benefit of the task output.

This suggests to me that tasks calling safe_mkdir(results_dir, clean=True) may actually be breaking their ability to cache their work.

I will see if I can can get some insight through additional test coverage.

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

It appears that the above guess is correct. My initial read was that the clean was a noop if the symlink pointed to an empty directory, but it actually removes the directory even when empty.

The fallout of this is that calling safe_mkdir(vt.results_dir, clean=True) silently breaks task caching, completely. This is because the consumption of task output has been split. Products consume from the symlinked vt.results_dir location, while the artifact caches consume from the unique results_dir.

When the symlink is clobbered, tasks will still complete successfully, since downstream tasks consume Products from the vt.results_dir. This is important to note - engineers may think there is a surprisingly invalidated target but otherwise this error state is silent.

The issue is that now that the vt.results_dir symlink has been replaced, tasks files are only being output to 'current' and never land in the canonical "unique" results_dir. That miss is in turn propagated to the local and remote artifact caches, which consume from the unique results_dir. The caches will still create tarballs and happily report cache hits - but the tarballs will only hold empty dirs.

The repro is dead simple. Add safe_mkdir(vt.results_dir, clean=True) to the top of your chosen cache-enabled task and take a look at the output under .pants.d and the local_artifact_cache.

So, the proposed fix of cleaning the results_dir for invalid tasks is largely incomplete. It removes much of the motivation for task developers to call clean=True, but since that is not going to be obvious from their end, the existing and future calls will still be totally broken.

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

I am adding the following safety measures as a mitigation effort:

  1. cache_manager will clean and recreate results_dirs for invalid targets*
  2. cache_manager will refuse to update a vt to valid if the results_dir is no longer a symlink
  3. relative_symlink will provide a first-class error message upon attempting to overwrite an existing directory
  • useful for task correctness and, if known, removes the motivation to clean arbitrary results_dirs. But largely orthogonal to the issue at hand.

Hopefully those will mitigate the discovered issues. I personally believe that Pants will continue to experience pain from this issue while we are quietly passing a symlink instead of an actual dir.

My instinct is that it would be better to pass tasks the canonical link and coalesce all consumption to one canonical location for both products and caches. That would leave the symlinks as internal implementation details and we could more comfortably rely on the state.

If anyone has a motivating case for why passing the consumer a symlink is preferred, I would love the context.

@stuhood
Copy link
Member

stuhood commented Dec 13, 2016

If anyone has a motivating case for why passing the consumer a symlink is preferred, I would love the context.

The idea behind the stable symlink is to export a stable classpath location for things like ./pants export, and to give synthetic codegen targets stable target ids. Before the stable symlink, target ids as seen in IntelliJ would contain the hash, and would change after every compile.

@stuhood
Copy link
Member

stuhood commented Dec 13, 2016

relative_symlink will provide a first-class error message upon attempting to overwrite an existing directory

@mateor : I'm worried that this (your 3) is orthogonal, and likely to break other things. Items 1 and 2 seem like pure bugfixes. Thanks!

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

I knew why the stable location is useful, but I am still not sure why the tasks should prefer to interact with the stable one.

Changing simple_codegen to use the vt's unique results_dir solves almost all of that example with one word.

However, while it is not quite obvious to me why Products should use a different file pointer than the caches, I am okay with punting that larger issue until we see evidence that this is broken after my patch.

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

#3 already raises an error today - without the Exception catch, it is raised by the system instead of Pants as detailed in the original issue:

    Exception message: [Errno 1] Operation not permitted: '/Users/yic/workspace/source/.pants.d/resolve/node/252d64521cf9/<target.id>/current'

ETA: The exception occurs because the relative_symlink method tries to unlink the link_path if it exists, without checking to be sure that it is a symlink beforehand.

If it silently succeeded, I may have never noticed the issue :)

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

Also, that last one will be as an individual followup, so it won't gate the other fixes 👍

@stuhood
Copy link
Member

stuhood commented Dec 13, 2016

#3 already raises an error today - without the Exception catch, it is raised by the system instead of Pants as detailed in the original issue:

Ok. So the idea is that we will force any tasks that currently think they need to clear the directory (and are doing it incorrectly) to stop, and get a clear error message. Fine with me.

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

Yep, just some sanity checks around the issue. We had two instances of this live in our repo, and Pants had at least the one. So I am hoping to raise an Exception for any other outside callers.

@gmalmquist
Copy link
Contributor

Thanks! We've been seeing that pop up occasionally at Square for ages and I'd been unable/without enough time to track it down.

@mateor
Copy link
Member Author

mateor commented Dec 13, 2016

Us too :)

Once this lands, Stu is going to backport it and release 1.2.1, I will have a RB tonight.

mateor added a commit to mateor/pants that referenced this issue Dec 14, 2016
…sks.

Tasks that sometimes fail due to outside factors (download failures,
resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True)
in order to wipe possibly truncated or crufty state.

But since vt.results_dir is a symlink, that replaced it with a real dir.
That ended up breaking caching, since the task output was therefore
never making it into the artifact cache, and other weird
bugs.

This is a small change that deletes the existing directories if
a target is invalid, removing the need for tasks to wipe the
results_dir, and also now checks to make sure that the results_dir
is legal before marking valid and passing to the artifact_caches.

The majority of the change is added test coverage around the breaks.
I have a couple immediate followups that cover an additional failure
state, and reworks the cache_manager to remove some of the harder
to reason about bits.

Closes: pantsbuild#4137
mateor added a commit to foursquare/fsqio that referenced this issue Dec 17, 2017
* Remove clean from pants task setups.

Tasks that might leave partial results in a workdir were historically
left to the task to clean. But that became bugged when the workdirs
were moved to a symlink, since the clean algorithm didn't have symlink
awareness.

The consequence of this was that calling clean=True would silently overwrite
the symlink with a real dir and break the cache symantics.

For resolves, the fresh workspaces was still needed. I changed upstram
Pants to clean workdirs by default, so any failure now cleans.

pantsbuild/pants#4051
pantsbuild/pants#4137

* Fix linter and missed variables in the refactor

(sapling split of c8a374715e5d7eb31eac507c70a3fe9db476af01)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants