-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks that call safe_mkdir(results_dir, clean=True) silently overwrite symlink with real directory #4137
Comments
Repros of the above issues, and further discussion, can be seen: #4051 |
Okay, the scope of this is perhaps even wider than initially claimed. With the upcoming changes, But in most cases, the task execution will presumably succeed and the vt will be marked valid and the results_dir path never created an additional time. In the above case, this bug appears more severe. Since the results_dir has been overwritten from a symlink to a concrete directory, the task will be adding its output to the now concrete This suggests to me that tasks calling safe_mkdir(results_dir, clean=True) may actually be breaking their ability to cache their work. I will see if I can can get some insight through additional test coverage. |
It appears that the above guess is correct. My initial read was that the clean was a noop if the symlink pointed to an empty directory, but it actually removes the directory even when empty. The fallout of this is that calling When the symlink is clobbered, tasks will still complete successfully, since downstream tasks consume Products from the vt.results_dir. This is important to note - engineers may think there is a surprisingly invalidated target but otherwise this error state is silent. The issue is that now that the vt.results_dir symlink has been replaced, tasks files are only being output to 'current' and never land in the canonical "unique" results_dir. That miss is in turn propagated to the local and remote artifact caches, which consume from the unique results_dir. The caches will still create tarballs and happily report cache hits - but the tarballs will only hold empty dirs. The repro is dead simple. Add So, the proposed fix of cleaning the results_dir for invalid tasks is largely incomplete. It removes much of the motivation for task developers to call clean=True, but since that is not going to be obvious from their end, the existing and future calls will still be totally broken. |
I am adding the following safety measures as a mitigation effort:
Hopefully those will mitigate the discovered issues. I personally believe that Pants will continue to experience pain from this issue while we are quietly passing a symlink instead of an actual dir. My instinct is that it would be better to pass tasks the canonical link and coalesce all consumption to one canonical location for both products and caches. That would leave the symlinks as internal implementation details and we could more comfortably rely on the state. If anyone has a motivating case for why passing the consumer a symlink is preferred, I would love the context. |
The idea behind the stable symlink is to export a stable classpath location for things like |
@mateor : I'm worried that this (your 3) is orthogonal, and likely to break other things. Items 1 and 2 seem like pure bugfixes. Thanks! |
I knew why the stable location is useful, but I am still not sure why the tasks should prefer to interact with the stable one. Changing simple_codegen to use the vt's unique results_dir solves almost all of that example with one word. However, while it is not quite obvious to me why Products should use a different file pointer than the caches, I am okay with punting that larger issue until we see evidence that this is broken after my patch. |
#3 already raises an error today - without the Exception catch, it is raised by the system instead of Pants as detailed in the original issue:
ETA: The exception occurs because the relative_symlink method tries to unlink the link_path if it exists, without checking to be sure that it is a symlink beforehand. If it silently succeeded, I may have never noticed the issue :) |
Also, that last one will be as an individual followup, so it won't gate the other fixes 👍 |
Ok. So the idea is that we will force any tasks that currently think they need to clear the directory (and are doing it incorrectly) to stop, and get a clear error message. Fine with me. |
Yep, just some sanity checks around the issue. We had two instances of this live in our repo, and Pants had at least the one. So I am hoping to raise an Exception for any other outside callers. |
Thanks! We've been seeing that pop up occasionally at Square for ages and I'd been unable/without enough time to track it down. |
Us too :) Once this lands, Stu is going to backport it and release 1.2.1, I will have a RB tonight. |
…sks. Tasks that sometimes fail due to outside factors (download failures, resolve issues, etc) often would call safe_mkdir(vt.results_dir, clean=True) in order to wipe possibly truncated or crufty state. But since vt.results_dir is a symlink, that replaced it with a real dir. That ended up breaking caching, since the task output was therefore never making it into the artifact cache, and other weird bugs. This is a small change that deletes the existing directories if a target is invalid, removing the need for tasks to wipe the results_dir, and also now checks to make sure that the results_dir is legal before marking valid and passing to the artifact_caches. The majority of the change is added test coverage around the breaks. I have a couple immediate followups that cover an additional failure state, and reworks the cache_manager to remove some of the harder to reason about bits. Closes: pantsbuild#4137
* Remove clean from pants task setups. Tasks that might leave partial results in a workdir were historically left to the task to clean. But that became bugged when the workdirs were moved to a symlink, since the clean algorithm didn't have symlink awareness. The consequence of this was that calling clean=True would silently overwrite the symlink with a real dir and break the cache symantics. For resolves, the fresh workspaces was still needed. I changed upstram Pants to clean workdirs by default, so any failure now cleans. pantsbuild/pants#4051 pantsbuild/pants#4137 * Fix linter and missed variables in the refactor (sapling split of c8a374715e5d7eb31eac507c70a3fe9db476af01)
This will later cause the Exceptions when the cache_manager attempts to unlink the
vt.results_dir
, something like:This situation is somewhat rare - it can only happen when an invalid VT unexpectedly has contents inside the results_dir, perhaps from a failed resolve or an error during task execution.
Additionally, not only is the symlink silently replaced, but the actual clean is never executed - the clean will remove the symlink, but the unique current_results_dir is not dereferenced. So not only is the results_dir symlink bugged but the bad or failed state is preserved within the canonical path.
This is a limitation of safe_mkdir, in that it has no concept of the unique properties of symlinks. My instinct is that the the clean operation should be removed since the clean does not work on symlinks like the
results_dir
, which is the most likely directory for a task to interact with.For now, a workaround within the cache_manager is to simply make sure that the
current_results_dir
that the symlink points to is always empty when passed to tasks. This assumes that the clean=True calls are always trying to clear old state, as opposed to happening mid-way through execution. This assumption is true for the call sites I am aware of, both inside our repo and pants itself.This causes the clean operation to noop and therefore preserve the symlink. It also catches the additional failure state of the failed clean, since we have now forced it within the cache manager.
The intermediate workaround is going out soon, this issue merely tracks it.
The text was updated successfully, but these errors were encountered: