Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinstate the clean of vt.results_dir in NodeResolve #4051

Closed
mateor opened this issue Nov 14, 2016 · 4 comments
Closed

Reinstate the clean of vt.results_dir in NodeResolve #4051

mateor opened this issue Nov 14, 2016 · 4 comments
Labels

Comments

@mateor
Copy link
Member

mateor commented Nov 14, 2016

The NodeResolve task used to clean the vt.results_dir but that was recently removed in 8517486.

The argument was that safe_mkdir(results_dir, clean=True) was redundant and removing it stopped exceptions like Exception message: [Errno 1] Operation not permitted: '/Users/yic/workspace/source/.pants.d/resolve/node/252d64521cf9/<target.id>/current'.

I think that this was a misdiagnosis, and that the clean should be reinstated. The purpose of that call wasn't the mkdir, as much as the clean. The exceptions are below proposed as a consequence of the clean logic.

Motivation for reinstating the clean
The clean is actually useful if there has been a ctrl-c or error during a resolve - the directory could hold a halfway state and best be blown away. With a bad resolve, it is possibly vital to clean.

Proposed cause of the exceptions
The exceptions are from the times when the above scenario has occurred. For one reason or another, the vt is invalid and the results_dir had something in it.

In this case, the safe_mkdir call would correctly clean the directory and remake it. But the file pointer being passed by the SimpleCodegen superclass is the current results_dir - which is expected to be a symlink. When safe_mkdir recreated the directory, it converted it from a symlink into a real dir, so the next run will raise an Exception when the cache_manager unsuccessfully attempts to unlink current.

Disclaimer
Does bad state get cleaned somewhere else along the pipeline? Not that I see with the below repro

Repro POC
On pants master, drop a trace on in the invalidation block:

@@ -84,6 +84,7 @@ class NodeResolve(NodeTask):
         for vt in invalidation_check.all_vts:
           target = vt.target
           if not vt.valid:
+            import pdb; pdb.set_trace()

Repro is to touch a file under the vt.results_dir and clean with safe_mkdir. Then ctrl-C and re-run the resolve to see exception. Here is mine:

mateo:pants mateo$ ./pants resolve contrib/node:
<snip>
16:02:30 00:02     [node]
                   Invalidated 5 targets.> /Users/mateo/dev/pants/contrib/node/src/python/pants/contrib/node/tasks/node_resolve.py(88)execute()
-> resolver_for_target_type = self._resolver_for_target(target).global_instance()
(Pdb) from pants.util.dirutil import safe_mkdir
(Pdb) import os
(Pdb) pp vt.results_dir
u'/Users/mateo/dev/pants/.pants.d/resolve/node/252d64521cf9/contrib.node.examples.src.node.preinstalled-project.preinstalled-project/current'
(Pdb) os.listdir(vt.results_dir)
[u'mateo.txt']
(Pdb) safe_mkdir(vt.results_dir, clean=True)
(Pdb) os.listdir(vt.results_dir)
[]
(Pdb)
16:02:30 00:02       [install]
16:04:08 01:40   [complete]
               ABORTED
Interrupted by user.
mateo:pants mateo$ ./pants resolve contrib/node::
<snip>
Exception message: [Errno 1] Operation not permitted: '/Users/mateo/dev/pants/.pants.d/resolve/node/252d64521cf9/contrib.node.examples.src.node.preinstalled-project.preinstalled-project/current'

mateo:pants mateo$

Robust solutions
The ideal fix would probably be restore the clean and do one or both of:

  1. Ensure tasks pass around the stable file path for results_dir instead of the current symlink
  2. Give safe_mkdir an understanding of symlinks, so that it cleans the dereferenced directory.*

NB: (2) is a bit sticky, since it raises the philosophical question of whether that allows deleting directories outside the buildroot.

safe_mkdir will happily create directories outside the buildroot, although I do not believe it exercises that power. But creating a directory is not as scary as recursively deleting. My instinct would be to add a check to make sure that the path is under the pants_workdir if you went that route.

Quick Solutions
The first two are not ideal since they only solve this situation for NodeResolve and leave other tasks to fend for themselves, but easy to patch in quickly.

One of:
1.

           if not vt.valid:
+            safe_mkdir(os.path.realpath(vt.results_dir), clean=True)
             resolver_for_target_type = self._resolver_for_target(target).global_instance()
             resolver_for_target_type.resolve_target(self, target, vt.results_dir, node_paths)
  1. Manually do the delete with shutil.rmtree or something
  2. Perhaps change safe_mkdir(fp, clean=True) to only delete the contents instead of deleting the entire directory and recreating. Possibly useful in other ways, since silently recreating means that it could also silently change the directory owner or permissions.
@stuhood
Copy link
Member

stuhood commented Nov 14, 2016

If we're not appropriately cleaning results dirs before giving them to tasks, that should be fixed at a lower level... most likely here:

def create_results_dir(self, root_dir, allow_incremental):
"""Ensures that a results_dir exists under the given root_dir for this versioned target.
If incremental=True, attempts to clone the results_dir for the previous version of this target
to the new results dir. Otherwise, simply ensures that the results dir exists.
"""
# Generate unique and stable directory paths for this cache key.
current_dir = self._results_dir_path(root_dir, self.cache_key, stable=False)
self._current_results_dir = current_dir
stable_dir = self._results_dir_path(root_dir, self.cache_key, stable=True)
self._results_dir = stable_dir
if self.valid:
# If the target is valid, both directories can be assumed to exist.
return
# Clone from the previous results_dir if incremental, or initialize.
previous_dir = self._use_previous_dir(allow_incremental, root_dir, current_dir)
if previous_dir is not None:
self.is_incremental = True
self._previous_results_dir = previous_dir
shutil.copytree(previous_dir, current_dir)
else:
safe_mkdir(current_dir)
# Finally, create the stable symlink.
relative_symlink(current_dir, stable_dir)

@mateor
Copy link
Member Author

mateor commented Nov 14, 2016

I think I agree with that. But this task should be restored in the meantime, unless that is coming quickly.

And, that change would still leave safe_mkdir bugged for tasks in at least two ways.

  1. Can replace current symlink with a real dir for any existing calls so the above exception would still raise
  2. Can pretend to clean directories without actually doing so, since the clean=True does not follow symlinks.
19:32:05 00:01     [node]
                   Invalidated 5 targets.> /Users/mateo/dev/pants/contrib/node/src/python/pants/contrib/node/tasks/node_resolve.py(88)execute()
-> resolver_for_target_type = self._resolver_for_target(target).global_instance()
(Pdb) from pants.util.dirutil import safe_mkdir
(Pdb) import os
(Pdb) pp vt.results_dir
u'/Users/mateo/dev/pants/.pants.d/resolve/node/252d64521cf9/contrib.node.examples.src.node.preinstalled-project.preinstalled-project/current'
(Pdb) stable = os.path.realpath(vt.results_dir)
(Pdb) pp stable
u'/Users/mateo/dev/pants/.pants.d/resolve/node/252d64521cf9/contrib.node.examples.src.node.preinstalled-project.preinstalled-project/12b4ae6d686b'
(Pdb) os.listdir(vt.results_dir)
[u'mateo.txt']
(Pdb) os.path.islink(vt.results_dir)
True
(Pdb) safe_mkdir(vt.results_dir, clean=True)
(Pdb) os.path.islink(vt.results_dir)
False
(Pdb) os.listdir(vt.results_dir)
[]
(Pdb) os.listdir(stable)
[u'mateo.txt']

(ETA: Updated to actually add the final os.path.islink assertion)

@mateor
Copy link
Member Author

mateor commented Nov 14, 2016

I also have an additional safe_mkdir bug reported internally, but I haven't had time to validate. He says:
"Python's docs state "On some systems, mode is ignored. Where it is used, the current umask value is first masked out. If bits other than the last 9 (i.e. the last 3 digits of the octal representation of the mode) are set, their meaning is platform-dependent. On some platforms, they are ignored and you should call chmod() explicitly to set them."

It came with a patch that fixed it for his issue, but I am not sure it will be useful for everyone.

@mateor
Copy link
Member Author

mateor commented Jan 3, 2017

this is also now resolved with the default clean: #4137

@mateor mateor closed this as completed Jan 3, 2017
mateor added a commit to foursquare/fsqio that referenced this issue Dec 17, 2017
* Remove clean from pants task setups.

Tasks that might leave partial results in a workdir were historically
left to the task to clean. But that became bugged when the workdirs
were moved to a symlink, since the clean algorithm didn't have symlink
awareness.

The consequence of this was that calling clean=True would silently overwrite
the symlink with a real dir and break the cache symantics.

For resolves, the fresh workspaces was still needed. I changed upstram
Pants to clean workdirs by default, so any failure now cleans.

pantsbuild/pants#4051
pantsbuild/pants#4137

* Fix linter and missed variables in the refactor

(sapling split of c8a374715e5d7eb31eac507c70a3fe9db476af01)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants