Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to copy when rename causes cross-link error #2410

Merged
merged 1 commit into from
Aug 10, 2020
Merged

Fallback to copy when rename causes cross-link error #2410

merged 1 commit into from
Aug 10, 2020

Conversation

mckaymatt
Copy link
Contributor

This is the issue #1239. There is further discussion here. We came across it in our dockerized CI environment.

The fix for the bug is to fall back to copying when rename fails due to a cross-link error on Linux.
I'm not sure how a test could be created for this, but it can be reproduce in docker so long as dockerhub doesn't delete the images I used.

# REPRODUCE ERROR
# Build yesterday with nightly toolchain. "rustup update nightly" should fail.
# From https://hub.docker.com/layers/rustlang/rust/nightly-stretch/images/sha256-6e27094d819a2eeaaddbda329cc8ddcc27666c701d51eb9a3d35d3d261ef321d?context=explore
docker run --net=host -t --rm \
    rustlang/rust:nightly-stretch@sha256:6e27094d819a2eeaaddbda329cc8ddcc27666c701d51eb9a3d35d3d261ef321d \
    rustup update nightly

# BUILD FIX
docker run --net=host -t --rm \
    -v ${PWD}:/rustup \
    -w /rustup \
    rust:stretch@sha256:cd0d96614c172237a7b782fc090733f3c12922fb29e3c064e8fe34453ac01b99 \
    bash -c 'cargo build && RUSTUP_HOME=home CARGO_HOME=home target/debug/rustup-init --no-modify-path -y'


# TEST FIX
docker run --net=host -t --rm \
    -v ${PWD}/home/bin/rustup:/usr/local/cargo/bin/rustup \
    rustlang/rust:nightly-stretch@sha256:6e27094d819a2eeaaddbda329cc8ddcc27666c701d51eb9a3d35d3d261ef321d \
    rustup update nightly

Please let me know what you think.

@mckaymatt mckaymatt marked this pull request as ready for review July 9, 2020 20:54
@rbtcollins
Copy link
Contributor

Thank you for this patch.

As background, rustup has a transactional layer here which attempts to make sure that installation of components is atomic. To make that work we only unpack things in the same directory structure and explicitly do not support having multiple filesystems sharing RUSTUP_HOME.

The issue for your patch is by downgrading to copy, we no longer have that transactional protection and capability: if an IO error, or a memory error or an unexpected signal occurs during the middle of the copy (which will typically be of a couple hundreds of MB), then the installed toolchain location will have a partial write present (perhaps of a partial file, or perhaps of a subset of files of a directory).

And the issue with that is that we have no UI for detecting or recovering from those partial writes in rustup today, so if we introduce this class of error that cannot exist today, we'll have to deal with the inevitable occurrences of this, which will be a support burden for a small team.

Longer term @kinnison and I are discussing a different model for rustup where we can be eventually consistent, recover from partial writes and never write to a staging directory at all, avoiding this complication.

Finally, it makes very little sense to write a file to disk, attempt to rename it, then copy-and-delete it: this is strictly slower than unpacking and renaming as we do today, or unpacking directly to the desired location, which would address the issue with docker.

As a way forward, in the interim, I suggest either:

  • uninstalling rustup entirely and reinstalling it : this should be approximately as fast as doing an upgrade, and avoid the issue of doing upgrades in a separate docker layer.
  • extend this patch slightly, adding an unsupported variable that would enable this fallback mode (E.g. RUSTUP_PERMIT_COPY_RENAME)

By unsupported I mean:

  • we should document it
  • we should document that we may remove it at any point (but presumably only when we think rustup will work in additional docker layers without it)

This will mean that the current status quo will remain, where the default behaviour is very robust and it doesn't work with additional docker layers, but folk that search for the error can find docs on how to "fix" it. For a more polished approach, we could do something along these lines - https://github.com/sindresorhus/is-docker/blob/master/index.js.

I'll leave some thoughts on the code too.

@@ -615,6 +634,14 @@ where
notify_handler(Notification::RenameInUse(&src, &dest).into());
OperationResult::Retry(e)
}
io::ErrorKind::Other
if cfg!(target_os = "linux") && Some(18i32) == e.raw_os_error() =>
Copy link
Contributor

@rbtcollins rbtcollins Jul 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

written like this, this code path is compiled into every binary, linux or not; I think a #[cfg(linux)] block would be better.

Rather than having the magic number 18 here, please use https://docs.rs/libc/0.2.72/libc/constant.EXDEV.html

@@ -593,6 +593,25 @@ pub fn toolchain_sort<T: AsRef<str>>(v: &mut Vec<T>) {
});
}

fn slow_rename<'a, N>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest copy_and_delete as a name.

@mckaymatt
Copy link
Contributor Author

mckaymatt commented Jul 15, 2020

@rbtcollins Thanks for the detailed feedback. I came across this issue and it seemed like a quick fix, but I can now see why it's a bit more complicated. Still I think this is a common enough problem that the fallback could be useful, so I extended the patch.

For a more polished approach, we could do something along these lines - https://github.com/sindresorhus/is-docker/blob/master/index.js.

I'm a little weary of this because there are now multiple container runtimes besides docker and I assume they might have the same issue. It's also possible that docker won't even be a problem with rename depending on how redirct_dir is tuned.

@mckaymatt
Copy link
Contributor Author

@rbtcollins Can you please take a look when you have a chance.

// https://github.com/rust-lang/rustup/issues/1239
// This uses std::fs::copy() instead of the faster std::fs::rename() to
// avoid cross-device link errors.
if src.is_file() {
Copy link
Contributor

@rbtcollins rbtcollins Aug 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than checking is_file, check !is_dir() (or invert the order and check is_dir); that way you'll correctly fall back to a simple copy for symlinks

if src.is_file() {
copy_file(src, dest).and(remove_file(name, src))
} else {
copy_dir(src, dest, notify_handler).and(remove_dir(name, src, notify_handler))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_dir sends a notification that a directory is being removed; thats sensible for e.g. removing an installed toolchain, but here I'm not sure we want to do that, since we're emulating rename; I think calling raw::remove_dir, or with the modification I suggest above, remove_dir_all::remove_dir_all(path) directly, would be better

Copy link
Contributor

@rbtcollins rbtcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, one last change please.

This feature is enabled by RUSTUP_PERMIT_COPY_RENAME and
is Linux only.
Copy link
Contributor

@rbtcollins rbtcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this will be awkward since it adds a dependency on nested docker :/. I think as it is an unsupported thing that that is probably ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants