Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.48 beta broken on mips due to copy_file_range EOVERFLOW #78979

Closed
infinity0 opened this issue Nov 12, 2020 · 46 comments
Closed

1.48 beta broken on mips due to copy_file_range EOVERFLOW #78979

infinity0 opened this issue Nov 12, 2020 · 46 comments
Assignees
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta.
Milestone

Comments

@infinity0
Copy link
Contributor

infinity0 commented Nov 12, 2020

https://buildd.debian.org/status/fetch.php?pkg=rustc&arch=mipsel&ver=1.48.0%7Ebeta.8%2Bdfsg1-1%7Eexp1&stamp=1605149146&raw=0

error[E0107]: wrong number of type arguments: expected 3, found 2
   --> /usr/src/rustc-1.48.0/vendor/petgraph/src/graphmap.rs:580:16
    |
580 |     edges: &'a IndexMap<(N, N), E>,
    |                ^^^^^^^^^^^^^^^^^^^ expected 3 type arguments

error[E0107]: wrong number of type arguments: expected 2, found 1
   --> /usr/src/rustc-1.48.0/vendor/petgraph/src/matrix_graph.rs:925:22
    |
925 |     removed_ids: &'a IndexSet<usize>,
    |                      ^^^^^^^^^^^^^^^ expected 2 type arguments

error[E0107]: wrong number of type arguments: expected 3, found 2
  --> /usr/src/rustc-1.48.0/vendor/petgraph/src/graphmap.rs:61:12
   |
61 |     nodes: IndexMap<N, Vec<(N, CompactDirection)>>,
   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected 3 type arguments

error[E0107]: wrong number of type arguments: expected 3, found 2
  --> /usr/src/rustc-1.48.0/vendor/petgraph/src/graphmap.rs:62:12
   |
62 |     edges: IndexMap<(N, N), E>,
   |            ^^^^^^^^^^^^^^^^^^^ expected 3 type arguments

error[E0107]: wrong number of type arguments: expected 2, found 1
   --> /usr/src/rustc-1.48.0/vendor/petgraph/src/matrix_graph.rs:851:18
    |
851 |     removed_ids: IndexSet<usize>,
    |                  ^^^^^^^^^^^^^^^ expected 2 type arguments

error: aborting due to 5 previous errors; 87 warnings emitted

For more information about this error, try `rustc --explain E0107`.

Did not run successfully: exit code: 1
"/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1/bin/rustc" "--crate-name" "petgraph" "--edition=2018" "/<<PKGBUILDDIR>>/vendor/petgraph/src/lib.rs" "--error-format=json" "--json=diagnostic-rendered-ansi,artifacts" "--crate-type" "lib" "--emit=dep-info,metadata,link" "-C" "opt-level=3" "--cfg" "feature=\"default\"" "--cfg" "feature=\"graphmap\"" "--cfg" "feature=\"matrix_graph\"" "--cfg" "feature=\"stable_graph\"" "-C" "metadata=08c9d7a6b2939d49" "-C" "extra-filename=-08c9d7a6b2939d49" "--out-dir" "/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/mipsel-unknown-linux-gnu/release/deps" "--target" "mipsel-unknown-linux-gnu" "-C" "linker=mipsel-linux-gnu-gcc" "-L" "dependency=/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/mipsel-unknown-linux-gnu/release/deps" "-L" "dependency=/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/release/deps" "--extern" "fixedbitset=/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/mipsel-unknown-linux-gnu/release/deps/libfixedbitset-5dcd55a1fc6307c9.rmeta" "--extern" "indexmap=/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/mipsel-unknown-linux-gnu/release/deps/libindexmap-1f8402edc452918c.rmeta" "--cap-lints" "warn" "--cap-lints=warn" "-Clink-args=-Wl,-z,relro" "-Zmacro-backtrace" "-Zunstable-options" "-Wrustc::internal" "-Cprefer-dynamic" "-Zbinary-dep-depinfo" "-Wrust_2018_idioms" "-Wunused_lifetimes" "-Dwarnings" "--sysroot" "/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1" "--remap-path-prefix" "/<<PKGBUILDDIR>>=/usr/src/rustc-1.48.0" "-Z" "force-unstable-if-unmarked"
@infinity0 infinity0 added the C-bug Category: This is a bug. label Nov 12, 2020
@infinity0
Copy link
Contributor Author

Actually the real culprit would appear to be indexmap which was upgraded from 1.5.1 to 1.6.0 for rust 1.48 - breaking the API that petgraph depends on. @bluss @cuviper

@infinity0 infinity0 changed the title 1.48 beta broken on mips due to broken petgraph 1.48 beta broken on mips due to API-breaking indexmap upgrade (1.5.1 -> 1.6.0) Nov 12, 2020
@Mark-Simulacrum Mark-Simulacrum added the regression-from-stable-to-beta Performance or correctness regression from stable to beta. label Nov 12, 2020
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Nov 12, 2020
@Mark-Simulacrum Mark-Simulacrum added this to the 1.48.0 milestone Nov 12, 2020
@Mark-Simulacrum
Copy link
Member

So the problem is that indexmap's build script uses autocfg to try to discover the availability of some Rust features, and it is likely breaking on mips for some reason. See https://zulip-archive.rust-lang.org/182449tcompilerhelp/46175petgraphwrongnumberoftypearguments.html for some previous discussion.

I suspect the next step here is going to be identifying the cause of the problem -- you should be able to get logs from the autocfg build script somewhere in the build directory, along the lines of build/x86_64-unknown-linux-gnu/stage0-rustc/x86_64-unknown-linux-gnu/release/build/indexmap-617a24810b08e703/stderr but with the target triples and the hash in indexmap likely changed.

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

Actually the real culprit would appear to be indexmap which was upgraded from 1.5.1 to 1.6.0 for rust 1.48 - breaking the API that petgraph depends on.

There's no API change: indexmap-rs/indexmap@1.5.1...1.6.0

The error you're seeing is because no_std builds have IndexMap<K, V, S>, while std builds have default S = RandomState and add new and with_capacity using that default. We use autocfg to detect whether the target has std, and have done so since 1.3.0. There's a slight functional change in 1.5.2's new "std" feature, but that should only improve the situation by avoiding autocfg when the feature is set, and I don't think many dependents are setting that feature yet.

See indexmap-rs/indexmap#144 and indexmap-rs/indexmap#151 for more context. The autocfg probe just invokes rustc with the output set into the $OUT_DIR, which may be unusual but not wrong IMO. It seems to be an issue with the compiler writing or copying files to some filesystems, so I'd be interested to know what the error is in your case, and whether there are any other build environment differences that cause it to newly fail.

@infinity0
Copy link
Contributor Author

Thanks for the context, digging through the log I see:

[indexmap 1.6.0] error: could not copy "/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/mipsel-unknown-linux-gnu/release/build/indexmap-f26e0fc96efda7b8/out/probe0.probe0.3a1fbbbh-cgu.0.rcgu.ll" to "/<<PKGBUILDDIR>>/build/mipsel-unknown-linux-gnu/stage1-rustc/mipsel-unknown-linux-gnu/release/build/indexmap-f26e0fc96efda7b8/out/probe0.ll": Value too large for defined data type (os error 79)
[indexmap 1.6.0] 
[indexmap 1.6.0] error: aborting due to previous error
[indexmap 1.6.0] 
[indexmap 1.6.0] 
[indexmap 1.6.0] Did not run successfully: exit code: 1

The first occurence of this error "could not copy A to B: Value too large for defined data type (os error 79)" is for indexmap but it also happens for several other crates, so the problem is likely something else, I'll keep investigating..

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

If they're all in probeN.ll files, they're still likely via autocfg, but I think it will ultimately require a compiler fix.

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

EOVERFLOW is also a weird error here -- these are definitely not large files or anything...

@infinity0
Copy link
Contributor Author

It's the stage1 build of indexmap that is failing, so annoyingly I can't reproduce this outside of the rustc-build context with just indexmap since we only have stage0 (1.47) on hand. So it could indeed be some issue with 1.48.

@infinity0
Copy link
Contributor Author

@the8472 You were the only one to make changes to library/std/src/sys/unix/fs.rs specifically copy() for 1.48, any ideas? It is giving EOVERFLOW on mips.

@infinity0
Copy link
Contributor Author

I will try reverting 4ddedd5 to see if this fixes the problem, will report back in a few hours.

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

I wonder if mips (or your filesystem) doesn't like the request for usize::MAX bytes here:

let bytes_to_copy = cmp::min(max_len - written, usize::MAX as u64) as usize;

@infinity0
Copy link
Contributor Author

That was already there in 1.47. However the general gist of that commit seems to be to copy MAX bytes regardless of the size of the file, which perhaps is not allowed on mips or something? It seems like a weird thing to want to do, anyway.

@Mark-Simulacrum
Copy link
Member

I would be fine disabling copy_file_range on mips (or perhaps all platforms) for 1.48 at least since it seems like it's causing problems, presuming that fixes the bug here. I wouldn't want to revert just the latest commit though since it seems like it does fix real bugs (or at least claims to).

@infinity0
Copy link
Contributor Author

The commit description says "race conditions where a file is truncated while copying from it. if we blindly trusted the file size this would lead to an infinite loop" but this doesn't sound right since MAX is always greater than the actual file size, and we only read it once.

It could help in the situation {X: where a file is appended to while copying from it}, but again the previous behaviour would only copy fewer bytes than wanted without going into an infinite loop.

The other parts of the commit sound legit. I will report back whether the revert helps the issue, but I suspect changing the max_len part to the previous behaviour whilst leaving the rest of the commit in place, would fix all the issues described as well as mips, except [X] where it's debatable what the correct behaviour is anyway.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

The previous implementation queried the file size via stat. This has correctness issues (e.g. procfs reports sizes incorrectly) and is prone to races (files getting truncated while copying). The goal of the change was to instead copy until EOF is reached. That seemed to work on my 64bit linux.

Here's the generic_copy_file_checks

	/* Ensure offsets don't wrap. */
	if (pos_in + count < pos_in || pos_out + count < pos_out)
		return -EOVERFLOW;

That's comparing signed and unsigned types, I'm not familiar with the C integer rules but I guess that this just happens to work for the file offset of 0 and fails on a subsequent try if the first copy_file_range does a short copy, i.e. not the whole requested range. If that is correct then it could also fail on other platforms if a short copy is encountered. This might also be dependent on kernel versions since copy_file_range behavior changed several times.

I have an improved implementation that should (almost) handle this case in #75272

https://github.com/rust-lang/rust/blob/cd8cc05e1fc147f188a41d8c6b0d4d9c448c231a/library/std/src/sys/unix/kernel_copy.rs#L506-L507

I say almost because I assumed this could only happen with io::copy where file seek positions could be non-zero and assert that it doesn't happen in the fs::copy case.

I assume lifting that entire PR to beta would be inappropriate. If so I can extract the CopyResult type and fallback logic to a separate PR and also copy in "smaller" (2GB) chunks like the splice implementation so that overflow condition is less likely to occur.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

The commit description says "race conditions where a file is truncated while copying from it. if we blindly trusted the file size this would lead to an infinite loop" but this doesn't sound right since MAX is always greater than the actual file size, and we only read it once.

It's read in a loop. If the truncate happens between the stat call and any copy_file_range call then the old implementation wouldn't check for EOF and instead loop forever, expecting to be able to copy more bytes. And as mentioned above this isn't the only issue.

@infinity0
Copy link
Contributor Author

It's read in a loop.

I can't see this - 4ddedd5 removed one call to reader_metadata.len() but this was at the top-level of std::sys::fs::copy outside of any loops, and std::fs::copy calls this copy only once.

I am guessing this is the cause of EOVERFLOW but I have yet to confirm this, I should just write a short program to test it. Where is your generic_copy_file_checks from, glibc?

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

@infinity0

That was already there in 1.47.

The prior code was capped by the file length, whereas max_len is now u64::MAX

@the8472

Here's the generic_copy_file_checks

	/* Ensure offsets don't wrap. */
	if (pos_in + count < pos_in || pos_out + count < pos_out)
		return -EOVERFLOW;

That's comparing signed and unsigned types, I'm not familiar with the C integer rules

With 32-bit unsigned size_t len and 64-bit signed loff_t pos_*, the addition will promote to signed 64-bit integers, and then the comparison is signed.

With 64-bit size_t, I'm not sure off-hand...

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

I would be fine disabling copy_file_range on mips (or perhaps all platforms) for 1.48 at least since it seems like it's causing problems

I think we should either disable it on all platforms or add general EOVERFLOW handling and lower the chunk size so that an overflow is less likely to happen for anything but very large file offsets.

I can't see this - 4ddedd5 removed one call to reader_metadata.len() but this was at the top-level of std::sys::fs::copy outside of any loops, and std::fs::copy calls this copy only once.

The length is not "read" in the loop (I was confused by the word read, I thought you meant read(2)). But data from the file is read in a loop and the loop did not terminate until a sufficient amount of bytes was read. I added the EOF case with my change, but that alone is not sufficient due to the procfs and overlayfs issues that motivated that PR. You can't trust the stat results for the purpose of the copy operation.

Where is your generic_copy_file_checks from, glibc?

That's from the kernel sources.

@infinity0
Copy link
Contributor Author

The length is not "read" in the loop (I was confused by the word read, I thought you meant read(2)). But data from the file is read in a loop and the loop did not terminate until a sufficient amount of bytes was read.

Right, but your commit changes len to max_len (and the loop check becomes written < max_len) so I fail to see how this improves any supposed infinite loop - if there was an infinite loop previously, there would still be an infinite loop with this more relaxed condition.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

so I fail to see how this improves any supposed infinite loop

I did not say that that particular change solves the infinite loop case on its own. The change changes the semantics and that's why the name was changed from len to max_len. It now allows an upper bound for the bytes copied to be set, that's all it does. The happy path for termination (other than the upper bound) is reaching EOF. And probing for EOF does not require a size provided by stat.

With 32-bit unsigned size_t len and 64-bit signed loff_t pos_*, the addition will promote to signed 64-bit integers, and then the comparison is signed.

Odd, mipsel is 32bit, right? So this should never be able to overflow unless the file seek offset is in the range (i64::MAX - usize::MAX)..i64::MAX. Maybe that is on an older kernel with a different copy_file_range implementation.

@infinity0
Copy link
Contributor Author

It appears that size_t is signed on mips:

#define _GNU_SOURCE
#include <unistd.h>
#include <limits.h>
#include <stdio.h>

size_t max = INT_MAX + 1;

int main() {
  FILE *fa = fopen("a.txt", "rb");
  FILE *fb = fopen("b.txt", "wb");
  printf("calling copy_file_range with len %d\n", max);
  if (copy_file_range(fileno(fa), NULL, fileno(fb), NULL, max, 0) != 0) {
          perror(NULL);
  }
}
$ gcc copy_file_range.c && ./a.out && diff a.txt b.txt
copy_file_range.c:6:22: warning: integer overflow in expression of type 'int' results in '-2147483648' [-Woverflow]
    6 | size_t max = INT_MAX + 1;
      |                      ^
calling copy_file_range with len -2147483648
Invalid argument
1d0
< 12345

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

That would be a problem. The libc crate lists it as unsigned https://rust-lang.github.io/libc/mipsel-unknown-linux-gnu/libc/type.size_t.html

@infinity0
Copy link
Contributor Author

Sorry my bad, I miscoded, it should be ((size_t)INT_MAX) + 1 and size_t is unsigned. But for some reason copy_file_range doesn't like that value, whereas INT_MAX is fine.

#define _GNU_SOURCE
#include <unistd.h>
#include <limits.h>
#include <stdio.h>

size_t max = ((size_t)INT_MAX);

int main() {
  FILE *fa = fopen("a.txt", "rb");
  FILE *fb = fopen("b.txt", "wb");
  printf("calling copy_file_range with len %p\n", max);
  if (copy_file_range(fileno(fa), NULL, fileno(fb), NULL, max, 0) != 0) {
          perror(NULL);
  }
}
(experimental_mipsel-dchroot)infinity0@eller:~$ gcc copy_file_range.c && ./a.out && diff a.txt b.txt
calling copy_file_range with len 0x80000000
Invalid argument
1d0
< 12345
(experimental_mipsel-dchroot)infinity0@eller:~$ nano copy_file_range.c                                                                                                                                                                   
(experimental_mipsel-dchroot)infinity0@eller:~$ gcc copy_file_range.c && ./a.out && diff a.txt b.txt
calling copy_file_range with len 0x7fffffff
Success

@infinity0
Copy link
Contributor Author

"Invalid argument" is not EOVERFLOW however, so I'll have to investigate some more.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

It's good to know either way, then I'll lower the chunk size further to 1GB.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

The current implementation already deals with EINVAL by falling back to the userspace copy implementation. This isn't ideal but shouldn't be a release blocker. The question is under which circumstances EOVERFLOW happens.

@Mark-Simulacrum what's the release timeline? How much time do we have for a fix and how small should it be kept? Rendering copy_file_range inert would be a 1-line change. A proper fix takes some more effort but I have most of the needed code already in a different PR.

@Mark-Simulacrum
Copy link
Member

Any fix would ideally land in the next 72 hours, roughly. Please ping me and @pietroalbini on relevant PRs.

I suspect that given the timeframe just rendering this code inert is our best bet.

@cuviper cuviper changed the title 1.48 beta broken on mips due to API-breaking indexmap upgrade (1.5.1 -> 1.6.0) 1.48 beta broken on mips due to copy_file_range EOVERFLOW Nov 12, 2020
@infinity0
Copy link
Contributor Author

As a slight aside, @jrtc27 mentioned to me that mips64 does sign-extension by default, and the copy_file_range syscall does not have a compat version when running 32-bit user code on a 64-bit kernel so its size_t argument gets sign-extended as per this default behaviour. This does not really matter for this rust code in question, since the size parameter is secondary to the purpose of the logic, but this then get me thinking why not just set bytes_to_copy = u64::MAX as a constant - AFAICT there is no significance in this value anyway, all the logic is done by the surrounding code.

I'll launch a build to confirm that has_copy_file_range = false actually works, AIUI this will retain the other bugfixes from that commit.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

@infinity0 yeah right now the size has no relevance, that will change again with #75272

AIUI this will retain the other bugfixes from that commit.

Well, that's a funny way to put it, but yes.

@Mark-Simulacrum

I suspect that given the timeframe just rendering this code inert is our best bet.

Ok, will do.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

@rustbot claim

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

then I'll lower the chunk size further to 1GB.

This seems better than fully disabling copy_file_range -- just change the usize::MAX to 1GB?

but this then get me thinking why not just set bytes_to_copy = u64::MAX as a constant - AFAICT there is no significance in this value anyway, all the logic is done by the surrounding code.

I'm not sure what you mean -- bytes_to_copy is the size_t len of the syscall, no?

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

This seems better than fully disabling copy_file_range -- just change the usize::MAX to 1GB?

Well, setting the length to 1GB is mostly about avoiding the EINVAL that @infinity0 reported in #78979 (comment)

The actual issue is EOVERFLOW for which we haven't pinpointed the exact cause yet. It could also be due to 32bit calculations on mipsel or older kernel versions so it might happen for any large files too. I intend to handle this properly in #75272

We can try lowering the chunk size if you want, but that's leaves a bit more room for error than outright disabling it.

@the8472
Copy link
Member

the8472 commented Nov 12, 2020

Created two PRs against beta, pick one or the other.

For nightly I'll update #75272 to cover this case too and hope we'll get that merged soon(ish).

@cuviper
Copy link
Member

cuviper commented Nov 12, 2020

I intend to handle this properly in #75272

Keep an eye on when that lands -- if it's after the next branch date, it may need backporting to 1.49-beta.

@infinity0
Copy link
Contributor Author

The build on mips with disabling the flag succeeded https://buildd.debian.org/status/fetch.php?pkg=rustc&arch=mipsel&ver=1.48.0%7Ebeta.8%2Bdfsg1-1%7Eexp3&stamp=1605339163&raw=0

I'm still waiting on shell access on a similar machine.

bors added a commit to rust-lang-ci/rust that referenced this issue Nov 15, 2020
[beta] always disable copy_file_range to avoid EOVERFLOW errors

A bigger hammer as alternative to rust-lang#79007

Pro: will certainly fix the issue
Cons: will disable copy_file_range for everyone

Resolves rust-lang#78979
@infinity0
Copy link
Contributor Author

I couldn't reproduce the original EOVERFLOW issue on a mips machine with linux 4.19, the build succeeded even without patching away that flag. (So probably the existing EINVAL workaround was sufficient.) I'm trying to source a mips linux 5 box to try to reproduce it...

@Mark-Simulacrum
Copy link
Member

@infinity0 It would be helpful if you could confirm that current beta branch (likely no release quite yet) fixes this issue.

@infinity0
Copy link
Contributor Author

@Mark-Simulacrum with the flag set from true to false (as per #79008) the build succeeds, so I assume it works - sadly I don't have time to test the branch specifically before the Thursday release but I'm sure it'll be fine.

@Mark-Simulacrum Mark-Simulacrum modified the milestones: 1.48.0, 1.49.0 Nov 17, 2020
@Mark-Simulacrum
Copy link
Member

Setting milestone to 1.49, we need to land #79008 on that branch as well.

@spastorino
Copy link
Member

Assigning P-high as discussed as part of the Prioritization Working Group procedure and removing I-prioritize.

@spastorino spastorino added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Nov 18, 2020
@Mark-Simulacrum
Copy link
Member

backported in #79838 -- I think this should be fixed? Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta.
Projects
None yet
Development

No branches or pull requests

6 participants