Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs::File reads first 4096 bytes only on Linux v5.9 and v5.10 #3803

Closed
br0adcast opened this issue May 19, 2021 · 30 comments
Closed

fs::File reads first 4096 bytes only on Linux v5.9 and v5.10 #3803

br0adcast opened this issue May 19, 2021 · 30 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-fs Module: tokio/fs

Comments

@br0adcast
Copy link

Version
└── tokio v1.6.0
└── tokio-macros v1.2.0 (proc-macro)

Platform
Raspberry Pi 32-bit
Linux raspberrypi 5.10.17-v7l+ #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux

Description

After upgrading tokio from 1.5 to 1.6 only 4096 bytes can be read from the file

I tried this code:

let mut file = tokio::fs::File::open("/some/file").await?;
let mut contents = vec![];
file.read_to_end(&mut contents).await?;
println!("File size = {}", contents.len());

and I get File size = 4096 for any file.

@br0adcast br0adcast added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels May 19, 2021
@Darksonn Darksonn added the M-fs Module: tokio/fs label May 19, 2021
@Darksonn
Copy link
Contributor

That sounds weird. I just tried the code you provided on my local Linux laptop and it seems to work fine. Does this happen on raspberry pi only?

@wmanley Thoughts? Your PR was released in 1.6.0, so it's possible that it is related to this.

@Darksonn
Copy link
Contributor

Does it fail on files shorter than 4096 bytes? Does it fail on files longer than 4096 bytes?

@wmanley
Copy link
Contributor

wmanley commented May 19, 2021

I've had a look through #3518 and I can't see how it might cause this issue. I'd been looking for a place where we incorrectly return 0 when we're not EOF and I can't see it.

@br0adcast: Could you try reverting 39706b1 and see if that fixes it?

@wmanley
Copy link
Contributor

wmanley commented May 19, 2021

I've tried to reproduce this on my Raspberry Pi with kernel 4.14, but it works fine for me. I've confirmed using strace that it's correctly using preadv2 as well.

@carllerche
Copy link
Member

Looks like we need more info to repro.

@br0adcast
Copy link
Author

br0adcast commented May 19, 2021

I have not found this problem on the Linux systems I have, including the 32-bit ones i686 and 64-bit ARM. But I found that this problem is reproduced since tokio 1.0 on my Raspberry.
@wmanley so i think it is not 39706b1

How i reproduce it:

$ dd if=/dev/urandom of=file_450000 bs=1500 count=300
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ ./target/release/fs_test_tokio_1.0 file_450000 
File size = 4096
$ ./target/release/fs_test_tokio_0.2 file_450000 
File size = 450000
$ ./target/release/fs_test_tokio_1.0 file_450000 
File size = 450000

@br0adcast
Copy link
Author

I've had a look through #3518 and I can't see how it might cause this issue. I'd been looking for a place where we incorrectly return 0 when we're not EOF and I can't see it.

@br0adcast: Could you try reverting 39706b1 and see if that fixes it?

Yes, reverting 39706b1 fixes the problem.

strace:

futex(0xb6f2f6f0, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
preadv2(9, [{iov_base="\365Dh\200\316\3376n\7n!'2g9\215\363\31\t\0300O\346\2\205(\203G\261~\231d", iov_len=32}], 1, -1, RWF_NOWAIT) = 32
preadv2(9, [{iov_base="\30$\3074\234\277\226\206K\332\265\177\257\263\1\343\312\322\254J\23\376\202=<-\330\251\310_\6K", iov_len=32}], 1, -1, RWF_NOWAIT) = 32
preadv2(9, [{iov_base="\211\27Ax\227\362\2\25\20c\352Ai\256\302}N\244\245\6\253 \335\212G\1\277\5\336\322M\352"..., iov_len=64}], 1, -1, RWF_NOWAIT) = 64
preadv2(9, [{iov_base="\263\364S95\7\341\4\333\354\212U\265\230\\N\346\221\214h\327\275J\0f\215/\342\24/\204Z"..., iov_len=128}], 1, -1, RWF_NOWAIT) = 128
preadv2(9, [{iov_base="ool\33\v\365\16\214P\357\372F\rh\3473\23\351\347\201\213\336D\235\310i\257\3442\355\303\332"..., iov_len=256}], 1, -1, RWF_NOWAIT) = 256
preadv2(9, [{iov_base="}\370M\6[\240\260\250\376\312Nm\246\217\233u\275\242\201\316\341\22z\\\235A\366*\207\255\n\251"..., iov_len=512}], 1, -1, RWF_NOWAIT) = 512
preadv2(9, [{iov_base="[\23\210*\314\256\356u\246\260\246\242I\356\27\211O{,i\242\25\312\352\257\264\320\277\2255\305\352"..., iov_len=1024}], 1, -1, RWF_NOWAIT) = 1024
preadv2(9, [{iov_base="AAi\177\257:G!\225\362\3\2742t\256\303\363\178\277\266\f(\366\246\212\305\236qF/\350"..., iov_len=2048}], 1, -1, RWF_NOWAIT) = 2048
preadv2(9, [{iov_base="", iov_len=4096}], 1, -1, RWF_NOWAIT) = 0
write(1, "File size = 4096\n", 17File size = 4096
)      = 17
close(9)

@wmanley
Copy link
Contributor

wmanley commented May 19, 2021

Hmm, seems like a kernel bug. What a shame - I guess #3518 will have to be reverted. What filesystem is the file on?

@br0adcast
Copy link
Author

Hmm, seems like a kernel bug. What a shame - I guess #3518 will have to be reverted. What filesystem is the file on?

ext4

@wmanley
Copy link
Contributor

wmanley commented May 20, 2021

@br0adcast: Would you consider bisecting the kernel to find out when this broke?

@br0adcast
Copy link
Author

@wmanley: The problem occurs after upgrading the kernel from 5.4.81 to 5.10.0 and higher. At least, I have found it in the following kernels:

  1. Linux raspberrypi 5.10.0-v7l+ Add AsyncWriteExt::shutdown #1382 SMP Tue Dec 15 18:23:34 GMT 2020 armv7l GNU/Linux
  2. Linux raspberrypi 5.10.17-v7l+ IntoRawFd implementations missing #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux
  3. Linux raspberrypi 5.10.36-v7l+ codec: a few suggestions #1418 SMP Thu May 13 18:17:17 BST 2021 armv7l GNU/Linux

@br0adcast
Copy link
Author

I just reproduced a bug in an i686 virtual machine with Arch Linux 32 + lts kernel Linux 5.10.32-1.0-lts #1 SMP Thu, 22 Apr 2021 07:53:37 +0000 i686 GNU/Linux

@wmanley
Copy link
Contributor

wmanley commented May 20, 2021

I've tried to reproduce this with qemu by following https://ldpreload.com/blog/git-bisect-run . Code:

use tokio::io::AsyncReadExt;
use std::ffi::CStr;

fn main() -> std::io::Result<()> {

    if std::process::id() == 1 {
        //std::fs::create_dir("/dev")?;
        if unsafe{libc::mount(
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"/dev\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                0, std::ptr::null())} != 0 {
            return Err(std::io::Error::last_os_error());
        };
    }

    let res = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(test());

    match res {
        Ok(true) => println!("Success\n"),
        Ok(false) => println!("Failed\n"),
        Err(e) => println!("Error: {:?}\n", e),
    }

    if std::process::id() == 1 {
        unsafe{libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)};
    }
    Ok(())
}

async fn test() -> std::io::Result<bool> {
    let mut file = tokio::fs::File::open("init").await?;
    let mut contents = vec![];
    file.read_to_end(&mut contents).await?;
    println!("File size = {}", contents.len());

    Ok(contents.len() != 4096)
}

Compile with:

cargo build --target=i686-unknown-linux-musl

And build initrd with:

cp ../tokio-test/target/i686-unknown-linux-musl/debug/tokio-test initrd/init &&
cd initrd &&
echo init | cpio -H newc -o | gzip > initrd.gz

Kernel built with:

git checkout v5.10.32
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- defconfig && make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- -j12

Run with:

qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz

The test passes. Maybe I need a real filesystem, and not just initrd.

@wmanley
Copy link
Contributor

wmanley commented May 20, 2021

Yes, it fails with ext4.

Created ext4 partition with file:

dd if=/dev/zero of=myextpart bs=1M count=1000
mkfs.ext4 myextpart
mkdir mnt
sudo mount -o loop myextpart $PWD/mnt
sudo dd if=/dev/urandom of=mnt/myfile bs=1M count=1
sudo umount mnt

Modified the test program to mount this fs and inspect this file:

use tokio::io::AsyncReadExt;
use std::ffi::CStr;

fn main() -> std::io::Result<()> {

    if std::process::id() == 1 {
        //std::fs::create_dir("/dev")?;
        if unsafe{libc::mount(
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"/dev\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                0, std::ptr::null())} != 0 {
            return Err(std::io::Error::last_os_error());
        };
        std::fs::create_dir("/myext")?;
        if unsafe{libc::mount(
                CStr::from_bytes_with_nul(b"/dev/sda\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"/myext\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"ext4\0").unwrap().as_ptr(),
                0, std::ptr::null())} != 0 {
            return Err(std::io::Error::last_os_error());
        };
    }

    let res = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(test());

    match res {
        Ok(true) => println!("Success\n"),
        Ok(false) => println!("Failed\n"),
        Err(e) => println!("Error: {:?}\n", e),
    }

    if std::process::id() == 1 {
        unsafe{libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)};
    }
    Ok(())
}

async fn test() -> std::io::Result<bool> {
    let mut file = tokio::fs::File::open("/myext/myfile").await?;
    let mut contents = vec![];
    file.read_to_end(&mut contents).await?;
    println!("File size = {}", contents.len());

    Ok(contents.len() != 4096)
}

Run with partition attached as drive:

qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz -drive file=myextpart

Now to bisect...

@wmanley
Copy link
Contributor

wmanley commented May 20, 2021

Bisecting with script bisect-test.sh:

#!/bin/bash -e

make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- defconfig &&
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- -j12 &&
qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz -drive file=myextpart >output || exit 125

if grep -q ^Success output; then
    exit 0
fi

if grep -q ^Failed output; then
    exit 1
fi

exit 125

and commands:

git bisect start
git bisect bad v5.10
git bisect good v5.4
git bisect run ./bisect-test.sh

Points to torvalds/linux@efa8480: "fs: RWF_NOWAIT should imply IOCB_NOIO"

If that isn't a smoking gun then I don't know what is.

Hmm, I've just tested the recently released v5.12 and it succeeds there, so maybe it's already been fixed. I'll bisect to find the fix.

@wmanley
Copy link
Contributor

wmanley commented May 20, 2021

It was fixed in 5.11: torvalds/linux@06c0444 neither commit message mentions anything about bugs. The fix seems incidental.

I'll also run some tests to see if this is limited to 32-bit systems.

@Darksonn
Copy link
Contributor

Darksonn commented May 21, 2021

So given those commits, which kernel versions are affected? I'm guessing we can add a check for the kernel version on the first call and fall back to the old behavior on kernels with the bug.

@br0adcast
Copy link
Author

If the 'preadv' function returns 0, we can try to do a regular read after that, to double check if it is an EOF or a kernel bug.
This code in 'preadv2_safe' seems to work:

if 0 == bytes_read {
    Err(std::io::Error::from_raw_os_error(libc::EAGAIN))
} else if bytes_read < 0 {
    Err(std::io::Error::last_os_error())
} else {
    /* preadv2 returns the number of bytes read, e.g. the number of bytes that have
     * written into `unfilled`. So it's safe to assume that the data is now
     * initialised */
    dst.assume_init(bytes_read as usize);
    dst.advance(bytes_read as usize);
    Ok(())
}

@wmanley
Copy link
Contributor

wmanley commented May 21, 2021

I've just tested and I can confirm that the bug exists on x86-64 too, so it's not just a 32-bit issue.

So given those commits, which kernel versions are affected? I'm guessing we can add a check for the kernel version on the first call and fall back to the old behavior on kernels with the bug.

5.9 and 5.10 are broken according to my testing. Otherwise it works fine. I've only tested mainline linux, not any of the stable branches. I also can't speak to what's in the distro kernels. I'll run the test on a few stable kernels to see if they're affected too.

If the 'preadv' function returns 0, we can try to do a regular read after that, to double check if it is an EOF or a kernel bug.

That would certainly work around the bug, but would incur thread synchronisation overhead that #3518 was intended to avoid.

@wmanley
Copy link
Contributor

wmanley commented May 21, 2021

Stable kernel testing:

5.6.19: OK
5.7.19: OK
5.8.18: OK
5.9.16: BAD
5.10.38: BAD
5.11.22: OK

So there's nothing on the stable branch that affects this issue.

Regarding checking the kernel version, something like this should do the job:

fn has_buggy_preadv2() -> bool {
    let mut uts : libc::utsname;
    let res = unsafe{libc::uname(&mut uts as *mut utsname)};
    if res < 0 {
        // Getting kernel version failed, assume that it's buggy
        return true;
    }
    uts.release[..4] == b"5.9." || uts.release[..5] == b"5.10." 
}

But I would like to talk to the kernel devs first.

@Darksonn
Copy link
Contributor

Sure, let me know what the kernel devs have to say.

@praseodym
Copy link

praseodym commented May 26, 2021

I'm running into the same problem on my amd64 machine with kernel 5.10.0-6-amd64 (Debian 5.10.28-1) and (edit) using an XFS filesystem. I observed that when the file that is being read is already in the page cache (e.g. newly written files, or test by running md5sum on the file) everything works as expected. However, when the file is not in the page cache (e.g. after dropping the page cache using sync; echo 1 > /proc/sys/vm/drop_caches) only 4096 bytes get read.

@wmanley
Copy link
Contributor

wmanley commented May 26, 2021

Sure, let me know what the kernel devs have to say.

Here's the email: https://lore.kernel.org/linux-fsdevel/fea8b16d-5a69-40f9-b123-e84dcd6e8f2e@www.fastmail.com/T/#u . There's been no response yet, so I guess we should just work around it. I'll prepare a PR.

SergioBenitez added a commit to rwf2/Rocket that referenced this issue May 27, 2021
@carllerche
Copy link
Member

So how are we going to test for kernel versions?

@wmanley
Copy link
Contributor

wmanley commented May 28, 2021

@br0adcast: Would you mind changing the issue title to "fs::File reads first 4096 bytes only on Linux v5.9 and v5.10". That way it'll be easier for people to find this bug.

@br0adcast br0adcast changed the title fs::File reads first 4096 bytes only on the Raspberry Pi fs::File reads first 4096 bytes only on Linux v5.9 and v5.10 May 28, 2021
@cynecx
Copy link
Contributor

cynecx commented May 28, 2021

@carllerche by calling uname (syscall)?

EDIT: Oh, I didn't see @Darksonn's PR, which is exactly doing that.

@carllerche
Copy link
Member

Sorry, I meant, how are we going to add multiple kernel versions to CI to test for kernel specific regressions in the future :)

zonyitoo added a commit to shadowsocks/shadowsocks-rust that referenced this issue May 30, 2021
@Darksonn
Copy link
Contributor

Just to make sure everyone heard about it: We released 1.6.1 a few days ago, which does not have this bug. The 1.6.0 release was yanked.

@xuyang0410
Copy link

xuyang0410 commented Jul 30, 2021

@praseodym wonderful. It doesn't matter with 4096 bytes(I have verified it and use 32,64,128,,, 4096, they all return 0 if I drop cache before preadv2 with RWF_NOWAIT flag). Look kernel commit fs: RWF_NOWAIT should imply IOCB_NOIO, with the change allowing read-ahead for IOCB_NOWAIT, we changed the RWF_NOWAIT semantics of only doing cached reads. So if we use sync or echo 1 to drop cache interface, then we will get 0. I plan to write a regression test to ltp community for this.

mkerrisk pushed a commit to mkerrisk/man-pages that referenced this issue Aug 8, 2021
To save the next person before they fall foul of it.  See
<https://lore.kernel.org/linux-fsdevel/fea8b16d-5a69-40f9-b123-e84dcd6e8f2e@www.fastmail.com/T/#u>
and <tokio-rs/tokio#3803> for more information.

Signed-off-by: Will Manley <will@williammanley.net>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
@sameer
Copy link

sameer commented May 29, 2024

Can this be closed now? The commit in question was reverted in 3fbcf1b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-fs Module: tokio/fs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants