Skip to content

Commit

Permalink
O_DIRECT reads unaligned page size filesize
Browse files Browse the repository at this point in the history
Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads
with the following test case:

dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79
truncate -s 40382 /local_zpool/file2
zpool export local_zpool
zpool import -d ~/ local_zpool
dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct

That led to following panic happening:

[  307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed
[  307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter()
[  307.788743] Showing stack for process 9665
[  307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted:
P           OE    --------- -  - 4.18.0-408.el8.x86_64 openzfs#1
[  307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21
10/08/2020
[  307.811682] Call Trace:
[  307.814131]  dump_stack+0x41/0x60
[  307.817449]  spl_panic+0xd0/0xe8 [spl]
[  307.821210]  ? irq_work_queue+0x9/0x20
[  307.824961]  ? wake_up_klogd.part.30+0x30/0x40
[  307.829407]  ? vprintk_emit+0x125/0x250
[  307.833246]  ? printk+0x58/0x6f
[  307.836391]  spl_assert.constprop.1+0x16/0x20 [zfs]
[  307.841438]  abd_fletcher_4_iter+0x6c/0x101 [zfs]
[  307.846343]  ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs]
[  307.851922]  abd_iterate_func+0xb1/0x170 [zfs]
[  307.856533]  abd_fletcher_4_impl+0x3f/0xa0 [zfs]
[  307.861334]  abd_fletcher_4_native+0x52/0x70 [zfs]
[  307.866302]  ? enqueue_entity+0xf1/0x6e0
[  307.870226]  ? select_idle_sibling+0x23/0x700
[  307.874587]  ? enqueue_task_fair+0x94/0x710
[  307.878771]  ? select_task_rq_fair+0x351/0x990
[  307.883208]  zio_checksum_error_impl+0xff/0x5f0 [zfs]
[  307.888435]  ? abd_fletcher_4_impl+0xa0/0xa0 [zfs]
[  307.893401]  ? spl_kmem_alloc_impl+0xce/0xf0 [spl]
[  307.898203]  ? __wake_up_common+0x7a/0x190
[  307.902300]  ? __switch_to_asm+0x41/0x70
[  307.906220]  ? __switch_to_asm+0x35/0x70
[  307.910145]  ? __switch_to_asm+0x41/0x70
[  307.914061]  ? __switch_to_asm+0x35/0x70
[  307.917980]  ? __switch_to_asm+0x41/0x70
[  307.921903]  ? __switch_to_asm+0x35/0x70
[  307.925821]  ? __switch_to_asm+0x35/0x70
[  307.929739]  ? __switch_to_asm+0x41/0x70
[  307.933658]  ? __switch_to_asm+0x35/0x70
[  307.937582]  zio_checksum_error+0x47/0xc0 [zfs]
[  307.942288]  raidz_checksum_verify+0x3a/0x70 [zfs]
[  307.947257]  vdev_raidz_io_done+0x4b/0x160 [zfs]
[  307.952049]  zio_vdev_io_done+0x7f/0x200 [zfs]
[  307.956669]  zio_execute+0xee/0x210 [zfs]
[  307.960855]  taskq_thread+0x203/0x420 [spl]
[  307.965048]  ? wake_up_q+0x70/0x70
[  307.968455]  ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs]
[  307.974807]  ? taskq_lowest_id+0xc0/0xc0 [spl]
[  307.979260]  kthread+0x10a/0x120
[  307.982485]  ? set_kthread_struct+0x40/0x40
[  307.986670]  ret_from_fork+0x35/0x40

The reason this was occuring was because by doing the zpool export that
meant the initial read of O_DIRECT would be forced to go down to disk.
In this case it was still valid as bs=1M is still page size aligned;
howver, the file length was not. So when issuing the O_DIRECT read even
after calling make_abd_for_dbuf() we had an extra page allocated in the
original ABD along with the linear ABD attached at the end of the gang
abd from make_abd_for_dbuf().

This is an issue as it is our expectations with read that the block
sizes being read are page aligned. When we do our check we are only
checking the request but not the actual size of data we may read such as
the entire file.

In order to remedy this situation, I updated zfs_read() to attempt to
read as much as it can using O_DIRECT based on if the length is
page-sized aligned. Any additional bytes that are requested are then
read into the ARC. This still stays with our semantics that IO requests
must be page sized aligned.

There are a bit of draw backs here if there is only a single block being
read. In this case the block will be read twice. Once using O_DIRECT and
then using buffered to fill in the remaining data for the users request.
However, this should not be a big issue most of the time. In the normal
case a user may ask for a lot of data from a file and only the stray
bytes at the end of the file will have to be read using the ARC.

In order to make sure this case was completely covered, I added a new
ZTS test case dio_unaligned_filesize to test this out. The main thing
with that test case is the first O_DIRECT read will issue out two reads
two being O_DIRECT and the third being buffered for the remaining
requested bytes.

As part of this commit, I also updated stride_dd to take an additional
parameter of -e, which says read the entire input file and ingore the
count (-c) option. We need to use stride_dd for FreeBSD as dd does not
make sure the buffer is page aligned. This udpate to stride_dd allows us
to use it to test out this case in dio_unaligned_filesize for both Linux
and FreeBSD.

While this may not be the most elegant solution, it does stick with the
semanatics and still reads all the data the user requested. I am fine
with revisiting this and maybe we just return a short read?

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
  • Loading branch information
bwatkinson committed Sep 10, 2024
1 parent 6a34939 commit 4714a79
Show file tree
Hide file tree
Showing 7 changed files with 268 additions and 63 deletions.
46 changes: 40 additions & 6 deletions module/zfs/zfs_vnops.c
Original file line number Diff line number Diff line change
Expand Up @@ -380,14 +380,32 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
ssize_t chunk_size = zfs_vnops_read_chunk_size;
ssize_t n = MIN(zfs_uio_resid(uio), zp->z_size - zfs_uio_offset(uio));
ssize_t start_resid = n;
ssize_t dio_remaining_resid = 0;

/*
* All pages for an O_DIRECT request have already been mapped so
* there's no compelling reason to handle this uio is smaller chunks.
*/
if (uio->uio_extflg & UIO_DIRECT)
if (uio->uio_extflg & UIO_DIRECT) {
/*
* All pages for an O_DIRECT request ahve already been mapped
* so there's no compelling reason to handle this uio in
* smaller chunks.
*/
chunk_size = DMU_MAX_ACCESS;

/*
* In the event that the O_DIRECT request is reading the entire
* file, it is possible file's length is not page sized
* aligned. However, lower layers expect that the Direct I/O
* request is page-aligned. In this case, as much of the file
* that can be read using Direct I/O happens and the remaining
* amount will be read through the ARC.
*
* This is still consistent with the semantics of Direct I/O in
* ZFS as at a minimum the I/O request must be page-aligned.
*/
dio_remaining_resid = n - P2ALIGN(n, PAGE_SIZE);
if (dio_remaining_resid != 0)
n -= dio_remaining_resid;
}

while (n > 0) {
ssize_t nbytes = MIN(n, chunk_size -
P2PHASE(zfs_uio_offset(uio), chunk_size));
Expand Down Expand Up @@ -427,7 +445,23 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
n -= nbytes;
}

int64_t nread = start_resid - n;
int64_t nread = start_resid;
if (error == 0 && (uio->uio_extflg & UIO_DIRECT) &&
dio_remaining_resid != 0) {
/*
* Temporarily remove the UIO_DIRECT flag from the UIO so the
* remainder of the file can be read using the ARC.
*/
uio->uio_extflg &= ~UIO_DIRECT;
error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl), uio,
dio_remaining_resid);
uio->uio_extflg |= UIO_DIRECT;

if (error != 0)
n -= dio_remaining_resid;
}
nread -= n;

dataset_kstats_update_read_kstats(&zfsvfs->z_kstat, nread);
out:
zfs_rangelock_exit(lr);
Expand Down
3 changes: 2 additions & 1 deletion tests/runfiles/common.run
Original file line number Diff line number Diff line change
Expand Up @@ -697,7 +697,8 @@ tags = ['functional', 'delegate']
tests = ['dio_aligned_block', 'dio_async_always', 'dio_async_fio_ioengines',
'dio_compression', 'dio_dedup', 'dio_encryption', 'dio_grow_block',
'dio_max_recordsize', 'dio_mixed', 'dio_mmap', 'dio_property',
'dio_random', 'dio_recordsize', 'dio_unaligned_block']
'dio_random', 'dio_recordsize', 'dio_unaligned_block',
'dio_unaligned_filesize']
tags = ['functional', 'direct']

[tests/functional/exec]
Expand Down
152 changes: 104 additions & 48 deletions tests/zfs-tests/cmd/stride_dd.c
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ static int if_o_direct = 0;
static int of_o_direct = 0;
static int skip = 0;
static int skipbytes = 0;
static int entire_file = 0;
static const char *execname = "stride_dd";

static void usage(void);
Expand All @@ -42,10 +43,10 @@ static void
usage(void)
{
(void) fprintf(stderr,
"usage: %s -i inputfile -o outputfile -b blocksize -c count \n"
"usage: %s -i inputfile -o outputfile -b blocksize [-c count]\n"
" [-s stride] [-k seekblocks] [-K seekbytes]\n"
" [-a alignment] [-d if_o_direct] [-D of_o_direct]\n"
" [-p skipblocks] [-P skipbytes]\n"
" [-p skipblocks] [-P skipbytes] [-e entire_file]\n"
"\n"
"Simplified version of dd that supports the stride option.\n"
"A stride of n means that for each block written, n - 1 blocks\n"
Expand All @@ -56,20 +57,23 @@ usage(void)
" inputfile: File to read from\n"
" outputfile: File to write to\n"
" blocksize: Size of each block to read/write\n"
" count: Number of blocks to read/write\n"
" count: Number of blocks to read/write (Required"
" unless -e is used)\n"
" stride: Read/write a block then skip (stride - 1) blocks"
"\n"
" seekblocks: Number of blocks to skip at start of output\n"
" seekbytes: Treat seekblocks as byte count\n"
" alignment: Alignment passed to posix_memalign() (default "
" alignment: Alignment passed to posix_memalign() (default"
" PAGE_SIZE)\n"
" if_o_direct: Use O_DIRECT with inputfile (default no O_DIRECT)"
"\n"
" of_o_direct: Use O_DIRECT with outputfile (default no "
" O_DIRECT)\n"
" skipblocks: Number of blocks to skip at start of input "
" (default 0)\n"
" skipbytes: Treat skipblocks as byte count\n",
" skipbytes: Treat skipblocks as byte count\n"
" entire_file: When used the entire inputfile will be read and"
" count will be ignored\n",
execname);
(void) exit(1);
}
Expand Down Expand Up @@ -103,7 +107,7 @@ parse_options(int argc, char *argv[])
extern char *optarg;
extern int optind, optopt;

while ((c = getopt(argc, argv, "a:b:c:dDi:o:s:k:Kp:P")) != -1) {
while ((c = getopt(argc, argv, "a:b:c:deDi:o:s:k:Kp:P")) != -1) {
switch (c) {
case 'a':
alignment = atoi(optarg);
Expand All @@ -121,6 +125,10 @@ parse_options(int argc, char *argv[])
if_o_direct = 1;
break;

case 'e':
entire_file = 1;
break;

case 'D':
of_o_direct = 1;
break;
Expand Down Expand Up @@ -172,25 +180,106 @@ parse_options(int argc, char *argv[])
}
}

if (bsize <= 0 || count <= 0 || stride <= 0 || ifile == NULL ||
ofile == NULL || seek < 0 || invalid_alignment(alignment) ||
skip < 0) {
if (bsize <= 0 || stride <= 0 || ifile == NULL || ofile == NULL ||
seek < 0 || invalid_alignment(alignment) || skip < 0) {
(void) fprintf(stderr,
"Required parameter(s) missing or invalid.\n");
(void) usage();
}

if (count <= 0 && entire_file == 0) {
(void) fprintf(stderr,
"Required parameter(s) missing or invalid.\n");
(void) usage();
}
}

static void
read_entire_file(int ifd, int ofd, void *buf)
{
int c;

do {
c = read(ifd, buf, bsize);
if (c < 0) {
perror("read");
exit(2);
} else if (c != 0) {
c = write(ofd, buf, bsize);
if (c < 0) {
perror("write");
exit(2);
}

}

if (stride > 1) {
if (lseek(ifd, (stride - 1) * bsize, SEEK_CUR) == -1) {
perror("input lseek");
exit(2);
}
if (lseek(ofd, (stride - 1) * bsize, SEEK_CUR) == -1) {
perror("output lseek");
exit(2);
}
}
} while (c != 0);
}

static void
read_on_count(int ifd, int ofd, void *buf)
{
int i;
int c;

for (i = 0; i < count; i++) {
c = read(ifd, buf, bsize);
if (c != bsize) {
if (c < 0) {
perror("read");
} else {
(void) fprintf(stderr,
"%s: unexpected short read, read %d "
"bytes, expected %d\n", execname,
c, bsize);
}
exit(2);
}

c = write(ofd, buf, bsize);
if (c != bsize) {
if (c < 0) {
perror("write");
} else {
(void) fprintf(stderr,
"%s: unexpected short write, wrote %d "
"bytes, expected %d\n", execname,
c, bsize);
}
exit(2);
}

if (stride > 1) {
if (lseek(ifd, (stride - 1) * bsize, SEEK_CUR) == -1) {
perror("input lseek");
exit(2);
}
if (lseek(ofd, (stride - 1) * bsize, SEEK_CUR) == -1) {
perror("output lseek");
exit(2);
}
}
}
}

int
main(int argc, char *argv[])
{
int i;
int ifd;
int ofd;
int ifd_flags = O_RDONLY;
int ofd_flags = O_WRONLY | O_CREAT;
void *buf;
int c;

parse_options(argc, argv);

Expand Down Expand Up @@ -241,44 +330,11 @@ main(int argc, char *argv[])
}
}

for (i = 0; i < count; i++) {
c = read(ifd, buf, bsize);
if (c != bsize) {
if (c < 0) {
perror("read");
} else {
(void) fprintf(stderr,
"%s: unexpected short read, read %d "
"bytes, expected %d\n", execname,
c, bsize);
}
exit(2);
}

c = write(ofd, buf, bsize);
if (c != bsize) {
if (c < 0) {
perror("write");
} else {
(void) fprintf(stderr,
"%s: unexpected short write, wrote %d "
"bytes, expected %d\n", execname,
c, bsize);
}
exit(2);
}
if (entire_file == 1)
read_entire_file(ifd, ofd, buf);
else
read_on_count(ifd, ofd, buf);

if (stride > 1) {
if (lseek(ifd, (stride - 1) * bsize, SEEK_CUR) == -1) {
perror("input lseek");
exit(2);
}
if (lseek(ofd, (stride - 1) * bsize, SEEK_CUR) == -1) {
perror("output lseek");
exit(2);
}
}
}
free(buf);

(void) close(ofd);
Expand Down
1 change: 1 addition & 0 deletions tests/zfs-tests/tests/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -1474,6 +1474,7 @@ nobase_dist_datadir_zfs_tests_tests_SCRIPTS += \
functional/direct/dio_random.ksh \
functional/direct/dio_recordsize.ksh \
functional/direct/dio_unaligned_block.ksh \
functional/direct/dio_unaligned_filesize.ksh \
functional/direct/dio_write_verify.ksh \
functional/direct/dio_write_stable_pages.ksh \
functional/direct/setup.ksh \
Expand Down
29 changes: 29 additions & 0 deletions tests/zfs-tests/tests/functional/direct/dio.kshlib
Original file line number Diff line number Diff line change
Expand Up @@ -300,3 +300,32 @@ function check_read # pool file bs count skip flags buf_rd dio_rd
log_fail "Direct reads $dio_rd_actual of $dio_rd_expect"
fi
}

function get_file_size
{
typeset filename="$1"

if is_linux; then
filesize=$(stat -c %s $filename)
else
filesize=$(stat -s $filename | awk '{print $8}' | grep -o '[0-9]\+')
fi

echo $filesize
}

function do_truncate_reduce
{
typeset filename=$1
typeset size=$2

filesize=$(get_file_size $filename)
eval "echo original filesize: $filesize"
if is_linux; then
truncate $filename -s $((filesize - size))
else
truncate -s -$size $filename
fi
filesize=$(get_file_size $filename)
eval "echo new filesize after truncate: $filesize"
}
Loading

0 comments on commit 4714a79

Please sign in to comment.