Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readahead reordering causes prefetcher resets #488

Closed
jamesbornholt opened this issue Aug 30, 2023 · 0 comments · Fixed by #556
Closed

Readahead reordering causes prefetcher resets #488

jamesbornholt opened this issue Aug 30, 2023 · 0 comments · Fixed by #556
Labels
bug Something isn't working

Comments

@jamesbornholt
Copy link
Member

Mountpoint for Amazon S3 version

mount-s3 1.0.0-unofficial+09f556e

AWS Region

No response

Describe the running environment

EC2

What happened?

I ran this command against my mount:

cat ~/mnt/test20MiB > /dev/null & cat ~/mnt/test20MiB2 > /dev/null & cat ~/mnt/test20MiB3 > /dev/null &

i.e., reading three separate files concurrently.

In the logs, I see:

2023-08-30T03:04:41.030144Z  INFO mountpoint_s3::metrics: prefetch.out_of_order: 6 (n=6)

so the prefetcher is resetting even though each cat is reading sequentially.

It looks like what we're seeing is readahead and multithreading conflicting. For example:

2023-08-30T03:04:39.541005Z DEBUG fuser::request: FUSE( 42) ino 0x0000000000000003 READ fh FileHandle(2), offset 131072, size 262144
2023-08-30T03:04:39.541049Z DEBUG fuser::request: FUSE( 44) ino 0x0000000000000003 READ fh FileHandle(2), offset 393216, size 262144
2023-08-30T03:04:39.541118Z TRACE read{req=44 ino=3 fh=2 offset=393216 size=262144}: mountpoint_s3::fs: fs:read with ino 3 fh 2 offset 393216 size 262144
2023-08-30T03:04:39.541144Z TRACE read{req=44 ino=3 fh=2 offset=393216 size=262144}: mountpoint_s3::prefetch: read offset=393216 length=262144 next_seq_offset=131072
2023-08-30T03:04:39.541153Z TRACE read{req=42 ino=3 fh=2 offset=131072 size=262144}: mountpoint_s3::fs: fs:read with ino 3 fh 2 offset 131072 size 262144
2023-08-30T03:04:39.541168Z TRACE read{req=44 ino=3 fh=2 offset=393216 size=262144}: mountpoint_s3::prefetch: out-of-order read, resetting prefetch expected=131072 actual=393216
2023-08-30T03:04:39.541213Z TRACE read{req=44 ino=3 fh=2 offset=393216 size=262144}: mountpoint_s3::prefetch: spawning request range=393216..1572864 size=1179648
[... 80 ms pass ...]
2023-08-30T03:04:39.624656Z TRACE read{req=42 ino=3 fh=2 offset=131072 size=262144}: mountpoint_s3::prefetch: read offset=131072 length=262144 next_seq_offset=655360
2023-08-30T03:04:39.624695Z TRACE read{req=42 ino=3 fh=2 offset=131072 size=262144}: mountpoint_s3::prefetch: out-of-order read, resetting prefetch expected=655360 actual=131072
2023-08-30T03:04:39.624731Z TRACE read{req=42 ino=3 fh=2 offset=131072 size=262144}: mountpoint_s3::prefetch: spawning request range=131072..1310720 size=117964

Here, the kernel issued two adjacent READ operations on the file handle, without waiting for the response to the first READ (this is the readahead part). Request 42 arrived first, but request 44 was served first, because we have at least two threads available for serving FUSE requests thanks to the auto-scaling from #411, and 44 won the race. So we reset the prefetcher for 44 because it was further along in the object than expected, and then when we finally got to serve 42 we had to reset it again because now it looks like we're seeking backwards. This reduces sequential read performance.

Separate but related issue: it also looks like we aren't actually telling the CRT to cancel the inflight GET request when this reset happens, so those requests are still running (and using bandwidth) even though we're never going to use the result.

Relevant log output

No response

@jamesbornholt jamesbornholt added the bug Something isn't working label Aug 30, 2023
jamesbornholt added a commit to jamesbornholt/mountpoint-s3 that referenced this issue Oct 13, 2023
Linux asynchronous readahead confuses our prefetcher by sometimes making
the stream appear to go backwards, even though the customer is actually
just reading sequentially (awslabs#488). The problem is that with parallel FUSE
threads, the two asynchronous read operations can arrive to the
prefetcher out of order.

This change allows us to tolerate a little bit of backwards seeking in a
prefetch stream. We keep around a little bit of previously read data and
can reload it in the event that a seek goes backwards. We do this by
creating a fake new request containing the rewound bytes, so that the
existing read logic will pick them up. I chose an arbitrary max for the
backwards seek buffer, big enough to handle Linux readahead.

This should fix the readahead issue: in my testing, I no longer saw slow
sequential reads, and the logs confirmed this seeking logic was being
triggered in both directions (forwards and backwards), consistent with
the readahead requests sometimes arriving out of order.
jamesbornholt added a commit to jamesbornholt/mountpoint-s3 that referenced this issue Oct 13, 2023
Linux asynchronous readahead confuses our prefetcher by sometimes making
the stream appear to go backwards, even though the customer is actually
just reading sequentially (awslabs#488). The problem is that with parallel FUSE
threads, the two asynchronous read operations can arrive to the
prefetcher out of order.

This change allows us to tolerate a little bit of backwards seeking in a
prefetch stream. We keep around a little bit of previously read data and
can reload it in the event that a seek goes backwards. We do this by
creating a fake new request containing the rewound bytes, so that the
existing read logic will pick them up. I chose an arbitrary max for the
backwards seek buffer, big enough to handle Linux readahead.

This should fix the readahead issue: in my testing, I no longer saw slow
sequential reads, and the logs confirmed this seeking logic was being
triggered in both directions (forwards and backwards), consistent with
the readahead requests sometimes arriving out of order.

Signed-off-by: James Bornholt <bornholt@amazon.com>
jamesbornholt added a commit to jamesbornholt/mountpoint-s3 that referenced this issue Oct 16, 2023
Linux asynchronous readahead confuses our prefetcher by sometimes making
the stream appear to go backwards, even though the customer is actually
just reading sequentially (awslabs#488). The problem is that with parallel FUSE
threads, the two asynchronous read operations can arrive to the
prefetcher out of order.

This change allows us to tolerate a little bit of backwards seeking in a
prefetch stream. We keep around a little bit of previously read data and
can reload it in the event that a seek goes backwards. We do this by
creating a fake new request containing the rewound bytes, so that the
existing read logic will pick them up. I chose an arbitrary max for the
backwards seek buffer, big enough to handle Linux readahead.

This should fix the readahead issue: in my testing, I no longer saw slow
sequential reads, and the logs confirmed this seeking logic was being
triggered in both directions (forwards and backwards), consistent with
the readahead requests sometimes arriving out of order.

Signed-off-by: James Bornholt <bornholt@amazon.com>
github-merge-queue bot pushed a commit that referenced this issue Oct 18, 2023
* Allow seeking forwards within the prefetch stream

Right now we reset the prefetcher any time it seeks forwards, even if
the distance it's seeking could be handled by inflight requests (in the
worst case, the bytes are already in our buffers, and we just throw them
away). That's expensive and slow!

This change allows us to seek forwards a limited distance into the
prefetch stream. When we see a seek of an acceptable distance, we
fast-forward through the stream to the desired target offset, dropping
the skipped bytes on the floor. We enforce a maximum seek distance,
which is a trade-off between streaming a lot of unnecessary bytes versus
an extra request's latency. I haven't put any careful thought into the
number.

This commit also sets us up to support backwards seeking, which will
come in the future.

Signed-off-by: James Bornholt <bornholt@amazon.com>

* Allow seeking backwards within a prefetch stream

Linux asynchronous readahead confuses our prefetcher by sometimes making
the stream appear to go backwards, even though the customer is actually
just reading sequentially (#488). The problem is that with parallel FUSE
threads, the two asynchronous read operations can arrive to the
prefetcher out of order.

This change allows us to tolerate a little bit of backwards seeking in a
prefetch stream. We keep around a little bit of previously read data and
can reload it in the event that a seek goes backwards. We do this by
creating a fake new request containing the rewound bytes, so that the
existing read logic will pick them up. I chose an arbitrary max for the
backwards seek buffer, big enough to handle Linux readahead.

This should fix the readahead issue: in my testing, I no longer saw slow
sequential reads, and the logs confirmed this seeking logic was being
triggered in both directions (forwards and backwards), consistent with
the readahead requests sometimes arriving out of order.

Signed-off-by: James Bornholt <bornholt@amazon.com>

* Fix Shuttle tests with new request size logic

The old test was hiding a bug because it used a hard coded part size of
8MB regardless of what the client used. #552 changed that and now this
test runs out of memory a lot because it degrades to doing 1 byte
requests. I don't think it's worth playing with the logic because it
requires a weird config to get there, so just fix the test.

Signed-off-by: James Bornholt <bornholt@amazon.com>

---------

Signed-off-by: James Bornholt <bornholt@amazon.com>
github-merge-queue bot pushed a commit that referenced this issue Dec 2, 2024
Update the CRT libraries to the latest releases. In particular, include:
* Amazon S3 introduces support for AWS Dedicated Local Zones.
([awslabs/aws-c-s3#465](awslabs/aws-c-s3#465))
 

<details>
  <summary>Full CRT changelog:</summary>
  
```
Submodule mountpoint-s3-crt-sys/crt/aws-c-auth 48d647bf..3982bd75:
  > Update CMake to 3.9 (#255)
Submodule mountpoint-s3-crt-sys/crt/aws-c-cal 2cb1d2ea..fbbe2612:
  > RSA PKCS1.5 SHA1 signing (#201)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#199)
  > Update CMake to 3.9 (#200)
Submodule mountpoint-s3-crt-sys/crt/aws-c-common f58e807d..be8ed873:
  > Fix test that made no sense (#1172)
  > Update CBMC proof tooling to latest releases (#1164)
  > Forward CMake variables to prebuilding dependencies (#1161)
  > Remove reliance on hardcoded user in ci (#1170)
  > Doc fix for cbor (#1171)
  > switch c compiler check to different cmake variable (#1169)
  > disable visibility hidden on old gcc (#1167)
  > fix empty xml node handling (#1168)
  > Unlink shutdown callback from ref count (#1166)
  > check if numa available or not before loading numa functions (#1163)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#1151)
  > Update CMake to 3.9 (#1159)
  > Support Swift CXX Interop  (#1160)
Submodule mountpoint-s3-crt-sys/crt/aws-c-compression f36d0167..c6c1191e:
  > Update CMake to 3.9 (#70)
Submodule mountpoint-s3-crt-sys/crt/aws-c-http 6068653e..fc3eded2:
  > Update for event loop API changes (#491)
  > Add cxx support (#490)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#486)
  > Update CMake to 3.9 (#489)
  > Tweak error message for AWS_ERROR_HTTP_RESPONSE_FIRST_BYTE_TIMEOUT (#488)
Submodule mountpoint-s3-crt-sys/crt/aws-c-io e3637404..fcb38c80:
  > Add an Option to disable retries (#694)
  > Update checksum based on previous PR changes (#695)
  > Add ML-KEM Support (#693)
  > Event loop public api (#691)
  > Add cxx support (#689)
  > Fix s2n cleanup (#687)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#671)
  > Update CMake to 3.9 (#686)
Submodule mountpoint-s3-crt-sys/crt/aws-c-s3 16701501..45894ed3:
  > Amazon S3 introduces support for AWS Dedicated Local Zones (#465)
  > Support trailing checksum with no signing (#459)
  > support if-none-match for upload (#462)
  > Use proper public event loop group API (#460)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#452)
  > Update CMake to 3.9 (#458)
  > Support header checksum (#454)
Submodule mountpoint-s3-crt-sys/crt/aws-c-sdkutils 4658412a..ce09f797:
  > Add cxx support (#48)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#45)
  > Update CMake to 3.9 (#47)
Submodule mountpoint-s3-crt-sys/crt/aws-checksums ce04ab00..3e4101b9:
  > fix predefines for bswap for old compilers (#99)
  > CRC big endian support (#97)
  > chore: Modified bug issue template to add checkbox to report potential regression. (#95)
  > Update CMake to 3.9 (#98)
Submodule mountpoint-s3-crt-sys/crt/aws-lc 8b2ebfcf..59828538:
  > Prepare for v1.40.0 release (#2019)
  > [EC] Use s2n-bignum point doubling for P-384 and P-521 (#2011)
  > Document TLS Server Renegotiation Behavior (#2018)
  > Fail FIPS rsa_keygen_pubexp on change (#2016)
  > Adding -verify and expanding -x509 options for our OpenSSL tool (#1951)
  > Upstream merge 2024-11-11 (#1985)
  > Implement PKCS7_encrypt and PKC7_decrypt (#1996)
  > [EC] Unify scalar_mul_public for ec_nistp curves (#2004)
  > Adding the OpenSSL s_client tool (#1959)
  > Add Clang 19 to CI (#1998)
  > [EC] Unify scalar_mul_base point for ec_nistp curves (#2003)
  > Add internal APIs for ML-DSA (#1999)
  > Test cleanup (#2000)
  > Minor improvement to DSA (ASN1) + DSA Tests (#1990)
  > Implement PKCS7_dataInit and PKCS7_dataFinal (#1816)
  > Addition of generic NIST-DSA PKEY and ASN1 to support ML-DSA (#1963)
  > Expose a bit of lhash/conf for Ruby (#1987)
  > Allow ASN1_get_object to parse indefinite and universal (#1994)
  > Added CRL tool to CLI (#1976)
  > Prepare release AWS-LC v1.39.0 (#1995)
  > Revert "Replace CONF's internal representation with something more typesafe" (#1986)
  > Add Cyrus-SASL to our CI (#1988)
  > Cleanup test File utilities (#1989)
  > Account for cipher auth with multiple cert slots (#1956)
  > Allocate 16k scratch on heap (#1991)
  > Add CRYPTO_sysrand benchmarks to speed.cc (#1978)
  > Update PQREADME to add link to the KEM readme file (#1973)
  > Avoid compiler warning (#1981)
  > Ruby Support - More EVP_PKEY_DSA (#1954)
  > Upstream merge 2024-10-23 (#1955)
  > CI gcc-4.8 - use 4.8.5 tag (#1980)
  > Fix sess_hits counter on the server (#1974)
  > Support Finished-based APIs for TLS 1.3 (#1952)
  > Fix i2d behavior for i2d_SSL_SESSION (#1966)
  > fix `-Wcast-function-type` build issues (#1972)
  > Prepare v1.38.0 release (#1975)
  > Expose AES_cfb1_encrypt and AES_cfb8_encrypt (#1967)
  > EDDSA PCT (#1968)
  > ML-KEM keygen Pairwise Consistency Test (#1964)
  > Coverity Fix Null Check (#1965)
  > Actually add support for SSL_get_server/peer_tmp_key (#1945)
  > Also test w/ gcc 4.8 (#1962)
  > Fixes for Coverity Alerts (#1960)
  > Add support for POINT_CONVERSION_HYBRID (#1936)
  > Ruby Support - DSA custom md (#1953)
  > Add PKCS7-internal BIO_f_md (#1886)
  > Add PKCS7-internal BIO_f_cipher (#1836)
  > Expand support for EVP_PKEY_HMAC (#1933)
  > Support encode or decode ∞ like OpenSSL (#1930)
  > Fix FIPS.md typo (#1950)
  > Missing functionality + Adding Nmap to our CI (#1915)
  > HKDF, HKDF_expand, and PBKDF Truncated SHA2-512 (#1946)
  > bump mysql CI to 9.1.0 (#1939)
  > PQ README (#1932)
  > Add p4p, bump up time (#1943)
  > Remove retries on PCT failure in EC and RSA key generation. (#1938)
  > Remove old Intel CPU types (#1942)
  > Upstream merge 2024 10 17 (#1934)
  > DH paramgen callback (#1928)
  > Add null check in dh testing (#1937)
  > Use illegal_parameter instead of decode_error for invalid key shares (#1923)
  > Also prune SSM documents from ec2-test-framework (#1925)
  > Marshalling/Unmarshalling DH public keys (#1916)
  > 800-131Ar1: length of the key-derivation key shall be at least 112 bits. (#1924)
  > Prepare 1.37.0 release (#1927)
  > Add 2024 FIPS and fix build issues on older arm FIPS (#1920)
  > Align X509 PARTIAL_CHAIN behavior with 1.1.1 (#1917)
  > P161732527 coverity cleanup (#1918)
  > build: fix pkgconfig files (#1913)
  > Avoid allocating EVP_PKEY on size checks (#1911)
  > Add EC_GROUP mutablility to custom curves (#1881)
  > Implement more EVP_PKEY_DH functionality (#1880)
  > ML-DSA parameter refactor (#1910)
  > Update FIPS docs w/ certs (#1900)
  > Handle Windows not supporting static array dimension (#1912)
  > Remove duplicate s2n-bignum prefix include option (#1909)
  > Add support for EVP_PKEY_CTX callback functions (#1905)
  > P159598331 coverity cleanup (#1908)
  > Add Alpine-Linux-x86 to GitHub Actions CI (#1753)
  > Upstream merge 2024 09 16 (#1862)
  > Update Dilithium from crystals upstream (#1894)
  > Create mutable EC_GROUP API for OpenSSL compatibility (#1860)
  > ML-KEM FIPS 203 destruction of intermediate values (#1883)
  > Remove special s2n-bignum symbol handling sauce from build (#1903)
Submodule mountpoint-s3-crt-sys/crt/s2n-tls ffe0bf42..493b7716:
  >  feat: Reworking cleanup behavior (#4871)
  > chore: broaden use of flaky mark (#4865)
  > chore: configure dependabot (#4861)
  > fix: fix open AF_INET sockets in s2n_self_talk_ktls_test.c (#4852)
  > chore: update github PR template (#4885)
  > feat: add new security policy `20241106` (#4874)
  > chore: remove unused benchmarks (#4869)
  > ci: Clean dup source tree for CRT (#4882)
  > ci: remove www.mozilla.com from well-known to unblock CI (#4880)
  > fix: move prelude inclusion as PRIVATE (#4876)
  > build: add s2n_prelude.h to consolidate defines (#4465)
  > chore: bindings release 0.3.6 (#4867)
  > doc: fix incorrect README references (#4863)
  > fix: typo in comment of s2n_self_talk_tls13_test (#4864)
  > fix: close all /dev/urandom open fds (#4835)
  > docs: update fips documentation to specify supported libcrypto (#4857)
  > fix(bindings): correct poll_flush implementation (#4859)
  > feat: Adds cleanup_final (#4853)
  > test(bindings): Consolidate test pems (#4858)
  > chore: bindings release 0.3.5 (#4860)
  > chore: grant duvet action more permissions (#4854)
  > (feat): Adds certificate match metrics API (#4844)
  > chore: Fix failing OIDC workflows; cleanup unused actions (#4848)
  > chore(GHA): Update duvet arguments (#4850)
  > chore: remove unused compile definition (#4815)
  > Add new MLKEM TLS Policies (#4830)
  > fix: fix opened AF_UNIX sockets that didn't call s2n_io_pair_close (#4833)
  > bindings: pin openssl crate to 0.10.66 (#4849)
  > chore: flip 2 GHAs to use short lived creds. (#4839)
  > fix: fix s2n_io_pair_close_one_end (#4841)
  > ci: Re-enable asan and ubsan for fuzz tests (#4840)
  > fix: some open AF_UNIX sockets in forked child processes (#4834)
  > Update FIPS rules for ML-KEM (#4829)
  > ci: update ubuntu versions (#4828)
  > Add initial support for MLKEM768 (without any new Security Policies) (#4816)
  > chore: Adds print statements to help debug s2n_dynamic_load_test (#4836)
  > ci: add more libcryptos for fuzz batch & follow cmake idioms (#4795)
  > feature: bump cert authorities max size to 20kb (#4832)
  > ci: Add ubuntu24 with a new cmake buildspec (#4824)
  > Add ML-KEM Feature Probe and Test (#4823)
  > docs: update stateful resumption doc (#4818)
  > chore: remove make fuzz and AFL fuzz (#4808)
```
</details>

---

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license and I agree to the terms of
the [Developer Certificate of Origin
(DCO)](https://developercertificate.org/).

Signed-off-by: Alessandro Passaro <alexpax@amazon.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant