SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

ludwigschwardt · 2023-03-09T23:15:19Z

This improves the robustness of katdal's archive access.

While glitches where typically picked up while reading the response header, the response body had no such checks, as is the wont of the Requests library (only discovered that now 😅). Since the bulk of the data resides in the body, this meant that chunk transfers were rarely retried.

Fix this by integrating retries into the response reading too. Maintain a single Retry object per session, which is passed to the request, retrieved from the response and updated in case the response content has issues. Completely overhaul S3ChunkStore.request. Replace TruncatedRead with the more standard IncompleteRead exception. Retries can now be overridden per request.

This gets rid of the clunky fake 555 status code that treats truncated reads as status errors (they should be read errors instead).

Ensure that glitches are also handled when downloading data in non-streaming fashion via response.content (e.g. when downloading RDB files). Add a whole bunch of unit tests that verify glitch behaviour.

This addresses #296 and #362.

Review tip: There was quite an evolution as I tried and abandoned various approaches. I would start with the full diff and work backwards to specific commits to make sense of it.

The S3 chunk store has a custom HTTP response reader (`_read_chunk`) for performance reasons. Unfortunately this bypasses the urllib3 retry machinery, also for read timeouts wile streaming the response content. We are already doing our own retries for status errors, so extend this to deal with read timeouts too. The basic idea is to catch the socket timeout (the old `socket.timeout` or newer `TimeoutError` since 3.10) and turn it into an urllib3 `ReadTimeoutError` to trigger a retry. The most important bit is that we now also test this :-) We use token proxy suggestions to pause the reading mid-stream for a specified period. Report the number of retries when the request finally fails, as an assurance that retries actually happened. This addresses #362 and TESD-58 and maybe even SPR1-368.

The tests are divided into groups based on the error type, so put the main description of each group above the methods.

The read error tests check both the header and main body sections of the NPY chunks, since they use different mechanisms (read and readinto). Keep the same number of tests but parameterise them to shorten the code.

This drops the connection early without even returning a status, by sending a TCP reset (RST) packet. The end result is that the client throws a socket error ("Connection reset by peer," errno 54 or 104). These kinds of errors pop up when the chunk store server is overloaded. Ideally, reset connections should be retried or turned into ChunkNotFound if all else fails. This currently fails in the tests.

Check chunks when recovering from 50x server errors. Use more descriptive assertion failure messages.

The initial `reset-connection` suggestion breaks off the conversation very early, even before returning headers. The more common case is a connection reset while the actual response data is being read. Extend the read suggestions to include a reset after X bytes, but also keep the original suggestion for now as it tests a different problem. Use the opportunity to clean up the response data writing a bit.

When testing the token S3 server, add a suggestion to truncate the RDB file after 1000 bytes. (For the record, the RDB file is 4719 bytes long.) This triggers the exact error reported 2 years ago to the day in SPR1-813. The test fails with:: "katsdptelstate.errors.RdbParseError: Invalid RDB file object" As the ticket said: "Think of some unit tests to elicit this behaviour."

When the connection is reset before anything is returned, urllib3 and/or requests treat it as a connection error instead of a read error. If the reset occurs after some data has been returned, it is a read error. Since this is a grey area anyway (see e.g. psf/requests#2230 and urllib3/urllib3#470), produce the simplest code and worry about the semantics later. The S3 chunk store is safe to retry under most circumstances. This updates the unit test so long to detect a connection error, but it still fails.

Choose different numbers of retries for connect, read and status, allowing a weak check of which one was invoked. This also shaves a second off the run time.

We are working towards the ability to override both timeouts and retries per request. This means that we have to construct or normalise a timeout tuple and `Retry` object in multiple places. Add two helper functions to build and sanitise these objects. This also makes the __init__ and request code more understandable. In the process, fix the docstring for `timeout`. We accept `None` instead of a float and it means to wait forever (no timeout). This happens to be the default in Requests, so it will "leave things unchanged", but that description is misleading otherwise.

In preparation for retrying reset connections, restore the standalone `_raise_for_status` function, after removing it 3 years ago (with apologies to 1e4404b :-) ). Back then the concern was that it will need access to more attributes of the store, but now we simply pass those attributes -- and throw in `retries` to boot. The main benefit is a cleaner `S3ChunkStore.request` function, which promotes insight into the looming retry rework. The `complete_request` method has no need for `chunk_name` anymore, so absorb it into the kwargs. Change `chunk_name` from positional to keyword-only in the process.

This changes the default action of the `process` parameter of the `S3Chunkstore.complete_request` method. If the response is not assigned to a variable or used, it is dropped on the floor anyway, which is pretty much the prior behaviour. Maybe hanging on will hog some resources?

Here is a recap of how our HTTP requests happen:: -- S3ChunkStore.complete_request (<- proposed retry level) -> S3ChunkStore.request -> requests.sessions.Session.request -> requests.adapters.HTTPAdapter.send (<- retries currently set here) -> urllib3.connectionpool.HTTPConnectionPool.urlopen (do retries here) -> urllib3.response.HTTPResponse -> http.client.HTTPConnection / HTTPResponse Requests does timeouts but not retries. Retries are a urllib3 thing. Requests allow you to pass an urllib3 `Retry` object to `HTTPAdapter` for use on all subsequent requests. Instead of doing connect retries and some read retries inside the urllib3 request and other read and status retries at a higher level, switch off retries at the `HTTPAdapter` level and do them all in `complete_request`. The `Retry` configuration makes more sense because the process is driven by a single `Retry` instance. This also simplifies overriding retries per request. This is a game of exceptions. The trick is to turn higher-level exceptions (typically from Requests) back down into the appropriate mid-level exceptions (urllib3) that can interact with the `Retry` object. Sorta like pluripotent stem cells. - For connect retries we need urllib3's `ConnectTimeoutError`. - For read retries we need `ReadTimeoutError` (useful for socket timeouts) or `ProtocolError` (which covers 104 resets + short reads). - Status retries need an urllib3.HTTPResponse as input (and no error). We can use the `_standard_errors` mechanism to translate exceptions by setting up an `error_map` containing:: - `requests.exceptions.ConnectionError` -> `ConnectTimeoutError` - The standard `ConnectionResetError` aka 104 -> `ProtocolError` In addition, let `TruncatedRead` derive from `ProtocolError`, which will automatically trigger a read retry when it's raised. There is no more need to catch it and reraise it as another exception. Since we rely on a raise_for_status mechanism, we have to shuttle the response via an exception to the `Retry` object for status retries. Use urllib3's own ResponseError for this and simply jam the high-level Requests reponse unceremoniously into its args. This also gets rid of the pesky fake 555 status (good riddance!). The `S3ServerGlitch` reverts to a normal exception that is only raised at the end to indicate a temporary server failure (read / status fail). The unit tests for truncated reads and reset connections now all pass. The only leftover is truncated RDB files. This addresses #296, SR-2019, SPR1-1654 and possibly more.

In an epic about-turn, do the connect and status retries inside urllib3, and only the response read retries on the outside. This avoids having to reinvent the retry wheel (e.g. redirects??). Status retries can finally go back into urllib3 because we got rid of the 555 fake status code. The trick is to override `max_retries` on the HTTPAdapter of the Requests Session. Use the same `Retry` object throughout, retrieving it from the response after each request, incrementing it on response errors and passing it back to the adapter. To simplify the code, introduce a standalone `_request` function that catches socket timeouts and reset connections on the response and turns it into the appropriate urllib3 exceptions. This replaces the quirky `S3ChunkStore._request`. When the retries run out, catch either urllib3's `MaxRetryError` (raised while reading the response) or the appropriate Requests exception. Map these to S3ServerGlitch using the error map, for simplicity. A possible caveat is `requests.exceptions.ConnectionError`, which now becomes `StoreUnavailable` but this might be triggered by read failures (e.g. too many ProtocolErrors while reading the headers). Let the RDB reader use `complete_request`. This still fails in the token-based unit test, because IncompleteReads are not handled directly yet and TruncatedRead only applies to `read_array`. We now want the default `raise_on_status=True` since urllib3 is handling those retries.

Use the same parameters as for `test_persistent_reset_connections`. However, thanks to Requests exception mangling this turns into a ConnectionError, which becomes StoreUnavailable like before, even though it is a read retry.

Since there is no more lower-level `S3ChunkStore.request`.

It is now safe to modify on the store object, since it will only be used as default once a request is made. Previously it was buried inside the session factory, which made it pointless to update.

This now raises urllib3's `IncompleteRead` instead of creating its own version (`TruncatedRead`). It also wraps `readinto` just like it did for `read`, to be more generic. Part of the fun is figuring out how many bytes were read up to this point, and how many bytes are left in the response (more useful than the requested bytes in the actual read / readinto call). If we are at the urllib3 level, this is simple. The faster (and lower) httplib level does not track the bytes read and is unseekable, so track it ourselves via `_content_length`. Turn IncompleteReads into ProtocolErrors in `S3ChunkStore._request`, so that it may trigger read retries.

Check that server errors, truncated reads and reset connections result in the expected lower-level errors (50x responses, IncompleteReads and ConnectionResetErrors, respectively) which will then trigger retries. This is more a test of the unit test machinery itself, as well as a verification that IncompleteReads come up as expected.

@IanHeywood

RDB files are large (several MB) and things can go wrong during its retrieval. Extend the read retry mechanism to it too. Check for short reads after downloading the content and raise an IncompleteRead before it hits `telstate.load_from_file` and turns into an `RdbParseError`. Extend the test setup so that the token also allows paths that start with suggestions. This is necessary because the RDB URL is used to find the chunk store URL, by going one level up. For the `test_rdb_support` test we now have:: RDB: /please-truncate-read-after-1000-bytes-for-0.4-seconds/1234567890/1234567890_sdp_l0.rdb store: /please-truncate-read-after-1000-bytes-for-0.4-seconds/ If the suggestion was in the middle of the RDB like before, it breaks because the chunk store will be located inside the RDB bucket:: RDB: /1234567890/please-truncate-read-after-1000-bytes-for-0.4-seconds/1234567890_sdp_l0.rdb store: /1234567890/ A side effect of this is that not just the RDB file, but all chunks larger than 1000 bytes will be dutifully truncated and recovered, since the suggestion appear in their URLs too :-) This finally addresses @IanHeywood's comment in #362 two weeks ago that sparked this whole PR...

RDB files are loaded as generic Ceph objects using `_read_object` and a non-streaming request. This actually preloads the response content as part of `Session.request` (see the end of `Session.send`). The ConnectionResetError therefore happens even before the response is returned. The solution is to put the `session.request` with-statement inside the `try...except` block in `S3ChunkStore._request`, so that we can catch glitches inside the request itself. The second part of the solution is to mangle the exceptions back to urllib3's `ProtocolError`. When accessing the response `content` property, ConnectionResetErrors will become ProtocolErrors, only to be turned into ChunkedEncodingErrors by Requests. Revert that back to a ProtocolError in order to trigger retries in the calling context. Expand the tests to check what happens when RDB loading finally fails, and to check whether generic objects can survive some basic glitches. For the record, I tried to use urllib3's `enforce_content_length` property on `HTTPResponse`, but had no real luck with it in both streaming and non-streaming cases, so leave the _DetectTruncation machinery in place.

We want to test NPY files with version 2.0 headers, but the `np.save` function picks version 1.0 by default unless your header size actually exceeds 64 KiB. The relevant test therefore creates a 70 KB header to force version 2.0 files. NumPy >= 1.23.5 now complains if the header size is larger than 10 KB. It is considered a risk to pass that to `literal_eval`. One way around it is to increase the threshold via the `max_header_size` parameter to the header reader functions, but that only exists on recent NumPy versions. The crash is also avoided by allowing pickles, but that is not appealing either. The solution is to use the lower-level `write_array` function, which allows us to set the header version explicitly. This also removes the need to suppress a warning about version 2 compatibility.

On my laptop the reset connection originates from a connection error ("Connection reset by peer") while the same test elicits an IncompleteRead on Jenkins. Rather remove this check if it is going to be platform-dependent.

richarms · 2023-03-16T09:43:34Z

In testing with both an isolated docker container, and from within a virtual environment, I get the following error consistently using the spr1-813-handle-read-errors branch, but not with the master branch:

  File "/home/kat/ve3/bin/mvftoms.py", line 916, in <module>
    main()
  File "/home/kat/ve3/bin/mvftoms.py", line 679, in main
    averager.average_visibilities(vis_data, weight_data, flag_data,
  File "/home/kat/ve3/lib/python3.8/site-packages/katdal/averager.py", line 149, in average_visibilities
    _average_visibilities(vis, weight, flag, timeav, chanav, flagav)
  File "/home/kat/ve3/lib/python3.8/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/kat/ve3/lib/python3.8/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
    reraise(type(e), e, None)
  File "/home/kat/ve3/lib/python3.8/site-packages/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function empty>) found for signature:

 >>> empty(Literal[int](128), dtype=Function(<class 'bool'>))

There are 2 candidate implementations:
   - Of which 2 did not match due to:
   Overload of function 'empty': File: numba/core/typing/npydecl.py: Line 504.
     With argument(s): '(int64, dtype=Function(<class 'bool'>))':
    No match.

During: resolving callee type: Function(<built-in function empty>)
During: typing of call at /home/kat/ve3/lib/python3.8/site-packages/katdal/averager.py (49)


File "../ve3/lib/python3.8/site-packages/katdal/averager.py", line 49:
def _average_visibilities(vis, weight, flag, timeav, chanav, flagav):
    <source elided>
        weight_sum = np.empty(bl_step, weight.dtype)
        flag_any = np.empty(bl_step, dtype=bool)
        ^

Any ideas what might be going on?

ludwigschwardt · 2023-03-16T10:38:30Z

That is a bit strange... I would have thought that this is a Numba + NumPy version issue, but if those are the same between the two branches, it's hard to say.

If I read the error message correctly, there is a signature mismatch on np.empty.

Try replacing the line

bl_step = 128

with

bl_step = np.int64(128)

ludwigschwardt · 2023-04-12T13:50:35Z

Thanks Richard! Could you please add some notes about the testing you did to the SPR1-813 ticket? And what happened to the Numba issue discussed above?

ludwigschwardt added 26 commits February 23, 2023 00:17

SPR1-813 Improve test comments

d5d86b7

The tests are divided into groups based on the error type, so put the main description of each group above the methods.

SPR1-813 Parameterise tests involving NPY reads

124413f

The read error tests check both the header and main body sections of the NPY chunks, since they use different mechanisms (read and readinto). Keep the same number of tests but parameterise them to shorten the code.

SPR1-813 Improve checks and error messages

f2272da

Check chunks when recovering from 50x server errors. Use more descriptive assertion failure messages.

SPR1-813 Collect reset code in a method

7fcc778

SPR1-813 Different (and shorter) status retries

b13a6bb

Choose different numbers of retries for connect, read and status, allowing a weak check of which one was invoked. This also shaves a second off the run time.

SPR1-813 Override retries in complete_request too

fdab05d

SPR1-813 Early resets are read retries

4b0ed12

Use the same parameters as for `test_persistent_reset_connections`. However, thanks to Requests exception mangling this turns into a ConnectionError, which becomes StoreUnavailable like before, even though it is a read retry.

SPR1-813 Rename complete_request -> request

7f6210d

Since there is no more lower-level `S3ChunkStore.request`.

SPR1-813 Promote retries to a public parameter

3748349

It is now safe to modify on the store object, since it will only be used as default once a request is made. Previously it was buried inside the session factory, which made it pointless to update.

SPR1-813 Refactor to improve multi-lines

042ca37

SPR1-813 Remove platform-dependent check

11a824c

On my laptop the reset connection originates from a connection error ("Connection reset by peer") while the same test elicits an IncompleteRead on Jenkins. Rather remove this check if it is going to be platform-dependent.

ludwigschwardt requested review from ctgschollar and richarms March 10, 2023 09:13

richarms approved these changes Apr 12, 2023

View reviewed changes

Jordatious mentioned this pull request May 4, 2023

mvftoms produced malformed MS (TiledStMan::headerFileGet: mismatch in #row) #362

Closed

ludwigschwardt merged commit c5654ef into master May 4, 2023

ludwigschwardt deleted the spr1-813-handle-read-errors branch May 4, 2023 12:38

ludwigschwardt mentioned this pull request Mar 18, 2024

mvftoms.py reading time-out on Jupyter core #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

ludwigschwardt commented Mar 9, 2023 •

edited

Loading

richarms commented Mar 16, 2023

ludwigschwardt commented Mar 16, 2023

ludwigschwardt commented Apr 12, 2023

SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

Conversation

ludwigschwardt commented Mar 9, 2023 • edited Loading

richarms commented Mar 16, 2023

ludwigschwardt commented Mar 16, 2023

ludwigschwardt commented Apr 12, 2023

ludwigschwardt commented Mar 9, 2023 •

edited

Loading