Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

Merged
merged 26 commits into from
May 4, 2023

Conversation

ludwigschwardt
Copy link
Contributor

@ludwigschwardt ludwigschwardt commented Mar 9, 2023

This improves the robustness of katdal's archive access.

While glitches where typically picked up while reading the response header, the response body had no such checks, as is the wont of the Requests library (only discovered that now 😅). Since the bulk of the data resides in the body, this meant that chunk transfers were rarely retried.

Fix this by integrating retries into the response reading too. Maintain a single Retry object per session, which is passed to the request, retrieved from the response and updated in case the response content has issues. Completely overhaul S3ChunkStore.request. Replace TruncatedRead with the more standard IncompleteRead exception. Retries can now be overridden per request.

This gets rid of the clunky fake 555 status code that treats truncated reads as status errors (they should be read errors instead).

Ensure that glitches are also handled when downloading data in non-streaming fashion via response.content (e.g. when downloading RDB files). Add a whole bunch of unit tests that verify glitch behaviour.

This addresses #296 and #362.

Review tip: There was quite an evolution as I tried and abandoned various approaches. I would start with the full diff and work backwards to specific commits to make sense of it.

The S3 chunk store has a custom HTTP response reader (`_read_chunk`)
for performance reasons. Unfortunately this bypasses the urllib3 retry
machinery, also for read timeouts wile streaming the response content.

We are already doing our own retries for status errors, so extend this
to deal with read timeouts too. The basic idea is to catch the socket
timeout (the old `socket.timeout` or newer `TimeoutError` since 3.10)
and turn it into an urllib3 `ReadTimeoutError` to trigger a retry.

The most important bit is that we now also test this :-) We use token
proxy suggestions to pause the reading mid-stream for a specified
period.

Report the number of retries when the request finally fails, as an
assurance that retries actually happened.

This addresses #362 and TESD-58 and maybe even SPR1-368.
The tests are divided into groups based on the error type, so put
the main description of each group above the methods.
The read error tests check both the header and main body sections of
the NPY chunks, since they use different mechanisms (read and readinto).

Keep the same number of tests but parameterise them to shorten the code.
This drops the connection early without even returning a status, by
sending a TCP reset (RST) packet. The end result is that the client
throws a socket error ("Connection reset by peer," errno 54 or 104).
These kinds of errors pop up when the chunk store server is overloaded.

Ideally, reset connections should be retried or turned into
ChunkNotFound if all else fails. This currently fails in the tests.
Check chunks when recovering from 50x server errors. Use more
descriptive assertion failure messages.
The initial `reset-connection` suggestion breaks off the conversation
very early, even before returning headers. The more common case is
a connection reset while the actual response data is being read.
Extend the read suggestions to include a reset after X bytes, but
also keep the original suggestion for now as it tests a different
problem.

Use the opportunity to clean up the response data writing a bit.
When testing the token S3 server, add a suggestion to truncate the
RDB file after 1000 bytes. (For the record, the RDB file is 4719 bytes
long.) This triggers the exact error reported 2 years ago to the day
in SPR1-813. The test fails with::

  "katsdptelstate.errors.RdbParseError: Invalid RDB file object"

As the ticket said: "Think of some unit tests to elicit this behaviour."
When the connection is reset before anything is returned, urllib3
and/or requests treat it as a connection error instead of a read error.
If the reset occurs after some data has been returned, it is a read
error.

Since this is a grey area anyway (see e.g. psf/requests#2230 and
urllib3/urllib3#470), produce the simplest code and worry about the
semantics later. The S3 chunk store is safe to retry under most
circumstances.

This updates the unit test so long to detect a connection error, but
it still fails.
Choose different numbers of retries for connect, read and status,
allowing a weak check of which one was invoked. This also shaves
a second off the run time.
We are working towards the ability to override both timeouts and retries
per request. This means that we have to construct or normalise
a timeout tuple and `Retry` object in multiple places. Add two helper
functions to build and sanitise these objects. This also makes the
__init__ and request code more understandable.

In the process, fix the docstring for `timeout`. We accept `None`
instead of a float and it means to wait forever (no timeout).
This happens to be the default in Requests, so it will "leave things
unchanged", but that description is misleading otherwise.
In preparation for retrying reset connections, restore the standalone
`_raise_for_status` function, after removing it 3 years ago (with
apologies to 1e4404b :-) ). Back then the concern was that it will
need access to more attributes of the store, but now we simply pass
those attributes -- and throw in `retries` to boot. The main benefit
is a cleaner `S3ChunkStore.request` function, which promotes insight
into the looming retry rework.

The `complete_request` method has no need for `chunk_name` anymore,
so absorb it into the kwargs. Change `chunk_name` from positional
to keyword-only in the process.
This changes the default action of the `process` parameter of the
`S3Chunkstore.complete_request` method.

If the response is not assigned to a variable or used, it is dropped
on the floor anyway, which is pretty much the prior behaviour. Maybe
hanging on will hog some resources?
Here is a recap of how our HTTP requests happen::

  -- S3ChunkStore.complete_request (<- proposed retry level)
  -> S3ChunkStore.request
  -> requests.sessions.Session.request
  -> requests.adapters.HTTPAdapter.send (<- retries currently set here)
  -> urllib3.connectionpool.HTTPConnectionPool.urlopen (do retries here)
  -> urllib3.response.HTTPResponse
  -> http.client.HTTPConnection / HTTPResponse

Requests does timeouts but not retries. Retries are a urllib3 thing.
Requests allow you to pass an urllib3 `Retry` object to `HTTPAdapter`
for use on all subsequent requests.

Instead of doing connect retries and some read retries inside the
urllib3 request and other read and status retries at a higher level,
switch off retries at the `HTTPAdapter` level and do them all in
`complete_request`. The `Retry` configuration makes more sense
because the process is driven by a single `Retry` instance. This also
simplifies overriding retries per request.

This is a game of exceptions. The trick is to turn higher-level
exceptions (typically from Requests) back down into the appropriate
mid-level exceptions (urllib3) that can interact with the `Retry`
object. Sorta like pluripotent stem cells.

- For connect retries we need urllib3's `ConnectTimeoutError`.
- For read retries we need `ReadTimeoutError` (useful for socket
  timeouts) or `ProtocolError` (which covers 104 resets + short reads).
- Status retries need an urllib3.HTTPResponse as input (and no error).

We can use the `_standard_errors` mechanism to translate exceptions
by setting up an `error_map` containing::

- `requests.exceptions.ConnectionError` -> `ConnectTimeoutError`
- The standard `ConnectionResetError` aka 104 -> `ProtocolError`

In addition, let `TruncatedRead` derive from `ProtocolError`, which
will automatically trigger a read retry when it's raised. There is no
more need to catch it and reraise it as another exception.

Since we rely on a raise_for_status mechanism, we have to shuttle the
response via an exception to the `Retry` object for status retries.
Use urllib3's own ResponseError for this and simply jam the high-level
Requests reponse unceremoniously into its args. This also gets rid of
the pesky fake 555 status (good riddance!).

The `S3ServerGlitch` reverts to a normal exception that is only raised
at the end to indicate a temporary server failure (read / status fail).

The unit tests for truncated reads and reset connections now all pass.
The only leftover is truncated RDB files.

This addresses #296, SR-2019, SPR1-1654 and possibly more.
In an epic about-turn, do the connect and status retries inside urllib3,
and only the response read retries on the outside. This avoids having
to reinvent the retry wheel (e.g. redirects??). Status retries can
finally go back into urllib3 because we got rid of the 555 fake status
code.

The trick is to override `max_retries` on the HTTPAdapter of the
Requests Session. Use the same `Retry` object throughout, retrieving it
from the response after each request, incrementing it on response errors
and passing it back to the adapter.

To simplify the code, introduce a standalone `_request` function that
catches socket timeouts and reset connections on the response and turns
it into the appropriate urllib3 exceptions. This replaces the quirky
`S3ChunkStore._request`.

When the retries run out, catch either urllib3's `MaxRetryError` (raised
while reading the response) or the appropriate Requests exception. Map
these to S3ServerGlitch using the error map, for simplicity. A possible
caveat is `requests.exceptions.ConnectionError`, which now becomes
`StoreUnavailable` but this might be triggered by read failures (e.g.
too many ProtocolErrors while reading the headers).

Let the RDB reader use `complete_request`. This still fails in the
token-based unit test, because IncompleteReads are not handled directly
yet and TruncatedRead only applies to `read_array`.

We now want the default `raise_on_status=True` since urllib3 is handling
those retries.
Use the same parameters as for `test_persistent_reset_connections`.
However, thanks to Requests exception mangling this turns into a
ConnectionError, which becomes StoreUnavailable like before, even
though it is a read retry.
Since there is no more lower-level `S3ChunkStore.request`.
It is now safe to modify on the store object, since it will only be
used as default once a request is made. Previously it was buried inside
the session factory, which made it pointless to update.
This now raises urllib3's `IncompleteRead` instead of creating its
own version (`TruncatedRead`). It also wraps `readinto` just like it
did for `read`, to be more generic.

Part of the fun is figuring out how many bytes were read up to this
point, and how many bytes are left in the response (more useful than
the requested bytes in the actual read / readinto call). If we are at
the urllib3 level, this is simple. The faster (and lower) httplib level
does not track the bytes read and is unseekable, so track it ourselves
via `_content_length`.

Turn IncompleteReads into ProtocolErrors in `S3ChunkStore._request`,
so that it may trigger read retries.
Check that server errors, truncated reads and reset connections
result in the expected lower-level errors (50x responses,
IncompleteReads and ConnectionResetErrors, respectively) which will
then trigger retries. This is more a test of the unit test machinery
itself, as well as a verification that IncompleteReads come up as
expected.
RDB files are large (several MB) and things can go wrong during its
retrieval. Extend the read retry mechanism to it too. Check for short
reads after downloading the content and raise an IncompleteRead before
it hits `telstate.load_from_file` and turns into an `RdbParseError`.

Extend the test setup so that the token also allows paths that start
with suggestions. This is necessary because the RDB URL is used to
find the chunk store URL, by going one level up.

For the `test_rdb_support` test we now have::

RDB: /please-truncate-read-after-1000-bytes-for-0.4-seconds/1234567890/1234567890_sdp_l0.rdb
store: /please-truncate-read-after-1000-bytes-for-0.4-seconds/

If the suggestion was in the middle of the RDB like before, it breaks
because the chunk store will be located inside the RDB bucket::

RDB: /1234567890/please-truncate-read-after-1000-bytes-for-0.4-seconds/1234567890_sdp_l0.rdb
store: /1234567890/

A side effect of this is that not just the RDB file, but all chunks
larger than 1000 bytes will be dutifully truncated and recovered, since
the suggestion appear in their URLs too :-)

This finally addresses @IanHeywood's comment in #362 two weeks ago
that sparked this whole PR...
RDB files are loaded as generic Ceph objects using `_read_object`
and a non-streaming request. This actually preloads the response
content as part of `Session.request` (see the end of `Session.send`).
The ConnectionResetError therefore happens even before the response
is returned.

The solution is to put the `session.request` with-statement inside
the `try...except` block in `S3ChunkStore._request`, so that we can
catch glitches inside the request itself.

The second part of the solution is to mangle the exceptions back to
urllib3's `ProtocolError`. When accessing the response `content`
property, ConnectionResetErrors will become ProtocolErrors, only to
be turned into ChunkedEncodingErrors by Requests. Revert that back
to a ProtocolError in order to trigger retries in the calling context.

Expand the tests to check what happens when RDB loading finally fails,
and to check whether generic objects can survive some basic glitches.

For the record, I tried to use urllib3's `enforce_content_length`
property on `HTTPResponse`, but had no real luck with it in both
streaming and non-streaming cases, so leave the _DetectTruncation
machinery in place.
We want to test NPY files with version 2.0 headers, but the `np.save`
function picks version 1.0 by default unless your header size actually
exceeds 64 KiB. The relevant test therefore creates a 70 KB header
to force version 2.0 files.

NumPy >= 1.23.5 now complains if the header size is larger than 10 KB.
It is considered a risk to pass that to `literal_eval`. One way around
it is to increase the threshold via the `max_header_size` parameter
to the header reader functions, but that only exists on recent NumPy
versions. The crash is also avoided by allowing pickles, but that is
not appealing either.

The solution is to use the lower-level `write_array` function, which
allows us to set the header version explicitly. This also removes the
need to suppress a warning about version 2 compatibility.
On my laptop the reset connection originates from a connection error
("Connection reset by peer") while the same test elicits an
IncompleteRead on Jenkins. Rather remove this check if it is going
to be platform-dependent.
@richarms
Copy link

In testing with both an isolated docker container, and from within a virtual environment, I get the following error consistently using the spr1-813-handle-read-errors branch, but not with the master branch:

  File "/home/kat/ve3/bin/mvftoms.py", line 916, in <module>
    main()
  File "/home/kat/ve3/bin/mvftoms.py", line 679, in main
    averager.average_visibilities(vis_data, weight_data, flag_data,
  File "/home/kat/ve3/lib/python3.8/site-packages/katdal/averager.py", line 149, in average_visibilities
    _average_visibilities(vis, weight, flag, timeav, chanav, flagav)
  File "/home/kat/ve3/lib/python3.8/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/kat/ve3/lib/python3.8/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
    reraise(type(e), e, None)
  File "/home/kat/ve3/lib/python3.8/site-packages/numba/core/utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function empty>) found for signature:

 >>> empty(Literal[int](128), dtype=Function(<class 'bool'>))

There are 2 candidate implementations:
   - Of which 2 did not match due to:
   Overload of function 'empty': File: numba/core/typing/npydecl.py: Line 504.
     With argument(s): '(int64, dtype=Function(<class 'bool'>))':
    No match.

During: resolving callee type: Function(<built-in function empty>)
During: typing of call at /home/kat/ve3/lib/python3.8/site-packages/katdal/averager.py (49)


File "../ve3/lib/python3.8/site-packages/katdal/averager.py", line 49:
def _average_visibilities(vis, weight, flag, timeav, chanav, flagav):
    <source elided>
        weight_sum = np.empty(bl_step, weight.dtype)
        flag_any = np.empty(bl_step, dtype=bool)
        ^

Any ideas what might be going on?

@ludwigschwardt
Copy link
Contributor Author

That is a bit strange... I would have thought that this is a Numba + NumPy version issue, but if those are the same between the two branches, it's hard to say.

If I read the error message correctly, there is a signature mismatch on np.empty.

Try replacing the line

bl_step = 128

with

bl_step = np.int64(128)

@ludwigschwardt
Copy link
Contributor Author

Thanks Richard! Could you please add some notes about the testing you did to the SPR1-813 ticket? And what happened to the Numba issue discussed above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants