Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvtoms.py throws connection reset by peer error #296

Open
SpheMakh opened this issue Mar 26, 2020 · 19 comments
Open

mvtoms.py throws connection reset by peer error #296

SpheMakh opened this issue Mar 26, 2020 · 19 comments

Comments

@SpheMakh
Copy link

I'm using katdal 0.15 in an ubuntu 18.04 docker container

can   4 ( 599 samples) loaded. Target: 'J1939-6342'. Writing to disk...
Added new field 1: 'J1939-6342' 19:39:25.03 -63:42:45.6
Wrote scan data (201912.166424 MiB) in 2285.384612 s (88.349316 MiBps)

scan   5 ( 602 samples) loaded. Target: 'J1939-6342'. Writing to disk...
Traceback (most recent call last):
  File "/usr/local/bin/mvftoms.py", line 816, in <module>
    main()
  File "/usr/local/bin/mvftoms.py", line 566, in main
    scan_vis_data, scan_weight_data, scan_flag_data)
  File "/usr/local/bin/mvftoms.py", line 92, in load
    out=[vis, weights, flags])
  File "/usr/local/lib/python3.6/dist-packages/katdal/lazy_indexer.py", line 594, in get
    da.store(kept, out, lock=False)
  File "/usr/local/lib/python3.6/dist-packages/dask/array/core.py", line 951, in store
    result.compute(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/base.py", line 166, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/base.py", line 437, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/threaded.py", line 84, in get
    **kwargs
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.6/dist-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore.py", line 243, in get_chunk_or_zeros
    return self.get_chunk(array_name, slices, dtype)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 610, in get_chunk
    headers=headers, stream=True)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 587, in complete_request
    result = process(response)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 173, in _read_chunk
    chunk = read_array(data._fp)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 151, in read_array
    bytes_read = fp.readinto(memoryview(data.view(np.uint8)))
  File "/usr/lib/python3.6/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
@SpheMakh
Copy link
Author

This is the ID of the data I'm trying to download 1584577476

@SpheMakh
Copy link
Author

SpheMakh commented Apr 7, 2020

@ludwigschwardt any ideas?

@ludwigschwardt
Copy link
Contributor

I could not recreate the error. The server typically slammed down the phone on their side with such an error message, indicating temporary overload which should go away if you try again later.

I'm still a bit sad since my latest improvements aim to catch these errors and turn them into flagged missing data without crashing the script. I'm keeping this open to remind me to catch this error too.

@SpheMakh
Copy link
Author

SpheMakh commented Apr 8, 2020

I've tried a couple of times already, and it fails at the same point each time; at around 500Gb. But I'll give it another go

@ludwigschwardt
Copy link
Contributor

ludwigschwardt commented Apr 8, 2020

Interesting... I tried the following:

import katdal
from katdal.lazy_indexer import DaskLazyIndexer

d = katdal.open('...')
d.select(scans=5)   # which is where you are getting stuck
v, w, f = DaskLazyIndexer.get([d.vis, d.weights, d.flags], 0)
for n in range(602):
    print(n)
    DaskLazyIndexer.get([d.vis, d.weights, d.flags], n, out=[v, w, f])

It made it all the way to the end... Maybe try this on your setup.

@bennahugo
Copy link
Contributor

bennahugo commented Apr 8, 2020 via email

@ludwigschwardt
Copy link
Contributor

katdal 0.15 is pretty new (just pre-lockdown).

@SpheMakh
Copy link
Author

This only happens when I'm running in a docker container. It works fine outside a container. @ludwigschwardt are there any containers that use mvftoms that you know of, maybe I made a mistake in building mine.

@spassmoor
Copy link
Contributor

I have not had this issue running on my docker container that has mvftoms.py . I should also point out that I run it on one of the comm machines and this would mitigate any bad network problems.

@SpheMakh
Copy link
Author

@spassmoor I'm running on com08. Can you share the Dockerfile?

@SpheMakh
Copy link
Author

Thanks, I'll give it go.

@ludwigschwardt
Copy link
Contributor

Just remember that this is a public thread, in case your zip contains sensitive info :-)

@spassmoor
Copy link
Contributor

Other than my work email address and my preferred version of tornado I don't think there is anything sensitive in it.

@sarrvesh
Copy link

sarrvesh commented Feb 25, 2021

Hey folks, I encountered a similar issue using katdal 0.17. mvftoms.py failed for me with a connection time out error on my dataset (1596945366). I tried running @ludwigschwardt 's script above and that failed with the same time out error. Any ideas how to solve this issue?

@bennahugo
Copy link
Contributor

bennahugo commented Feb 25, 2021 via email

@sarrvesh
Copy link

I get the same error in a container environment and in a normal virtualenv installation. It seems to fail right away with the following error:

StoreUnavailable: Chunk '1596945366-sdp-l0/correlator_data/00168_00000_00000': HTTPConnectionPool(host='archive-gw-1.kat.ac.za', port=7480): Max retries exceeded with url: /1596945366-sdp-l0/correlator_data/00168_00000_00000.npy (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f7263391860>, 'Connection to archive-gw-1.kat.ac.za timed out. (connect timeout=30)'))

My network, otherwise, works just fine.

@ludwigschwardt
Copy link
Contributor

ludwigschwardt commented Feb 25, 2021

Hi @sarrvesh, a connection timeout indicates that you could not even start to talk to the archive server, i.e. the phone just rings and rings and nobody picks up. This differs from a connection reset (the topic of this issue), which is the server slamming down the phone in the middle of your conversation.

I see that you are trying to connect to port 7480. Are you on a machine in the CHPC cluster? If not, you'll need to connect to port 443, aka https, and use a token as provided by the RDB link button on the archive. So instead of

d = katdal.open('http://archive-gw-1.kat.ac.za:7480/1596945366/1596945366_sdp_l0.full.rdb')

try

d = katdal.open('https://archive-gw-1.kat.ac.za/1596945366/1596945366_sdp_l0.full.rdb?token=<your-token>')

I managed to download dump 168 with both methods just now, so the server is up and your dataset is intact. That's the good news 😄

This issue also occurs if you download the RDB file to your local disk and then open it via

d = katdal.open('1596945366_sdp_l0.full.rdb')

That trick only works on the CHPC cluster, or if you also copied all the data to your local disk, since the RDB file only contains the 7480 URL and won't know about the token.

@sarrvesh
Copy link

Ah, interesting. Yeah, that works. Thanks very much.

@ludwigschwardt
Copy link
Contributor

ludwigschwardt commented Feb 25, 2021

Pleasure!

I now remember that there's another option with the local RDB file to feed in the token - treat it like an URL:

d = katdal.open('1596945366_sdp_l0.full.rdb?token=<your-token>')

Although I'm not sure if that will go the https route...

ludwigschwardt added a commit that referenced this issue Mar 9, 2023
Here is a recap of how our HTTP requests happen::

  -- S3ChunkStore.complete_request (<- proposed retry level)
  -> S3ChunkStore.request
  -> requests.sessions.Session.request
  -> requests.adapters.HTTPAdapter.send (<- retries currently set here)
  -> urllib3.connectionpool.HTTPConnectionPool.urlopen (do retries here)
  -> urllib3.response.HTTPResponse
  -> http.client.HTTPConnection / HTTPResponse

Requests does timeouts but not retries. Retries are a urllib3 thing.
Requests allow you to pass an urllib3 `Retry` object to `HTTPAdapter`
for use on all subsequent requests.

Instead of doing connect retries and some read retries inside the
urllib3 request and other read and status retries at a higher level,
switch off retries at the `HTTPAdapter` level and do them all in
`complete_request`. The `Retry` configuration makes more sense
because the process is driven by a single `Retry` instance. This also
simplifies overriding retries per request.

This is a game of exceptions. The trick is to turn higher-level
exceptions (typically from Requests) back down into the appropriate
mid-level exceptions (urllib3) that can interact with the `Retry`
object. Sorta like pluripotent stem cells.

- For connect retries we need urllib3's `ConnectTimeoutError`.
- For read retries we need `ReadTimeoutError` (useful for socket
  timeouts) or `ProtocolError` (which covers 104 resets + short reads).
- Status retries need an urllib3.HTTPResponse as input (and no error).

We can use the `_standard_errors` mechanism to translate exceptions
by setting up an `error_map` containing::

- `requests.exceptions.ConnectionError` -> `ConnectTimeoutError`
- The standard `ConnectionResetError` aka 104 -> `ProtocolError`

In addition, let `TruncatedRead` derive from `ProtocolError`, which
will automatically trigger a read retry when it's raised. There is no
more need to catch it and reraise it as another exception.

Since we rely on a raise_for_status mechanism, we have to shuttle the
response via an exception to the `Retry` object for status retries.
Use urllib3's own ResponseError for this and simply jam the high-level
Requests reponse unceremoniously into its args. This also gets rid of
the pesky fake 555 status (good riddance!).

The `S3ServerGlitch` reverts to a normal exception that is only raised
at the end to indicate a temporary server failure (read / status fail).

The unit tests for truncated reads and reset connections now all pass.
The only leftover is truncated RDB files.

This addresses #296, SR-2019, SPR1-1654 and possibly more.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants