mvtoms.py throws connection reset by peer error #296

SpheMakh · 2020-03-26T14:57:56Z

I'm using katdal 0.15 in an ubuntu 18.04 docker container

can   4 ( 599 samples) loaded. Target: 'J1939-6342'. Writing to disk...
Added new field 1: 'J1939-6342' 19:39:25.03 -63:42:45.6
Wrote scan data (201912.166424 MiB) in 2285.384612 s (88.349316 MiBps)

scan   5 ( 602 samples) loaded. Target: 'J1939-6342'. Writing to disk...
Traceback (most recent call last):
  File "/usr/local/bin/mvftoms.py", line 816, in <module>
    main()
  File "/usr/local/bin/mvftoms.py", line 566, in main
    scan_vis_data, scan_weight_data, scan_flag_data)
  File "/usr/local/bin/mvftoms.py", line 92, in load
    out=[vis, weights, flags])
  File "/usr/local/lib/python3.6/dist-packages/katdal/lazy_indexer.py", line 594, in get
    da.store(kept, out, lock=False)
  File "/usr/local/lib/python3.6/dist-packages/dask/array/core.py", line 951, in store
    result.compute(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/base.py", line 166, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/base.py", line 437, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dask/threaded.py", line 84, in get
    **kwargs
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/usr/local/lib/python3.6/dist-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.6/dist-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore.py", line 243, in get_chunk_or_zeros
    return self.get_chunk(array_name, slices, dtype)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 610, in get_chunk
    headers=headers, stream=True)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 587, in complete_request
    result = process(response)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 173, in _read_chunk
    chunk = read_array(data._fp)
  File "/usr/local/lib/python3.6/dist-packages/katdal/chunkstore_s3.py", line 151, in read_array
    bytes_read = fp.readinto(memoryview(data.view(np.uint8)))
  File "/usr/lib/python3.6/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 874, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

The text was updated successfully, but these errors were encountered:

SpheMakh · 2020-03-26T14:59:28Z

This is the ID of the data I'm trying to download 1584577476

SpheMakh · 2020-04-07T13:59:34Z

@ludwigschwardt any ideas?

ludwigschwardt · 2020-04-08T11:52:41Z

I could not recreate the error. The server typically slammed down the phone on their side with such an error message, indicating temporary overload which should go away if you try again later.

I'm still a bit sad since my latest improvements aim to catch these errors and turn them into flagged missing data without crashing the script. I'm keeping this open to remind me to catch this error too.

SpheMakh · 2020-04-08T12:10:39Z

I've tried a couple of times already, and it fails at the same point each time; at around 500Gb. But I'll give it another go

ludwigschwardt · 2020-04-08T12:14:59Z

Interesting... I tried the following:

import katdal
from katdal.lazy_indexer import DaskLazyIndexer

d = katdal.open('...')
d.select(scans=5)   # which is where you are getting stuck
v, w, f = DaskLazyIndexer.get([d.vis, d.weights, d.flags], 0)
for n in range(602):
    print(n)
    DaskLazyIndexer.get([d.vis, d.weights, d.flags], n, out=[v, w, f])

It made it all the way to the end... Maybe try this on your setup.

bennahugo · 2020-04-08T12:15:06Z

Update your katdal Sphe. If you are coming in from Rhodes you need the latest and greatest! Had run into this network endpoint problem before

…

On Wed, Apr 8, 2020 at 2:10 PM Sphesihle Makhathini < ***@***.***> wrote: I've tried a couple of times already, and it fails at the same point each time; at around 500Gb. But I'll give it another go — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#296 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6QKHBFZMEYHQ5DESTLRLRSU7ANCNFSM4LUJL7YQ> .

-- -- Benjamin Hugo PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

ludwigschwardt · 2020-04-08T12:16:20Z

katdal 0.15 is pretty new (just pre-lockdown).

SpheMakh · 2020-05-20T09:27:20Z

This only happens when I'm running in a docker container. It works fine outside a container. @ludwigschwardt are there any containers that use mvftoms that you know of, maybe I made a mistake in building mine.

spassmoor · 2020-05-20T09:33:47Z

I have not had this issue running on my docker container that has mvftoms.py . I should also point out that I run it on one of the comm machines and this would mitigate any bad network problems.

SpheMakh · 2020-05-20T09:40:58Z

@spassmoor I'm running on com08. Can you share the Dockerfile?

SpheMakh · 2020-05-20T10:23:04Z

Thanks, I'll give it go.

ludwigschwardt · 2020-05-20T10:30:10Z

Just remember that this is a public thread, in case your zip contains sensitive info :-)

spassmoor · 2020-05-20T10:37:00Z

Other than my work email address and my preferred version of tornado I don't think there is anything sensitive in it.

sarrvesh · 2021-02-25T18:42:06Z

Hey folks, I encountered a similar issue using katdal 0.17. mvftoms.py failed for me with a connection time out error on my dataset (1596945366). I tried running @ludwigschwardt 's script above and that failed with the same time out error. Any ideas how to solve this issue?

bennahugo · 2021-02-25T19:28:25Z

Need more info here. Is this inside a container? If so please check that your docker bridge is working properly by pinging or telnet inside your container. If it is working can you indicate whether it fails part way through download or right at the start. There might be a misconfigured end point or something else not related to containerization.

…

On Thu, 25 Feb 2021, 20:42 Sarrvesh, ***@***.***> wrote: Hey folks, I encountered the same issue using katdal 0.17. mvftoms.py failed for me with a connection time out error on my dataset (1596945366). I tried running @ludwigschwardt <https://github.com/ludwigschwardt> 's script above and that failed with the same time out error. Any ideas how to solve this issue? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#296 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4RE6XIUOWAQC2JKPEI6F3TA2KY3ANCNFSM4LUJL7YQ> .

sarrvesh · 2021-02-25T20:05:17Z

I get the same error in a container environment and in a normal virtualenv installation. It seems to fail right away with the following error:

StoreUnavailable: Chunk '1596945366-sdp-l0/correlator_data/00168_00000_00000': HTTPConnectionPool(host='archive-gw-1.kat.ac.za', port=7480): Max retries exceeded with url: /1596945366-sdp-l0/correlator_data/00168_00000_00000.npy (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f7263391860>, 'Connection to archive-gw-1.kat.ac.za timed out. (connect timeout=30)'))

My network, otherwise, works just fine.

ludwigschwardt · 2021-02-25T20:27:37Z

Hi @sarrvesh, a connection timeout indicates that you could not even start to talk to the archive server, i.e. the phone just rings and rings and nobody picks up. This differs from a connection reset (the topic of this issue), which is the server slamming down the phone in the middle of your conversation.

I see that you are trying to connect to port 7480. Are you on a machine in the CHPC cluster? If not, you'll need to connect to port 443, aka https, and use a token as provided by the RDB link button on the archive. So instead of

d = katdal.open('http://archive-gw-1.kat.ac.za:7480/1596945366/1596945366_sdp_l0.full.rdb')

try

d = katdal.open('https://archive-gw-1.kat.ac.za/1596945366/1596945366_sdp_l0.full.rdb?token=<your-token>')

I managed to download dump 168 with both methods just now, so the server is up and your dataset is intact. That's the good news 😄

This issue also occurs if you download the RDB file to your local disk and then open it via

d = katdal.open('1596945366_sdp_l0.full.rdb')

That trick only works on the CHPC cluster, or if you also copied all the data to your local disk, since the RDB file only contains the 7480 URL and won't know about the token.

sarrvesh · 2021-02-25T20:58:17Z

Ah, interesting. Yeah, that works. Thanks very much.

ludwigschwardt · 2021-02-25T21:17:25Z

Pleasure!

I now remember that there's another option with the local RDB file to feed in the token - treat it like an URL:

d = katdal.open('1596945366_sdp_l0.full.rdb?token=<your-token>')

Although I'm not sure if that will go the https route...

Here is a recap of how our HTTP requests happen:: -- S3ChunkStore.complete_request (<- proposed retry level) -> S3ChunkStore.request -> requests.sessions.Session.request -> requests.adapters.HTTPAdapter.send (<- retries currently set here) -> urllib3.connectionpool.HTTPConnectionPool.urlopen (do retries here) -> urllib3.response.HTTPResponse -> http.client.HTTPConnection / HTTPResponse Requests does timeouts but not retries. Retries are a urllib3 thing. Requests allow you to pass an urllib3 `Retry` object to `HTTPAdapter` for use on all subsequent requests. Instead of doing connect retries and some read retries inside the urllib3 request and other read and status retries at a higher level, switch off retries at the `HTTPAdapter` level and do them all in `complete_request`. The `Retry` configuration makes more sense because the process is driven by a single `Retry` instance. This also simplifies overriding retries per request. This is a game of exceptions. The trick is to turn higher-level exceptions (typically from Requests) back down into the appropriate mid-level exceptions (urllib3) that can interact with the `Retry` object. Sorta like pluripotent stem cells. - For connect retries we need urllib3's `ConnectTimeoutError`. - For read retries we need `ReadTimeoutError` (useful for socket timeouts) or `ProtocolError` (which covers 104 resets + short reads). - Status retries need an urllib3.HTTPResponse as input (and no error). We can use the `_standard_errors` mechanism to translate exceptions by setting up an `error_map` containing:: - `requests.exceptions.ConnectionError` -> `ConnectTimeoutError` - The standard `ConnectionResetError` aka 104 -> `ProtocolError` In addition, let `TruncatedRead` derive from `ProtocolError`, which will automatically trigger a read retry when it's raised. There is no more need to catch it and reraise it as another exception. Since we rely on a raise_for_status mechanism, we have to shuttle the response via an exception to the `Retry` object for status retries. Use urllib3's own ResponseError for this and simply jam the high-level Requests reponse unceremoniously into its args. This also gets rid of the pesky fake 555 status (good riddance!). The `S3ServerGlitch` reverts to a normal exception that is only raised at the end to indicate a temporary server failure (read / status fail). The unit tests for truncated reads and reset connections now all pass. The only leftover is truncated RDB files. This addresses #296, SR-2019, SPR1-1654 and possibly more.

ludwigschwardt mentioned this issue Mar 9, 2023

SPR1-813: Handle truncated reads, socket timeouts and reset connections (aka glitches) in S3ChunkStore #363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvtoms.py throws connection reset by peer error #296

mvtoms.py throws connection reset by peer error #296

SpheMakh commented Mar 26, 2020

SpheMakh commented Mar 26, 2020

SpheMakh commented Apr 7, 2020

ludwigschwardt commented Apr 8, 2020

SpheMakh commented Apr 8, 2020

ludwigschwardt commented Apr 8, 2020 •

edited

Loading

bennahugo commented Apr 8, 2020 via email

ludwigschwardt commented Apr 8, 2020

SpheMakh commented May 20, 2020

spassmoor commented May 20, 2020

SpheMakh commented May 20, 2020

SpheMakh commented May 20, 2020

ludwigschwardt commented May 20, 2020

spassmoor commented May 20, 2020

sarrvesh commented Feb 25, 2021 •

edited

Loading

bennahugo commented Feb 25, 2021 via email

sarrvesh commented Feb 25, 2021

ludwigschwardt commented Feb 25, 2021 •

edited

Loading

sarrvesh commented Feb 25, 2021

ludwigschwardt commented Feb 25, 2021 •

edited

Loading

mvtoms.py throws connection reset by peer error #296

mvtoms.py throws connection reset by peer error #296

Comments

SpheMakh commented Mar 26, 2020

SpheMakh commented Mar 26, 2020

SpheMakh commented Apr 7, 2020

ludwigschwardt commented Apr 8, 2020

SpheMakh commented Apr 8, 2020

ludwigschwardt commented Apr 8, 2020 • edited Loading

bennahugo commented Apr 8, 2020 via email

ludwigschwardt commented Apr 8, 2020

SpheMakh commented May 20, 2020

spassmoor commented May 20, 2020

SpheMakh commented May 20, 2020

SpheMakh commented May 20, 2020

ludwigschwardt commented May 20, 2020

spassmoor commented May 20, 2020

sarrvesh commented Feb 25, 2021 • edited Loading

bennahugo commented Feb 25, 2021 via email

sarrvesh commented Feb 25, 2021

ludwigschwardt commented Feb 25, 2021 • edited Loading

sarrvesh commented Feb 25, 2021

ludwigschwardt commented Feb 25, 2021 • edited Loading

ludwigschwardt commented Apr 8, 2020 •

edited

Loading

sarrvesh commented Feb 25, 2021 •

edited

Loading

ludwigschwardt commented Feb 25, 2021 •

edited

Loading

ludwigschwardt commented Feb 25, 2021 •

edited

Loading