-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvtoms.py throws connection reset by peer error #296
Comments
This is the ID of the data I'm trying to download 1584577476 |
@ludwigschwardt any ideas? |
I could not recreate the error. The server typically slammed down the phone on their side with such an error message, indicating temporary overload which should go away if you try again later. I'm still a bit sad since my latest improvements aim to catch these errors and turn them into flagged missing data without crashing the script. I'm keeping this open to remind me to catch this error too. |
I've tried a couple of times already, and it fails at the same point each time; at around 500Gb. But I'll give it another go |
Interesting... I tried the following: import katdal
from katdal.lazy_indexer import DaskLazyIndexer
d = katdal.open('...')
d.select(scans=5) # which is where you are getting stuck
v, w, f = DaskLazyIndexer.get([d.vis, d.weights, d.flags], 0)
for n in range(602):
print(n)
DaskLazyIndexer.get([d.vis, d.weights, d.flags], n, out=[v, w, f]) It made it all the way to the end... Maybe try this on your setup. |
Update your katdal Sphe. If you are coming in from Rhodes you need the
latest and greatest! Had run into this network endpoint problem before
…On Wed, Apr 8, 2020 at 2:10 PM Sphesihle Makhathini < ***@***.***> wrote:
I've tried a couple of times already, and it fails at the same point each
time; at around 500Gb. But I'll give it another go
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#296 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4RE6QKHBFZMEYHQ5DESTLRLRSU7ANCNFSM4LUJL7YQ>
.
--
--
Benjamin Hugo
PhD. student,
Centre for Radio Astronomy Techniques and Technologies
Department of Physics and Electronics
Rhodes University
Junior software developer
Radio Astronomy Research Group
South African Radio Astronomy Observatory
Black River Business Park
Observatory
Cape Town
|
katdal 0.15 is pretty new (just pre-lockdown). |
This only happens when I'm running in a docker container. It works fine outside a container. @ludwigschwardt are there any containers that use mvftoms that you know of, maybe I made a mistake in building mine. |
I have not had this issue running on my docker container that has mvftoms.py . I should also point out that I run it on one of the comm machines and this would mitigate any bad network problems. |
@spassmoor I'm running on com08. Can you share the Dockerfile? |
Thanks, I'll give it go. |
Just remember that this is a public thread, in case your zip contains sensitive info :-) |
Other than my work email address and my preferred version of tornado I don't think there is anything sensitive in it. |
Hey folks, I encountered a similar issue using katdal 0.17. mvftoms.py failed for me with a connection time out error on my dataset (1596945366). I tried running @ludwigschwardt 's script above and that failed with the same time out error. Any ideas how to solve this issue? |
Need more info here.
Is this inside a container? If so please check that your docker bridge is
working properly by pinging or telnet inside your container.
If it is working can you indicate whether it fails part way through
download or right at the start. There might be a misconfigured end point or
something else not related to containerization.
…On Thu, 25 Feb 2021, 20:42 Sarrvesh, ***@***.***> wrote:
Hey folks, I encountered the same issue using katdal 0.17. mvftoms.py
failed for me with a connection time out error on my dataset (1596945366).
I tried running @ludwigschwardt <https://github.com/ludwigschwardt> 's
script above and that failed with the same time out error. Any ideas how to
solve this issue?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#296 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4RE6XIUOWAQC2JKPEI6F3TA2KY3ANCNFSM4LUJL7YQ>
.
|
I get the same error in a container environment and in a normal virtualenv installation. It seems to fail right away with the following error:
My network, otherwise, works just fine. |
Hi @sarrvesh, a connection timeout indicates that you could not even start to talk to the archive server, i.e. the phone just rings and rings and nobody picks up. This differs from a connection reset (the topic of this issue), which is the server slamming down the phone in the middle of your conversation. I see that you are trying to connect to port 7480. Are you on a machine in the CHPC cluster? If not, you'll need to connect to port 443, aka https, and use a token as provided by the RDB link button on the archive. So instead of
try
I managed to download dump 168 with both methods just now, so the server is up and your dataset is intact. That's the good news 😄 This issue also occurs if you download the RDB file to your local disk and then open it via
That trick only works on the CHPC cluster, or if you also copied all the data to your local disk, since the RDB file only contains the 7480 URL and won't know about the token. |
Ah, interesting. Yeah, that works. Thanks very much. |
Pleasure! I now remember that there's another option with the local RDB file to feed in the token - treat it like an URL:
Although I'm not sure if that will go the https route... |
Here is a recap of how our HTTP requests happen:: -- S3ChunkStore.complete_request (<- proposed retry level) -> S3ChunkStore.request -> requests.sessions.Session.request -> requests.adapters.HTTPAdapter.send (<- retries currently set here) -> urllib3.connectionpool.HTTPConnectionPool.urlopen (do retries here) -> urllib3.response.HTTPResponse -> http.client.HTTPConnection / HTTPResponse Requests does timeouts but not retries. Retries are a urllib3 thing. Requests allow you to pass an urllib3 `Retry` object to `HTTPAdapter` for use on all subsequent requests. Instead of doing connect retries and some read retries inside the urllib3 request and other read and status retries at a higher level, switch off retries at the `HTTPAdapter` level and do them all in `complete_request`. The `Retry` configuration makes more sense because the process is driven by a single `Retry` instance. This also simplifies overriding retries per request. This is a game of exceptions. The trick is to turn higher-level exceptions (typically from Requests) back down into the appropriate mid-level exceptions (urllib3) that can interact with the `Retry` object. Sorta like pluripotent stem cells. - For connect retries we need urllib3's `ConnectTimeoutError`. - For read retries we need `ReadTimeoutError` (useful for socket timeouts) or `ProtocolError` (which covers 104 resets + short reads). - Status retries need an urllib3.HTTPResponse as input (and no error). We can use the `_standard_errors` mechanism to translate exceptions by setting up an `error_map` containing:: - `requests.exceptions.ConnectionError` -> `ConnectTimeoutError` - The standard `ConnectionResetError` aka 104 -> `ProtocolError` In addition, let `TruncatedRead` derive from `ProtocolError`, which will automatically trigger a read retry when it's raised. There is no more need to catch it and reraise it as another exception. Since we rely on a raise_for_status mechanism, we have to shuttle the response via an exception to the `Retry` object for status retries. Use urllib3's own ResponseError for this and simply jam the high-level Requests reponse unceremoniously into its args. This also gets rid of the pesky fake 555 status (good riddance!). The `S3ServerGlitch` reverts to a normal exception that is only raised at the end to indicate a temporary server failure (read / status fail). The unit tests for truncated reads and reset connections now all pass. The only leftover is truncated RDB files. This addresses #296, SR-2019, SPR1-1654 and possibly more.
I'm using katdal 0.15 in an ubuntu 18.04 docker container
The text was updated successfully, but these errors were encountered: