-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection reset by peer when uploading to S3 with image column #1948
Comments
I can't reproduce this with a single process, but it sounds like this can be generated if many parallel processes are running in parallel. Looking at upstream discussion (apache/arrow-rs#5378, apache/arrow-rs#5383), this might just be an issue with writing too quickly. But since any rate limiting mechanism we create will only affect a single process, it's not clear if we can fix this in Lance. I think my advice is to do less on a single node: either rate limit tasks or spread them across separate computers. |
I am able to reproduce this now if I try to write a file > 100GB to GCS from within GCP. Other issues have presented the solution as using limit store. But I am hitting this from a single upload. I suspect just uploading the 10 10 MB parts in parallel can trigger this. For now we have retries just for the upload part requests: lance/rust/lance-io/src/object_store/gcs_wrapper.rs Lines 254 to 279 in b39e8e8
|
I created a low-level test of I think the problem might be somewhere in Lance then. |
May need a retry loop with special backoff here.
Also investigate: could a very large blob trigger this?
The text was updated successfully, but these errors were encountered: