Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset by peer when uploading to S3 with image column #1948

Open
wjones127 opened this issue Feb 13, 2024 · 3 comments
Open

Connection reset by peer when uploading to S3 with image column #1948

wjones127 opened this issue Feb 13, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@wjones127
Copy link
Contributor

wjones127 commented Feb 13, 2024

May need a retry loop with special backoff here.

Also investigate: could a very large blob trigger this?

@wjones127 wjones127 added the bug Something isn't working label Feb 13, 2024
@wjones127 wjones127 self-assigned this Feb 13, 2024
@wjones127
Copy link
Contributor Author

wjones127 commented Feb 19, 2024

I can't reproduce this with a single process, but it sounds like this can be generated if many parallel processes are running in parallel. Looking at upstream discussion (apache/arrow-rs#5378, apache/arrow-rs#5383), this might just be an issue with writing too quickly. But since any rate limiting mechanism we create will only affect a single process, it's not clear if we can fix this in Lance. I think my advice is to do less on a single node: either rate limit tasks or spread them across separate computers.

@wjones127
Copy link
Contributor Author

I am able to reproduce this now if I try to write a file > 100GB to GCS from within GCP.

Other issues have presented the solution as using limit store. But I am hitting this from a single upload. I suspect just uploading the 10 10 MB parts in parallel can trigger this.

For now we have retries just for the upload part requests:

Err(UploadPutError {
source: OSError::Generic { source, .. },
part_idx,
buffer,
}) if source
.to_string()
.to_lowercase()
.contains("Connection reset by peer")
&& mut_self.connection_resets < 20 =>
{
// Retry, but only up to 20 of them.
mut_self.connection_resets += 1;
// Resubmit with random jitter
let sleep_time_ms = rand::thread_rng().gen_range(2_000..8_000);
let sleep_time = std::time::Duration::from_millis(sleep_time_ms);
futures.push(Self::put_part(
mut_self.path.clone(),
mut_self.store.clone(),
buffer,
part_idx,
multipart_id.clone(),
Some(sleep_time),
));
}

@wjones127
Copy link
Contributor Author

I created a low-level test of object-store, and I'm no longer able to reproduce this with just one upload. Even if I set the concurrency to 60, it works flawlessly. It also works flawlessly for part size of 100MB.

I think the problem might be somewhere in Lance then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant