Connection reset by peer when uploading to S3 with image column #1948

wjones127 · 2024-02-13T19:34:25Z

May need a retry loop with special backoff here.

Also investigate: could a very large blob trigger this?

wjones127 · 2024-02-19T17:25:21Z

I can't reproduce this with a single process, but it sounds like this can be generated if many parallel processes are running in parallel. Looking at upstream discussion (apache/arrow-rs#5378, apache/arrow-rs#5383), this might just be an issue with writing too quickly. But since any rate limiting mechanism we create will only affect a single process, it's not clear if we can fix this in Lance. I think my advice is to do less on a single node: either rate limit tasks or spread them across separate computers.

wjones127 · 2024-04-30T15:47:37Z

I am able to reproduce this now if I try to write a file > 100GB to GCS from within GCP.

Other issues have presented the solution as using limit store. But I am hitting this from a single upload. I suspect just uploading the 10 10 MB parts in parallel can trigger this.

For now we have retries just for the upload part requests:

lance/rust/lance-io/src/object_store/gcs_wrapper.rs

Lines 254 to 279 in b39e8e8

    
           Err(UploadPutError { 
        
               source: OSError::Generic { source, .. }, 
        
               part_idx, 
        
               buffer, 
        
           }) if source 
        
               .to_string() 
        
               .to_lowercase() 
        
               .contains("Connection reset by peer") 
        
               && mut_self.connection_resets < 20 => 
        
           { 
        
               // Retry, but only up to 20 of them. 
        
               mut_self.connection_resets += 1; 
        
               // Resubmit with random jitter 
        
               let sleep_time_ms = rand::thread_rng().gen_range(2_000..8_000); 
        
               let sleep_time = std::time::Duration::from_millis(sleep_time_ms); 
        
               futures.push(Self::put_part( 
        
                   mut_self.path.clone(), 
        
                   mut_self.store.clone(), 
        
                   buffer, 
        
                   part_idx, 
        
                   multipart_id.clone(), 
        
                   Some(sleep_time), 
        
               )); 
        
           }

wjones127 · 2024-05-02T21:13:29Z

I created a low-level test of object-store, and I'm no longer able to reproduce this with just one upload. Even if I set the concurrency to 60, it works flawlessly. It also works flawlessly for part size of 100MB.

I think the problem might be somewhere in Lance then.

wjones127 added the bug Something isn't working label Feb 13, 2024

wjones127 self-assigned this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection reset by peer when uploading to S3 with image column #1948

Connection reset by peer when uploading to S3 with image column #1948

wjones127 commented Feb 13, 2024 •

edited

Loading

wjones127 commented Feb 19, 2024 •

edited

Loading

wjones127 commented Apr 30, 2024

wjones127 commented May 2, 2024

Connection reset by peer when uploading to S3 with image column #1948

Connection reset by peer when uploading to S3 with image column #1948

Comments

wjones127 commented Feb 13, 2024 • edited Loading

wjones127 commented Feb 19, 2024 • edited Loading

wjones127 commented Apr 30, 2024

wjones127 commented May 2, 2024

wjones127 commented Feb 13, 2024 •

edited

Loading

wjones127 commented Feb 19, 2024 •

edited

Loading