connection errors during watch do not cause a reconnect #53

butonic · 2024-07-17T15:11:17Z

AFAICT kubeBuilder.Build should reconnect when the witcher encounters an error:

Lines 177 to 183 in b382846

    
           go until(func() { 
        
           	r.wg.Add(1) 
        
           	err := r.watch() 
        
           	if err != nil && err != io.EOF { 
        
           		grpclog.Errorf("kuberesolver: watching ended with error='%v', will reconnect again", err) 
        
           	} 
        
           }, time.Second, time.Second*30, ctx.Done())

But tkResolver.watch() only returns an error during the initial connection establishment in watchEndpoints(). All other cases in the select return a nil error:

kuberesolver/builder.go

Lines 278 to 299 in b382846

    
           func (k *kResolver) watch() error { 
        
           	defer k.wg.Done() 
        
           	// watch endpoints lists existing endpoints at start 
        
           	sw, err := watchEndpoints(k.ctx, k.k8sClient, k.target.serviceNamespace, k.target.serviceName) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	for { 
        
           		select { 
        
           		case <-k.ctx.Done(): 
        
           			return nil 
        
           		case <-k.t.C: 
        
           			k.resolve() 
        
           		case up, hasMore := <-sw.ResultChan(): 
        
           			if hasMore { 
        
           				k.handle(up.Object) 
        
           			} else { 
        
           				return nil 
        
           			} 
        
           		} 
        
           	} 
        
           }

watchEndpoints sets up a newStreamWatcher which loops over the channel and closes it in case of an error:

kuberesolver/stream.go

Lines 69 to 93 in b382846

    
           func (sw *streamWatcher) receive() { 
        
           	defer close(sw.result) 
        
           	defer sw.Stop() 
        
           	for { 
        
           		obj, err := sw.Decode() 
        
           		if err != nil { 
        
           			// Ignore expected error. 
        
           			if sw.stopping() { 
        
           				return 
        
           			} 
        
           			switch err { 
        
           			case io.EOF: 
        
           				// watch closed normally 
        
           			case context.Canceled: 
        
           				// canceled normally 
        
           			case io.ErrUnexpectedEOF: 
        
           				grpclog.Infof("kuberesolver: Unexpected EOF during watch stream event decoding: %v", err) 
        
           			default: 
        
           				grpclog.Infof("kuberesolver: Unable to decode an event from the watch stream: %v", err) 
        
           			} 
        
           			return 
        
           		} 
        
           		sw.result <- obj 
        
           	} 
        
           }

until() only wraps panics ... and will silently stop restarting watch() after 30secs

Am I missing something? How does the actual reconnect happen on intermediate connection errors after the 30 seconds? There is a timer that updates the endpoints every 30min, but the watcher is just not restarted.

I'm asking because we encountered the kubernetes API to be unavailable for minutes during a cluster upgrade.

The text was updated successfully, but these errors were encountered:

sercand · 2024-07-17T15:34:12Z

until() only wraps panics ... and will silently stop restarting watch() after 30secs

No, this is wrong. until always calls the given function with backoff and never stops recalling until given stop channel called. The backoff sequence is 1, 2, 4, 8, 16, 1, 2, 4... seconds. This backoff logic is added at #40
You are free to submit a PR to change this if you have experienced a problem caused by this

butonic · 2024-07-18T07:43:04Z

Ah, I double checked the codepath of the channel handed to until. I somewhere mixed it up with the cancel channel of the streamWatcher. Thx for clarifying and sorry for the noise.

butonic mentioned this issue Jul 17, 2024

respect grpc service transport cs3org/reva#4744

Merged

butonic closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connection errors during watch do not cause a reconnect #53

connection errors during watch do not cause a reconnect #53

butonic commented Jul 17, 2024

sercand commented Jul 17, 2024 •

edited

Loading

butonic commented Jul 18, 2024

connection errors during watch do not cause a reconnect #53

connection errors during watch do not cause a reconnect #53

Comments

butonic commented Jul 17, 2024

sercand commented Jul 17, 2024 • edited Loading

butonic commented Jul 18, 2024

sercand commented Jul 17, 2024 •

edited

Loading