Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection errors during watch do not cause a reconnect #53

Closed
butonic opened this issue Jul 17, 2024 · 2 comments
Closed

connection errors during watch do not cause a reconnect #53

butonic opened this issue Jul 17, 2024 · 2 comments

Comments

@butonic
Copy link

butonic commented Jul 17, 2024

AFAICT kubeBuilder.Build should reconnect when the witcher encounters an error:

kuberesolver/builder.go

Lines 177 to 183 in b382846

go until(func() {
r.wg.Add(1)
err := r.watch()
if err != nil && err != io.EOF {
grpclog.Errorf("kuberesolver: watching ended with error='%v', will reconnect again", err)
}
}, time.Second, time.Second*30, ctx.Done())

But tkResolver.watch() only returns an error during the initial connection establishment in watchEndpoints(). All other cases in the select return a nil error:

kuberesolver/builder.go

Lines 278 to 299 in b382846

func (k *kResolver) watch() error {
defer k.wg.Done()
// watch endpoints lists existing endpoints at start
sw, err := watchEndpoints(k.ctx, k.k8sClient, k.target.serviceNamespace, k.target.serviceName)
if err != nil {
return err
}
for {
select {
case <-k.ctx.Done():
return nil
case <-k.t.C:
k.resolve()
case up, hasMore := <-sw.ResultChan():
if hasMore {
k.handle(up.Object)
} else {
return nil
}
}
}
}

watchEndpoints sets up a newStreamWatcher which loops over the channel and closes it in case of an error:

kuberesolver/stream.go

Lines 69 to 93 in b382846

func (sw *streamWatcher) receive() {
defer close(sw.result)
defer sw.Stop()
for {
obj, err := sw.Decode()
if err != nil {
// Ignore expected error.
if sw.stopping() {
return
}
switch err {
case io.EOF:
// watch closed normally
case context.Canceled:
// canceled normally
case io.ErrUnexpectedEOF:
grpclog.Infof("kuberesolver: Unexpected EOF during watch stream event decoding: %v", err)
default:
grpclog.Infof("kuberesolver: Unable to decode an event from the watch stream: %v", err)
}
return
}
sw.result <- obj
}
}

until() only wraps panics ... and will silently stop restarting watch() after 30secs

Am I missing something? How does the actual reconnect happen on intermediate connection errors after the 30 seconds? There is a timer that updates the endpoints every 30min, but the watcher is just not restarted.

I'm asking because we encountered the kubernetes API to be unavailable for minutes during a cluster upgrade.

@sercand
Copy link
Owner

sercand commented Jul 17, 2024

until() only wraps panics ... and will silently stop restarting watch() after 30secs

No, this is wrong. until always calls the given function with backoff and never stops recalling until given stop channel called. The backoff sequence is 1, 2, 4, 8, 16, 1, 2, 4... seconds. This backoff logic is added at #40
You are free to submit a PR to change this if you have experienced a problem caused by this

@butonic
Copy link
Author

butonic commented Jul 18, 2024

Ah, I double checked the codepath of the channel handed to until. I somewhere mixed it up with the cancel channel of the streamWatcher. Thx for clarifying and sorry for the noise.

@butonic butonic closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants