WatchResponse.Canceled is not triggered to true #8231

mwf · 2017-07-07T16:23:14Z

Hi team!

You have some inconsistency between clientv3 Watch documentation and real behavior!

Docs says:

    // Canceled is used to indicate watch failure.
    // If the watch failed and the stream was about to close, before the channel is closed,
    // the channel sends a final response that has Canceled set to true with a non-nil Err().
    Canceled bool

But you receive response with Canceled = false in case of revision compaction! And then the channel is unexpectedly closed!

Try this (consider 1 as some compacted revision):

prefix := "foo"
watchCh := client.Watch(context.Background(), prefix, clientv3.WithPrefix(), clientv3.WithRev(1))
for resp := range watchCh {
	if resp.Canceled {
		log.Printf("watch channel is closing")
		panic(resp.Err())
	}
	if err := resp.Err(); err != nil {
		// there is some non-critical error without channel close, just log it
		log.Printf("error watching '%s': %s", prefix, err)
		log.Printf("canceled: %t", resp.Canceled)
		continue
	}
	<do something with event>
}
log.Println("exit")

The result will be:

2017/07/07 19:19:10 error watching 'foo': etcdserver: mvcc: required revision has been compacted
2017/07/07 19:19:10 canceled: false
2017/07/07 19:19:10 exit

According to documentation I expect Canceled to be set to true every time the response channel is going to be closed.

The problem exists both for etcd 3.1.x and 3.2.x clients.

What do you think about it?

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-07-08T23:35:00Z

@heyitsanthony

I think the comment is wrong? User should rely on error instead? Or we switch to always set canceled to true when there is an error?

mwf · 2017-07-08T23:41:23Z

And also the question is, can there be some errors, which don't lead to channel close? If can - the Canceled flag is usefull.
If any error leads to channel close - then what is the purpose for Canceled flag? User can just check Err() and that's all.
Maybe the original thought was that there could be some "recoverable errors" without channel close, but how should the user react to them? Just log and continue the event stream processing?

heyitsanthony · 2017-07-10T16:36:57Z

There aren't any errors that won't also close the watch, but it might be nice to have them in the future. The server doesn't set Canceled when closing the channel on compact, it only sets the compact revision, and so the client watch response won't set cancel either.

It seems like the server should set both Canceled and the compact revision on watch compaction. This won't break any existing clients afaik, so that's probably the fix here.

mwf · 2017-07-10T21:05:09Z

BTW, Anthony, I've got another concern. You say it's server who closes the channel on compact. When does it do this? We just hit this issue several days ago, we had some long-lasting watches for service discovery and etcd compaction once per 24 hours. And suddenly services received watch channel close - though I played around and could not reproduce this case.

If you watch starting from some revision, then make compact to the server - the watch channel is still alive and receives the events! I couldn't make it closed by compact.

And what's the purpose to close the channel on compact? I don't care if the revision I start watching from is compacted or not - I just need the kv updates to start from particular revision.

So could you please describe the mechanics behind watch + compact in details?

mwf · 2017-07-11T09:45:08Z

@heyitsanthony ping :)

xiang90 · 2017-07-11T15:49:04Z

service discovery and etcd compaction once per 24 hours. And

if closes the connection if the watcher wants to watch (or watching on) compacted revisions.

mwf · 2017-07-11T16:01:09Z

if closes the connection if the watcher wants to watch (or watching on) compacted revisions.

Yes, it's OK to close the connection with error if you try to start the watch on compacted revision. And it does so.
But as I told above, it closes the connection after you already started the watch.

I could not reproduce it - I started watch on rev. 3000 for example, then made some inserts, compacted store to 3010, made another insert - and client received the event.
But the fact is that we received etcdserver: mvcc: required revision has been compacted error while the watch has already been started for several days. We saw errors in logs. The watch was not reinitialized - in that case we would see some additional lines about restarting watch on some reason.

That's why I'm asking, if such situation with compaction error during some active watch could possibly occur. Is it intended behaviour or kind of a bug? It's double strange, that we use this code for several months, and we hit such behaviour only now.

mwf · 2017-07-11T16:02:36Z

Because it's strange for me to close a watch just because the revision you start watching from is compacted. What's the point?

xiang90 · 2017-07-11T16:05:10Z

Because it's strange for me to close a watch just because the revision you start watching from is compacted. What's the point?

can you reproduce this? if yes, then it is a bug.

That's why I'm asking, if such situation with compaction error during some active watch could possibly occur.

it can happen if your watcher is slow and lagging. and you compact at the current head. if you believe this is not the case, either reproduce it or look into the code to fix it?

mwf · 2017-07-11T21:32:32Z

Yeap, I agree, it should be reproducible to be considered a bug. If I find a way to reproduce - I'll tell you. It was important for me only to ensure that you don't consider it as expected behaviour either :)

and you compact at the current head

But there is no other out-of-box way to compact, only at current head with auto-compaction-retention. BTW, do you have any plans to make it more flexible? For example some other flag auto-compaction-grace=24, to clean the revisions with timestamp >=24 hours from now? To keep last 24 hours history. Along with retention of 24, the period of 24-48 hours will be kept.
If you like the idea, I could create a separate issue for this and maybe implement it.

xiang90 · 2017-07-11T21:33:46Z

to clean the revisions with timestamp >=24 hours from now?

this is what etcd does not. it compacts 24 hours from now. keep 24 hours history basically.

mwf · 2017-07-11T21:42:20Z

Oh, great then. I thought it's just a time period for starting head compaction. Current documentation doesn't explain it well enough :)

xiang90 · 2017-07-11T21:43:21Z

@mwf maybe help us improve the documentation? thanks :P

mwf · 2017-07-11T22:21:53Z

Easily :)
I'll check tomorrow that it really keeps the historical revisions, and if it does - will improve the doc.

Fixes etcd-io#8231

newhook · 2017-08-22T22:37:20Z

I received this error today I believe -- I got an error:

Aug 21 19:14:34 staging-shard disco_daemonize[13226]: daemonize 2017/08/21 19:14:34.847623 etcd error: etcdserver: mvcc: required revision has been compacted

on a watch that had been active for several hours.

Here is the prior compaction in the etcd logs. I would have started the watch on revision 9.

2017-08-22 18:30:06.830156 N | compactor: Starting auto-compaction at revision 51
2017-08-22 18:30:06.831045 N | compactor: Finished auto-compaction at revision 51

Is this then not expected? The server is running v3.1.9, the client API is the latest code.

heyitsanthony · 2017-08-22T22:45:04Z

@newhook is there an easy way to reproduce this? I think what happened was the watch was waiting for an event with rev >= 9, the kv store was compacted to rev=51, the watch disconnected, then reconnected into a compacted error. Using WithProgressNotify() might help

newhook · 2017-08-22T22:58:54Z

That makes sense, is that expected to occur? This is occurring in a GCE private network so I doubt there are underlying network issues.

I can reproduce pretty easily I think.

If I don't use WithProgressNotify what is the best way to fix this issue?

What this code is doing is:

fetch configuration data.
watch for changes on the data.
if the data changes reconfigure.

So its implemented like so:

Fetch key xyz. Returns data, revision.
Load configuration data using data.
Watch key xyz starting at given revision+1.
If the data changes reload configuration.

The revision here ensures that I don't miss any changes after (1) and before (3).

A couple of ideas to fix the problem:

Refetch the initial set of keys and then re-create the watcher with the new revision.
Re-create the watcher using the CompactRevision in the previous WatchResponse.

heyitsanthony · 2017-08-23T17:17:44Z

Refetch the initial set of keys and then re-create the watcher with the new revision.

This is how it's usually done. It also covers the case where the client is partitioned and there is a compaction, which isn't helped by WithProgressNotify.

heyitsanthony added this to the v3.3.0 milestone Jul 18, 2017

heyitsanthony self-assigned this Jul 19, 2017

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Jul 19, 2017

v3rpc: set canceled=true when stream is compacted

4506ba7

Fixes etcd-io#8231

heyitsanthony mentioned this issue Jul 19, 2017

v3rpc: set Canceled=true on compacted watch #8283

Merged

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Jul 25, 2017

v3rpc: set canceled=true when stream is compacted

6fb0867

Fixes etcd-io#8231

heyitsanthony closed this as completed in #8283 Jul 26, 2017

visheshnp pushed a commit to visheshnp/etcd that referenced this issue Aug 3, 2017

v3rpc: set canceled=true when stream is compacted

c1dbb08

Fixes etcd-io#8231

philips unassigned heyitsanthony Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WatchResponse.Canceled is not triggered to true #8231

WatchResponse.Canceled is not triggered to true #8231

mwf commented Jul 7, 2017 •

edited

Loading

xiang90 commented Jul 8, 2017

mwf commented Jul 8, 2017

heyitsanthony commented Jul 10, 2017

mwf commented Jul 10, 2017

mwf commented Jul 11, 2017

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017 •

edited

Loading

mwf commented Jul 11, 2017

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017 •

edited

Loading

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017

newhook commented Aug 22, 2017 •

edited

Loading

heyitsanthony commented Aug 22, 2017

newhook commented Aug 22, 2017 •

edited

Loading

heyitsanthony commented Aug 23, 2017

WatchResponse.Canceled is not triggered to true #8231

WatchResponse.Canceled is not triggered to true #8231

Comments

mwf commented Jul 7, 2017 • edited Loading

xiang90 commented Jul 8, 2017

mwf commented Jul 8, 2017

heyitsanthony commented Jul 10, 2017

mwf commented Jul 10, 2017

mwf commented Jul 11, 2017

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017 • edited Loading

mwf commented Jul 11, 2017

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017 • edited Loading

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017

xiang90 commented Jul 11, 2017

mwf commented Jul 11, 2017

newhook commented Aug 22, 2017 • edited Loading

heyitsanthony commented Aug 22, 2017

newhook commented Aug 22, 2017 • edited Loading

heyitsanthony commented Aug 23, 2017

mwf commented Jul 7, 2017 •

edited

Loading

mwf commented Jul 11, 2017 •

edited

Loading

mwf commented Jul 11, 2017 •

edited

Loading

newhook commented Aug 22, 2017 •

edited

Loading

newhook commented Aug 22, 2017 •

edited

Loading