-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(kuma-cp) subscription finalizer, rev 2 #2526
Conversation
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru> # Conflicts: # api/mesh/v1alpha1/dataplane_insight.pb.go # api/system/v1alpha1/zone_insight.pb.go # pkg/xds/server/callbacks/dataplane_status_sink_test.go # test/framework/universal_cluster.go
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru> # Conflicts: # pkg/sds/server/v3/reconciler.go
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru> # Conflicts: # test/framework/universal_cluster.go
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Codecov Report
@@ Coverage Diff @@
## master #2526 +/- ##
==========================================
+ Coverage 52.30% 52.33% +0.02%
==========================================
Files 867 869 +2
Lines 48182 48429 +247
==========================================
+ Hits 25204 25343 +139
- Misses 20942 21035 +93
- Partials 2036 2051 +15
Continue to review full report at Codecov.
|
if subscription.DisconnectTime == nil { | ||
subscription.DisconnectTime = now | ||
} | ||
} | ||
} | ||
|
||
func (ds *DataplaneInsight) GetLatestSubscription() (*DiscoverySubscription, *time.Time) { | ||
if len(ds.GetSubscriptions()) == 0 { | ||
// todo(lobkovilya): delete GetLatestSubscription, use GetLastSubscription instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be done in this PR or separate? If separate, do we have a issue / card for it?
@@ -270,11 +270,15 @@ metrics: | |||
enabled: true # ENV: KUMA_METRICS_DATAPLANE_ENABLED | |||
# How many latest subscriptions will be stored in DataplaneInsight object, if equals 0 then unlimited | |||
subscriptionLimit: 2 # ENV: KUMA_METRICS_DATAPLANE_SUBSCRIPTION_LIMIT | |||
# How long data plane proxy can stay Online without active xDS connection | |||
idleTimeout: 5m # ENV: KUMA_METRICS_DATAPLANE_IDLE_TIMEOUT | |||
zone: | |||
# Enables collecting metrics from Zone | |||
enabled: true # ENV: KUMA_METRICS_ZONE_ENABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking... now we rely on Insights to provide a stable product. Should we remove an option to disable insights?
@@ -107,6 +107,9 @@ type DiscoverySubscription struct { | |||
Status *DiscoverySubscriptionStatus `protobuf:"bytes,5,opt,name=status,proto3" json:"status,omitempty"` | |||
// Version of Envoy and Kuma dataplane | |||
Version *Version `protobuf:"bytes,6,opt,name=version,proto3" json:"version,omitempty"` | |||
// Generation is an integer number which is periodically increased by the | |||
// status sink | |||
Generation uint32 `protobuf:"varint,7,opt,name=generation,proto3" json:"generation,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to use Version from ResourceMeta instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 19 of 29 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: 21 of 30 files reviewed, 17 unresolved discussions (waiting on @lobkovilya)
api/helpers/insights.go, line 1 at r2 (raw file):
package helpers
Suggest that we call this package generic
or interface
. generic.Insight
reads nicely IMHO.
api/mesh/v1alpha1/dataplane_insight.proto, line 74 at r2 (raw file):
// Generation is an integer number which is periodically increased by the // status sink
Nit: end sentence with a full-stop. Here and elsewhere.
https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
api/mesh/v1alpha1/dataplane_insight_helpers.go, line 9 at r2 (raw file):
"google.golang.org/protobuf/types/known/timestamppb" "github.com/kumahq/kuma/api/helpers"
Since you already imported this package, WDYT about adding type assertions for the insight and subscription types?
api/mesh/v1alpha1/dataplane_insight_helpers.go, line 102 at r2 (raw file):
} // todo(lobkovilya): delete GetLatestSubscription, use GetLastSubscription instead
File an issue for this TODO?
api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 8 at r2 (raw file):
"google.golang.org/protobuf/types/known/timestamppb" "github.com/kumahq/kuma/api/helpers"
Same suggestion to add type assertions.
api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):
return } discoverySubscription, ok := s.(*DiscoverySubscription)
This should never happen, right? Can we just let this panic?
pkg/gc/collector_test.go, line 66 at r2 (raw file):
Expect(err).ToNot(HaveOccurred()) mtxNow.Lock() core.Now = func() time.Time {
We should reset core.Now
in an AfterEach
.
pkg/gc/collector_test.go, line 69 at r2 (raw file):
return now } mtxNow.Unlock()
Why do we need the mutex here?
pkg/gc/components.go, line 66 at r2 (raw file):
} default: return errors.Errorf("unknown Kuma CP mode %v", rt.Config().Mode)
nit: should use %s
here
pkg/gc/finalizer.go, line 35 at r2 (raw file):
// 4. DPP status is Online whereas it should be Offline type descriptor struct {
Suggest this would be clearer if we called it subscription
.
pkg/gc/finalizer.go, line 40 at r2 (raw file):
} type insightMap map[core_model.ResourceKey]*descriptor
Do we need a pointer here, or just use a value directly.
pkg/gc/finalizer.go, line 54 at r2 (raw file):
insights := insightsByType{} for _, typ := range types { if err := validateType(typ); err != nil {
Suggest that we rename validateType
for clarity and just make it return bool
:
if !isInsightType(typ) {
return fmt.Errorf("%q type is not an Insight", typ)
}
pkg/gc/finalizer.go, line 90 at r2 (raw file):
func (f *subscriptionFinalizer) checkGeneration(typ core_model.ResourceType) error { // get all the insights for provided type ctx := context.Background()
Nit: this is only used once, let's just inline it.
pkg/gc/finalizer.go, line 97 at r2 (raw file):
// delete items from the map that don't exist in the upstream f.upstreamSync(insights)
Rename upstreamSync
to more obviously be named after what it is doing. Maybe removeDeletedInsights
?
pkg/gc/finalizer.go, line 120 at r2 (raw file):
} log.V(1).Info("mark subscription as disconnected")
Log the subscription ID and generation here?
pkg/gc/finalizer.go, line 129 at r2 (raw file):
if err != nil { log.Error(err, "unable to finalize subscription") }
Return the error here? If we return the error then we will also keep the map entry, which means that we will come back and retry.
pkg/gc/finalizer_test.go, line 41 at r2 (raw file):
mtxNow.RLock() _ = finalizer.Start(stop) mtxNow.RUnlock()
Not sure that we have to lock this over the whole run time. The BeforeEach
should always complete before we run anything, right?
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 21 of 30 files reviewed, 18 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)
api/helpers/insights.go, line 1 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Suggest that we call this package
generic
orinterface
.generic.Insight
reads nicely IMHO.
Done.
api/mesh/v1alpha1/dataplane_insight.proto, line 74 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Nit: end sentence with a full-stop. Here and elsewhere.
https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
because of the approach, proposed by @jakubdyszkiewicz, we can avoid extending API at all
api/mesh/v1alpha1/dataplane_insight.pb.go, line 112 at r2 (raw file):
Previously, jakubdyszkiewicz (Jakub Dyszkiewicz) wrote…
would it be possible to use Version from ResourceMeta instead?
This is a really good idea, done.
api/mesh/v1alpha1/dataplane_insight_helpers.go, line 102 at r2 (raw file):
Previously, jakubdyszkiewicz (Jakub Dyszkiewicz) wrote…
Will this be done in this PR or separate? If separate, do we have a issue / card for it?
Done #2557
api/mesh/v1alpha1/dataplane_insight_helpers.go, line 102 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
File an issue for this TODO?
Done #2557
api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
This should never happen, right? Can we just let this panic?
If a developer makes a mistake and passes the wrong subscription type to the UpdateSubscription
, will this panic immediately?
I can imagine the code that operates with subscriptions in a generic way (like finalizer.go) and may cause not an immediate panic, but panic later in production. So I'd better avoid panic here, maybe log some error?
pkg/config/app/kuma-cp/kuma-cp.defaults.yaml, line 277 at r2 (raw file):
Previously, jakubdyszkiewicz (Jakub Dyszkiewicz) wrote…
I'm thinking... now we rely on Insights to provide a stable product. Should we remove an option to disable insights?
If you disable Insights you won't see correct statuses in GUI, that's how it worked before this PR. But if we want to remove enabled
option, I can do it
pkg/gc/collector_test.go, line 66 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
We should reset
core.Now
in anAfterEach
.
Done.
pkg/gc/collector_test.go, line 69 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Why do we need the mutex here?
Added the comment:
// The problem is SubscriptionFinalizer in the same package uses core.Now
// variable to set DisconnectTime when finalizes subscription.
// Test in the current file replaces core.Now for testing purposes. When running
// the whole suite with `-race` flag it constantly fails with `WARNING: DATA RACE`.
// In order to avoid that we protect core.Now with mtxNow mutex in the scope
// of this gc_test package.
pkg/gc/finalizer.go, line 35 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Suggest this would be clearer if we called it
subscription
.
now it's just version
, so got rid of this struct
pkg/gc/finalizer.go, line 40 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Do we need a pointer here, or just use a value directly.
same
pkg/gc/finalizer.go, line 54 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Suggest that we rename
validateType
for clarity and just make it returnbool
:if !isInsightType(typ) { return fmt.Errorf("%q type is not an Insight", typ) }
Done. Now isInsightType
could panic if obj, err := registry.Global().NewObject(typ)
fails, but I think it's acceptable
pkg/gc/finalizer.go, line 97 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Rename
upstreamSync
to more obviously be named after what it is doing. MayberemoveDeletedInsights
?
Done.
pkg/gc/finalizer.go, line 120 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Log the subscription ID and generation here?
no need since now we're relying on ResourceVersion
pkg/gc/finalizer.go, line 129 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Return the error here? If we return the error then we will also keep the map entry, which means that we will come back and retry.
Right! I missed that, thanks!
pkg/gc/finalizer_test.go, line 41 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Not sure that we have to lock this over the whole run time. The
BeforeEach
should always complete before we run anything, right?
yes it should, but on CI we run tests with -race
, and apparently it doesn't see any synchronization between BeforeEach
and this goroutine, so it constantly fails with race detected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 15 files at r3.
Reviewable status: 10 of 30 files reviewed, 9 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 29 files at r1, 12 of 15 files at r3, all commit messages.
Reviewable status: 29 of 30 files reviewed, 11 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)
api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):
Previously, lobkovilya (Ilya Lobkov) wrote…
If a developer makes a mistake and passes the wrong subscription type to the
UpdateSubscription
, will this panic immediately?
I can imagine the code that operates with subscriptions in a generic way (like finalizer.go) and may cause not an immediate panic, but panic later in production. So I'd better avoid panic here, maybe log some error?
Then can we propagate an error out of UpdateSubscription
? That can bubble up through the upsert and be handled in the finalizer.
pkg/config/app/kuma-cp/kuma-cp.defaults.yaml, line 277 at r2 (raw file):
Previously, lobkovilya (Ilya Lobkov) wrote…
If you disable Insights you won't see correct statuses in GUI, that's how it worked before this PR. But if we want to remove
enabled
option, I can do it
Shall we discuss this in a separate issue?
pkg/gc/finalizer.go, line 116 at r3 (raw file):
upsertInsight, _ := registry.Global().NewObject(typ) err := manager.Upsert(f.rm, core_model.MetaToResourceKey(item.GetMeta()), upsertInsight, func(_ core_model.Resource) {
You already have a local variable key
that you can use here.
pkg/gc/finalizer.go, line 117 at r3 (raw file):
upsertInsight, _ := registry.Global().NewObject(typ) err := manager.Upsert(f.rm, core_model.MetaToResourceKey(item.GetMeta()), upsertInsight, func(_ core_model.Resource) { upsertInsight.GetSpec().(generic.Insight).UpdateSubscription(insight.GetLastSubscription())
Here, aren't you supposed to update the parameter passed to the callback func?
pkg/kds/server/status_sink.go, line 60 at r3 (raw file):
var lastStoredState *system_proto.KDSSubscription forceUpdate := false
Move this into flush()
and drop the reset.
pkg/xds/server/callbacks/dataplane_status_sink.go, line 65 at r3 (raw file):
var lastStoredState *mesh_proto.DiscoverySubscription forceUpdate := false
If you make this local to flush()
then you don't need to reset it.
pkg/xds/server/callbacks/dataplane_status_sink.go, line 75 at r3 (raw file):
} if proto.Equal(currentState, lastStoredState) && !forceUpdate {
I'm am pretty sure that Kubernetes does not guarantee to increment the resource version if you do a no-op write.
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com> # Conflicts: # pkg/sds/server/v3/reconciler.go # pkg/xds/server/callbacks/dataplane_status_sink.go # pkg/xds/server/callbacks/dataplane_status_sink_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 29 of 30 files reviewed, 11 unresolved discussions (waiting on @jakubdyszkiewicz and @jpeach)
api/mesh/v1alpha1/dataplane_insight_helpers.go, line 9 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Since you already imported this package, WDYT about adding type assertions for the insight and subscription types?
Done.
api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 8 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Same suggestion to add type assertions.
Done.
api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Then can we propagate an error out of
UpdateSubscription
? That can bubble up through the upsert and be handled in the finalizer.
Done.
pkg/config/app/kuma-cp/kuma-cp.defaults.yaml, line 277 at r2 (raw file):
Previously, jpeach (James Peach) wrote…
Shall we discuss this in a separate issue?
Yep, done #2565
pkg/gc/finalizer.go, line 116 at r3 (raw file):
Previously, jpeach (James Peach) wrote…
You already have a local variable
key
that you can use here.
Done.
pkg/gc/finalizer.go, line 117 at r3 (raw file):
Previously, jpeach (James Peach) wrote…
Here, aren't you supposed to update the parameter passed to the callback func?
That doesn't matter, the variable is the same
pkg/kds/server/status_sink.go, line 60 at r3 (raw file):
Previously, jpeach (James Peach) wrote…
Move this into
flush()
and drop the reset.
Right, thanks!
pkg/xds/server/callbacks/dataplane_status_sink.go, line 65 at r3 (raw file):
Previously, jpeach (James Peach) wrote…
If you make this local to
flush()
then you don't need to reset it.
Done.
pkg/xds/server/callbacks/dataplane_status_sink.go, line 75 at r3 (raw file):
Previously, jpeach (James Peach) wrote…
I'm am pretty sure that Kubernetes does not guarantee to increment the resource version if you do a no-op write.
Yes, that's true, revert last changes and restore 'generation' field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 14 files at r4, 20 of 21 files at r5, all commit messages.
Reviewable status: 36 of 37 files reviewed, 4 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)
pkg/gc/finalizer.go, line 117 at r3 (raw file):
Previously, lobkovilya (Ilya Lobkov) wrote…
That doesn't matter, the variable is the same
Is there a guarantee that it can't be concurrently written by the sink?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 36 of 37 files reviewed, 3 unresolved discussions (waiting on @jakubdyszkiewicz and @jpeach)
@lobkovilya No backporting this one? |
Summary
Based on #2444, e2e tests and subscription refactoring were taken from there, but the algorithm is different.
After the discussion with @jpeach we came up with the new algorithm:
generation
counter.generation
counter was changed.The current algorithm requires 2 times fewer UPDATE operations on the storage.
Full changelog
idleTimeout
property for kuma-cp configIssues resolved
Fix #2320
Documentation
Testing
Backwards compatibility
backport-to-stable
label if the code is backwards compatible. Otherwise, list breaking changes.This change is