Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kuma-cp) subscription finalizer, rev 2 #2526

Merged
merged 18 commits into from
Aug 16, 2021

Conversation

lobkovilya
Copy link
Contributor

@lobkovilya lobkovilya commented Aug 9, 2021

Summary

Based on #2444, e2e tests and subscription refactoring were taken from there, but the algorithm is different.

After the discussion with @jpeach we came up with the new algorithm:

  • Every N/2 mins StatusSink increments generation counter.
  • Every N mins Finalizer checks that generation counter was changed.
  • If counter didn't change after N minutes, Finalizer considers it as disconnected and sets DisconnectTime

The current algorithm requires 2 times fewer UPDATE operations on the storage.

Full changelog

  • add subscriptionFinalizer
  • enable resilient tests
  • new idleTimeout property for kuma-cp config

Issues resolved

Fix #2320

Documentation

Testing

  • Unit tests
  • E2E tests
  • Manual testing on Universal
  • Manual testing on Kubernetes

Backwards compatibility

  • Add backport-to-stable label if the code is backwards compatible. Otherwise, list breaking changes.

This change is Reviewable

Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>

# Conflicts:
#	api/mesh/v1alpha1/dataplane_insight.pb.go
#	api/system/v1alpha1/zone_insight.pb.go
#	pkg/xds/server/callbacks/dataplane_status_sink_test.go
#	test/framework/universal_cluster.go
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>

# Conflicts:
#	pkg/sds/server/v3/reconciler.go
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>

# Conflicts:
#	test/framework/universal_cluster.go
@lobkovilya lobkovilya requested a review from a team as a code owner August 9, 2021 11:09
lobkovilya and others added 2 commits August 10, 2021 15:35
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2021

Codecov Report

Merging #2526 (f0754db) into master (7561bd0) will increase coverage by 0.02%.
The diff coverage is 43.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2526      +/-   ##
==========================================
+ Coverage   52.30%   52.33%   +0.02%     
==========================================
  Files         867      869       +2     
  Lines       48182    48429     +247     
==========================================
+ Hits        25204    25343     +139     
- Misses      20942    21035      +93     
- Partials     2036     2051      +15     
Impacted Files Coverage Δ
app/kuma-cp/cmd/run.go 24.79% <0.00%> (+0.59%) ⬆️
pkg/gc/collector.go 68.29% <ø> (ø)
pkg/kds/server/components.go 64.17% <0.00%> (-3.01%) ⬇️
pkg/kds/server/status_sink.go 0.00% <0.00%> (ø)
pkg/xds/server/v3/components.go 79.41% <0.00%> (-3.67%) ⬇️
...ilience/resilience_multizone_universal_postgres.go 0.00% <0.00%> (ø)
.../e2e/resilience/resilience_standalone_universal.go 0.00% <0.00%> (ø)
test/framework/interface.go 0.00% <0.00%> (ø)
test/framework/setup.go 0.00% <0.00%> (ø)
test/framework/universal_app.go 0.00% <0.00%> (ø)
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7561bd0...f0754db. Read the comment docs.

if subscription.DisconnectTime == nil {
subscription.DisconnectTime = now
}
}
}

func (ds *DataplaneInsight) GetLatestSubscription() (*DiscoverySubscription, *time.Time) {
if len(ds.GetSubscriptions()) == 0 {
// todo(lobkovilya): delete GetLatestSubscription, use GetLastSubscription instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be done in this PR or separate? If separate, do we have a issue / card for it?

@@ -270,11 +270,15 @@ metrics:
enabled: true # ENV: KUMA_METRICS_DATAPLANE_ENABLED
# How many latest subscriptions will be stored in DataplaneInsight object, if equals 0 then unlimited
subscriptionLimit: 2 # ENV: KUMA_METRICS_DATAPLANE_SUBSCRIPTION_LIMIT
# How long data plane proxy can stay Online without active xDS connection
idleTimeout: 5m # ENV: KUMA_METRICS_DATAPLANE_IDLE_TIMEOUT
zone:
# Enables collecting metrics from Zone
enabled: true # ENV: KUMA_METRICS_ZONE_ENABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking... now we rely on Insights to provide a stable product. Should we remove an option to disable insights?

@@ -107,6 +107,9 @@ type DiscoverySubscription struct {
Status *DiscoverySubscriptionStatus `protobuf:"bytes,5,opt,name=status,proto3" json:"status,omitempty"`
// Version of Envoy and Kuma dataplane
Version *Version `protobuf:"bytes,6,opt,name=version,proto3" json:"version,omitempty"`
// Generation is an integer number which is periodically increased by the
// status sink
Generation uint32 `protobuf:"varint,7,opt,name=generation,proto3" json:"generation,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to use Version from ResourceMeta instead?

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 19 of 29 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: 21 of 30 files reviewed, 17 unresolved discussions (waiting on @lobkovilya)


api/helpers/insights.go, line 1 at r2 (raw file):

package helpers

Suggest that we call this package generic or interface. generic.Insight reads nicely IMHO.


api/mesh/v1alpha1/dataplane_insight.proto, line 74 at r2 (raw file):

  // Generation is an integer number which is periodically increased by the
  // status sink

Nit: end sentence with a full-stop. Here and elsewhere.

https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences


api/mesh/v1alpha1/dataplane_insight_helpers.go, line 9 at r2 (raw file):

	"google.golang.org/protobuf/types/known/timestamppb"

	"github.com/kumahq/kuma/api/helpers"

Since you already imported this package, WDYT about adding type assertions for the insight and subscription types?


api/mesh/v1alpha1/dataplane_insight_helpers.go, line 102 at r2 (raw file):

}

// todo(lobkovilya): delete GetLatestSubscription, use GetLastSubscription instead

File an issue for this TODO?


api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 8 at r2 (raw file):

	"google.golang.org/protobuf/types/known/timestamppb"

	"github.com/kumahq/kuma/api/helpers"

Same suggestion to add type assertions.


api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):

		return
	}
	discoverySubscription, ok := s.(*DiscoverySubscription)

This should never happen, right? Can we just let this panic?


pkg/gc/collector_test.go, line 66 at r2 (raw file):

		Expect(err).ToNot(HaveOccurred())
		mtxNow.Lock()
		core.Now = func() time.Time {

We should reset core.Now in an AfterEach.


pkg/gc/collector_test.go, line 69 at r2 (raw file):

			return now
		}
		mtxNow.Unlock()

Why do we need the mutex here?


pkg/gc/components.go, line 66 at r2 (raw file):

		}
	default:
		return errors.Errorf("unknown Kuma CP mode %v", rt.Config().Mode)

nit: should use %s here


pkg/gc/finalizer.go, line 35 at r2 (raw file):

// 4. DPP status is Online whereas it should be Offline

type descriptor struct {

Suggest this would be clearer if we called it subscription.


pkg/gc/finalizer.go, line 40 at r2 (raw file):

}

type insightMap map[core_model.ResourceKey]*descriptor

Do we need a pointer here, or just use a value directly.


pkg/gc/finalizer.go, line 54 at r2 (raw file):

	insights := insightsByType{}
	for _, typ := range types {
		if err := validateType(typ); err != nil {

Suggest that we rename validateType for clarity and just make it return bool:

if !isInsightType(typ) {
    return fmt.Errorf("%q type is not an Insight", typ)
}

pkg/gc/finalizer.go, line 90 at r2 (raw file):

func (f *subscriptionFinalizer) checkGeneration(typ core_model.ResourceType) error {
	// get all the insights for provided type
	ctx := context.Background()

Nit: this is only used once, let's just inline it.


pkg/gc/finalizer.go, line 97 at r2 (raw file):

	// delete items from the map that don't exist in the upstream
	f.upstreamSync(insights)

Rename upstreamSyncto more obviously be named after what it is doing. Maybe removeDeletedInsights?


pkg/gc/finalizer.go, line 120 at r2 (raw file):

		}

		log.V(1).Info("mark subscription as disconnected")

Log the subscription ID and generation here?


pkg/gc/finalizer.go, line 129 at r2 (raw file):

		if err != nil {
			log.Error(err, "unable to finalize subscription")
		}

Return the error here? If we return the error then we will also keep the map entry, which means that we will come back and retry.


pkg/gc/finalizer_test.go, line 41 at r2 (raw file):

			mtxNow.RLock()
			_ = finalizer.Start(stop)
			mtxNow.RUnlock()

Not sure that we have to lock this over the whole run time. The BeforeEach should always complete before we run anything, right?

Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Copy link
Contributor Author

@lobkovilya lobkovilya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 21 of 30 files reviewed, 18 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)


api/helpers/insights.go, line 1 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Suggest that we call this package generic or interface. generic.Insight reads nicely IMHO.

Done.


api/mesh/v1alpha1/dataplane_insight.proto, line 74 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Nit: end sentence with a full-stop. Here and elsewhere.

https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences

because of the approach, proposed by @jakubdyszkiewicz, we can avoid extending API at all


api/mesh/v1alpha1/dataplane_insight.pb.go, line 112 at r2 (raw file):

Previously, jakubdyszkiewicz (Jakub Dyszkiewicz) wrote…

would it be possible to use Version from ResourceMeta instead?

This is a really good idea, done.


api/mesh/v1alpha1/dataplane_insight_helpers.go, line 102 at r2 (raw file):

Previously, jakubdyszkiewicz (Jakub Dyszkiewicz) wrote…

Will this be done in this PR or separate? If separate, do we have a issue / card for it?

Done #2557


api/mesh/v1alpha1/dataplane_insight_helpers.go, line 102 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

File an issue for this TODO?

Done #2557


api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

This should never happen, right? Can we just let this panic?

If a developer makes a mistake and passes the wrong subscription type to the UpdateSubscription, will this panic immediately?
I can imagine the code that operates with subscriptions in a generic way (like finalizer.go) and may cause not an immediate panic, but panic later in production. So I'd better avoid panic here, maybe log some error?


pkg/config/app/kuma-cp/kuma-cp.defaults.yaml, line 277 at r2 (raw file):

Previously, jakubdyszkiewicz (Jakub Dyszkiewicz) wrote…

I'm thinking... now we rely on Insights to provide a stable product. Should we remove an option to disable insights?

If you disable Insights you won't see correct statuses in GUI, that's how it worked before this PR. But if we want to remove enabled option, I can do it


pkg/gc/collector_test.go, line 66 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

We should reset core.Now in an AfterEach.

Done.


pkg/gc/collector_test.go, line 69 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Why do we need the mutex here?

Added the comment:

// The problem is SubscriptionFinalizer in the same package uses core.Now
// variable to set DisconnectTime when finalizes subscription.
// Test in the current file replaces core.Now for testing purposes. When running
// the whole suite with `-race` flag it constantly fails with `WARNING: DATA RACE`.
// In order to avoid that we protect core.Now with mtxNow mutex in the scope
// of this gc_test package.

pkg/gc/finalizer.go, line 35 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Suggest this would be clearer if we called it subscription.

now it's just version, so got rid of this struct


pkg/gc/finalizer.go, line 40 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Do we need a pointer here, or just use a value directly.

same


pkg/gc/finalizer.go, line 54 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Suggest that we rename validateType for clarity and just make it return bool:

if !isInsightType(typ) {
    return fmt.Errorf("%q type is not an Insight", typ)
}

Done. Now isInsightType could panic if obj, err := registry.Global().NewObject(typ) fails, but I think it's acceptable


pkg/gc/finalizer.go, line 97 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Rename upstreamSyncto more obviously be named after what it is doing. Maybe removeDeletedInsights?

Done.


pkg/gc/finalizer.go, line 120 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Log the subscription ID and generation here?

no need since now we're relying on ResourceVersion


pkg/gc/finalizer.go, line 129 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Return the error here? If we return the error then we will also keep the map entry, which means that we will come back and retry.

Right! I missed that, thanks!


pkg/gc/finalizer_test.go, line 41 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Not sure that we have to lock this over the whole run time. The BeforeEach should always complete before we run anything, right?

yes it should, but on CI we run tests with -race, and apparently it doesn't see any synchronization between BeforeEach and this goroutine, so it constantly fails with race detected

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 15 files at r3.
Reviewable status: 10 of 30 files reviewed, 9 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 29 files at r1, 12 of 15 files at r3, all commit messages.
Reviewable status: 29 of 30 files reviewed, 11 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)


api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):

Previously, lobkovilya (Ilya Lobkov) wrote…

If a developer makes a mistake and passes the wrong subscription type to the UpdateSubscription, will this panic immediately?
I can imagine the code that operates with subscriptions in a generic way (like finalizer.go) and may cause not an immediate panic, but panic later in production. So I'd better avoid panic here, maybe log some error?

Then can we propagate an error out of UpdateSubscription? That can bubble up through the upsert and be handled in the finalizer.


pkg/config/app/kuma-cp/kuma-cp.defaults.yaml, line 277 at r2 (raw file):

Previously, lobkovilya (Ilya Lobkov) wrote…

If you disable Insights you won't see correct statuses in GUI, that's how it worked before this PR. But if we want to remove enabled option, I can do it

Shall we discuss this in a separate issue?


pkg/gc/finalizer.go, line 116 at r3 (raw file):

		upsertInsight, _ := registry.Global().NewObject(typ)
		err := manager.Upsert(f.rm, core_model.MetaToResourceKey(item.GetMeta()), upsertInsight, func(_ core_model.Resource) {

You already have a local variable key that you can use here.


pkg/gc/finalizer.go, line 117 at r3 (raw file):

		upsertInsight, _ := registry.Global().NewObject(typ)
		err := manager.Upsert(f.rm, core_model.MetaToResourceKey(item.GetMeta()), upsertInsight, func(_ core_model.Resource) {
			upsertInsight.GetSpec().(generic.Insight).UpdateSubscription(insight.GetLastSubscription())

Here, aren't you supposed to update the parameter passed to the callback func?


pkg/kds/server/status_sink.go, line 60 at r3 (raw file):

	var lastStoredState *system_proto.KDSSubscription
	forceUpdate := false

Move this into flush() and drop the reset.


pkg/xds/server/callbacks/dataplane_status_sink.go, line 65 at r3 (raw file):

	var lastStoredState *mesh_proto.DiscoverySubscription
	forceUpdate := false

If you make this local to flush() then you don't need to reset it.


pkg/xds/server/callbacks/dataplane_status_sink.go, line 75 at r3 (raw file):

		}

		if proto.Equal(currentState, lastStoredState) && !forceUpdate {

I'm am pretty sure that Kubernetes does not guarantee to increment the resource version if you do a no-op write.

kubernetes/kubernetes#67541

lobkovilya and others added 3 commits August 13, 2021 18:22
Signed-off-by: Ilya Lobkov <lobkovilya@yandex.ru>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

# Conflicts:
#	pkg/sds/server/v3/reconciler.go
#	pkg/xds/server/callbacks/dataplane_status_sink.go
#	pkg/xds/server/callbacks/dataplane_status_sink_test.go
Copy link
Contributor Author

@lobkovilya lobkovilya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 29 of 30 files reviewed, 11 unresolved discussions (waiting on @jakubdyszkiewicz and @jpeach)


api/mesh/v1alpha1/dataplane_insight_helpers.go, line 9 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Since you already imported this package, WDYT about adding type assertions for the insight and subscription types?

Done.


api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 8 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Same suggestion to add type assertions.

Done.


api/mesh/v1alpha1/zone_ingress_insight_helpers.go, line 24 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Then can we propagate an error out of UpdateSubscription? That can bubble up through the upsert and be handled in the finalizer.

Done.


pkg/config/app/kuma-cp/kuma-cp.defaults.yaml, line 277 at r2 (raw file):

Previously, jpeach (James Peach) wrote…

Shall we discuss this in a separate issue?

Yep, done #2565


pkg/gc/finalizer.go, line 116 at r3 (raw file):

Previously, jpeach (James Peach) wrote…

You already have a local variable key that you can use here.

Done.


pkg/gc/finalizer.go, line 117 at r3 (raw file):

Previously, jpeach (James Peach) wrote…

Here, aren't you supposed to update the parameter passed to the callback func?

That doesn't matter, the variable is the same


pkg/kds/server/status_sink.go, line 60 at r3 (raw file):

Previously, jpeach (James Peach) wrote…

Move this into flush() and drop the reset.

Right, thanks!


pkg/xds/server/callbacks/dataplane_status_sink.go, line 65 at r3 (raw file):

Previously, jpeach (James Peach) wrote…

If you make this local to flush() then you don't need to reset it.

Done.


pkg/xds/server/callbacks/dataplane_status_sink.go, line 75 at r3 (raw file):

Previously, jpeach (James Peach) wrote…

I'm am pretty sure that Kubernetes does not guarantee to increment the resource version if you do a no-op write.

kubernetes/kubernetes#67541

Yes, that's true, revert last changes and restore 'generation' field

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 14 files at r4, 20 of 21 files at r5, all commit messages.
Reviewable status: 36 of 37 files reviewed, 4 unresolved discussions (waiting on @jakubdyszkiewicz, @jpeach, and @lobkovilya)


pkg/gc/finalizer.go, line 117 at r3 (raw file):

Previously, lobkovilya (Ilya Lobkov) wrote…

That doesn't matter, the variable is the same

Is there a guarantee that it can't be concurrently written by the sink?

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 36 of 37 files reviewed, 3 unresolved discussions (waiting on @jakubdyszkiewicz and @jpeach)

@lobkovilya lobkovilya merged commit 1343424 into master Aug 16, 2021
@lobkovilya lobkovilya deleted the fix/status-finalizer-with-counter branch August 16, 2021 10:00
@jpeach
Copy link
Contributor

jpeach commented Aug 16, 2021

@lobkovilya No backporting this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataplane, Zone and ZoneIngress status problem
4 participants