Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add base xDS REST SD and kuma_sd implementation #8844

Merged
merged 5 commits into from
Jul 23, 2021

Conversation

austince
Copy link
Contributor

@austince austince commented May 18, 2021

Implementation of a Kuma Service Discovery mechanism via the xDS REST protocol.

Since this is new code, the protobuf library used is the official google.golang.com/protobuf lib, which is where the community would like to migrate to, as per this ML thread. The usage is entirely isolated from the prompb package where the gogo/proto package is currently used. When this transition is complete in Prometheus, it may make sense to include the Kuma MADS v1 .proto and compile it here, but since this is a "frozen" API, and the dependency on the official client already exists transitively, it did not seem too bad to include the compiled version here in the meantime.

Closes #7919

Screenshots

Screen Shot 2021-05-04 at 4 42 09 PM
Target synchronization

Screen Shot 2021-05-04 at 4 41 44 PM
Targets in Prometheus

Screen Shot 2021-05-04 at 4 43 21 PM
Scraped Envoy metrics

@austince austince changed the title Feat/discovery xds Add base xDS REST SD and kuma_sd implementation May 18, 2021
@austince austince force-pushed the feat/discovery-xds branch 5 times, most recently from f2a4b33 to 0344ae4 Compare May 18, 2021 21:23
@austince
Copy link
Contributor Author

I think the go version in go.mod is out of date (1.14) – I had to tidy with go 1.16 to pass the CI check. Did I miss this note somewhere?

@austince austince force-pushed the feat/discovery-xds branch from 0344ae4 to 91af23e Compare May 18, 2021 21:41
@roidelapluie
Copy link
Member

I think the go version in go.mod is out of date (1.14) – I had to tidy with go 1.16 to pass the CI check. Did I miss this note somewhere?

We did not update go.mod since we do not require specific go 1.15 or 1.16 issues.

We do not guarantee that Prometheus works properly with go 1.14 or less.
We also guarantee that some packages like TSDB can still be used with go 1.15.

@austince
Copy link
Contributor Author

@roidelapluie – makes sense, thanks!

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started reviewing this but I have a major question first:

Why do you use refresh? Refresh is used for simple service discoveries. Fully fledged service discoveries like this should directly implement the discoverer interface and not rely on refresh. Look at consul for an example.

discovery/refresh/refresh.go Outdated Show resolved Hide resolved
@austince
Copy link
Contributor Author

I have started reviewing this but I have a major question first:

Why do you use refresh? Refresh is used for simple service discoveries. Fully fledged service discoveries like this should directly implement the discoverer interface and not rely on refresh. Look at consul for an example.

Thanks for taking a look!

I used refresh as I thought it was preferred for simple polling-based SDs, like triton, docker, AWS, etc. Reading through the Consul SD now it seems like it's doing some sort of persistent watching + rpc calls on a few channels, sort of like a gRPC stream? Is that correct?

Since the xDS HTTP API is polling based, refresh seemed applicable. Are you concerned about the applicability of the "skip refresh" functionality I've added here?

@roidelapluie
Copy link
Member

Consul uses http long polling, which sounds like I read from the xds doc: Synchronous (long) polling via REST endpoints is also available for the xDS singleton APIs.

@roidelapluie
Copy link
Member

		VersionInfo:   rc.latestVersion,
		ResponseNonce: rc.latestNonce,

That seems to be like the watch ID in consul.

@austince
Copy link
Contributor Author

austince commented May 19, 2021

Consul uses http long polling, which sounds like I read from the xds doc: Synchronous (long) polling via REST endpoints is also available for the xDS singleton APIs.

That makes a lot of sense – is this is set up via the http.Transport? Never used long polling before.

IdleConnTimeout: 2 * time.Duration(watchTimeout),
DialContext: conntrack.NewDialContextFunc(
  conntrack.DialWithTracing(),
  conntrack.DialWithName("consul_sd"),
),

If so, that shouldn't be a huge change. If it's more complicated, happy to look more into it.

@roidelapluie
Copy link
Member

Long polling means that you make an HTTP request and you get the answer when the server decides (e.g. when there is an update).

It means that we do not need to do a request every 30s, and of there is an update after 10s, we directly get the update. Then, we start another query, which might return only after 2 minutes (the next xds change).

@austince
Copy link
Contributor Author

Got it – ok, I'll look into if there are any changes necessary to support this on the xDS server side and then update this to use long polling. Thanks for the feedback – please let me any other questions that come up in the meantime.

@austince austince marked this pull request as draft May 19, 2021 22:13
@austince austince force-pushed the feat/discovery-xds branch 4 times, most recently from f915a0b to 222c82a Compare June 17, 2021 21:14
@austince austince marked this pull request as ready for review June 17, 2021 21:37
@austince
Copy link
Contributor Author

@roidelapluie I've updated this PR to use HTTP long polling, which is available in the Kuma 1.2.0 release. Please have another look when you have a chance. Thanks!

@austince austince force-pushed the feat/discovery-xds branch from 222c82a to cc939ac Compare June 17, 2021 23:36
@austince austince requested a review from roidelapluie June 18, 2021 12:06
@roidelapluie
Copy link
Member

I want to thank you already! I will get to it ASAP!

@austince austince force-pushed the feat/discovery-xds branch from cc939ac to e381025 Compare June 28, 2021 13:59
@austince austince force-pushed the feat/discovery-xds branch 2 times, most recently from 0547e44 to 56ee89f Compare June 30, 2021 21:00
@austince
Copy link
Contributor Author

@roidelapluie I think I have addressed all your comments, namely:

  • removed unnecessary labels (versions, server)
  • removed version config options so that only protocol v3 and MADS v1 are implemented
    • the few places where ProtocolVersion is kept internally are mostly placeholders to not completely lock the implementation in to v3, though can remove those too if you'd like
  • defaulted client_id to the hostname if not specified
  • wrapped errors in kuma_sd and removed the confusing xds_sd prefixes

Thanks again + let me know if anything else shows up in your testing :)

@austince austince force-pushed the feat/discovery-xds branch from 56ee89f to 42e32b3 Compare July 6, 2021 15:39
@austince
Copy link
Contributor Author

austince commented Jul 6, 2021

Hey @roidelapluie, have you had time to continue with your testing?

@austince austince requested a review from roidelapluie July 6, 2021 15:59
Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few reviews. I am unsure we should change the metrics_path, but I am aware that some kubernetes SD change the scheme so I guess it is okay.

In general, I have the impression that there is a lot of debug level logging. We might want to see this down a bit.

Please verify that all comments start with a cap and end with a full stop.

discovery/xds/xds.go Outdated Show resolved Hide resolved
discovery/xds/xds.go Outdated Show resolved Hide resolved
discovery/xds/xds.go Outdated Show resolved Hide resolved
discovery/xds/xds.go Outdated Show resolved Hide resolved
discovery/xds/xds.go Outdated Show resolved Hide resolved
discovery/xds/kuma.go Outdated Show resolved Hide resolved
discovery/xds/client_test.go Outdated Show resolved Hide resolved
discovery/xds/client.go Outdated Show resolved Hide resolved
discovery/xds/client.go Outdated Show resolved Hide resolved
discovery/xds/client.go Outdated Show resolved Hide resolved
@austince austince force-pushed the feat/discovery-xds branch 2 times, most recently from af8aa55 to 5743d4b Compare July 7, 2021 20:16
@austince austince force-pushed the feat/discovery-xds branch 2 times, most recently from 9348a27 to 1aab11d Compare July 20, 2021 20:20
Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we are getting very close

util/osutil/hostname.go Outdated Show resolved Hide resolved
```yaml
# Address of the Kuma Control Plane's MADS xDS server.
server: <string>
# An arbitrary identifier to send to the MADS server, which should be unique to this instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understamding we could remove this parameter for now and not expose it to the user.

discovery/xds/xds_test.go Outdated Show resolved Hide resolved
discovery/xds/xds.go Outdated Show resolved Hide resolved
discovery/xds/xds.go Outdated Show resolved Hide resolved
prometheus.MustRegister(kumaFetchDuration, kumaFetchSkipUpdateCount, kumaFetchFailuresCount)

// Register protobuf types that need to be marshalled/ unmarshalled.
_ = protoTypes.RegisterMessage((&v3.DiscoveryRequest{}).ProtoReflect().Type())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the empty assignment? if it is an error, should we deal with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this should be more like a MustRegister..

discovery/xds/kuma.go Outdated Show resolved Hide resolved
discovery/xds/kuma.go Show resolved Hide resolved
discovery/xds/client.go Outdated Show resolved Hide resolved
@austince austince force-pushed the feat/discovery-xds branch 2 times, most recently from 27342db to 9617595 Compare July 21, 2021 16:48
austince added 5 commits July 21, 2021 12:55
Signed-off-by: austin ce <austin.cawley@gmail.com>
Signed-off-by: austin ce <austin.cawley@gmail.com>
Signed-off-by: austin ce <austin.cawley@gmail.com>
Signed-off-by: austin ce <austin.cawley@gmail.com>
Signed-off-by: austin ce <austin.cawley@gmail.com>
@austince austince force-pushed the feat/discovery-xds branch from 9617595 to 3593b20 Compare July 21, 2021 16:55
@austince austince requested a review from roidelapluie July 21, 2021 18:31
@austince
Copy link
Contributor Author

Thanks for your patience and good reviews @roidelapluie – I think I've addressed all the feedback now.

@roidelapluie roidelapluie merged commit 79d354a into prometheus:main Jul 23, 2021
@roidelapluie
Copy link
Member

Thanks!

@austince austince deleted the feat/discovery-xds branch July 23, 2021 14:12
@austince
Copy link
Contributor Author

Thank you so much @roidelapluie !

valyala added a commit to VictoriaMetrics/VictoriaMetrics that referenced this pull request Feb 23, 2023
- Do not generate __meta_server label, since it is unavailable in Prometheus.
- Add a link to https://docs.victoriametrics.com/sd_configs.html#kuma_sd_configs to docs/CHANGELOG.md,
  so users could click it and read the docs without the need to search the corresponding docs.
- Remove kumaTarget struct, since it is easier generating labels for discovered targets
  directly from the response returned by Kuma. This simplifies the code.
- Store the generated labels for discovered targets inside atomic.Value. This allows reading them
  from concurrent goroutines without the need to use mutex.
- Use synchronouse requests to Kuma instead of long polling, since there is a little sense
  in the long polling when the Kuma server may return 304 Not Modified response every -promscrape.kumaSDCheckInterval.
- Remove -promscrape.kuma.waitTime command-line flag, since it is no longer needed when long polling isn't used.
- Set default value for -promscrape.kumaSDCheckInterval to 30s in order to be consistent with Prometheus.
- Remove unnecessary indirections for string literals, which are used only once, in order to improve code readability.
- Remove unused fields from discoveryRequest and discoveryResponse.
- Update tests.
- Document why fetch_timeout and refresh_interval options are missing in kuma_sd_config.
- Add docs to discoveryutils.RequestCallback and discoveryutils.ResponseCallback,
  since these are public types.

Side notes: it is weird that Prometheus implementation for kuma_sd_configs sets `instance` label,
since usually this label is set by the Prometheus itself to __address__ after the relabeling phase.
See https://www.robustperception.io/life-of-a-label/

Updates #3389

See prometheus/prometheus#7919
and prometheus/prometheus#8844
as a reference implementation in Prometheus
valyala added a commit to VictoriaMetrics/VictoriaMetrics that referenced this pull request Feb 23, 2023
- Do not generate __meta_server label, since it is unavailable in Prometheus.
- Add a link to https://docs.victoriametrics.com/sd_configs.html#kuma_sd_configs to docs/CHANGELOG.md,
  so users could click it and read the docs without the need to search the corresponding docs.
- Remove kumaTarget struct, since it is easier generating labels for discovered targets
  directly from the response returned by Kuma. This simplifies the code.
- Store the generated labels for discovered targets inside atomic.Value. This allows reading them
  from concurrent goroutines without the need to use mutex.
- Use synchronouse requests to Kuma instead of long polling, since there is a little sense
  in the long polling when the Kuma server may return 304 Not Modified response every -promscrape.kumaSDCheckInterval.
- Remove -promscrape.kuma.waitTime command-line flag, since it is no longer needed when long polling isn't used.
- Set default value for -promscrape.kumaSDCheckInterval to 30s in order to be consistent with Prometheus.
- Remove unnecessary indirections for string literals, which are used only once, in order to improve code readability.
- Remove unused fields from discoveryRequest and discoveryResponse.
- Update tests.
- Document why fetch_timeout and refresh_interval options are missing in kuma_sd_config.
- Add docs to discoveryutils.RequestCallback and discoveryutils.ResponseCallback,
  since these are public types.

Side notes: it is weird that Prometheus implementation for kuma_sd_configs sets `instance` label,
since usually this label is set by the Prometheus itself to __address__ after the relabeling phase.
See https://www.robustperception.io/life-of-a-label/

Updates #3389

See prometheus/prometheus#7919
and prometheus/prometheus#8844
as a reference implementation in Prometheus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Add Kuma service mesh SD mechanism
2 participants