Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry: periodic reporting #2721

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kyessenov
Copy link
Contributor

@kyessenov kyessenov commented Mar 10, 2023

Generalize interval reporting to all access logs.
This is useful for long duration streams, and the setting would work for stdout access logs, OTel, etc, and not just prometheus stats.

Fixes: istio/istio#43763

Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov kyessenov requested a review from a team as a code owner March 10, 2023 19:58
@istio-policy-bot
Copy link

😊 Welcome @kyessenov! This is either your first contribution to the Istio api repo, or it's been
awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 10, 2023
@kyessenov kyessenov added the release-notes-none Indicates a PR that does not require release notes. label Mar 10, 2023
@kyessenov
Copy link
Contributor Author

/retest

@kyessenov
Copy link
Contributor Author

/assign @zirain


// Configuration for the interval reporting for the access logging and
// metrics.
ReportingInterval reporting_interval = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in top level or inside each struct? You may want different settings for each. Stdout logs we set flush interval to 1s for example, but that would be aggressive for a write to a remote server.

Does this overwrite --file-flush-interval-msec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not supported. Most people have only one sink enabled (telemetry costs $$$) and we should cater to them and not complicate things for imaginary use cases.

This setting has nothing to do with IO flush.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be misunderstanding periodic reporting. It's reporting per data stream, not per sink stream

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am misunderstanding the surely our users will 🙂 . Can we clarify the comments a bit so it's understandable for someone that doesn't have prior context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if it's not related to IO flush that seems like something to be called out in the comments. I guess that means it I use Telemetry to turn on stdout logging and set this field it has no impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that the IO flush rate is completely independent. Istio reports telemetry per stream, not per sink.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the .proto file I still have no clue what this does

  // Reporting interval allows configuration of the time between the access log
  // reports for the individual streams.

But you are saying its NOT related to the IO flush rate, which is, by my understanding, the interval at which access logs are written to stdout?

In Envoy, is this config global or something? What happens if I have 2 metrics backends configured by 2 Telemetry with 2 different configs? Or 2 telemetry, 1 configuring metrics and 1 configuring access logs.

TBH I have read this proto quite a few times and I don't think I am any closer to understanding what it does, and how or why I should set it. Can we write the comments from the perspective of a user to help guide them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a new feature - this existed in Istio since 1.0 for TCP. IO flush rate has nothing to do with telemetry production - it's a rate of delivering telemetry not production. Users never complained about this being confusing :/

In Envoy, is this config global or something? What happens if I have 2 metrics backends configured by 2 Telemetry with 2 different configs? Or 2 telemetry, 1 configuring metrics and 1 configuring access logs.

It is a global setting per listener.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of telemetry providers - and how many can implement this feature ? From what I've seen, if we use OTel this will not be configurable per app - but the per-node or cluster collector will handle most of this.

For metric push - I see OTel has it so I don't mind adding it, but for the others I would not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of telemetry providers - and how many can implement this feature ? From what I've seen, if we use OTel this will not be configurable per app - but the per-node or cluster collector will handle most of this.

I think all of them would be supported, but they need to be modified to properly report mid-stream telemetry. OTel would need a special field to indicate whether it's end of stream or mid-stream.

For metric push - I see OTel has it so I don't mind adding it, but for the others I would not.

Metrics are somewhat irrelevant actually, it's an artifact of Istio quirky stackdriver/prometheus.
Standard envoy metrics report immediately, and Istio metric sinks act like access loggers, so they get snapshot of the data at the same rate as this value.

Copy link
Member

@zirain zirain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Mar 13, 2023
Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Mar 14, 2023
@istio-policy-bot
Copy link

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@kyessenov
Copy link
Contributor Author

Need TOC review.

@howardjohn @linsun @ericvn

@howardjohn howardjohn added the do-not-merge/hold Block automatic merging of a PR. label Mar 16, 2023
google.protobuf.Duration http = 2;
}

// Configuration for the interval reporting for the access logging and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to "report" and access log at a reporting_interval: 5s? Does that mean if I open a connection and send no data, it will report and access log every 5s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what Istio does today?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know access logs are only printed once per connection...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the demo stdout log. The production stackdriver log reports periodically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we note, in the proto, this? Stdout is not seen as "demo" as well, its pretty commonly used in production (https://12factor.net/logs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any vendor rely on stdout for production? It's demo because you need another system to collect logs and we don't include it in Istio.

The point of this PR is to generalize what we do for stackdriver as a good production practive to every other logging sink including Otel and stdout. Is that what you want me to say?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any vendor rely on stdout for production?

Yes, we do 🙂

I want the PR to explain what the API does, and give guidance to a user on how they can configure it, which is the standard for API documentation.

I am still not sure I even understand. "including Otel and stdout" - I thought we were saying this does not apply to stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google doesn't depend on stdout logging, not sure why you claim so. It shouldn't since stdout will hurt performance.

The setting is global - it applies to all Envoy telemetry sinks, including stdout and otel. SD implementation is an odd ball but it's Google's problem and will be fixed later.

The API is about enabling periodic reporting for long living streams, and it's opt-in. What is not clear about it? Are you questioning the value of the periodic reporting or the shape of the API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends - many apps report logs to stdout and this gets collected and pushed to stackdriver.

I think we need to be compatible ( and adopt ) OTel, including integration with their collectors if available - and not expose APIs that would not be implementable or needed.

Maybe in mesh config for the other integrations - but even for that I doubt we should expose them.

All APIs require testing, support and are hard to deprecated.

@kyessenov kyessenov changed the title telemetry: reporting duration telemetry: periodic reporting Mar 16, 2023
// Reporting interval allows configuration of the time between the access log
// reports for the individual streams. Duration set to 0 disables the perodic
// reporting, and only the ends of the streams are reported.
message ReportingInterval {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of AccessLogging? IIUC, this config is only applicable for access logs. If access logs is not enabled, this config would not be useful, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, worth to explain why user would want to customize this for what type of workloads. And why they would consider disable it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a discussion about what we plan to to about OTel and all the standards and integrations they are driving.

https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/ has the set of options OTel supports - OTEL_METRIC_EXPORT_INTERVAL is the equivalent setting that an app using otel natively would use.

In otel it applies to all push-based metric exporters - logs and tracers don't seem to have this ( but have batch and other settings ).

IMO we should consider the API surface of OTel and not try to have more complicated settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linsun AccessLogging is a list, and the setting is global in Envoy. Hence I pulled it to the top. Users shouldn't generally touch this unless they want to disable it (because they want to limit telemetry). It cannot be made Otel specific, since the log production is driven by Envoy and the push is driven by OTel.

@costinm That env seems unrelated - it's about metric push, not production again. See the discussion about IO flush - they are not related at all.

The closest semantic convention I could find is modeling stream events in Otel https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/semantic_conventions/events.md. We'd define an event.name for stream open/mid/close, and report the generic attributes associated with the stream.

// Optional. Reporting interval for TCP streams. The default value is 5s.
google.protobuf.Duration tcp = 1;

// Optional. Reporting interval for HTTP streams. Not enabled by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both http 1 and 2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this would be here or why it matters what protocol is used. You want 30 seconds - using whatever protocol is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any HTTP, Envoy treats them the same.

Specialization by protocol is for backwards compatibility only - Istio does periodic reporting for TCP only right now.

30s is too long - users consider 10s extra latency unacceptable and without any telemetry during that time, we can't debug.

}

// Configuration for the interval reporting for the access logging and
// metrics. Telemetry is reported independently per each data stream, at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for metrics also? Earlier only access logs were mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some metric sinks are implemented as access loggers. It's deliberately vague since it's implementation-dependent.

@linsun
Copy link
Member

linsun commented Mar 17, 2023

@kyessenov @zirain i think I agree the docs for the proto could be a bit more clear for users who are not very familiar with telemetry. It would be really helpful to explain to readers why something should be configured in addition to the behavior. At the minimum, point to readers to find that info.

Also worth clarifying if this is for access logs or metrics or both.

@costinm
Copy link
Contributor

costinm commented Mar 17, 2023 via email

@kyessenov
Copy link
Contributor Author

@costinm OTel is all about the metric schema and the distribution. It's still up to Istio to define when we produce the telemetry, irrespective of the sink. OTel doesn't define a strict schema - it's a universal mapping from any other semantic model but it doesn't specify what you map into it. We really need to support long running streams better at the metric production level because 1) HBONE is a long running stream; 2) debugging customer issues with long streams is painful.

@costinm
Copy link
Contributor

costinm commented Mar 17, 2023 via email

@kyessenov
Copy link
Contributor Author

kyessenov commented Mar 17, 2023

@costinm The logging data model in OTel is very loose. There are three parts to it:

  • event name and domain - domain is stream, names are open/close/mid;
  • resource - generally it's the proxy workload, a bag of string to string mappings;
  • attributes - a map from string to string where strings are "attribute names". These would be things about http.request, source.*, destination.*.

I'm not trying to design the exact set of attributes to produce in this PR. That's for @lei-tang I think. The topic is making sure we can produce stream-open events in Envoy, because that's we already do for opencensus and OTlp is now upstream so we must use upstream xDS to match the behavior.

@costinm
Copy link
Contributor

costinm commented Mar 17, 2023 via email

@kyessenov
Copy link
Contributor Author

@costinm We cannot do it per-provider. Envoy has a global setting and it's hard to justify the complexity of running dual/distinctly configured providers at the same time.

@costinm
Copy link
Contributor

costinm commented Mar 17, 2023 via email

@kyessenov
Copy link
Contributor Author

I know we support multiple providers at the same time now. And Envoy has a global setting. But we can say that this feature is only supported for OTel, or require that multiple providers set it to the same value. And we can decide that for ambient we'll just support one provider - OTel/Prom - with collectors or agents handling additional sinks.

Explicitly handicapping this feature for other sinks doesn't seem like a good idea to me. It'd be nice to enable stdout with periodic reporting temporarily for debugging. Why should we block this support?

I looked into gRPC Otel interceptor and it's modeled as "message" event in both directions. I think we could implement the exact same model in Envoy generically but it's outside the scope of this PR since it's not periodic.

@costinm
Copy link
Contributor

costinm commented Mar 17, 2023 via email

@kyessenov
Copy link
Contributor Author

kyessenov commented Mar 17, 2023 via email

@costinm
Copy link
Contributor

costinm commented Mar 17, 2023 via email

@kyessenov
Copy link
Contributor Author

@costinm We cannot freeze telemetry API evolution while offering EnvoyFilter approach. See previous discussion for the original PR that added this #2556. If we stop supporting existing sidecar API, then we're going to get more and more EnvoyFilters doing the same thing.

@costinm
Copy link
Contributor

costinm commented Mar 18, 2023 via email

@zirain
Copy link
Member

zirain commented Jun 6, 2023

as #2800 done, we can use annotation to support feature like this?

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Sep 7, 2023
@istio-testing
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Block automatic merging of a PR. needs-rebase Indicates a PR needs to be rebased before being merged release-notes-none Indicates a PR that does not require release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use standard Envoy interval reporting for telemetry
7 participants