Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Migrate API Server Source to the Open Telemetry lib #4769

Closed
skonto opened this issue Jan 21, 2021 · 8 comments
Closed

[Spike] Migrate API Server Source to the Open Telemetry lib #4769

skonto opened this issue Jan 21, 2021 · 8 comments
Assignees
Labels
area/observability kind/feature-request lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@skonto
Copy link
Contributor

skonto commented Jan 21, 2021

Problem
As discussed here and as a sub-task of #3126 we should start exploring migrating to the new lib.

Persona:
Affects everyone consuming Knative metrics (ops etc)

Exit Criteria
Check if migration is viable/stable/performant and evaluate the new metrics model to drive decision making for next steps and full migration to the new stack.

@lionelvillard
Copy link
Member

@kthakar1990 FYI

@skonto skonto mentioned this issue Jan 25, 2021
@lberk lberk added this to the v0.22.0 milestone Jan 25, 2021
@lberk lberk added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 25, 2021
@skonto
Copy link
Contributor Author

skonto commented Jan 25, 2021

Apiserver source is recording metrics in three separate places one is memstats as usual eg.

# TYPE apiserversource_go_alloc gauge
apiserversource_go_alloc{name=""} 8.02932e+06

the other is the source specific metrics here eg.

# TYPE apiserversource_event_count counter
apiserversource_event_count{event_source="https://172.25.0.1:443",event_type="dev.knative.apiserver.resource.add",name="unknown",namespace_name="unknown",resource_group="unknown",response_code="202",response_code_class="2xx",response_error="",response_timeout=""} 730

and there are metrics that come from the controller reporter eg.

# TYPE apiserversource_client_latency histogram
apiserversource_client_latency_bucket{name="",le="1e-05"} 0

The plan for this spike is roughly to develop/migrate in parallel the code in knative/pkg that is called above following the new otel model as described here.
In the new model there is support for asynchronous calls (eg. Observer) and there are also plugins that can be used in this project. For example the memstats reporting can be replaced by this plugin, there is a trivial example here.

Regarding the architecture supported, we can try/test two different approaches:
a) use this example as a guide to directly expose metrics for Prometheus to scrape (pull)
b) emit metrics directly to a collector that can be scraped from Prometheus (push).

Goal is to compare with existing code and have enough data to evaluate performance.
For example otel introduces a resource (we already have our own definition for that in current implementation) and it is important to know to what extend we will face issues like here.

I will work on a PR.
/cc @aliok @vaikas @evankanderson @matzew

@kthakar1990
Copy link
Contributor

@skonto we have a requirement for couple of metrics for our source and I had presented about that in a community workgroup meeting today.

They asked me to connect with you to see if I can contribute in this work with you all and get our requirements added along with the same PR that you are working on. This is the issue for your reference.

Please let me know what's the best way to collaborate and have a quick discussion on this. Thank you :)

@skonto
Copy link
Contributor Author

skonto commented Jan 28, 2021

I did a quick PoC on the new lib for the metrics stuff and tried to emulate the memstats we have in knative/pkg, here is a sample of the latter. Here are the issues I spotted. I didnt want to jump in directly of porting stuff in knative.dev before we know what is working and what is not, what is wip etc.
I will continue with adding tests to check the oteltest lib and other stuff. Then we can migrate to complete this task if we see fit to do it.
/cc @aliok @bharattkukreja @kthakar1990 @evankanderson @vaikas

@bharattkukreja
Copy link
Contributor

@skonto can you share the branch with your changes? (I want to take up other areas for API Server Source migration). I am a little concerned about the issues because otel is supposed to be backwards compatible.

@skonto
Copy link
Contributor Author

skonto commented Feb 17, 2021

@bharattkukreja the branch is https://github.com/skonto/pkg/tree/otel, this is where we agreed to work on the api server source migration. The otel is backwards compatible at the shim layer wg. for tracing but not 100% check here. In general backwards compatibility for both open-tracing and opencensus is wip. I am sure things will improve in the future but in general we can skip at some point relying on the old opencensus lib.

@lberk
Copy link
Member

lberk commented Apr 12, 2021

@skonto looking at the date we first triaged the issue -- moving it back to the backlog for now -- feel free to move it once you get some spare cycles to work on it! Thanks!

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability kind/feature-request lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

5 participants