-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skaffold trace wrapping of critical functions & skaffold trace exporters via SKAFFOLD_TRACE env var #5854
skaffold trace wrapping of critical functions & skaffold trace exporters via SKAFFOLD_TRACE env var #5854
Conversation
467f01e
to
492f1a0
Compare
Codecov Report
@@ Coverage Diff @@
## master #5854 +/- ##
==========================================
- Coverage 70.92% 70.86% -0.07%
==========================================
Files 449 451 +2
Lines 16981 17301 +320
==========================================
+ Hits 12044 12260 +216
- Misses 4039 4141 +102
- Partials 898 900 +2
Continue to review full report at Codecov.
|
4cd958a
to
5263d5a
Compare
5893976
to
fb37334
Compare
7261b06
to
bfe1dcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is really cool.
Since this example isn't really standalone, what do you think about turning it into a tutorial instead? And rather than maintain the jaeger-all-in-one-yaml, why not use their Helm chart? https://github.com/jaegertracing/helm-charts
Remember to copy examples/ back into integration/examples/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming conventions could do with a revisit: the PR is often using the function name, but other times using something quite different. I think we'd like to get some traceability back to the source to see where time is going?
WDYT of something like "pkg", "Func[_Sub]"
where Func is the function name (e.g., Dev
, or doDev
, and _Sub
is added if we're only tracing part of the function (e.g., Dev_perArtifact
).
We also need to be very careful with PII in the attributes. Let's chat about this offline.
EDIT: aaron-prindle - this has been addressed #5854 (comment)
3144dd9
to
3e8abc9
Compare
12ead39
to
ad7fbb5
Compare
Regarding the naming convention, I have modified this to be of the recommended form. Additionally I changed the pkg information to actually point to the file directly s/pkg\/skaffold\/runner/pkg\/skaffold\/runner\/v1\/deploy.go , etc. Also in Cloud Trace currently the package information is not shown/used in the UI it seems so made it so the wrapper adds it as an attribute called |
ad7fbb5
to
e826bc8
Compare
3aec330
to
e32b2ba
Compare
15dbc76
to
8f31513
Compare
|
||
ctx, endTrace = instrumentation.StartTrace(ctx, "Deploy_WaitForDeletions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry that ctx
(the argument) is actually the ctx from the previous StartTrace()
, and not the ctx passed into this function? Does the endTrace()
re-link the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the Deploy_
and doDev_
calls which had multiple StartTrace
calls in a single function and passed ctx
down the chain to properly create a new ctx (childCtx
) and pass that to the relevant children for each command s.t. each direct trace in the function uses the initial root ctx
@@ -0,0 +1,37 @@ | |||
### Example: Skaffold Command Tracing with Jaeger | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a note here that this is experimental and may change without notice. And be sure to place it in integration/examples too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
f97bdca
to
d740120
Compare
_**WARNING: If you're running this on a cloud cluster, this example will create a service and expose a webserver. | ||
It's highly recommended that you only run this example on a local, private cluster like minikube or Kubernetes in Docker for Desktop.**_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just link to the Jaeger Project's instructions for their all-in-one docker container?
https://www.jaegertracing.io/docs/getting-started/#all-in-one
docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 14250:14250 \
-p 9411:9411 \
jaegertracing/all-in-one:1.22
It's less maintenance for us, since we don't need to shlep this jaeger-all-in-one-template.yaml
, and there's less risk since the user's daemon is unlikely to be exposed to the internet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/skaffold/build/docker/docker.go
Outdated
"github.com/GoogleContainerTools/skaffold/pkg/skaffold/output" | ||
latestV1 "github.com/GoogleContainerTools/skaffold/pkg/skaffold/schema/latest/v1" | ||
"github.com/GoogleContainerTools/skaffold/pkg/skaffold/util" | ||
"github.com/GoogleContainerTools/skaffold/pkg/skaffold/warnings" | ||
) | ||
|
||
func (b *Builder) Build(ctx context.Context, out io.Writer, a *latestV1.Artifact, tag string) (string, error) { | ||
instrumentation.AddAttributesToCurrentSpanFromContext(ctx, map[string]string{ | ||
"BuildType": "docker", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to add the "Context": instrumentation.PII(a.Workspace)
and "Destination": tag
too to the builders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
endTrace() | ||
|
||
_, endTrace = instrumentation.StartTrace(ctx, "Deploy_execKptCommand") | ||
cmd := exec.CommandContext(childCtx, "kpt", kptCommandArgs(applyDir, []string{"live", "apply"}, k.getKptLiveApplyArgs(), nil)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is childCtx
suitable to be used here? Won't it have been ended on line 181? Shouldn't we be using ctx
? Or better yet, the context ignored on 187?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this, this was a mistake. It now uses the context ignored which was ignored 187 (now set as childCtx
) as was originally intended.
56f8fd4
to
2babf99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the destination tags should be marked as PII. With that, LGTM.
2babf99
to
0cbcab0
Compare
…rters What is the problem being solved? Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly. Why is this the best approach? Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out. What other approaches did you consider? There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them. What side effects will this approach have? There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold. What future work remains to be done? Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added. Some future work mentioned in PR review included: - OTEL_TRACES_EXPORTER=* support (vs SKAFFOLD_TRACE)
0cbcab0
to
2c84a98
Compare
What is the problem being solved?
Part of #5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly.
Why is this the best approach?
Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out.
What other approaches did you consider?
There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in
pkg/skaffold/instrumentation/trace.go
(w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them.What side effects will this approach have?
There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (
SKAFFOLD_TRACE
unset) and it does not change the performance of skaffold.What future work remains to be done?
Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added. Some future work mentioned in PR review included: