Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skaffold trace wrapping of critical functions & skaffold trace exporters via SKAFFOLD_TRACE env var #5854

Merged

Conversation

aaron-prindle
Copy link
Contributor

@aaron-prindle aaron-prindle commented May 17, 2021

What is the problem being solved?

Part of #5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold). Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger. This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly.

Why is this the best approach?

Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard. Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out.

What other approaches did you consider?

There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations (eg: how long to hash a file, etc.) it makes sense to not integrate them.

What side effects will this approach have?

There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally. I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold.

What future work remains to be done?

Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.). Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added. Some future work mentioned in PR review included:

  • OTEL_TRACES_EXPORTER=* support (vs SKAFFOLD_TRACE)
  • Setup gcp-skaffold exporter and allow users to opt-in to emitting trace data at a sampling rate that does not affect performance

@aaron-prindle aaron-prindle requested review from yuwenma and a team as code owners May 17, 2021 16:56
@aaron-prindle aaron-prindle requested a review from MarlonGamez May 17, 2021 16:56
@google-cla google-cla bot added the cla: yes label May 17, 2021
@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 3 times, most recently from 467f01e to 492f1a0 Compare May 17, 2021 17:44
@codecov
Copy link

codecov bot commented May 17, 2021

Codecov Report

Merging #5854 (2c84a98) into master (f7289a5) will decrease coverage by 0.06%.
The diff coverage is 67.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5854      +/-   ##
==========================================
- Coverage   70.92%   70.86%   -0.07%     
==========================================
  Files         449      451       +2     
  Lines       16981    17301     +320     
==========================================
+ Hits        12044    12260     +216     
- Misses       4039     4141     +102     
- Partials      898      900       +2     
Impacted Files Coverage Δ
cmd/skaffold/skaffold.go 0.00% <0.00%> (ø)
pkg/skaffold/build/custom/build.go 0.00% <0.00%> (ø)
pkg/skaffold/build/gcb/cloud_build.go 0.00% <0.00%> (ø)
...affold/kubernetes/portforward/forwarder_manager.go 48.88% <0.00%> (-3.50%) ⬇️
pkg/skaffold/runner/v1/apply.go 0.00% <0.00%> (ø)
pkg/skaffold/build/cache/retrieve.go 61.81% <22.22%> (-3.53%) ⬇️
pkg/skaffold/instrumentation/new.go 33.33% <33.33%> (ø)
pkg/skaffold/runner/v1/new.go 62.37% <33.33%> (-1.92%) ⬇️
pkg/skaffold/instrumentation/export.go 63.45% <36.53%> (-9.66%) ⬇️
pkg/skaffold/build/cache/hash.go 70.75% <60.00%> (-0.54%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7289a5...2c84a98. Read the comment docs.

@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 2 times, most recently from 4cd958a to 5263d5a Compare May 17, 2021 19:08
@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 3 times, most recently from 5893976 to fb37334 Compare May 17, 2021 21:10
cmd/skaffold/app/cmd/runner.go Outdated Show resolved Hide resolved
examples/jaeger-skaffold-trace/README.md Outdated Show resolved Hide resolved
examples/jaeger-skaffold-trace/README.md Outdated Show resolved Hide resolved
examples/jaeger-skaffold-trace/README.md Outdated Show resolved Hide resolved
examples/jaeger-skaffold-trace/README.md Outdated Show resolved Hide resolved
pkg/skaffold/build/cache/retrieve.go Outdated Show resolved Hide resolved
pkg/skaffold/graph/dependencies.go Outdated Show resolved Hide resolved
Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really cool.

Since this example isn't really standalone, what do you think about turning it into a tutorial instead? And rather than maintain the jaeger-all-in-one-yaml, why not use their Helm chart? https://github.com/jaegertracing/helm-charts

Remember to copy examples/ back into integration/examples/

examples/jaeger-skaffold-trace/README.md Outdated Show resolved Hide resolved
Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming conventions could do with a revisit: the PR is often using the function name, but other times using something quite different. I think we'd like to get some traceability back to the source to see where time is going?

WDYT of something like "pkg", "Func[_Sub]" where Func is the function name (e.g., Dev, or doDev, and _Sub is added if we're only tracing part of the function (e.g., Dev_perArtifact).

We also need to be very careful with PII in the attributes. Let's chat about this offline.

EDIT: aaron-prindle - this has been addressed #5854 (comment)

pkg/skaffold/instrumentation/trace.go Outdated Show resolved Hide resolved
pkg/skaffold/kubernetes/manifest/images.go Outdated Show resolved Hide resolved
pkg/skaffold/runner/v1/apply.go Outdated Show resolved Hide resolved
pkg/skaffold/runner/v1/apply.go Outdated Show resolved Hide resolved
pkg/skaffold/runner/v1/deploy.go Outdated Show resolved Hide resolved
pkg/skaffold/instrumentation/export.go Outdated Show resolved Hide resolved
pkg/skaffold/instrumentation/export.go Show resolved Hide resolved
pkg/skaffold/instrumentation/export.go Outdated Show resolved Hide resolved
pkg/skaffold/graph/dependencies.go Outdated Show resolved Hide resolved
pkg/skaffold/deploy/kubectl/kubectl.go Outdated Show resolved Hide resolved
@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 2 times, most recently from 3144dd9 to 3e8abc9 Compare May 19, 2021 06:47
@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 3 times, most recently from 12ead39 to ad7fbb5 Compare May 19, 2021 16:34
@aaron-prindle
Copy link
Contributor Author

aaron-prindle commented May 19, 2021

The naming conventions could do with a revisit: the PR is often using the function name, but other times using something quite different. I think we'd like to get some traceability back to the source to see where time is going?

WDYT of something like "pkg", "Func[_Sub]" where Func is the function name (e.g., Dev, or doDev, and _Sub is added if we're only tracing part of the function (e.g., Dev_perArtifact).

We also need to be very careful with PII in the attributes. Let's chat about this offline.

Regarding the naming convention, I have modified this to be of the recommended form. Additionally I changed the pkg information to actually point to the file directly s/pkg\/skaffold\/runner/pkg\/skaffold\/runner\/v1\/deploy.go , etc. Also in Cloud Trace currently the package information is not shown/used in the UI it seems so made it so the wrapper adds it as an attribute called source_file

pkg/skaffold/build/cache/hash.go Outdated Show resolved Hide resolved
pkg/skaffold/build/cache/hash.go Outdated Show resolved Hide resolved
pkg/skaffold/build/scheduler.go Outdated Show resolved Hide resolved
pkg/skaffold/graph/dependencies.go Outdated Show resolved Hide resolved
pkg/skaffold/instrumentation/export.go Outdated Show resolved Hide resolved
pkg/skaffold/instrumentation/export.go Show resolved Hide resolved
pkg/skaffold/instrumentation/trace.go Outdated Show resolved Hide resolved
pkg/skaffold/instrumentation/trace.go Outdated Show resolved Hide resolved
pkg/skaffold/instrumentation/trace.go Outdated Show resolved Hide resolved
@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 2 times, most recently from 3aec330 to e32b2ba Compare May 25, 2021 18:17
@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 5 times, most recently from 15dbc76 to 8f31513 Compare May 25, 2021 20:56

ctx, endTrace = instrumentation.StartTrace(ctx, "Deploy_WaitForDeletions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry that ctx (the argument) is actually the ctx from the previous StartTrace(), and not the ctx passed into this function? Does the endTrace() re-link the context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the Deploy_ and doDev_ calls which had multiple StartTrace calls in a single function and passed ctx down the chain to properly create a new ctx (childCtx) and pass that to the relevant children for each command s.t. each direct trace in the function uses the initial root ctx

@@ -0,0 +1,37 @@
### Example: Skaffold Command Tracing with Jaeger

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a note here that this is experimental and may change without notice. And be sure to place it in integration/examples too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 7 times, most recently from f97bdca to d740120 Compare May 27, 2021 16:44
Comment on lines +14 to +15
_**WARNING: If you're running this on a cloud cluster, this example will create a service and expose a webserver.
It's highly recommended that you only run this example on a local, private cluster like minikube or Kubernetes in Docker for Desktop.**_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just link to the Jaeger Project's instructions for their all-in-one docker container?
https://www.jaegertracing.io/docs/getting-started/#all-in-one

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.22

It's less maintenance for us, since we don't need to shlep this jaeger-all-in-one-template.yaml, and there's less risk since the user's daemon is unlikely to be exposed to the internet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"github.com/GoogleContainerTools/skaffold/pkg/skaffold/output"
latestV1 "github.com/GoogleContainerTools/skaffold/pkg/skaffold/schema/latest/v1"
"github.com/GoogleContainerTools/skaffold/pkg/skaffold/util"
"github.com/GoogleContainerTools/skaffold/pkg/skaffold/warnings"
)

func (b *Builder) Build(ctx context.Context, out io.Writer, a *latestV1.Artifact, tag string) (string, error) {
instrumentation.AddAttributesToCurrentSpanFromContext(ctx, map[string]string{
"BuildType": "docker",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to add the "Context": instrumentation.PII(a.Workspace) and "Destination": tag too to the builders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

endTrace()

_, endTrace = instrumentation.StartTrace(ctx, "Deploy_execKptCommand")
cmd := exec.CommandContext(childCtx, "kpt", kptCommandArgs(applyDir, []string{"live", "apply"}, k.getKptLiveApplyArgs(), nil)...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is childCtx suitable to be used here? Won't it have been ended on line 181? Shouldn't we be using ctx? Or better yet, the context ignored on 187?

Copy link
Contributor Author

@aaron-prindle aaron-prindle Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, this was a mistake. It now uses the context ignored which was ignored 187 (now set as childCtx) as was originally intended.

@aaron-prindle aaron-prindle force-pushed the skaff-trace branch 5 times, most recently from 56f8fd4 to 2babf99 Compare June 1, 2021 23:12
Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the destination tags should be marked as PII. With that, LGTM.

pkg/skaffold/build/custom/build.go Outdated Show resolved Hide resolved
pkg/skaffold/build/docker/docker.go Outdated Show resolved Hide resolved
pkg/skaffold/build/gcb/cloud_build.go Outdated Show resolved Hide resolved
pkg/skaffold/build/jib/build.go Outdated Show resolved Hide resolved
…rters

What is the problem being solved?
Part of GoogleContainerTools#5756, adding opentelemetry trace information to skaffold commands. Added trace information to specific performance critical skaffold functions (identified in go/cloud-trace-skaffold).  Also added 4 trace exporters - gcp-skaffold, gcp-adc, stdout, and jaeger.  This PR uses env var based enabling/disabling for the trace for simplicity and to hide it from users directly.

Why is this the best approach?
Using opentelemetry tracing is the obvious choice as we use open telemetry libs for metrics and it is becoming the metrics/tracing standard.  Using an env var in this PR and later integrating the flag setup was considered optimal as currently skaffold tracing will be used for benchmarking/bottleneck-identifying for select use cases while the user facing UX w/ jaeger, etc. is still being worked out.

What other approaches did you consider?
There was the possibility of building tracing directly into skaffold events but I think with the current wrapper setup in pkg/skaffold/instrumentation/trace.go (w/ the minimal code required) and the fact that many trace locations will not be event locations  (eg: how long to hash a file, etc.) it makes sense to not integrate them.

What side effects will this approach have?
There shouldn't be any side effects w/ this approach as the default "off" for tracing and the minimal user visibility for now should mean that it used only for select use cases experimentally.  I have done timing tests with the no-op/empty trace (SKAFFOLD_TRACE unset) and it does not change the performance of skaffold.

What future work remains to be done?
Future work includes wiring up a --trace flag through dev, build, deploy, etc. and working on how skaffold might be able to do distributed tracing w/ other tools (minikube, buildpacks, etc.).  Additionally the ability to allow for more sporadic sampling (vs AlwaysSample) should be added.  Some future work mentioned in PR review included:
- OTEL_TRACES_EXPORTER=* support (vs SKAFFOLD_TRACE)
@aaron-prindle aaron-prindle merged commit f95999f into GoogleContainerTools:master Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants