Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin: add an endpoint to dump spawned Tokio tasks #595

Merged
merged 32 commits into from
Jul 31, 2020
Merged

Conversation

hawkw
Copy link
Contributor

@hawkw hawkw commented Jul 15, 2020

Motivation

When debugging proxy issues, it can be useful to inspect the list of
currently spawned Tokio tasks and their states. This can be used
similarly to the thread or coroutine dumps provided by other languages'
runtimes.

Solution

This branch adds a new endpoint to the proxy's admin server, /tasks,
that returns a dump of all tasks currently spawned on the Tokio runtime,
using the new Tracing instrumentation added in tokio-rs/tokio#2655, and
a work-in-progress tokio-trace crate that provides Tokio-specific
Tracing layers.

Currently, the /tasks admin endpoint records the following information
about each task:

  • Whether it is a normal, local, or blocking task (not relevant to us
    currently, since Linkerd does not use local or blocking tasks...but
    we might eventually!)
  • Whether the task is active (currently being polled) or idle (waiting
    to be polled)
  • The type of the future that was spawned
  • The Tracing span context from which the task was spawned
  • The total number of times the task has been polled
  • Timing statistics about the task, including:
    • The time in nanoseconds between when the task was spawned and when
      it was first polled (essentially, measuring the Tokio scheduler's
      latency)
    • The total time in nanoseconds the task has existed
    • The task's busy time in nanoseconds (time it was actively being
      polled)
    • The tasks idle time in nanoseconds (time it was not being
      polled)

In the future, Tokio will likely expose additional Tracing information,
which we'll be able to collect as well.

The task dump can be accessed either as an HTML table or as JSON. JSON
is returned if the request has an Accept: application/json header, or
whenever the path /tasks.json is requested; otherwise, the data is
rendered as an HTML table. Like the /proxy-log-level endpoint, access
is denied to requests coming from sources other than localhost, to help
restrict access to authorized users (since a high volume of requests for
task dumps could be used to starve the proxy).

Example JSON output (in Firefox Dev Edition's extremely nice GUI
JSON viewer):

Screenshot_20200715_121938

Zoomed in on the timing data for a single task:
Screenshot_20200715_122047

And HTML:

Screenshot_20200715_143155

Because the task data is generated from Tracing spans emitted by Tokio,
the task spans must be enabled for it to be used. This can be done by
setting a trace filter that enables the trace level for the target
tokio::task, e.g.:

tokio::task=trace

or

tokio=trace

Notes

  • This branch depends on unreleased code from upstream, including a
    Tokio change that has merged to master but not been published, and my
    unreleased work-in-progress tokio-trace crate. Therefore, I've
    pinned these upstreams to fixed Git SHAs, to guard against
    dependencies changing under us unexpectedly.
  • I considered requiring a build-time feature flag to enable this
    feature, the way we do for the mock SO_ORIG_DST implementation for
    testing. However, this would make it harder to use task tracking to
    debug issues in proxies not built with the flag. I'm happy to change
    this code to be feature flagged if we think that's the right approach.

Closes linkerd/linkerd2#3803

Signed-off-by: Eliza Weisman eliza@buoyant.io

@hawkw hawkw requested review from olix0r and a team July 15, 2020 21:33
@hawkw hawkw self-assigned this Jul 15, 2020
@olix0r
Copy link
Member

olix0r commented Jul 15, 2020

@hawkw re: feature-flagging: What's the "cost" of this? Is there any discernible different in benchmarks, for instance?

@hawkw
Copy link
Contributor Author

hawkw commented Jul 15, 2020

@hawkw re: feature-flagging: What's the "cost" of this? Is there any discernible different in benchmarks, for instance?

It should be pretty minimal when disabled at runtime, but I'll do a benchmark run to make sure, good call.

@hawkw hawkw force-pushed the eliza/tokio-trace branch from cb0aa92 to 03e8cb2 Compare July 23, 2020 18:11
hawkw added 24 commits July 23, 2020 11:43
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw force-pushed the eliza/tokio-trace branch from d2adde4 to 32051a4 Compare July 23, 2020 18:49
};
use tokio_trace::tasks::TaskList;
#[derive(Clone)]
pub struct Tasks {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker (obviously) but do you think this can live in hawkw/tokio-trace behind a feature flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRs_welcome

Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well for me!

I installed linkerd with this change as the proxy version and set up a port-forward for 4191 from the controller pod.

I can go to localhost:4191/tasks or localhost:4191/tasks.json and see all the tasks for the controller's proxy.

@@ -36,21 +36,21 @@ ARG PROXY_UNOPTIMIZED
ARG PROXY_FEATURES

RUN --mount=type=cache,target=/var/lib/apt/lists \
--mount=type=cache,target=/var/tmp \
apt update && apt install -y time cmake
--mount=type=cache,target=/var/tmp \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto formatting here? Not a blocker but just wondering if this was intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, i think this change can be removed entirely — this was left over from an attempt at feature-flagging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would prefer to revert this as it removes some indentation below

@olix0r
Copy link
Member

olix0r commented Jul 24, 2020

@hawkw re: feature-flagging: What's the "cost" of this? Is there any discernible different in benchmarks, for instance?

@hawkw Have you run benchmarks since your updates?

@hawkw
Copy link
Contributor Author

hawkw commented Jul 24, 2020

@hawkw re: feature-flagging: What's the "cost" of this? Is there any discernible different in benchmarks, for instance?

@hawkw Have you run benchmarks since your updates?

@olix0r here are benchmark results with the most recent commit to this branch. Looks like a little bit of overhead:
Screenshot_20200724_134832

@hawkw
Copy link
Contributor Author

hawkw commented Jul 24, 2020

@ver my latest commit 3f404bb should decrease the overhead a little bit more:
Screenshot_20200724_152747
(except for in the x100 case, weirdly? maybe that was bench env noise?)

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Self::Json(level) => level.reload(filter)?,
Self::Plain(level) => level.reload(filter)?,
LevelHandle::Json(level) => level.reload(filter)?,
LevelHandle::Plain(level) => level.reload(filter)?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this from Self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these methods briefly moved to a different type and then came back. Will back that out.

hawkw added 2 commits July 28, 2020 16:47
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@olix0r
Copy link
Member

olix0r commented Jul 30, 2020

@hawkw looks like this merge wasn't clean

@hawkw
Copy link
Contributor Author

hawkw commented Jul 30, 2020

@hawkw looks like this merge wasn't clean

yeah, I'm fixing that up right now!

hawkw added 2 commits July 30, 2020 11:41
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw merged commit d5fe0e6 into main Jul 31, 2020
@hawkw hawkw deleted the eliza/tokio-trace branch July 31, 2020 23:54
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Aug 5, 2020
This release enables a multi-threaded runtime. Previously, the proxy
would only ever use a single thread for data plane processing; now, when
the proxy is allocated more than 1 CPU share, the proxy allocates a
thread per available CPU. This has shown substantial latency
improvements in benchmarks, especially when the proxy is serving
requests for many concurrent connections.

---

* Add a `multicore` feature flag (linkerd/linkerd2-proxy#611)
* Add `multicore` to default features (linkerd/linkerd2-proxy#612)
* admin: add an endpoint to dump spawned Tokio tasks (linkerd/linkerd2-proxy#595)
* trace: roll `tracing` and `tracing-subscriber` dependencies (linkerd/linkerd2-proxy#615)
* stack: Add NewService::into_make_service (linkerd/linkerd2-proxy#618)
* trace: tweak tracing & test support for the multithreaded runtime (linkerd/linkerd2-proxy#616)
* Make FailFast cloneable (linkerd/linkerd2-proxy#617)
* Move HTTP detection & server into linkerd2_proxy_http (linkerd/linkerd2-proxy#619)
* Mark tap integration tests as flakey (linkerd/linkerd2-proxy#621)
* Introduce a SkipDetect layer to preempt detection (linkerd/linkerd2-proxy#620)
adleong pushed a commit to linkerd/linkerd2 that referenced this pull request Aug 6, 2020
This release enables a multi-threaded runtime. Previously, the proxy
would only ever use a single thread for data plane processing; now, when
the proxy is allocated more than 1 CPU share, the proxy allocates a
thread per available CPU. This has shown substantial latency
improvements in benchmarks, especially when the proxy is serving
requests for many concurrent connections.

---

* Add a `multicore` feature flag (linkerd/linkerd2-proxy#611)
* Add `multicore` to default features (linkerd/linkerd2-proxy#612)
* admin: add an endpoint to dump spawned Tokio tasks (linkerd/linkerd2-proxy#595)
* trace: roll `tracing` and `tracing-subscriber` dependencies (linkerd/linkerd2-proxy#615)
* stack: Add NewService::into_make_service (linkerd/linkerd2-proxy#618)
* trace: tweak tracing & test support for the multithreaded runtime (linkerd/linkerd2-proxy#616)
* Make FailFast cloneable (linkerd/linkerd2-proxy#617)
* Move HTTP detection & server into linkerd2_proxy_http (linkerd/linkerd2-proxy#619)
* Mark tap integration tests as flakey (linkerd/linkerd2-proxy#621)
* Introduce a SkipDetect layer to preempt detection (linkerd/linkerd2-proxy#620)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proxy task introspection
4 participants