Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.2] Stream Envoy metrics to the cloud #4053

Merged
merged 34 commits into from
Feb 4, 2022
Merged

Conversation

douglascamata
Copy link
Contributor

@douglascamata douglascamata commented Jan 27, 2022

Description

This pull request intends to stream metrics from Envoy to Ambassador's cloud. We're interested the following metrics for both stable and canary clusters:

  • upstream_rq_total: total amount of requests served upstream
  • upstream_rq_5xx: total amount of requests served upstream with a 5xx status code (server error)
  • upstream_rq_time: histogram with request duration per cluster

The code currently drops all the other metrics in the agent. In the future, this might be done at Envoy's configuration to completely avoid all the extra network traffic between agent and Emissary Ingress pods.

The flow of the metrics is:

  1. Envoy's metrics server pushes to a K8s Service
  2. This K8s Service point to a gRPC server started by the agent
  3. This gRPC server streams the metrics to Ambassador's cloud

Related Issues

This is based in the work done at #3657.

Testing

I could test this manually in a local cluster and a server that behaves like Ambassador's cloud.

Checklist

  • I made sure to update CHANGELOG.md.

    Remember, the CHANGELOG needs to mention:

    • Any new features
    • Any changes to our included version of Envoy
    • Any non-backward-compatible changes
    • Any deprecations
  • This is unlikely to impact how Ambassador performs at scale.

    Remember, things that might have an impact at scale include:

    • Any significant changes in memory use that might require adjusting the memory limits
    • Any significant changes in CPU use that might require adjusting the CPU limits
    • Anything that might change how many replicas users should use
    • Changes that impact data-plane latency/scalability
  • My change is adequately tested.

    Remember when considering testing:

    • Your change needs to be specifically covered by tests.
      • Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
    • You also need to make sure that the entire area being changed has adequate test coverage.
      • If existing tests don't actually cover the entire area being changed, add tests.
      • This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
    • We should lean on the bulk of code being covered by unit tests, but...
    • ... an end-to-end test should cover the integration points
  • I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.

douglascamata and others added 14 commits January 26, 2022 11:44
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Flynn <flynn@datawire.io>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
@douglascamata douglascamata marked this pull request as ready for review January 28, 2022 11:57
rp4rk
rp4rk previously approved these changes Jan 28, 2022
Copy link

@rp4rk rp4rk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Flynn <flynn@datawire.io>
@khussey khussey changed the title Stream Envoy metrics to the cloud [2.2] Stream Envoy metrics to the cloud Feb 2, 2022
Douglas Camata added 2 commits February 2, 2022 17:59
Copy link
Member

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very cool! We need to change the port number to 8006, sadly: sorry about that! Looking forward to getting this in. 🙂

pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/envoy_metrics_server.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
Douglas Camata added 3 commits February 3, 2022 10:54
Also add it to the table of ports being used in python/README.md

Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Douglas Camata added 2 commits February 3, 2022 14:14
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Copy link
Member

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good! Thanks! 🙂

@kflynn kflynn self-requested a review February 3, 2022 15:22
Copy link
Member

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops -- that gotest failure looks like it might be real? 😐

kflynn
kflynn previously approved these changes Feb 3, 2022
Copy link
Member

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll approve this again with the understanding that we must track down the race in the tests.

Alice-Lilith
Alice-Lilith previously approved these changes Feb 3, 2022

dlog.Infof(ctx, "metrics service listening on %s", listener.Addr().String())
s.logCtx = ctx
return grpcServer.Serve(listener)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use github.com/datawire/dlib/dhttp, rather than the google.golang.org/grpc HTTP server. (For an example of how to do this, see the cmd/example-envoy-metrics-sink/ which this PR also edits.)

port = int(parts[1])
else:
raise ValueError("too many colons")
return host, port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely raise an exception for IPv6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there should be type annotations on the signature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest way to do this is probably

from urllib.parse import urlparse
from typing import Tuple

def split_host_port(value: str) -> Tuple[str, int]:
    parsed = urlparse("//"+value)
    return parsed.hostname, int(parsed.port or 80)

if err := metricsServer.StartServer(ctx); err != nil {
dlog.Error(ctx, err)
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ever launch a goroutine that you don't have a way to wait for it to shut down. github.com/datawire/dlib/dgroup can help with this, but you are free to use other solutions as well.

Copy link
Contributor

@LukeShu LukeShu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, 3 concerns:

  • the split_host_port routine
  • not keeping track of goroutines
  • using the google.golang.org/grpc HTTP server

I'm not super-opposed to merging this for -rc.0 and then fixing those later. But if we weren't pushing for an RC ASAP, this'd be a "request changes".

Douglas Camata added 3 commits February 3, 2022 19:46
Add some type checking on top

Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Douglas Camata <dcamata@datawire.io>
@khussey khussey changed the title [2.2] Stream Envoy metrics to the cloud [v2.2] Stream Envoy metrics to the cloud Feb 4, 2022
Signed-off-by: Flynn <flynn@datawire.io>
@kflynn kflynn dismissed stale reviews from Alice-Lilith and themself via 2b7544a February 4, 2022 03:21
Flynn and others added 7 commits February 3, 2022 22:22
Signed-off-by: Flynn <flynn@datawire.io>
…gress/emissary into dcamata/agent-metrics-stream
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Douglas Camata <dcamata@datawire.io>
Signed-off-by: Flynn <flynn@datawire.io>
Copy link
Member

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving after fixing merge conflicts and pinning pytest to version 6.2.5 -- let's land this thing!!

@kflynn kflynn merged commit d3bd34f into master Feb 4, 2022
@kflynn kflynn deleted the dcamata/agent-metrics-stream branch February 4, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants