[v2.2] Stream Envoy metrics to the cloud #4053

douglascamata · 2022-01-27T17:46:46Z

Description

This pull request intends to stream metrics from Envoy to Ambassador's cloud. We're interested the following metrics for both stable and canary clusters:

upstream_rq_total: total amount of requests served upstream
upstream_rq_5xx: total amount of requests served upstream with a 5xx status code (server error)
upstream_rq_time: histogram with request duration per cluster

The code currently drops all the other metrics in the agent. In the future, this might be done at Envoy's configuration to completely avoid all the extra network traffic between agent and Emissary Ingress pods.

The flow of the metrics is:

Envoy's metrics server pushes to a K8s Service
This K8s Service point to a gRPC server started by the agent
This gRPC server streams the metrics to Ambassador's cloud

Related Issues

This is based in the work done at #3657.

Testing

I could test this manually in a local cluster and a server that behaves like Ambassador's cloud.

Checklist

I made sure to update CHANGELOG.md.

Remember, the CHANGELOG needs to mention:
- Any new features
- Any changes to our included version of Envoy
- Any non-backward-compatible changes
- Any deprecations
This is unlikely to impact how Ambassador performs at scale.

Remember, things that might have an impact at scale include:
- Any significant changes in memory use that might require adjusting the memory limits
- Any significant changes in CPU use that might require adjusting the CPU limits
- Anything that might change how many replicas users should use
- Changes that impact data-plane latency/scalability
My change is adequately tested.

Remember when considering testing:
- Your change needs to be specifically covered by tests.
  - Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
- You also need to make sure that the entire area being changed has adequate test coverage.
  - If existing tests don't actually cover the entire area being changed, add tests.
  - This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
- We should lean on the bulk of code being covered by unit tests, but...
- ... an end-to-end test should cover the integration points
I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

…amata/agent-metrics-stream

Signed-off-by: Flynn <flynn@datawire.io>

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

rp4rk

LGTM

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Signed-off-by: Flynn <flynn@datawire.io>

…amata/agent-metrics-stream

…gress/emissary into dcamata/agent-metrics-stream

kflynn

This looks very cool! We need to change the port number to 8006, sadly: sorry about that! Looking forward to getting this in. 🙂

pkg/agent/agent.go

pkg/agent/envoy_metrics_server.go

pkg/agent/agent.go

Also add it to the table of ports being used in python/README.md Signed-off-by: Douglas Camata <dcamata@datawire.io>

Signed-off-by: Douglas Camata <dcamata@datawire.io>

…amata/agent-metrics-stream

Signed-off-by: Douglas Camata <dcamata@datawire.io>

kflynn

I think this looks good! Thanks! 🙂

kflynn

Oh whoops -- that gotest failure looks like it might be real? 😐

kflynn

OK, I'll approve this again with the understanding that we must track down the race in the tests.

LukeShu · 2022-02-03T17:19:23Z

pkg/agent/envoy_metrics_server.go

+
+	dlog.Infof(ctx, "metrics service listening on %s", listener.Addr().String())
+	s.logCtx = ctx
+	return grpcServer.Serve(listener)


This should use github.com/datawire/dlib/dhttp, rather than the google.golang.org/grpc HTTP server. (For an example of how to do this, see the cmd/example-envoy-metrics-sink/ which this PR also edits.)

LukeShu · 2022-02-03T17:20:53Z

python/ambassador/envoy/v3/v3bootstrap.py

+        port = int(parts[1])
+    else:
+        raise ValueError("too many colons")
+    return host, port


This will definitely raise an exception for IPv6.

Also there should be type annotations on the signature.

The simplest way to do this is probably

from urllib.parse import urlparse from typing import Tuple def split_host_port(value: str) -> Tuple[str, int]: parsed = urlparse("//"+value) return parsed.hostname, int(parsed.port or 80)

LukeShu · 2022-02-03T17:28:46Z

cmd/agent/main.go

+		if err := metricsServer.StartServer(ctx); err != nil {
+			dlog.Error(ctx, err)
+		}
+	}()


Don't ever launch a goroutine that you don't have a way to wait for it to shut down. github.com/datawire/dlib/dgroup can help with this, but you are free to use other solutions as well.

LukeShu

OK, 3 concerns:

the split_host_port routine
not keeping track of goroutines
using the google.golang.org/grpc HTTP server

I'm not super-opposed to merging this for -rc.0 and then fixing those later. But if we weren't pushing for an RC ASAP, this'd be a "request changes".

Add some type checking on top Signed-off-by: Douglas Camata <dcamata@datawire.io>

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Signed-off-by: Flynn <flynn@datawire.io>

…gress/emissary into dcamata/agent-metrics-stream

…amata/agent-metrics-stream

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Signed-off-by: Flynn <flynn@datawire.io>

kflynn

Reapproving after fixing merge conflicts and pinning pytest to version 6.2.5 -- let's land this thing!!

douglascamata and others added 14 commits January 26, 2022 11:44

Prototype to stream Envoy metrics to Ambassador's backend

ed0775f

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Stream Envoy metrics to the cloud

a298915

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Port over metrics sink to envoy v3

b869f5c

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Fix logs and metrics v2 import

a4f088f

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Mount the go build cache in the builder dockerfile

4d0d08b

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Update Helm chart with grpc port and service

d2025a2

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Mark agent's grpc service as required k8s config

9ca987f

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Update generated files

4fd1cbc

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Change all references of CEPC to DCP

1c51788

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Added release notes about streaming metrics

8982f80

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Stream only the metrics we need to the cloud

6753404

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Merge branch 'master' of github.com:emissary-ingress/emissary into dc…

b2c5378

…amata/agent-metrics-stream

Force BuildKit on in builder.mk

a42dae0

Signed-off-by: Flynn <flynn@datawire.io>

Fix mock client for grpc call

b4a346f

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

douglascamata marked this pull request as ready for review January 28, 2022 11:57

rp4rk previously approved these changes Jan 28, 2022

View reviewed changes

Properly break out of suffix loop if found match

7350a57

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

douglascamata dismissed rp4rk’s stale review via 7350a57 January 28, 2022 12:51

Merge branch 'master' into dcamata/agent-metrics-stream

45be7de

Signed-off-by: Flynn <flynn@datawire.io>

douglascamata requested a review from kflynn January 31, 2022 18:22

khussey changed the title ~~Stream Envoy metrics to the cloud~~ [2.2] Stream Envoy metrics to the cloud Feb 2, 2022

Douglas Camata added 2 commits February 2, 2022 17:59

Merge branch 'master' of github.com:emissary-ingress/emissary into dc…

0f8bd8a

…amata/agent-metrics-stream

Merge branch 'dcamata/agent-metrics-stream' of github.com:emissary-in…

af33779

…gress/emissary into dcamata/agent-metrics-stream

kflynn requested changes Feb 3, 2022

View reviewed changes

pkg/agent/agent.go Outdated Show resolved Hide resolved

pkg/agent/envoy_metrics_server.go Outdated Show resolved Hide resolved

pkg/agent/agent.go Outdated Show resolved Hide resolved

Douglas Camata added 3 commits February 3, 2022 10:54

Change Envoy metrics server from 8123 and 8006

e28a361

Also add it to the table of ports being used in python/README.md Signed-off-by: Douglas Camata <dcamata@datawire.io>

Upgrade to metrics v3 transport api

0a1fccb

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Fix logic error in envoy metrics filtering

30e35b6

Signed-off-by: Douglas Camata <dcamata@datawire.io>

douglascamata requested a review from kflynn February 3, 2022 11:04

Douglas Camata added 2 commits February 3, 2022 14:14

Merge branch 'master' of github.com:emissary-ingress/emissary into dc…

63ae1a7

…amata/agent-metrics-stream

Update generated files

fb36a8c

Signed-off-by: Douglas Camata <dcamata@datawire.io>

kflynn approved these changes Feb 3, 2022

View reviewed changes

kflynn self-requested a review February 3, 2022 15:22

kflynn requested changes Feb 3, 2022

View reviewed changes

kflynn previously approved these changes Feb 3, 2022

View reviewed changes

Alice-Lilith previously approved these changes Feb 3, 2022

View reviewed changes

LukeShu reviewed Feb 3, 2022

View reviewed changes

Douglas Camata added 3 commits February 3, 2022 19:46

Improve python host/port parsing to work with ipv6

d5b19d4

Add some type checking on top Signed-off-by: Douglas Camata <dcamata@datawire.io>

Use dhttp package to start the metrics server

10a7796

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Start the metrics server with dgroup

ee40b51

Signed-off-by: Douglas Camata <dcamata@datawire.io>

LukeShu added the status:needs-change label Feb 3, 2022

khussey changed the title ~~[2.2] Stream Envoy metrics to the cloud~~ [v2.2] Stream Envoy metrics to the cloud Feb 4, 2022

Merge branch 'master' into dcamata/agent-metrics-stream

2b7544a

Signed-off-by: Flynn <flynn@datawire.io>

kflynn dismissed stale reviews from Alice-Lilith and themself via 2b7544a February 4, 2022 03:21

Flynn and others added 7 commits February 3, 2022 22:22

Whoops, missed a commit for the CHANGELOG

902d455

Signed-off-by: Flynn <flynn@datawire.io>

Merge branch 'dcamata/agent-metrics-stream' of github.com:emissary-in…

0029397

…gress/emissary into dcamata/agent-metrics-stream

Merge branch 'master' of github.com:emissary-ingress/emissary into dc…

7344154

…amata/agent-metrics-stream

Return error from metrics-server group

d305b81

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Fix logging in the envoy metrics server

41d620b

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Fix Python type signature of split_host_port

128bce4

Signed-off-by: Douglas Camata <dcamata@datawire.io>

Pin pytest back to 6.2.5.

6473414

Signed-off-by: Flynn <flynn@datawire.io>

kflynn approved these changes Feb 4, 2022

View reviewed changes

kflynn merged commit d3bd34f into master Feb 4, 2022

kflynn deleted the dcamata/agent-metrics-stream branch February 4, 2022 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2.2] Stream Envoy metrics to the cloud #4053

[v2.2] Stream Envoy metrics to the cloud #4053

douglascamata commented Jan 27, 2022 •

edited

Loading

rp4rk left a comment

kflynn left a comment

kflynn left a comment

kflynn left a comment •

edited

Loading

kflynn left a comment

LukeShu Feb 3, 2022

LukeShu Feb 3, 2022

LukeShu Feb 3, 2022

LukeShu Feb 3, 2022

LukeShu Feb 3, 2022

LukeShu left a comment

kflynn left a comment

[v2.2] Stream Envoy metrics to the cloud #4053

[v2.2] Stream Envoy metrics to the cloud #4053

Conversation

douglascamata commented Jan 27, 2022 • edited Loading

Description

Related Issues

Testing

Checklist

rp4rk left a comment

Choose a reason for hiding this comment

kflynn left a comment

Choose a reason for hiding this comment

kflynn left a comment

Choose a reason for hiding this comment

kflynn left a comment • edited Loading

Choose a reason for hiding this comment

kflynn left a comment

Choose a reason for hiding this comment

LukeShu Feb 3, 2022

Choose a reason for hiding this comment

LukeShu Feb 3, 2022

Choose a reason for hiding this comment

LukeShu Feb 3, 2022

Choose a reason for hiding this comment

LukeShu Feb 3, 2022

Choose a reason for hiding this comment

LukeShu Feb 3, 2022

Choose a reason for hiding this comment

LukeShu left a comment

Choose a reason for hiding this comment

kflynn left a comment

Choose a reason for hiding this comment

douglascamata commented Jan 27, 2022 •

edited

Loading

kflynn left a comment •

edited

Loading