stats: Consider communicating stats across hot-restart via RPC rather than shared memory #4974

jmarantz · 2018-11-06T18:40:37Z

Description:

The stats architecture has evolved, increasing in complexity, and presenting barriers to people wishing to improve it. The stats-in-shared-memory feature allows continuity of gauges and counters across binary restarts, including binary upgrades. However it enforces flat, fixed-size stat structures that don't scale well when there are large numbers of clusters. Truncation of dynamically generated stats names that exceed the command-line-specified shared-memory per-stat limit adds complexity as well. The current solution involves alternate class implementations for stat memory.

An alternative approach was suggested by @mattklein123 which was to use heap-memory for stats always, and use a paginated RPC to send current gauge/counter data from the old process to the new one during hot restart. This issue can be used to capture design ideas and tradeoffs around that.

One variation on this is to avoid sending counters, as gauges are important to stay accurate. And in any case we'll avoid sending stats where the delta is zero.

[optional Relevant Links:]
https://blog.envoyproxy.io/envoy-stats-b65c7f363342
https://github.com/envoyproxy/envoy/blob/master/source/docs/stats.md

bplotnick · 2018-11-08T03:40:13Z

This seems related to #3804. Having a stats RPC protocol that enforces backwards compatibility (e.g. using proto3) should address the hot restart problem, i think.

jmarantz · 2018-12-28T16:12:04Z

\begin{strikethrough}
@mattklein123 is it necessary to use RPC? Or could we just write protobuf-serialized stats to a file from the old process and read it from the new one? We'd supply a filename on the command-line of course.
\end{strikethrough}

Either way, there will be a window of time where some stats will get dropped. Any other advantages of using RPC for this?

jmarantz · 2018-12-28T17:48:34Z

Ignore previous comment; I just realized there's already a ton of RPC going on in the hot-restart transfer, so layering some stats data in there is pretty incremental, and using a file is not going to simplify anything. Also some shared-memory locks are still needed per https://blog.envoyproxy.io/envoy-hot-restart-1d16b14555b5

mattklein123 · 2018-12-28T18:07:37Z

@jmarantz yeah was going to reply with similar. We already do periodic "get parent stats." IMO it's actually a pretty incremental change to just get all the stats during this interval (the only complexity will be it will require pagination). IMO we should just get this done, because the other stats stuff has become so complicated that it dwarfs the complexity of changing how we do hot restart stats, and then we can delete a ton of code and focus on a single impl.

jmarantz · 2018-12-28T18:46:34Z

In your hot-restart blog post you indicated the importance of keeping gauges accurate; e.g. outstanding connection count.

It seems to me that for that specific stat, the best strategy might be not to transfer the gauge at all, as it really would be accurate for the process if just started fresh and didn't get adjusted based on the old process. Is that right?

For others that we might want to persist, should we just delay the paginated transfer of stats from the old process to the new one until the old one has retired all outstanding requests? The new process will be accumulating its own stats, but they could just be combined with the transferred stats as they stream in, so nothing will be lost.

mattklein123 · 2018-12-28T19:17:29Z

The way it works currently is that every <flush interval> the new process asks the old process for stats. the way I would implement is I would ask the old process for all used counter and gauges. Both counters and gauges can be added to the local process counters and gauges before flushing and everything will be eventually consistent and correct (note that there are likely some gotchas here. There are some gauge stats in which we assume the gauge is eventually consistent between both processes, such as num healthy gauges).

This will work very well and relatively easily for all stats sinks that use flushing. For any pull/admin based stats, it's a little more complicated. IMO the admin code should keep a shadow copy of the parent stats, and on-demand add them to the local stats. I think this will be the easiest thing to do and anyone that is not hot restarting won't pay any extra for this.

jmarantz · 2019-01-04T03:47:28Z

I meant to ping you when you are active on slack and i am free... I didn't totally understand all of the above. But let me just ask some questions:

how long do both old and new processes tend to co-exist? Is this more than just the lame-duck time -- say O(1 minute)? During the time when both exist, how will stats be read?
at what point should we finally add the old process stat values to the new process stat value? I
was thinking that should be done only after the old process completely stops serving requests --
but before it exits -- it would I guess need to block exiting until all the stats are transferred. Is
that what you have in mind?
when you say "pull/admin based stats", do you mean when Envoy is configured without a stats sink,
and thus all stats are expected to be read through the admin interface? Other than sinks and
admin, how does 'pull' work?
"shadow copy of the parent stats" -- I have to admit I don't fully understand what parent means
here. My exposure to the term "parent" in the context of stats was the collection of data from the
thread-local histograms, and I wasn't thinking about it in terms of hot restart.

ggreenway · 2019-01-04T16:11:29Z

how long do both old and new processes tend to co-exist? Is this more than just the lame-duck time -- say O(1 minute)? During the time when both exist, how will stats be read?

This is configurable on the CLI. I have a use-case (involving TCP proxying) where I have it set to multiple hours, because there's no graceful way to tell a tcp-proxies connection that it's time to close soon.

at what point should we finally add the old process stat values to the new process stat value? I
was thinking that should be done only after the old process completely stops serving requests --
but before it exits -- it would I guess need to block exiting until all the stats are transferred. Is
that what you have in mind?

Given the above (moderately long time windows), I think the stats should be sent periodically (probably with configurable period), but no slower than 1/minute.

jmarantz · 2019-01-04T16:15:46Z

@ggreenway thanks -- then I think we'd need to atomically zero the old-process stat values as we transmit them to the new process, to avoid double-counting them? Or alternatively we could keep the old-process stat values separate in the new process and only combine them when the old process is allowed to die. Other ideas?

ggreenway · 2019-01-04T16:23:40Z

Yeah, something along those lines. And you'd need to re-zero everytime you send an updated value from the old to new. Or track the last value that was transmitted so you can compute deltas. Possibly use CounterImpl::pending_increment_ for counters. I'm not sure what to do with gauges.

jmarantz · 2019-01-04T16:31:29Z

Zeroing in the old process, via https://en.cppreference.com/w/cpp/atomic/atomic/exchange, feels like a good approach as it doesn't require using any additional per-stat storage in the new process, and would work for gauges, and the stats-sink/admin-access to stats flow wouldn't need to change.

@mattklein123 wdyt? This diverges a bit from what you were suggesting.

mattklein123 · 2019-01-04T17:49:28Z

how long do both old and new processes tend to co-exist? Is this more than just the lame-duck time -- say O(1 minute)? During the time when both exist, how will stats be read?

Per @ggreenway this is configurable and can be a very long time. This is why we currently have the concept of a flush interval (which defaults to 5s). During every flush interval, we flush stats to output sinks in the primary process only, but before we do this we ask the parent process for certain stats, as well as merge histograms. Its during this configurable flush interval that I am suggesting asking the parent process for all stats.

at what point should we finally add the old process stat values to the new process stat value? I
was thinking that should be done only after the old process completely stops serving requests --
but before it exits -- it would I guess need to block exiting until all the stats are transferred. Is
that what you have in mind?

IMO during every configurable flush interval, as we do today.

when you say "pull/admin based stats", do you mean when Envoy is configured without a stats sink,
and thus all stats are expected to be read through the admin interface? Other than sinks and
admin, how does 'pull' work?

Yes, I'm referring to people that read stats via the admin interface, whether via Prometheus, JSON, etc.

"shadow copy of the parent stats" -- I have to admit I don't fully understand what parent means
here. My exposure to the term "parent" in the context of stats was the collection of data from the
thread-local histograms, and I wasn't thinking about it in terms of hot restart.

Sorry this is my own language. I call the newest hot restart process the primary, the one that came before the parent, the one that came before the grandparent, etc. Does that makes sense?

Yeah, something along those lines. And you'd need to re-zero everytime you send an updated value from the old to new. Or track the last value that was transmitted so you can compute deltas. Possibly use CounterImpl::pending_increment_ for counters. I'm not sure what to do with gauges.

For gauges, I think we just send the entire gauge value over, and we can add them in the primary process. As I said above, I think we need to audit all usages of guages that call set() vs. inc()/dec() since there may be s subtle behavior change here. I'm not sure it will matter but we need to look. There should be very few cases of gauges that are set in the code base so it shouldn't be too hard to audit.

For counters, I agree with @ggreenway that we can just use the pending increment. Since the parent process stops flushing, we can simply latch the pending increment when the primary asks the parent for its stats, and send the increment. This can be added to the latched increment and total in the secondary. (One small point is whether you send the old counter total to the new process on hot restart. This would be nice, but is something we can discuss).

fredlas · 2019-01-30T20:52:23Z

I discussed this with @jmarantz this morning, and looked into the code. I see where pending_increment_ fits in and how it would be useful. Having the child keep a periodically updated shadow copy of the parent's gauges, which it combines with its own when presenting to the outside world, makes sense too.

All of that makes sense as things to be done for admin/pull-based stats. What isn't clear to me: what needs to be done for stat sinks? From the discussion ("We already do periodic 'get parent stats'"), it kind of sounds like they already handle hot restarts exactly in the way being discussed here.

One small point is whether you send the old counter total to the new process on hot restart. This would be nice, but is something we can discuss

I think sending the old counter total over should be straightforward. If counter communication is built around using pending_increment_ to send "how many more have there been since last flush/poll", then since the child will start off with an implicit 0 for all the parent's counters, we can just have the first "increment" the parent sends actually be the whole count.

I think we need to audit all usages of guages that call set()

I got briefly started on this. There are ~100 non-test uses of set() on gauges. I gave them a quick lookthrough, and it's kind of a grab bag. I see some that are counts/totals that would clearly be legitimate to sum, some that appear to be meant as booleans, some (like hot_restart_epoch) that are meaningful numbers but not summable, and some that are just completely opaque, like hashes.

I'm tempted to suggest introducing a new stat type for these non-numerical opaque states, since something like a hash seems like an abuse of the gauge mechanism - for instance, it wouldn't even make sense to try to export a hash to Prometheus as a Prometheus gauge, right? So, they're not even being used as "actual" gauges. That said, a new type wouldn't completely solve the current problem: integer versions like hot_restart_epoch might reasonably seem gauge-worthy, but aren't summable.

mattklein123 · 2019-01-30T21:51:47Z

All of that makes sense as things to be done for admin/pull-based stats. What isn't clear to me: what needs to be done for stat sinks? From the discussion ("We already do periodic 'get parent stats'"), it kind of sounds like they already handle hot restarts exactly in the way being discussed here.

Correct, nothing needs to be done. Basically the push/sink model is a) get stats from parent, b) add them to local stats, and push/flush.

I'm tempted to suggest introducing a new stat type for these non-numerical opaque states, since something like a hash seems like an abuse of the gauge mechanism - for instance, it wouldn't even make sense to try to export a hash to Prometheus as a Prometheus gauge, right? So, they're not even being used as "actual" gauges. That said, a new type wouldn't completely solve the current problem: integer versions like hot_restart_epoch might reasonably seem gauge-worthy, but aren't summable.

+1, this sounds like a great solution.

fredlas · 2019-01-31T22:25:03Z

It's actually way fewer than 100; I think I might have read the wrong number from the code search thing I was using. It's more like ~30 different gauges using set(), and 15 or 20 look like they wouldn't make sense to sum.

mattklein123 · 2019-01-31T23:36:24Z

It's actually way fewer than 100; I think I might have read the wrong number from the code search thing I was using. It's more like ~30 different gauges using set(), and 15 or 20 look like they wouldn't make sense to sum.

Yeah I was very surprised by the original number. This sounds more like what I was expecting. Still, I think your plan of having different more intentional stat types sounds perfect to me and I don't think would be very hard to implement.

fredlas · 2019-02-01T15:37:07Z

😊 haha yeah, whoops. (A link for completeness: I listed those uses over in #5790).

fredlas · 2019-02-06T18:54:28Z

Quick check-in before I get too far into implementing: it appears that there are currently just two things (memory_allocated and num_connections) that are communicated via RPC. So, essentially all stats - both admin and sink-handled - will need to be covered by what I'm building. Or, am I missing RPCs happening somewhere else?

I see that the current RPC setup uses sendmsg/recvmsg. That is necessary to enable fd passing, yes? Additionally, it's kept nice and simple by assuming all requests and replies are always <= 4096 bytes, including type+len info. ~~Would people be ok with switching the existing setup to only handling fd passing, and setting up a new RPC framework for everything else (including this new work)?~~ I ultimately settled on keeping everything on the existing 4096 byte sendmsg datagram communication.

Here's what I'm envisioning. Communicate over a domain socket. In each direction, the protocol is a series of:
[uint64 length]['length' bytes of EnvoyHotRestart{Request,Response}]
Where EnvoyHotRestart{Request,Response} are overarching container protobufs, containing oneof { the various message types, like RpcShutdownAdminReply, RpcGetStatsReply }.

The stats would be passed.... either in explicitly named protobuf fields, or in a statname->statvalue map. I'm not clear on whether e.g. cm_stats_.active_clusters_.set(active_clusters_.size()); can be accessed by store.gauges("active_clusters").value(). (If so, using a map would obviously be way cleaner and more extensible).

It could instead be gRPC, although I think that's kind of overkill here.

Of course, it's totally possible to do everything within the existing framework. But, I think it will be much easier to get the implementation right with a nice flexible protobuf with repeated/map fields, rather than manually packing arbitrary, variable-size stats into raw memory, always being careful to move onto a new message before overrunning 4096 bytes.

mattklein123 · 2019-02-07T19:13:22Z

Quick check-in before I get too far into implementing: it appears that there are currently just two things (memory_allocated and num_connections) that are communicated via RPC. So, essentially all stats - both admin and sink-handled - will need to be covered by what I'm building. Or, am I missing RPCs happening somewhere else?

Nope that's it. You can use the same app flow and just extend this sequence to transmit all stats between the parent and new process, so the changes should be mostly at the RPC layer.

Of course, it's totally possible to do everything within the existing framework. But, I think it will be much easier to get the implementation right with a nice flexible protobuf with repeated/map fields, rather than manually packing arbitrary, variable-size stats into raw memory, always being careful to move onto a new message before overrunning 4096 bytes.

+1, we have talked about moving this to proto for quite some time, and I think a "gRPC light" is the right way to go here. Very excited for this work!!

mattklein123 · 2019-05-02T18:18:10Z

Fixed!

jmarantz · 2019-05-02T18:23:47Z

:) And thanks for the callout on Twitter: https://twitter.com/mattklein123/status/1123742661469454341

fredlas · 2019-05-03T16:01:54Z

Yes, thank you; glad the change made you that happy! It was also a nice overview of the motivation and result, since I was entirely focused on the mechanical details the whole time, and not really seeing the big picture.

mattklein123 added design proposal Needs design doc/proposal before implementation help wanted Needs help! labels Nov 6, 2018

jmarantz mentioned this issue Dec 28, 2018

stats: Add fake symbol table as an intermediate state to move to SymbolTable API without taking locks. #5414

Merged

mattklein123 added this to the 1.10.0 milestone Jan 11, 2019

jmarantz assigned fredlas Jan 17, 2019

fredlas mentioned this issue Jan 31, 2019

stats: New type(s) of Metric #5790

Closed

fredlas mentioned this issue Feb 1, 2019

stats: add new BoolIndicator stat type #5813

Merged

mattklein123 mentioned this issue Feb 3, 2019

Stats: Envoy "upstream_rq_xxx" stats are inconsistent across hot-restart #5819

Closed

fredlas mentioned this issue Feb 5, 2019

stats: add new TextReadout stat type #5844

Closed

fredlas mentioned this issue Feb 11, 2019

server: change hot restart parent<-->child protocol, remove stats shared memory #5910

Merged

mattklein123 added the no stalebot Disables stalebot from closing an issue label Mar 19, 2019

mattklein123 removed help wanted Needs help! labels Mar 19, 2019

mattklein123 modified the milestones: 1.10.0, 1.11.0 Mar 19, 2019

mattklein123 closed this as completed May 2, 2019

fredlas mentioned this issue May 7, 2019

stats: prevent unused counters from leaking across hot restart #6850

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: Consider communicating stats across hot-restart via RPC rather than shared memory #4974

stats: Consider communicating stats across hot-restart via RPC rather than shared memory #4974

jmarantz commented Nov 6, 2018 •

edited

Loading

bplotnick commented Nov 8, 2018

jmarantz commented Dec 28, 2018 •

edited

Loading

jmarantz commented Dec 28, 2018 •

edited

Loading

mattklein123 commented Dec 28, 2018

jmarantz commented Dec 28, 2018

mattklein123 commented Dec 28, 2018 •

edited by jmarantz

Loading

jmarantz commented Jan 4, 2019

ggreenway commented Jan 4, 2019

jmarantz commented Jan 4, 2019

ggreenway commented Jan 4, 2019

jmarantz commented Jan 4, 2019

mattklein123 commented Jan 4, 2019 •

edited

Loading

fredlas commented Jan 30, 2019

mattklein123 commented Jan 30, 2019

fredlas commented Jan 31, 2019

mattklein123 commented Jan 31, 2019

fredlas commented Feb 1, 2019 •

edited

Loading

fredlas commented Feb 6, 2019 •

edited

Loading

mattklein123 commented Feb 7, 2019

mattklein123 commented May 2, 2019

jmarantz commented May 2, 2019

fredlas commented May 3, 2019

stats: Consider communicating stats across hot-restart via RPC rather than shared memory #4974

stats: Consider communicating stats across hot-restart via RPC rather than shared memory #4974

Comments

jmarantz commented Nov 6, 2018 • edited Loading

bplotnick commented Nov 8, 2018

jmarantz commented Dec 28, 2018 • edited Loading

jmarantz commented Dec 28, 2018 • edited Loading

mattklein123 commented Dec 28, 2018

jmarantz commented Dec 28, 2018

mattklein123 commented Dec 28, 2018 • edited by jmarantz Loading

jmarantz commented Jan 4, 2019

ggreenway commented Jan 4, 2019

jmarantz commented Jan 4, 2019

ggreenway commented Jan 4, 2019

jmarantz commented Jan 4, 2019

mattklein123 commented Jan 4, 2019 • edited Loading

fredlas commented Jan 30, 2019

mattklein123 commented Jan 30, 2019

fredlas commented Jan 31, 2019

mattklein123 commented Jan 31, 2019

fredlas commented Feb 1, 2019 • edited Loading

fredlas commented Feb 6, 2019 • edited Loading

mattklein123 commented Feb 7, 2019

mattklein123 commented May 2, 2019

jmarantz commented May 2, 2019

fredlas commented May 3, 2019

jmarantz commented Nov 6, 2018 •

edited

Loading

jmarantz commented Dec 28, 2018 •

edited

Loading

jmarantz commented Dec 28, 2018 •

edited

Loading

mattklein123 commented Dec 28, 2018 •

edited by jmarantz

Loading

mattklein123 commented Jan 4, 2019 •

edited

Loading

fredlas commented Feb 1, 2019 •

edited

Loading

fredlas commented Feb 6, 2019 •

edited

Loading