Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: Consider communicating stats across hot-restart via RPC rather than shared memory #4974

Closed
jmarantz opened this issue Nov 6, 2018 · 22 comments · Fixed by #6850
Closed
Assignees
Labels
design proposal Needs design doc/proposal before implementation no stalebot Disables stalebot from closing an issue
Milestone

Comments

@jmarantz
Copy link
Contributor

jmarantz commented Nov 6, 2018

Description:

The stats architecture has evolved, increasing in complexity, and presenting barriers to people wishing to improve it. The stats-in-shared-memory feature allows continuity of gauges and counters across binary restarts, including binary upgrades. However it enforces flat, fixed-size stat structures that don't scale well when there are large numbers of clusters. Truncation of dynamically generated stats names that exceed the command-line-specified shared-memory per-stat limit adds complexity as well. The current solution involves alternate class implementations for stat memory.

An alternative approach was suggested by @mattklein123 which was to use heap-memory for stats always, and use a paginated RPC to send current gauge/counter data from the old process to the new one during hot restart. This issue can be used to capture design ideas and tradeoffs around that.

One variation on this is to avoid sending counters, as gauges are important to stay accurate. And in any case we'll avoid sending stats where the delta is zero.

[optional Relevant Links:]
https://blog.envoyproxy.io/envoy-stats-b65c7f363342
https://github.com/envoyproxy/envoy/blob/master/source/docs/stats.md

@mattklein123 mattklein123 added design proposal Needs design doc/proposal before implementation help wanted Needs help! labels Nov 6, 2018
@bplotnick
Copy link
Contributor

This seems related to #3804. Having a stats RPC protocol that enforces backwards compatibility (e.g. using proto3) should address the hot restart problem, i think.

@jmarantz
Copy link
Contributor Author

jmarantz commented Dec 28, 2018

\begin{strikethrough}
@mattklein123 is it necessary to use RPC? Or could we just write protobuf-serialized stats to a file from the old process and read it from the new one? We'd supply a filename on the command-line of course.
\end{strikethrough}

Either way, there will be a window of time where some stats will get dropped. Any other advantages of using RPC for this?

@jmarantz
Copy link
Contributor Author

jmarantz commented Dec 28, 2018

Ignore previous comment; I just realized there's already a ton of RPC going on in the hot-restart transfer, so layering some stats data in there is pretty incremental, and using a file is not going to simplify anything. Also some shared-memory locks are still needed per https://blog.envoyproxy.io/envoy-hot-restart-1d16b14555b5

@mattklein123
Copy link
Member

@jmarantz yeah was going to reply with similar. We already do periodic "get parent stats." IMO it's actually a pretty incremental change to just get all the stats during this interval (the only complexity will be it will require pagination). IMO we should just get this done, because the other stats stuff has become so complicated that it dwarfs the complexity of changing how we do hot restart stats, and then we can delete a ton of code and focus on a single impl.

@jmarantz
Copy link
Contributor Author

In your hot-restart blog post you indicated the importance of keeping gauges accurate; e.g. outstanding connection count.

It seems to me that for that specific stat, the best strategy might be not to transfer the gauge at all, as it really would be accurate for the process if just started fresh and didn't get adjusted based on the old process. Is that right?

For others that we might want to persist, should we just delay the paginated transfer of stats from the old process to the new one until the old one has retired all outstanding requests? The new process will be accumulating its own stats, but they could just be combined with the transferred stats as they stream in, so nothing will be lost.

@mattklein123
Copy link
Member

mattklein123 commented Dec 28, 2018

The way it works currently is that every <flush interval> the new process asks the old process for stats. the way I would implement is I would ask the old process for all used counter and gauges. Both counters and gauges can be added to the local process counters and gauges before flushing and everything will be eventually consistent and correct (note that there are likely some gotchas here. There are some gauge stats in which we assume the gauge is eventually consistent between both processes, such as num healthy gauges).

This will work very well and relatively easily for all stats sinks that use flushing. For any pull/admin based stats, it's a little more complicated. IMO the admin code should keep a shadow copy of the parent stats, and on-demand add them to the local stats. I think this will be the easiest thing to do and anyone that is not hot restarting won't pay any extra for this.

@jmarantz
Copy link
Contributor Author

jmarantz commented Jan 4, 2019

I meant to ping you when you are active on slack and i am free... I didn't totally understand all of the above. But let me just ask some questions:

  • how long do both old and new processes tend to co-exist? Is this more than just the lame-duck time -- say O(1 minute)? During the time when both exist, how will stats be read?
  • at what point should we finally add the old process stat values to the new process stat value? I
    was thinking that should be done only after the old process completely stops serving requests --
    but before it exits -- it would I guess need to block exiting until all the stats are transferred. Is
    that what you have in mind?
  • when you say "pull/admin based stats", do you mean when Envoy is configured without a stats sink,
    and thus all stats are expected to be read through the admin interface? Other than sinks and
    admin, how does 'pull' work?
  • "shadow copy of the parent stats" -- I have to admit I don't fully understand what parent means
    here. My exposure to the term "parent" in the context of stats was the collection of data from the
    thread-local histograms, and I wasn't thinking about it in terms of hot restart.

@ggreenway
Copy link
Contributor

  • how long do both old and new processes tend to co-exist? Is this more than just the lame-duck time -- say O(1 minute)? During the time when both exist, how will stats be read?

This is configurable on the CLI. I have a use-case (involving TCP proxying) where I have it set to multiple hours, because there's no graceful way to tell a tcp-proxies connection that it's time to close soon.

  • at what point should we finally add the old process stat values to the new process stat value? I
    was thinking that should be done only after the old process completely stops serving requests --
    but before it exits -- it would I guess need to block exiting until all the stats are transferred. Is
    that what you have in mind?

Given the above (moderately long time windows), I think the stats should be sent periodically (probably with configurable period), but no slower than 1/minute.

@jmarantz
Copy link
Contributor Author

jmarantz commented Jan 4, 2019

@ggreenway thanks -- then I think we'd need to atomically zero the old-process stat values as we transmit them to the new process, to avoid double-counting them? Or alternatively we could keep the old-process stat values separate in the new process and only combine them when the old process is allowed to die. Other ideas?

@ggreenway
Copy link
Contributor

Yeah, something along those lines. And you'd need to re-zero everytime you send an updated value from the old to new. Or track the last value that was transmitted so you can compute deltas. Possibly use CounterImpl::pending_increment_ for counters. I'm not sure what to do with gauges.

@jmarantz
Copy link
Contributor Author

jmarantz commented Jan 4, 2019

Zeroing in the old process, via https://en.cppreference.com/w/cpp/atomic/atomic/exchange, feels like a good approach as it doesn't require using any additional per-stat storage in the new process, and would work for gauges, and the stats-sink/admin-access to stats flow wouldn't need to change.

@mattklein123 wdyt? This diverges a bit from what you were suggesting.

@mattklein123
Copy link
Member

mattklein123 commented Jan 4, 2019

how long do both old and new processes tend to co-exist? Is this more than just the lame-duck time -- say O(1 minute)? During the time when both exist, how will stats be read?

Per @ggreenway this is configurable and can be a very long time. This is why we currently have the concept of a flush interval (which defaults to 5s). During every flush interval, we flush stats to output sinks in the primary process only, but before we do this we ask the parent process for certain stats, as well as merge histograms. Its during this configurable flush interval that I am suggesting asking the parent process for all stats.

at what point should we finally add the old process stat values to the new process stat value? I
was thinking that should be done only after the old process completely stops serving requests --
but before it exits -- it would I guess need to block exiting until all the stats are transferred. Is
that what you have in mind?

IMO during every configurable flush interval, as we do today.

when you say "pull/admin based stats", do you mean when Envoy is configured without a stats sink,
and thus all stats are expected to be read through the admin interface? Other than sinks and
admin, how does 'pull' work?

Yes, I'm referring to people that read stats via the admin interface, whether via Prometheus, JSON, etc.

"shadow copy of the parent stats" -- I have to admit I don't fully understand what parent means
here. My exposure to the term "parent" in the context of stats was the collection of data from the
thread-local histograms, and I wasn't thinking about it in terms of hot restart.

Sorry this is my own language. I call the newest hot restart process the primary, the one that came before the parent, the one that came before the grandparent, etc. Does that makes sense?

Yeah, something along those lines. And you'd need to re-zero everytime you send an updated value from the old to new. Or track the last value that was transmitted so you can compute deltas. Possibly use CounterImpl::pending_increment_ for counters. I'm not sure what to do with gauges.

For gauges, I think we just send the entire gauge value over, and we can add them in the primary process. As I said above, I think we need to audit all usages of guages that call set() vs. inc()/dec() since there may be s subtle behavior change here. I'm not sure it will matter but we need to look. There should be very few cases of gauges that are set in the code base so it shouldn't be too hard to audit.

For counters, I agree with @ggreenway that we can just use the pending increment. Since the parent process stops flushing, we can simply latch the pending increment when the primary asks the parent for its stats, and send the increment. This can be added to the latched increment and total in the secondary. (One small point is whether you send the old counter total to the new process on hot restart. This would be nice, but is something we can discuss).

@mattklein123 mattklein123 added this to the 1.10.0 milestone Jan 11, 2019
@fredlas
Copy link
Contributor

fredlas commented Jan 30, 2019

I discussed this with @jmarantz this morning, and looked into the code. I see where pending_increment_ fits in and how it would be useful. Having the child keep a periodically updated shadow copy of the parent's gauges, which it combines with its own when presenting to the outside world, makes sense too.

All of that makes sense as things to be done for admin/pull-based stats. What isn't clear to me: what needs to be done for stat sinks? From the discussion ("We already do periodic 'get parent stats'"), it kind of sounds like they already handle hot restarts exactly in the way being discussed here.

One small point is whether you send the old counter total to the new process on hot restart. This would be nice, but is something we can discuss

I think sending the old counter total over should be straightforward. If counter communication is built around using pending_increment_ to send "how many more have there been since last flush/poll", then since the child will start off with an implicit 0 for all the parent's counters, we can just have the first "increment" the parent sends actually be the whole count.

I think we need to audit all usages of guages that call set()

I got briefly started on this. There are ~100 non-test uses of set() on gauges. I gave them a quick lookthrough, and it's kind of a grab bag. I see some that are counts/totals that would clearly be legitimate to sum, some that appear to be meant as booleans, some (like hot_restart_epoch) that are meaningful numbers but not summable, and some that are just completely opaque, like hashes.

I'm tempted to suggest introducing a new stat type for these non-numerical opaque states, since something like a hash seems like an abuse of the gauge mechanism - for instance, it wouldn't even make sense to try to export a hash to Prometheus as a Prometheus gauge, right? So, they're not even being used as "actual" gauges. That said, a new type wouldn't completely solve the current problem: integer versions like hot_restart_epoch might reasonably seem gauge-worthy, but aren't summable.

@mattklein123
Copy link
Member

All of that makes sense as things to be done for admin/pull-based stats. What isn't clear to me: what needs to be done for stat sinks? From the discussion ("We already do periodic 'get parent stats'"), it kind of sounds like they already handle hot restarts exactly in the way being discussed here.

Correct, nothing needs to be done. Basically the push/sink model is a) get stats from parent, b) add them to local stats, and push/flush.

I'm tempted to suggest introducing a new stat type for these non-numerical opaque states, since something like a hash seems like an abuse of the gauge mechanism - for instance, it wouldn't even make sense to try to export a hash to Prometheus as a Prometheus gauge, right? So, they're not even being used as "actual" gauges. That said, a new type wouldn't completely solve the current problem: integer versions like hot_restart_epoch might reasonably seem gauge-worthy, but aren't summable.

+1, this sounds like a great solution.

@fredlas
Copy link
Contributor

fredlas commented Jan 31, 2019

It's actually way fewer than 100; I think I might have read the wrong number from the code search thing I was using. It's more like ~30 different gauges using set(), and 15 or 20 look like they wouldn't make sense to sum.

@mattklein123
Copy link
Member

It's actually way fewer than 100; I think I might have read the wrong number from the code search thing I was using. It's more like ~30 different gauges using set(), and 15 or 20 look like they wouldn't make sense to sum.

Yeah I was very surprised by the original number. This sounds more like what I was expecting. Still, I think your plan of having different more intentional stat types sounds perfect to me and I don't think would be very hard to implement.

@fredlas
Copy link
Contributor

fredlas commented Feb 1, 2019

😊 haha yeah, whoops. (A link for completeness: I listed those uses over in #5790).

@fredlas
Copy link
Contributor

fredlas commented Feb 6, 2019

Quick check-in before I get too far into implementing: it appears that there are currently just two things (memory_allocated and num_connections) that are communicated via RPC. So, essentially all stats - both admin and sink-handled - will need to be covered by what I'm building. Or, am I missing RPCs happening somewhere else?

I see that the current RPC setup uses sendmsg/recvmsg. That is necessary to enable fd passing, yes? Additionally, it's kept nice and simple by assuming all requests and replies are always <= 4096 bytes, including type+len info. Would people be ok with switching the existing setup to only handling fd passing, and setting up a new RPC framework for everything else (including this new work)? I ultimately settled on keeping everything on the existing 4096 byte sendmsg datagram communication.

Here's what I'm envisioning. Communicate over a domain socket. In each direction, the protocol is a series of:
[uint64 length]['length' bytes of EnvoyHotRestart{Request,Response}]
Where EnvoyHotRestart{Request,Response} are overarching container protobufs, containing oneof { the various message types, like RpcShutdownAdminReply, RpcGetStatsReply }.

The stats would be passed.... either in explicitly named protobuf fields, or in a statname->statvalue map. I'm not clear on whether e.g. cm_stats_.active_clusters_.set(active_clusters_.size()); can be accessed by store.gauges("active_clusters").value(). (If so, using a map would obviously be way cleaner and more extensible).

It could instead be gRPC, although I think that's kind of overkill here.

Of course, it's totally possible to do everything within the existing framework. But, I think it will be much easier to get the implementation right with a nice flexible protobuf with repeated/map fields, rather than manually packing arbitrary, variable-size stats into raw memory, always being careful to move onto a new message before overrunning 4096 bytes.

@mattklein123
Copy link
Member

Quick check-in before I get too far into implementing: it appears that there are currently just two things (memory_allocated and num_connections) that are communicated via RPC. So, essentially all stats - both admin and sink-handled - will need to be covered by what I'm building. Or, am I missing RPCs happening somewhere else?

Nope that's it. You can use the same app flow and just extend this sequence to transmit all stats between the parent and new process, so the changes should be mostly at the RPC layer.

Of course, it's totally possible to do everything within the existing framework. But, I think it will be much easier to get the implementation right with a nice flexible protobuf with repeated/map fields, rather than manually packing arbitrary, variable-size stats into raw memory, always being careful to move onto a new message before overrunning 4096 bytes.

+1, we have talked about moving this to proto for quite some time, and I think a "gRPC light" is the right way to go here. Very excited for this work!!

@mattklein123 mattklein123 removed help wanted Needs help! labels Mar 19, 2019
@mattklein123 mattklein123 modified the milestones: 1.10.0, 1.11.0 Mar 19, 2019
@mattklein123
Copy link
Member

Fixed!

@jmarantz
Copy link
Contributor Author

jmarantz commented May 2, 2019

:) And thanks for the callout on Twitter: https://twitter.com/mattklein123/status/1123742661469454341

@fredlas
Copy link
Contributor

fredlas commented May 3, 2019

Yes, thank you; glad the change made you that happy! It was also a nice overview of the motivation and result, since I was entirely focused on the mechanical details the whole time, and not really seeing the big picture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design proposal Needs design doc/proposal before implementation no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants