-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fiber tracing: asynchronous stack tracing #854
Conversation
This is amazing work! Even just the write-up is epic. I have lots of thoughts and I'll try to respond shortly, but just a quick hat-tip first. :-) Well done. |
This looks awesome, looking forward to having this! |
core/jvm/src/main/java/cats/effect/internals/TracingPlatformFast.java
Outdated
Show resolved
Hide resolved
core/shared/src/main/scala/cats/effect/internals/IORunLoop.scala
Outdated
Show resolved
Hide resolved
core/shared/src/main/scala/cats/effect/internals/IOTracing.scala
Outdated
Show resolved
Hide resolved
core/shared/src/main/scala/cats/effect/internals/IOTracing.scala
Outdated
Show resolved
Hide resolved
core/shared/src/main/scala/cats/effect/internals/IOTracing.scala
Outdated
Show resolved
Hide resolved
Some thoughts:
IMO, global disabling of tracing should result in a 0% degradation. Local disabling should result in at worst something like a 1-2% degradation, while fully-enabled maybe something like 3-4%. At least those are kind of the numbers I'm thinking. Obviously less is better, but this feels relatively attainable.
I think the bytecode is a primary source. :-)
I don't think it hurts to make it publicly visible if we have to. It's in the
Fascinating. I wonder if that's due to the inlining limits. Also that answers one of my review questions. We could get around this by using a macro in
Ah more of my review comments answered! Where does the cost come from when the tracing is disabled? Just the larger object header size?
Also I think we can actually not worry about the LRU part of this if we cache based on lambda class rather than lambda instance, since the number of distinct cache entries will be strictly bounded by the number of classes, which is static (and relatively small). Also regarding your example, that seems completely fine to me. We know tracing is going to suffer from these kinds of limitations.
We should explicitly call it out, but IMO it's completely fine. Tracing is an approximation, really, intended to give some information where, at present, we have none.
Ah yeah this is rough. I really don't like the fact that bracket is leveraging
Monad transformers in general are going to cause a problem for rabbit tracing, since the lambdas will basically all come from the transformer. Slug tracing obviously doesn't have this problem since it bypasses the cache altogether, but then you take the performance hit. At any rate, we're basically looking at heuristics either way. I think I had some random ideas on how we might do this in the spec gist, but ultimately nothing is really going to be fully general. I have no objections to starting with the class filtering approach and seeing how it works.
This is fairly nice, and my ideas around exploiting the memory model basically just make the thread local approach faster. So, this is the right place to start with implementation, IMO.
There are a ton of them, and IMO none of the internal ones should be traced. Tracing is meant to show user-created structure, not implementation details.
See coop. :-) |
Before 2.12, the singleton static instance was declared
Yeah, exactly. I'm certain the cost will be greater for the simpler
Ah, this makes a lot of sense. I thought the lambda reference is always a singleton, but this obviously isn't the case, especially if it closes over other variables in scope. 👍 to caching the class.
I came across a comment in the codebase about leveraging
This makes me question how much value rabbit tracing will provide. I think it's fair to say that if you're using transformers, its likely that it permeates throughout your entire codebase (this certainly has been the case for me). If the transformer lambdas/classes are cached, tracing is likely to produce incorrect results, at least past the immediate frame. I wonder if we should refocus the scope of rabbit tracing away from stack-based tracing and just expose run-time information (e.g. instructions, forks/joins, async boundaries, yielding, etc.). Slug mode could focus more on stack-based tracing, and we could offer options around caching and trace frame calculation that users can customize to their preference. Of course, the user would have to be well-aware of what the limitations of those options are before turning them on, but I think that's reasonable since they are options. But the bottom line would be that the default, out-of-the-box behavior works correctly.
👍 |
wip Delete SingleMapCallBenchmark.scala Delete SingleMapCallBenchmark.scala
It just surprises me that we need those approaches at all. I guess this is betraying my lack of understanding of the current runloop, but
Tracing in general is definitely lacking in value in a lot of common scenarios. I think the real advantage here is in providing any information, where currently there is none. What I'm currently thinking about in terms of use-case for rabbit tracing is someone gets a production crash and/or a production livelock and they want to start diagnostics. They're going to turn on slug tracing locally, but they need to have some idea of how to reproduce the situation in the first place. Tracing at least gives some hints. (as a note on the livelock case, being able to dump the traces of all the fibers would be awesome, basically giving us the analogue of a thread dump) I don't think there's any particular way of getting around the fact that, in a lot of cases, the trace will just show "yet more I feel like, as with all lossy introspection tools (think: heap dump analysis), people eventually learn the ins and outs of when it's helpful and what the hints usually mean in which context, and they learn to interpret the patterns in ways that the tool authors didn't anticipate. All of which is to say I think more information is better. So while I agree with you that fast tracing (regardless of implementation details) is generally not going to be completely accurate in any application with interpreter stack layers (monad transformers, streaming stuff like fs2, etc), I do think there's still quite a bit of value in it just on the basis of giving you something to hang your hat on. |
I finally got around to running some proper benchmarks for this. Some context:
The first column is the baseline; I just ran the benchmarks on
The good news is that disabled tracing is for the most part right behind The bad news is that cached tracing is hurting pretty bad. I'm seeing up to 50% degradations compared to I'll post an updated benchmark after I remove some more of these barriers. |
Yeah that's a pretty serious hit in the cached case. I think we can optimize the caching more. Removing the read barriers will help for sure. I would also imagine we can optimize the table itself somewhat, depending on how sparse it's ending up being. We should also sanity-check |
I totally missed 3 memory barriers in I'm omitting disabled tracing from this table.
A lot better than before. Next up is the CHM |
@djspiewak I think I'm happy with the API now, feel free to pull it and try it out sometime. Before we merge, is it worth closing this PR, moving to a new branch and reorganizing the commits? |
I actually rather like having a comprehensive history, so I'm in favor of keeping it as-is, but I know other people have strong opinions on this stuff. :-) Overall, I'm 👍 for merging it now! There's definitely more to be done, but we can't boil the ocean in a single PR. Let's get it in master, get it into people's hands, and see how it behaves. (build has to be fixed first though) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Thanks so much! 😍 |
This PR contains a first pass for fiber tracing. Fiber tracing refers to the analysis of the execution graph produced by an
IO
program. The scope of this PR is to capture stack frames at well-defined points in execution and collect them into trace frames, which can later be printed out on demand. This is useful in both development and production settings where asynchronous programming is concerned because fiber traces are much more useful for debugging and easier to understand than stack traces. In the future, fiber tracing can be expanded to cover additional use cases.I make several claims about performance here, which are mostly backed by local runs of
MapCallsBenchmark
. This offers soft upper bound on performance impact, but I need to run the other benchmarks and create some more comprehensive test suites so others can run and compare.Design constraints
There are several design constraints I wrote this implementation under, which forced some tradeoffs. We should probably clarify how important these are and whether we can relax them at all.
IO
program. This makes tracing possible and appealing in production. The current implementation usesThreadLocal
. This has no impact when tracing is turned off globally, but if it is, every code path incurs the cost ofThreadLocal
accesses. @djspiewak has some interesting ideas here that might be able to exploit the memory model to achieve thread-local state with non-volatile variables.Limitations
Additionally, there were some limitations in the Scala language that forced me to make some code design decisions I'm not very happy about.
Implementation details
Several implementation details are described below. Hopefully we can have discussions around what the best approach is moving forward.
static final
variables (or so called just-in-time constants). With that information, it can eliminate code paths that are consequent to those constants. As described above, Scala object-level fields can't take advantage of this, so we're defining the global tracing flag in Java instead. One issue here is package visibility.IO.Builder
helper andIOTracing
utility methods, but C2 can't inline those method calls in 2.12. So the conditionals are somewhat scattered around the codebase.IO
node, or introduce a newIO
constructor that reifies anIO
node with a trace frame. I took the latter approach because there's virtually no cost when tracing is disabled, and it doesn't pollute the constructor or interpreter logic for otherIO
instructions. However, there is a very steep cost (more than 30-40% IIRC) when tracing IS enabled because it effectively doubles the number of interior nodes in theIO
tree being traced. However, the instance variable approach is much cheaper during tracing, but incurs around a 4-5% performance penalty when its disabled.Map
,Bind
, andAsync
nodes. I chose a globalConcurrentHashMap
here, but we probably need to benchmark it more and compare it to alternatives likeThreadLocal
. Also, this should probably be an LRU cache bounded in entry count. There is a snag with this caching approach where the same lambda reference could be used in multiple places:x
, but we still need some way to detect if we're looking at a lambda that is shared like this, otherwise caching could produce incorrect results. Maybe there's a way to compare the name of the lambda with reflection. Bytecode analysis is also another option here. But I personally haven't seen or used this pattern ever, so maybe we explicitly call it out as a limitation.IOBracket
leverages theAsync
constructor and invokes a new execution of theIORunLoop
. Since two executions of the run-loop can then be associated with the same fiber, we need some way to move state between each execution. My approach is to create a thread-safeIOContext
class that holds a buffer of traces and gets passed along in theAsync
constructor. Ideally it would be mutable (with an internal ring buffer) because there's a cost to the synchronization, but the alternative was to significantly refactor the run-loop. I believe some of the work done for CE3 will simplify thebracket
implementation so only one run-loop is invoked perIO
.ThreadLocal
was the easy option here and the performance hit isn't terrible. PreservingThreadLocal
state across asynchronous boundaries is fairly straightforward: before invoking the continuation of anAsync
, save the current state. Restore that state when the continuation invokes the callback. Note that in the case of cancellation, we don't need to worry about resetting theThreadLocal
of that thread, because it will be reset when it starts a new run-loop or picks up a new task (which is re-entering the run loop from an asynchronous boundary).Questions
IO.Async
throughout the library. Should we trace them all?TODO
TracedException
Example
produces as output...
This is still a work-in-progress, so please leave your feedback here :)