Fiber tracing: asynchronous stack tracing #854

RaasAhsan · 2020-05-02T02:14:51Z

This PR contains a first pass for fiber tracing. Fiber tracing refers to the analysis of the execution graph produced by an IO program. The scope of this PR is to capture stack frames at well-defined points in execution and collect them into trace frames, which can later be printed out on demand. This is useful in both development and production settings where asynchronous programming is concerned because fiber traces are much more useful for debugging and easier to understand than stack traces. In the future, fiber tracing can be expanded to cover additional use cases.

I make several claims about performance here, which are mostly backed by local runs of MapCallsBenchmark. This offers soft upper bound on performance impact, but I need to run the other benchmarks and create some more comprehensive test suites so others can run and compare.

Design constraints

There are several design constraints I wrote this implementation under, which forced some tradeoffs. We should probably clarify how important these are and whether we can relax them at all.

If tracing is globally disabled, there should be negligible impact to application performance. The current implementation sees a 0% performance degradation; the JIT can dynamically eliminate all consequent code paths at runtime. But it comes at a steep cost when tracing is enabled.
There should be optimal support for lexically scoped tracing i.e. tracing only a subtree of an IO program. This makes tracing possible and appealing in production. The current implementation uses ThreadLocal. This has no impact when tracing is turned off globally, but if it is, every code path incurs the cost of ThreadLocal accesses. @djspiewak has some interesting ideas here that might be able to exploit the memory model to achieve thread-local state with non-volatile variables.
Support different tracing modes. A fast tracing mode would build trace frames, produce surface-level detail from stack trace, and cache it all so that subsequent calls complete faster. A slower tracing mode would perform more detailed analysis. I haven't done too much work here yet but I imagine there's a lot possible.

Limitations

Additionally, there were some limitations in the Scala language that forced me to make some code design decisions I'm not very happy about.

In Scala 2.12, singleton object methods cannot be elided by the JIT. This was fixed in Scala 2.13.
In general, Scala cross-module singleton variable accesses cost a volatile read. I don't have a primary source for this but if you inspect the bytecode you will see it. The JIT can't optimize this out.

Implementation details

Several implementation details are described below. Hopefully we can have discussions around what the best approach is moving forward.

Global flag defined in Java scope. After class initialization, the HotSpot C2 compiler is aware of the values of all static final variables (or so called just-in-time constants). With that information, it can eliminate code paths that are consequent to those constants. As described above, Scala object-level fields can't take advantage of this, so we're defining the global tracing flag in Java instead. One issue here is package visibility.
Global flag conditionals. Originally I wanted to hide the conditionals behind an IO.Builder helper and IOTracing utility methods, but C2 can't inline those method calls in 2.12. So the conditionals are somewhat scattered around the codebase.
Trace frame capture. There were two approaches I identified here: either capture the trace frames as instance variables in each IO node, or introduce a new IO constructor that reifies an IO node with a trace frame. I took the latter approach because there's virtually no cost when tracing is disabled, and it doesn't pollute the constructor or interpreter logic for other IO instructions. However, there is a very steep cost (more than 30-40% IIRC) when tracing IS enabled because it effectively doubles the number of interior nodes in the IO tree being traced. However, the instance variable approach is much cheaper during tracing, but incurs around a 4-5% performance penalty when its disabled.
Trace frame cache. Trace frames are cached by lambda references as keys, so we can easily cache Map, Bind, and Async nodes. I chose a global ConcurrentHashMap here, but we probably need to benchmark it more and compare it to alternatives like ThreadLocal. Also, this should probably be an LRU cache bounded in entry count. There is a snag with this caching approach where the same lambda reference could be used in multiple places:

val x: Int => Int = (x: Int) => x + 1
Option(3)
  .map(x)
  .map(x)

The Scala compiler won't generate a new lambda class for each invocation of x, but we still need some way to detect if we're looking at a lambda that is shared like this, otherwise caching could produce incorrect results. Maybe there's a way to compare the name of the lambda with reflection. Bytecode analysis is also another option here. But I personally haven't seen or used this pattern ever, so maybe we explicitly call it out as a limitation.
Async. The implementation of IOBracket leverages the Async constructor and invokes a new execution of the IORunLoop. Since two executions of the run-loop can then be associated with the same fiber, we need some way to move state between each execution. My approach is to create a thread-safe IOContext class that holds a buffer of traces and gets passed along in the Async constructor. Ideally it would be mutable (with an internal ring buffer) because there's a cost to the synchronization, but the alternative was to significantly refactor the run-loop. I believe some of the work done for CE3 will simplify the bracket implementation so only one run-loop is invoked per IO.
Trace frame calculation. The current implementation filters out specific class prefixes from stack frames and takes the first one remaining. Obviously this won't work with constructs like monad transformers, so we need to figure out a smarter algorithm here. Maybe taking sequences of frames that might be relevant. Or the user can specify a list of packages they're interested in. Also need to clean up the class/lambda names a little bit.
Lexically scoped tracing mode. ThreadLocal was the easy option here and the performance hit isn't terrible. Preserving ThreadLocal state across asynchronous boundaries is fairly straightforward: before invoking the continuation of an Async, save the current state. Restore that state when the continuation invokes the callback. Note that in the case of cancellation, we don't need to worry about resetting the ThreadLocal of that thread, because it will be reset when it starts a new run-loop or picks up a new task (which is re-entering the run loop from an asynchronous boundary).

Questions

There are many constructor calls for IO.Async throughout the library. Should we trace them all?
What should trace frames look like?

TODO

Example

def program2: IO[Unit] =
  for {
    _ <- IO.delay(println("7"))
    _ <- IO.delay(println("8"))
  } yield ()

def program: IO[Unit] =
  for {
    _ <- IO.delay(println("1"))
    _ <- IO.delay(println("2"))
    _ <- IO.shift
    _ <- IO.unit.bracket(_ =>
      IO.delay(println("3"))
        .flatMap(_ => program2)
    )(_ => IO.unit)
    _ <- IO.delay(println("4"))
    _ <- IO.delay(println("5"))
  } yield ()

override def run(args: List[String]): IO[ExitCode] =
  for {
    _ <- IO.suspend(program).rabbitTrace
    _ <- IO.delay("10")
    trace <- IO.backtrace
    _ <- IO.delay(trace.printTrace())
  } yield ExitCode.Success

produces as output...

1
2
3
7
8
4
5
IOTrace
	map at org.simpleapp.example.Example$.$anonfun$program$8 (Example.scala:42)
	bind at org.simpleapp.example.Example$.$anonfun$program$7 (Example.scala:41)
	map at org.simpleapp.example.Example$.$anonfun$program2$1 (Example.scala:29)
	bind at org.simpleapp.example.Example$.program2 (Example.scala:28)
	bind at org.simpleapp.example.Example$.$anonfun$program$4 (Example.scala:39)
	async at org.simpleapp.example.Example$.$anonfun$program$3 (Example.scala:40)
	bind at org.simpleapp.example.Example$.$anonfun$program$3 (Example.scala:37)
	bind at org.simpleapp.example.Example$.$anonfun$program$2 (Example.scala:36)
	bind at org.simpleapp.example.Example$.$anonfun$program$1 (Example.scala:35)
	bind at org.simpleapp.example.Example$.program (Example.scala:34)

This is still a work-in-progress, so please leave your feedback here :)

djspiewak · 2020-05-03T17:44:18Z

This is amazing work! Even just the write-up is epic. I have lots of thoughts and I'll try to respond shortly, but just a quick hat-tip first. :-) Well done.

core/jvm/src/main/scala/cats/effect/internals/Example.scala

kubukoz · 2020-05-04T17:36:05Z

This looks awesome, looking forward to having this!

core/jvm/src/main/java/cats/effect/internals/TracingPlatformFast.java

core/shared/src/main/scala/cats/effect/IO.scala

core/shared/src/main/scala/cats/effect/internals/IORunLoop.scala

core/shared/src/main/scala/cats/effect/internals/IOTracing.scala

core/shared/src/main/scala/cats/effect/tracing/IOTrace.scala

djspiewak · 2020-05-09T22:09:09Z

Some thoughts:

If tracing is globally disabled, there should be negligible impact to application performance. The current implementation sees a 0% performance degradation; the JIT can dynamically eliminate all consequent code paths at runtime. But it comes at a steep cost when tracing is enabled.

IMO, global disabling of tracing should result in a 0% degradation. Local disabling should result in at worst something like a 1-2% degradation, while fully-enabled maybe something like 3-4%. At least those are kind of the numbers I'm thinking. Obviously less is better, but this feels relatively attainable.

In general, Scala cross-module singleton variable accesses cost a volatile read. I don't have a primary source for this but if you inspect the bytecode you will see it. The JIT can't optimize this out.

I think the bytecode is a primary source. :-)

Global flag defined in Java scope. After class initialization, the HotSpot C2 compiler is aware of the values of all static final variables (or so called just-in-time constants). With that information, it can eliminate code paths that are consequent to those constants. As described above, Scala object-level fields can't take advantage of this, so we're defining the global tracing flag in Java instead. One issue here is package visibility.

I don't think it hurts to make it publicly visible if we have to. It's in the internals package and I'm perfectly happy to say it's out of scope for bincompat.

Global flag conditionals. Originally I wanted to hide the conditionals behind an IO.Builder helper and IOTracing utility methods, but C2 can't inline those method calls in 2.12. So the conditionals are somewhat scattered around the codebase.

Fascinating. I wonder if that's due to the inlining limits. Also that answers one of my review questions. We could get around this by using a macro in IOTracing.

Trace frame capture. There were two approaches I identified here: either capture the trace frames as instance variables in each IO node, or introduce a new IO constructor that reifies an IO node with a trace frame. I took the latter approach because there's virtually no cost when tracing is disabled, and it doesn't pollute the constructor or interpreter logic for other IO instructions. However, there is a very steep cost (more than 30-40% IIRC) when tracing IS enabled because it effectively doubles the number of interior nodes in the IO tree being traced. However, the instance variable approach is much cheaper during tracing, but incurs around a 4-5% performance penalty when its disabled.

Ah more of my review comments answered!

Where does the cost come from when the tracing is disabled? Just the larger object header size?

Trace frame cache. Trace frames are cached by lambda references as keys, so we can easily cache Map, Bind, and Async nodes. I chose a global ConcurrentHashMap here, but we probably need to benchmark it more and compare it to alternatives like ThreadLocal. Also, this should probably be an LRU cache bounded in entry count. There is a snag with this caching approach where the same lambda reference could be used in multiple places:

ConcurrentHashMap is very fast for read-biased workloads. Not like, perfectly optimal, but pretty good. It's certainly going to be faster than ThreadLocal.

Also I think we can actually not worry about the LRU part of this if we cache based on lambda class rather than lambda instance, since the number of distinct cache entries will be strictly bounded by the number of classes, which is static (and relatively small).

Also regarding your example, that seems completely fine to me. We know tracing is going to suffer from these kinds of limitations.

The Scala compiler won't generate a new lambda class for each invocation of x, but we still need some way to detect if we're looking at a lambda that is shared like this, otherwise caching could produce incorrect results. Maybe there's a way to compare the name of the lambda with reflection. Bytecode analysis is also another option here. But I personally haven't seen or used this pattern ever, so maybe we explicitly call it out as a limitation.

We should explicitly call it out, but IMO it's completely fine. Tracing is an approximation, really, intended to give some information where, at present, we have none.

Async. The implementation of IOBracket leverages the Async constructor and invokes a new execution of the IORunLoop. Since two executions of the run-loop can then be associated with the same fiber, we need some way to move state between each execution. My approach is to create a thread-safe IOContext class that holds a buffer of traces and gets passed along in the Async constructor. Ideally it would be mutable (with an internal ring buffer) because there's a cost to the synchronization, but the alternative was to significantly refactor the run-loop. I believe some of the work done for CE3 will simplify the bracket implementation so only one run-loop is invoked per IO.

Ah yeah this is rough. I really don't like the fact that bracket is leveraging Async, tbh. That definitely imposes a significant penalty on things. Thread safety in general imposes a high penalty, and particularly for a case like this where we know it should be synchronous… :-/ We probably need to resolve this.

Trace frame calculation. The current implementation filters out specific class prefixes from stack frames and takes the first one remaining. Obviously this won't work with constructs like monad transformers, so we need to figure out a smarter algorithm here. Maybe taking sequences of frames that might be relevant. Or the user can specify a list of packages they're interested in. Also need to clean up the class/lambda names a little bit.

Monad transformers in general are going to cause a problem for rabbit tracing, since the lambdas will basically all come from the transformer. Slug tracing obviously doesn't have this problem since it bypasses the cache altogether, but then you take the performance hit.

At any rate, we're basically looking at heuristics either way. I think I had some random ideas on how we might do this in the spec gist, but ultimately nothing is really going to be fully general. I have no objections to starting with the class filtering approach and seeing how it works.

Lexically scoped tracing mode. ThreadLocal was the easy option here and the performance hit isn't terrible. Preserving ThreadLocal state across asynchronous boundaries is fairly straightforward: before invoking the continuation of an Async, save the current state. Restore that state when the continuation invokes the callback. Note that in the case of cancellation, we don't need to worry about resetting the ThreadLocal of that thread, because it will be reset when it starts a new run-loop or picks up a new task (which is re-entering the run loop from an asynchronous boundary).

This is fairly nice, and my ideas around exploiting the memory model basically just make the thread local approach faster. So, this is the right place to start with implementation, IMO.

There are many constructor calls for IO.Async throughout the library. Should we trace them all?

There are a ton of them, and IMO none of the internal ones should be traced. Tracing is meant to show user-created structure, not implementation details.

What should trace frames look like?

See coop. :-)

RaasAhsan · 2020-05-10T03:53:54Z

Fascinating. I wonder if that's due to the inlining limits. Also that answers one of my review questions. We could get around this by using a macro in IOTracing.

Before 2.12, the singleton static instance was declared final, so method calls could be inlined. If you check out the link in the limitations section, it was made non-final in 2.12 to maintain compatibility with JDK 9 class files. But it's since been fixed in 2.13. Will look into the macro approach.

Where does the cost come from when the tracing is disabled? Just the larger object header size?

Yeah, exactly. I'm certain the cost will be greater for the simpler IO constructors like Pure and Delay.

Also I think we can actually not worry about the LRU part of this if we cache based on lambda class rather than lambda instance, since the number of distinct cache entries will be strictly bounded by the number of classes, which is static (and relatively small).

Ah, this makes a lot of sense. I thought the lambda reference is always a singleton, but this obviously isn't the case, especially if it closes over other variables in scope. 👍 to caching the class.

Ah yeah this is rough. I really don't like the fact that bracket is leveraging Async, tbh. ... We probably need to resolve this.

I came across a comment in the codebase about leveraging ContextSwitch to implement bracket. I haven't looked into feasibility, but do you think this is something worth pursuing? Another alternative is to allow the Async continuation to pass back state, but in such a way that it's not exposed to the public API. It would be a bit of internal refactoring.

Monad transformers in general are going to cause a problem for rabbit tracing, since the lambdas will basically all come from the transformer.

This makes me question how much value rabbit tracing will provide. I think it's fair to say that if you're using transformers, its likely that it permeates throughout your entire codebase (this certainly has been the case for me). If the transformer lambdas/classes are cached, tracing is likely to produce incorrect results, at least past the immediate frame.

I wonder if we should refocus the scope of rabbit tracing away from stack-based tracing and just expose run-time information (e.g. instructions, forks/joins, async boundaries, yielding, etc.). Slug mode could focus more on stack-based tracing, and we could offer options around caching and trace frame calculation that users can customize to their preference.

Of course, the user would have to be well-aware of what the limitations of those options are before turning them on, but I think that's reasonable since they are options. But the bottom line would be that the default, out-of-the-box behavior works correctly.

There are a ton of them, and IMO none of the internal ones should be traced. Tracing is meant to show user-created structure, not implementation details.

👍

wip Delete SingleMapCallBenchmark.scala Delete SingleMapCallBenchmark.scala

djspiewak · 2020-05-12T16:49:39Z

I came across a comment in the codebase about leveraging ContextSwitch to implement bracket. I haven't looked into feasibility, but do you think this is something worth pursuing? Another alternative is to allow the Async continuation to pass back state, but in such a way that it's not exposed to the public API. It would be a bit of internal refactoring.

It just surprises me that we need those approaches at all. I guess this is betraying my lack of understanding of the current runloop, but bracket should in theory be implementable by maintaining a (possibly null) stack of finalizers, along with a boolean state indicating whether or not the current region is masked (or something isomorphic to this). I'm not sure I understand why either Async or ContextShift are strictly required.

This makes me question how much value rabbit tracing will provide. I think it's fair to say that if you're using transformers, its likely that it permeates throughout your entire codebase (this certainly has been the case for me). If the transformer lambdas/classes are cached, tracing is likely to produce incorrect results, at least past the immediate frame.

Tracing in general is definitely lacking in value in a lot of common scenarios. I think the real advantage here is in providing any information, where currently there is none.

What I'm currently thinking about in terms of use-case for rabbit tracing is someone gets a production crash and/or a production livelock and they want to start diagnostics. They're going to turn on slug tracing locally, but they need to have some idea of how to reproduce the situation in the first place. Tracing at least gives some hints. (as a note on the livelock case, being able to dump the traces of all the fibers would be awesome, basically giving us the analogue of a thread dump)

I don't think there's any particular way of getting around the fact that, in a lot of cases, the trace will just show "yet more OptionT.flatMap" and/or things like the fs2 interpreter internals. But that's still at least some information. We could refocus more on map/yield/etc, but that's basically what rabbit tracing is already showing, just with the addition of a (possibly misleading) stack frame.

I feel like, as with all lossy introspection tools (think: heap dump analysis), people eventually learn the ins and outs of when it's helpful and what the hints usually mean in which context, and they learn to interpret the patterns in ways that the tool authors didn't anticipate. All of which is to say I think more information is better.

So while I agree with you that fast tracing (regardless of implementation details) is generally not going to be completely accurate in any application with interpreter stack layers (monad transformers, streaming stuff like fs2, etc), I do think there's still quite a bit of value in it just on the basis of giving you something to hang your hat on.

RaasAhsan · 2020-07-09T08:33:50Z

I finally got around to running some proper benchmarks for this. Some context:

Map fusion was removed.
Delay and Suspend aren't being traced in cached mode. It's going to cause a perf hit for at least cached tracing and maybe disabled tracing, so I'm going to think about it more.
The traced combinator that marks a region of code for trace accumulated was removed. This means that all stack traces that are captured will be pushed to the ring buffer. Arguably this makes things easier to benchmark and a simpler API, but there's a heavier cost for it.

The first column is the baseline; I just ran the benchmarks on master. The second is this PR with cached tracing. The third is this PR with disabled tracing. I didn't bother running it for full tracing because it's going to be horrendous either way. Numbers measure throughput (ops/sec), so higher number is better. Apologies for not making it clear who won :)

Benchmark	master	cached tracing	disabled tracing
AsyncBenchmark.async	119572.326 ± 413.003	79073.289 ± 2699.042	113184.161 ± 1329.851
AsyncBenchmark.bracket	29696.470 ± 506.753	23979.161 ± 379.111	28690.440 ± 543.955
AsyncBenchmark.cancelBoundary	149274.089 ± 1308.698	93234.194 ± 1462.561	117542.862 ± 1911.121
AsyncBenchmark.cancelable	76047.130 ± 1530.196	57926.565 ± 588.724	76606.612 ± 1249.202
AsyncBenchmark.parMap2	5770.955 ± 424.211	3658.221 ± 419.610	3805.028 ± 1452.939
AsyncBenchmark.race	54208.527 ± 1046.410	47548.477 ± 351.995	50205.045 ± 1203.067
AsyncBenchmark.racePair	52858.627 ± 1348.588	46234.207 ± 401.834	49184.446 ± 1434.859
AsyncBenchmark.start	4613.861 ± 66.900	3938.927 ± 117.411	4235.612 ± 125.023
AsyncBenchmark.uncancelable	335959.872 ± 8007.988	215468.003 ± 55895.984	310715.990 ± 3577.594
AttemptBenchmark.errorRaised	2214.245 ± 30.976	1071.959 ± 37.659	2082.647 ± 39.953
AttemptBenchmark.happyPath	2093.844 ± 14.947	1687.826 ± 28.531	2103.024 ± 20.174
DeepBindBenchmark.async	341.085 ± 25.915	252.261 ± 13.372	334.080 ± 40.588
DeepBindBenchmark.delay	6769.888 ± 197.760	3865.844 ± 62.744	6589.597 ± 191.756
DeepBindBenchmark.pure	7486.475 ± 306.902	4166.002 ± 51.511	7350.192 ± 87.553
ECBenchmark.app	30.129 ± 0.134	24.933 ± 0.172	29.192 ± 0.256
ECBenchmark.appWithCtx	12.329 ± 0.540	8.804 ± 1.090	11.539 ± 1.177
HandleErrorBenchmark.errorRaised	2356.960 ± 13.261	1354.339 ± 20.241	2117.004 ± 21.843
HandleErrorBenchmark.happyPath	3264.041 ± 225.132	2450.450 ± 287.435	3211.480 ± 30.413
MapCallsBenchmark.batch120	12016.605 ± 85.183	3764.943 ± 109.195	5975.144 ± 59.635
MapCallsBenchmark.batch30	15716.561 ± 239.801	2956.037 ± 33.107	5927.286 ± 41.393
MapCallsBenchmark.one	5853.308 ± 28.972	5886166.788 ± 31165.093	46695239.188 ± 411206.001
MapStreamBenchmark.batch120	3097.527 ± 54.732	1587.907 ± 22.619	2404.530 ± 45.621
MapStreamBenchmark.batch30	1332.489 ± 15.626	632.612 ± 17.989	946.424 ± 18.011
MapStreamBenchmark.one	1447.147 ± 11.316	879.340 ± 25.421	1454.559 ± 18.920
ShallowBindBenchmark.pure	5998.937 ± 82.625	3022.161 ± 51.800	5996.171 ± 108.693
ShallowBindBenchmark.async	117.344 ± 7.443	87.645 ± 5.915	119.520 ± 11.301
ShallowBindBenchmark.delay	5008.061 ± 21.284	2668.011 ± 19.410	4919.652 ± 62.336

The good news is that disabled tracing is for the most part right behind master. This is mostly coming from the extra allocation in the Map, FlatMap and Async constructors.

The bad news is that cached tracing is hurting pretty bad. I'm seeing up to 50% degradations compared to master in some benchmarks. I still need to remove those memory barriers so I think it will look a lot better afterwards. I originally thought the CHM was the performance hit, but it's literally a single read barrier after the cache has converged.

I'll post an updated benchmark after I remove some more of these barriers.

djspiewak · 2020-07-09T16:19:32Z

Yeah that's a pretty serious hit in the cached case. I think we can optimize the caching more. Removing the read barriers will help for sure. I would also imagine we can optimize the table itself somewhat, depending on how sparse it's ending up being. We should also sanity-check equals and hashCode on Class to make sure they aren't doing anything inane (I suspect they aren't).

RaasAhsan · 2020-07-10T02:28:38Z

I totally missed 3 memory barriers in IOContext, so I removed those. I also replaced TraceTag with integer tags, since we can't really inline them in the IO class without allocating or defining in Java. Honestly we can probably remove the tags altogether since we should be able to infer them from the stack trace. The IOTracing reference also can't be inlined in the IO class without allocation, but it isn't very costly either (in Scala 2.13 it can be inlined).

I'm omitting disabled tracing from this table.

Benchmark	master	new cached tracing	old cached tracing
AsyncBenchmark.async	119572.326	93693.168	79073.289
AsyncBenchmark.bracket	29696.470	27175.376	23979.161
AsyncBenchmark.cancelBoundary	149274.089	108202.295	93234.194
AsyncBenchmark.cancelable	76047.130	64941.074	57926.565
AsyncBenchmark.parMap2	5770.955	3727.669	3658.221
AsyncBenchmark.race	54208.527	47109.698	47548.477
AsyncBenchmark.racePair	52858.627	46779.455	46234.207
AsyncBenchmark.start	4613.861	3996.665	3938.927
AsyncBenchmark.uncancelable	335959.872	251950.982	215468.003
AttemptBenchmark.errorRaised	2214.245	1403.077	1071.959
AttemptBenchmark.happyPath	2093.844	2120.155	1687.826
DeepBindBenchmark.async	341.085	284.715	252.261
DeepBindBenchmark.delay	6769.888	4824.286	3865.844
DeepBindBenchmark.pure	7486.475	5297.557	4166.002
ECBenchmark.app	30.129	25.984	24.933
ECBenchmark.appWithCtx	12.329	9.334	8.804
HandleErrorBenchmark.errorRaised	2356.960	1748.884	1354.339
HandleErrorBenchmark.happyPath	3264.041	2801.483	2450.450
MapCallsBenchmark.batch120	12016.605	4622.263	3764.943
MapCallsBenchmark.batch30	15716.561	3647.625	2956.037
MapCallsBenchmark.one	5853.308	6261480.811	5886166.788
MapStreamBenchmark.batch120	3097.527	2048.182	1587.907
MapStreamBenchmark.batch30	1332.489	760.250	632.612
MapStreamBenchmark.one	1447.147	1122.805	879.340
ShallowBindBenchmark.pure	5998.937	4122.696	3022.161
ShallowBindBenchmark.async	117.344	96.441	87.645
ShallowBindBenchmark.delay	5008.061	3319.394	2668.011

A lot better than before. Next up is the CHM

RaasAhsan · 2020-07-11T07:58:05Z

@djspiewak I think I'm happy with the API now, feel free to pull it and try it out sometime. Before we merge, is it worth closing this PR, moving to a new branch and reorganizing the commits?

djspiewak · 2020-07-11T18:22:35Z

Before we merge, is it worth closing this PR, moving to a new branch and reorganizing the commits?

I actually rather like having a comprehensive history, so I'm in favor of keeping it as-is, but I know other people have strong opinions on this stuff. :-)

Overall, I'm 👍 for merging it now! There's definitely more to be done, but we can't boil the ocean in a single PR. Let's get it in master, get it into people's hands, and see how it behaves. (build has to be fixed first though)

djspiewak

🎉

LukaJCB · 2020-07-11T19:25:26Z

Thanks so much! 😍

Raas Ahsan added 19 commits April 25, 2020 19:23

wip

72952a1

Introduce global tracing flag

5301fe9

Capture traces in more functions

6c404dd

IOContext for threading state across asynchronous boundaries

887a3d7

Inspect tracing mode

3159e9e

wip

2b5f569

Add Example

41f06b3

Hold frames in IOContext

f0a715f

scalafmt

995ac9f

Clean up

c8890f1

disabled mode

b6bb1e5

wip

ff2ec7e

Basic trace calculation

72d9723

WIP

bf659ad

comment

c23b4df

Lexically scoped tracing via ThreadLocal

4666088

Check flag

fe61325

WIP

99eb9ca

default to false

09da37f

LukaJCB reviewed May 4, 2020

View reviewed changes

core/jvm/src/main/scala/cats/effect/internals/Example.scala Outdated Show resolved Hide resolved

djspiewak reviewed May 9, 2020

View reviewed changes

address pr feedback

1cbe1f5

wip Delete SingleMapCallBenchmark.scala Delete SingleMapCallBenchmark.scala

RaasAhsan force-pushed the fiber-tracing branch from a538418 to 1cbe1f5 Compare May 10, 2020 09:13

Raas Ahsan added 2 commits May 10, 2020 04:14

final class

3a2c39e

isTracingEnabled

0772c7e

Raas Ahsan added 3 commits July 9, 2020 01:37

wip

4db5554

doc

a277e17

restore benchmarks

f6892fa

Raas Ahsan added 2 commits July 9, 2020 19:32

Declare ring buffer instance variables as private[this]

6cbda28

Remove TraceTag

9f396b5

Raas Ahsan added 8 commits July 9, 2020 23:02

docs

1a3a946

Add tracing on raiseError and docs update

9adfe41

doc updates

895ccaa

fix tracing tests

46a5142

Only run tracing tests for JVM

fa90a3a

Reorganize test directories

10bd15f

Add mima filters

9fe4ece

Remove unused import in synciotests

cb29e7e

RaasAhsan changed the title ~~WIP: Fiber tracing~~ Fiber tracing: asynchronous stack tracing Jul 10, 2020

Raas Ahsan added 2 commits July 11, 2020 02:22

Refactor for IOEvent and better printers

c22f731

Printing options

b06b109

djspiewak approved these changes Jul 11, 2020

View reviewed changes

Raas Ahsan added 3 commits July 11, 2020 13:40

Fix tests

42aef25

Merge branch 'master' into fiber-tracing

5267dd6

fix build

7ba1b11

djspiewak merged commit 99516fd into typelevel:master Jul 11, 2020

neko-kai mentioned this pull request Jul 12, 2020

Borrow back useful improvements to ZIO Tracing, if any, from cats-effect's implementation zio/zio#3964

Open

RaasAhsan mentioned this pull request Jul 13, 2020

Intermediate value tracing #937

Open

Avasil mentioned this pull request Sep 13, 2020

Implement Asynchronous Stack Traces for Task monix/monix#1267

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fiber tracing: asynchronous stack tracing #854

Fiber tracing: asynchronous stack tracing #854

RaasAhsan commented May 2, 2020 •

edited

Loading

djspiewak commented May 3, 2020

kubukoz commented May 4, 2020

djspiewak commented May 9, 2020

RaasAhsan commented May 10, 2020

djspiewak commented May 12, 2020

RaasAhsan commented Jul 9, 2020 •

edited

Loading

djspiewak commented Jul 9, 2020

RaasAhsan commented Jul 10, 2020

RaasAhsan commented Jul 11, 2020

djspiewak commented Jul 11, 2020

djspiewak left a comment

LukaJCB commented Jul 11, 2020

Fiber tracing: asynchronous stack tracing #854

Fiber tracing: asynchronous stack tracing #854

Conversation

RaasAhsan commented May 2, 2020 • edited Loading

Design constraints

Limitations

Implementation details

Questions

TODO

Example

djspiewak commented May 3, 2020

kubukoz commented May 4, 2020

djspiewak commented May 9, 2020

RaasAhsan commented May 10, 2020

djspiewak commented May 12, 2020

RaasAhsan commented Jul 9, 2020 • edited Loading

djspiewak commented Jul 9, 2020

RaasAhsan commented Jul 10, 2020

RaasAhsan commented Jul 11, 2020

djspiewak commented Jul 11, 2020

djspiewak left a comment

Choose a reason for hiding this comment

LukaJCB commented Jul 11, 2020

RaasAhsan commented May 2, 2020 •

edited

Loading

RaasAhsan commented Jul 9, 2020 •

edited

Loading