Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fiber tracing: asynchronous stack tracing #854

Merged
merged 80 commits into from
Jul 11, 2020

Conversation

RaasAhsan
Copy link

@RaasAhsan RaasAhsan commented May 2, 2020

This PR contains a first pass for fiber tracing. Fiber tracing refers to the analysis of the execution graph produced by an IO program. The scope of this PR is to capture stack frames at well-defined points in execution and collect them into trace frames, which can later be printed out on demand. This is useful in both development and production settings where asynchronous programming is concerned because fiber traces are much more useful for debugging and easier to understand than stack traces. In the future, fiber tracing can be expanded to cover additional use cases.

I make several claims about performance here, which are mostly backed by local runs of MapCallsBenchmark. This offers soft upper bound on performance impact, but I need to run the other benchmarks and create some more comprehensive test suites so others can run and compare.

Design constraints

There are several design constraints I wrote this implementation under, which forced some tradeoffs. We should probably clarify how important these are and whether we can relax them at all.

  1. If tracing is globally disabled, there should be negligible impact to application performance. The current implementation sees a 0% performance degradation; the JIT can dynamically eliminate all consequent code paths at runtime. But it comes at a steep cost when tracing is enabled.
  2. There should be optimal support for lexically scoped tracing i.e. tracing only a subtree of an IO program. This makes tracing possible and appealing in production. The current implementation uses ThreadLocal. This has no impact when tracing is turned off globally, but if it is, every code path incurs the cost of ThreadLocal accesses. @djspiewak has some interesting ideas here that might be able to exploit the memory model to achieve thread-local state with non-volatile variables.
  3. Support different tracing modes. A fast tracing mode would build trace frames, produce surface-level detail from stack trace, and cache it all so that subsequent calls complete faster. A slower tracing mode would perform more detailed analysis. I haven't done too much work here yet but I imagine there's a lot possible.

Limitations

Additionally, there were some limitations in the Scala language that forced me to make some code design decisions I'm not very happy about.

  1. In Scala 2.12, singleton object methods cannot be elided by the JIT. This was fixed in Scala 2.13.
  2. In general, Scala cross-module singleton variable accesses cost a volatile read. I don't have a primary source for this but if you inspect the bytecode you will see it. The JIT can't optimize this out.

Implementation details

Several implementation details are described below. Hopefully we can have discussions around what the best approach is moving forward.

  1. Global flag defined in Java scope. After class initialization, the HotSpot C2 compiler is aware of the values of all static final variables (or so called just-in-time constants). With that information, it can eliminate code paths that are consequent to those constants. As described above, Scala object-level fields can't take advantage of this, so we're defining the global tracing flag in Java instead. One issue here is package visibility.
  2. Global flag conditionals. Originally I wanted to hide the conditionals behind an IO.Builder helper and IOTracing utility methods, but C2 can't inline those method calls in 2.12. So the conditionals are somewhat scattered around the codebase.
  3. Trace frame capture. There were two approaches I identified here: either capture the trace frames as instance variables in each IO node, or introduce a new IO constructor that reifies an IO node with a trace frame. I took the latter approach because there's virtually no cost when tracing is disabled, and it doesn't pollute the constructor or interpreter logic for other IO instructions. However, there is a very steep cost (more than 30-40% IIRC) when tracing IS enabled because it effectively doubles the number of interior nodes in the IO tree being traced. However, the instance variable approach is much cheaper during tracing, but incurs around a 4-5% performance penalty when its disabled.
  4. Trace frame cache. Trace frames are cached by lambda references as keys, so we can easily cache Map, Bind, and Async nodes. I chose a global ConcurrentHashMap here, but we probably need to benchmark it more and compare it to alternatives like ThreadLocal. Also, this should probably be an LRU cache bounded in entry count. There is a snag with this caching approach where the same lambda reference could be used in multiple places:
val x: Int => Int = (x: Int) => x + 1
Option(3)
  .map(x)
  .map(x)
  1. The Scala compiler won't generate a new lambda class for each invocation of x, but we still need some way to detect if we're looking at a lambda that is shared like this, otherwise caching could produce incorrect results. Maybe there's a way to compare the name of the lambda with reflection. Bytecode analysis is also another option here. But I personally haven't seen or used this pattern ever, so maybe we explicitly call it out as a limitation.
  2. Async. The implementation of IOBracket leverages the Async constructor and invokes a new execution of the IORunLoop. Since two executions of the run-loop can then be associated with the same fiber, we need some way to move state between each execution. My approach is to create a thread-safe IOContext class that holds a buffer of traces and gets passed along in the Async constructor. Ideally it would be mutable (with an internal ring buffer) because there's a cost to the synchronization, but the alternative was to significantly refactor the run-loop. I believe some of the work done for CE3 will simplify the bracket implementation so only one run-loop is invoked per IO.
  3. Trace frame calculation. The current implementation filters out specific class prefixes from stack frames and takes the first one remaining. Obviously this won't work with constructs like monad transformers, so we need to figure out a smarter algorithm here. Maybe taking sequences of frames that might be relevant. Or the user can specify a list of packages they're interested in. Also need to clean up the class/lambda names a little bit.
  4. Lexically scoped tracing mode. ThreadLocal was the easy option here and the performance hit isn't terrible. Preserving ThreadLocal state across asynchronous boundaries is fairly straightforward: before invoking the continuation of an Async, save the current state. Restore that state when the continuation invokes the callback. Note that in the case of cancellation, we don't need to worry about resetting the ThreadLocal of that thread, because it will be reset when it starts a new run-loop or picks up a new task (which is re-entering the run loop from an asynchronous boundary).

Questions

  1. There are many constructor calls for IO.Async throughout the library. Should we trace them all?
  2. What should trace frames look like?

TODO

  • Frame cache: thread-local or a thread-safe global cache? LRU entry-count bounded?
  • Proper trace frame calculation
  • Add appropriate tracing to all primitive operations
  • Exception handling and TracedException
  • Shared lambda issue
  • Demangle class and anonfun names
  • Post benchmarks and create a more comprehensive test suite others can run
  • Tests
  • Documentation and comments
  • Cleanup

Example

def program2: IO[Unit] =
  for {
    _ <- IO.delay(println("7"))
    _ <- IO.delay(println("8"))
  } yield ()

def program: IO[Unit] =
  for {
    _ <- IO.delay(println("1"))
    _ <- IO.delay(println("2"))
    _ <- IO.shift
    _ <- IO.unit.bracket(_ =>
      IO.delay(println("3"))
        .flatMap(_ => program2)
    )(_ => IO.unit)
    _ <- IO.delay(println("4"))
    _ <- IO.delay(println("5"))
  } yield ()

override def run(args: List[String]): IO[ExitCode] =
  for {
    _ <- IO.suspend(program).rabbitTrace
    _ <- IO.delay("10")
    trace <- IO.backtrace
    _ <- IO.delay(trace.printTrace())
  } yield ExitCode.Success

produces as output...

1
2
3
7
8
4
5
IOTrace
	map at org.simpleapp.example.Example$.$anonfun$program$8 (Example.scala:42)
	bind at org.simpleapp.example.Example$.$anonfun$program$7 (Example.scala:41)
	map at org.simpleapp.example.Example$.$anonfun$program2$1 (Example.scala:29)
	bind at org.simpleapp.example.Example$.program2 (Example.scala:28)
	bind at org.simpleapp.example.Example$.$anonfun$program$4 (Example.scala:39)
	async at org.simpleapp.example.Example$.$anonfun$program$3 (Example.scala:40)
	bind at org.simpleapp.example.Example$.$anonfun$program$3 (Example.scala:37)
	bind at org.simpleapp.example.Example$.$anonfun$program$2 (Example.scala:36)
	bind at org.simpleapp.example.Example$.$anonfun$program$1 (Example.scala:35)
	bind at org.simpleapp.example.Example$.program (Example.scala:34)

This is still a work-in-progress, so please leave your feedback here :)

@djspiewak
Copy link
Member

This is amazing work! Even just the write-up is epic. I have lots of thoughts and I'll try to respond shortly, but just a quick hat-tip first. :-) Well done.

@kubukoz
Copy link
Member

kubukoz commented May 4, 2020

This looks awesome, looking forward to having this!

core/shared/src/main/scala/cats/effect/IO.scala Outdated Show resolved Hide resolved
core/shared/src/main/scala/cats/effect/IO.scala Outdated Show resolved Hide resolved
core/shared/src/main/scala/cats/effect/IO.scala Outdated Show resolved Hide resolved
@djspiewak
Copy link
Member

Some thoughts:

If tracing is globally disabled, there should be negligible impact to application performance. The current implementation sees a 0% performance degradation; the JIT can dynamically eliminate all consequent code paths at runtime. But it comes at a steep cost when tracing is enabled.

IMO, global disabling of tracing should result in a 0% degradation. Local disabling should result in at worst something like a 1-2% degradation, while fully-enabled maybe something like 3-4%. At least those are kind of the numbers I'm thinking. Obviously less is better, but this feels relatively attainable.

In general, Scala cross-module singleton variable accesses cost a volatile read. I don't have a primary source for this but if you inspect the bytecode you will see it. The JIT can't optimize this out.

I think the bytecode is a primary source. :-)

Global flag defined in Java scope. After class initialization, the HotSpot C2 compiler is aware of the values of all static final variables (or so called just-in-time constants). With that information, it can eliminate code paths that are consequent to those constants. As described above, Scala object-level fields can't take advantage of this, so we're defining the global tracing flag in Java instead. One issue here is package visibility.

I don't think it hurts to make it publicly visible if we have to. It's in the internals package and I'm perfectly happy to say it's out of scope for bincompat.

Global flag conditionals. Originally I wanted to hide the conditionals behind an IO.Builder helper and IOTracing utility methods, but C2 can't inline those method calls in 2.12. So the conditionals are somewhat scattered around the codebase.

Fascinating. I wonder if that's due to the inlining limits. Also that answers one of my review questions. We could get around this by using a macro in IOTracing.

Trace frame capture. There were two approaches I identified here: either capture the trace frames as instance variables in each IO node, or introduce a new IO constructor that reifies an IO node with a trace frame. I took the latter approach because there's virtually no cost when tracing is disabled, and it doesn't pollute the constructor or interpreter logic for other IO instructions. However, there is a very steep cost (more than 30-40% IIRC) when tracing IS enabled because it effectively doubles the number of interior nodes in the IO tree being traced. However, the instance variable approach is much cheaper during tracing, but incurs around a 4-5% performance penalty when its disabled.

Ah more of my review comments answered!

Where does the cost come from when the tracing is disabled? Just the larger object header size?

Trace frame cache. Trace frames are cached by lambda references as keys, so we can easily cache Map, Bind, and Async nodes. I chose a global ConcurrentHashMap here, but we probably need to benchmark it more and compare it to alternatives like ThreadLocal. Also, this should probably be an LRU cache bounded in entry count. There is a snag with this caching approach where the same lambda reference could be used in multiple places:

ConcurrentHashMap is very fast for read-biased workloads. Not like, perfectly optimal, but pretty good. It's certainly going to be faster than ThreadLocal.

Also I think we can actually not worry about the LRU part of this if we cache based on lambda class rather than lambda instance, since the number of distinct cache entries will be strictly bounded by the number of classes, which is static (and relatively small).

Also regarding your example, that seems completely fine to me. We know tracing is going to suffer from these kinds of limitations.

The Scala compiler won't generate a new lambda class for each invocation of x, but we still need some way to detect if we're looking at a lambda that is shared like this, otherwise caching could produce incorrect results. Maybe there's a way to compare the name of the lambda with reflection. Bytecode analysis is also another option here. But I personally haven't seen or used this pattern ever, so maybe we explicitly call it out as a limitation.

We should explicitly call it out, but IMO it's completely fine. Tracing is an approximation, really, intended to give some information where, at present, we have none.

Async. The implementation of IOBracket leverages the Async constructor and invokes a new execution of the IORunLoop. Since two executions of the run-loop can then be associated with the same fiber, we need some way to move state between each execution. My approach is to create a thread-safe IOContext class that holds a buffer of traces and gets passed along in the Async constructor. Ideally it would be mutable (with an internal ring buffer) because there's a cost to the synchronization, but the alternative was to significantly refactor the run-loop. I believe some of the work done for CE3 will simplify the bracket implementation so only one run-loop is invoked per IO.

Ah yeah this is rough. I really don't like the fact that bracket is leveraging Async, tbh. That definitely imposes a significant penalty on things. Thread safety in general imposes a high penalty, and particularly for a case like this where we know it should be synchronous… :-/ We probably need to resolve this.

Trace frame calculation. The current implementation filters out specific class prefixes from stack frames and takes the first one remaining. Obviously this won't work with constructs like monad transformers, so we need to figure out a smarter algorithm here. Maybe taking sequences of frames that might be relevant. Or the user can specify a list of packages they're interested in. Also need to clean up the class/lambda names a little bit.

Monad transformers in general are going to cause a problem for rabbit tracing, since the lambdas will basically all come from the transformer. Slug tracing obviously doesn't have this problem since it bypasses the cache altogether, but then you take the performance hit.

At any rate, we're basically looking at heuristics either way. I think I had some random ideas on how we might do this in the spec gist, but ultimately nothing is really going to be fully general. I have no objections to starting with the class filtering approach and seeing how it works.

Lexically scoped tracing mode. ThreadLocal was the easy option here and the performance hit isn't terrible. Preserving ThreadLocal state across asynchronous boundaries is fairly straightforward: before invoking the continuation of an Async, save the current state. Restore that state when the continuation invokes the callback. Note that in the case of cancellation, we don't need to worry about resetting the ThreadLocal of that thread, because it will be reset when it starts a new run-loop or picks up a new task (which is re-entering the run loop from an asynchronous boundary).

This is fairly nice, and my ideas around exploiting the memory model basically just make the thread local approach faster. So, this is the right place to start with implementation, IMO.

There are many constructor calls for IO.Async throughout the library. Should we trace them all?

There are a ton of them, and IMO none of the internal ones should be traced. Tracing is meant to show user-created structure, not implementation details.

What should trace frames look like?

See coop. :-)

@RaasAhsan
Copy link
Author

Fascinating. I wonder if that's due to the inlining limits. Also that answers one of my review questions. We could get around this by using a macro in IOTracing.

Before 2.12, the singleton static instance was declared final, so method calls could be inlined. If you check out the link in the limitations section, it was made non-final in 2.12 to maintain compatibility with JDK 9 class files. But it's since been fixed in 2.13. Will look into the macro approach.

Where does the cost come from when the tracing is disabled? Just the larger object header size?

Yeah, exactly. I'm certain the cost will be greater for the simpler IO constructors like Pure and Delay.

Also I think we can actually not worry about the LRU part of this if we cache based on lambda class rather than lambda instance, since the number of distinct cache entries will be strictly bounded by the number of classes, which is static (and relatively small).

Ah, this makes a lot of sense. I thought the lambda reference is always a singleton, but this obviously isn't the case, especially if it closes over other variables in scope. 👍 to caching the class.

Ah yeah this is rough. I really don't like the fact that bracket is leveraging Async, tbh. ... We probably need to resolve this.

I came across a comment in the codebase about leveraging ContextSwitch to implement bracket. I haven't looked into feasibility, but do you think this is something worth pursuing? Another alternative is to allow the Async continuation to pass back state, but in such a way that it's not exposed to the public API. It would be a bit of internal refactoring.

Monad transformers in general are going to cause a problem for rabbit tracing, since the lambdas will basically all come from the transformer.

This makes me question how much value rabbit tracing will provide. I think it's fair to say that if you're using transformers, its likely that it permeates throughout your entire codebase (this certainly has been the case for me). If the transformer lambdas/classes are cached, tracing is likely to produce incorrect results, at least past the immediate frame.

I wonder if we should refocus the scope of rabbit tracing away from stack-based tracing and just expose run-time information (e.g. instructions, forks/joins, async boundaries, yielding, etc.). Slug mode could focus more on stack-based tracing, and we could offer options around caching and trace frame calculation that users can customize to their preference.

Of course, the user would have to be well-aware of what the limitations of those options are before turning them on, but I think that's reasonable since they are options. But the bottom line would be that the default, out-of-the-box behavior works correctly.

There are a ton of them, and IMO none of the internal ones should be traced. Tracing is meant to show user-created structure, not implementation details.

👍

wip

Delete SingleMapCallBenchmark.scala

Delete SingleMapCallBenchmark.scala
@djspiewak
Copy link
Member

I came across a comment in the codebase about leveraging ContextSwitch to implement bracket. I haven't looked into feasibility, but do you think this is something worth pursuing? Another alternative is to allow the Async continuation to pass back state, but in such a way that it's not exposed to the public API. It would be a bit of internal refactoring.

It just surprises me that we need those approaches at all. I guess this is betraying my lack of understanding of the current runloop, but bracket should in theory be implementable by maintaining a (possibly null) stack of finalizers, along with a boolean state indicating whether or not the current region is masked (or something isomorphic to this). I'm not sure I understand why either Async or ContextShift are strictly required.

This makes me question how much value rabbit tracing will provide. I think it's fair to say that if you're using transformers, its likely that it permeates throughout your entire codebase (this certainly has been the case for me). If the transformer lambdas/classes are cached, tracing is likely to produce incorrect results, at least past the immediate frame.

Tracing in general is definitely lacking in value in a lot of common scenarios. I think the real advantage here is in providing any information, where currently there is none.

What I'm currently thinking about in terms of use-case for rabbit tracing is someone gets a production crash and/or a production livelock and they want to start diagnostics. They're going to turn on slug tracing locally, but they need to have some idea of how to reproduce the situation in the first place. Tracing at least gives some hints. (as a note on the livelock case, being able to dump the traces of all the fibers would be awesome, basically giving us the analogue of a thread dump)

I don't think there's any particular way of getting around the fact that, in a lot of cases, the trace will just show "yet more OptionT.flatMap" and/or things like the fs2 interpreter internals. But that's still at least some information. We could refocus more on map/yield/etc, but that's basically what rabbit tracing is already showing, just with the addition of a (possibly misleading) stack frame.

I feel like, as with all lossy introspection tools (think: heap dump analysis), people eventually learn the ins and outs of when it's helpful and what the hints usually mean in which context, and they learn to interpret the patterns in ways that the tool authors didn't anticipate. All of which is to say I think more information is better.

So while I agree with you that fast tracing (regardless of implementation details) is generally not going to be completely accurate in any application with interpreter stack layers (monad transformers, streaming stuff like fs2, etc), I do think there's still quite a bit of value in it just on the basis of giving you something to hang your hat on.

@RaasAhsan
Copy link
Author

RaasAhsan commented Jul 9, 2020

I finally got around to running some proper benchmarks for this. Some context:

  1. Map fusion was removed.
  2. Delay and Suspend aren't being traced in cached mode. It's going to cause a perf hit for at least cached tracing and maybe disabled tracing, so I'm going to think about it more.
  3. The traced combinator that marks a region of code for trace accumulated was removed. This means that all stack traces that are captured will be pushed to the ring buffer. Arguably this makes things easier to benchmark and a simpler API, but there's a heavier cost for it.

The first column is the baseline; I just ran the benchmarks on master. The second is this PR with cached tracing. The third is this PR with disabled tracing. I didn't bother running it for full tracing because it's going to be horrendous either way. Numbers measure throughput (ops/sec), so higher number is better. Apologies for not making it clear who won :)

Benchmark master cached tracing disabled tracing
AsyncBenchmark.async 119572.326 ± 413.003 79073.289 ± 2699.042 113184.161 ± 1329.851
AsyncBenchmark.bracket 29696.470 ± 506.753 23979.161 ± 379.111 28690.440 ± 543.955
AsyncBenchmark.cancelBoundary 149274.089 ± 1308.698 93234.194 ± 1462.561 117542.862 ± 1911.121
AsyncBenchmark.cancelable 76047.130 ± 1530.196 57926.565 ± 588.724 76606.612 ± 1249.202
AsyncBenchmark.parMap2 5770.955 ± 424.211 3658.221 ± 419.610 3805.028 ± 1452.939
AsyncBenchmark.race 54208.527 ± 1046.410 47548.477 ± 351.995 50205.045 ± 1203.067
AsyncBenchmark.racePair 52858.627 ± 1348.588 46234.207 ± 401.834 49184.446 ± 1434.859
AsyncBenchmark.start 4613.861 ± 66.900 3938.927 ± 117.411 4235.612 ± 125.023
AsyncBenchmark.uncancelable 335959.872 ± 8007.988 215468.003 ± 55895.984 310715.990 ± 3577.594
AttemptBenchmark.errorRaised 2214.245 ± 30.976 1071.959 ± 37.659 2082.647 ± 39.953
AttemptBenchmark.happyPath 2093.844 ± 14.947 1687.826 ± 28.531 2103.024 ± 20.174
DeepBindBenchmark.async 341.085 ± 25.915 252.261 ± 13.372 334.080 ± 40.588
DeepBindBenchmark.delay 6769.888 ± 197.760 3865.844 ± 62.744 6589.597 ± 191.756
DeepBindBenchmark.pure 7486.475 ± 306.902 4166.002 ± 51.511 7350.192 ± 87.553
ECBenchmark.app 30.129 ± 0.134 24.933 ± 0.172 29.192 ± 0.256
ECBenchmark.appWithCtx 12.329 ± 0.540 8.804 ± 1.090 11.539 ± 1.177
HandleErrorBenchmark.errorRaised 2356.960 ± 13.261 1354.339 ± 20.241 2117.004 ± 21.843
HandleErrorBenchmark.happyPath 3264.041 ± 225.132 2450.450 ± 287.435 3211.480 ± 30.413
MapCallsBenchmark.batch120 12016.605 ± 85.183 3764.943 ± 109.195 5975.144 ± 59.635
MapCallsBenchmark.batch30 15716.561 ± 239.801 2956.037 ± 33.107 5927.286 ± 41.393
MapCallsBenchmark.one 5853.308 ± 28.972 5886166.788 ± 31165.093 46695239.188 ± 411206.001
MapStreamBenchmark.batch120 3097.527 ± 54.732 1587.907 ± 22.619 2404.530 ± 45.621
MapStreamBenchmark.batch30 1332.489 ± 15.626 632.612 ± 17.989 946.424 ± 18.011
MapStreamBenchmark.one 1447.147 ± 11.316 879.340 ± 25.421 1454.559 ± 18.920
ShallowBindBenchmark.pure 5998.937 ± 82.625 3022.161 ± 51.800 5996.171 ± 108.693
ShallowBindBenchmark.async 117.344 ± 7.443 87.645 ± 5.915 119.520 ± 11.301
ShallowBindBenchmark.delay 5008.061 ± 21.284 2668.011 ± 19.410 4919.652 ± 62.336

The good news is that disabled tracing is for the most part right behind master. This is mostly coming from the extra allocation in the Map, FlatMap and Async constructors.

The bad news is that cached tracing is hurting pretty bad. I'm seeing up to 50% degradations compared to master in some benchmarks. I still need to remove those memory barriers so I think it will look a lot better afterwards. I originally thought the CHM was the performance hit, but it's literally a single read barrier after the cache has converged.

I'll post an updated benchmark after I remove some more of these barriers.

@djspiewak
Copy link
Member

Yeah that's a pretty serious hit in the cached case. I think we can optimize the caching more. Removing the read barriers will help for sure. I would also imagine we can optimize the table itself somewhat, depending on how sparse it's ending up being. We should also sanity-check equals and hashCode on Class to make sure they aren't doing anything inane (I suspect they aren't).

@RaasAhsan
Copy link
Author

I totally missed 3 memory barriers in IOContext, so I removed those. I also replaced TraceTag with integer tags, since we can't really inline them in the IO class without allocating or defining in Java. Honestly we can probably remove the tags altogether since we should be able to infer them from the stack trace. The IOTracing reference also can't be inlined in the IO class without allocation, but it isn't very costly either (in Scala 2.13 it can be inlined).

I'm omitting disabled tracing from this table.

Benchmark master new cached tracing old cached tracing
AsyncBenchmark.async 119572.326 93693.168 79073.289
AsyncBenchmark.bracket 29696.470 27175.376 23979.161
AsyncBenchmark.cancelBoundary 149274.089 108202.295 93234.194
AsyncBenchmark.cancelable 76047.130 64941.074 57926.565
AsyncBenchmark.parMap2 5770.955 3727.669 3658.221
AsyncBenchmark.race 54208.527 47109.698 47548.477
AsyncBenchmark.racePair 52858.627 46779.455 46234.207
AsyncBenchmark.start 4613.861 3996.665 3938.927
AsyncBenchmark.uncancelable 335959.872 251950.982 215468.003
AttemptBenchmark.errorRaised 2214.245 1403.077 1071.959
AttemptBenchmark.happyPath 2093.844 2120.155 1687.826
DeepBindBenchmark.async 341.085 284.715 252.261
DeepBindBenchmark.delay 6769.888 4824.286 3865.844
DeepBindBenchmark.pure 7486.475 5297.557 4166.002
ECBenchmark.app 30.129 25.984 24.933
ECBenchmark.appWithCtx 12.329 9.334 8.804
HandleErrorBenchmark.errorRaised 2356.960 1748.884 1354.339
HandleErrorBenchmark.happyPath 3264.041 2801.483 2450.450
MapCallsBenchmark.batch120 12016.605 4622.263 3764.943
MapCallsBenchmark.batch30 15716.561 3647.625 2956.037
MapCallsBenchmark.one 5853.308 6261480.811 5886166.788
MapStreamBenchmark.batch120 3097.527 2048.182 1587.907
MapStreamBenchmark.batch30 1332.489 760.250 632.612
MapStreamBenchmark.one 1447.147 1122.805 879.340
ShallowBindBenchmark.pure 5998.937 4122.696 3022.161
ShallowBindBenchmark.async 117.344 96.441 87.645
ShallowBindBenchmark.delay 5008.061 3319.394 2668.011

A lot better than before. Next up is the CHM

@RaasAhsan RaasAhsan changed the title WIP: Fiber tracing Fiber tracing: asynchronous stack tracing Jul 10, 2020
@RaasAhsan
Copy link
Author

@djspiewak I think I'm happy with the API now, feel free to pull it and try it out sometime. Before we merge, is it worth closing this PR, moving to a new branch and reorganizing the commits?

@djspiewak
Copy link
Member

Before we merge, is it worth closing this PR, moving to a new branch and reorganizing the commits?

I actually rather like having a comprehensive history, so I'm in favor of keeping it as-is, but I know other people have strong opinions on this stuff. :-)

Overall, I'm 👍 for merging it now! There's definitely more to be done, but we can't boil the ocean in a single PR. Let's get it in master, get it into people's hands, and see how it behaves. (build has to be fixed first though)

Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@djspiewak djspiewak merged commit 99516fd into typelevel:master Jul 11, 2020
@LukaJCB
Copy link
Member

LukaJCB commented Jul 11, 2020

Thanks so much! 😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants