-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry Tracing API vs Tokio-Tracing API for Distributed Tracing #1571
Comments
Tagging @open-telemetry/rust-approvers |
If we were to use tracing as the API. This is the deviation between existing tracing API and Otel tracing API
|
From the metrics perspective exemplars are also something to take into account. |
As requested in the community meeting: I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly. Questions that are open from my perspective:
We know that the Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway. |
As a heavy user of direct OpenTelemetry instrumentation (e.g., using Which interfaces specifically would be deprecated?
I suspect a lot of other OpenTelemetry users are also doing so in private repositories, so I agree that it's hard to measure. I would caution against inferring much from these public GitHub usage stats. |
We don't know exactly yet. The idea is to bridge the gap between the |
I vote for option 2, as there are challenges with other options:
Going with Option 2, we also need evaluation for introducing an extension API within OpenTelemetry. This is to effectively bridge the existing gaps between the OTel specifications and Tokio-Tracing's functionalities (e.g, Baggage support, Propagators). |
Direct consumption of opentelemetry-api could be for traces, metrics and logs, and I agree it is really hard to get the actual statistics for "traces" only :) |
OpenTelemetry comes from OpenCensus and OpenTracing merge. IDK if I have a saying because I don't maintain the OTel Rust, but I'd vote for Option 1 and invite the maintainers of IDK how much
I'm biased but I see OTel as the future for Observability signals. |
Tagging for more inputs. |
Just to provide context here. I think if we move to
|
I haven't had much time recently to work on open source, but my perspective is that option 3 is likely optimal in the near term. I suspect that expressing the full otel API via Option 3 could be done via clearer purposes for each API (e.g. low level "full" api via otel, or high level "limited but ergonomic and user-friendly" api via |
I'm on vacation, so I'll be brief and try to expand next week/summarize my thoughts from Slack: pulling a .NET (paying attention to the intent, not the letter of the spec) is very much possible, down to the fact that propagators remained in a dedicated OTEL library for 2.5 years. I think |
In practice, this is still the case! So is I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails. I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further. |
I guess I'll try to take a look at how we could go for option 3. From the top of my head use cases to look at:
Then some variant of the two where both Those will be "advanced cases", but honestly it might be more common than one might think. |
Comment/Discussion from Community Meeting for option3: Test to validate the option3 A - uses tracing for producing span 3 spans It may not be feasible to ask users to use same api for all 3, as they may not own/control some of them. eg: B could be reqwest crate. #1378 (comment) shows an examples where logging and tracing (distributed tracing aka spans) are used, and correlation is broken when |
Took some time to get to this due to other priorities, but here are more details on one possible way to go with option2, including a prototype: |
👋🏻 I am not a Rust developer so am coming from a very different perspective. My take is that, to my knowledge, every other ecosystem has opted for Option 1 long-term, Option 3 near-term. Specifying the API in OTel was (I assume) a large effort and we have seen the API evolve as developers have battle-tested it and provided feedback (e.g., lack of a synchronous gauge instrument, which is now in the spec.) My impression is that spec evolution is a pretty collaborative process, which is nice to observe. In my view it would be a mistake to align on pre-existing instrumentation conventions as OTel's mission has been to provide a standard API that instrumentations across languages/systems can adhere to. This is particularly important as it provides a path for libraries to provide instrumentation hooks to, e.g. automatically generate traces and metrics as part of their own business logic, kind of like bpf kernel tracepoints or UDST. And those hooks are written according to a wider specification and hence less vulnerable to governance issues that tend to come up in external libraries from time to time. In the Go OTel SDK there are several "bridge" interfaces that help to close the gap b/w the OTel API and existing instrumentation libraries, e.g., the opencensus bridge. Perhaps this would be a way to pave the path towards wider OTel API adoption. /$0.02 🙇🏻 |
@hdost After re-reading this, I am not entirely sure if I understand the part where you said "I mentioned during the meeting we will still need to have "some" API anyway" I think @TommyCpp also mentioned this (in metrics context though). If you look at the prototype, it has tracing sdk only! No tracing api. i.e there is nospan/span.start()/end() etc. We'll need Could you check this. We can discuss in the next SIG call and figure out what are the gaps in our understanding. |
Update from July 30 OTel Rust Community Meeting: We recognize it’ll be a while before this can be fully sorted out. We continually see issues - upgrades are hard, otel-demo is broken, and users are unsure which versions are compatible and the list goes on. To mitigate the short/medium term pain, while also being not-too-far from the long term plans, it was decided to offer Note that this does not support interoperating both APIs for spans - either use tracing or otel tracing api, but mixing them up won't work. If option 3 is settled on, then this will need to be solved, but not part of the immediate release. This does not deprecate @TommyCpp will make the above happen and we are targeting to include it in the next release (~Aug 30) |
After some experiments using our custom tracing implementation, I reached the conclusion that it isn't worth the extra complexity. Most of the ecossystem is using tokio's tracing implementation, and the Rust OpenTelemetry WG are evaluating replacing their implementation with only the tracing layer[1]. Thus, I decided to replace it with a tracing integration setup. The setup is pretty standard, but the implementation uses a custom Layer to pass data from tracing to OpenTelemetry, continuing to use the background worker to do most of the heavy lifting. This makes the hot path that runs in the application loop to be more efficient. The implementation also uses a custom context to allow for faster retrieval of tracing information for propagation. The result is that kiso's users don't have to worry about OpenTelemetry crates unless they need dynamic attributes or links. Everything else is handled by the tracing crate. This commit contains only the necessary code for the migration. There are still some things to sort out, and primarily performance improvement to implement. These will be done in other patches as this one is already too big to properly review. [1]: open-telemetry/opentelemetry-rust#1571
This work is delayed, and won't be part of the coming release (expected in a day). Will post new ETA for this soon. |
@cijothomas is there an update on when it will be implemented? |
I'd love to see smooth integration between (tokio's)tracing and OTEL. I recently sunk some time into setting up OTEL+Rust and went down a fairly substantial rabbit hole probing the different combinations of pieces to get things working properly for the different combinations of rust libraries & instrumentation options one might expect. Which is to say, I think option 3 is realistically what's happening, right now. Option 2 / deprecating OTEL in favour of tracing is bunk, because tracing doesn't cover the same set of use cases. I'm going to take the time to fully enumerate the "rough edges" between tracing/OTEL as it stands and report back. I think a pragmatic next step here would be providing concrete guidance on how to set them up together, and at the same time trying to smooth the edges out. @jtescher 's comment here resonates:
|
@scottgerring 3 is what is happening today, and that is the issue. If two of them are in such a way that one is layered on top of other, then that is totally fine (This is what Julian's comment is referring to as well, from what I can tell.)
Thanks ❤️ ! Really appreciate the help. |
@scottgerring https://cloud-native.slack.com/archives/C069U408RNW is a slack channel created dedicated to discussing this. Feel free to join, and use that for discussions as well. |
Hey @cijothomas thanks for the enthusiastic welcome!
In my (likely rather incomplete) mental model, it feels like OTEL should sit "beneath" tracing, in the case where tracing is in-use, whilst still being able to stand on its own when it is not; I think this is what you and Julian are thinking too? |
I took advantage of my "beginners mind", and ported an application we have setup with OTEL-only integration to use I am using:
This should be indicative of the sort of common stack we should expect folks to use. flowchart TD
A[Actix App] -->|Logs and spans| B[Tracing]
B -->|Subscriber: tracing-opentelemetry| C[tracing-opentelemetry]
B -->|Subscriber: opentelemetry-appender-tracing| D[opentelemetry-appender-tracing]
C -->|Pushes Traces| E[OpenTelemetry Traces API]
D -->|Pushes Logs| F[OpenTelemetry Logs API]
E -->|Forwarded to| G[OpenTelemetry Collector]
F -->|Forwarded to| G
This then feeds into Jaeger and Datadog to inspect. Observations1. Functional
Usability
ConclusionBased on this, I reckon:
I suspect this is a rather long-winded way of ending up on the same page as everyone else on the thread, but I hope this can serve as a good summary :) Finally - enormous thanks to @julianocosta89 for entertaining my unending OTEL questions these last 2 weeks ❤️ |
Thanks for your findings @scottgerring . Just a comment on single crate:
If the motivation for having a single subscriber/bridge for traces and logs is to achieve correlation between logs and traces, I believe this may not be absolutely necessary. This was explored in #1394, specifically comment here - #1394 (comment). One potential constraint is that both subscribers would need to depend on the same version of otel and otel-sdk. However, this limitation could be addressed once OpenTelemetry reaches a stable release. That said, having a single subscriber would indeed be a good or ideal solution. However, I wonder if it might introduce any additional burden for high-performance logging needs that do not require span context or correlation. ( I may be incorrect, so need to investigate further). Also, there is plan to make |
Lets not make that conclusion. It should be up-to users to decide if they prefer SpanEvents or Logs (or both). |
@scottgerring Thank you for your analysis. One thing is not clear to me - which of the option listed are you suggesting? 2 or 3? |
I think the decision to go with option 2 or 3 should be left to the maintainers and contributors. The first and last options are probably not feasible or likely to happen. |
This is a great point! It also seems reasonable to expect that many users may use different exporters for different signals, which may place different version constraints on the otel-sdk pieces.
For what it's worth, my opinions on options - 1 and 4 seem like non-starters for reasons that have been extensively enumerated. Two - the gap is big as seen in @TommyCpp's message above. I think tracing has a much smaller remit - "in process tracing" primarily (I just discovered Three - I think this is the realistic option. There will be overlap, but it would support the existing community where they are, and allow new users to choose to use OTEL directly if they wish, without carrying an extra layer of indirection, via tracing. What would this mean in practice? I can see:
To be clear - I don't think my opinion should carry much weight - i'm very new to this long running discussion, and am coming at this purely from "it'd be cool if this played nicely together" perspective. I am happy to contribute in whatever way is useful to push whichever solution is chosen along. In the meantime - have a good weekend all! |
Whether |
One thing (In my view, this is likely the biggest thing to check out!) that we need to explore further is the Context part. For example, a user at the beginning of an incoming request puts UserId into baggage. Then Another Context related thing - is the suppression flag #1330 has made some attempts and has listed few challenges - I think the root issue there also is the lack of common agreement of what is the Context as we have |
@cijothomas Understood - and thanks for taking the time to detail this. I'll try jump on the SIG call tomorrow for a quick chat and to see if there's any way I can make myself useful here. |
@cijothomas I've been talking a bit with @scottgerring about his woes when building the sample application, and my own when building a different sample. This has led me to dig around a lot in the current bridge implementations and various POCs that are referenced here. I think that the missmatch between the goals of OpenTelemetry and Tokio Tracing is too big for one of them to replace the other in any straight forward way, so option 3 with great interop between the APIs seems the logical path forward. The idea that I've been toying with (not unlike some of the proposals here) is to treat Tokio Tracing as two separate things:
To enable this I propose something similar to the Context Storage in Java. This would allow a bridge to implement a I've not fully figured out all the constraints on OpenTelemetry contexts within Tokio Tracing spans (will they be closed with the span they were created in etc.), but I'm planning to spend some time to prototype this. Also, this would allow the complete separation of the OpenTelemetry tracing logging appender, and the OpenTelemetry tracing tracing bridge (is that the correct name?), and you could pick and choose to enable context propagation if necessary. Let's discuss in the SIG call tomorrow. |
+1 to providing support to use/plugin external Context, irrespective of the option we go with. This is also somewhat specified in the specs:
Also, opentelemetry-cpp has similar implementation - https://github.com/open-telemetry/opentelemetry-cpp/blob/31956f82ff990a870d2953c722666a782f672c35/api/include/opentelemetry/context/runtime_context.h#L155. It implements thread-local Context as default, but allow users to bring their own. |
Background
The Rust ecosystem has two prominent tracing APIs: the OpenTelemetry Tracing API (Otel for short), delivered through the
opentelemetry
crate, and the Tokio tracing API, provided by thetracing
crate. The OTel Tracing API adheres to the OpenTelemetry specification, ensuring alignment with OpenTelemetry Tracing implementations in other languages like C++, Java etc. Conversely, the Tokio tracing ecosystem, which predatesOpenTelemetry, boasts widespread adoption, with many popular libraries already instrumented. The tracing-opentelemetry crate, maintained outside of OpenTelemetry repositories, act as a "bridge", enabling applications instrumented with tracing to work with OpenTelemetry.
The issue
The coexistence of the OTel Tracing API and Tokio-Tracing poses a dilemma, forcing end users to choose between two competing APIs. This situation complicates the decision-making process due to the absence of comprehensive
documentation comparing the two options. A significant concern is the lack of tested interoperability between the APIs, which can result in issues, especially in applications where different layers use different tracing APIs, potentially
leading to incomplete traces. This also impacts the log correlation scenarios as well.
A Comparison with OTel .NET
The OpenTelemetry .NET community encountered a similar challenge when the OTel Tracing API was introduced, as the .NET runtime library (shipped as the
DiagnosticSource package) already had a similar API in place. This issue was resolved through collaboration between OTel .NET maintainers and the .NET
runtime team, leading to the alignment of the .NET runtime's tracing API with the OTel specifications. This approach was later applied to the Metrics API as well. While the decision by OTel .NET to prioritize the .NET Runtime library's
API over its own for tracing/metrics has generally been successful, it has not been without its challenges. Despite declaring stability years ago, OTel .NET has yet to implement certain aspects of the OTel specification fully.
Although the outcomes in the .NET ecosystem might not directly forecast the success of similar efforts in Rust, they provide a valuable reference point.
Options for Consideration
Deprecate Tokio-Tracing: This approach would align Rust with the OpenTelemetry strategies adopted by other languages. However, considering the popularity and active maintenance of the
tracing
crate in the Rust ecosystem, this path has highest friction and is highly improbable.Deprecate OTel Tracing: Promoting Tokio-Tracing as the standard could be a feasible option, albeit requiring comprehensive evaluation. This strategy would cause OTel Rust to deviate from its counterparts in other languages.
Potential alignment of Tokio-Tracing with OTel Tracing specifications could mitigate this concern but necessitates groundwork to identify gaps and propose solutions. Tokio-Tracing maintainers have shown willingness to accommodate
reasonable changes, pending a clear set of requirements. This option does not eliminate the OTel Tracing API completely, but it'll still remain to compensate for things missing from Tokio-Tracing - only those APIs which are overlapping/competing with Tokio-Tracing needs to be deprecated/removed.
Maintain Both APIs: This alternative emphasizes the importance of ensuring seamless interoperability between the two APIs, allowing users to choose based on preference or specific needs without compromising trace completeness. Achieving this goal requires significant effort to identify and bridge any existing gaps in the interoperability story. Users should be able freely chose between, without worrying about any broken traces.
Do nothing.: OTel Rust has some special accommodations done to help tracing crate (and vice-versa). We can just remove them, and let each crate follow their own destiny. (Highly undesirable state, just listed for completion)
Are there more options? Please let us know in the comments!
Current State
The Rust tracing ecosystem is at a critical juncture. Active discussions between the OTel Rust team and the Tracing Rust team are taking place, with updates and deliberations shared on Cloud Native
Slack. Interested individuals are encouraged to join the discussion on Slack (or right in this Github issue). All decisions and considerations will be posted on GitHub as well for wider visibility and to gather feedbacks.
Timeline
Resolving this issue is a prerequisite (though not the only one) for declaring the Tracing signal as GA (General Availability) for OTel Rust. Given the goal to achieve Tracing GA (alongside other milestones) soon, it's crucial that this issue is resolved promptly. A tentative deadline to reach a decision on the chosen path forward is set for April 30th, 2024, approximately 2 months from today.
Related issues
#1378 Tracing Propagation.
#1394 (comment)
Broken Trace example : #1690
The text was updated successfully, but these errors were encountered: