-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracing Framework] RFC - Sampling Strategy #8918
Comments
@reta @shwetathareja @Bukhtawar @suranjay your thoughts on this? |
@reta @backslasht @shwetathareja @Bukhtawar @suranjay reminder for your thoughts on this. |
@Gaganjuneja thanks a lot for the proposal (and my apologies for the delay) I think it makes sense to start with Head Based OpenSearch first since this is the most straightforward way to get traces out, OpenTelemetry has a good support for probabilistic sampling.
I would strongly -1 the per-action configuration for the following reasons (at least at this stage):
OTel has Tail Sampling Processor [1] that consolidates traces on the processor side and make the decision based on policies, have you seen it? It also covers traces depth mentioned in the On a general note, I would strongly advocate to make a decision for sampling strategies to output either consistent complete trace or no trace at all (no matter if it is head or tail sampling). Introducing any randomness or/and inconsistencies in the middle of the trace (like probabilistic spans dropping, level that we discussed previously) would render the feature unusable and confusing since each request in general would produce unique trace hierarchy. |
@reta Thanks for your reply.
+1 on this and totally aligned.
Yes, so we would be applying this strategy for sampling on the root span only and later on all the spans would respect the parent's decision using parent-based sampling. It works across nodes as well with tracestate flags.
I think most of the other actions should follow the default strategy except client facing actions like bulk and search. There should be a minimum number of actions which requires separate configuration.
Action based sampling would be applicable only for the root span and action could be anything like TransportAction or RestAction. We can change the name to operation if this is misleading.
Yes, but as the requirement is to have all the spans related to a trace on the same collector for the effective sampling decisions, I have proposed the above approaches to consolidate the spans to a single collector. Thereafter |
Thanks @Gaganjuneja
This is impossible to decide on the root level, as I mention the action name is materialized very late in the flow (we have a whole network layer before it that will have to be instrumented). Let us make it work first with the Otel sampling, that we could take any other decisions.
We are not building the distributing tracing infrastructure (I hope), our goal is instrumentation (and we just in the beginning of it). If we have the solution (like https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor) that solves the problem (that we don't need to solve), why do we need to invent our own? |
Thanks @Gaganjuneja for the proposal on the Sampling. |
Thanks @shwetathareja , this very easy to solve with tracing on demand, where for every search or indexing request in question user could add |
Thanks @reta & @shwetathareja providing your valuable comments. I concur that on-demand sampling would be effective for debugging requests. This approach is reactive, but certain use cases necessitate proactive data collection. For instance, establishing monitoring through traces and metrics, identifying performance bottlenecks, root cause analysis, and generating a cluster insights dashboard that highlights top resource-consuming queries. For such scenarios, we must sample 100% of traces using head-based sampling and subsequently make intelligent decisions through tail-based sampling. Regarding configuration maintenance, I agree that the proposed action-based schema is not sustainable. Instead, we should define appropriate default limits and allow for potential overrides using AffixSettings, similar to what we do for loggers. After reviewing the code, I also realized that the action materializes quite late. Can we do sampling based on the URI once the request has landed in the RequestHandler (both HTTP and Netty4)? In terms of a generic implementation, we can consider sampling based on the value of the "operation" or some attribute if the Span is a root span and that attribute is defined in the settings. Otherwise, the default limit should apply, while also taking into account the "trace=true" setting. |
It is very effective when you pinpointed the limited set of requests that are anomalies (usually coming out from adaptive/tail sampling based on criteria).
This is still too late, the trace would have been started already. But again, here adaptive sampling (tail based sampling) would help 100% - the action name / URI / etc could be used to make a decision to trace or not.
What kind of limits you are referring to? |
@reta & @shwetathareja, I am thinking of covering the following scenarios in the head sampling.
In all the above three scenarios these limits and overrides will be taken into account while creating the root span and following child spans will respect the parent sampling decision. Your thoughts? |
thanks @Gaganjuneja , I would suggest to focus on 1 & 2 now and keep others (3 & 4) as possible future improvements. The issue is that we have absolutely zero instrumentation now, once we have it, we could think about the gaps (in sampling, etc) to make it better. |
Agreed. I had listed these items in priority order. Instrumentation PR coming your way next week 😊 |
…ce header & probability (opensearch-project#8918)
…ce header & probability (opensearch-project#8918) Signed-off-by: Dev Agarwal <devagarwal1803@gmail.com>
…ce header & probability (opensearch-project#8918) Signed-off-by: Dev Agarwal <devagarwal1803@gmail.com>
What is Sampling?
Sampling is a statistical method employed to choose a subgroup that can effectively represent the entire population and be readily extrapolated.
Why we need Sampling?
As we consider the instrumentation of generic constructs such as rest controllers, transport actions, and task managers, a notable issue arises. Our system encompasses more than 200+ transport actions and a variety of rest actions, leading to the potential generation of numerous spans. However, each span incurs a non-negligible cost. Hence, it is crucial to develop a Sampling strategy that enables us to select a subset of spans from the total pool, effectively representing the entire population or system.
Different Sampling Options
There are two distinct sampling options available in OpenTelemetry.
Head Sampling: Head sampling takes place just before the creation of a span. Since the decision needs to be made as early as possible, it relies on arbitrary factors like randomly selecting a percentage of spans. While this is a straightforward technique to implement, it lacks request/trace-specific data, which may limit its ability to make intelligent decisions.
Tail Sampling: In contrast to head sampling, tail sampling makes its sampling decision at the end of the entire trace when data from all spans is available. This type of sampling occurs at the collector level. Here, sampling can be based on various criteria such as latency, attribute values, status (e.g., error, success), etc. One major advantage is static stability when we are especially measuring the resource consumption. One challenge with this strategy is that it necessitates having all spans sent to a single collector, which could be problematic for distributed systems like OpenSearch where sending all spans to a specific data node might be challenging. In the following sections, we will explore different strategies related to these sampling methods.
OpenSearch Sampling Requirements
To determine the most suitable sampling strategy, it is essential to consider the specific requirements and goals for implementing tracing in the system. Here are some common scenarios:
By carefully considering the specific use cases and objectives of tracing, a suitable sampling strategy can be chosen to strike the right balance between capturing enough data for analysis while minimizing the impact on performance and storage resources.
OpenSearch sampling strategy
For a distributed system like OpenSearch, an effective sampling strategy often combines both head and tail sampling techniques to achieve the desired tracing goals. Let's delve into the details of the proposed sampling strategy for OpenSearch:
Head Based OpenSearch sampling
Tail Sampling
As most requests in OpenSearch return 200 OK responses and stay within Service Level Agreements (SLAs), not all of these requests need to be traced. Tail sampling becomes valuable in this context.
Options for Tail Sampling in OpenSearch:
Send Spans as Part of Response: Spans are sent as part of the response back to the coordinator node, which can then export these spans to the local collector. The overall impact on performance should be assessed through performance runs.
Pros:
Cons:
Export Span to Coordinator Node Always: A custom span exporter can be created to directly export spans from the OpenSearch core to the collector running on the coordinator node. This might require opening the GRPC port for node intercommunication.
Pros:
Cons:
Export Span to Local Collector: Spans are initially exported to a local collector before being distributed to a single collector that handles all spans belonging to a particular trace. This could involve multiple levels of collectors.
Pros:
Cons:
In conclusion, tail sampling in OpenSearch requires careful consideration as there is no clear winner among the options discussed. Option 2 and Option 3 are already configurable in otel. Option 1 may introduce resource overhead and require feasibility test. Looking forward to feedback from the community on these tail sampling options.
Other limiting factors
Indeed, there are several other limiting factors and considerations when implementing tracing in a distributed system like OpenSearch:
By thoughtfully considering these limiting factors and incorporating the appropriate sampling techniques, level definitions, and span limits, the tracing implementation in OpenSearch can strike the right balance between capturing sufficient data for analysis and maintaining a performant and efficient distributed system. I am doing POC with couple of approaches and will update the results here.
The text was updated successfully, but these errors were encountered: