Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Distributed Tracing #3951

Closed
3 tasks
Gaganjuneja opened this issue May 3, 2023 · 14 comments · Fixed by #4964
Closed
3 tasks

[DOC] Distributed Tracing #3951

Gaganjuneja opened this issue May 3, 2023 · 14 comments · Fixed by #4964
Assignees
Labels
3 - Tech review PR: Tech review in progress Sev2 High-medium priority. Upcoming release or incorrect information. v2.11.0
Milestone

Comments

@Gaganjuneja
Copy link
Contributor

What do you want to do?

  • Request a change to existing documentation
  • [X ] Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. Provide a summary of the request and all versions that are affected.

  1. Describe the Feature.
  2. Configuration changes.
  3. Collector usage with different data source options.

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.
TBD

@Naarcha-AWS Naarcha-AWS added 1 - Backlog Issue: The issue is unassigned or assigned but not started and removed untriaged labels May 5, 2023
@Naarcha-AWS Naarcha-AWS self-assigned this May 5, 2023
@Naarcha-AWS
Copy link
Collaborator

@Gaganjuneja: Any idea where in the roadmap we plan to implement this? Which OpenSearch version?

@rohin
Copy link

rohin commented May 15, 2023

@Naarcha-AWS - we are targeting 2.8 release

@hdhalter hdhalter added this to the v2.8 milestone May 15, 2023
@hdhalter hdhalter added v2.8.0 and removed v-TBD labels May 15, 2023
@hdhalter
Copy link
Contributor

Hi @rohin - I don't see this item on the roadmap or part of the unified project. Can you confirm this is in 2.8 and is there an issue for it (besides the RFC)? Thanks.

@cwillum
Copy link
Contributor

cwillum commented Jun 1, 2023

@rohin Hi, I'm trying to learn whether this issue is still being considered for 2.8. The four issues associated with plans for a first-phase release in 2.8 are currently still open. (#7543, #7544, #7545, and #7546). The only PR I can find associated with any of these issues is #7648. But this appears to be open as well. Do you have any more information about status of this enhancement? Thanks.
cc: @Gaganjuneja

@rohin
Copy link

rohin commented Jun 2, 2023 via email

@cwillum
Copy link
Contributor

cwillum commented Jun 2, 2023

@rohin A big thanks for the response.

@cwillum cwillum added v2.9.0 and removed v2.8.0 labels Jun 2, 2023
@hdhalter hdhalter modified the milestones: v2.8, v2.9 Jun 2, 2023
@Naarcha-AWS Naarcha-AWS added 2 - In progress Issue/PR: The issue or PR is in progress. Sev2 High-medium priority. Upcoming release or incorrect information. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Jun 30, 2023
@hdhalter hdhalter modified the milestones: v2.9, v2.10 Jul 11, 2023
@hdhalter
Copy link
Contributor

Moving to 2.10.

@reta
Copy link
Contributor

reta commented Aug 29, 2023

Mentioning new setting for 2.10: "telemetry.tracer.sampler.probability" (see please opensearch-project/OpenSearch#9522)

@Naarcha-AWS Naarcha-AWS added 1 - Backlog Issue: The issue is unassigned or assigned but not started and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Aug 31, 2023
@vagimeli vagimeli assigned vagimeli and unassigned Naarcha-AWS Aug 31, 2023
@vagimeli vagimeli added 2 - In progress Issue/PR: The issue or PR is in progress. and removed 1 - Backlog Issue: The issue is unassigned or assigned but not started labels Aug 31, 2023
@Gaganjuneja
Copy link
Contributor Author

@Naarcha-AWS @hdhalter Providing you the details. Please let me know if further details are needed.

Distributed Tracing for OpenSearch Requests and tasks.

The feature of Distributed Tracing for OpenSearch Requests and tasks offers the ability to trace requests comprehensively, from start to finish, including detailed breakdowns at the task level. This allows us to obtain finely detailed information about latencies and resource utilization. Moreover, it equips us with the means to troubleshoot prolonged queries and indexing requests, enabling us to pinpoint hotspots and performance bottlenecks. By doing so, we can enhance the efficiency of our queries and indexing requests. It's important to note that this feature is currently in development, and in the upcoming releases, we plan to incorporate tracing and spans at various levels of granularity and code paths.

RFC - opensearch-project/OpenSearch#6750

Enable the Distributed Tracing Feature Flag

The Distributed Tracing feature is currently in the experimental phase for this release. To utilize this feature, you must first enable it and subsequently activate the Tracer using the dynamic setting telemetry.tracer.enabled. It's important to exercise caution when enabling this feature as it can consume system resources. Detailed information on enabling and configuring this feature, including on-demand debugging and request sampling, can be found in the sections below.

Enable on a node using a tarball install

The flag is toggled using a new jvm parameter that is set either in OPENSEARCH_JAVA_OPTS or in config/jvm.options.

OPTION 1: MODIFY JVM.OPTIONS

Add the following lines to config/jvm.options before starting the OpenSearch process to enable the feature and its dependency:

-Dopensearch.experimental.feature.telemetry.enabled=true

Run OpenSearch

./bin/opensearch

OPTION 2: ENABLE FROM AN ENVIRONMENT VARIABLE

As an alternative to directly modifying config/jvm.options, you can define the properties by using an environment variable. This can be done in a single command when you start OpenSearch or by defining the variable with export.

To add these flags in-line when starting OpenSearch:

OPENSEARCH_JAVA_OPTS="-Dopensearch.experimental.feature.telemetry.enabled=true" ./opensearch-2.9.0/bin/opensearch

If you want to define the environment variable separately, prior to running OpenSearch:

export OPENSEARCH_JAVA_OPTS="-Dopensearch.experimental.feature.telemetry.enabled=true"
 ./bin/opensearch

Enable with Docker containers

If you’re running Docker, add the following line to docker-compose.yml underneath the opensearch-node and environment section:

OPENSEARCH_JAVA_OPTS="-Dopensearch.experimental.feature.telemetry.enabled=true"

Enable for OpenSearch development

To enable the distributed feature, you must first enable these features by adding the correct properties to run.gradle before building OpenSearch. See the developer guide for information about to use how Gradle to build OpenSearch.
Add the following properties to run.gradle to enable the feature:

testClusters {
  runTask {
    testDistribution = 'archive'
 if (numZones > 1) numberOfZones = numZones
    if (numNodes > 1) numberOfNodes = numNodes
    systemProperty 'opensearch.experimental.feature.telemetry.enabled', 'true'
 }
 }

Enable the distributed tracing

Once you've enabled the feature flag, you can enable the tracer using the following dynamic setting. This setting can be adjusted dynamically to enable or disable tracing in the running cluster:

telemetry.tracer.enabled=true

Install the OpenSearch Open Telemetry plugin

OpenSearch's distributed tracing framework supports various telemetry solutions through plugins. Currently, an Open Telemetry plugin for OpenSearch is available and must be installed to enable tracing. The plugin is named "telemetry-otel," and you can follow the provided guide for installation instructions.

Exporters

The distributed tracing feature generates traces and spans for requests and other cluster operations. These spans/traces are initially kept in memory using the Open Telemetry BatchSpanProcessor and are then sent to an Exporter based on configured settings. There are several important components:

  1. Span Processor - As spans conclude on the request path, OpenTelemetry provides them to the SpanProcessor for processing and exporting. OpenSearch's Distributed Tracing framework uses the BatchSpanProcessor, which batches spans for specific configurable intervals and then sends them to the exporter. The following configurations are available for the BatchSpanProcessor.
    1. telemetry.otel.tracer.exporter.max_queue_size - Defines the maximum queue size; when the queue reaches this value, it will be written to the exporter. The default value is 2048.
    2. telemetry.otel.tracer.exporter.delay - Defines the delay; if there are not enough spans to fill the max_queue_size until this delay time, they will be flushed. The default delay is set to 2 seconds.
    3. telemetry.otel.tracer.exporter.batch_size - Configures the maximum batch size for each export to reduce IO. This value should always be less than the max_queue_size, with the default set to 512.
  2. Exporters - Exporters are responsible for persisting the data. Open Telemetry provides several out-of-the-box exporters, but OpenSearch currently supports the following:
    1. LoggingSpanExporter - Exports spans to a log file, generating a separate file in the logs directory named "_otel_traces.log." This is the default configuration.
      telemetry.otel.tracer.span.exporter.class=io.opentelemetry.exporter.logging.LoggingSpanExporter
    2. OtlpGrpcSpanExporter - Exports spans via gRPC. To use this exporter, you need to install the otel-collector on the node and specify the endpoint using the setting "telemetry.otel.tracer.exporter.endpoint." By default, it writes to the http://localhost:4317/ endpoint, but you can configure it for HTTPS as well, following the otel documentation. The setting is as follows:
    telemetry.otel.tracer.span.exporter.class=org.opensearch.telemetry.tracing.exporter.OtlpGrpcSpanExporterProvider
    telemetry.otel.tracer.exporter.endpoint: https://localhost:4317
    
    

Sampling

Distributed tracing can generate numerous spans, consuming system resources unnecessarily. To reduce the number of traces/samples, you can enable sampling, which is configured by default for only 1% of all requests. Sampling can be of two types:

  1. Head Sampling - Sampling decisions are made before initiating the root span of a request. OpenSearch supports two head sampling methods:
    1. Probabilistic - A blanket limit on incoming requests, dynamically adjustable with the telemetry.tracer.sampler.probability setting. This setting ranges between 0 and 1, with a default value of 0.01 (indicating that 1% of incoming requests are sampled).
    2. On-Demand - For debugging specific requests, users can send the trace=true attribute as part of the header, causing those requests to be sampled regardless of the probabilistic sampling setting.
  2. Tail Base Sampling - Configuration of tail-based sampling can be done by following the OpenTelemetry documentation, depending on the type of collector you choose.

Ongoing work and more details can be found in the GitHub RFC at opensearch-project/OpenSearch#8918.

Collection of Spans

The SpanProcessor writes spans to the exporter, and the choice of Exporter defines the endpoint, which can be logs or gRPC. To collect spans via gRPC, you need to configure the collector as a sidecar process running on each OpenSearch node. From the collectors, these spans can be written to the sync of your choice, such as Jaeger, Prometheus, Grafana, File Store, etc., for further analysis.

@Gaganjuneja
Copy link
Contributor Author

We have 2 PRs open and will update you to merge once that's done.

  1. Add Tracing Instrumentation at Network and Rest layer OpenSearch#9415
  2. [Telemetry-Otel] Added support for OtlpGrpcSpanExporter exporter OpenSearch#9666

@Gaganjuneja
Copy link
Contributor Author

@reta, Please take a look and suggest if anything needs to be added/deleted/updated. Thanks!

@vagimeli
Copy link
Contributor

vagimeli commented Sep 5, 2023

@Gaganjuneja @reta I'm the technical writer working on the documentation for 2.10 release. Thank you for providing the draft content. I'll tag you in the doc PR for your technical reviews.

@vagimeli vagimeli added 3 - Tech review PR: Tech review in progress and removed 2 - In progress Issue/PR: The issue or PR is in progress. labels Sep 6, 2023
@vagimeli vagimeli added v2.11.0 and removed v2.10.0 labels Sep 12, 2023
@vagimeli vagimeli modified the milestones: v2.10, v2.11 Sep 12, 2023
@hdhalter
Copy link
Contributor

Hi @Gaganjuneja - Are we targeting 2.11 for this feature?

@Gaganjuneja
Copy link
Contributor Author

Hello @hdhalter, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Tech review PR: Tech review in progress Sev2 High-medium priority. Upcoming release or incorrect information. v2.11.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants