Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter stops sending any spans to collector after timed out export #4406

Closed
aelmekeev opened this issue Jan 8, 2024 · 5 comments
Closed
Assignees
Labels
bug Something isn't working needs:author-response waiting for author to respond priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect

Comments

@aelmekeev
Copy link

aelmekeev commented Jan 8, 2024

What happened?

Steps to Reproduce

We have noticed that sometimes after collectors fails to respond on time application silently stops to send any trace to collector unless restarted.

I think this might be related to the issue introduced in #3958 that @Zirak has tried to address as part of #4287. Raising this to get some visibility to the maintainers.

Expected Result

library should not stop sending spans to collector

Actual Result

library stops sending spans to collector

Additional details

- name: OTEL_BSP_MAX_QUEUE_SIZE
  value: '2048'
- name: OTEL_BSP_MAX_EXPORT_BATCH_SIZE
  value: '512'

With our details for queue and batch size it looks like this is happening relatively often (4 times in last 5 days) during pick traffic.

Although I must admit that I don't understand why adding finally can fix this as in my understanding either then or catch would always be triggered for Promise but bumping OTEL_BSP_MAX_QUEUE_SIZE did help with this issue.

OpenTelemetry Setup Code

import { HttpInstrumentation } from '@opentelemetry/instrumentation-http'
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express'
import { GraphQLInstrumentation } from '@opentelemetry/instrumentation-graphql'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { Resource } from '@opentelemetry/resources'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { AlwaysOnSampler } from '@opentelemetry/core'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'

const serviceName = process.env['OTEL_SERVICE_NAME'] || 'unknown'
const serviceInstance = process.env['HOSTNAME'] || 'unknown'
const serviceVersion = process.env['OTEL_SERVICE_VERSION'] || process.env.npm_package_version || 'unknown'
const oltpTracesEndpoint = process.env['OTEL_EXPORTER_OTLP_TRACES_ENDPOINT']

const serviceResource = Resource.default().merge(
    new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
        [SemanticResourceAttributes.SERVICE_INSTANCE_ID]: serviceInstance,
        [SemanticResourceAttributes.SERVICE_VERSION]: serviceVersion,
    }),
)

if (oltpTracesEndpoint) {
    const tracerProvider = new NodeTracerProvider({
        sampler: new AlwaysOnSampler(),
        resource: serviceResource,
    })

    registerInstrumentations({
        tracerProvider,
        instrumentations: [
            new HttpInstrumentation(),
            new ExpressInstrumentation(),
            [new GraphQLInstrumentation()]),
        ],
    })

    const oltpTraceExporter = new OTLPTraceExporter({
        url: oltpTracesEndpoint,
    })

    tracerProvider.addSpanProcessor(
        new BatchSpanProcessor(oltpTraceExporter, {
            maxQueueSize: Number(process.env['OTEL_BSP_MAX_QUEUE_SIZE']) || 2000,
            maxExportBatchSize: Number(process?.env['OTEL_BSP_MAX_EXPORT_BATCH_SIZE']) || 1000,
            scheduledDelayMillis: Number(process.env['OTEL_BSP_SCHEDULE_DELAY']) || 500,
            exportTimeoutMillis: Number(process.env['OTEL_BSP_EXPORT_TIMEOUT']) || 30000,
        }),
    )
    tracerProvider.register()
    console.log(`📡 Transmitting OpenTelemetry traces to ${oltpTracesEndpoint}`)
}

package.json

"dependencies": {
        "@opentelemetry/api": "^1.7.0",
        "@opentelemetry/core": "^1.18.1",
        "@opentelemetry/exporter-trace-otlp-http": "^0.45.1",
        "@opentelemetry/instrumentation": "^0.45.1",
        "@opentelemetry/instrumentation-express": "^0.33.3",
        "@opentelemetry/instrumentation-graphql": "^0.36.0",
        "@opentelemetry/instrumentation-http": "^0.45.1",
        "@opentelemetry/resources": "^1.18.1",
        "@opentelemetry/sdk-trace-base": "^1.18.1",
        "@opentelemetry/sdk-trace-node": "^1.18.1",
        "@opentelemetry/semantic-conventions": "^1.18.1"
    },
    "devDependencies": {
        "@microsoft/api-extractor": "^7.32.0",
        "npm-run-all": "^4.1.5",
        "typescript": "^4.8.4"
    },

Relevant log output

N/A
@aelmekeev aelmekeev added bug Something isn't working triage labels Jan 8, 2024
@pichlermarc pichlermarc added priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect and removed triage labels Jan 10, 2024
@dgoscn
Copy link

dgoscn commented Feb 2, 2024

Hi @aelmekeev. Do you have any news about this case? Thanks

@aelmekeev
Copy link
Author

@dgoscn I'll try to give it a test next week, thank you!

@pichlermarc
Copy link
Member

@aelmekeev I think this has been fixed by #4287, could you re-try with the latest version? 🙂

@pichlermarc pichlermarc added the needs:author-response waiting for author to respond label Feb 21, 2024
@pichlermarc pichlermarc self-assigned this Feb 22, 2024
@aelmekeev
Copy link
Author

@pichlermarc just tested this and I believe the issue is resolved hence I'm closing it. For the context this what I've done:

  1. diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.ALL)
  2. Set OTEL_BSP_MAX_EXPORT_BATCH_SIZE and OTEL_BSP_MAX_QUEUE_SIZE to 10 locally
  3. Start the app and poke it so it will send traces to collector
  4. Break the connection between the app and the collector (stop port forwarding in my case)
  5. Poke the app again to observe no traces received by collector
  6. Fix the connection between the app and the collector

With

        "@opentelemetry/api": "^1.7.0",
        "@opentelemetry/core": "^1.18.1",
        "@opentelemetry/exporter-metrics-otlp-http": "^0.45.1",
        "@opentelemetry/exporter-trace-otlp-http": "^0.45.1",
        "@opentelemetry/host-metrics": "^0.34.0",
        "@opentelemetry/instrumentation": "^0.45.1",
        "@opentelemetry/instrumentation-express": "^0.33.3",
        "@opentelemetry/instrumentation-graphql": "^0.36.0",
        "@opentelemetry/instrumentation-http": "^0.45.1",
        "@opentelemetry/resources": "^1.18.1",
        "@opentelemetry/sdk-metrics": "^1.18.1",
        "@opentelemetry/sdk-trace-base": "^1.18.1",
        "@opentelemetry/sdk-trace-node": "^1.18.1",
        "@opentelemetry/semantic-conventions": "^1.18.1"
  1. No errors related to traces.
  2. App is not connecting back to collector.

With

        "@opentelemetry/api": "^1.7.0",
        "@opentelemetry/core": "^1.21.0",
        "@opentelemetry/exporter-metrics-otlp-http": "^0.48.0",
        "@opentelemetry/exporter-trace-otlp-http": "^0.48.0",
        "@opentelemetry/host-metrics": "^0.35.0",
        "@opentelemetry/instrumentation": "^0.48.0",
        "@opentelemetry/instrumentation-express": "^0.35.0",
        "@opentelemetry/instrumentation-graphql": "^0.37.0",
        "@opentelemetry/instrumentation-http": "^0.48.0",
        "@opentelemetry/resources": "^1.21.0",
        "@opentelemetry/sdk-metrics": "^1.21.0",
        "@opentelemetry/sdk-trace-base": "^1.21.0",
        "@opentelemetry/sdk-trace-node": "^1.21.0",
        "@opentelemetry/semantic-conventions": "^1.21.0"
  1. Observe error while the app has no connection to the collector:
[serve] {"stack":"Error: connect ECONNREFUSED 127.0.0.1:8080\n    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1159:16)\n    at TCPConnectWrap.callbackTrampoline (internal/async_hooks.js:130:17)","message":"connect ECONNREFUSED 127.0.0.1:8080","errno":"-61","code":"ECONNREFUSED","syscall":"connect","address":"127.0.0.1","port":"8080","name":"Error"}
  1. App is able to connect back to collector!

@Zirak
Copy link
Contributor

Zirak commented Feb 24, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:author-response waiting for author to respond priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect
Projects
None yet
Development

No branches or pull requests

4 participants