Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otel collector container keeps crashing due to nil pointer dereference when using fluentforwardreceiver #27469

Closed
karthik-kasarapu opened this issue Oct 6, 2023 · 7 comments · Fixed by #29274
Labels

Comments

@karthik-kasarapu
Copy link

karthik-kasarapu commented Oct 6, 2023

Component(s)

receiver/fluentforward

What happened?

Description

The otel collector container running as a side car in ECS fargate to capture logs from app container that uses aws firelens log
driver is crashing with nil pointer de-reference error, as shown below

Steps to Reproduce

  1. Deploy the collector as a side car to ECS fargate and use aws firelens logDriver for main container
  2. Issue seems to be happening during high log volume. Noticed at around 1000 logs/sec

Expected Result

Expected to work without panic error

Actual Result

Container is crashing

Collector version

0.79.0, 0.85.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

extensions:
  # exposes http endpoint on port 13133 that can be used by k8s as a rediness probe
  health_check:
  
  # https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/ballastextension
  memory_ballast:
    size_in_percentage: 30
  
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  fluentforward:
    endpoint: unix:///var/run/fluent.sock

  awsecscontainermetrics:

  filelog/1:
    # Grafana Loki limit
    max_log_size: 256KiB
    include: "${ECS_LOG_FILES}"

processors:
  batch:

  # should be first processor in pipelines; may need to play with settings
  memory_limiter:
    check_interval: 1s
    limit_percentage: 10
    spike_limit_percentage: 5

  resourcedetection:
    # a list of resource detectors to run, valid options are: "env", "system", "gce", "gke", "ec2", "ecs", "elastic_beanstalk", "eks", "azure"
    # env resource detection uses environment variable: OTEL_RESOURCE_ATTRIBUTES=<key1>=<value1>,<key2>=<value2>,...
    detectors: [ env, ecs ]

  # rename some attributes to match semconv
  # ref: https://opentelemetry.io/docs/reference/specification/logs/semantic_conventions/media/
  transform/logs:
    log_statements:
    - context: resource
      statements:
      - delete_key(attributes, "aws.log.group.names")
      - delete_key(attributes, "aws.log.group.arns")
      - delete_key(attributes, "aws.log.stream.names")
      - delete_key(attributes, "aws.log.stream.arns")
    - context: log
      statements:
      - set(attributes["log.iostream"], attributes["source"])
      - set(resource.attributes["container.id"], attributes["container_id"])
      - set(resource.attributes["container.name"], attributes["container_name"])
      - delete_key(attributes, "source")
      - delete_key(attributes, "container_id")
      - delete_key(attributes, "container_name")
      - delete_key(attributes, "fluent.tag")


exporters:
  logging:
    loglevel: warn

  otlp/logs:
    endpoint: "${COLLECTOR_ENV}-collector-logs.com:4317"
    tls:
      insecure: true
  otlp/metrics:
    endpoint: "${COLLECTOR_ENV}-collector-metrics.com:4317"
    tls:
      insecure: true
  otlp/traces:
    endpoint: "${COLLECTOR_ENV}-collector-traces.com:4317"
    tls:
      insecure: true

service:
  # leaving off zpages and pprof as they are for debugging and can be added as necessary
  extensions: [health_check,memory_ballast]

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter,resourcedetection,batch]
      exporters: [otlp/traces]
    
    metrics:
      receivers: [otlp,awsecscontainermetrics]
      processors: [memory_limiter,resourcedetection,batch]
      exporters: [otlp/metrics]
    
    logs:
      receivers: [otlp,filelog/1]
      processors: [memory_limiter,resourcedetection,batch]
      exporters: [otlp/logs]

    logs/ecs:
      receivers: [fluentforward]
      processors: [memory_limiter,resourcedetection,transform/logs,batch]
      exporters: [otlp/logs]

Log output

For 0.79.0
--
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xabe9d6]
goroutine 132 [running]:
go.opentelemetry.io/collector/pdata/plog.ResourceLogsSlice.At(...)
	go.opentelemetry.io/collector/pdata@v1.0.0-rcv0012/plog/generated_resourcelogsslice.go:53
go.opentelemetry.io/collector/pdata/plog.Logs.LogRecordCount({0xc000241c20?})
	go.opentelemetry.io/collector/pdata@v1.0.0-rcv0012/plog/logs.go:38 +0x16
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).processEvents(0xc0008a5fb0, {0x452c8a0, 0xc00074f950})
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.79.0/collector.go:60 +0xe8
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).Start in goroutine 1
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.79.0/collector.go:39 +0x8c

For 0.85.0
--
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xacdeb6]
goroutine 138 [running]:
go.opentelemetry.io/collector/pdata/plog.ResourceLogsSlice.At(...)
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/plog/generated_resourcelogsslice.go:53
go.opentelemetry.io/collector/pdata/plog.Logs.LogRecordCount({0xc000bcaf00?})
go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/plog/logs.go:38 +0x16
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).processEvents(0xc0006a5bc0, {0x481e6c0, 0xc0001a7e50})
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.85.0/collector.go:60 +0xe8
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver.(*Collector).Start in goroutine 1
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/fluentforwardreceiver@v0.85.0/collector.go:39 +0x8c

Additional context

No response

@karthik-kasarapu karthik-kasarapu added bug Something isn't working needs triage New item requiring triage labels Oct 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bryan-aguilar
Copy link
Contributor

Setting this as a P1 for now since panics should be protected against in components

@ms-jcorley
Copy link

Looking at the code here:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/fluentforwardreceiver/collector.go#L59

This seems to be the same root cause as this bug:
#29107

Calling LogRecordCount on a plog that was previously passed to a ConsumeLogs call.
Downstream a processor (such as batch) may be modifying the pLog asynchronously while LogRecordCount is attempting to loop through it.

Fix: call LogRecordCount before ConsumeLogs and cache the value for logging after. It is unsafe to call LogRecordCount after ConsumeLogs.

@bryan-aguilar
Copy link
Contributor

Thank's for following up @ms-jcorley! I added the comment then thought "I should look at the code first before I claim they are linked". I appreciate you doing the due diligence for me with a bit more investigation.

@ms-jcorley
Copy link

Thank's for following up @ms-jcorley! I added the comment then thought "I should look at the code first before I claim they are linked". I appreciate you doing the due diligence for me with a bit more investigation.

No problem! Also note that a quick (unreliable) search shows instances of this same pattern in several other places:
receiver/lokireceiver/loki.go
receiver/mongodbatlasreceiver/events.go
receiver/sqlqueryreceiver/logs_receiver.go
receiver/kafkareceiver/kafka_receiver.go
receiver/k8sobjectsreceiver/receiver.go
receiver/googlecloudpubsubreceiver/receiver.go
receiver/otlpjsonfilereceiver/file.go

@bryan-aguilar
Copy link
Contributor

We should consider opening up a larger meta issue track all of these once we confirm that this pattern will lead to a panic.

@djaglowski
Copy link
Member

I agree that this appears to be due to a race condition. I've opened #29274 with the quick fix of calling LogRecordCount before calling ConsumeLogs. I suggest we assume this fixes it for now and reopen the issue if it still occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants