Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filelogreceiver] CPU consumption increases (roughly) linearly with number of files watched #27404

Open
cwegener opened this issue Oct 3, 2023 · 13 comments
Labels
bug Something isn't working never stale Issues marked with this label will be never staled and automatically removed pkg/stanza priority:p2 Medium receiver/filelog release:required-for-ga Must be resolved before GA release

Comments

@cwegener
Copy link
Contributor

cwegener commented Oct 3, 2023

Component(s)

pkg/stanza, receiver/filelog

What happened?

Description

The CPU consumption of the filelog receiver increases (roughly) linearly as the number of files to be watched increases.

Steps to Reproduce

  1. Download otelcol-contrib (any recent release. tested on Linux AMD64 with different versions from 0.77 to 0.86)
  2. Create 100 small files with random content of about 1000 bytes in size. e.g. i=1; while [ $i -ne 100 ]; do dd if=/dev/urandom bs=1000 count=1 | base64 > /tmp/logs/$i.log; i=$(($i+1)); done
  3. Run OTel Collector with a simple pipeline of filelogreceiver -> loggingexporter and send the 100 files through the pipeline
  4. Observe CPU utilization of the otelcol-contrib process. On my test system (AMD Zen3 laptop CPU, 100 files uses ~5% of a single core to watch 100 files)
  5. Restart the process at step 2 and increase the number of files to be watched.
  6. Observe the CPU utilization increase. On my test system, at 1000 files watched, ~110% of one core is needed to watch the 1000 files

Expected Result

Good question.

I think that ultimately, the expected result is that the default configuration of the filelog receiver protects the administrator from putting unexpected burden on the Host CPU.

Actual Result

With a default configuration, the filelog receiver can cause significant increase in CPU utilization.

Collector version

v0.86.0

Environment information

Environment

OS: Archlinux x86_64,
Compiler(if manually compiled): N/A. Using binaries from github for testing.

OpenTelemetry Collector configuration

receivers:
  filelog/manyfiles:
    include:
      - /tmp/logs/*

exporters:
  logging:
    # verbosity: detailed ## if visual verification of pipeline is required

service:
  pipelines:
    logs/manyfiles:
      receivers:
        - filelog/manyfiles
      processors: []
      exporters:
        - logging

Log output

./otelcol-contrib --config minimal.yaml
2023-10-03T18:51:22.983+1100    info    service@v0.86.0/telemetry.go:84 Setting up own telemetry...
2023-10-03T18:51:22.983+1100    info    service@v0.86.0/telemetry.go:201        Serving Prometheus metrics      {"address": ":8888", "level": "Basic"}
2023-10-03T18:51:22.983+1100    info    exporter@v0.86.0/exporter.go:275        Deprecated component. Will be removed in future releases.       {"kind": "exporter", "data_type": "logs", "name": "logging"}
2023-10-03T18:51:22.985+1100    info    service@v0.86.0/service.go:138  Starting otelcol-contrib...     {"Version": "0.86.0", "NumCPU": 16}
2023-10-03T18:51:22.985+1100    info    extensions/extensions.go:31     Starting extensions...
2023-10-03T18:51:22.985+1100    info    adapter/receiver.go:45  Starting stanza receiver        {"kind": "receiver", "name": "filelog/manyfiles", "data_type": "logs"}
2023-10-03T18:51:22.988+1100    info    service@v0.86.0/service.go:161  Everything is ready. Begin running and processing data.
2023-10-03T18:51:23.195+1100    info    fileconsumer/file.go:196        Started watching file from end. To read preexisting logs, configure the argument 'start_at' to 'beginning'      {"kind": "receiver", "name": "filelog/manyfiles", "data_type": "logs", "component": "fileconsumer", "path": "/tmp/logs/1.log"}
2023-10-03T18:51:23.258+1100    info    fileconsumer/file.go:194        Started watching file   {"kind": "receiver", "name": "filelog/manyfiles", "data_type": "logs", "component": "fileconsumer", "path": "/tmp/logs/560.log"}
2023-10-03T18:51:23.307+1100    info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "logging", "resource logs": 1, "log records": 100}
2023-10-03T18:51:23.307+1100    info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "logging", "resource logs": 1, "log records": 100}

Additional context

A quick look at pprof shows that most CPU time is mostly spent on GC.

(pprof) top10 
Showing nodes accounting for 6420ms, 58.95% of 10890ms total
Dropped 280 nodes (cum <= 54.45ms)
Showing top 10 nodes out of 173
      flat  flat%   sum%        cum   cum%
    1700ms 15.61% 15.61%     1700ms 15.61%  runtime/internal/syscall.Syscall6
    1220ms 11.20% 26.81%     1220ms 11.20%  runtime.memclrNoHeapPointers
     960ms  8.82% 35.63%     2360ms 21.67%  runtime.scanobject
     950ms  8.72% 44.35%      950ms  8.72%  memeqbody
     360ms  3.31% 47.66%      540ms  4.96%  runtime.findObject
     300ms  2.75% 50.41%     2780ms 25.53%  runtime.gcDrain
     300ms  2.75% 53.17%      490ms  4.50%  runtime.greyobject
     220ms  2.02% 55.19%      220ms  2.02%  runtime.procyield
     210ms  1.93% 57.12%      210ms  1.93%  runtime.futex
     200ms  1.84% 58.95%      200ms  1.84%  encoding/base64.(*Encoding).Encode
@cwegener cwegener added bug Something isn't working needs triage New item requiring triage labels Oct 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski djaglowski removed the needs triage New item requiring triage label Oct 3, 2023
@djaglowski
Copy link
Member

Performance improvements are always welcome.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

Pinging code owners for receiver/filelog: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

Pinging code owners for pkg/stanza: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@cwegener
Copy link
Contributor Author

cwegener commented Oct 3, 2023

Performance improvements are always welcome.

I'll see how far I can get with just some flamegraph analysis.

@djaglowski
Copy link
Member

Thanks @cwegener. Feel free to share any insights you discover here.

@cwegener
Copy link
Contributor Author

cwegener commented Oct 4, 2023

Thanks @cwegener. Feel free to share any insights you discover here.

My first instinct is that allocations are escaping to heap. I'll have a look which allocations that might be and if the code can be changed so that the compiler will use stack-based allocation instead.

@djaglowski
Copy link
Member

Looking at this again after some related discussions, I think I can explain what is going here.

Basically, we are remembering N file (roughly 3x what we find in a poll interval). Then, for each of the M files we find in a poll interval, we open it, read a fingerprint from it, and directly compare it against the N files we remember (unless we find a match, in which case we stop short).

There is clearly opportunity for improvement here and I think recent refactoring has put us closer to a trie-based solution which would give us much more efficient cross-referencing of files. Let's keep this open until we can make a substantial gain on this front.

Copy link
Contributor

github-actions bot commented Jan 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 1, 2024
@djaglowski djaglowski removed the Stale label Jan 8, 2024
@djaglowski djaglowski added the priority:p2 Medium label Jan 23, 2024
@djaglowski djaglowski added the release:required-for-ga Must be resolved before GA release label Feb 12, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 15, 2024
@djaglowski djaglowski added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Apr 15, 2024
@afoninsky
Copy link

Hello. Just a small comment/feedback.
In our environments this bug is a blocker, so we had to deprecate failog receiver and switch to another (non-opentelemetry related) solution for a specific case.

@djaglowski
Copy link
Member

Thank you for the additional feedback @afoninsky. We will take a serious look at this before moving the component to GA.

@r0mdau
Copy link

r0mdau commented Nov 3, 2024

The most impactful mitigation I've found is to lengthen the poll_interval to 1s.

With a value of 200ms my understanding is the system checks all files 5 times per second. With 1000+ files, this creates significant overhead.

With poll_interval: 200ms

(pprof) top10
Showing nodes accounting for 17500ms, 47.44% of 36890ms total
Dropped 530 nodes (cum <= 184.45ms)
Showing top 10 nodes out of 250
      flat  flat%   sum%        cum   cum%
    4250ms 11.52% 11.52%     4250ms 11.52%  runtime/internal/syscall.Syscall6
    3030ms  8.21% 19.73%     3030ms  8.21%  memeqbody
    2330ms  6.32% 26.05%     3020ms  8.19%  encoding/json.appendCompact
    2010ms  5.45% 31.50%     2010ms  5.45%  runtime.memclrNoHeapPointers
    1400ms  3.80% 35.29%     4620ms 12.52%  runtime.scanobject
    1150ms  3.12% 38.41%     4260ms 11.55%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/fileset.(*Fileset[go.shape.*uint8]).Match
     960ms  2.60% 41.01%     1270ms  3.44%  runtime.typePointers.next
     890ms  2.41% 43.43%     1920ms  5.20%  runtime.pcvalue
     860ms  2.33% 45.76%      860ms  2.33%  runtime.futex
     620ms  1.68% 47.44%      620ms  1.68%  encoding/json.stateInString

With poll_interval: 1s

(pprof) top10
Showing nodes accounting for 3650ms, 46.09% of 7920ms total
Dropped 238 nodes (cum <= 39.60ms)
Showing top 10 nodes out of 269
      flat  flat%   sum%        cum   cum%
     830ms 10.48% 10.48%      830ms 10.48%  runtime/internal/syscall.Syscall6
     790ms  9.97% 20.45%      790ms  9.97%  memeqbody
     460ms  5.81% 26.26%      580ms  7.32%  encoding/json.appendCompact
     310ms  3.91% 30.18%      930ms 11.74%  runtime.scanobject
     280ms  3.54% 33.71%      280ms  3.54%  runtime.memclrNoHeapPointers
     260ms  3.28% 36.99%     1050ms 13.26%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/fileset.(*Fileset[go.shape.*uint8]).Match
     230ms  2.90% 39.90%      450ms  5.68%  runtime.pcvalue
     170ms  2.15% 42.05%      930ms 11.74%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/fingerprint.Fingerprint.StartsWith (inline)
     160ms  2.02% 44.07%      160ms  2.02%  runtime.futex
     160ms  2.02% 46.09%      160ms  2.02%  runtime.step

The next mitigation will be to reduce the max_concurrent_files because each file is read concurrently.

Hope this could help mitigate readers here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working never stale Issues marked with this label will be never staled and automatically removed pkg/stanza priority:p2 Medium receiver/filelog release:required-for-ga Must be resolved before GA release
Projects
None yet
Development

No branches or pull requests

5 participants