Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enabled the memory buffer causes high memory usage #8998

Closed
wgb1990 opened this issue Sep 2, 2021 · 3 comments
Closed

enabled the memory buffer causes high memory usage #8998

wgb1990 opened this issue Sep 2, 2021 · 3 comments
Labels
domain: performance Anything related to Vector's performance type: bug A code related bug.

Comments

@wgb1990
Copy link

wgb1990 commented Sep 2, 2021

Vector Version

0.16.1

Vector Configuration File

[api]
enabled = true
address = "0.0.0.0:8686"

[sources]

  [sources.metrics]
  type = "internal_metrics"
  scrape_interval_secs = 2.0

  [sources.apps_log]
  type = "kafka"
  bootstrap_servers = ""
  group_id = "dw-log-forward-vector-gid"
  topics = [
    "dw-log-forward"
  ]

[transforms]

  [transforms.remap_message]
  type = "remap"
  inputs = [
    "apps_log"
  ]
  source = ". = parse_json!(string!(.message))\n.eventTime = to_int!(.timestamp)\n"

  [transforms.output_loki]
  type = "remap"
  inputs = [
    "remap_message"
  ]
  source = ".aggregator = get_env_var!(\"POD_NAME\")\ndel(.domain)\ndel(.eventTime)\n"

  [transforms.output]
  type = "route"
  inputs = [
    "remap_message"
  ]

    [transforms.output.route]
    java_gc_log = ".collectorType == \"gc\""
    java_error_log = ".level == \"ERROR\""

[sinks]

  [sinks.prometheus_exporter]
  type = "prometheus_exporter"
  inputs = [
    "metrics"
  ]
  address = "0.0.0.0:9598"
  default_namespace = "service"

  [sinks.kafka_gc]
  type = "kafka"
  inputs = [
    "output.java_gc_log"
  ]
  bootstrap_servers = ""
  compression = "gzip"
  topic = "java_gc_log"

    [sinks.kafka_gc.encoding]
    codec = "json"
    timestamp = "unix"

    [sinks.kafka_gc.healthcheck]
    enabled = true

  [sinks.kafka_error]
  type = "kafka"
  inputs = [
    "output.java_error_log"
  ]
  bootstrap_servers = ""
  compression = "gzip"
  topic = "exception_log_prd"

    [sinks.kafka_error.encoding]
    codec = "json"
    timestamp = "unix"

    [sinks.kafka_error.healthcheck]
    enabled = true

  [sinks.loki]
  type = "loki"
  inputs = [
    "output_loki"
  ]
  endpoint = ""
  tenant_id = "{{ tenantId }}"
  remove_label_fields = true
  out_of_order_action = "rewrite_timestamp"

    [sinks.loki.batch]
    max_bytes = 30490000
    max_events = 7000
    timeout_secs = 1

    [sinks.loki.buffer]
    type = "memory"
    max_events = 10240000

    [sinks.loki.request]
    concurrency = 10240000

    [sinks.loki.labels]
    service = "{{ service }}"
    hostname = "{{ hostname }}"
    level = "{{ level }}"
    collectorType = "{{ collectorType }}"
    aggregator = "{{ aggregator }}"

    [sinks.loki.encoding]
    codec = "json"
    timestamp_format = "rfc3339"

    [sinks.loki.healthcheck]
    enabled = true

Debug Output

Expected Behavior

memory consumption is within the normal range, and events can be sent to Loki storage normally

Actual Behavior

events accumulate in the buffer and eventually cause the pod to run out of memory and restart

Example Data

Vector I play the role of aggregator and have three nodes to process about 6000 events per second

Additional INFO

memory begins to grow at some point. It should be that the memory buffer accumulates a large number of events, which eventually causes Loki to stop sending.
image

image

References

@wgb1990
Copy link
Author

wgb1990 commented Sep 3, 2021

I feel that the input throughput of vector is greater than the output, resulting in oom. I expanded from 3 nodes to 9 nodes. This phenomenon does not appear.

@wgb1990 wgb1990 changed the title enabled the memory buffer causes high memory consumption enabled the memory buffer causes high memory usage Sep 6, 2021
@jszwedko
Copy link
Member

jszwedko commented Nov 9, 2021

Hi @wgb1990 !

Apologies for the long delay in response.

A few things jumped out from your configuration:

    [sinks.loki.batch]
    max_bytes = 30490000
    max_events = 7000
    timeout_secs = 1

This will cause Vector to create batches of up to 7000 events or ~30 MB. The number of concurrent batches will be related to the number of partitions. For the loki sink one partition is created per unique set of labels:

    [sinks.loki.labels]
    service = "{{ service }}"
    hostname = "{{ hostname }}"
    level = "{{ level }}"
    collectorType = "{{ collectorType }}"
    aggregator = "{{ aggregator }}"

For:

    [sinks.loki.buffer]
    type = "memory"
    max_events = 10240000

Depending on your average event size, this could end up allocating a large amount as well. For example, if we assume your average event is 1 Kb, this would mean the buffer could be up to ~ 10 GB.

Does this additional context help? Your graphs just show percentages so I can't tell what the RSS is of Vector in absolute terms.

@jszwedko jszwedko added the domain: performance Anything related to Vector's performance label Nov 9, 2021
@jszwedko
Copy link
Member

Closing this due to lack of response to the last comment. Feel free to re-open though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants