Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel receivers spawn on need #687

Merged
merged 16 commits into from
Apr 30, 2024
Merged

Otel receivers spawn on need #687

merged 16 commits into from
Apr 30, 2024

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Apr 26, 2024

PR Description

Modify all otelcol receiver components to only start a receiver for a signal when the signal is needed downstream (when it's set in the output block).

This fixes an issue with the otelcol.receiver.kafka component when the wrong receiver would claim the topic, leading to decoding error.

Which issue(s) this PR fixes

Fixes #644

Notes to the Reviewer

PR Checklist

Spawning a receiver for every signal for every receiver component was not needed and created errors with the kafka receivers. Now the otelcol receiver checks which signals are needed and spawn the corresponding receivers.
@wildum wildum requested a review from ptodev April 26, 2024 12:51
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this one, especially because checkOutputSignals isn't handling updates. I find this to be a little bit complicated and potentially fixing a symptom of a larger underlying issue.

If this is a bug with upstream, we should help them fix that bug upstream. If this isn't a bug with upstream, I'd like to understand what upstream does differently than we do.

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

If this is a bug with upstream, we should help them fix that bug upstream. If this isn't a bug with upstream, I'd like to understand what upstream does differently than we do.

The Otel collector has pipelines. This looks like this for kafka:

service:
  pipelines:
    traces:
      receivers:
        - kafka
      exporters:
        - otlp 

Because the kafka receiver is set in the traces pipeline, the otel collector will only create a TraceReceiver to claim the provided kafka topic.

In Alloy we currently don't have a way to specify which signal the receiver should handle so we spawn by default the three receivers (metrics/logs/traces) for every receiver component.

otelcol.receiver.kafka "otelkafka" {
  brokers                 = ["127.0.0.1:9094"]
  protocol_version = "2.8.0"
  encoding              = "otlp_proto"

  output {
    traces  = [otelcol.processor.batch.otelkafka.input]
  }
}

At best, the useless receivers that we spawn don't have any impact (fe the metricsReceiver and the logsReceiver that we spawn for the jaeger receiver don't do anything). But for Kafka all active receivers try to claim the given topic.

The goal of this PR is to use the output block as an equivalent to the pipeline feature of Otel and only spawns the receivers that are needed. In this case, because it only outputs traces the receiver should only create a TraceReceiver.

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

it's a bit confusing to call everything receiver but that's how it's called in the code. An otel receiver contains up to three "signals" receivers: metrics, logs and traces

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

I'm not sure about this one, especially because checkOutputSignals isn't handling updates.

Good point. If we go with this solution, I could try to move it or make it an error in the validate func if we want to prevent this behavior for the kafka receiver. Or we just keep the warning in the doc and if the user configure it wrong it will get a decoding error msg

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

Because the kafka receiver is set in the traces pipeline, the otel collector will only create a TraceReceiver to claim the provided kafka topic.

But what happens upstream if you configure the kafka receiver in the logs and traces pipeline? Does it reject it?

@ptodev
Copy link
Contributor

ptodev commented Apr 26, 2024

Because the kafka receiver is set in the traces pipeline, the otel collector will only create a TraceReceiver to claim the provided kafka topic.

But what happens upstream if you configure the kafka receiver in the logs and traces pipeline? Does it reject it?

It depends on whether you explicitly configured a Kafka topic. If you didn't, then it'll default to a different topic for every signal. If you did, it'll try to use that same topic for all signal types.

(I have never tried this with this receiver btw :) I'm just stating my expectations)

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

But what happens upstream if you configure the kafka receiver in the logs and traces pipeline? Does it reject it?

The behavior is the same as with Alloy: it starts the logsReceiver and the tracesReceiver and one of them will claim the topic. If it's the traceReceiver then it works. If it's the logsReceiver then it will fail to decode (it's stuck in a loop -> consumes a message -> fails to decode because the schema is wrong -> try to consume the next message etc...)
I ran it several time and it's random which receiver will be claim it first

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

Sorry I did not see your answer @ptodev, but you are correct indeed

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

The behavior is the same as with Alloy: it starts the logsReceiver and the tracesReceiver and one of them will claim the topic. If it's the traceReceiver then it works. If it's the logsReceiver then it will fail to decode (it's stuck in a loop -> consumes a message -> fails to decode because the schema is wrong -> try to consume the next message etc...)

So I think my statement stands here; I would prefer reporting and help fix that behavior in upstream, since I think this is a symptom of a larger underlying issue rather than an issue with the framework we built out for creating OTel component wrappers.

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

So I think my statement stands here; I would prefer reporting and help fix that behavior in upstream, since I think this is a symptom of a larger underlying issue rather than an issue with the framework we built out for creating OTel component wrappers.

It's two different things.
The problem with Otel is that if you put the kafka receiver in the wrong pipeline then it doesn't work. That's not a bug, at most it's a questionable design. If you need to use kafka for traces, metrics and logs, then you configure 3 receivers, assign each the correct topic and you put each one in the correct pipeline.

For Alloy this is a bug because we cannot specify the pipeline. ATM the only way to make the kafka receiver work consistenly is to explicitly set the topic argument to an empty string. If you set the topic argument to any other value, it will not work. And this is a workaround that expects that you named the topics to the correct default values. In other words, our otelcol.receiver.kafka is completely broken and only usable via a hacky workaround

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

The problem with Otel is that if you put the kafka receiver in the wrong pipeline then it doesn't work. That's not a bug, at most it's a questionable design. If you need to use kafka for traces, metrics and logs, then you configure 3 receivers, assign each the correct topic and you put each one in the correct pipeline.

I don't understand, how can a user put the Kafka receiver in a "wrong" pipeline in OpenTelemetry? Given that OpenTelemetry intends for a receiver to be re-used in any number of pipelines of any mix of telemetry types, I don't see how an instance of the Kafka receiver component could be placed in a "wrong" pipeline.

For example, based on what I understand about OpenTelemetry, this should be valid:

receivers:
  kafka: # ... 
 
exporters: 
  otlp: # ... 

pipelines: 
  metrics:
     receivers: [kafka]
     exporters: [otlp]
  logs:
     receivers: [kafka]
     exporters: [otlp]
  traces:
     receivers: [kafka]
     exporters: [otlp]

I don't see anything in upstream documentation for the Kafka receiver would imply that this is invalid just for the Kafka receiver. Am I misunderstanding what you meant here?

@ptodev
Copy link
Contributor

ptodev commented Apr 26, 2024

@rfratto The config you suggested is going to work fine, but only if you don't explicitly set a kafka topic in the OTel config.

If you do set a topic explicitly, then that topic will be used in every pipeline. This causes the kafka receiver to throw errors because all of logs/metrics/traces receivers will be trying to interpret the incoming data as logs/metrics/traces respectively.

I'd have preferred if we can explicitly set a "log topic", "traces topic" and a "metrics topic", similarly to how the otlp receiver can have traces_url_path , etc. So I do agree with you that there is scope to improve the component upstream. I don't see harm in merging the receiver.go changes in this PR though. There's no point in setting up receivers for telemetry types for which we don't need them.

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

The problem with Otel is that if you put the kafka receiver in the wrong pipeline then it doesn't work. That's not a bug, at most it's a questionable design. If you need to use kafka for traces, metrics and logs, then you configure 3 receivers, assign each the correct topic and you put each one in the correct pipeline.

I don't understand, how can a user put the Kafka receiver in a "wrong" pipeline in OpenTelemetry? Given that OpenTelemetry intends for a receiver to be re-used in any number of pipelines of any mix of telemetry types, I don't see how an instance of the Kafka receiver component could be placed in a "wrong" pipeline.

For example, based on what I understand about OpenTelemetry, this should be valid:

receivers:
  kafka: # ... 
 
exporters: 
  otlp: # ... 

pipelines: 
  metrics:
     receivers: [kafka]
     exporters: [otlp]
  logs:
     receivers: [kafka]
     exporters: [otlp]
  traces:
     receivers: [kafka]
     exporters: [otlp]

I don't see anything in upstream documentation for the Kafka receiver would imply that this is invalid just for the Kafka receiver. Am I misunderstanding what you meant here?

The config is valid as long as you don't specify a topic. In this case you will get this:

  • kafka for metrics will create a metricsReceiver which will claim the topic otlp_metrics
  • kafka for traces will create a tracesReceiver which will claim the topic otlp_spans
  • kafka for logs will create a logsReceiver which will claim the topic otlp_logs

If you specify one topic (let's say "test_topic") in your kafka receiver config, they will all try to claim it:

  • kafka for metrics will create a metricsReceiver which will claim the topic test_topic
  • kafka for traces will create a tracesReceiver which will claim the topic test_topic
  • kafka for logs will create a logsReceiver which will claim the topic test_topic

Only one will get it. This is not good and will work if you are lucky, else you get decoding error and no data is forwarded.

One you can do if you need to specify different names is this:

receivers:
  kafka/traces:
    protocol_version: 2.8.0
    brokers: ["kafka:9092"]
    topic: tracing_custom_topic
    encoding: otlp_proto
  kafka/metrics:
    protocol_version: 2.8.0
    brokers: ["kafka:9092"]
    topic: metrics_custom_topic
    encoding: otlp_proto

exporters:
  otlp:

service:
  pipelines:
    traces:
      receivers:
        - kafka/traces
      exporters:
        - otlp
    metrics:
      receivers:
        - kafka/metrics
      exporters:
        - otlp

With this I can configure different kafka receivers with different topics and assign them to the correct pipelines. By correct pipeline I mean that the topic "tracing_custom_topic" which contains traces should only be consume by the traces pipeline. This use-case is currently not possible in Alloy.

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

Thanks, I see. Yes, in that case there would be a collision. However, I'd rather not us stray too far from the upstream behavior as we aim for config compatibility, so I don't think we should reject someone configuring the component this way. However, I do think it's ok for us to mark the component unhealthy if we detect this scenario (outputting to multiple telemetry types with the component using the topic per type).

I've also noticed that we don't set the default topic the same way as upstream. Upstream sets the default topic at runtime based on the telemetry type the receiver is being instantiated for. We always set it to otlp_spans. We could fix this behaviour, but it might be a breaking change for anyone who is writing non-trace data to the otlp_spans topic. However it might be worth fixing this one anyway, since I do think this behavior is incorrect.

Additionally, I understand now why we're changing all receivers to conditionally enable pipelines depending on where data is being routed. I think this needs a few tests though:

  • Add tests to assert the expected behavior when you're sending (e.g.) logs to a component that doesn't have have enabled (i.e., it's not forwarding logs to another component).
    • I don't know what the expected behavior should be, but at the very least it shouldn't panic.
  • Add tests to ensure that the code changes handles updates properly with enabling/disabling per-type instances of the underlying OTel component.

I also still think we should report this behavior upstream and ensure it's intended on their side. I also think we should help them modify their documentation to warn of this behaviour if it is intended. Otherwise there may be a way to fix it (such as finding a way to allow multiple subscribers to the same topic as long as they're all subscribing for different telemetry types).

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

We always set it to otlp_spans. We could fix this behaviour, but it might be a breaking change for anyone who is writing non-trace data to the otlp_spans topic.

Although this is a breaking change in theory. In practice it cannot be a breaking change because if you use the default value of the topic arg it won't work. The only way to have our otelcol.receiver.kafka component working is to set it to an empty string. We only found out now because no one was using this component I guess

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

I also still think we should report this behavior upstream and ensure it's intended on their side. I also think we should help them modify their documentation to warn of this behaviour if it is intended.

I can follow up on this :)

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

Add tests to assert the expected behavior when you're sending (e.g.) logs to a component that doesn't have have enabled (i.e., it's not forwarding logs to another component).
I don't know what the expected behavior should be, but at the very least it shouldn't panic.
Add tests to ensure that the code changes handles updates properly with enabling/disabling per-type instances of the underlying OTel component.

I will add the tests

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

Although this is a breaking change in theory. In practice it cannot be a breaking change because if you use the default value of the topic arg it won't work. The only way to have our otelcol.receiver.kafka component working is to set it to an empty string.

Yeah, that's a good point. Though my understanding is that it's not that it never worked at all, just that it's a race condition so there may be some missing data, right?

We only found out now because no one was using this component I guess

Neither our usage stats for Flow nor Alloy show any usage of this component. That doesn't mean that there isn't someone using it that's just not reporting usage stats, but it does seem reasonable to say that this component has little-to-no usage.

I definitely think it's OK for us to fix the default value now without it breaking people 👍

@ptodev
Copy link
Contributor

ptodev commented Apr 26, 2024

If I remember correctly, the Kafka receiver upstream used to also set the topic to "otlp_spans" by default, for metrics/logs/traces pipelines. They later changed it to have a different topic depending on the type of telemetry. The Alloy defaults are consistent with the old Otel Collector defaults.

We could change our default to "" and put in an explanation that it depends on the telemetry? Then it'd be consistent with the Collector.

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

Yeah, that's a good point. Though my understanding is that it's not that it never worked at all, just that it's a race condition so there may be some missing data, right?

Yeah saying that it "never worked" was a bit of a shortcut, you do have a chance to hit the right receiver although you need to roll the dice every time the sessions end and the receivers need to re-claim the topic

@wildum
Copy link
Contributor Author

wildum commented Apr 26, 2024

We could change our default to "" and put in an explanation that it depends on the telemetry? Then it'd be consistent with the Collector.

I'm ok with the plan, should we explicitly say that if you leave it to "" then you can use the 3 signals for one receiver but if you set a topic then you should only use one signal? Just as documentation, no log in the code.

And then hopefully we can change the behaviour in Otel

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

I'm ok with the plan, should we explicitly say that if you leave it to "" then you can use the 3 signals for one receiver but if you set a topic then you should only use one signal? Just as documentation, no log in the code.

And then hopefully we can change the behaviour in Otel

Yeah, I think if the default is an empty string then it should use different topics internally and so it should be valid to use all three types for emitting telemetry. If you manually set a topic, then it's an invalid config and we should note that on the component health by marking it as unhealthy with a descriptive message. (We can mention this in the documentation as well). Marking it in a log entry could work, but I think it's less persistent than the health status would be.

And then yeah hopefully we can work with upstream to see if there's any possible improvements 👍

@ptodev
Copy link
Contributor

ptodev commented Apr 26, 2024

Actually, I don't think we can change the default to "" because we'd be breaking backwards compatibility 😄 We could instead keep defaulting to "otlp_spans", and just document that setting the topic to "" has interesting side effects. And I do think we should merge the change this PR does to receiver.go

@rfratto
Copy link
Member

rfratto commented Apr 26, 2024

Actually, I don't think we can change the default to "" because we'd be breaking backwards compatibility 😄 We could instead keep defaulting to "otlp_spans", and just document that setting the topic to "" has interesting side effects. And I do think we should merge the change this PR does to receiver.go

Have you seen the discussion about this above? We know that the component can't work properly today regardless of how you configure it, and that upstream have recently changed the defaults, so fixing this could fall under any one of these exceptions:

Specification errors: If a specification for a feature is found to be incomplete or inconsistent, fixing the specification may require a breaking change.

Bugs: If a bug is found that goes against the documented specification of that functionality, fixing the bug may require breaking compatibility for users who are relying on the incorrect behavior.

Upstream changes: Much of the functionality of Alloy is built on top of other software, such as OpenTelemetry Collector and Prometheus. If upstream software breaks compatibility, we may need to reflect this in Alloy.

We've also observed that the component has little-to-no usage (likely due to the bug), which makes it easier for us to justify an exception to the rule to fix the behavior as it's unlikely anyone would actually be broken.

@ptodev
Copy link
Contributor

ptodev commented Apr 26, 2024

Yes, as you say I think we'd be ok with such a change in this case 👍

@wildum wildum requested a review from ptodev April 29, 2024 13:51
wildum and others added 4 commits April 30, 2024 09:52
Co-authored-by: Paulin Todev <paulin.todev@gmail.com>
Co-authored-by: Paulin Todev <paulin.todev@gmail.com>
Co-authored-by: Paulin Todev <paulin.todev@gmail.com>
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks William!

internal/component/otelcol/receiver/kafka/kafka_test.go Outdated Show resolved Hide resolved
@wildum wildum merged commit db18233 into main Apr 30, 2024
13 checks passed
@wildum wildum deleted the otel-receivers-spawn-on-need branch April 30, 2024 12:53
hainenber pushed a commit to hainenber/alloy that referenced this pull request May 1, 2024
* Spawn otelcol receivers on need.

Spawning a receiver for every signal for every receiver component was not needed and created errors with the kafka receivers. Now the otelcol receiver checks which signals are needed and spawn the corresponding receivers.

* changelog entry

* add warning when more than one signal is set for the output of the kafka receiver

* cleanup

* use validate instead of logging

* update doc

* add test to checkt that the receiver spawns the correct receivers on update

* doc typo

* remove default value in code

* update tests

* Update docs/sources/reference/components/otelcol.receiver.kafka.md

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>

* Update internal/component/otelcol/receiver/kafka/kafka.go

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>

* Update docs/sources/reference/components/otelcol.receiver.kafka.md

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>

* adapt tests

* make test more readable

---------

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>
polyrain pushed a commit to polyrain/alloy that referenced this pull request May 1, 2024
* Spawn otelcol receivers on need.

Spawning a receiver for every signal for every receiver component was not needed and created errors with the kafka receivers. Now the otelcol receiver checks which signals are needed and spawn the corresponding receivers.

* changelog entry

* add warning when more than one signal is set for the output of the kafka receiver

* cleanup

* use validate instead of logging

* update doc

* add test to checkt that the receiver spawns the correct receivers on update

* doc typo

* remove default value in code

* update tests

* Update docs/sources/reference/components/otelcol.receiver.kafka.md

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>

* Update internal/component/otelcol/receiver/kafka/kafka.go

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>

* Update docs/sources/reference/components/otelcol.receiver.kafka.md

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>

* adapt tests

* make test more readable

---------

Co-authored-by: Paulin Todev <paulin.todev@gmail.com>
@rfratto rfratto added the backport-to-agent PR should be backported to the agent repo. label May 3, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport-to-agent PR should be backported to the agent repo. frozen-due-to-age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Various "error":"proto: wrong wireType = 2 for field TimeUnixNano" messages when using otel.receiver.kafka
3 participants