Consistent config interface for filtering external datasets #25161

dmitryax · 2023-08-11T00:10:07Z

Problem

We use many different filtering interfaces in the collector configuration where we need to provide users with an option to limit an external set of data. The interfaces differ pretty significantly, making it confusing for the end user. Moreover, developers may find it unclear which interface to adopt for their components moving forward.

Here we are talking only about filtering external sets, not pdata objects which are supposed to be fully handled by OTTL.

There are two main use cases:

1. Just filter external data sets prior to further processing.

This scenario arises when users need to limit the external dataset before the collector proceeds with specific processing.

1.1 List of glob file paths:

Example: pkg/stanza/fileconsumer/matcher.Include/Exclude taking slices of glob strings to match against files to watch

include:
  - /var/log/pods/*/*/*.log
exclude:
  - /var/log/pods/*/otel-collector/*.log

While effective for file paths, this method might not be universally applicable to other data sources.

1.2 Filtering by strict/regex rules provided explicitly

This method is widely employed in the filtering processor, soon to be substituted by OTTL. Additionally, it has been implemented in the hostmetrics receiver with this structure:

<include_devices|exclude_devices>:
  devices: [ <device name>, ... ]
  match_type: <strict|regexp>

It was introduced in open-telemetry/opentelemetry-collector#522 not to break the interfaces that were already built by that time. This will be fully replaced by OTTL in the filter processor and by #25134 in hostmetrics receiver.

1.3 Dynamically deducting regexp/glob/strict type from the string

This approach is used in dockerstats receiver exclude_images

excluded_images:
  - undesired-container
  - /.*undesired.*/
  - another-*-container

Strings wrapped with / represent regex, strings prefixed with ! represent negation, glob strings are automatically deducted if any the *?[]{}! characters are found in the string.

This approach is the least verbose but requires adding some additional escaping rules, e.g. if we want to do a strict match against /some-string/ or !string. Also, the unclear distinction between glob and regular strings can cause confusion.

2. Convert external maps into a new set of attributes

This functionality is needed when users desire a configuration interface to map external maps into resource attributes. Examples include:

A subset of Kubernetes pod/namespace annotations and labels mapped to resource attributes
A subset of container labels and environment variables mapped to resource attributes

The filtering part of this is the same as in the use case (1), but we also need to provide a way for users to remap keys from the original map to the attribute keys. This cannot be done in the downstream component because different sources (like pod annotations and labels) can have the same key, so users need a way to avoid keys from one map being overridden by another map. The mapping part can also be solved as a separate configuration option, but it's nice to do it at once, especially when we need to map regex-mapped groups. For example, k8sattributes processor currently provides the following interface for this:

extract:
  labels:
    - key_regex: (service-.*)
      from: pod
      tag_name: "k8s.pod.labels.$1"

Goal

Resolving this issue does not require changing all the existing filtering interfaces. The goal is to establish a prescribed way to define the filtering interfaces for external datasets for any future use cases. This issue would also be a prerequisite for resolving the following ones:

Possible solutions

Option A

Keep using approach 1.3 "Dynamically deducting regexp/glob/strict type from the string".

Pros

It's concise. We can support both use cases (1) and (2), where, for the use case (1), we provide a config interface as a simple list:

filter:
  - exact-string
  - /service-.*/

For the use case (2), we cannot use a map because we need to preserve the order of the filtering/mapping rules evaluation, so we need to keep it as a list of structs with two fields (filter_and_map is just a placeholder):

filter_and_map:
  - source_key: source-string
    target_key: target-string
  - source_key: /service-(.*)/
    target_key: $1

Cons

Require a way to escape of wrapping / for regexp and the ! negation prefix. Potentially, we can avoid supporting the negation ! sign because it can always be handled by a regexp. Also, we can recommend using regex for strings wrapped with /'s, e.g. /\/...\//.
Unclear distinction between strict and glob modes, e.g. users can add a single ? sign as part of the string they want to match, but it'll match any character. Potentially we can remove glob expression support and keep only /regex/ and strict-string matching

Option B

Make every filtering item a struct that would include a matching type as a field name: strict, glob regexp. Setting more than one of them shouldn't pass validation. To support use case (2), we can simply add another optional field, target_key, for example (note that filter and filter_and_map are just placeholders, they can be anything else like include_container_values):

filter:
  - strict: my-service
  - regexp: service-v1.(.*)

filter_and_map:
  # match only "source-key" key and assign its value to the "target-key" key in the target map
  - strict: source-key
    target_key: target-key
  # identify keys that match the regular expression 'service-(.*)' and assign 
  # their corresponding values to keys named after the matched group names in the target map
  - regexp: service-(.*)
    target_key: $1
  # identify keys matching the regular expression 'app-.*' 
  # and set the corresponding value under the same keys in the target map
  - regexp: app-.*

Pros

Explicit about a matching method

Cons

More verbose than Option A

Option C

Utilize OTTLP for the external datasets additionally to OTLP data. This would require introducing additional identifiers for the external items. It needs more discussion to determine its pros and cons

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2023-08-14T18:30:40Z

OTTL should definitely be the de facto solution when making decisions on data that has already be turned into pdata, but for situations where the filter configuration is modifying how the receiver collects data in the first place it gets trickier.

If we want a truly uniform filtering experience across all components making decision at any moment I believe the best solution is OTTL. It will be the most robust, allowing users to build any type of condition they need. If we use OTTL we won't need to worry about feature parity between 2 condition systems.

We have added a lot of plumbing to OTTL that will treat pcommon.Slice and []any as the same and pcommon.Map and map[string]any as the same, so I believe the primary work would be a new, extremely simple, "filter-specific" context and plumbing in internal/filterottl. I think it is possible that the context is a noop - there situations shouldn't actually transform and save any telemetry, we just need OTTL Conditions to help us make a decision.

But taking a step back, I believe we can separate the configuration API itself from the underlying implementation details. I am not interested in changing the glob pattern matching that filelogreceiver uses - it is a very natural and appropriate form of identifying file paths to collect. Also, I don't fully comprehend situation 2. I don't think receivers should be using a filter configuration to determine what data to grab and how to modify it at the same time. I'd rather the modification of any data be handled in the transformprocessor.

@dmitryax I think this would be a good topic for the SIG meeting.

dmitryax · 2023-08-14T20:45:27Z

I don't think receivers should be using a filter configuration to determine what data to grab and how to modify it at the same time. I'd rather the modification of any data be handled in the transformprocessor.

Sometimes it needs to be done in the same component. For example, we need to grab key/value pairs from k8s pod annotations AND labels making sure that key collisions are resolved. k8sattributes processor sets them with defferent prefies: k8s.pod.labels. and k8s.pod.annotations. by default, but it's configurable with regex groups. Potentially, we can simplify it and have just prefixes to be configurable.

TylerHelmuth · 2023-08-14T21:00:56Z

Reading through the k8sattributes config I now see what you mean. It make sense why it allows setting tag_name based on the regex key capture groups, but my preference would be incorporating OTTL somehow, but not if it would overcomplicate things. If feels reasonable to rewrite a configuration like

extract:
   labels:
     - tag_name: $$1
        key_regex: kubernetes.io/(.*)

as something like

extract:
   labels:
     # IDK what the actual syntax would look like for this as we havent written an extract function
     # We are adding named parameters tho which ensure clarity
     - extract(attributes, "$$1", "kubernetes.io/(.*)")

But again, I don't want to add overcomplexity where it isn't needed. Introducing OTTL functions and conditions gives a lot of flexibility, but I don't want the added complexity if it isn't needed.

Behind the scenes I still think it would be good to use OTTL, that way we aren't managing multiple filter frameworks.

TylerHelmuth · 2023-08-14T22:32:32Z

Considering non-OTTL options I lean towards option B, with the addition of a glob field.

kentquirk · 2023-08-16T16:34:25Z

As said in the meeting, please keep glob available for filenames; they're too useful and well-understood.

I favor option B as well.

github-actions · 2023-10-16T03:30:59Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

povilasv · 2023-12-05T12:26:03Z

I also like option B), it's pretty convenient and readable, you don't need to learn new syntax of / means regex, etc.

Though how would we support excluding elements? Two different fields with the same config structure, named include /exclude?

For example including certain resource attribute values:

resource_attributes:
  string.enum.resource.attr:
    description: Resource attribute with a known set of string values.
    type: string
    enum: [one, two]
    enabled: true
    include:
      - strict: one
      - regex: one|two

Example of excluding certain attribute values:

resource_attributes:
  string.enum.resource.attr:
    description: Resource attribute with a known set of string values.
    type: string
    enum: [one, two]
    enabled: true
    exclude:
      - strict: two

Another interesting complication from filtering resource attributes, is that they have a type: String, Slice, Map, Bytes. Would this work with maps / slice?

dmitryax · 2023-12-21T00:25:16Z

Though how would we support excluding elements? Two different fields with the same config structure, named include /exclude?

Sounds good to me. We should clearly document the precedence logic that is applied between them.

Another interesting complication from filtering resource attributes, is that they have a type: String, Slice, Map, Bytes. Would this work with maps/slice?

I think for other types, we should only support the strict mode.

andrzej-stencel · 2024-03-21T20:15:09Z

I also like option B. It's concise - brief enough, but comprehensible and unequivocal at the same time.

I would change the term strict to exact, seems this is the term usually used - "exact match", "regex match", etc.

dmitryax added enhancement New feature or request discussion needed Community discussion needed labels Aug 11, 2023

dmitryax changed the title ~~Consistent filtering interface for external slices and maps~~ Consistent filtering interface for external datasets Aug 11, 2023

bryan-aguilar mentioned this issue Aug 11, 2023

Allow configuring which resource attributes should be kept at the metric #22914

Closed

dmitryax mentioned this issue Aug 11, 2023

[processor/k8sattributes] Refactor FieldExtractConfig #25135

Open

dmitryax changed the title ~~Consistent filtering interface for external datasets~~ Consistent config interface for filtering external datasets Aug 11, 2023

TylerHelmuth added the pkg/ottl label Aug 14, 2023

TylerHelmuth mentioned this issue Aug 14, 2023

Initial commit for setting up a new component: k8slog receiver #24439

Open

TylerHelmuth mentioned this issue Oct 12, 2023

[receiver/k8sevents] Allow excluding namespaces #23991

Open

github-actions bot added the Stale label Oct 16, 2023

TylerHelmuth added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Oct 17, 2023

This was referenced Nov 3, 2023

[k8sclusterreceiver]: add node metadata as resource attributes #28848

Closed

[cmd/mdatagen] Ability to filter by resource attribute values #25134

Closed

evan-bradley mentioned this issue Dec 5, 2023

pkg/ottl: add section about design principles #29424

Merged

dmitryax mentioned this issue Jan 2, 2024

[mdatagen] allow filtering out metrics based on resource attributes #29769

Closed

TylerHelmuth mentioned this issue Feb 2, 2024

new component: deltatocumulative processor #30479

Closed

TylerHelmuth mentioned this issue Feb 23, 2024

Port k8s object receiver filters to the k8scluster receiver #31399

Open

dmitryax mentioned this issue Mar 6, 2024

[mdatagen] allow filtering out metrics based on resource attributes open-telemetry/opentelemetry-collector#9660

Merged

dmitryax mentioned this issue Apr 16, 2024

[mdatagen] Rename include/exclude config options open-telemetry/opentelemetry-collector#9960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent config interface for filtering external datasets #25161

Consistent config interface for filtering external datasets #25161

dmitryax commented Aug 11, 2023 •

edited

Loading

TylerHelmuth commented Aug 14, 2023

dmitryax commented Aug 14, 2023

TylerHelmuth commented Aug 14, 2023

TylerHelmuth commented Aug 14, 2023

kentquirk commented Aug 16, 2023

github-actions bot commented Oct 16, 2023

povilasv commented Dec 5, 2023 •

edited

Loading

dmitryax commented Dec 21, 2023

andrzej-stencel commented Mar 21, 2024

Consistent config interface for filtering external datasets #25161

Consistent config interface for filtering external datasets #25161

Comments

dmitryax commented Aug 11, 2023 • edited Loading

Problem

1. Just filter external data sets prior to further processing.

1.1 List of glob file paths:

1.2 Filtering by strict/regex rules provided explicitly

1.3 Dynamically deducting regexp/glob/strict type from the string

2. Convert external maps into a new set of attributes

Goal

Possible solutions

Option A

Pros

Cons

Option B

Pros

Cons

Option C

TylerHelmuth commented Aug 14, 2023

dmitryax commented Aug 14, 2023

TylerHelmuth commented Aug 14, 2023

TylerHelmuth commented Aug 14, 2023

kentquirk commented Aug 16, 2023

github-actions bot commented Oct 16, 2023

povilasv commented Dec 5, 2023 • edited Loading

dmitryax commented Dec 21, 2023

andrzej-stencel commented Mar 21, 2024

dmitryax commented Aug 11, 2023 •

edited

Loading

povilasv commented Dec 5, 2023 •

edited

Loading