Large Vector users often require complex Vector topologies to facilitate the collection and processing of data from many different upstream sources. Currently, this results in large Vector configuration files that are hard to manage, especially across different teams. This RFC lays out the concept of pipelines, a structured way to organize configuration that makes Vector a better candidate for use cases involving widespread collaboration on configuration.
- RFC 2064 Event driven observability.
- The definition of pipelines and their limitations
- How pipelines fit into Vector's configuration loading and topology
- Expected observability outputs related to pipelines
- Access control - The ability to control access of global resources (sources and sinks) within pipelines.
- Component reuse - The ability to define boilerplate for reuse across many different pipelines. This will likely align with Datadog’s “pipeline catalogue”.
- Connecting pipelines together - The ability to take input from another pipeline.
- Pipeline quotas - The ability to limit how much data a pipeline can send to a sink.
- Pre/post-processing of data - The ability to prepare and normalize data.
- How pipelines should be synchronized between Vector instances.
- Vector does not provide the ability to enforce any kind of organizational structure on configuration files, making large configurations painful.
- Because of this lack of structure, this is no clear path to achieving delegation and/or isolation of configuration subsections.
- There is no means for grouping individual components together for observability purposes.
This change will introduce the concept of pipelines to users. A pipeline is defined as:
- A collection of transforms defined together, outside of the top-level configuration file
- Able to draw input from and send output to components defined in the top-level configuration, but are isolated from other pipelines
- Having each contained component's internal metrics tagged with the
id
of the pipeline
Pipelines will be loaded from a pipelines
sub-directory relative to the Vector configuration directory (e.g., /etc/vector/pipelines
). Therefore, if a user changes the location of the Vector configuration directory they will also change the pipelines
directory path. They are coupled.
The pipelines
directory will contain all pipelines represented as individual files. For simplicity, and to ensure users do not overcomplicate pipeline management, sub-directories/nesting are not allowed. This is inspired by Terraform's single-level directory nesting, which has been a net positive for simple management of large Terraform projects.
Each pipeline file is a processing subset of a larger Vector configuration file. Therefore, it follows the same syntax as Vector's configuration (toml
, yaml
, and json
). Each pipeline will have an id
derived the name of the file without the extension. For example, the pipeline defined in load-balancer.yml
will have load-balancer
as its id
.
Pipelines have access to any components defined in the root configuration directory. For example, if the transform foo
is defined in /etc/vector/bar.toml
, it will be accessible by the pipeline /etc/vector/pipelines/pipeline.toml
, but if a transform bar
is defined in /etc/vector/pipelines/another-pipeline.toml
, it will not be accessible by other pipelines.
If no pipeline is defined, Vector behaves as if the feature didn't exist. This way, a configuration from a version without the pipeline
feature will keep working. If a pipeline file is left empty, Vector behaves as if it doesn't exist.
If any of the following constraints are violated, Vector will error on boot:
- There cannot be several pipelines with the same id (for example
load-balancer.yml
andload-balancer.json
). - The pipeline's configuration files should only contain transforms.
- A pipeline's transform cannot have the same name as any component from the root configuration.
- A pipeline cannot use another pipeline's component as input or output.
If the violation occurs during a reload, an error will be triggered and handled in the same fashion as other reload errors.
As mentioned in the previous section, a pipeline is just a set of transforms.
To be able to forward the events going through the pipeline to a sink
, we'll add a new option outputs
on the pipeline's transforms that will simply specify where the transforms events are redirected to.
outputs
options are used solely to build the topology and represent an interface between the transform and the external sinks.
A pipeline will have the following internal representation before building the topology.
struct PipelineTransform {
inner: TransformOuter,
outputs: Vec<String>,
}
struct Pipeline {
id: String,
transforms: Map<String, PipelineTransform>,
}
Which corresponds to the following configuration file:
# /etc/vector/pipelines/pipeline.toml
[transforms.foo]
type = "remap"
inputs = ["from-root"]
outputs = ["dc1", "dc2"]
# ...
[transforms.bar]
type = "remap"
inputs = ["foo"]
outputs = ["dc-us", "dc-eu"]
# ...
The outputs
option is made to forward the events from inside the pipeline to an external component.
If we look deeper at the configuration building process, the configuration compiler will require the pipelines to build the configuration.
To do so, we'll need to implement a Pipeline
from the previous section. We'll then update the compile
function to build a Config
containing the required pipelines components. The compiler will load the pipelines' transforms and add the outputs
to the corresponding sinks
.
The components coming from the pipeline will be cloned inside the final Config
, in the IndexMap
containing the transforms
and the outputs
from the pipeline components will be added to the referring components input field.
For example, the following configuration and pipeline, and its equivalent once built.
# /etc/vector/vector.toml
[sources.in]
# ...
[sinks.out]
# ...
# /etc/vector/pipelines/foo.toml
[transforms.bar]
inputs = ["in"]
outputs = ["out"]
# ...
# equivalent once compiled
[sources.in]
# ...
[transforms.foo#baz]
inputs = ["in"]
[sinks.out]
# the # notation is just a representation of the pipeline namespace
inputs = ["foo#baz"]
# ...
In order to avoid internal conflicts with the pipeline components id
s, the components id
s internal representation will be changed to the following struct
struct ComponentId {
name: String,
scope: ComponentScope,
}
enum ComponentScope {
Global,
Pipeline(String),
}
That way, if a transform foo
is defined in the pipeline bar
and in the pipeline baz
, they will not conflict.
Users should be able to observe and monitor individual pipelines.
This means relevant metrics coming from the internal_metrics
source must contain a pipeline_id
tag referring to the pipeline's id
.
This approach would extend the RFC 2064 by just adding pipeline_id
to the context.
In Vector, once the topology is built from the configuration, every component is encapsulated in a Task
that intercepts an incoming event and processes it accordingly. This task also keeps track of its internal metrics and finally emits internal_metrics
events.
To add the pipeline information to the task, we need to change the name
parameter to id: ComponentId
in the Task::new
method.
pub struct Task {
#[pin]
inner: BoxFuture<'static, Result<TaskOutput, ()>>,
id: ComponentId,
typetag: String,
}
impl Task {
pub fn new<S, Fut>(id: ComponentId, typetag: S, inner: Fut) -> Self
where
S: Into<String>,
Fut: Future<Output = Result<TaskOutput, ()>> + Send + 'static,
{
Self {
inner: inner.boxed(),
id,
typetag: typetag.into(),
}
}
}
That way, when vector spawns a new transform task, it will be able to add the optional pipeline information to the span.
let span = error_span!(
"transform",
component_kind = "transform",
component_name = %task.name(),
component_type = %task.typetag(),
pipeline_id = %task.pipeline_id(),
);
Doing so, each time the task will emit an internal event, it will be populated by the optional pipeline_id
.
Why is this change worth it?
- These changes provide a way for Vector users to split their configuration in a way that improves the collaboration between ops and devs.
- This split will be made by allowing the creation of individual pipeline configuration files, intended to align with services and teams, enabling autonomous management.
- The ops will now configure their
sources
,sinks
andtransforms
, and expose them to be used by the devs. - The devs will have the ability to consume the provided components without being able to change the common configuration.
- The users will be able to monitor their pipelines through the
internal_metrics
source. - Helps Vector to grow organically within an organization by allowing teams to adopt Vector at their own pace without heavy administrator involvement.
- Reduces the management overhead of devops/SREs by enabling teams to manage their own pipelines (spread the management load).
What is the impact of not doing this?
- This would force users to maintain complex configuration files and/or to duplicate component configuration across configuration files.
How does this position us for success in the future?
- With this representation, we'll be able add access control by, for example, declaring the pipelines inside the configuration files to limit the reachable components. We would also be able to specify a quota for each pipeline.
- Why should we not do this?
- What kind of ongoing burden does this place on the team?
- Do nothing: we can already use several configuration files, people could split their existing configuration.
This would imply some duplication if a transform is used in multiple configuration files.
Anybody that has write access to Vector's configuration folder could add a sink or source.
Adding a different folder would allow to separate concerns between a root
config and a pipeline
.
- Do nothing: write a tool that would write a big configuration file where each pipeline would start with a dummy filter that we could monitor in the
internal_metrics
.
Writing a different tool would increase the difficulty of using this feature.
Doesn't add access control regarding who can edit the root
config.
- Evolve vector to use a tag/filter model like our competitors; have a 'pipeline' be a 'tag'.
This doesn't allow to add internal metrics to specific transforms and monitor them, other than by adding a dummy filter that we could monitor.
Doesn't add access control regarding who can edit the root
config.
- Run a single vector per-'pipeline' and support metric tagging to distinguish at the telemetry level.
Adds a lot of complexity and would add some constraints regarding resources that can only be used once. Doesn't block to create other sources/sinks.
- Create the Pipeline structure and parse a pipeline's configuration file
- Update compiler to take the pipelines into consideration during validation
- Update topology with pipeline's components
- Update the context for taking pipeline information
- Implement a mechanism to avoid Vector to stop when a pipeline is misconfigured. This could be done by just logging the error and ignoring the pipeline.
- Allow to customize the pipelines location.