This RFC proposes adding an HTTP API for Vector's configuration reloading process. This will allow an enterprise control plane to more directly control a running instance and provide a path for synchronous runtime feedback that doesn't currently exist.
- RFC: Observability Pipelines Worker (internal)
- Observability Pipelines Worker Implementation (internal)
- Integrating an experimental API that can initiate a reload and return success or failure
- Refactoring Vector's main application loop to better support such an API
- Allowing configuration error messages to be passed back via the API
- Deciding on the future of Vector's API as a public-facing feature
- Finalizing the representation of items like configuration errors
- Changing the actual reloading semantics of Vector in any way
- Implementing APIs for any other type of control actions
- Updating k8s resource definitions as part of a reload process
Our enterprise control plane component needs to initiate configuration reloads in Vector and report back success or failure. Currently, the only way to do that would be to manipulate the config files on disk, ensure Vector is watching for changes, and parse the log output to determine whether or not the reload succeeded. This would be difficult to implement and likely unreliable.
At startup, Vector will look for an environment variable
(VECTOR_CONTROL_SOCKET_PATH
). If that variable is present and contains a
valid path, it will start an HTTP control server listening on a Unix socket at
that path.
The control server will support a single route at POST /config
, which will
accept a JSON-encoded version of a Vector configuration (excluding global and
otherwise non-reloadable fields). Similar to config received from a provider
implementation, input globs will be expanded but environment variables will not
be interpolated. The HTTP response will synchronously indicate the success or
failure of the triggered reload, including any error messages scoped by
component name.
An example request would be as follows:
POST /config
{
"sources": {
"in": {
"type": "stdin"
}
},
"sinks": {
"out": {
"inputs": [
"in"
],
"type": "console",
"encoding": {
"codec": "text"
}
}
}
}
Such a request would receive a simple 200 OK
response when the reload
succeeded, or a 400-level response with a JSON structure in the body describing
any errors that occurred:
400 Bad Request
{
"errors": {
"sources": {
"foo": [
"unexpected field bar at quux",
...
],
}
"transforms": {},
"sinks": {},
"global": {},
}
}
This feature will initially be considered an experimental implementation detail of our control plane product, and not documented or otherwise intended for other use cases at this point.
The goal will be to implement a very simple, local-only API via which external processes can hook directly into Vector's reloading process. An important difference with existing solutions like providers is that the interaction will be synchronous, i.e. there will be a response with either success or failure and error messages for each reload attempt.
Currently, all of Vectors reloading ultimately happens in the main run loop,
where we watch for events like a signal, crash of a component, etc, and execute
the corresponding changes to the topology (slightly different depending on the
source of the signal). The first challenge will be to consolidate the logic of
what needs to happen when a config changes (e.g. updating GraphQL API state,
enterprise reporting, internal events, etc). A rough initial pass at that
process is included with the RFC PR to help show dependencies. The next step
would likely be to extract a TopologyController
or similar struct that owns
the topology and everything that needs to change along with it. That struct can
then present a simple API for reloads that ensures everything is done in a
consistent matter, no matter the source.
Once that struct is in place, it can be wrapped in an Arc<Mutex<_>>
or
similar such that it can be shared across both the existing Signal
-focused
run loop and a new control server. The new control server itself can be a very
simple task spawned prior to entering the main loop. As a local-only control
API, it can different from most HTTP servers in a few ways:
- By listening on a Unix socket instead of the network, it can rely on filesystem permissions for authentication rather than its own scheme.
- Because it's designed for use with a single local client, it does not need to
support concurrent processing of any kind. If it can't immediately acquire
the lock on the
TopologyController
, it can fail and tell the client to try again.
With the server and controller in place, the path for synchronous feedback will
exist. The next step is to adjust the implementation of
RunningTopology::reload_config_and_respawn
away from the current return value
of Result<bool, ()>
and towards something that actually returns error
messages. This will involve a fair bit of work down into the reloading stack,
where we currently tend to emit errors as they happen. The likely solution will
be refactoring to build up an error report of some kind that can be reported
and/or returned at the end of the whole process.
This approach is designed to be relatively minimal while still achieving all of the functionality needed for a good user experience. There are certainly deeper changes that could (and should) be made around the design of this part of Vector, and what we've set out here should be a good first step towards that future while providing immediate user value. This design should also be relatively easy to expand and evolve over time should be decide to adjust any aspect of it.
The main drawback is the overlap in functionality with existing pieces of Vector. These include:
- The GraphQL API: This is already "the API" and some confusion could result from introducing another (see alternatives section below for a discussion of why we're not reusing it).
- Providers: This is an existing path for configurable ways to provide configuration outside of the standard config file. Some but not all of the functionality here could be implemented as a provider.
- File-based configs: It could be confusing that this approach is not exclusive to the normal config loading/reloading processes being active. In theory, some strange interactions between the two are possible.
What's proposed here is similar to Consul, Docker, and Caddy, in that they're relatively simple JSON-based HTTP APIs. Docker defaults to using a Unix socket just as we're proposing here, and the Caddy docs also specifically recommend a Unix socket for better security (though they don't default to using one).
HAProxy is more of a raw control socket than an API, which doesn't seem to provide any benefit for our use case. Rather than sending commands directly over a socket, we'd likely prefer the docker model where a client binary communicates over a JSON API.
Instead of adding a new control-specific API, we could work with the existing GraphQL API to add a mutation for updating the config. The main benefit to this approach would be to maintain only a single API within Vector, which would reduce the need for potentially duplicative configuration, documentation, etc. There are also some potential future benefits to GraphQL if we later decided to add some richer functionality to the API that would benefit from GraphQL's specific capabilities.
The main reason this RFC does not take that approach is related to code overhead and maintenance. While feature-rich, GraphQL is a relatively complex technology that very few on the Vector team have any experience with. There is a heavy reliance on proc macros for implementation of the server, and this complexity carries over into the client as well. It has proven to be a significant maintenance burden and it's unlikely that we want to continue investing in it for Vector's use case. By starting a new, simpler API foundation, we can set the stage to migrate current GraphQL API use cases towards technology more suited to the team maintaining it.
On the other hand, nothing in this RFC requires that such a migration happen. Since the code being added is isolated and relatively minimal, it should be easily compatible with any future decisions around the future of Vector's API as a public facing feature. This includes the possibilities like being merged into the GraphQL API as a mutation or even removed entirely if the product moves in a different direction.
Likely the smallest change we could make would be to allow a feedback mechanism
for providers. This would allow us to implement the control server as a
provider with fewer structural changes to Vector's main run loop. Concretely,
this could look something like adding an optional oneshot response channel to
the SignalTo::ReloadFromConfigBuilder
variant used by providers.
We don't take this route because the signal-based focus of reloading is already overloaded enough to make the process difficult to follow. It's not a good match for what we're trying to do and would like further obscure the logic, making bugs more difficult to track down. In fact, given the limited uptake of the provider system in general, it's possible that it should be reworked to fit the same pattern as the new control server. This could be done as future work.
Another option would be to branch earlier in Application::run
, make options
related to the control server exclusive with normal options like --config
,
and have more of a separate mode of operation for this functionality. This
would remove some concerns about potential strange interactions between
multiple paths for reloading (e.g. watching a file while running a control
server while running an HTTP provider) and potentially require less refactoring
work than what's proposed in this RFC.
This approach wasn't chosen because the divergence for "normal" Vector operation would itself be a path for potentially confusing differences in behavior. It also would not move us closer to an improved overall design for this portion of Vector, and would create another path that we would later likely want to unify.
Instead of a REST-ish JSON API, we could use gRPC. We already use the technology in other areas of Vector (native source and sink), so it wouldn't have the same downside as GraphQL, while providing some of the benefits of an IDL.
This approach was not chosen for a couple of reasons:
- Protobuf adds complexity (i.e. code generation) to any client implementation.
- The ability to interact with the API from simple tools like cURL is lost.
- The only current endpoint accepts a Vector config, for which we already have a JSON Schema and we would not want to translate to Protobuf.
- Many of the benefits of gRPC are around high performance, load balancing, health checking, etc, none of which is relevant to a small local control API.
- What is the expected flow for setting global options at startup (especially
those that can't be changed via reload)? Use normal file loading with only
global options set?
- For this use case, we can bootstrap with a config file containing global options and the enterprise internal metrics pipeline.
- Extract controller struct to consolidate things that change with reloads
- Write simple Unix socket HTTP server task with
POST /config
route that builds/validates config - Wrap and share controller struct with server task so it can actually do reloads
- Refactor topology reload method to return errors rather than log directly
- Write an RFC for differentiating types of configuration (e.g. global vs component, reloadable vs not) and modeling them better within Vector. Then use this to have a clearer distinction between what should be present in a global config file vs what's loadable via the API, make things appropriately exclusive and reduce the number of possible weird interactions from running multiple config loading mechanisms concurrently.
- Investigate the needs and current implementation of
top
andtap
to inform an RFC on the future of a unified Vector API. - Improve the internal capabilities/modularity of reloads and expose that via the API, allowing for things like adding individual components rather than replacing current config wholesale, inspecting the current config, etc.
- Add a command line interface that can operate Vector via this API (e.g.
vector reload
).