RFC 541 - 2022-12-09 - Reload API

This RFC proposes adding an HTTP API for Vector's configuration reloading process. This will allow an enterprise control plane to more directly control a running instance and provide a path for synchronous runtime feedback that doesn't currently exist.

Context

RFC: Observability Pipelines Worker (internal)
Observability Pipelines Worker Implementation (internal)

Cross cutting concerns

Reloading RFC
Add reload subcommand
Data Samples for OP Configuration Builder (internal)

Scope

In scope

Integrating an experimental API that can initiate a reload and return success or failure
Refactoring Vector's main application loop to better support such an API
Allowing configuration error messages to be passed back via the API

Out of scope

Deciding on the future of Vector's API as a public-facing feature
Finalizing the representation of items like configuration errors
Changing the actual reloading semantics of Vector in any way
Implementing APIs for any other type of control actions
Updating k8s resource definitions as part of a reload process

Pain

Our enterprise control plane component needs to initiate configuration reloads in Vector and report back success or failure. Currently, the only way to do that would be to manipulate the config files on disk, ensure Vector is watching for changes, and parse the log output to determine whether or not the reload succeeded. This would be difficult to implement and likely unreliable.

Proposal

User Experience

At startup, Vector will look for an environment variable (VECTOR_CONTROL_SOCKET_PATH). If that variable is present and contains a valid path, it will start an HTTP control server listening on a Unix socket at that path.

The control server will support a single route at POST /config, which will accept a JSON-encoded version of a Vector configuration (excluding global and otherwise non-reloadable fields). Similar to config received from a provider implementation, input globs will be expanded but environment variables will not be interpolated. The HTTP response will synchronously indicate the success or failure of the triggered reload, including any error messages scoped by component name.

An example request would be as follows:

POST /config

{
  "sources": {
    "in": {
      "type": "stdin"
    }
  },
  "sinks": {
    "out": {
      "inputs": [
        "in"
      ],
      "type": "console",
      "encoding": {
        "codec": "text"
      }
    }
  }
}

Such a request would receive a simple 200 OK response when the reload succeeded, or a 400-level response with a JSON structure in the body describing any errors that occurred:

400 Bad Request

{
  "errors": {
    "sources": {
      "foo": [
        "unexpected field bar at quux",
        ...
      ],
    }
    "transforms": {},
    "sinks": {},
    "global": {},
  }
}

This feature will initially be considered an experimental implementation detail of our control plane product, and not documented or otherwise intended for other use cases at this point.

Implementation

The goal will be to implement a very simple, local-only API via which external processes can hook directly into Vector's reloading process. An important difference with existing solutions like providers is that the interaction will be synchronous, i.e. there will be a response with either success or failure and error messages for each reload attempt.

Currently, all of Vectors reloading ultimately happens in the main run loop, where we watch for events like a signal, crash of a component, etc, and execute the corresponding changes to the topology (slightly different depending on the source of the signal). The first challenge will be to consolidate the logic of what needs to happen when a config changes (e.g. updating GraphQL API state, enterprise reporting, internal events, etc). A rough initial pass at that process is included with the RFC PR to help show dependencies. The next step would likely be to extract a TopologyController or similar struct that owns the topology and everything that needs to change along with it. That struct can then present a simple API for reloads that ensures everything is done in a consistent matter, no matter the source.

Once that struct is in place, it can be wrapped in an Arc<Mutex<_>> or similar such that it can be shared across both the existing Signal-focused run loop and a new control server. The new control server itself can be a very simple task spawned prior to entering the main loop. As a local-only control API, it can different from most HTTP servers in a few ways:

By listening on a Unix socket instead of the network, it can rely on filesystem permissions for authentication rather than its own scheme.
Because it's designed for use with a single local client, it does not need to support concurrent processing of any kind. If it can't immediately acquire the lock on the TopologyController, it can fail and tell the client to try again.

With the server and controller in place, the path for synchronous feedback will exist. The next step is to adjust the implementation of RunningTopology::reload_config_and_respawn away from the current return value of Result<bool, ()> and towards something that actually returns error messages. This will involve a fair bit of work down into the reloading stack, where we currently tend to emit errors as they happen. The likely solution will be refactoring to build up an error report of some kind that can be reported and/or returned at the end of the whole process.

Rationale

This approach is designed to be relatively minimal while still achieving all of the functionality needed for a good user experience. There are certainly deeper changes that could (and should) be made around the design of this part of Vector, and what we've set out here should be a good first step towards that future while providing immediate user value. This design should also be relatively easy to expand and evolve over time should be decide to adjust any aspect of it.

Drawbacks

The main drawback is the overlap in functionality with existing pieces of Vector. These include:

The GraphQL API: This is already "the API" and some confusion could result from introducing another (see alternatives section below for a discussion of why we're not reusing it).
Providers: This is an existing path for configurable ways to provide configuration outside of the standard config file. Some but not all of the functionality here could be implemented as a provider.
File-based configs: It could be confusing that this approach is not exclusive to the normal config loading/reloading processes being active. In theory, some strange interactions between the two are possible.

Prior Art

Caddy API
Consul API
Docker Engine API
HAProxy control socket

What's proposed here is similar to Consul, Docker, and Caddy, in that they're relatively simple JSON-based HTTP APIs. Docker defaults to using a Unix socket just as we're proposing here, and the Caddy docs also specifically recommend a Unix socket for better security (though they don't default to using one).

HAProxy is more of a raw control socket than an API, which doesn't seem to provide any benefit for our use case. Rather than sending commands directly over a socket, we'd likely prefer the docker model where a client binary communicates over a JSON API.

Alternatives

Extend the GraphQL API

Instead of adding a new control-specific API, we could work with the existing GraphQL API to add a mutation for updating the config. The main benefit to this approach would be to maintain only a single API within Vector, which would reduce the need for potentially duplicative configuration, documentation, etc. There are also some potential future benefits to GraphQL if we later decided to add some richer functionality to the API that would benefit from GraphQL's specific capabilities.

The main reason this RFC does not take that approach is related to code overhead and maintenance. While feature-rich, GraphQL is a relatively complex technology that very few on the Vector team have any experience with. There is a heavy reliance on proc macros for implementation of the server, and this complexity carries over into the client as well. It has proven to be a significant maintenance burden and it's unlikely that we want to continue investing in it for Vector's use case. By starting a new, simpler API foundation, we can set the stage to migrate current GraphQL API use cases towards technology more suited to the team maintaining it.

On the other hand, nothing in this RFC requires that such a migration happen. Since the code being added is isolated and relatively minimal, it should be easily compatible with any future decisions around the future of Vector's API as a public facing feature. This includes the possibilities like being merged into the GraphQL API as a mutation or even removed entirely if the product moves in a different direction.

Make providers more synchronous

Likely the smallest change we could make would be to allow a feedback mechanism for providers. This would allow us to implement the control server as a provider with fewer structural changes to Vector's main run loop. Concretely, this could look something like adding an optional oneshot response channel to the SignalTo::ReloadFromConfigBuilder variant used by providers.

We don't take this route because the signal-based focus of reloading is already overloaded enough to make the process difficult to follow. It's not a good match for what we're trying to do and would like further obscure the logic, making bugs more difficult to track down. In fact, given the limited uptake of the provider system in general, it's possible that it should be reworked to fit the same pattern as the new control server. This could be done as future work.

More isolated control socket mode

Another option would be to branch earlier in Application::run, make options related to the control server exclusive with normal options like --config, and have more of a separate mode of operation for this functionality. This would remove some concerns about potential strange interactions between multiple paths for reloading (e.g. watching a file while running a control server while running an HTTP provider) and potentially require less refactoring work than what's proposed in this RFC.

This approach wasn't chosen because the divergence for "normal" Vector operation would itself be a path for potentially confusing differences in behavior. It also would not move us closer to an improved overall design for this portion of Vector, and would create another path that we would later likely want to unify.

gRPC instead of JSON

Instead of a REST-ish JSON API, we could use gRPC. We already use the technology in other areas of Vector (native source and sink), so it wouldn't have the same downside as GraphQL, while providing some of the benefits of an IDL.

This approach was not chosen for a couple of reasons:

Protobuf adds complexity (i.e. code generation) to any client implementation.
The ability to interact with the API from simple tools like cURL is lost.
The only current endpoint accepts a Vector config, for which we already have a JSON Schema and we would not want to translate to Protobuf.
Many of the benefits of gRPC are around high performance, load balancing, health checking, etc, none of which is relevant to a small local control API.

Outstanding Questions

What is the expected flow for setting global options at startup (especially those that can't be changed via reload)? Use normal file loading with only global options set?
- For this use case, we can bootstrap with a config file containing global options and the enterprise internal metrics pipeline.

Plan Of Attack

Extract controller struct to consolidate things that change with reloads
Write simple Unix socket HTTP server task with POST /config route that builds/validates config
Wrap and share controller struct with server task so it can actually do reloads
Refactor topology reload method to return errors rather than log directly

Future Improvements

Write an RFC for differentiating types of configuration (e.g. global vs component, reloadable vs not) and modeling them better within Vector. Then use this to have a clearer distinction between what should be present in a global config file vs what's loadable via the API, make things appropriately exclusive and reduce the number of possible weird interactions from running multiple config loading mechanisms concurrently.
Investigate the needs and current implementation of top and tap to inform an RFC on the future of a unified Vector API.
Improve the internal capabilities/modularity of reloads and expose that via the API, allowing for things like adding individual components rather than replacing current config wholesale, inspecting the current config, etc.
Add a command line interface that can operate Vector via this API (e.g. vector reload).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2022-12-09-541-reload-api.md

2022-12-09-541-reload-api.md

RFC 541 - 2022-12-09 - Reload API

Context

Cross cutting concerns

Scope

In scope

Out of scope

Pain

Proposal

User Experience

Implementation

Rationale

Drawbacks

Prior Art

Alternatives

Extend the GraphQL API

Make providers more synchronous

More isolated control socket mode

gRPC instead of JSON

Outstanding Questions

Plan Of Attack

Future Improvements

Files

2022-12-09-541-reload-api.md

Latest commit

History

2022-12-09-541-reload-api.md

File metadata and controls

RFC 541 - 2022-12-09 - Reload API

Context

Cross cutting concerns

Scope

In scope

Out of scope

Pain

Proposal

User Experience

Implementation

Rationale

Drawbacks

Prior Art

Alternatives

Extend the GraphQL API

Make providers more synchronous

More isolated control socket mode

gRPC instead of JSON

Outstanding Questions

Plan Of Attack

Future Improvements