Schema metadata RFC #4599
Labels
domain: processing
Anything related to processing Vector's events (parsing, merging, reducing, etc.)
type: task
Generic non-code related tasks
As a first step towards supporting various schemas, we want to shift the event metadata into a Vector specific namespace. This will solve a number of awkward siutations where Vector's metadata clashes with the user's schema.
Examples
Transitioning from Logstash
Let's look at a simple example where a user is replacing Logstash with Vector. Vector would receive data from upstream beats over TCP in the following format:
Here we can see that the event is already enriched with metadata. When the event leaves the Vector
socket
source it'll be formatted as such:Right off the bat this is awkward since we've constructed an event that does not resemble what they are expecting. To solve this the user must run the event through the
json_parser
which would result in:This is looking better, but notice we have a
timestamp
and@timestamp
field with different values. When this event is encoded within a sink, we'll use the wrong timestamp.This should be enough to demonstrate the awkwardness of this approach.
Proposals
Below are a couple of proposals that have been discussed. As part of the RFC you'll need to choose the best approach and propose it.
Vector metadata namespace
Instead of pulluting the user's namespace with Vector specific fields,
timestamp
andmessage
in our example above, we could shove them in a Vector specific namespace:This solves the key clashing issues in that we are no longer polluting the user's namespace.
raw
metadata keyIn addition the the above, we could also move raw data into a specific
raw
key:This strikes me as cleaner and much more flexible:
tcp
->tcp
pipeline we can cleanly pass this data through.Alternatives & Prior Art
It is worth exploring Splunk forwarder's approach to this problem. From my understanding, they use a root-level
_raw
key that gets removed once the data is parsed. You can see this in their docs:Outstanding questions
timestamp
key to use within each sink? Could the user tell us this through configuration? Hint: we'll need some sort of schema knowledge that tells us where to look for fields._vector
metadata key from being encoded? Ex: we could default the sink-levelencoding.except_fields
to["_vector"]
, or we could hard code ignoring this when we encode events.*_key
options across our sources and sinks?The text was updated successfully, but these errors were encountered: