Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema metadata RFC #4599

Closed
binarylogic opened this issue Oct 16, 2020 · 1 comment
Closed

Schema metadata RFC #4599

binarylogic opened this issue Oct 16, 2020 · 1 comment
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) type: task Generic non-code related tasks

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Oct 16, 2020

As a first step towards supporting various schemas, we want to shift the event metadata into a Vector specific namespace. This will solve a number of awkward siutations where Vector's metadata clashes with the user's schema.

Examples

Transitioning from Logstash

Let's look at a simple example where a user is replacing Logstash with Vector. Vector would receive data from upstream beats over TCP in the following format:

{
        "message": "127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"",
     "@timestamp": "2013-12-11T08:01:45.000Z",
       "@version": "1",
           "host": "cadenza",
       "clientip": "127.0.0.1",
          "ident": "-",
           "auth": "-",
      "timestamp": "11/Dec/2013:00:01:45 -0800",
           "verb": "GET",
        "request": "/xampp/status.php",
    "httpversion": "1.1",
       "response": "200",
          "bytes": "3891",
       "referrer": "\"http://cadenza/xampp/navi.php\"",
          "agent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\""
}

Here we can see that the event is already enriched with metadata. When the event leaves the Vector socket source it'll be formatted as such:

{
  "timestamp": "...",
  "host": "...",
  "message": "{\n\"message\": \"127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"\",\n\"@timestamp\": \"2013-12-11T08:01:45.000Z\",\n\"@version\": \"1\",\n\"host\": \"cadenza\",\n\"clientip\": \"127.0.0.1\",\n\"ident\": \"-\",\n\"auth\": \"-\",\n\"timestamp\": \"11/Dec/2013:00:01:45 -0800\",\n\"verb\": \"GET\",\n\"request\": \"/xampp/status.php\",\n\"httpversion\": \"1.1\",\n\"response\": \"200\",\n\"bytes\": \"3891\",\n\"referrer\": \"\"http://cadenza/xampp/navi.php\"\",\n\"agent\": \"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"\"\n}\n"
}

Right off the bat this is awkward since we've constructed an event that does not resemble what they are expecting. To solve this the user must run the event through the json_parser which would result in:

{
        "timestamp": "...",
        "message": "127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"",
     "@timestamp": "2013-12-11T08:01:45.000Z",
       "@version": "1",
           "host": "cadenza",
       "clientip": "127.0.0.1",
          "ident": "-",
           "auth": "-",
      "timestamp": "11/Dec/2013:00:01:45 -0800",
           "verb": "GET",
        "request": "/xampp/status.php",
    "httpversion": "1.1",
       "response": "200",
          "bytes": "3891",
       "referrer": "\"http://cadenza/xampp/navi.php\"",
          "agent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\""
}

This is looking better, but notice we have a timestamp and @timestamp field with different values. When this event is encoded within a sink, we'll use the wrong timestamp.

This should be enough to demonstrate the awkwardness of this approach.

Proposals

Below are a couple of proposals that have been discussed. As part of the RFC you'll need to choose the best approach and propose it.

Vector metadata namespace

Instead of pulluting the user's namespace with Vector specific fields, timestamp and message in our example above, we could shove them in a Vector specific namespace:

{
   "_vector": {
     "timestamp": "...",
     "host": "..."
   },
   "message": "..."
}

This solves the key clashing issues in that we are no longer polluting the user's namespace.

raw metadata key

In addition the the above, we could also move raw data into a specific raw key:

{
   "_vector": {
     "timestamp": "...",
     "host": "...",
     "raw": "..."
   }
}

This strikes me as cleaner and much more flexible:

  1. We retain the raw data which I'm sure is useful in some use cases.
  2. We know if the event is explicitly structured or not. In the tcp -> tcp pipeline we can cleanly pass this data through.

Alternatives & Prior Art

It is worth exploring Splunk forwarder's approach to this problem. From my understanding, they use a root-level _raw key that gets removed once the data is parsed. You can see this in their docs:

If events do not have a _raw field, they'll be serialized to JSON prior to being sent.

Outstanding questions

  1. Even though we solved the problem of polluting the user's namespace, how would Vector know which timestamp key to use within each sink? Could the user tell us this through configuration? Hint: we'll need some sort of schema knowledge that tells us where to look for fields.
  2. How do we prevent the _vector metadata key from being encoded? Ex: we could default the sink-level encoding.except_fields to ["_vector"], or we could hard code ignoring this when we encode events.
  3. What should we do with all of the *_key options across our sources and sinks?
  4. How do we preserve backward compatibility?
@binarylogic binarylogic added type: task Generic non-code related tasks domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) labels Oct 16, 2020
@jszwedko
Copy link
Member

jszwedko commented Aug 4, 2022

Closing in-lieu of #12187

@jszwedko jszwedko closed this as completed Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) type: task Generic non-code related tasks
Projects
None yet
Development

No branches or pull requests

2 participants