-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate datadog_agent -> remap -> datadog_logs
#8512
Comments
I might suggest the remap config be adjusted slightly to:
I think users commonly want to parse and merge JSON objects so that should be more representative. |
At the close of #8263 we observe vector in a similar configuration -- dropping the remap step -- is able to accomplish 210Mb/s throughput. Building vector with the following string:
we find that:
The remap transform does not appear to instrument |
@jszwedko your proposed configuration change results in:
It's worth noting, maybe, that some of the |
So...
|
Err... however, in the case that |
Yeah, it does seem like |
Here's the structure that's being arbitrarily generated and fed into vector as a datadog logs json blob: The important part is It's always a string. Sometimes the string happens to be valid json derived from and sometimes it's an instance of but it's never an integer, unless remap is doing auto-coercion if the string is number-y. |
💭
This might be useful for debugging |
I still think, for the purposes of this soak test, that having it always send JSON objects of a similar shape and size would be closer to the common user case. |
@jszwedko For what it's worth I patterned the input off an enterprise customer's setup: part unstructured, part structured input and all. This ticket is also not the soak test, necessarily. Considering the observed regression introduced by the small example of remap I'm satisfied with the configuration outlined in the original comment for the purpose of providing a useful optimization target, absent any compelling use-case from a user that needs immediate work. If this ticket makes fast progress I'd be glad to entertain a more elaborate remap transform, focus on a different input pattern etc. |
Continuing on from this comment perf run indicates that memmov, malloc and free are the primary cost centers of the run -- 8.96%, 6.59% and 5.44% -- which coincides well with our understanding that events are cloned into remap for processing. CPU timeline is relatively full, see screenshot: Flamegraph analysis suggests that we're spending a non-trivial portion of time buffering events by partially serializing them -- known from #8263 -- but 21% of runtime in We know from #8263 that the cost of the source is not a limit up to higher throughput than is observed presently, which suggests we should focus on remap primarily. |
The stack through vector into VRL is somewhat crowded, making it hard to find optimization targets. I wrote a more restricted program that mimics how the transform itself behaves: use bytes::Bytes;
use lookup::LookupBuf;
use shared::TimeZone;
use std::collections::BTreeMap;
use vrl::{Program, Runtime, Target, Value};
const JSON: &[u8] = r#"{"key": "value"}"#.as_bytes();
const PROGRAM: &str = r#". = parse_json!(.message)"#;
#[derive(Debug, Clone, Copy, Default)]
struct StaticTarget {}
impl Target for StaticTarget {
#[inline]
fn insert(&mut self, _path: &LookupBuf, _value: Value) -> Result<(), String> {
// nothing, intentionally
Ok(())
}
#[inline]
fn get(&self, path: &LookupBuf) -> Result<Option<Value>, String> {
if path.is_empty() {
Ok(Some(Value::Object(BTreeMap::default())))
} else {
Ok(Some(Value::Bytes(Bytes::from_static(JSON))))
}
}
#[inline]
fn remove(&mut self, _path: &LookupBuf, _compact: bool) -> Result<Option<Value>, String> {
unimplemented!()
}
}
fn main() {
let program: Program = vrl::compile(PROGRAM, &vrl_stdlib::all()).unwrap();
// this loop simulates the call of `<Remap as FunctionTransform>::transform`
let timezone = TimeZone::default();
loop {
let mut runtime = Runtime::default();
let mut target = StaticTarget::default();
let _ = runtime.resolve(&mut target, &program, &timezone).unwrap();
}
} The key difference, to start, is the |
I have updated the experimental program to include an use bytes::Bytes;
use shared::TimeZone;
use std::collections::BTreeMap;
use vector_core::event::{self, Event, EventMetadata, LogEvent, VrlTarget};
use vrl::{Program, Runtime};
const JSON: &[u8] = r#"{"key": "value"}"#.as_bytes();
const PROGRAM: &str = r#". = parse_json!(.message)"#;
fn main() {
let program: Program = vrl::compile(PROGRAM, &vrl_stdlib::all()).unwrap();
// this loop simulates the call of `<Remap as FunctionTransform>::transform`
let timezone = TimeZone::default();
let mut map: BTreeMap<String, event::Value> = BTreeMap::new();
map.insert(
"message".to_owned(),
event::Value::Bytes(Bytes::from_static(JSON)),
);
let metadata = EventMetadata::default();
let log_event = LogEvent::from_parts(map, metadata);
let event = Event::Log(log_event);
loop {
let mut runtime = Runtime::default();
let mut target: VrlTarget = event.clone().into();
let _ = runtime.resolve(&mut target, &program, &timezone).unwrap();
}
} The main cost centers here appear to be Backing out some, the remap transform itself is stateless but is a single choke point in our topology. It seems likely that we can run multiple in parallel. |
Could that introduce risk of getting events passed out of order? |
They would absolutely be passed out of order. I was under the impression we didn't guarantee ordering today and we should not if it can be helped at all. |
I don't think we've explicitly guaranteed it, but we do implicitly... for example the |
Gotcha, that makes sense. To be more explicit not guaranteeing in-order processing is an important optimization opportunity. If we can process events out of order we can do things in parallel without coordination -- like I'm thinking for the remap transform -- or shunt slow-to-process events behind fast-to-process events etc etc. |
100% the loki problem should be solved on their side soon, and I'm not aware of anything else we have that has it as a requirement - just wanted to make sure we didn't unintentionally break loki when we're also trying to polish and improve that sink this quarter |
It's a good call out; I appreciate it. Do we have a ticket tracking this on our end? I think it means we have to preserve our accidental in-order processing and/or build a mechanism to re-order in the sink before we can make the transforms parallel. Or provide a knob to make them sequential and I hate new knobs. |
@blt a few tickets regarding loki ordering and performance...
and the epic: #8460 |
We do currently preserve message ordering within a given source-dependent context. For example, for the When we've previously discussed introducing transform parallelization, we've talked about doing it such that the transforms would be pushed up into the sources such that they would execute concurrently using whatever unit of concurrency the source does. For example, this would let us run remap in parallel across TCP connections to the |
In-order out from the source is a-okay by me.
Yep. That RFC's end state is the ideal, especially because we could elide any buffer overhead by jamming transforms up the topology. Much like with the way we tackled #8263 I'd like to take little steps that are cheap to throw out if they aren't worthwhile. Seems to me -- and I'm getting more intimately familiar with the topology code now -- we could make the remap transform itself concurrent internally and still work our way toward the Architecture Revisit vision without having to take a big leap there. That is, assuming inserting a guaranteed order destroying transform wouldn't break anything downstream. |
As a part of #8512 I need to have a much more clear understanding of the topology construct of vector than I presently do. I have an eye toward accomplishing some of the aims of the Architecture Revisit RFC but would like to get there in small steps. This commit does not functionally change vector. It moves some test code around and adjusts the documentation to be more, I hope, clear about what's what. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
As a part of #8512 I need to have a much more clear understanding of the topology construct of vector than I presently do. I have an eye toward accomplishing some of the aims of the Architecture Revisit RFC but would like to get there in small steps. This commit does not functionally change vector. It moves some test code around and adjusts the documentation to be more, I hope, clear about what's what. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit removes a `span!` call with `Level::ERROR` setting from `FunctionCall::resolve`. My understanding here is that this span did not terminate into any collector and will not impact user observable remap behavior. This does, however, add +9Mb/s throughput to the configuration described in #8512. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit moves the `TransformConfig` down into vector-core. The ambition I've got is to work closer to the architecture described in Architecture Revisit and to allow for me to experiment with transforms that directly produce into streams etc etc. Motivation here being an observation in #8512 that we spend a good deal of time making small vecs, which happens in the topology. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Following on from #8584 I attempted to extend the While not fully exhausted any scope for improving the single-threaded throughput of this particular remap configuration is limited, seemingly, to a small subset of the flamegraph: Unfortunately the perf this is derived from is too large to attach. |
* Move transform configuration into vector-core This commit moves the `TransformConfig` down into vector-core. The ambition I've got is to work closer to the architecture described in Architecture Revisit and to allow for me to experiment with transforms that directly produce into streams etc etc. Motivation here being an observation in #8512 that we spend a good deal of time making small vecs, which happens in the topology. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Add env-test-util dep for Windows Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Satisfy clippy Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
As a part of vectordotdev#8512 I need to have a much more clear understanding of the topology construct of vector than I presently do. I have an eye toward accomplishing some of the aims of the Architecture Revisit RFC but would like to get there in small steps. This commit does not functionally change vector. It moves some test code around and adjusts the documentation to be more, I hope, clear about what's what. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Signed-off-by: dbcfd <bdbrowning2@gmail.com>
This commit removes a `span!` call with `Level::ERROR` setting from `FunctionCall::resolve`. My understanding here is that this span did not terminate into any collector and will not impact user observable remap behavior. This does, however, add +9Mb/s throughput to the configuration described in vectordotdev#8512. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Signed-off-by: dbcfd <bdbrowning2@gmail.com>
* Move transform configuration into vector-core This commit moves the `TransformConfig` down into vector-core. The ambition I've got is to work closer to the architecture described in Architecture Revisit and to allow for me to experiment with transforms that directly produce into streams etc etc. Motivation here being an observation in vectordotdev#8512 that we spend a good deal of time making small vecs, which happens in the topology. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Add env-test-util dep for Windows Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Satisfy clippy Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Signed-off-by: dbcfd <bdbrowning2@gmail.com>
Repeating with unchanged experimental setup for e876d40 and repeated for 403007e we find that:
These numbers remain stable. We have noticed that warp factors heavily into stack traces for the datadog agent and intend to remove it from the |
This commit removes a lazy_static and call to a lazy_static from the DatadogAgentSource. As a part of #8512 I spent some time trying to remove warp from the agent source to see its effect and found that, on the whole, it would not really be worth all that much. warp might bloat our stack traces somewhat but is lean internally. That said, I did find that I could remove some serialization from the source, primarily by migrating globally retrieved types into the source struct itself. See also #8472 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
While I've been focused on the remap step in the config it has struck me that the sink contributes to half-loss of throughput. We very likely can be more explicit about the format we're encoding to in the sink so I'll tackle that next. As noted in #8789 reducing the stack by removing warp is interesting but not particularly feasible without either building our own in-house setup or making the agent source unique. This is not a bottleneck presently so there's no need. |
…ce (#8789) * Migrate API_KEY_MATCHER and log_schema() into DatadogAgentSource This commit removes a lazy_static and call to a lazy_static from the DatadogAgentSource. As a part of #8512 I spent some time trying to remove warp from the agent source to see its effect and found that, on the whole, it would not really be worth all that much. warp might bloat our stack traces somewhat but is lean internally. That said, I did find that I could remove some serialization from the source, primarily by migrating globally retrieved types into the source struct itself. See also #8472 Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Update src/sources/datadog/agent.rs Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Co-authored-by: Jesse Szwedko <jesse@szwedko.me> Co-authored-by: Jesse Szwedko <jesse@szwedko.me>
This PR rethinks how our Datadog logs sink works with an eye toward applying a similar transformation to each of our sinks. The work here is driven by the observation in #8512 that we spend a great deal of time in vector polling futures. After tracing out the path of an `Event` through our previous stack it's clear why we're doing that: ![Image from iOS](https://user-images.githubusercontent.com/3579/130370869-8dbc68dd-a000-41f1-ad27-c144ce0ff30a.jpg) The top portion of this diagram details the type hierarchy at play and the bottom the actual code path in practice for an event, born out by examining live stack traces. None of these calls are inlined with one another -- consistent with the amount of boxing and dynamic dispatch going on -- so we do in fact pay for all the polls. It's also quite hard to understand this hierarchy and make changes to it. So, this PR is a rethink of the whole setup as it affects the Datadog logs sink. Highlights: * the logs sink is now a _stream_ processor * the logs sink is responsible for batching by API key itself * serialization of payloads is done as late as possible and over a complete batch in one swoop This approach is at least twice as fast as our present approach. I expect the patch to be code positive but only because of inlined rustdocs. Things missing as of this writing: - [x] acks are not currently performed -- resolved in 225cfe2 - [x] telemetry is broken considering that is part of the generic type hierarchy abandoned here -- resolved in b6832a4 - [x] concurrency, rate-limiting and back-pressure need to be enabled yet - [x] it's possible that payload sizes cross the 5Mb boundary the API sets -- see [this comment](#8825 (comment)) With regard to the last point these are accomplished today by use of tower-based layering. I haven't decided whether the `HttpClient` should be responsible for participating in these concerns -- leaving the stream processor ignorant of them -- or if the stream processor should itself be a `Service`. - - - ## Method The basic method at play in the previous hierarchy was inheritance. A `DatadogLogsSink` is a type of `HttpSink` that exposes a few methods and is called by some obscure mechanism elsewhere. What I'm pushing for here is composition. A `LogApi` is a stream processor that is composed with an http client that knows about back-pressure, concurrency limits etc and clicks this together with buffering logic specific to the logs API. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Now that #8825 is merged we have to slightly update our vector config used in this investigation: data_dir = "/tmp/data/vector"
##
## Sources
##
[sources.internal_metrics]
type = "internal_metrics"
[sources.datadog_agent]
type = "datadog_agent"
acknowledgements = false
address = "0.0.0.0:8282"
##
## Transforms
##
[transforms.parse_message]
type = "remap"
inputs = ["datadog_agent"]
source = '''
pyld, err = parse_json(.message)
if err == null {
.message = pyld.mineral
}
'''
##
## Sinks
##
[sinks.prometheus]
type = "prometheus_exporter"
inputs = ["internal_metrics"]
address = "0.0.0.0:9598"
[sinks.dd_logs_egress]
type = "datadog_logs"
inputs = ["parse_message"]
endpoint = "http://localhost:8080/v1/input"
default_api_key = "DEADBEEF" Repeating our experiment for a32c7fd we find:
Remap utilization in this configuration is 95% with logs sink utilization being 32%, implying a serious bottleneck in the remap transform. Removing remap from the center of this configuration we find that throughput is 92MB/s, confirming this suspicion. |
This commit speeds up the `rename_key` function of `LogEvent` by 57%, avoiding the use of path iteration code to achieve this. This results in +2MB/s in #8512. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit introduces a `ready_chunks` into the pipeline for function transforms. This combinator polls the underlying stream until either it signals Pending or the chunk limit is reached, sending a vector of items downstream. In this commit we pull at most 128 `Event` instances off the input stream for processing, improving locality. I have not done the same for task transforms, though I think it could be done. Without changing the trait type we'd need the inverse of `ready_chunks` to move from a `Stream<Item = Vec<Event>>` into a `Stream<Item = Event>`. This is worth +7MB/s in #8512 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit introduces a `ready_chunks` into the pipeline for function transforms. This combinator polls the underlying stream until either it signals Pending or the chunk limit is reached, sending a vector of items downstream. In this commit we pull at most 128 `Event` instances off the input stream for processing, improving locality. I have not done the same for task transforms, though I think it could be done. Without changing the trait type we'd need the inverse of `ready_chunks` to move from a `Stream<Item = Vec<Event>>` into a `Stream<Item = Event>`. This is worth +7MB/s in #8512 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Repeating our experiment with the config outlined here for be8ceaf we find that:
Remap utilization is now 10%, |
This configuration is now part of our regular soak tests. We measure approximately 20Mb/s per core as of this writing. |
This issue is a record of and parent to the investigation of vector's performance when configured like so:
Inbound
http_gen
is configured with:Much like in #8263 our method will be to measure inbound throughput and outbound throughput primarily. Outbound target is
http_blackhole
, also from the lading project, and all three components are executed as follows until otherwise stated:The text was updated successfully, but these errors were encountered: