Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate datadog_agent -> remap -> datadog_logs #8512

Closed
blt opened this issue Jul 29, 2021 · 31 comments
Closed

Investigate datadog_agent -> remap -> datadog_logs #8512

blt opened this issue Jul 29, 2021 · 31 comments
Assignees
Labels
domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language transform: remap Anything `remap` transform related

Comments

@blt
Copy link
Contributor

blt commented Jul 29, 2021

This issue is a record of and parent to the investigation of vector's performance when configured like so:

data_dir = "/tmp/data/vector"

##
## Sources
##

[sources.internal_metrics]
type = "internal_metrics"

[sources.datadog_agent]
type = "datadog_agent"
acknowledgements = false
address = "0.0.0.0:8282"

##
## Transforms
##

[transforms.parse_message]
type = "remap"
inputs = ["datadog_agent"]
source = '''
pyld, err = parse_json(.message)
if err == null {
  .message = pyld.mineral
}
'''

##
## Sinks
##

[sinks.prometheus]
type = "prometheus_exporter"
inputs = ["internal_metrics"]
address = "0.0.0.0:9598"

[sinks.dd_logs_egress]
type = "datadog_logs"
inputs = ["parse_message"]
endpoint = "http://localhost:8080/v1/input"
default_api_key = "DEADBEEF"

buffer.type = "memory"
buffer.max_events = 40000 # buffer 40 payloads at a time

Inbound http_gen is configured with:

worker_threads = 2
prometheus_addr = "0.0.0.0:9001"

[targets.simple]
target_uri = "http://localhost:8282/v1/input"
bytes_per_second = "500 Mb"
parallel_connections = 10
method.type = "Post"
method.variant = "DatadogLog"
method.maximum_prebuild_cache_size_bytes = "1 Gb"

[targets.simple.headers]
dd-api-key = "DEADBEEF"

Much like in #8263 our method will be to measure inbound throughput and outbound throughput primarily. Outbound target is http_blackhole, also from the lading project, and all three components are executed as follows until otherwise stated:

> firejail --noprofile --rlimit-as=2g --cpu=1,2,3,4,5,6,7,8 ./target/release/vector --config ~/projects/us/troutwine/lading/configs/vector-ddlogs-remap.toml
> firejail --rlimit-as=2g  --noprofile ./target/release/http_blackhole --worker-threads 16 --concurrent-requests-max 512 --binding-addr "0.0.0.0:8080" --prometheus-addr "0.0.0.0:9002"
> firejail --rlimit-as=8g --cpu=11,12 --noprofile ./target/release/http_gen --config-path configs/http_gen.toml
@blt blt added domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language transform: remap Anything `remap` transform related labels Jul 29, 2021
@jszwedko
Copy link
Member

I might suggest the remap config be adjusted slightly to:

pyld, err = parse_json(.message)
if err == null {
  .message = del(pyld.mineral)
  . |= object!(pyld)
}

I think users commonly want to parse and merge JSON objects so that should be more representative.

@blt
Copy link
Contributor Author

blt commented Jul 29, 2021

At the close of #8263 we observe vector in a similar configuration -- dropping the remap step -- is able to accomplish 210Mb/s throughput. Building vector with the following string:

> CARGO_PROFILE_RELEASE_DEBUG=2 cargo build --release --no-default-features --features "sources-datadog,sources-internal_metrics,transforms-remap,transforms-log_to_metric,sinks-blackhole,sinks-datadog,sinks-prometheus"

we find that:

SHA http_gen generation Mb/s datadog_logs bytes processed Mb/s http_blackhole receipt Mb/s
b54bd88 31.9 29.4 29.4

The remap transform does not appear to instrument vector_processed_bytes_total -- though this is present in the documentation -- so I have not included it in the table. At present the remap transform appears to regress performance by -180.6Mb/s compared to without its presence.

@blt
Copy link
Contributor Author

blt commented Jul 29, 2021

@jszwedko your proposed configuration change results in:

➜  vector git:(master) ✗ firejail --noprofile --rlimit-as=2g --cpu=1,2,3,4,5,6,7,8 ./target/release/vector --config ~/projects/us/troutwine/lading/configs/vector-ddlogs-remap.toml
Parent pid 457712, child pid 457713
Child process initialized in 4.72 ms
Jul 29 10:13:57.559  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info" enable_datadog_tracing=false
Jul 29 10:13:57.559  INFO vector::app: Loading configs. paths=["/home/blt/projects/us/troutwine/lading/configs/vector-ddlogs-remap.toml"]
Jul 29 10:13:57.565  INFO vector::topology: Running healthchecks.
Jul 29 10:13:57.565  INFO vector::topology: Starting source. name="datadog_agent"
Jul 29 10:13:57.565  INFO vector::topology: Starting source. name="internal_metrics"
Jul 29 10:13:57.565  INFO vector::topology: Starting transform. name="parse_message"
Jul 29 10:13:57.565  INFO vector::topology: Starting sink. name="dd_logs_egress"
Jul 29 10:13:57.566  INFO source{component_kind="source" component_name=datadog_agent component_type=datadog_agent}: vector::sources::util::http: Building HTTP server. address=0.0.0.0:8282
Jul 29 10:13:57.566  INFO vector::topology: Starting sink. name="prometheus"
Jul 29 10:13:57.567  INFO vector::topology::builder: Healthcheck: Passed.
Jul 29 10:13:57.567  INFO vector: Vector has started. debug="true" version="0.16.0" arch="x86_64" build_id="none"
Jul 29 10:13:57.568  INFO vector::topology::builder: Healthcheck: Passed.
Jul 29 10:14:03.215  WARN transform{component_kind="transform" component_name=parse_message component_type=remap}: vector::internal_events::remap: Mapping failed with event. error="function call error for \"object\" at (88:101): expected \"object\", got \"integer\"" internal_log_rate_secs=30
Jul 29 10:14:03.234  WARN transform{component_kind="transform" component_name=parse_message component_type=remap}: vector::internal_events::remap: Internal log [Mapping failed with event.] is being rate limited.
Jul 29 10:14:33.242  WARN transform{component_kind="transform" component_name=parse_message component_type=remap}: vector::internal_events::remap: Internal log [Mapping failed with event.] has been rate limited 3176 times.
Jul 29 10:14:33.243  WARN transform{component_kind="transform" component_name=parse_message component_type=remap}: vector::internal_events::remap: Mapping failed with event. error="function call error for \"object\" at (88:101): expected \"object\", got \"integer\"" internal_log_rate_secs=30
Jul 29 10:14:33.274  WARN transform{component_kind="transform" component_name=parse_message component_type=remap}: vector::internal_events::remap: Internal log [Mapping failed with event.] is being rate limited.

It's worth noting, maybe, that some of the message fields in the datadog log payloads lading generates are just unstructured strings, some are structured json. I can't read remap well enough yet to understand what your suggested modification there actually does.

@spencergilbert
Copy link
Contributor

spencergilbert commented Jul 29, 2021

pyld, err = parse_json(.message)
if err == null {
  .message = del(pyld.mineral)
  . |= object!(pyld) # Here is where the error is surfacing
   # you can only merge an object and `pyld` is sometimes apparently just an integer
}

So...

pyld, err = parse_json(.message)
if err == null {
  .message = del(pyld.mineral)
  if is_object(plyd) {
    . |= object!(pyld)
  }
}

@spencergilbert
Copy link
Contributor

Err... however, in the case that pyld is just an integer... .message = del(plyd.mineral) would replace the message with null?

@jszwedko
Copy link
Member

jszwedko commented Jul 29, 2021

Yeah, it does seem like pyld is sometimes an integer. To match the common user case, I think pyld should always be a valid JSON string representing an object.

@blt
Copy link
Contributor Author

blt commented Jul 29, 2021

Here's the structure that's being arbitrarily generated and fed into vector as a datadog logs json blob:

https://github.com/blt/lading/blob/fdb574383c2dea5f86131bebc30b2d68dc35798f/lading_common/src/payload/datadog_logs.rs#L166-L183

The important part is Message which is here

https://github.com/blt/lading/blob/fdb574383c2dea5f86131bebc30b2d68dc35798f/lading_common/src/payload/datadog_logs.rs#L136-L164

It's always a string. Sometimes the string happens to be valid json derived from

https://github.com/blt/lading/blob/fdb574383c2dea5f86131bebc30b2d68dc35798f/lading_common/src/payload/datadog_logs.rs#L101-L108

and sometimes it's an instance of

https://github.com/blt/lading/blob/fdb574383c2dea5f86131bebc30b2d68dc35798f/lading_common/src/payload/common.rs#L8-L11

but it's never an integer, unless remap is doing auto-coercion if the string is number-y.

@spencergilbert
Copy link
Contributor

💭

pyld, err = parse_json(.message)
if err == null {
  log(pyld)
  .message = del(pyld.mineral)
  . |= object!(pyld)
}

This might be useful for debugging

@jszwedko
Copy link
Member

but it's never an integer, unless remap is doing auto-coercion if the string is number-y.

parse_json would convert it to an integer if it is a valid JSON string representing a number.

I still think, for the purposes of this soak test, that having it always send JSON objects of a similar shape and size would be closer to the common user case.

@blt
Copy link
Contributor Author

blt commented Jul 29, 2021

@jszwedko For what it's worth I patterned the input off an enterprise customer's setup: part unstructured, part structured input and all. This ticket is also not the soak test, necessarily.

Considering the observed regression introduced by the small example of remap I'm satisfied with the configuration outlined in the original comment for the purpose of providing a useful optimization target, absent any compelling use-case from a user that needs immediate work. If this ticket makes fast progress I'd be glad to entertain a more elaborate remap transform, focus on a different input pattern etc.

@blt
Copy link
Contributor Author

blt commented Jul 29, 2021

Continuing on from this comment perf run indicates that memmov, malloc and free are the primary cost centers of the run -- 8.96%, 6.59% and 5.44% -- which coincides well with our understanding that events are cloned into remap for processing. CPU timeline is relatively full, see screenshot:

Screenshot from 2021-07-29 11-51-48

Flamegraph analysis suggests that we're spending a non-trivial portion of time buffering events by partially serializing them -- known from #8263 -- but 21% of runtime in Remap::transform:

Screenshot from 2021-07-29 11-57-01

We know from #8263 that the cost of the source is not a limit up to higher throughput than is observed presently, which suggests we should focus on remap primarily.

@blt
Copy link
Contributor Author

blt commented Jul 30, 2021

The stack through vector into VRL is somewhat crowded, making it hard to find optimization targets. I wrote a more restricted program that mimics how the transform itself behaves:

use bytes::Bytes;
use lookup::LookupBuf;
use shared::TimeZone;
use std::collections::BTreeMap;
use vrl::{Program, Runtime, Target, Value};

const JSON: &[u8] = r#"{"key": "value"}"#.as_bytes();
const PROGRAM: &str = r#". = parse_json!(.message)"#;

#[derive(Debug, Clone, Copy, Default)]
struct StaticTarget {}

impl Target for StaticTarget {
    #[inline]
    fn insert(&mut self, _path: &LookupBuf, _value: Value) -> Result<(), String> {
        // nothing, intentionally
        Ok(())
    }

    #[inline]
    fn get(&self, path: &LookupBuf) -> Result<Option<Value>, String> {
        if path.is_empty() {
            Ok(Some(Value::Object(BTreeMap::default())))
        } else {
            Ok(Some(Value::Bytes(Bytes::from_static(JSON))))
        }
    }

    #[inline]
    fn remove(&mut self, _path: &LookupBuf, _compact: bool) -> Result<Option<Value>, String> {
        unimplemented!()
    }
}

fn main() {
    let program: Program = vrl::compile(PROGRAM, &vrl_stdlib::all()).unwrap();

    // this loop simulates the call of `<Remap as FunctionTransform>::transform`

    let timezone = TimeZone::default();

    loop {
        let mut runtime = Runtime::default();
        let mut target = StaticTarget::default();
        let _ = runtime.resolve(&mut target, &program, &timezone).unwrap();
    }
}

The key difference, to start, is the StaticTarget. It's as cheap as possible so as to show up as little as possible in perf reports: success. As expected the main cost center of PROGRAM is in serde_json, implying that, all other things being zero-cost, that serde_json is our bottleneck, in this program if not vector. So far so good.

Screenshot from 2021-07-29 17-58-11

@blt
Copy link
Contributor Author

blt commented Jul 30, 2021

I have updated the experimental program to include an Event.

use bytes::Bytes;
use shared::TimeZone;
use std::collections::BTreeMap;
use vector_core::event::{self, Event, EventMetadata, LogEvent, VrlTarget};
use vrl::{Program, Runtime};

const JSON: &[u8] = r#"{"key": "value"}"#.as_bytes();
const PROGRAM: &str = r#". = parse_json!(.message)"#;

fn main() {
    let program: Program = vrl::compile(PROGRAM, &vrl_stdlib::all()).unwrap();

    // this loop simulates the call of `<Remap as FunctionTransform>::transform`

    let timezone = TimeZone::default();
    let mut map: BTreeMap<String, event::Value> = BTreeMap::new();
    map.insert(
        "message".to_owned(),
        event::Value::Bytes(Bytes::from_static(JSON)),
    );
    let metadata = EventMetadata::default();

    let log_event = LogEvent::from_parts(map, metadata);
    let event = Event::Log(log_event);

    loop {
        let mut runtime = Runtime::default();
        let mut target: VrlTarget = event.clone().into();
        let _ = runtime.resolve(&mut target, &program, &timezone).unwrap();
    }
}

The main cost centers here appear to be vector_core::event::value::Value::insert, serde_json and vector_core::event::value::Value::get, inline with the findings from #8331. Path iteration code currently lacks bail-out opportunities for 'flat' paths, incurring iteration overhead.

Backing out some, the remap transform itself is stateless but is a single choke point in our topology. It seems likely that we can run multiple in parallel.

@spencergilbert
Copy link
Contributor

Could that introduce risk of getting events passed out of order?

@blt
Copy link
Contributor Author

blt commented Jul 30, 2021

They would absolutely be passed out of order. I was under the impression we didn't guarantee ordering today and we should not if it can be helped at all.

@spencergilbert
Copy link
Contributor

They would absolutely be passed out of order. I was under the impression we didn't guarantee ordering today and we should not if it can be helped at all.

I don't think we've explicitly guaranteed it, but we do implicitly... for example the loki sink requires events in order, and we set concurrency to 1 to enforce we don't accidentally deliver events out-of-order. Just something to keep in mind 😄

@blt
Copy link
Contributor Author

blt commented Jul 30, 2021

They would absolutely be passed out of order. I was under the impression we didn't guarantee ordering today and we should not if it can be helped at all.

I don't think we've explicitly guaranteed it, but we do implicitly... for example the loki sink requires events in order, and we set concurrency to 1 to enforce we don't accidentally deliver events out-of-order. Just something to keep in mind smile

Gotcha, that makes sense. To be more explicit not guaranteeing in-order processing is an important optimization opportunity. If we can process events out of order we can do things in parallel without coordination -- like I'm thinking for the remap transform -- or shunt slow-to-process events behind fast-to-process events etc etc.

@spencergilbert
Copy link
Contributor

100% the loki problem should be solved on their side soon, and I'm not aware of anything else we have that has it as a requirement - just wanted to make sure we didn't unintentionally break loki when we're also trying to polish and improve that sink this quarter

@blt
Copy link
Contributor Author

blt commented Jul 30, 2021

100% the loki problem should be solved on their side soon, and I'm not aware of anything else we have that has it as a requirement - just wanted to make sure we didn't unintentionally break loki when we're also trying to polish and improve that sink this quarter

It's a good call out; I appreciate it. Do we have a ticket tracking this on our end? I think it means we have to preserve our accidental in-order processing and/or build a mechanism to re-order in the sink before we can make the transforms parallel. Or provide a knob to make them sequential and I hate new knobs.

@spencergilbert
Copy link
Contributor

@blt a few tickets regarding loki ordering and performance...

and the epic: #8460

@jszwedko
Copy link
Member

We do currently preserve message ordering within a given source-dependent context. For example, for the socket or syslog source using tcp, all messages from a given connection will be passed in-order downstream; for a given HTTP request to the http source, events would be pushed downstream in-order; for a given S3 object, the lines will be sent in-order. We can discuss if that is a guarantee we want to continue supplying though.

When we've previously discussed introducing transform parallelization, we've talked about doing it such that the transforms would be pushed up into the sources such that they would execute concurrently using whatever unit of concurrency the source does. For example, this would let us run remap in parallel across TCP connections to the socket source, but not in parallel for a given TCP connection (or HTTP request etc.). https://github.com/timberio/vector/blob/master/rfcs/2020-06-18-2625-architecture-revisit.md#topology-design talks about this in more detail.

@blt
Copy link
Contributor Author

blt commented Jul 30, 2021

We do currently preserve message ordering within a given source-dependent context. For example, for the socket or syslog source using tcp, all messages from a given connection will be passed in-order downstream; for a given HTTP request to the http source, events would be pushed downstream in-order; for a given S3 object, the lines will be sent in-order. We can discuss if that is a guarantee we want to continue supplying though.

In-order out from the source is a-okay by me.

When we've previously discussed introducing transform parallelization, we've talked about doing it such that the transforms would be pushed up into the sources such that they would execute concurrently using whatever unit of concurrency the source does. For example, this would let us run remap in parallel across TCP connections to the socket source, but not in parallel for a given TCP connection (or HTTP request etc.). https://github.com/timberio/vector/blob/master/rfcs/2020-06-18-2625-architecture-revisit.md#topology-design talks about this in more detail.

Yep. That RFC's end state is the ideal, especially because we could elide any buffer overhead by jamming transforms up the topology. Much like with the way we tackled #8263 I'd like to take little steps that are cheap to throw out if they aren't worthwhile. Seems to me -- and I'm getting more intimately familiar with the topology code now -- we could make the remap transform itself concurrent internally and still work our way toward the Architecture Revisit vision without having to take a big leap there.

That is, assuming inserting a guaranteed order destroying transform wouldn't break anything downstream.

blt added a commit that referenced this issue Jul 31, 2021
As a part of #8512 I need to have a much more clear understanding of the
topology construct of vector than I presently do. I have an eye toward
accomplishing some of the aims of the Architecture Revisit RFC but would like to
get there in small steps.

This commit does not functionally change vector. It moves some test code around
and adjusts the documentation to be more, I hope, clear about what's what.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Aug 3, 2021
As a part of #8512 I need to have a much more clear understanding of the
topology construct of vector than I presently do. I have an eye toward
accomplishing some of the aims of the Architecture Revisit RFC but would like to
get there in small steps.

This commit does not functionally change vector. It moves some test code around
and adjusts the documentation to be more, I hope, clear about what's what.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Aug 3, 2021

I have re-run the experiment set out here for f132b0a and report no change in the numbers presented.

blt added a commit that referenced this issue Aug 3, 2021
This commit removes a `span!` call with `Level::ERROR` setting from
`FunctionCall::resolve`. My understanding here is that this span did not
terminate into any collector and will not impact user observable remap
behavior. This does, however, add +9Mb/s throughput to the configuration
described in #8512.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Aug 3, 2021
This commit moves the `TransformConfig` down into vector-core. The ambition I've
got is to work closer to the architecture described in Architecture Revisit and
to allow for me to experiment with transforms that directly produce into streams
etc etc. Motivation here being an observation in #8512 that we spend a good deal
of time making small vecs, which happens in the topology.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Aug 4, 2021

Following on from #8584 I attempted to extend the FunctionTransform trait to return an Iterator instance, but owing to the boxed-dyn nature of our current setup I could not figure out how to square the types for yet another dynamic type. So, instead I put a SmallVec instance into the topology builder in place of our current Vec. While this does cut down on some Vec allocation it's not a contribution to improved throughput. I take this as a sign that modifying either the dynamic dispatch nature of our Transform or reducing incidental allocations are, at present, not critical path.

While not fully exhausted any scope for improving the single-threaded throughput of this particular remap configuration is limited, seemingly, to a small subset of the flamegraph:

Screenshot from 2021-08-03 18-31-44

Unfortunately the perf this is derived from is too large to attach.

@blt
Copy link
Contributor Author

blt commented Aug 4, 2021

Repeating with unchanged experimental setup for 403007e we find that:

SHA http_gen generation Mb/s datadog_logs bytes processed Mb/s http_blackhole receipt Mb/s
403007e 41 37.9 37.9

The primary improvement in these numbers being the result of #8579, +8.5Mb/s as measured from the blackhole.

blt added a commit that referenced this issue Aug 4, 2021
* Move transform configuration into vector-core

This commit moves the `TransformConfig` down into vector-core. The ambition I've
got is to work closer to the architecture described in Architecture Revisit and
to allow for me to experiment with transforms that directly produce into streams
etc etc. Motivation here being an observation in #8512 that we spend a good deal
of time making small vecs, which happens in the topology.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Add env-test-util dep for Windows

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Satisfy clippy

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt blt self-assigned this Aug 16, 2021
dbcfd pushed a commit to dbcfd/vector that referenced this issue Aug 18, 2021
As a part of vectordotdev#8512 I need to have a much more clear understanding of the
topology construct of vector than I presently do. I have an eye toward
accomplishing some of the aims of the Architecture Revisit RFC but would like to
get there in small steps.

This commit does not functionally change vector. It moves some test code around
and adjusts the documentation to be more, I hope, clear about what's what.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
dbcfd pushed a commit to dbcfd/vector that referenced this issue Aug 18, 2021
This commit removes a `span!` call with `Level::ERROR` setting from
`FunctionCall::resolve`. My understanding here is that this span did not
terminate into any collector and will not impact user observable remap
behavior. This does, however, add +9Mb/s throughput to the configuration
described in vectordotdev#8512.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
dbcfd pushed a commit to dbcfd/vector that referenced this issue Aug 18, 2021
* Move transform configuration into vector-core

This commit moves the `TransformConfig` down into vector-core. The ambition I've
got is to work closer to the architecture described in Architecture Revisit and
to allow for me to experiment with transforms that directly produce into streams
etc etc. Motivation here being an observation in vectordotdev#8512 that we spend a good deal
of time making small vecs, which happens in the topology.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Add env-test-util dep for Windows

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Satisfy clippy

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: dbcfd <bdbrowning2@gmail.com>
@blt
Copy link
Contributor Author

blt commented Aug 18, 2021

Repeating with unchanged experimental setup for e876d40 and repeated for 403007e we find that:

SHA http_gen generation Mb/s datadog_logs bytes processed Mb/s http_blackhole receipt Mb/s
403007e 41.6 38.2 38.2
e876d40 41.8 38.5 38.5

These numbers remain stable. We have noticed that warp factors heavily into stack traces for the datadog agent and intend to remove it from the HttpSource to reduce the steepness of the stack, with no belief that this will appreciably affect throughput. It'll just make analysis more simple.

blt added a commit that referenced this issue Aug 18, 2021
This commit removes a lazy_static and call to a lazy_static from the
DatadogAgentSource. As a part of #8512 I spent some time trying to remove warp
from the agent source to see its effect and found that, on the whole, it would
not really be worth all that much. warp might bloat our stack traces somewhat
but is lean internally. That said, I did find that I could remove some
serialization from the source, primarily by migrating globally retrieved types
into the source struct itself.

See also #8472

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Aug 19, 2021

While I've been focused on the remap step in the config it has struck me that the sink contributes to half-loss of throughput. We very likely can be more explicit about the format we're encoding to in the sink so I'll tackle that next. As noted in #8789 reducing the stack by removing warp is interesting but not particularly feasible without either building our own in-house setup or making the agent source unique. This is not a bottleneck presently so there's no need.

blt added a commit that referenced this issue Aug 19, 2021
…ce (#8789)

* Migrate API_KEY_MATCHER and log_schema() into DatadogAgentSource

This commit removes a lazy_static and call to a lazy_static from the
DatadogAgentSource. As a part of #8512 I spent some time trying to remove warp
from the agent source to see its effect and found that, on the whole, it would
not really be worth all that much. warp might bloat our stack traces somewhat
but is lean internally. That said, I did find that I could remove some
serialization from the source, primarily by migrating globally retrieved types
into the source struct itself.

See also #8472

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Update src/sources/datadog/agent.rs

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Co-authored-by: Jesse Szwedko <jesse@szwedko.me>

Co-authored-by: Jesse Szwedko <jesse@szwedko.me>
blt added a commit that referenced this issue Sep 16, 2021
This PR rethinks how our Datadog logs sink works with an eye toward applying a similar transformation to each of our sinks. The work here is driven by the observation in #8512 that we spend a great deal of time in vector polling futures. After tracing out the path of an `Event` through our previous stack it's clear why we're doing that:

![Image from iOS](https://user-images.githubusercontent.com/3579/130370869-8dbc68dd-a000-41f1-ad27-c144ce0ff30a.jpg)

The top portion of this diagram details the type hierarchy at play and the bottom the actual code path in practice for an event, born out by examining live stack traces. None of these calls are inlined with one another -- consistent with the amount of boxing and dynamic dispatch going on -- so we do in fact pay for all the polls. It's also quite hard to understand this hierarchy and make changes to it. 

So, this PR is a rethink of the whole setup as it affects the Datadog logs sink. Highlights:

* the logs sink is now a _stream_ processor
* the logs sink is responsible for batching by API key itself 
* serialization of payloads is done as late as possible and over a complete batch in one swoop

This approach is at least twice as fast as our present approach. I expect the patch to be code positive but only because of inlined rustdocs.  

Things missing as of this writing:

- [x] acks are not currently performed -- resolved in 225cfe2
- [x] telemetry is broken considering that is part of the generic type hierarchy abandoned here -- resolved in b6832a4
- [x] concurrency, rate-limiting and back-pressure need to be enabled yet
- [x] it's possible that payload sizes cross the 5Mb boundary the API sets -- see [this comment](#8825 (comment))

With regard to the last point these are accomplished today by use of tower-based layering. I haven't decided whether the `HttpClient` should be responsible for participating in these concerns -- leaving the stream processor ignorant of them -- or if the stream processor should itself be a `Service`. 

- - - 

## Method

The basic method at play in the previous hierarchy was inheritance. A `DatadogLogsSink` is a type of `HttpSink` that exposes a few methods and is called by some obscure mechanism elsewhere. What I'm pushing for here is composition. A `LogApi` is a stream processor that is composed with an http client that knows about back-pressure, concurrency limits etc and clicks this together with buffering logic specific to the logs API. 

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Sep 17, 2021

Now that #8825 is merged we have to slightly update our vector config used in this investigation:

data_dir = "/tmp/data/vector"

##
## Sources
##

[sources.internal_metrics]
type = "internal_metrics"

[sources.datadog_agent]
type = "datadog_agent"
acknowledgements = false
address = "0.0.0.0:8282"

##
## Transforms
##

[transforms.parse_message]
type = "remap"
inputs = ["datadog_agent"]
source = '''
pyld, err = parse_json(.message)
if err == null {
  .message = pyld.mineral
}
'''

##
## Sinks
##

[sinks.prometheus]
type = "prometheus_exporter"
inputs = ["internal_metrics"]
address = "0.0.0.0:9598"

[sinks.dd_logs_egress]
type = "datadog_logs"
inputs = ["parse_message"]
endpoint = "http://localhost:8080/v1/input"
default_api_key = "DEADBEEF"

Repeating our experiment for a32c7fd we find:

SHA http_gen generation Mb/s datadog_logs bytes processed Mb/s http_blackhole receipt Mb/s
a32c7fd 27.6 27.9 26

Remap utilization in this configuration is 95% with logs sink utilization being 32%, implying a serious bottleneck in the remap transform. Removing remap from the center of this configuration we find that throughput is 92MB/s, confirming this suspicion.

@blt
Copy link
Contributor Author

blt commented Sep 17, 2021

Follow-up investigation strongly suggests that #8579 has increased remap runtime in this soak to approximately 30% of runtime compared to last measure -- tracking issue in #9141 -- and use of emit! makes vector::internal_events::emit a surprisingly high contributor to runtime at 14%.

blt added a commit that referenced this issue Sep 18, 2021
This commit speeds up the `rename_key` function of `LogEvent` by 57%, avoiding
the use of path iteration code to achieve this. This results in +2MB/s in #8512.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Sep 18, 2021
This commit introduces a `ready_chunks` into the pipeline for function
transforms. This combinator polls the underlying stream until either it signals
Pending or the chunk limit is reached, sending a vector of items downstream. In
this commit we pull at most 128 `Event` instances off the input stream for
processing, improving locality.

I have not done the same for task transforms, though I think it could be
done. Without changing the trait type we'd need the inverse of `ready_chunks` to
move from a `Stream<Item = Vec<Event>>` into a `Stream<Item = Event>`.

This is worth +7MB/s in #8512

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Sep 20, 2021
This commit speeds up the `rename_key` function of `LogEvent` by 57%, avoiding
the use of path iteration code to achieve this. This results in +2MB/s in #8512.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Sep 20, 2021
This commit introduces a `ready_chunks` into the pipeline for function
transforms. This combinator polls the underlying stream until either it signals
Pending or the chunk limit is reached, sending a vector of items downstream. In
this commit we pull at most 128 `Event` instances off the input stream for
processing, improving locality.

I have not done the same for task transforms, though I think it could be
done. Without changing the trait type we'd need the inverse of `ready_chunks` to
move from a `Stream<Item = Vec<Event>>` into a `Stream<Item = Event>`.

This is worth +7MB/s in #8512

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Sep 20, 2021

Repeating our experiment with the config outlined here for be8ceaf we find that:

SHA http_gen generation Mb/s datadog_logs bytes processed Mb/s http_blackhole receipt Mb/s
be8ceaf 35.5 35.5 32.7

Remap utilization is now 10%, datadog_agent utilization 40%, consistent with the behavior identified in #9228.

@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

This configuration is now part of our regular soak tests. We measure approximately 20Mb/s per core as of this writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language transform: remap Anything `remap` transform related
Projects
None yet
Development

No branches or pull requests

3 participants