make the OTLP exporters non exclusive #147

Geal · 2021-11-16T09:57:15Z

this allows users to choose which exporter they want without recompiling the router, at the price of a larger binary (unstripped binary goes from 59MB to 66MB)

I started on this to see if I could reduce the time spent in xtask test in CI, but I think it makes more sense anyway to have all the exporters available in the released version

This PR also includes CI tuning to reduce the time spent, mainly by:

building the router and router-core only once
caching the xtask build step
running xtask lint only on linux

The result:

build from scratch (without cache, ie anytime Cargo.lock changes): 18mn https://app.circleci.com/pipelines/github/apollographql/router/237/workflows/c3e173d9-ff51-40a5-b816-b5d500115eca
build with cache: 6mn49s https://app.circleci.com/pipelines/github/apollographql/router/237/workflows/ba8f4055-b5a0-4182-9a45-8aee9a9e6ce0
in comparison, on the main branch, all builds are around 16mn: https://app.circleci.com/pipelines/github/apollographql/router?branch=main

o0Ignition0o · 2021-11-16T10:00:03Z

Great job!

I think this will reduce CI build times by a huge amount, and it will allow for more flexibility.

It might even fix one of @prasek s concerns, when he set up federation-demo with grpc.

huge +1 for me, as soon as the cargo xtask commands are up to date.

Geal · 2021-11-16T10:31:44Z

with 9180bf4 the total CI time (with cached cargo dir) is now 12mn44s (down from 16mn) https://app.circleci.com/pipelines/github/apollographql/router/204/workflows/18c48a92-70b6-4769-97b0-d572d5e87e8d
This is still dominated by the OSX build

this allows users to choose which exporter they want without recompiling the router, at the price of a larger binary (unstripped binary goes from 59MB to 66MB) updates the opentelemetry dependency to get access to SpanExporterBuilder that was not public before

since the features are not mutually exclusive now, we can do all tests in one pass, and reduce the CI time

- pre build the xtask binary so it can be cached - deactivate circleci/rust cache since we already have a global cache step - run the main compilation and tests with the same feature as xtask test to avoid a full rebuild

this is far too slow on other platforms than linux

Geal · 2021-11-16T17:48:39Z

I hereby declare this PR open for review and will avoid messing with the CI for a while now 😬

circleci tries to find a cache for one of the keys, in order, by prefix matching. In order: - we look for a cache for the same branch(or PR) and the same Cargo.lock. This is the key under which we will save a cache if it is not present, to have an up to date cache for the current PR and make checks faster - we look for a cache from the main branch with the same Cargo.lock, which means it will have all the dependencies already downloaded and compiled - if Cargo.lock changed, we start from the latest cache generated on main, it is likely it will have some dependencies in common

cecton · 2021-11-17T08:42:26Z

I a few questions here so please bear with me 🙏

Correct me if I'm wrong but afaiu tonic, grpcio and http all communicate to the same endpoint using the same protocol. The only difference is that they use different libraries:
- http uses none (maybe just reqwest) but does not support https
- grpcio uses grpcio, which depends on grpcio-sys which is a wrapper around the C/C++ lib
- tonic uses tonic: a pure Rust implementation with optionally tls backed by rustls (also pure rust)
So there doesn't seem to be a point of releasing one binary that have them all. On the contrary, some of the linked dependencies (grpcio) might not be available on the system or undesirable.
You're not testing the features separately anymore but altogether. I think this is just a new path to build the thing but it does not ensure that the build works on each and every one of them separately.

For example: it does not ensures that building without grpcio but with tonic would work. It only checks that the whole thing together work. So you could say the test is not 100% complete.
I personally didn't mind shortening the CI build on PRs by keeping only one simple build path (without any of these feature flags) and checking the others only on main. This would have improved the speed of the feedback loop in PR but shifted some of the checks to main and would require us to fix them from time to time.

To me it was acceptable because this is not the kind of code we touch often. But @abernix had a different opinion on this and thought it would be better to test everything properly in the CI at every PR. In my opinion it was not an issue to keep the CI slow like this as I can run the test quickly locally anyway. It's not like I'm in a hurry to get things merged and I prefer to fix it upfront rather than merging and risking other people to merge their stuff on top of mine and accumulating problems. So I thought he had a good argument and I don't mind anyway.

I understand that you do mind about the CI build time but if this requires to reduce the coverage of our tests I want to make sure everybody understand the terms and conditions and sign here at the end of the document 😁

Geal · 2021-11-17T10:45:24Z

http uses none (maybe just reqwest) but does not support https

HTTP is using OTLP/HTTP which is not the same protocol as OTLP/gRPC. It can support HTTPS since it's based on reqwest. If we want to pass specific parameters (like certificate authority or client cert) we can do that by creating a reqwest client manually and using the with_http_client method

So there doesn't seem to be a point of releasing one binary that have them all. On the contrary, some of the linked dependencies (grpcio) might not be available on the system or undesirable.

It does not make sense to tell users that to use OTLP/gRPC they can download the released binary, but for OTLP:HTTP they need to build from source. We could make different binaries for each feature set. Or make them all available at once as is done in this PR.
On the issue of missing dependencies, you saw in #150 that I'd actually like to remove grpcio, because there's no good reason to keep 2 different gRPC implementations.

You're not testing the features separately anymore but altogether. I think this is just a new path to build the thing but it does not ensure that the build works on each and every one of them separately.

I want to reduce the number of paths. Right now, we have otlp-tonic, otlp-http, otlp-grpcio and tls, which modifies otlp-tonic. We already have a case where building a feature separately from the others results in a compilation error: building only with tls, because it should depend on otlp-tonic. Increasing the number of features is increasing the risk that some of them are not properly tested.

What I want is:

otlp-http: builds with otlp-http like before
otlp-grpc: build with tonic and tls
those features are additive, not exclusive like we're currently doing (cargo features are supposed to be additive even if that's not really an issue here)
both features are available by default

So you could say the test is not 100% complete.

We're only testing that configuration files are deserialized properly, the tests are already not 100% complete.

I personally didn't mind shortening the CI build on PRs by keeping only one simple build path (without any of these feature flags) and checking the others only on main. This would have improved the speed of the feedback loop in PR but shifted some of the checks to main and would require us to fix them from time to time.

My goal here is exactly that: one simple build path. The difference is that for me that build path should exercise all the features.

Also we are not in a case with multiple features modifying graphql interpretation in potentially incompatible ways, where I'd want the CI to test various combinations. We're in a case where we have multiple incompletely tested telemetry options that will not conflict because we choose only one at runtime. And I trust that if we change that behaviour later it will be obvious in code reviews.

I understand that you do mind about the CI build time but if this requires to reduce the coverage of our tests I want to make sure everybody understand the terms and conditions and sign here at the end of the document

The test coverage stays the same, all of the tests are executed, I even added more.

I get that you don't feel affected too much by the build times, but in my experience, improving build times locally and in CI has a great impact on development velocity. That means we can run a lot more tests, have a quicker feedback loop between a change and tests passing, and switch faster between branches, without having to wait.
Right now the CI takes 15mn to pass. Our target with @o0Ignition0o is 2 to 3mn. With a 3mn CI run you can read a comment on a PR, fix the code, push and get the result while staying in the same conversation. That changes the experience radically.

cecton · 2021-11-17T12:04:15Z

I guess I'm not too impacted either way so... if @BrynCooke agrees with this PR, let's go for it

o0Ignition0o · 2021-11-17T17:21:55Z

There is a way for us to fan out in CI and run each variant of the features in parallel. This would allow us to check each valid permutation of features and keep a steady build time.

However, this doesn't seem to be our concerns here, I would have expected otlp-grpc and otlp-http to be enough, and our binaries to be released with --all-features. If this is not the case, we would need to generate a matrix of build artefacts, which we don't.

Would it be worth opening an issue or a discussion? I feel like this is more of a business decision.

Geal · 2021-11-17T18:17:26Z

Multiple points to address here:

leveraging the CI platform instead of xtask to test feature combinations is the good approach IMO. We have to assume that someone will clone the repository and try to run it with a random set of features. It should work
we can have multiple additive features, but there should be only one version of the router per platform, otherwise it will be very hard to support
we should strive to keep the number of features low so we don't get combinatorial explosion
but also it's fine to add short term features for tests or quick fixes that we will remove soon after (like the post-processing)

cecton · 2021-11-18T10:01:13Z

leveraging the CI platform instead of xtask to test feature combinations is the good approach IMO. We have to assume that someone will clone the repository and try to run it with a random set of features. It should work

The point of xtask is exactly to make it work right out of box. Right now the repositories builds properly locally when you select the set of feature manually using cargo build but it also works properly when you do the full permutation tests using cargo xtask test. No one needs to open the yaml file yo figure out how to run things. That's the beauty of it.

Geal · 2021-11-18T10:26:17Z

xtask is doing too much or not enough right now. It can be the source of truth for feature permutations, and have an option to build them all in sequence, but it should allow a way to parallelize for CI.
@o0Ignition0o is it possible to have a step in circleci that calls xtask so that it returns a serie of commands to run in parallel?

o0Ignition0o · 2021-11-18T10:56:29Z

@Geal yes, there is. I've opened #160 so we can go back to the requirements / problem space and define the expected role for xtask and the ci script, which I can then implement.

This will hopefully allow us to keep this pull request scoped on merging the features, and land it in a reasonable amount of time, while getting buy in on what xtask and the script should do.

Edit: Although I'm all for merging the features, we might wanna make sure everyone else also is (cc @abernix @BrynCooke)

cecton · 2021-11-18T12:00:30Z

Ok I have answered in the discussion then

Geal · 2021-11-18T13:17:20Z

a proposal: I shuffle my tree and make a separate PR with only the changes to configuration/OTLP and #153 (which is based on this PR), and we put off the decision about what to do or not with features, CI and caches to @o0Ignition0o's PR on CI that goes into much more detail than what I did.
(also I'll make sure the tls feature requires the otlp-tonic feature to avoid that issue)

o0Ignition0o · 2021-11-18T19:12:47Z

Oh sorry I thought I had replied before, this sounds good to me!

Geal · 2021-11-22T17:51:11Z

closed in favor of #153

…rgo/regex-1.6.0 chore(deps): bump regex from 1.5.4 to 1.6.0

Geal force-pushed the merge-grpc branch 14 times, most recently from 2778c65 to 5c5b527 Compare November 16, 2021 16:51

Geal added 6 commits November 16, 2021 18:03

run xtask test with all features

51dbacc

since the features are not mutually exclusive now, we can do all tests in one pass, and reduce the CI time

only build router and router-core in xtask test

5d0f09a

Reduce the time spent in xtask in CI

2115f2c

- pre build the xtask binary so it can be cached - deactivate circleci/rust cache since we already have a global cache step - run the main compilation and tests with the same feature as xtask test to avoid a full rebuild

cache the .cargo directory

1cfcfac

do not run xtask lint on OSX and Windows

b224431

this is far too slow on other platforms than linux

Geal force-pushed the merge-grpc branch from 5c5b527 to b224431 Compare November 16, 2021 17:04

Geal marked this pull request as ready for review November 16, 2021 17:47

Geal requested review from cecton and o0Ignition0o November 16, 2021 17:47

Geal mentioned this pull request Nov 17, 2021

do we need 2 different GRPC implementations? #150

Closed

Geal mentioned this pull request Nov 17, 2021

CI performance improvement #157

Closed

o0Ignition0o mentioned this pull request Nov 18, 2021

set up cargo-deny #89

Merged

Geal mentioned this pull request Nov 19, 2021

Remove grpcio #153

Merged

Geal closed this Nov 22, 2021

cecton deleted the merge-grpc branch November 24, 2021 08:34

Geal self-assigned this Dec 1, 2021

tinnou pushed a commit to Netflix-Skunkworks/router that referenced this pull request Oct 16, 2023

Merge pull request apollographql#147 from apollographql/dependabot/ca…

2b0a97a

…rgo/regex-1.6.0 chore(deps): bump regex from 1.5.4 to 1.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make the OTLP exporters non exclusive #147

make the OTLP exporters non exclusive #147

Geal commented Nov 16, 2021 •

edited

Loading

o0Ignition0o commented Nov 16, 2021

Geal commented Nov 16, 2021

Geal commented Nov 16, 2021

cecton commented Nov 17, 2021

Geal commented Nov 17, 2021

cecton commented Nov 17, 2021

o0Ignition0o commented Nov 17, 2021

Geal commented Nov 17, 2021

cecton commented Nov 18, 2021

Geal commented Nov 18, 2021

o0Ignition0o commented Nov 18, 2021 •

edited

Loading

cecton commented Nov 18, 2021

Geal commented Nov 18, 2021

o0Ignition0o commented Nov 18, 2021

Geal commented Nov 22, 2021

make the OTLP exporters non exclusive #147

make the OTLP exporters non exclusive #147

Conversation

Geal commented Nov 16, 2021 • edited Loading

o0Ignition0o commented Nov 16, 2021

Geal commented Nov 16, 2021

Geal commented Nov 16, 2021

cecton commented Nov 17, 2021

Geal commented Nov 17, 2021

cecton commented Nov 17, 2021

o0Ignition0o commented Nov 17, 2021

Geal commented Nov 17, 2021

cecton commented Nov 18, 2021

Geal commented Nov 18, 2021

o0Ignition0o commented Nov 18, 2021 • edited Loading

cecton commented Nov 18, 2021

Geal commented Nov 18, 2021

o0Ignition0o commented Nov 18, 2021

Geal commented Nov 22, 2021

Geal commented Nov 16, 2021 •

edited

Loading

o0Ignition0o commented Nov 18, 2021 •

edited

Loading