Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the OTLP exporters non exclusive #147

Closed
wants to merge 7 commits into from
Closed

make the OTLP exporters non exclusive #147

wants to merge 7 commits into from

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Nov 16, 2021

this allows users to choose which exporter they want without recompiling the router, at the price of a larger binary (unstripped binary goes from 59MB to 66MB)

I started on this to see if I could reduce the time spent in xtask test in CI, but I think it makes more sense anyway to have all the exporters available in the released version

This PR also includes CI tuning to reduce the time spent, mainly by:

  • building the router and router-core only once
  • caching the xtask build step
  • running xtask lint only on linux

The result:

@o0Ignition0o
Copy link
Contributor

Great job!

I think this will reduce CI build times by a huge amount, and it will allow for more flexibility.

It might even fix one of @prasek s concerns, when he set up federation-demo with grpc.

huge +1 for me, as soon as the cargo xtask commands are up to date.

@Geal
Copy link
Contributor Author

Geal commented Nov 16, 2021

with 9180bf4 the total CI time (with cached cargo dir) is now 12mn44s (down from 16mn) https://app.circleci.com/pipelines/github/apollographql/router/204/workflows/18c48a92-70b6-4769-97b0-d572d5e87e8d
This is still dominated by the OSX build

@Geal Geal force-pushed the merge-grpc branch 14 times, most recently from 2778c65 to 5c5b527 Compare November 16, 2021 16:51
this allows users to choose which exporter they want without
recompiling the router, at the price of a larger binary (unstripped
binary goes from 59MB to 66MB)

updates the opentelemetry dependency to get access to
SpanExporterBuilder that was not public before
since the features are not mutually exclusive now, we can do all tests
in one pass, and reduce the CI time
- pre build the xtask binary so it can be cached
- deactivate circleci/rust cache since we already have a global cache
  step
- run the main compilation and tests with the same feature as xtask test
  to avoid a full rebuild
this is far too slow on other platforms than linux
@Geal Geal marked this pull request as ready for review November 16, 2021 17:47
@Geal
Copy link
Contributor Author

Geal commented Nov 16, 2021

I hereby declare this PR open for review and will avoid messing with the CI for a while now 😬

circleci tries to find a cache for one of the keys, in order, by prefix
matching.

In order:
- we look for a	cache for the same branch(or PR) and the same
  Cargo.lock. This is the key under which we will save a cache if it is
not present, to have an up to date cache for the current PR and make
checks faster
- we look for a cache from the main branch with the same Cargo.lock,
  which means it will have all the dependencies already downloaded and
compiled
- if Cargo.lock changed, we start from the latest cache generated on
  main, it is likely it will have some dependencies in common
@cecton
Copy link
Contributor

cecton commented Nov 17, 2021

I a few questions here so please bear with me 🙏

  1. Correct me if I'm wrong but afaiu tonic, grpcio and http all communicate to the same endpoint using the same protocol. The only difference is that they use different libraries:

    • http uses none (maybe just reqwest) but does not support https
    • grpcio uses grpcio, which depends on grpcio-sys which is a wrapper around the C/C++ lib
    • tonic uses tonic: a pure Rust implementation with optionally tls backed by rustls (also pure rust)

    So there doesn't seem to be a point of releasing one binary that have them all. On the contrary, some of the linked dependencies (grpcio) might not be available on the system or undesirable.

  2. You're not testing the features separately anymore but altogether. I think this is just a new path to build the thing but it does not ensure that the build works on each and every one of them separately.

    For example: it does not ensures that building without grpcio but with tonic would work. It only checks that the whole thing together work. So you could say the test is not 100% complete.

  3. I personally didn't mind shortening the CI build on PRs by keeping only one simple build path (without any of these feature flags) and checking the others only on main. This would have improved the speed of the feedback loop in PR but shifted some of the checks to main and would require us to fix them from time to time.

    To me it was acceptable because this is not the kind of code we touch often. But @abernix had a different opinion on this and thought it would be better to test everything properly in the CI at every PR. In my opinion it was not an issue to keep the CI slow like this as I can run the test quickly locally anyway. It's not like I'm in a hurry to get things merged and I prefer to fix it upfront rather than merging and risking other people to merge their stuff on top of mine and accumulating problems. So I thought he had a good argument and I don't mind anyway.

    I understand that you do mind about the CI build time but if this requires to reduce the coverage of our tests I want to make sure everybody understand the terms and conditions and sign here at the end of the document 😁

@Geal
Copy link
Contributor Author

Geal commented Nov 17, 2021

http uses none (maybe just reqwest) but does not support https

HTTP is using OTLP/HTTP which is not the same protocol as OTLP/gRPC. It can support HTTPS since it's based on reqwest. If we want to pass specific parameters (like certificate authority or client cert) we can do that by creating a reqwest client manually and using the with_http_client method

So there doesn't seem to be a point of releasing one binary that have them all. On the contrary, some of the linked dependencies (grpcio) might not be available on the system or undesirable.

It does not make sense to tell users that to use OTLP/gRPC they can download the released binary, but for OTLP:HTTP they need to build from source. We could make different binaries for each feature set. Or make them all available at once as is done in this PR.
On the issue of missing dependencies, you saw in #150 that I'd actually like to remove grpcio, because there's no good reason to keep 2 different gRPC implementations.

You're not testing the features separately anymore but altogether. I think this is just a new path to build the thing but it does not ensure that the build works on each and every one of them separately.

I want to reduce the number of paths. Right now, we have otlp-tonic, otlp-http, otlp-grpcio and tls, which modifies otlp-tonic. We already have a case where building a feature separately from the others results in a compilation error: building only with tls, because it should depend on otlp-tonic. Increasing the number of features is increasing the risk that some of them are not properly tested.

What I want is:

  • otlp-http: builds with otlp-http like before
  • otlp-grpc: build with tonic and tls
  • those features are additive, not exclusive like we're currently doing (cargo features are supposed to be additive even if that's not really an issue here)
  • both features are available by default

So you could say the test is not 100% complete.

We're only testing that configuration files are deserialized properly, the tests are already not 100% complete.

I personally didn't mind shortening the CI build on PRs by keeping only one simple build path (without any of these feature flags) and checking the others only on main. This would have improved the speed of the feedback loop in PR but shifted some of the checks to main and would require us to fix them from time to time.

My goal here is exactly that: one simple build path. The difference is that for me that build path should exercise all the features.

Also we are not in a case with multiple features modifying graphql interpretation in potentially incompatible ways, where I'd want the CI to test various combinations. We're in a case where we have multiple incompletely tested telemetry options that will not conflict because we choose only one at runtime. And I trust that if we change that behaviour later it will be obvious in code reviews.

I understand that you do mind about the CI build time but if this requires to reduce the coverage of our tests I want to make sure everybody understand the terms and conditions and sign here at the end of the document

The test coverage stays the same, all of the tests are executed, I even added more.

I get that you don't feel affected too much by the build times, but in my experience, improving build times locally and in CI has a great impact on development velocity. That means we can run a lot more tests, have a quicker feedback loop between a change and tests passing, and switch faster between branches, without having to wait.
Right now the CI takes 15mn to pass. Our target with @o0Ignition0o is 2 to 3mn. With a 3mn CI run you can read a comment on a PR, fix the code, push and get the result while staying in the same conversation. That changes the experience radically.

@cecton
Copy link
Contributor

cecton commented Nov 17, 2021

I guess I'm not too impacted either way so... if @BrynCooke agrees with this PR, let's go for it

@Geal Geal mentioned this pull request Nov 17, 2021
@o0Ignition0o
Copy link
Contributor

There is a way for us to fan out in CI and run each variant of the features in parallel. This would allow us to check each valid permutation of features and keep a steady build time.

However, this doesn't seem to be our concerns here, I would have expected otlp-grpc and otlp-http to be enough, and our binaries to be released with --all-features. If this is not the case, we would need to generate a matrix of build artefacts, which we don't.

Would it be worth opening an issue or a discussion? I feel like this is more of a business decision.

@Geal
Copy link
Contributor Author

Geal commented Nov 17, 2021

Multiple points to address here:

  • leveraging the CI platform instead of xtask to test feature combinations is the good approach IMO. We have to assume that someone will clone the repository and try to run it with a random set of features. It should work
  • we can have multiple additive features, but there should be only one version of the router per platform, otherwise it will be very hard to support
  • we should strive to keep the number of features low so we don't get combinatorial explosion
  • but also it's fine to add short term features for tests or quick fixes that we will remove soon after (like the post-processing)

@cecton
Copy link
Contributor

cecton commented Nov 18, 2021

leveraging the CI platform instead of xtask to test feature combinations is the good approach IMO. We have to assume that someone will clone the repository and try to run it with a random set of features. It should work

The point of xtask is exactly to make it work right out of box. Right now the repositories builds properly locally when you select the set of feature manually using cargo build but it also works properly when you do the full permutation tests using cargo xtask test. No one needs to open the yaml file yo figure out how to run things. That's the beauty of it.

@Geal
Copy link
Contributor Author

Geal commented Nov 18, 2021

xtask is doing too much or not enough right now. It can be the source of truth for feature permutations, and have an option to build them all in sequence, but it should allow a way to parallelize for CI.
@o0Ignition0o is it possible to have a step in circleci that calls xtask so that it returns a serie of commands to run in parallel?

@o0Ignition0o o0Ignition0o mentioned this pull request Nov 18, 2021
@o0Ignition0o
Copy link
Contributor

o0Ignition0o commented Nov 18, 2021

@Geal yes, there is. I've opened #160 so we can go back to the requirements / problem space and define the expected role for xtask and the ci script, which I can then implement.

This will hopefully allow us to keep this pull request scoped on merging the features, and land it in a reasonable amount of time, while getting buy in on what xtask and the script should do.

Edit: Although I'm all for merging the features, we might wanna make sure everyone else also is (cc @abernix @BrynCooke)

@cecton
Copy link
Contributor

cecton commented Nov 18, 2021

Ok I have answered in the discussion then

@Geal
Copy link
Contributor Author

Geal commented Nov 18, 2021

a proposal: I shuffle my tree and make a separate PR with only the changes to configuration/OTLP and #153 (which is based on this PR), and we put off the decision about what to do or not with features, CI and caches to @o0Ignition0o's PR on CI that goes into much more detail than what I did.
(also I'll make sure the tls feature requires the otlp-tonic feature to avoid that issue)

@o0Ignition0o
Copy link
Contributor

Oh sorry I thought I had replied before, this sounds good to me!

@Geal Geal mentioned this pull request Nov 19, 2021
@Geal
Copy link
Contributor Author

Geal commented Nov 22, 2021

closed in favor of #153

@Geal Geal closed this Nov 22, 2021
@cecton cecton deleted the merge-grpc branch November 24, 2021 08:34
@Geal Geal self-assigned this Dec 1, 2021
tinnou pushed a commit to Netflix-Skunkworks/router that referenced this pull request Oct 16, 2023
…rgo/regex-1.6.0

chore(deps): bump regex from 1.5.4 to 1.6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants