Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do we need 2 different GRPC implementations? #150

Closed
Geal opened this issue Nov 16, 2021 · 9 comments · Fixed by #153
Closed

do we need 2 different GRPC implementations? #150

Geal opened this issue Nov 16, 2021 · 9 comments · Fixed by #153
Assignees

Comments

@Geal
Copy link
Contributor

Geal commented Nov 16, 2021

Opentelemetry is set up to have 3 different exporters: HTTP, tonic(grpc) and grpcio( grpc)

I'd like to remove grpcio because building with this library greatly increases the compilation time and the size of the build directory, and we already have the tonic GRPC implementation.
On my machine, building apollo-router in debug mode:

  • cargo build --features tls,otlp-http,otlp-tonic -p apollo-router: builds in 47s, target/ weights at 2.4GB
  • cargo build --features tls,otlp-http,otlp-tonic,otlp-grpcio -p apollo-router: builds in 1mn29s, target/ weights at 3.4GB

They apparently have different feature sets:

grpcio:

  • setup credentials
  • add additional headers
  • config compression
  • select whether to use TLS
  • set the number of GRPC worker threads to poll queues

tonic:

  • add additional metadata
  • set tls config(with tls feature enabled)
  • bring custom channel

Which features do we actually need? Can we contribute the missing ones in tonic? Can we get away with using tonic and a local collector thant handles credentials, custom headers, etc?

@cecton WDYT?

@Geal Geal added the triage label Nov 16, 2021
@cecton
Copy link
Contributor

cecton commented Nov 17, 2021

I don't think anything. 😁 They can all be gone for all I care. 🤷‍♀️ But I do think @BrynCooke actually preferred the grpcio implementation probably because it's the most compatible one. I think I remember he said it's the one what would be most used.

Thanks for looking up and summarizing all the features. 🙏 I suppose this is more of a product kinda decision.

@Geal
Copy link
Contributor Author

Geal commented Nov 17, 2021

I'll debug #149 to see if tonic is the issue

@BrynCooke
Copy link
Contributor

BrynCooke commented Nov 17, 2021

Disclaimer: I'm not an expert on anything, and this is just stuff that I picked up when browsing the internet.

https://crates.io/crates/opentelemetry-otlp#grpc-libraries-comparison

The thing that makes me wary of tonic is that it uses rustls which is rather new vs openssl that is battle tested.
It's also worth checking the user requirements here:

  • openssl is the sort of thing that gets audited and upgraded in images. By using tonic/rustls the user would have to bump the router version?
  • How to use native certificates with tonic/rustls? https://github.com/rustls/rustls-native-certs. If users are relying on a centralized image building that contains cert management then openssl will work out of the box.

@Geal
Copy link
Contributor Author

Geal commented Nov 17, 2021

So, I tested CI timings in #153, based on #147 (which is already a bit faster than what we currently have):

About openssl, I disagree strongly and would rather we remove it entirely:

  • while it has been used for a longer time than rustls, it's been plagued by vulnerabilities, especially memory flaws. There exists a stripped down version of openssl, called boringssl, that we could use, but I'm not sure all our deps would support it. On the other hand, rustls is built almost entirely in Rust, avoiding those memory flaws (the non Rust bits are crypto algorithms in assembly extracted from boringssl)
  • for now, we are only using TLS to connect to OTLP or to subgraphs, which are controlled by the user, so the risk is much lower than exposing a TLS server to internet
  • it's debatable, but base image and dependencies upgrades are not very reliable, a lot of images are used with obsolete libraries. With the router we can have a proper release cycle where we warn our users about security updates
  • rustls can use the certificate authorities installed by the OS, and it's supported by most libraries (example: reqwest)

For me, the biggest argument against rustls is actually that it only supports TLS 1.2 and 1.3. Which is not a big issue because they have been officially deprecated, and for internal OTLP tools I'd suspect the configuration is modern enough. For requests to subgraphs (the other place where we use TLS) there might still be services using old versions? I am strongly against supporting TLS 1.0 and 1.1 in a new product in 2021 though.

@BrynCooke
Copy link
Contributor

If rustls can use the OS supplied certificates then this sounds fine.

As long as it is easy for an organisation to invalidate/replace certificates using whatever their standard process is and not have to deal with router specific configuration then it's a plus 1 from me.

@Geal
Copy link
Contributor Author

Geal commented Nov 17, 2021

I'd suspect a lot of organisations will want a local collector in a sidecar anyway and that's where certificates will be managed

@BrynCooke
Copy link
Contributor

Hashicorp vault sidecar is likely to be something we come across.
It does beg the question, what happens when certs are updated, will we need to restart the router pod?

@Geal
Copy link
Contributor Author

Geal commented Nov 17, 2021 via email

@abernix
Copy link
Member

abernix commented Nov 17, 2021

I'm not opposed to dropping grpcio — at the very least, for the time being — since I think we lack enough concrete feedback on what protocols users necessitate in this regard right now. Plus, I think a fair bit of the OTLP protocol has only recently been stabilized if I'm not mistaken. We probably won't have a good pulse on what protocols are desired for a while, so it seems reasonable to remove it and reintroduce it if we find it necessary in the future.

@Geal Geal mentioned this issue Nov 19, 2021
@Geal Geal self-assigned this Nov 22, 2021
@Geal Geal added 2021-11 and removed triage labels Nov 22, 2021
@Geal Geal closed this as completed in #153 Nov 23, 2021
tinnou pushed a commit to Netflix-Skunkworks/router that referenced this issue Oct 16, 2023
using s3 for this because codesandbox just randomly deleted the
`federation-internals` package from their registry.

i assure you this is only temporary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants