Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Cilium kvstore and support multiple policies. #33

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

deverton-godaddy
Copy link
Contributor

@deverton-godaddy deverton-godaddy commented Nov 13, 2023

Feature or Problem

This change modifies the code base to use the Cilium kvstore package to manage netreap policy data and its interactions with Cilium. This allows netreap to match the configuration of the Cilium agent on the host and enables the introduction of multiple policies. Configuration to access the kvstore can still be overridden but default will be pulled from the Cilium agent config.

Multiple policies can now also be managed under a prefix. Any values within the prefix will be treated as Cilium policy documents and passed to the local agent. This enables much larger policy sets than can be stored in a single key within Consul.

Consumer Impact

Note that this is a breaking change as the location of policy documents in the kvstore has changed from netreap.io/policy to netreap/policies/v1 to match the patterns of Cilium.

Also this change currently depends on Cilium 1.15 as it uses API parameters of the PutPolicy call that were unavailable in earlier versions.

Testing

We've been running this for a while now with no known issues across a few thousand policies. Though this was with a pure etcd solution as we no longer use Consul.

Built on platform(s)

  • x86_64-linux
  • aarch64-linux

Tested on platform(s)

  • x86_64-linux
  • aarch64-linux

Unit Test(s)

Adds some unit tests for the policies reaper .

Copy link

netlify bot commented Nov 13, 2023

Deploy Preview for netreap canceled.

Name Link
🔨 Latest commit aa690c3
🔍 Latest deploy log https://app.netlify.com/sites/netreap/deploys/65a0ae8f0f0815000951cae3

@deverton-godaddy
Copy link
Contributor Author

Happy to rework this PR if needed as it is pretty big and breaks backwards compatibility. It may be possible to run this on earlier versions of Cilium if you don't use multiple policies but I haven't tested that.

@protochron protochron self-requested a review November 15, 2023 14:34
@protochron
Copy link
Contributor

@deverton-godaddy apologies for not getting a chance to really look at this before now! I should be able to get you some feedback today or tomorrow

deverton-godaddy added a commit to deverton-godaddy/netreap that referenced this pull request Jan 12, 2024
Cilium uses logrus instead of zap for logging. This change introducs a simple bridge that should funnel any logs from the Cilium codebase through the zap logger in netreap.

This is part of breaking up cosmonic-labs#33.
This change modifies the code base to use the Cilium kvstore package to
manage netreap policy data and its interactions with Cilium. This allows
netreap to match the configuration of the Cilium agent on the host and
enables the introduction of multiple policies. Configuation can still
be overridden but default will be requires from the Cilium agent.

Multiple policies can now also be managed under a prefix. Any values
within the prefix will be treated as Cilium policy documents and passed
to the local agent.

Note that this is a breaking change as the location of policy documents
in the kvstore has changed from `netreap.io/policy` to `netreap/policies/v1`
to match the patterns of Cilium.

Also this change currently depends on Cilium 1.15 as it uses API parameters
of the `PutPolicy` call that were unavailable in earlier versions.
@deverton-godaddy
Copy link
Contributor Author

It's also worth considering this in the context of #35 with the suggestion of Nomad Variables as a policy source. Using the Cilium kvstore abstraction would make it more difficult to do so though not impossible as you'd could implement the kvstore interface. The question though is that the direction you want to take this project?

I think ideally long term this could be get refactored to the point it could be merged with the Cilium code base as a "Nomad Operator" package. That would suggest hewing close the Cilium way of doing things if possible.

@protochron
Copy link
Contributor

Cilium 1.15 is out, so we can now depend on it which I think was the big blocker towards getting this code merged.

The branch is now out of date, but I think the approach is sound. I agree that in an ideal world we could merge some of this code upstream in Cilium to have native Nomad support, but we'd need to work with them on that. In the meantime I'm definitely not opposed to a kvstore implementation backed by Nomad variables, but I don't think the API has a way to subscribe to updates so I think it would need to poll.

This change fixes up some differences between the RC and the final release of CIlium 1.15.0
@deverton-godaddy
Copy link
Contributor Author

I've updated this branch with the release version of Cilium 1.15 which required a couple of minor changes.

Regarding integrating with Cilium upstream, I think we could definitely get closer by refactoring towards using their Hive/Cell infrastructure. That would provide some consistency with how Cilium operates internally and make it closer to the Kubernetes experience.

The other change I think would be valuable would be to unify the event handling so there's a single event consumer that distributes the work. This should reduce the load on Nomad and again is similar to how the Cilium operator works.

The other other thing that's worth keeping an eye on is how more functions are moving in to the Cilium Operator, e.g. cilium/cilium#30356 which is moving ID generation from the agent to the operator. Matching that function in netreap could be a challenge.

This change renitroduces the old leader election logic but ported
on top of the Cilium kvstore. This should be more reliable than the
previous attempt.
Also fixes a silly bug I introduced where the reaper would get stuck when Nomad sent a bad event.
To reduce the size of the kvstore PR this change removes various unrelated changes.
We've found that it's better to exit immediately on any error from the Nomad event stream. Errors can occur any time but typically during a rolling update of the Nomad servers. Attempting to ignore some error types results in an infinite loop of

```
2024-03-06T04:11:05.089Z	WARN	reapers/endpoints.go:89	Ignoring invalid events payload from Nomad	{"error": "invalid character 's' looking for beginning of value"}
github.com/cosmonic-labs/netreap/reapers.(*EndpointReaper).Run.func1
	/app/netreap/reapers/endpoints.go:89
```

The Nomad job definition for netreap should result in a restart of the daemon process.
The cluster name is used to construct the paths in the kvstore where Cilium stores data. If unset it will be read from the Cilium agent configuration at startup. Can also be provided explcitily on the command line.
Allows managing Cilium clusters with a name other than `default`
When reconciling endpoints in Cilium, this change will now delete any endpoints that are not associated with an allocation in Nomad.
This change modifies the endpoint patching process so that it only patches the Cilium endpoint when security relevant identifiers change. This reduces churn on the endpoint when labels are changing for non-Cilium reasons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants