Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable IPSec encryption of tunnel traffic #209

Merged
merged 3 commits into from
Dec 27, 2019
Merged

Enable IPSec encryption of tunnel traffic #209

merged 3 commits into from
Dec 27, 2019

Conversation

jianjuns
Copy link
Contributor

@jianjuns jianjuns commented Dec 9, 2019

  1. Create a separate OVS tunnel port for each remote Node.
  2. Change L3 forwarding flows to support outputting to the IPSec tunnel port.
  3. Add yamls to enable IPSec encryption and add a new container in the Agent DS to run IPSec daemons.

Will add e2e tests in the same PR.

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests. This command can only be run by members of the vmware-tanzu organization
  • /skip-e2e: to skip e2e tests. This command can only be run by members of the vmware-tanzu organization

@jianjuns
Copy link
Contributor Author

jianjuns commented Dec 9, 2019

@wenyingd: I changed the L3 and L2 tables as we discussed offline. Please check.

pkg/agent/openflow/client.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
@@ -37,7 +37,7 @@ func installNodeFlows(ofClient Client, cacheKey string) (int, error) {
gwMAC, _ := net.ParseMAC("AA:BB:CC:DD:EE:FF")
IP, IPNet, _ := net.ParseCIDR("10.0.1.1/24")
peerNodeIP := net.ParseIP("192.168.1.1")
err := ofClient.InstallNodeFlows(hostName, gwMAC, IP, *IPNet, peerNodeIP)
err := ofClient.InstallNodeFlows(hostName, gwMAC, IP, *IPNet, peerNodeIP, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just want to not block original UT, you could use types.DefaultTunOFPort for parameter of ofport, or the case might fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the test does not fail with 0. But I could change to types.DefaultTunOFPort.

@jianjuns jianjuns force-pushed the ipsec branch 2 times, most recently from 2c95a45 to f812877 Compare December 10, 2019 21:45
@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

@jianjuns jianjuns changed the title WIP: Enable IPSec encryption of tunnel traffic Enable IPSec encryption of tunnel traffic Dec 20, 2019
@jianjuns
Copy link
Contributor Author

I still plan to add e2e tests before merging the PR, but the current comments are ready for review.

@jianjuns jianjuns force-pushed the ipsec branch 2 times, most recently from 57be6d7 to c81ce7c Compare December 22, 2019 07:06
@jianjuns
Copy link
Contributor Author

/test-e2e

@jianjuns
Copy link
Contributor Author

/test-e2e

pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved
pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved
pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved
@@ -45,6 +48,8 @@ const (
maxRetryDelay = 300 * time.Second
// Default number of workers processing a node change
defaultWorkers = 4

ovsExternalIDNodeName = "node-name"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now ovs external ID constants are spread in files, so do ParseXXXInterfaceConfig. Do you think we could put them together and closer to interfaceCache?

Copy link
Contributor Author

@jianjuns jianjuns Dec 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then where you suggest to put them (and buildXXXExternalIDs)? I do not think externalIDs belong to InterfaceStore, which is just a data store for interface data, and should not know anything about OVS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion, just felt a place having all external ID constants and the parsing functions might be convenient to know what data antrea persist in ovsdb and save class like AgentInitializer to depend on all packages that have parsing function. I didn't mean putting them in interfaceCache, but a separate package or file closer to interfaceCache as they are the things stored and read in interfaceCache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally do not like to create too many small pkgs, but do not have a good idea which existing pkg for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering a separate file in pkg/agent/interfacestore. But I'm not sure if it makes sense in other perspectives like the interface scope you mentioned. I'm fine with keeping them as they are until we are clear that the other way were more appropriate.

pkg/agent/interfacestore/types.go Outdated Show resolved Hide resolved
// multiple calls to GenerateContainerInterfaceName with the same parameters
// return the same value). The output has the length of interfaceNameLength(15).
// The probability of collision should be neglectable.
func GenerateContainerInterfaceName(podName string, podNamespace string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it cause existing interface names change? Would upgrading be affected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should first ask which format you like.

To handle upgrade, we could change to use Pod name + Namespace name to be the key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not supporting upgrade make users hard to use new features and get bugs in their clusters fixed? I think handling upgrade in code to consider previously used key might be complicated and hard to describe the compatibility in doc when we want to remove it. Do you think we can just use major release number to indicate compatibility like some software do? If it's 0.2.0 to 0.3.0, maybe we should avoid changes that break compatibility and leave it to major release change if it's necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on how many deployments will use 0.2 and 0.1 and would expect upgrade without impact to existing Pods. If we want to make sure nothing breaks across versions, guess we need to add upgrade tests.

But for this change there is another way to handle it - if we believe not safe to use the generated interface name as the key we could instead use Pod + NS or Node name to be the key for InterfaceStore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it's ok to ignore existing Pods at this stage.
If it's just changing key of InterfaceStore, I think it doesn't break compatibility? My concern was GenerateContainerInterfaceName is used in setupInterfaces for host veth name, not sure what would happen if it changes. Anyway, I agree it's fine to not handle upgrade case until we declare it and test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought over. Maybe let me revert the name change in this PR, and we could make a separate one to change it.
Do you support the new format I used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the difference keeping "-" in the name and different hashing value? For the former, I'm not sure why @wenyingd removed it and whether it's related to compatibility when names contain "-". If not, I'm fine with both.
@wenyingd could you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored the original format.

@jianjuns
Copy link
Contributor Author

/test-e2e

For each remote Node, create a separate tunnel port and interface on
the OVS bridge, and set the remote IP and PSK options to the
interface.
Refactore InterfaceStore structs and funcs to add tunnel interface
attributes and group InterfaceConfig fields by interface types.

For packets to a tunnel, change the l3Forwarding table flow to set the
output port to the tunnel interface, and resubmit to the
l2ForwardingOutput table (bypassing the l2ForwardingCalcTable and the
ingress rules tables).
When IPSec is enabled, for every remote Node set the output port to
the Node's tunnel interface, and also add a separate classification
flow in the in port classification table for the tunnel.
Change Makefile to generate antrea-ipsec.yml that deploys Antrea with
the IPSec encryption enabled.
Add a config option to antrea-agent.conf for enabling IPSec
encryption.
Add yaml patches for creating a K8s Secret to save IKE PSK and adding
an environment variable to the antrea-agent container to pass the
Secret value.
@jianjuns
Copy link
Contributor Author

/test-e2e

@jianjuns
Copy link
Contributor Author

/test-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a question about duplicate error logs.

if err := c.ovsBridgeClient.DeletePort(interfaceConfig.PortUUID); err != nil {
klog.Errorf("Failed to delete OVS tunnel port %s for Node %s: %v",
interfaceConfig.InterfaceName, nodeName, err)
return fmt.Errorf("failed to delete OVS tunnel port for Node %s", nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not unified before, but want to discuss it here, finally these errors will be logged in processNextWorkItem:

klog.Errorf("Error syncing Node %s, requeuing. Error: %v", key, err)

Do you think we can remove all klog.Errorf to avoid duplicate error logs? I'm thinking maybe we should only use klog.Errorf when the code is going to swallow the error and not pop up it, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought about it too. Just feel might be easier to debug if log closer to where the failure happens, and probably log with more details there.
For example, if we do not log here, should we always append the previously returned error to the new error and return to the caller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if not log here, previously returned error should be appended, like L246 (I think it's the most common way to return error in Go). That would guarantee no information will be missing. I raised it because there were only little difference between the error logs and the returned errors, while there should be no reason to not include details in returned errors too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Here I still feel easier to debug if log closer to the failure. But no strong opinion. If you believe we should always log in the upper layer, I could change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just personal opinion, not strong too. Feel feel to ignore.

@jianjuns
Copy link
Contributor Author

/test-conformance

@jianjuns
Copy link
Contributor Author

/test-e2e

@edwardbadboy
Copy link
Contributor

/test-conformance
/test-networkpolicy

@edwardbadboy
Copy link
Contributor

/test-e2e

1 similar comment
@edwardbadboy
Copy link
Contributor

/test-e2e

Add a test that redeloys Antrea with IPSec tunnel enabled, and tests
connectivity between two Pods on two different Nodes.
@jianjuns
Copy link
Contributor Author

/test-e2e

@tnqn
Copy link
Member

tnqn commented Dec 26, 2019

/test-all

@jianjuns jianjuns merged commit a5d9c42 into master Dec 27, 2019
@jianjuns jianjuns deleted the ipsec branch December 27, 2019 00:21
@jianjuns jianjuns mentioned this pull request Jan 22, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants