-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OFSwitch connection check to Agent's liveness probes #4126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report
@@ Coverage Diff @@
## main #4126 +/- ##
==========================================
- Coverage 65.96% 61.99% -3.97%
==========================================
Files 304 309 +5
Lines 46625 47736 +1111
==========================================
- Hits 30754 29596 -1158
- Misses 13461 15802 +2341
+ Partials 2410 2338 -72
|
/test-all |
path: /livez | ||
port: api | ||
scheme: HTTPS | ||
initialDelaySeconds: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - why we increase it to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It typically takes more than 2 seconds to connect to OVS from agent logs in my testbed. I think 5 seconds may cause a liveness failure at startup in production clusters when nodes are under resource pressure. The liveness is a backup solution just in case reconnection doesn't work as expected so it doesn't need to be too aggresive.
Besides, all Kubernetes components use 10 seconds initialDelaySeconds.
1f23fcf
to
2071cbc
Compare
Unit test failures of Test_ofPacketOutBuilder_Done were because the test relies on specific pseudo-random numbers. Created #4148 to fix it. |
78839f5
to
f184dc0
Compare
This helps automatic recovery if some issues cause OFSwitch reconnection to not work properly. It also fixes a race condition between the IsConnected and SwitchConnected methods of OFBridge and makes necessary changes to the constructor of APIServer to allow testing. For antrea-io#4092 Signed-off-by: Quan Tian <qtian@vmware.com>
f184dc0
to
834dc34
Compare
/test-all |
secureServing.BindAddress = bindAddress | ||
secureServing.BindPort = o.config.APIPort | ||
secureServing.CipherSuites = o.tlsCipherSuites | ||
secureServing.MinTLSVersion = o.config.TLSMinVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In original code, secureServing.MinTLSVersion should be ipher.TLSVersionMap[o.config.TLSMinVersion]
, it is different from the current version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are different. The original code sets the GenericAPIServer
's attributes s.SecureServingInfo.CipherSuites
and s.SecureServingInfo.MinTLSVersion
directly after the server is created, which is bad from the perspective of encapsulation. The right way to configure the GenericAPIServer
should be setting the options used to create the GenericAPIServer
, which is SecureServingOptions.CipherSuites
and SecureServingOptions.MinTLSVersion
in this case. The options will be translated to s.SecureServingInfo.CipherSuites
and s.SecureServingInfo.MinTLSVersion
when creating GenericAPIServer
.
https://github.com/kubernetes/kubernetes/blob/8206c9d458e321d7ad22ea9fc2e21a890790fc09/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L274-L286
Basically the previous code passed empty options when creating APIServer, translated the options itself with duplicate code in Antrea, and mutated APIServer directly.
if err != nil { | ||
return fmt.Errorf("invalid TLSMinVersion: %v", err) | ||
} | ||
trimmedTLSCipherSuites := strings.ReplaceAll(o.config.TLSCipherSuites, " ", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 186-192 is similar to cipher.GenerateCipherSuitesList(), the difference is new code has a check with trimmedTLSCipherSuites != ""
, why not use this version to re-write GenerateCipherSuitesList?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained above, the code in pkg/util/cipher/
is duplicated with the code in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/cli/flag/ciphersuites_flag.go and https://github.com/kubernetes/kubernetes/blob/8206c9d458e321d7ad22ea9fc2e21a890790fc09/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L274-L286. The right way to set TLS configurations is setting the SecureServingOptions with raw strings instead of calculating the results and mutating APIServer directly. The difference is not only the check, but also the data type of the slice. We are passing []string
to SecureServingOptions, not []int16
to GenericAPIServer.
pkg/util/cipher/
will be removed after all APIServers are changed to use upstream code, I don't want to touch antrea-controller and flow-aggregator code in this PR so keep them for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) This helps automatic recovery if some issues cause OFSwitch reconnection to not work properly. It also fixes a race condition between the IsConnected and SwitchConnected methods of OFBridge and makes necessary changes to the constructor of APIServer to allow testing. For antrea-io#4092 Signed-off-by: Quan Tian <qtian@vmware.com>
This helps automatic recovery if some issues cause OFSwitch reconnection to
not work properly.
It also fixes a race condition between the IsConnected and SwitchConnected
methods of OFBridge and makes necessary changes to the constructor of
APIServer to allow testing.
For #4092
Signed-off-by: Quan Tian qtian@vmware.com