Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add available check for northd enpoint #2799

Merged
merged 1 commit into from
May 16, 2023
Merged

add available check for northd enpoint #2799

merged 1 commit into from
May 16, 2023

Conversation

hongzhen-ma
Copy link
Collaborator

@hongzhen-ma hongzhen-ma commented May 12, 2023

What type of this PR

  • Bug fixes

Which issue(s) this PR fixes:

Fixes #2703

WHAT

🤖 Generated by Copilot at 618b17d

Improve ovn-northd leader check by using TCP probe. Modify pkg/ovn_leader_checker/ovn.go to implement the probe and configure its parameters.

🤖 Generated by Copilot at 618b17d

Oh we are the crew of the ovn-northd ship
And we need to check if our leader is fit
We don't just ping the IP, we probe the TCP
With probe_interval and probe_timeout set

HOW

🤖 Generated by Copilot at 618b17d

  • Add or modify constants and a global variable to configure the TCP connection check for the ovn-northd leader (link)
  • Implement a new function checkNorthdEpAvailable to test the TCP connection to the ovn-northd leader using the endpoint IP and port, and return false if the connection fails more than MaxFailCount times (link)
  • Modify the function checkNorthdEpAlive to call checkNorthdEpAvailable after getting the endpoint IP from the Kubernetes API, and return the result of the TCP connection check instead of just the IP existence (link)
  • Modify the function doOvnLeaderCheck to call checkNorthdEpAlive instead of checkNorthdSvcValidIP when checking the ovn-northd leader status, making the leader check more robust and reliable (link)

@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This might affect the performance of the system, and it should be reviewed if this change is necessary or not.
  • The MaxFailCount constant has been added without any explanation of its purpose. It should be documented in a comment above the constant declaration.
  • The failCount variable has been added without any explanation of its purpose. It should be documented in a comment above the variable declaration.
  • The function checkNorthdEpAvailable has been added to check if the OVN northd endpoint is available. However, it does not handle errors properly when the maximum number of failures is reached. If the endpoint is unavailable, it should return false immediately instead of waiting for more failures.
  • The function checkNorthdSvcValidIP has been renamed to checkNorthdEpAlive, which is more accurate. However, the function still returns true even if the endpoint is not available. It should call the checkNorthdEpAvailable function and return its result instead.

@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This change should be reviewed to ensure that it does not negatively impact the performance of the system.
  • The MaxFailCount constant has been added, but it is not clear what this value represents or how it is used. This should be documented in the code or in comments.
  • The stealLock function now takes a cfg parameter, which is not used in the function. This parameter should either be used or removed to avoid confusion.
  • The checkNorthdEpAvailable function increments the failCount variable when a connection attempt fails. However, this variable is not reset when a successful connection is made. This could lead to false positives if the maximum fail count is reached and subsequent successful connections are ignored. The failCount variable should be reset to 0 when a successful connection is made.
  • The getNorthdEpIP function returns an empty string if the endpoint cannot be found or has no assigned IP address. However, this function is called without checking for an empty return value, which could cause errors later in the code. The calling functions should check for an empty return value and handle it appropriately.

@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This change should be reviewed to ensure that it does not negatively impact the performance of the system.
  • The MaxFailCount constant has been added, indicating a maximum number of failed attempts to connect to the northd leader before considering it unavailable. However, this value is hardcoded and may not be suitable for all environments. Consider making this configurable.
  • The stealLock() function now takes in additional parameters, including the configuration and namespace. This change should be reviewed to ensure that it does not break any existing functionality.
  • The checkNorthdEpAvailable() function has been added to check if the northd endpoint is available. However, this function does not handle errors properly and may cause unexpected behavior. Consider adding proper error handling to this function.
  • The getNorthdEpIP() function has been added to retrieve the IP address of the northd endpoint. However, this function does not handle errors properly and may cause unexpected behavior. Consider adding proper error handling to this function.

@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This change might affect the performance of the system, and it should be reviewed to ensure that it does not cause any issues.
  • The MaxFailCount constant has been added, but there is no explanation of what it does. It should be documented to make it clear what its purpose is.
  • The stealLock function has been modified to use the IP address of the northd endpoint instead of always using 127.0.0.1. However, the getNorthdEpIP function used to retrieve the IP address can return an empty string, which would cause fmt.Sprintf to panic. This issue should be addressed to prevent potential crashes.
  • The checkNorthdEpAvailable function has been added to check if the northd endpoint is available. However, it increments a global failCount variable without any synchronization mechanism, which could lead to race conditions. This issue should be fixed to ensure that the variable is accessed safely.
  • The checkNorthdEpAlive function has been modified to use checkNorthdEpAvailable to check if the northd endpoint is available. However, it still returns true even if the endpoint is not available, which could cause issues later on. It should be updated to return false if the endpoint is not available.

pkg/ovn_leader_checker/ovn.go Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This change may affect the performance of the system, and it is important to ensure that this new value does not cause any issues. It would be helpful to have more information on why this change was made and if any testing was done to validate the new value.
  • The MaxFailCount constant has been added without any explanation or context. It is unclear what this constant represents and how it affects the system. It would be helpful to add comments explaining its purpose and usage.
  • The failCount variable has been added without any explanation or context. It is unclear what this variable represents and how it affects the system. It would be helpful to add comments explaining its purpose and usage.
  • The checkNorthdEpAvailable function has been added to check if the northd endpoint is available. However, it does not handle errors properly and may cause the program to panic. It would be better to return an error instead of panicking and handle the error in the calling function.
  • The getOvnSBEpIP function returns an empty string if the endpoint is not found or has no addresses assigned. However, it does not return an error or log any messages, making it difficult to debug issues related to missing endpoints or addresses. It would be better to return an error or log a message when an endpoint or address is not found.

@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This might affect the performance of the system, and it should be reviewed if this change is necessary or not.
  • The MaxFailCount constant has been added, but there is no explanation about its purpose. It should be documented to make it clear why this value was chosen.
  • The stealLock function now receives two parameters, but only one of them is being used. The namespace parameter is not being used, and it should be removed to avoid confusion.
  • The checkNorthdEpAvailable function is new and seems to be a good addition to check if the northd endpoint is available. However, it should be renamed to something more descriptive, like checkNorthdEndpoint.
  • The checkNorthdSvcValidIP function has been modified to use the checkNorthdEpAvailable function instead of checking if the IP is valid. This change makes sense, but the function name is misleading and should be updated to reflect the new behavior.

@github-actions
Copy link
Contributor

  • The DefaultProbeInterval constant has been changed from 15 to 5. This change should be reviewed to ensure that it does not negatively impact the performance of the system. If this change was made to improve performance, then it should be documented in the code comments.
  • The MaxFailCount constant has been added without any explanation or documentation. It is unclear what this constant represents and how it affects the behavior of the system. This should be documented in the code comments.
  • The failCount variable has been added without any explanation or documentation. It is unclear what this variable represents and how it affects the behavior of the system. This should be documented in the code comments.
  • The checkNorthdEpAvailable function has been added to check if the northd endpoint is available. However, it does not handle errors properly and may cause the program to panic. This function should be updated to handle errors gracefully.
  • The checkNorthdSvcValidIP function has been renamed to checkNorthdEpAlive and now checks if the northd endpoint is alive instead of checking if it has a valid IP address. This change should be documented in the code comments to avoid confusion.

@oilbeater
Copy link
Collaborator

Need backport to 1.11

@hongzhen-ma hongzhen-ma merged commit 2ba3846 into master May 16, 2023
@hongzhen-ma hongzhen-ma deleted the northd branch May 16, 2023 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accelerate northd leader transfer when the node is shut down
2 participants