Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify healing - determinate reselect state for whole chain #1454

Closed
5 tasks done
denis-tingaikin opened this issue May 3, 2023 · 3 comments
Closed
5 tasks done
Assignees

Comments

@denis-tingaikin
Copy link
Member

denis-tingaikin commented May 3, 2023

Steps to reproduce

  1. deploy nsm
  2. deploy nse1 with service ns1
  3. deploy nse2 with service ns1
  4. deploy nsc
  5. nsc connects to nse1
  6. nse1 dies
  7. nsc reselects nse
  8. client's nsmgr misses close from reselect

Expected Behavior

All downstream components correctly handle reselect on next request from the nsc.

Current Behavior

All downstream components do something non-deterministic. (i.e. some component like forwarder may try to restore connection, some component may just refresh the connection)

Context

See at


See at

Initial ideas

  1. Remove here path
    request.GetConnection().Mechanism = nil
    and then all downstream components will think that it is a new connection
  2. Add something into request proto to determidate reselect state by downstream components

Defenition of done

  • Investigate the problem and suggest options on resolve the problem with pros/cons.
  • Get approve from @edwarnicke
  • Implement the solution.
  • Pass linters/code review ~2h
  • Pass integration tests. ~14h
@d-uzlov d-uzlov moved this to In Progress in Release v1.10.0 May 10, 2023
@d-uzlov d-uzlov moved this to Moved to next release in Release v1.9.0 May 11, 2023
@d-uzlov

This comment was marked as outdated.

@d-uzlov
Copy link
Contributor

d-uzlov commented May 16, 2023

Current behavior

If Close reaches all apps in path, it clears all caches, and reselect works as expected.

If Close doesn't reach some apps, those apps try to restore old connection to avoid resource leaks.
Therefore, reselect doesn't work.

Solution 1

Changes:

  • Clear request path.

When path is empty, updatePath will create new IDs.
All the caches that prevent reselect from working are linked to path segment ID.
With new ID we are guaranteed to get reselect.

Pros:

  • Very simple
  • Fully backward-compatible

Cons:

  • The only way to clear leaked resources is using timeout

Solution 2 (Best solution)

Changes:

If Close from client has reached all apps in the connection path, everything will work as before.
If local Nsmgr has restarted, it's likely we go to the same forwarder, and forwarder will close the old connection.
If local forwarder has restarted, but we go to the same remote node, the remote forwarder or nsmgr will clear resources.

It is still possible to have resource leaks if we can't reach some apps in the path.
But this solutions will fix many common cases.

Pros:

  • "Best effort" resource cleanup
  • Still simple

Cons:

  • A small change in API

Example when NSMgr restarted:
image

@denis-tingaikin
Copy link
Member Author

Seems like done

@github-project-automation github-project-automation bot moved this from Moved to next release to Done in Release v1.9.0 Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

No branches or pull requests

2 participants