Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources leak until timeout if response fails to return to the Client #1020

Open
Bolodya1997 opened this issue Jul 14, 2021 · 14 comments
Open
Assignees
Labels
bug Something isn't working good first issue Good for newcomers stability The problem is related to system stability

Comments

@Bolodya1997
Copy link

Bolodya1997 commented Jul 14, 2021

Expected Behavior

All allocated resources should be either accessible from the Client or should be cleaned up in the short time.

Current Behavior

We can get resources allocated and cleaned up only on timeout happens:

  1. NSC requests icmp-responder with T request timeout.
  2. Request blocks in discover until NSE registers.
  3. NSE registers after T-t duration.
  4. During t duration Request comes to NSE, allocates IP addresses and starts returning back.
  5. Request timeout happens.
  6. IP addresses allocated on [4] are not accessible from the Client and would be cleaned up only on (NSMgr -> NSE) token timeout.

Issue happens in real life

https://github.com/networkservicemesh/integration-k8s-packet/actions/runs/1029599216

@Bolodya1997 Bolodya1997 added the bug Something isn't working label Jul 14, 2021
@Bolodya1997
Copy link
Author

So here is a basic problem:

... -> hop-a (id-a) -> hop-b (id-b) -> ...

If Request allocates some resources in hop-b and doesn't return id-b ID to hop-a, hop-a is unable to Close these resources. Even if it tries to Close, it will result in closing id-b' Connection which doesn't exist on hop-b.

I don't think that this can be solved without adding backward monitoring from hop-b to hop-a - in such case hop-b will receive monitor break and close the id-b Connection.

@edwarnicke @denis-tingaikin
Thoughts?

@Bolodya1997 Bolodya1997 added the question Further information is requested label Jul 14, 2021
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Jul 14, 2021

It seems to me missed a Close call. Monitor server knows when stream breaks. So it can produce a event for closing the resources.

@Bolodya1997
Copy link
Author

Bolodya1997 commented Jul 16, 2021

It seems to me missed a Close call. Monitor server knows when stream breaks. So it can produce a event for closing the resources.

So here is a problem with a monitor server:

  1. Monitor server doesn’t actually know what is the Connection that was requesting the left side, because it is another API not in the Request/Close chain.
  2. Even if it can know it, we make a huge restriction “if you request monitor events, please hold this stream until your death - in the other way whole left side would be closed”, so it affects our current solution with cmd-nsc-init and cmd-nsc, and so it deprecates non-owners of the Connection to monitor it.

@Bolodya1997
Copy link
Author

Actually here by backward monitoring I don't mean sending some monitor update from the right side to the left side. The only thing we need is to do some streaming request from the left side to the right side which will be a part of Request/Close chain with the lifetime of the corresponding Connection.

@edwarnicke
Thoughts?

@edwarnicke
Copy link
Member

@Bolodya1997 This isn't a leak as long as things are cleaned up when the expireTime expires. That said, please see networkservicemesh/deployments-k8s#906

@Bolodya1997
Copy link
Author

@edwarnicke
Generally you are right. But it can lead to some problematic issues like networkservicemesh/sdk-vpp#315. And so initially this issue was filed due to some failure in our integration tests caused by exhausted IP poll in IPAM for the same reason.

@Bolodya1997
Copy link
Author

@edwarnicke
Please take a look, here are 2 new solutions:

Monitor based solution

We can add some client chain element before the networkservice.Client creating a monitor stream on initial Request failure. If we have a corresponding Connection on the remote side - we should close it.

  • This solution doesn't require any new API and backward monitoring.

Backward monitoring solution

We can add some establish API:

  1. Client requests a stream from Server on Request context and sends initial event with its connection ID.
  2. Client requests Server.
  3. Server receives a Request and it already has a stream to this Request and calls next.Request.
  4. If path wasn't modified during the Request, Server closes stream.
  5. Server returns back Connection.
  6. Client sends close event and closes the stream.
  7. Server receives close event and also closes the stream.
  8. If Server doesn't receive close event, but stream closes with context timeout, it closes the Connection on its side.
  • This solution doesn't affect healing in any way, because it works only with initial Requests.
  • This solution will work in case of sudden network failure (1 solution just can't send Close in such case).
  • But it needs new API and backward monitoring.

@glazychev-art glazychev-art self-assigned this Aug 3, 2021
@denis-tingaikin
Copy link
Member

@edwarnicke Do you agree with solution from @Bolodya1997 and @glazychev-art #1051?

For me it looks a bit complicated, but I'm fine if you are ok with it.

@Bolodya1997
Copy link
Author

If this will be fixed, we need to set back request timeout in heal tests to 15s - networkservicemesh/deployments-k8s#2541.

@Bolodya1997
Copy link
Author

@edwarnicke
Would we think about something else then timeout to close leaked connections?

@Bolodya1997
Copy link
Author

This possibly can lead (leads sometimes) to very painful bugs with kernel interfaces in VPP Forwarder - networkservicemesh/sdk-vpp#315.

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 1, 2021

@Bolodya1997
The currently proposed solution is not better than using simply timeout.
So I moving this into backlog until we don't get the motivation to return to this back and find the simpler solution.

@denis-tingaikin denis-tingaikin removed the question Further information is requested label Sep 1, 2021
@denis-tingaikin denis-tingaikin added the stability The problem is related to system stability label Sep 7, 2021
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 29, 2021

@edwarnicke Should we do this before release? I think we can try to test a behaviour with handling ctx.Done and calling next.Server(ctx).Close(ctx,...)

@edwarnicke
Copy link
Member

@denis-tingaikin I think this is probably a good idea... I'd suggest we do it in its own chain element, and place it before 'begin' (as it will then clean up begin). Should be pretty quick and easy :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers stability The problem is related to system stability
Projects
None yet
Development

No branches or pull requests

4 participants