Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Disable probes in remote agent #164

Merged
merged 6 commits into from
Mar 21, 2023

Conversation

Eneuman
Copy link
Contributor

@Eneuman Eneuman commented Feb 11, 2023

This PR disables probes from beeing cloned to the remote agent pod when the remote agent is deployed.
Existing probes are restored when EndpointManager disconnects.

You can enable cloning of probes by adding the following line to the KubernetesLocalProcessConfig.yaml file.

enableFeatures:
  - Probes  

fixes #114
fixes #111

@Eneuman Eneuman requested review from a team, Tatsinnit and sabbour as code owners February 11, 2023 17:42
@Eneuman
Copy link
Contributor Author

Eneuman commented Feb 13, 2023

@elenavillamil I just noticed that the existing code in GetStatefulSetJsonPatch doesn't handle StartupProbes. Do you want me to fix this in this PR as well?

Copy link
Contributor

@elenavillamil elenavillamil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for your contribution :) Regarding the comment on _GetStatefulSetJsonPathch, I am trying to think what is best way forward for backwards compatibility. I can think of two options:

  1. Remove the flag and bridge always removes probes for all code paths by default. A bit scared as to why old team did not do that from the begining.
  2. Keep flag and make it default true. So even if user doesnt add it we will remove it, but it will be user option to keep probes it if they want to.

I am leaning towards option 2 since it adds more flexibility for not that much more work. All we would have to do is add a new section (DisabledFeatures) instead of adding this new setting under EnabledFeatures on the configuration yaml file.

@elenavillamil
Copy link
Contributor

elenavillamil commented Feb 13, 2023

@elenavillamil I just noticed that the existing code in GetStatefulSetJsonPatch doesn't handle StartupProbes. Do you want me to fix this in this PR as well?

I think I was in the middle of reviewing this PR as you commented this :) I just posted the review, let me know if you have questions after reading the comments.

@Eneuman
Copy link
Contributor Author

Eneuman commented Feb 13, 2023

Thank you again for your contribution :) Regarding the comment on _GetStatefulSetJsonPathch, I am trying to think what is best way forward for backwards compatibility. I can think of two options:

  1. Remove the flag and bridge always removes probes for all code paths by default. A bit scared as to why old team did not do that from the begining.
  2. Keep flag and make it default true. So even if user doesnt add it we will remove it, but it will be user option to keep probes it if they want to.

I am leaning towards option 2 since it adds more flexibility for not that much more work. All we would have to do is add a new section (DisabledFeatures) instead of adding this new setting under EnabledFeatures on the configuration yaml file.

@elenavillamil Sure, sound good to me, but shouldn't probes be a "enable" feature in that case?
Like this?

version: 0.1
enableFeatures:
  - ManagedIdentity
  - Probes

@elenavillamil
Copy link
Contributor

elenavillamil commented Feb 13, 2023

Thank you again for your contribution :) Regarding the comment on _GetStatefulSetJsonPathch, I am trying to think what is best way forward for backwards compatibility. I can think of two options:

  1. Remove the flag and bridge always removes probes for all code paths by default. A bit scared as to why old team did not do that from the begining.
  2. Keep flag and make it default true. So even if user doesnt add it we will remove it, but it will be user option to keep probes it if they want to.

I am leaning towards option 2 since it adds more flexibility for not that much more work. All we would have to do is add a new section (DisabledFeatures) instead of adding this new setting under EnabledFeatures on the configuration yaml file.

@elenavillamil Sure, sound good to me, but shouldn't probes be a "enable" feature in that case? Like this?

version: 0.1
enableFeatures:
  - ManagedIdentity
  - Probes

@Eneuman This is another way of thinking of it. I like it. If Probes is enabled then Bridge keeps the configuration for the probes. Else it will remove them. Once this is checked in and released we will update the public documentation.

@sabbour
Copy link
Collaborator

sabbour commented Feb 14, 2023

Looks good to me. Are there any scenarios where a user would want to have the probes enabled by default in the remote agent that would break with this release?
Also is there a risk of failing to restore probes when the EndpointManager disconnects? Can we add a unit test for this?

@elenavillamil
Copy link
Contributor

elenavillamil commented Feb 14, 2023

Its also worth noting that B2K code has handle to return 200 for probes here. Personally, I think just ignoring the probes is a simpler and less error prone solution than trying to intercept probe requests and return 200. But wanted to drop this here for discussion. @Eneuman and @hsubramanianaks as FYI for thoughts :)

@Eneuman
Copy link
Contributor Author

Eneuman commented Feb 16, 2023

Its also worth noting that B2K code has handle to return 200 for probes here. Personally, I think just ignoring the probes is a simpler and less error prone solution than trying to intercept probe requests and return 200. But wanted to drop this here for discussion. @Eneuman and @hsubramanianaks as FYI for thoughts :)

@elenavillamil @hsubramanianaks
My experience is that the todays handling of probe requests sometime works but mostly not. Now that I disabled the probes it feels more solid. If the pod is always replaced with the devagent then I cannot think of a scenario where probes are needed.

Maby we should add a condition to the code you referenced to @elenavillamil so It also gets disabled when probes are not enabled?

@elenavillamil
Copy link
Contributor

Its also worth noting that B2K code has handle to return 200 for probes here. Personally, I think just ignoring the probes is a simpler and less error prone solution than trying to intercept probe requests and return 200. But wanted to drop this here for discussion. @Eneuman and @hsubramanianaks as FYI for thoughts :)

@elenavillamil @hsubramanianaks My experience is that the todays handling of probe requests sometime works but mostly not. Now that I disabled the probes it feels more solid. If the pod is always replaced with the devagent then I cannot think of a scenario where probes are needed.

Maby we should add a condition to the code you referenced to @elenavillamil so It also gets disabled when probes are not enabled?

Yes, I think so. I like the idea of associating that code with the flag we are introducing with this change. This will remove forwarding complexity in the general case and it will make it cleaner if we every want to drop that flag and code path.

@Eneuman
Copy link
Contributor Author

Eneuman commented Feb 16, 2023

Its also worth noting that B2K code has handle to return 200 for probes here. Personally, I think just ignoring the probes is a simpler and less error prone solution than trying to intercept probe requests and return 200. But wanted to drop this here for discussion. @Eneuman and @hsubramanianaks as FYI for thoughts :)

@elenavillamil @hsubramanianaks My experience is that the todays handling of probe requests sometime works but mostly not. Now that I disabled the probes it feels more solid. If the pod is always replaced with the devagent then I cannot think of a scenario where probes are needed.
Maby we should add a condition to the code you referenced to @elenavillamil so It also gets disabled when probes are not enabled?

Yes, I think so. I like the idea of associating that code with the flag we are introducing with this change. This will remove forwarding complexity in the general case and it will make it cleaner if we every want to drop that flag and code path.

I took a second look about this and since the remoteContainerConnectionDetails won't collect the probes from the spec, the probe information will not be send to AgentExecutorHub in the dev agent, so the RunReversePortForward command will not have any probes to check IsProbeRequest for :)

Maby I should just add a comment about this in the code?

I havn't tested the code with DAPR since I dont have enough knowledge about that technology but I know DAPR is a sidecar that uses probes by default (https://docs.dapr.io/developing-applications/building-blocks/observability/sidecar-health/).
Maby someone with knowledge about that could also test this PR before it gets released :)

Copy link
Contributor

@elenavillamil elenavillamil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eneuman Regarding DAPR, I have checked out our manual testing for DAPR and the sample application we use doesnt have probes. Thus, I am pretty sure probes are not a requirement for DAPR.

Yes, please add comment so in future we know that may be code we can drop if after deploying this we dont see any usages of the Probes flag.

I am sending this to our testing team next and I will keep this updated with any findings.

@elenavillamil
Copy link
Contributor

elenavillamil commented Feb 23, 2023

@Eneuman Our testing team found that when running VSIX with this change restore job cannot complete. This is using stats-api from todo-app in samples folder. I can reproduce locally as well. If I add the Probes under enableFeatures in the yaml file, then it runs fine. I have to look at restore job logic to better understand what is happening. I tried to get logs but I couldnt get anything useful, pod went into error state and restarted. The logs I got before didnt have much. I will keep trying to get logs of what causes pod to go into error state. We may have this in our telemetry.
Screenshot 2023-02-22 at 8 46 24 PM

@Eneuman
Copy link
Contributor Author

Eneuman commented Feb 23, 2023

@elenavillamil Could you ask them to do a descibe on the pod and check the events?

@elenavillamil
Copy link
Contributor

elenavillamil commented Feb 23, 2023

@elenavillamil Could you ask them to do a descibe on the pod and check the events?

@Eneuman Ya sorry I didnt paste in my comment but i did describe as well and didnt see anything interesting. Output:
Events:
Type Reason Age From Message


Normal Scheduled 106s default-scheduler Successfully assigned todo-app/stats-api-restore-eaf0d-q5l9d to aks-nodepool1-00911511-vmss000002
Normal Pulled 102s kubelet Successfully pulled image "mindarostage.azurecr.io/lpkrestorationjob:1.0.0" in 4.17886243s
Normal Pulling 13s (x2 over 106s) kubelet Pulling image "mindarostage.azurecr.io/lpkrestorationjob:1.0.0"
Normal Created 13s (x2 over 102s) kubelet Created container lpkrestorationjob
Normal Started 13s (x2 over 102s) kubelet Started container lpkrestorationjob
Normal Pulled 13s kubelet Successfully pulled image "mindarostage.azurecr.io/lpkrestorationjob:1.0.0" in 251.670113ms

Can you repo on your end? Updating BRIDGE_BUILD_PATH to point to a build with this change should get you the repo on disconnect. I can repo on Mac use VSCode. Our testing team can repo on WS linux using VSCode.

@Eneuman
Copy link
Contributor Author

Eneuman commented Mar 3, 2023

I think I now know what's going on. Will try and fix this tomorrow.

@elenavillamil
Copy link
Contributor

I think I now know what's going on. Will try and fix this tomorrow.

@Eneuman Sounds good! Let us know if you need help.

@elenavillamil
Copy link
Contributor

I think I now know what's going on. Will try and fix this tomorrow.

@Eneuman Sounds good! Let us know if you need help.

Hi @Eneuman, happy to take over this pull request if you got busy. Just let us know, I dont want to over step :)

@Eneuman
Copy link
Contributor Author

Eneuman commented Mar 15, 2023

@elenavillamil
Please have a look if you got the time :)

I think the problem might have something to do with using json patch together with removing the probes but I can be wrong here.

@Eneuman
Copy link
Contributor Author

Eneuman commented Mar 20, 2023

@elenavillamil I did some testing yesterday and I can't reproduce this exception on my machine any more :(

I tried using the latest extensions in both VS Code and VS Pro, both set to my local build of DSC.

@elenavillamil
Copy link
Contributor

@Eneuman, yes I was testing this Friday and I could not repo either. I sent it off this morning to our testing team for final validation. The only thing I noticed is, if you look at restore logs when restore is happening is below warning. Which just indicates restore is being quicked from outside restore pod which is okay.

2023-03-20T17:12:05.2072004Z | RestorationJob | WARNG | Found 2 pods for deployment todo-app/stats-api but expected 1
2023-03-20T17:12:05.2087181Z | RestorationJob | TRACE | Couldn't get agent endpoint

Copy link
Collaborator

@hsubramanianaks hsubramanianaks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Eneuman for these changes and @elenavillamil for reviewing it. Only thing I would call out is either in the README or document update, please have something similar By default bridge would remove the probes (liveness, readiness and startup) for the remote agent pod deployed in cluster, to enable them please use EnableFeatureProbes to true. Because at first glance I thought this is disabling probes (PR description says that) and got confused. But if this flag is not set or set to any value other than true, bridge would disable probes. Thanks.

@elenavillamil
Copy link
Contributor

Thanks @Eneuman for these changes and @elenavillamil for reviewing it. Only thing I would call out is either in the README or document update, please have something similar By default bridge would remove the probes (liveness, readiness and startup) for the remote agent pod deployed in cluster, to enable them please use EnableFeatureProbes to true. Because at first glance I thought this is disabling probes (PR description says that) and got confused. But if this flag is not set or set to any value other than true, bridge would disable probes. Thanks.

Correct, plan is to update public docs once this is released. @sabbour / @qpetraroia as FYI. I will reach out once this is released for this doc update.

@elenavillamil elenavillamil merged commit 847cad8 into Azure:main Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote agent container fails readiness probes Liveliness/readiness probe - bridge container being recycled
4 participants