Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set some default resource requests on the workspace pod #698

Closed
blampe opened this issue Oct 1, 2024 · 7 comments
Closed

Set some default resource requests on the workspace pod #698

blampe opened this issue Oct 1, 2024 · 7 comments
Assignees
Labels
kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Milestone

Comments

@blampe
Copy link
Contributor

blampe commented Oct 1, 2024

Manager has limits on it already -- currently has guaranteed QoS.

Related to #694 and probably a pre-req -- set a small request limit to give the workspace pod burst-able QoS.

Additional considerations:

  • use GOMEMLIMIT to tell the agent to not use more than the requested memory for itself. Consider setting it using the Downward API. Avoid setting it on the child processes (use SetMemoryLimit in code?).
  • check for zombie sub-processes.
@cleverguy25

This comment has been minimized.

@pulumi-bot pulumi-bot added the needs-triage Needs attention from the triage team label Oct 1, 2024
@blampe blampe added this to the 0.111 milestone Oct 1, 2024
@blampe blampe removed the needs-triage Needs attention from the triage team label Oct 1, 2024
@EronWright
Copy link
Contributor

The baseline stats for random-yaml with 1-minute resync interval.
Image

@EronWright
Copy link
Contributor

EronWright commented Oct 2, 2024

Zombie processes do seem to accumulate in the workspace pod, given a per-minute resync:

pulumi@random-yaml-workspace-0:/$ ps auxwww 
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pulumi       1  0.0  0.3 1248856 14268 ?       Ssl  16:07   0:01 /share/agent serve --workspace /share/workspace --skip-install
pulumi      46  0.0  0.0      0     0 ?        Z    16:07   0:00 [pulumi-language] <defunct>
pulumi      75  0.0  0.0      0     0 ?        Z    16:07   0:00 [pulumi-language] <defunct>
pulumi     236  0.0  0.0      0     0 ?        Z    16:07   0:00 [pulumi-language] <defunct>
pulumi     256  0.0  0.0      0     0 ?        Z    16:07   0:00 [pulumi-resource] <defunct>
pulumi     271  0.0  0.0      0     0 ?        Z    16:07   0:00 [pulumi-resource] <defunct>
pulumi     400  0.0  0.0      0     0 ?        Z    16:08   0:00 [pulumi-language] <defunct>
pulumi     415  0.0  0.0      0     0 ?        Z    16:08   0:00 [pulumi-resource] <defunct>
pulumi     431  0.0  0.0      0     0 ?        Z    16:08   0:00 [pulumi-resource] <defunct>
pulumi     563  0.0  0.0      0     0 ?        Z    16:09   0:00 [pulumi-language] <defunct>
pulumi     579  0.0  0.0      0     0 ?        Z    16:09   0:00 [pulumi-resource] <defunct>
pulumi     594  0.0  0.0      0     0 ?        Z    16:09   0:00 [pulumi-resource] <defunct>
pulumi     724  0.0  0.0      0     0 ?        Z    16:10   0:00 [pulumi-language] <defunct>
pulumi     739  0.0  0.0      0     0 ?        Z    16:10   0:00 [pulumi-resource] <defunct>
pulumi     753  0.0  0.0      0     0 ?        Z    16:10   0:00 [pulumi-resource] <defunct>
pulumi     886  0.0  0.0      0     0 ?        Z    16:11   0:00 [pulumi-language] <defunct>
pulumi     901  0.0  0.0      0     0 ?        Z    16:11   0:00 [pulumi-resource] <defunct>
pulumi     917  0.0  0.0      0     0 ?        Z    16:11   0:00 [pulumi-resource] <defunct>
pulumi    1044  0.0  0.0      0     0 ?        Z    16:12   0:00 [pulumi-language] <defunct>
pulumi    1059  0.0  0.0      0     0 ?        Z    16:12   0:00 [pulumi-resource] <defunct>
pulumi    1075  0.0  0.0      0     0 ?        Z    16:12   0:00 [pulumi-resource] <defunct>
pulumi    1205  0.0  0.0      0     0 ?        Z    16:13   0:00 [pulumi-language] <defunct>
pulumi    1220  0.0  0.0      0     0 ?        Z    16:13   0:00 [pulumi-resource] <defunct>
pulumi    1236  0.0  0.0      0     0 ?        Z    16:13   0:00 [pulumi-resource] <defunct>
pulumi    1368  0.0  0.0      0     0 ?        Z    16:14   0:00 [pulumi-language] <defunct>
pulumi    1383  0.0  0.0      0     0 ?        Z    16:14   0:00 [pulumi-resource] <defunct>
...

@justinvp
Copy link
Member

justinvp commented Oct 2, 2024

Likely related to pulumi/pulumi#17361

@EronWright
Copy link
Contributor

EronWright commented Oct 2, 2024

These measurements made after "zombie" process issue was fixed.

After another hour of periodic execution:
Image

And another:
Image

@EronWright
Copy link
Contributor

A case of failed updates causing a lot more interactions with the workspace:
Image

@EronWright
Copy link
Contributor

With all fixes:
Image

@mikhailshilkov mikhailshilkov added the kind/enhancement Improvements or new features label Oct 7, 2024
EronWright added a commit that referenced this issue Oct 7, 2024
<!--Thanks for your contribution. See [CONTRIBUTING](CONTRIBUTING.md)
    for Pulumi's contribution guidelines.

    Help us merge your changes more quickly by adding more details such
    as labels, milestones, and reviewers.-->

### Proposed changes

<!--Give us a brief description of what you've done and what it solves.
-->
Implements good defaults for the workspace resource, using a
["burstable"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#burstable)
approach.
Since a workspace pod's utilization is bursty - with low resource usage
during idle times and with high resource usage during deployment ops -
the pod requests a small amount of resources (64mb, 100m) to be able to
idle. A deployment op is able to use much more memory - all available
memory on the host.

Users may customize the resources (e.g. to apply different requests
and/or limits). For large/complex Pulumi apps, it might make sense to
reserve more memory and/or use
#694.

The agent takes some pains to stay within the requested amount, using a
programmatic form of the
[GOMEMLIMIT](https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications)
environment variable. The agent detects the requested amount via the
Downward API. We don't use `GOMEMLIMIT` to avoid propagating it to
sub-processes, and because the format is a Kubernetes 'quantity'.

It was observed that zombies weren't being cleaned up, and this was
leading to resource exhaustion. Fixed by using
[tini](https://github.com/krallin/tini/) as the entrypoint process (PID
1).

### Related issues (optional)

<!--Refer to related PRs or issues: #1234, or 'Fixes #1234' or 'Closes
#1234'.
Or link to full URLs to issues or pull requests in other GitHub
repositories. -->
Closes #698
@EronWright EronWright added the resolution/fixed This issue was fixed label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Projects
Status: Done
Development

No branches or pull requests

6 participants