Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding CPU / RAM configurations to helm network deployments #8786

Merged
merged 6 commits into from
Sep 26, 2024

Conversation

stevenplatt
Copy link
Collaborator

@stevenplatt stevenplatt commented Sep 25, 2024

Change 1: CPU/RAM Limits for node deployments

This PR assigns resource configurations to nodes that are part of helm network deployments.
Adding such resource configurations helps Kubernetes balance and deploy aztec nodes.

These initial values are chosen based on historical usage of the currently deployed devnet environment in AWS ( Grafana Dashboard ).

Definitions
requests: This is the minimum resource that must be available on the underlying server before Kubernetes can deploy the component.
limits: After deployment, the component is allowed to flex up and down, but never above this set limit. Using a limit keeps the shared infra stable when there is memory leaks or unexpected application behavior. Components are terminated and redeployed if exceeding the assigned limit.

Change 2: Options for bots and public networks

Additionally, this PR add configuration to turn bots as well as public access on or off at the time of the helm deployment. This can be used with the following helm syntax:

helm upgrade --install <installation name> . -n <kubernetes namespace> \ 
--set network.public=true --set network.enableBots=true

By default, network.public is false since enabling this deploys load balancers which are not available when running a Kubernetes cluster on a local machine and within CI environments.


These resource configurations have been tested by deploying the parent helm chart to the spartan Kubernetes cluster in AWS.

@stevenplatt stevenplatt enabled auto-merge (squash) September 25, 2024 15:16
@stevenplatt stevenplatt marked this pull request as draft September 25, 2024 18:33
Copy link
Collaborator

@ludamad ludamad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just the question if we want to set CPU limits vs just rely on scheduler. I can get this in without a CI pass if you ping me, too (no need to undraft)

Copy link
Contributor

@just-mitch just-mitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Side note, when it is LoadBalancer, does EKS automatically set that up? How/Where do you find the public endpoints?

@stevenplatt stevenplatt marked this pull request as ready for review September 26, 2024 16:15
@stevenplatt stevenplatt enabled auto-merge (squash) September 26, 2024 16:15
@stevenplatt
Copy link
Collaborator Author

Nice. Side note, when it is LoadBalancer, does EKS automatically set that up? How/Where do you find the public endpoints?

Yes, EKS automatically deploys a load balancer within AWS (outside of the cluster) when it is defined in the helm chart. It also automatically deletes it when helm uninstall is used.

@stevenplatt stevenplatt merged commit 7790ede into master Sep 26, 2024
50 checks passed
@stevenplatt stevenplatt deleted the srp/helm-resource-limits branch September 26, 2024 16:49
TomAFrench added a commit that referenced this pull request Sep 26, 2024
* master:
  feat: make shplemini proof constant (#8826)
  feat: Adding CPU / RAM configurations to helm network deployments (#8786)
  chore: removing hack commitment from eccvm (#8825)
  feat: Handle epoch proofs on L1 (#8704)
Rumata888 pushed a commit that referenced this pull request Sep 27, 2024
)

# Change 1: CPU/RAM Limits for node deployments

This PR assigns resource configurations to nodes that are part of helm
network deployments.
Adding such resource configurations helps Kubernetes balance and deploy
aztec nodes.

These initial values are chosen based on historical usage of the
currently deployed `devnet` environment in AWS ( [Grafana
Dashboard](https://grafana.aztec.network/d/cdtxao66xa1ogc/aztec-dashboard?orgId=1&refresh=1m&var-network=devnet&var-instance=All&var-protocol_circuit=All&var-min_block_build=20m&var-system_res_interval=$__auto_interval_system_res_interval&var-sequencer=All&var-prover=All&from=now-7d&to=now)
).

**Definitions**
`requests:` This is the minimum resource that must be available on the
underlying server before Kubernetes can deploy the component.
`limits:` After deployment, the component is allowed to flex up and
down, but never above this set limit. Using a limit keeps the shared
infra stable when there is memory leaks or unexpected application
behavior. Components are terminated and redeployed if exceeding the
assigned limit.


# Change 2: Options for bots and public networks

Additionally, this PR add configuration to turn bots as well as public
access on or off at the time of the helm deployment. This can be used
with the following helm syntax:

```
helm upgrade --install <installation name> . -n <kubernetes namespace> \ 
--set network.public=true --set network.enableBots=true
```

By default, `network.public` is `false` since enabling this deploys load
balancers which are not available when running a Kubernetes cluster on a
local machine and within CI environments.

---

These resource configurations have been tested by deploying the parent
helm chart to the spartan Kubernetes cluster in AWS.
Rumata888 pushed a commit that referenced this pull request Sep 27, 2024
)

# Change 1: CPU/RAM Limits for node deployments

This PR assigns resource configurations to nodes that are part of helm
network deployments.
Adding such resource configurations helps Kubernetes balance and deploy
aztec nodes.

These initial values are chosen based on historical usage of the
currently deployed `devnet` environment in AWS ( [Grafana
Dashboard](https://grafana.aztec.network/d/cdtxao66xa1ogc/aztec-dashboard?orgId=1&refresh=1m&var-network=devnet&var-instance=All&var-protocol_circuit=All&var-min_block_build=20m&var-system_res_interval=$__auto_interval_system_res_interval&var-sequencer=All&var-prover=All&from=now-7d&to=now)
).

**Definitions**
`requests:` This is the minimum resource that must be available on the
underlying server before Kubernetes can deploy the component.
`limits:` After deployment, the component is allowed to flex up and
down, but never above this set limit. Using a limit keeps the shared
infra stable when there is memory leaks or unexpected application
behavior. Components are terminated and redeployed if exceeding the
assigned limit.


# Change 2: Options for bots and public networks

Additionally, this PR add configuration to turn bots as well as public
access on or off at the time of the helm deployment. This can be used
with the following helm syntax:

```
helm upgrade --install <installation name> . -n <kubernetes namespace> \ 
--set network.public=true --set network.enableBots=true
```

By default, `network.public` is `false` since enabling this deploys load
balancers which are not available when running a Kubernetes cluster on a
local machine and within CI environments.

---

These resource configurations have been tested by deploying the parent
helm chart to the spartan Kubernetes cluster in AWS.
Rumata888 pushed a commit that referenced this pull request Sep 27, 2024
)

# Change 1: CPU/RAM Limits for node deployments

This PR assigns resource configurations to nodes that are part of helm
network deployments.
Adding such resource configurations helps Kubernetes balance and deploy
aztec nodes.

These initial values are chosen based on historical usage of the
currently deployed `devnet` environment in AWS ( [Grafana
Dashboard](https://grafana.aztec.network/d/cdtxao66xa1ogc/aztec-dashboard?orgId=1&refresh=1m&var-network=devnet&var-instance=All&var-protocol_circuit=All&var-min_block_build=20m&var-system_res_interval=$__auto_interval_system_res_interval&var-sequencer=All&var-prover=All&from=now-7d&to=now)
).

**Definitions**
`requests:` This is the minimum resource that must be available on the
underlying server before Kubernetes can deploy the component.
`limits:` After deployment, the component is allowed to flex up and
down, but never above this set limit. Using a limit keeps the shared
infra stable when there is memory leaks or unexpected application
behavior. Components are terminated and redeployed if exceeding the
assigned limit.


# Change 2: Options for bots and public networks

Additionally, this PR add configuration to turn bots as well as public
access on or off at the time of the helm deployment. This can be used
with the following helm syntax:

```
helm upgrade --install <installation name> . -n <kubernetes namespace> \ 
--set network.public=true --set network.enableBots=true
```

By default, `network.public` is `false` since enabling this deploys load
balancers which are not available when running a Kubernetes cluster on a
local machine and within CI environments.

---

These resource configurations have been tested by deploying the parent
helm chart to the spartan Kubernetes cluster in AWS.
stevenplatt added a commit that referenced this pull request Oct 2, 2024
…8923)

This PR includes two changes:
- Adds persistent storage for Aztec nodes running the Spartan cluster
- Repairs previously merged load balancer configurations

# Persistent Storage

Nodes that were previously configured with mounted volumes are now
configured to use `volumeClaimTemplates`. Rather than directly
configuring a `PersistentVolumeClaim`, a `volumeClaimTemplate` will
automatically append index suffixes when replicas increase, so that
there is not a storage conflict.

## Persistent Storage for Grafana

The currently bundles Grafana instance uses a standard
`PersistentVolumeClaim` since it is not expected to be deployed with
replicas. Grafana also has an OS-level user defined it its container,
which assumes ownership of the volume once it is mounted. To allow
remounting, the user have to be defined in the helm chart. This is done
using a `securityContext` in Grafana yaml template.

# Repaired Load Balancer Config

PR #8786 previously made network interfaces *either* internal or
external. This meant that when the network was set as public, certain
references to internal network interfaces were no longer reachable.
Specifically items that address a node port
([bootNodeURL](https://github.com/AztecProtocol/aztec-packages/blob/master/spartan/aztec-network/templates/_helpers.tpl#L62)
for example).

This PR adds the load balancer as a second interface, without modifying
the original.

# Testing

Code in this PR has been tested by by deployed the updated helm
configurations to the Spartan cluster using command:

`helm upgrade --install staging . -n staging --set network.public=true`

As part of this change, replica counts have also validated to work
without causing conflict for volume mounts, network interfaces or other
resources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants