Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tolerations and Affinity support #204

Merged
merged 5 commits into from
Sep 11, 2024

Conversation

vlameiras
Copy link
Contributor

Hi there 👋

We were testing the project internally and noticed including Tolerations and Affinity support would be nice.

Hopefully this PR will do the trick.

Thanks for the great project!

operator: "Equal"
value: "present"
effect: "NoSchedule"
affinity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to keep our default resource profile cloud agnostic. So I would prefer to remove the affinity rule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - we can keep the tolerations though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah true. Let's keep the toleration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can accomplish a resource profile that works across most popular cluster types by leveraging how affinity nodeSelectorTerms are OR'd together instead of AND'd. Will open a new issue to track this. We should still remove the affinity & node selector in here for now.

charts/kubeai/values.yaml Outdated Show resolved Hide resolved
@samos123
Copy link
Contributor

Thanks for the PR. We do need this since adding taints for GPU nodes is common.

I did an initial review but let's wait for @nstogner to review too.

@samos123
Copy link
Contributor

The quickstart e2e test CI is failing because kubeAI scaled down to 0:

[events -w] 0s          Normal   Killing                   pod/kubeai-56485ff664-mqpjg                Stopping container kubeai
[events -w] 0s          Normal   SuccessfulDelete          replicaset/kubeai-56485ff664               Deleted pod: kubeai-56485ff664-mqpjg
[events -w] 0s          Normal   ScalingReplicaSet         deployment/kubeai                          Scaled down replica set kubeai-56485ff664 to 0 from 1

@nstogner
Copy link
Contributor

@vlameiras Thanks for the PR! We were missing unit tests for the modified function so I added a PR into your branch to add those in as well as update the integration tests: vlameiras#1

@vlameiras
Copy link
Contributor Author

Thanks for the quick replies and for the detailed comments!

@samos123
Copy link
Contributor

Please ignore my previous comment about e2e quickstart CI failing. It was flaky and I have fixed the flakiness here: #205

Simplify equality check and add tests
@nstogner
Copy link
Contributor

I think we are good to merge once the changes to the default values.yaml file are reverted: #204 (comment)

@samos123 samos123 merged commit 13b3270 into substratusai:main Sep 11, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants