Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico integration Tests #1906

Merged
merged 7 commits into from
Mar 8, 2022
Merged

Calico integration Tests #1906

merged 7 commits into from
Mar 8, 2022

Conversation

haouc
Copy link
Contributor

@haouc haouc commented Mar 4, 2022

What type of PR is this?
This PR is fixing a failing weekly Calico integration tests and replacing it with helm installation with demoed STAR network policy tests suite.

bug
Which issue does this PR fix:
Fixes #1874

What does this PR do / Why do we need it:
We have deprecated support of Calico charts in this repo, but we need do routine integration tests with certain/latest calico version with our VPC CNI. This PR contains the following changes/additions:

  1. Helm installation of Calico operator with a given version of Calico
  2. STAR tests with various network policy settings
  3. Calico tests are no longer using bash scripts. Ginkgo suite is used to implement the tests
  4. New ENVs to config tests: RUN_LATEST_CALICO_VERSION (boolean) that indicates if fetching Calico repo to use the latest Calico released version; CALICO_VERSION (boolean) that sets Calico version to test if the RUN_LATEST_CALICO_VERSION == false
  5. If DEPROVISION == false, EC2 instances will be terminated to avoid impact for future tests in the cluster
  6. Reorganized VPC CNI tests on current image with Calico tests to avoid interference
  7. Due to installing AMD arch based Calico images and STAR resources, during the tests ARM arch nodes will be unschedulable. They will be restored to schedulable state after Calico tests are finished
  8. To utilize the recently added metrics infra, Calico tests also add a new metrics calico_test_status
  9. If the ENV RUN_CALICO_TEST_WITH_PD NOT set, the script will also run PD with Calico by default

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

Using specified Calico version:

$ RUN_CALICO_TEST=true CALICO_VERSION=3.21.0 ./scripts/run-integration-tests.sh
...
2022-03-04 07:49:42 [✔]  EKS cluster "cni-test-16533" in "us-west-2" region is ready
ok.
TIMELINE: Upping test cluster took 1079 seconds.
...
=== RUN   TestIntegration
Mar  4 07:49:44.618: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
Running Suite: Amazon VPC CNI Integration Tests
...
Ran 2 of 2 Specs in 31.384 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestIntegration (31.38s)
...
TIMELINE: Current image integration tests took 21 seconds.
Starting Helm installing Tigera operator and running Calico STAR tests
~/environment/go/src/amazon-vpc-cni-k8s/test ~/environment/go/src/amazon-vpc-cni-k8s
Using Calico version 3.21.0 to test   <-----
Running Suite: Calico with VPC CNI e2e Test Suite
=================================================
Random Seed: 1646380248
Will run 22 of 22 specs
...
STEP: Restore ARM64 Nodes Schedulability
 
Ran 22 of 22 Specs in 379.131 seconds
SUCCESS! -- 22 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS
 
Ginkgo ran 1 suite in 6m24.072209201s
Test Suite Passed
~/environment/go/src/amazon-vpc-cni-k8s
Deleting cluster  (this may take ~10 mins) ... ok.
TIMELINE: Down processes took 420 seconds.

Using latest Calico released version (Automatically fetching from Calico)

% RUN_CALICO_TEST=true RUN_LATEST_CALICO_VERSION=true ./scripts/run-integration-tests.sh
...
TIMELINE: Current image integration tests took 39 seconds.
Starting Helm installing Tigera operator and running Calico STAR tests
~/go/src/amazon-vpc-cni-k8s/test ~/go/src/amazon-vpc-cni-k8s
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19996    0 19996    0     0  43094      0 --:--:-- --:--:-- --:--:-- 43094
Using Calico version v3.22.1 to test   <-----
Running Suite: Calico with VPC CNI e2e Test Suite
=================================================
Random Seed: 1646377066
Will run 22 of 22 specs

...
STEP: Restore ARM64 Nodes Schedulability
 
Ran 22 of 22 Specs in 373.879 seconds
SUCCESS! -- 22 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS
 
Ginkgo ran 1 suite in 6m24.177285169s
Test Suite Passed
~/go/src/amazon-vpc-cni-k8s
Deleting cluster  (this may take ~10 mins) ... ok.
TIMELINE: Down processes took 442 seconds.

Prefix Delegation tests

STEP: Restore ARM64 Nodes Schedulability

Ran 22 of 22 Specs in 378.191 seconds
SUCCESS! -- 22 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 6m24.845380702s
Test Suite Passed
~/go/src/amazon-vpc-cni-k8s
Run Calico tests with Prefix Delegation enabled
daemonset.apps/aws-node env updated
{
    "TerminatingInstances": [
        {
            "CurrentState": {
                "Code": 32,
                "Name": "shutting-down"
            },
            "InstanceId": "i-0341454b54c510205",
            "PreviousState": {
                "Code": 16,
                "Name": "running"
            }
        }
    ]
}
Waiting 15 minutes for new nodes being ready

Starting Helm installing Tigera operator and running Calico STAR tests
~/go/src/amazon-vpc-cni-k8s/test ~/go/src/amazon-vpc-cni-k8s
Using Calico version 3.22.0 to test
Running Suite: Calico with VPC CNI e2e Test Suite
=================================================
Random Seed: 1646727202
Will run 22 of 22 specs
...
STEP: Restore ARM64 Nodes Schedulability

Ran 22 of 22 Specs in 370.725 seconds
SUCCESS! -- 22 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 6m18.087117999s
Test Suite Passed

Automation added to e2e:

yes
Will this PR introduce any new dependencies?:

no
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
no

Does this change require updates to the CNI daemonset config files to work?:

no
Does this PR introduce any user-facing change?:

no


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@haouc haouc requested review from jayanthvn and vikasmb March 4, 2022 17:47
@haouc haouc requested a review from a team as a code owner March 4, 2022 17:47
@haouc haouc changed the title Calico integration Calico integration Tests Mar 4, 2022
ginkgo -v e2e/calico -- --cluster-kubeconfig=$KUBECONFIG --cluster-name=$CLUSTER_NAME --aws-region=$AWS_DEFAULT_REGION --aws-vpc-id=$VPC_ID --calico-version=$calico_version
popd

if [[ "$DEPROVISION" == false ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DEPROVISION is set to false only when we want to login to the cluster nodes and debug. As such, automated termination of nodes may not be needed here. Thoughts ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to recycle the cluster? @jayanthvn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should keep this in case the Calico tests interfere with any tests without awareness in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to reuse the cluster since this is run only weekly.

@@ -24,6 +24,9 @@ import (
type InstallationManager interface {
InstallCNIMetricsHelper(image string, tag string) error
UnInstallCNIMetricsHelper() error
InstallTigeraOperator(version string) error
UninstallTigeraOperator() error
//AddAndUpdateRepository(entry repo.Entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this commented code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Will remove it.

Expect(err).ToNot(HaveOccurred())

By("Patching ARM64 node unschedulable")
err = updateNodesSchedulability("beta.kubernetes.io/arch", "arm64", true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead use "kubernetes.io/arch" ? I know we do tag with beta still but better to use "kubernetes.io/arch".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. I will change this to not use beta.

err := f.InstallationManager.InstallTigeraOperator(tigeraVersion)
Expect(err).ToNot(HaveOccurred())

By("Patching ARM64 node unschedulable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we mark it unschedulable?

Copy link
Contributor

@jayanthvn jayanthvn Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is it because star policy is still not supported for ARM instance types? Maybe we can check this - projectcalico/calico#3717 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cordon these arm nodes to avoid Calico tests pods deployed to them. For now the provided tests containers are for AMD.

Container(uiContainer).
Replicas(1).
PodLabel("role", "management-ui").
NodeSelector("beta.kubernetes.io/arch", "amd64").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@jayanthvn
Copy link
Contributor

Can we also add these tests with PD enabled? We can run the test with secondary IP mode, enable PD, ASG can be reduced to 0 and again back to few nodes and then run the test again.

@haouc
Copy link
Contributor Author

haouc commented Mar 8, 2022

Can we also add these tests with PD enabled? We can run the test with secondary IP mode, enable PD, ASG can be reduced to 0 and again back to few nodes and then run the test again.

PD tests were added.

@haouc haouc requested a review from jayanthvn March 8, 2022 16:11
Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@vikasmb vikasmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@haouc haouc merged commit 1de5da4 into aws:master Mar 8, 2022
@haouc haouc deleted the calico-integration branch March 8, 2022 22:00
sushrk pushed a commit to sushrk/amazon-vpc-cni-k8s that referenced this pull request Mar 9, 2022
* Adding Tigera operator installation, Stars resource installations and tests

* Calico test pods can only work on amd64 nodes.

* Need cordon the arm64 nodes to test amd version calico and stars

* we should run new image CNI test and then calico tests

* Enable metrics for calico test

* Fix make format

* updates for node label and adding PD tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix calico tests in the weekly test suite
3 participants