Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIG Cloud Provider KEP: Reporting Conformance Test Results to Testgrid #2224

Merged
merged 1 commit into from
Jul 9, 2018

Conversation

andrewsykim
Copy link
Member

KEP for reporting conformance test results from all cloud providers.

@AishSundar @BenTheElder (sig-testing)
@calebamiles @jdumars (sig-release)

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels Jun 7, 2018
@k8s-ci-robot k8s-ci-robot requested review from bgrant0607 and jbeda June 7, 2018 12:58
@andrewsykim andrewsykim force-pushed the kep-provider-e2e branch 2 times, most recently from d6b676b to 7b8c697 Compare June 7, 2018 13:09

There are various scenarios where cloud providers may mistakenly upload incorrect conformance tests results. One example being uploading results for the wrong Kubernetes version.

Mitigation: TODO
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AishSundar @BenTheElder would love some feedback on how to mitigate this going forward

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with flaky tests below

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the providers own their own GCS bucket, then they can go into the appropriate folder and delete data corresponsing to the run / move it to the right directory in the bucket etc. If the GCS bucket is owned by GKE EngProd, then they can mail gke-kubernetes-engprod@google.com, with details of the run and bucket, for fixing the data.

@BenTheElder wdyt? accurate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if testgrid will remove these @michelle192837 if we remove a GCS entry that has already shown up in testgrid will it be removed from testgrid?

Also, since we've moved to creating a bucket for each provider we can probably just grant the service account with permissions to delete objects, but I don't think we'd be unwilling to help if emailed about a removal either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So TestGrid processes X results back until it stops finding changes, but that means that it varies how many columns back it looks. If you remove a GCS entry and want that run to not show up in TestGrid, the best way to enforce that is to force TestGrid to re-collect results for that test group.

To force TestGrid to re-collect, you can:

  • Increase the test group's days_of_results field, [Recommended]
  • Change the test group's query or column_header fields, or
  • Change the test group name (which TestGrid will treat as a new test group).

If you need to wipe something urgently, you can also request that I or someone else on the team delete the state file for the test group, which will remove the results until TestGrid updates that group again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note: the changes mentioned above would be to the testgrid config file in test-infra, by a PR.
thanks @michelle192837 :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good to know, I'll add a section about this.

Assuming that you have a publicly readable bucket provided by the GKE team, you should have been provided a service account JSON file which you can use with [gcloud](https://cloud.google.com/sdk/downloads) to authenticate with your GCS bucket.

```
$ gcloud auth activate-service-account --key-file /path/to/k8s-conformance-serivce-accout.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one note: the uploader will automate this if you instead supply the file to it with the same flag, and also remove the service account and revert to the previously active GCP account if any after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh.. that's quite handy. Is that the preferred method going forward?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either works fine, but I think for some setups this will be easier? It also means the cloud providers just need to have the gcloud tools available, they don't necessarily need to be aware of how to use them


#### Flaky Tests

Tests can fail for various reasons in any cloud environment and may raise false negatives for the release team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect some level of flaking to occur, but the release team will probably not move/keep jobs into the release blocking dashboards if they are very flaky. (they will still be in other dashboards of course). for the moment which jobs are in the release dashboard is ~up to the release team, and they may choose to ignore other dashboards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. One of the major criteria in promoting a e2e test to conformance is it should flake <1% of the time. If for some reason this isnt met we will remove the test from conformance after investigation of rootcause and fix.


Here are some more concrete examples of how other cloud providers are running conformance tests and uploading results to testgrid:
* Open Stack
* [OpenLan zuul job for running/uploading testgrid results](https://github.com/theopenlab/openlab-zuul-jobs/tree/master/playbooks/cloud-provider-openstack-acceptance-test-e2e-conformance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/OpenLan/OpenLab/


#### How to Run E2E Conformance Tests

[Sonobuoy](https://github.com/heptio/sonobuoy) is the recommended tool to run conformance tests on your cluster. At this point we will assume that you have a running cluster and that `kubectl` is configured to point to that cluster with admin access.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a major downside to using Sonobuoy vs building the e2e tests with kubernetes in continuous integration in particular, AFAIK. If you use the pre-released Sonobuoy images then you will not be testing the latest conformance tests as they are developed. Running these directly or with kubetest avoids this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this comment. In the past whenever I wished to run conformance on the master to get latest coverage or test latest changes I used kubetest. If Sonobuoy supports this it will be good to call that out here. If not, plz include instructions on how to achieve this using Kubetest or the hack scripts. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will update to use kubetest, I was in favour of Sonobuoy originally because it was easier to run since it doesn't require setting up a development environment but I don't think the tradeoff is worth it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not suggesting to remove Sonobuoy completely from here. But plz do mention Kubetest also as a means of triggering tests, esp if you want to test on latest changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't want to say that we necessarily need to mention any particular tool, I've mentioned Sonobuoy and kubetest in the testgrid conformance usage guide.
It should also be possible to run gingko / the conformance tests directly or via the hack scripts in kubernetes, but we should probably mention at least one way that involves using freshly built e2e tests, and how to use it correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All we need todo is push some of the details from https://github.com/heptio/kube-conformance into an upstream artifact that is cleanly factored (does just 1 thing). Some of the test containers from upstream, last I looked, rolled in a bunch of other stuff which from a community perspective can be confusing and unnecessary. @mml and I wanted todo this several releases ago, but just never got to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sonobuoy itself is a more elaborate runner to gather multiple artifacts in a more reproduciable manner. You don't need to rebuild those containers and they are widely used, vetted, versioned and tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there aren't any other artifacts to gather in this case...?
But you should use the latest e2es in some cases (continuous integration...) and I don't see how that will work without rebuilding https://github.com/heptio/kube-conformance/blob/c43a9d2d1f096ccff8d5b069691a523b1b517b8e/Dockerfile#L24 to match the newest binaries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on what they want todo. A lot of folks create their own custom tests that they collect together. Also, it's a reproducible artifact that you can handoff. Can't do that with the e2es today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: kubernetes/test-infra#8299 issue for sonobuoy automation instructions with the CNCF config + the testgrid uploader.

@BenTheElder
Copy link
Member

I expect providers will want to perform two kinds of testing:

  1. My current hosted product / platform is conformant with the released tests, IE passes sonobuoy

  2. Continuous integration where I've built kubernetes from a release branch / master, gotten it running on my platform, and tested it with the tests built with it.

The first is obviously important to providers, but the release team and myself are going to be far more interested in 2) so we can validate:

  • the core components in any ongoing development to kubernetes
  • the conformance tests themselves as they are developed.

I would like to push towards 2), which does have a bit more overhead and needs a bit more clarity regarding versions.

For OpenStack the tests run now are 2) by checking eg release-1.11 branch, building k8s + tests, then running.
For GCE they are 2) the same as OpenStack, and also another set of tests checking out the latest tagged release, IE 1.10.2. The OpenStack type are the most valuable signal for the release team, the tagged release signal is also useful but less important, imho.

Also for 2) making sure we're actually using a current build of kubernetes and the tests is a concern.

@andrewsykim
Copy link
Member Author

I definitely agree that pushing for 2) would be ideal so we can catch bugs before they're released. Because of the overhead I had felt that settling for 1) as a requirement and 2) as strongly recommended seemed like a good middle ground (at least for now). My gut feeling is that most providers will end up doing 2) anyways for the reasons you brought up.

@BenTheElder
Copy link
Member

BenTheElder commented Jun 7, 2018 via email

test_group_name: cloud-provider-demo-e2e-conformance-release-v1.10
```

Once you've made the following changes, open a PR against the test-infra repo. Once your PR merges you should be able to view your results on https://k8s-testgrid.appspot.com/ which should be ready to be consumed by the necessary stakeholders (sig-release, sig-testing, etc).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be suggest sending the PR to '@kubernetes/sig-testing-pr-reviews' and CCing /sig testing ?

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jun 7, 2018
@andrewsykim andrewsykim force-pushed the kep-provider-e2e branch 2 times, most recently from d195837 to 30ad875 Compare June 8, 2018 18:32
@andrewsykim andrewsykim force-pushed the kep-provider-e2e branch 2 times, most recently from 71ac076 to 837245f Compare June 12, 2018 00:47
@andrewsykim
Copy link
Member Author

andrewsykim commented Jun 12, 2018

Comments addressed, PTAL @AishSundar @BenTheElder. Note that this is just a first pass at the KEP so there are still some TODOs which we will address in future PRs.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 12, 2018
@BenTheElder
Copy link
Member

er habit there on the hold, not necessary.
This looks pretty good to me, we should probably fill out the TODO though :-)
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2018
@karataliu
Copy link

@justaugustus I'm following this


## Summary

This is a KEP outlining the motivation behind why cloud providers should frequently upload E2E conformance test results to [Testgrid](https://github.com/kubernetes/test-infra/tree/master/testgrid) and how a cloud provider can go about doing this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'frequently' implies high infra cost, cloud provider test vendors shall have control over the frequency.

'periodically' could be an alternate word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely. frequently is great, but... any results at all can be useful. periodically SGTM :-)


At this point we will assume that you have a running cluster and that `kubectl` is configured to point to that cluster with admin access.

#### Sonobuoy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just put a link for those testing tools, since the step details are subject to change.

If we're to go with latest master change (of conformance test) , shall it mention 'Sonobuoy' for now? It is not used for running against latest master change, thus it'll be confusing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, "latest master change" to me implies we should use kubetest since Sonobuoy does not always let us run the latest set of tests in master as you mentioned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, I think it's worth mentioning both here as long as we highlight the differences (see paragraph above). Running conformance tests on the latest master is going to be preferred going forward but not strictly a requirement yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sonobuoy does intend to support the latest test binaries better in the future (see previous threads here) AFAIK, and for tagged/stable k8s releases most likely the official conformance config should be used? (Though underneath it's all setting up ginkgo + e2e.test and filtering on tests tagged [Conformance])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm thinking about is sometimes it might not makes sense to run against fixed e2e version and upload periodically.
For example:
Since conformance test is considered as important, some provider would run conformance test (say e2e version-1) as pre-submit check.
Later, the periodical run also validates 'cloud-provider master' against 'e2e version-1', then for sure the uploaded report would be all pass. The report will be less useful if it is intrinsically always pass. (suppose it is not used for identifying flaky tests)

If the periodical run is against (e2e version-master), then some conformance test change in master would reveal and may show up as failure in report. Was that the original intention?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, the periodical run also validates 'cloud-provider master' against 'e2e version-1', then for sure the uploaded report would be all pass. The report will be less useful if it is intrinsically always pass. (suppose it is not used for identifying flaky tests)

There's no reason to assume the cloud provider AND tests are not flaky. The tests are not supposed to be, but they have occasion been so.

If the periodical run is against (e2e version-master), then some conformance test change in master would reveal and may show up as failure in report. Was that the original intention?

I'm not sure if this was @andrewsykim's intention here, but we are leveraging something like this now and it has caught regressions. Validating conformance tests / core components from kubernetes master across providers helps catch regresssions.

For GCE we're running a matrix of these periodically with kubernetes + e2es built from the release branches at HEAD, last tagged release version within each supported minor release, and against master at HEAD. OpenStack also has a smaller matrix, and Gardener.
Providers will obviously have to choose what kind of matrix coverage they want, but getting the ongoing development particularly on master tested widely is definitely helpful for the project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the periodical run is against (e2e version-master), then some conformance test change in master would reveal and may show up as failure in report. Was that the original intention?

Yes, this was the original intention. In the future as @BenTheElder had mentioned we may want something more elaborate but this seems like a good starting point.

@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 6, 2018
Copy link
Contributor

@nckturner nckturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question regarding whether conformance tests are good enough or if we should be asking for tests that are specific to each cloud provider.


#### How to Run E2E Conformance Tests

This KEP outlines two ways of running conformance tests, the first using [Sonobuoy](https://github.com/heptio/sonobuoy) and the second using [kubetest](https://github.com/kubernetes/test-infra/tree/master/kubetest). Though Sonobuoy is easier to setup, it does not guarantee that you will run the latest set of conformance tests. Kubetest though requiring a bit more work to setup, can ensure that you are running the latest set of conformance tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the set of tests included in the default e2e run skips quite a few tests that might be relevant for cloud provider to consider their cloud controller manager healthy. You might want to mention that the cloud provider should spend some effort determining which existing tests are relevant and whether or not more tests need to be added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically for conformance testing the provider should not be selecting the set of tests, they should be using the common conformance tests. for further testing (which is outside of this PR's scope AFAICT) they should run more tests separately, and may want to pick a wider suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah of course, my general question was whether or not conformance test is the correct thing for this KEP to require, and it sounds like for now it should be scoped to conformance tests exclusively.

* [Testgrid Configuration](https://github.com/kubernetes/test-infra/tree/master/testgrid#testgrid)
* [Display Conformance Tests with Testgrid](https://github.com/kubernetes/test-infra/blob/master/testgrid/conformance/README.md)

#### How to Run E2E Conformance Tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the goal of this is to find issues with cloud provider implementation specific stuff. Do you need a bit more information about the set of binaries that the cloud provider should ensure are under test here (at least CCM, but are CSI impls included? Anything else?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests need (IIRC):

  • ginkgo
  • e2e.test
  • kubectl

Anything else will be specific to how the cloud provider handles cluster spinup, which is somewhat out of scope, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

export KUBERNETES_CONFORMANCE_TEST=y
kubetest \
# conformance tests aren't supposed to be aware of providers
--provider=skeleton \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is skeleton what we want? Or is my previous comment incorrect re: the goal is testing cloud provider specific implementations? I should take a closer look at what tests are actually run, but my hunch is we might want a custom set of tests vs conformance for each cloud provider to actually catch issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, provider=skeleton is necessary for conformance testing, otherwise the tests will sniff the provider and work differently / skip themselves etc. which defeats the purpose of conformance testing.
Providers will likely want to run more tests separately to get better coverage, and conformance testing really ought to get more coverage as well.

@BenTheElder
Copy link
Member

If the conformance tests aren't good enough we should probably make them better? That seems orthogonal.

Seperately, providers may also want to run and optionally report other tests suites, kubernetes/kubernetes CI runs many, many, many of these in a matrix.

@andrewsykim
Copy link
Member Author

@nckturner The initial goal here is actually to get cloud providers reporting something and the standard set of conformance tests seem like a good start. Once we have the infrastructure in place, incrementally increasing coverage like @BenTheElder had mentioned will be significantly easier.

@hogepodge
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 9, 2018
@andrewsykim
Copy link
Member Author

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 9, 2018
@justaugustus
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2018
Copy link
Member

@justaugustus justaugustus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to ask CPs to submit conformance test results, I think that we should also authoritatively prescribe which tool we will expect conformance results to come from.

Based on what's currently here, I'd imagine some would choose to use Sonobuoy and others may choose to instead use kubetest.

Can we decide what the one acceptable tool is before merging this?

(Also, sorry for the last-minute /hold.)

@BenTheElder
Copy link
Member

BenTheElder commented Jul 9, 2018

If we're going to ask CPs to submit conformance test results, I think that we should also authoritatively prescribe which tool we will expect conformance results to come from.

The results come from the output of the e2e.test binary / ginkgo. Sonobuoy and kubetest orchestrate this and help copy the files.

Edit: specifically the JUnit structured results and the test log. The junit being most important and wholly produced by ginkgo / e2e.test, just cped by the tools above.

For some use cases specifically https://github.com/cncf/k8s-conformance really ought to be used as it's the official config for submitting results to the CNCF, but for testing more recent builds something else may be necessary.

Similarly there's no requirement to use the upload_e2e.py script, just some tool the effectively produces the same results.

@justaugustus
Copy link
Member

Thanks for clarifying, @BenTheElder!
/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2018
@k8s-ci-robot k8s-ci-robot merged commit 5786e1f into kubernetes:master Jul 9, 2018
@nckturner
Copy link
Contributor

@andrewsykim @BenTheElder That all makes sense. This KEP is scoped to just getting conformance tests run on a cloud provider (in any way possible) and the results reported back. Out of scope is testing the cloud provider specific stuff, like the CCM binary, or any tests outside of conformance.

And tangentially, as a community, we should improve conformance tests where they fall short when possible.

Sounds like a great start.

calebamiles pushed a commit to calebamiles/community that referenced this pull request Sep 5, 2018
SIG Cloud Provider KEP: Reporting Conformance Test Results to Testgrid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.