-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIG Cloud Provider KEP: Reporting Conformance Test Results to Testgrid #2224
Conversation
d6b676b
to
7b8c697
Compare
|
||
There are various scenarios where cloud providers may mistakenly upload incorrect conformance tests results. One example being uploading results for the wrong Kubernetes version. | ||
|
||
Mitigation: TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AishSundar @BenTheElder would love some feedback on how to mitigate this going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with flaky tests below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the providers own their own GCS bucket, then they can go into the appropriate folder and delete data corresponsing to the run / move it to the right directory in the bucket etc. If the GCS bucket is owned by GKE EngProd, then they can mail gke-kubernetes-engprod@google.com, with details of the run and bucket, for fixing the data.
@BenTheElder wdyt? accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if testgrid will remove these @michelle192837 if we remove a GCS entry that has already shown up in testgrid will it be removed from testgrid?
Also, since we've moved to creating a bucket for each provider we can probably just grant the service account with permissions to delete objects, but I don't think we'd be unwilling to help if emailed about a removal either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So TestGrid processes X results back until it stops finding changes, but that means that it varies how many columns back it looks. If you remove a GCS entry and want that run to not show up in TestGrid, the best way to enforce that is to force TestGrid to re-collect results for that test group.
To force TestGrid to re-collect, you can:
- Increase the test group's
days_of_results
field, [Recommended] - Change the test group's
query
orcolumn_header
fields, or - Change the test group name (which TestGrid will treat as a new test group).
If you need to wipe something urgently, you can also request that I or someone else on the team delete the state file for the test group, which will remove the results until TestGrid updates that group again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note: the changes mentioned above would be to the testgrid config file in test-infra, by a PR.
thanks @michelle192837 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good to know, I'll add a section about this.
Assuming that you have a publicly readable bucket provided by the GKE team, you should have been provided a service account JSON file which you can use with [gcloud](https://cloud.google.com/sdk/downloads) to authenticate with your GCS bucket. | ||
|
||
``` | ||
$ gcloud auth activate-service-account --key-file /path/to/k8s-conformance-serivce-accout.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one note: the uploader will automate this if you instead supply the file to it with the same flag, and also remove the service account and revert to the previously active GCP account if any after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh.. that's quite handy. Is that the preferred method going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either works fine, but I think for some setups this will be easier? It also means the cloud providers just need to have the gcloud tools available, they don't necessarily need to be aware of how to use them
|
||
#### Flaky Tests | ||
|
||
Tests can fail for various reasons in any cloud environment and may raise false negatives for the release team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expect some level of flaking to occur, but the release team will probably not move/keep jobs into the release blocking dashboards if they are very flaky. (they will still be in other dashboards of course). for the moment which jobs are in the release dashboard is ~up to the release team, and they may choose to ignore other dashboards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this. One of the major criteria in promoting a e2e test to conformance is it should flake <1% of the time. If for some reason this isnt met we will remove the test from conformance after investigation of rootcause and fix.
|
||
Here are some more concrete examples of how other cloud providers are running conformance tests and uploading results to testgrid: | ||
* Open Stack | ||
* [OpenLan zuul job for running/uploading testgrid results](https://github.com/theopenlab/openlab-zuul-jobs/tree/master/playbooks/cloud-provider-openstack-acceptance-test-e2e-conformance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/OpenLan/OpenLab/
|
||
#### How to Run E2E Conformance Tests | ||
|
||
[Sonobuoy](https://github.com/heptio/sonobuoy) is the recommended tool to run conformance tests on your cluster. At this point we will assume that you have a running cluster and that `kubectl` is configured to point to that cluster with admin access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually a major downside to using Sonobuoy vs building the e2e tests with kubernetes in continuous integration in particular, AFAIK. If you use the pre-released Sonobuoy images then you will not be testing the latest conformance tests as they are developed. Running these directly or with kubetest avoids this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this comment. In the past whenever I wished to run conformance on the master to get latest coverage or test latest changes I used kubetest. If Sonobuoy supports this it will be good to call that out here. If not, plz include instructions on how to achieve this using Kubetest or the hack scripts. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will update to use kubetest, I was in favour of Sonobuoy originally because it was easier to run since it doesn't require setting up a development environment but I don't think the tradeoff is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not suggesting to remove Sonobuoy completely from here. But plz do mention Kubetest also as a means of triggering tests, esp if you want to test on latest changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't want to say that we necessarily need to mention any particular tool, I've mentioned Sonobuoy and kubetest in the testgrid conformance usage guide.
It should also be possible to run gingko / the conformance tests directly or via the hack scripts in kubernetes, but we should probably mention at least one way that involves using freshly built e2e tests, and how to use it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All we need todo is push some of the details from https://github.com/heptio/kube-conformance into an upstream artifact that is cleanly factored (does just 1 thing). Some of the test containers from upstream, last I looked, rolled in a bunch of other stuff which from a community perspective can be confusing and unnecessary. @mml and I wanted todo this several releases ago, but just never got to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sonobuoy itself is a more elaborate runner to gather multiple artifacts in a more reproduciable manner. You don't need to rebuild those containers and they are widely used, vetted, versioned and tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there aren't any other artifacts to gather in this case...?
But you should use the latest e2es in some cases (continuous integration...) and I don't see how that will work without rebuilding https://github.com/heptio/kube-conformance/blob/c43a9d2d1f096ccff8d5b069691a523b1b517b8e/Dockerfile#L24 to match the newest binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on what they want todo. A lot of folks create their own custom tests that they collect together. Also, it's a reproducible artifact that you can handoff. Can't do that with the e2es today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: kubernetes/test-infra#8299 issue for sonobuoy automation instructions with the CNCF config + the testgrid uploader.
I expect providers will want to perform two kinds of testing:
The first is obviously important to providers, but the release team and myself are going to be far more interested in 2) so we can validate:
I would like to push towards 2), which does have a bit more overhead and needs a bit more clarity regarding versions. For OpenStack the tests run now are 2) by checking eg Also for 2) making sure we're actually using a current build of kubernetes and the tests is a concern. |
I definitely agree that pushing for 2) would be ideal so we can catch bugs before they're released. Because of the overhead I had felt that settling for 1) as a requirement and 2) as strongly recommended seemed like a good middle ground (at least for now). My gut feeling is that most providers will end up doing 2) anyways for the reasons you brought up. |
Totally agree!
Though I really hope we can push to get plenty up to speed on 2) to catch Kubernetes-wide conformance problems quickly like kubernetes/kubernetes#64359, kubernetes/kubernetes#64362.
We should also try to make 1) vs 2) clear, since 2) is more likely to fail due to new tests etc. and does not necessarily mean that the provider is non-conformant. One way could just be standardizing how the TestGrid entries are named to clarify (?) I'm not sure what the best solution is just yet though.
…On Thu, Jun 7, 2018 at 1:40 PM Andrew Sy Kim ***@***.***> wrote:
I definitely agree that pushing for 2) would be ideal so we can catch bugs
before they're released. Because of the overhead I had felt that settling
for 1) as a requirement and 2) as strongly recommended seemed like a good
middle ground (at least for now). My gut feeling is that most providers
will end up doing 2) anyways for the reasons you brought up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2224 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4BqwAZz99fyPDS_67nuDIU860gS_YGks5t6Y-6gaJpZM4UeW6y>
.
|
7b8c697
to
30b597d
Compare
test_group_name: cloud-provider-demo-e2e-conformance-release-v1.10 | ||
``` | ||
|
||
Once you've made the following changes, open a PR against the test-infra repo. Once your PR merges you should be able to view your results on https://k8s-testgrid.appspot.com/ which should be ready to be consumed by the necessary stakeholders (sig-release, sig-testing, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be suggest sending the PR to '@kubernetes/sig-testing-pr-reviews' and CCing /sig testing ?
d195837
to
30ad875
Compare
71ac076
to
837245f
Compare
Comments addressed, PTAL @AishSundar @BenTheElder. Note that this is just a first pass at the KEP so there are still some TODOs which we will address in future PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
er habit there on the hold, not necessary. |
@justaugustus I'm following this |
|
||
## Summary | ||
|
||
This is a KEP outlining the motivation behind why cloud providers should frequently upload E2E conformance test results to [Testgrid](https://github.com/kubernetes/test-infra/tree/master/testgrid) and how a cloud provider can go about doing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'frequently' implies high infra cost, cloud provider test vendors shall have control over the frequency.
'periodically' could be an alternate word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, definitely. frequently is great, but... any results at all can be useful. periodically
SGTM :-)
|
||
At this point we will assume that you have a running cluster and that `kubectl` is configured to point to that cluster with admin access. | ||
|
||
#### Sonobuoy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just put a link for those testing tools, since the step details are subject to change.
If we're to go with latest master change (of conformance test) , shall it mention 'Sonobuoy' for now? It is not used for running against latest master change, thus it'll be confusing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, "latest master change" to me implies we should use kubetest since Sonobuoy does not always let us run the latest set of tests in master as you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, I think it's worth mentioning both here as long as we highlight the differences (see paragraph above). Running conformance tests on the latest master is going to be preferred going forward but not strictly a requirement yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sonobuoy does intend to support the latest test binaries better in the future (see previous threads here) AFAIK, and for tagged/stable k8s releases most likely the official conformance config should be used? (Though underneath it's all setting up ginkgo + e2e.test and filtering on tests tagged [Conformance]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm thinking about is sometimes it might not makes sense to run against fixed e2e version and upload periodically.
For example:
Since conformance test is considered as important, some provider would run conformance test (say e2e version-1) as pre-submit check.
Later, the periodical run also validates 'cloud-provider master' against 'e2e version-1', then for sure the uploaded report would be all pass. The report will be less useful if it is intrinsically always pass. (suppose it is not used for identifying flaky tests)
If the periodical run is against (e2e version-master), then some conformance test change in master would reveal and may show up as failure in report. Was that the original intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later, the periodical run also validates 'cloud-provider master' against 'e2e version-1', then for sure the uploaded report would be all pass. The report will be less useful if it is intrinsically always pass. (suppose it is not used for identifying flaky tests)
There's no reason to assume the cloud provider AND tests are not flaky. The tests are not supposed to be, but they have occasion been so.
If the periodical run is against (e2e version-master), then some conformance test change in master would reveal and may show up as failure in report. Was that the original intention?
I'm not sure if this was @andrewsykim's intention here, but we are leveraging something like this now and it has caught regressions. Validating conformance tests / core components from kubernetes master across providers helps catch regresssions.
For GCE we're running a matrix of these periodically with kubernetes + e2es built from the release branches at HEAD, last tagged release version within each supported minor release, and against master at HEAD. OpenStack also has a smaller matrix, and Gardener.
Providers will obviously have to choose what kind of matrix coverage they want, but getting the ongoing development particularly on master tested widely is definitely helpful for the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the periodical run is against (e2e version-master), then some conformance test change in master would reveal and may show up as failure in report. Was that the original intention?
Yes, this was the original intention. In the future as @BenTheElder had mentioned we may want something more elaborate but this seems like a good starting point.
837245f
to
4038fcd
Compare
4038fcd
to
6b2e77d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question regarding whether conformance tests are good enough or if we should be asking for tests that are specific to each cloud provider.
|
||
#### How to Run E2E Conformance Tests | ||
|
||
This KEP outlines two ways of running conformance tests, the first using [Sonobuoy](https://github.com/heptio/sonobuoy) and the second using [kubetest](https://github.com/kubernetes/test-infra/tree/master/kubetest). Though Sonobuoy is easier to setup, it does not guarantee that you will run the latest set of conformance tests. Kubetest though requiring a bit more work to setup, can ensure that you are running the latest set of conformance tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the set of tests included in the default e2e run skips quite a few tests that might be relevant for cloud provider to consider their cloud controller manager healthy. You might want to mention that the cloud provider should spend some effort determining which existing tests are relevant and whether or not more tests need to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically for conformance testing the provider should not be selecting the set of tests, they should be using the common conformance tests. for further testing (which is outside of this PR's scope AFAICT) they should run more tests separately, and may want to pick a wider suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah of course, my general question was whether or not conformance test is the correct thing for this KEP to require, and it sounds like for now it should be scoped to conformance tests exclusively.
* [Testgrid Configuration](https://github.com/kubernetes/test-infra/tree/master/testgrid#testgrid) | ||
* [Display Conformance Tests with Testgrid](https://github.com/kubernetes/test-infra/blob/master/testgrid/conformance/README.md) | ||
|
||
#### How to Run E2E Conformance Tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the goal of this is to find issues with cloud provider implementation specific stuff. Do you need a bit more information about the set of binaries that the cloud provider should ensure are under test here (at least CCM, but are CSI impls included? Anything else?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests need (IIRC):
- ginkgo
- e2e.test
- kubectl
Anything else will be specific to how the cloud provider handles cluster spinup, which is somewhat out of scope, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
export KUBERNETES_CONFORMANCE_TEST=y | ||
kubetest \ | ||
# conformance tests aren't supposed to be aware of providers | ||
--provider=skeleton \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is skeleton
what we want? Or is my previous comment incorrect re: the goal is testing cloud provider specific implementations? I should take a closer look at what tests are actually run, but my hunch is we might want a custom set of tests vs conformance for each cloud provider to actually catch issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, provider=skeleton is necessary for conformance testing, otherwise the tests will sniff the provider and work differently / skip themselves etc. which defeats the purpose of conformance testing.
Providers will likely want to run more tests separately to get better coverage, and conformance testing really ought to get more coverage as well.
If the conformance tests aren't good enough we should probably make them better? That seems orthogonal. Seperately, providers may also want to run and optionally report other tests suites, kubernetes/kubernetes CI runs many, many, many of these in a matrix. |
@nckturner The initial goal here is actually to get cloud providers reporting something and the standard set of conformance tests seem like a good start. Once we have the infrastructure in place, incrementally increasing coverage like @BenTheElder had mentioned will be significantly easier. |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to ask CPs to submit conformance test results, I think that we should also authoritatively prescribe which tool we will expect conformance results to come from.
Based on what's currently here, I'd imagine some would choose to use Sonobuoy and others may choose to instead use kubetest.
Can we decide what the one acceptable tool is before merging this?
(Also, sorry for the last-minute /hold
.)
The results come from the output of the Edit: specifically the JUnit structured results and the test log. The junit being most important and wholly produced by For some use cases specifically https://github.com/cncf/k8s-conformance really ought to be used as it's the official config for submitting results to the CNCF, but for testing more recent builds something else may be necessary. Similarly there's no requirement to use the |
Thanks for clarifying, @BenTheElder! |
@andrewsykim @BenTheElder That all makes sense. This KEP is scoped to just getting conformance tests run on a cloud provider (in any way possible) and the results reported back. Out of scope is testing the cloud provider specific stuff, like the CCM binary, or any tests outside of conformance. And tangentially, as a community, we should improve conformance tests where they fall short when possible. Sounds like a great start. |
SIG Cloud Provider KEP: Reporting Conformance Test Results to Testgrid
KEP for reporting conformance test results from all cloud providers.
@AishSundar @BenTheElder (sig-testing)
@calebamiles @jdumars (sig-release)