Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Testing infrastructure #303

Open
smothiki opened this issue Aug 9, 2016 · 16 comments
Open

Proposal: Testing infrastructure #303

smothiki opened this issue Aug 9, 2016 · 16 comments
Assignees
Labels

Comments

@smothiki
Copy link
Contributor

smothiki commented Aug 9, 2016

After working on the test suite for a good amount of time and getting help from fellow folks.
These are some issues I'm thinking of proposing to make the suite better. During our last retro, we have already talked about running smoke tests.

Some Proposals about proceeding further with current CI/CD infrastructure.

view of the current Deis architecture

The control Plane:

  • Controller, Builder, Registry, Database, Minio
    Any changes made to these components or Deis Cli or controller-sdk-go repository will affect workflow functionality and should run the full test suite.

The Logging and Monitoring stack:

  • Fluentd, Logger, Grafana, NSQ, InfluxDB.
    A Pr change to this stack will not affect any CLI functionality or workflow features that are related to control plane except Deis logs

sub-components:

  • Slugbuilder, slugrunner, dockerfilebuilder
    A pr change to these will only affect a few tests in the workflow e2e and instead of running entire test suite and git push with different app types test should be sufficinet for testing.
Ideas
  • Each component in the architecture should have it's own dedicated test suite and clusters.
  • For example :
  • control plane should have a helm chart or manifests, which will install the control plane components on to Deis namespace. The cluster should already be running Logging and monitoring stack.
  • Logging and monitoring stack should have it's own dedicated cluster where control plane is already present and helm chart or manifests are needed to install this stack on to existing control plane in Deis namespace.
  • subcomponents like slugbuilder, slugrunner and dockerfile builder etc., should run specific tests in e2e or should have a dedicated test suite that would run specific tests to check if these components are working.
  • Note This point is a complete outlier but if implemented would cut GCS costs. We should also look at installing Deis in different namespaces within the same cluster. This might require some code change in each component but would save cluster costs and also provides user a chance to install deis in a custom namespace.

Please provide valuable feedback and comments to proceed further in this proposal and come up with action items that we can work on to make Jenkins a green place .

@smothiki smothiki self-assigned this Aug 9, 2016
@bacongobbler
Copy link
Member

Great write-up @smothiki. Going to sit down and think about this one overnight.

@bacongobbler
Copy link
Member

Some great thoughts in here.

On the view of the current architecture, one of our most recent projects is registry-proxy, and it was not mentioned here. I think registry-proxy should be considered as a sub-component, since it's just a dumb reverse proxy for the registry.

When you mean

Each component in the architecture should have it's own dedicated test suite and clusters.

What would this look like? workflow-e2e becomes strictly for the control plane, then we would implement a separate logging-e2e and monitoring-e2e suites? Would the sub-components just run local functional tests like deis/postgres's test suite, or something different?

@mboersma
Copy link
Member

mboersma commented Aug 10, 2016

WRT the last point, I think beefy unit tests and local functional tests are the place to start for a component-specific suite. Ideally we can get to a point where we test all visible interfaces thoroughly enough with mocks that workflow-e2e is not required to merge a PR.

@sgoings sgoings changed the title Porposal: Testing infrastructure Proposal: Testing infrastructure Aug 10, 2016
@vdice
Copy link
Member

vdice commented Aug 10, 2016

This definitely represents a smarter approach for when and how to utilize the e2e suite(s); thanks, @smothiki.

Here are my ideas of possible immediate next steps that can be pursued:

  1. No change for 'control plane' ci pipelines at this point (keep running full e2e suite as per usual against changes). This could (and should) definitely be revisited as functional test coverage is added/improved in a given repo.
  2. Break out logging/monitoring stack from full e2e by identifying the current e2e specs that touch this functionality and 'tagging' or otherwise labeling them so that the e2e stage of the ci pipeline only runs these. This would involve an organizational change/update in workflow-e2e and similarly to the pipeline job(s) in jenkins-jobs. This can be done while unit/functional tests are beefed up in these repos. Eventually, the ci pipeline would add a 'functional' stage of tests in between component build and (minimal, sub-suite) e2e.
  3. Similarly for the other sub-components, break out from full e2e by identifying current specs that touch this functionality, done in parallel w/ unit/functional test coverage improvements.
  4. Build momentum on functional test efforts in all component repos above by creating the necessary tickets and prioritizing work.

@smothiki
Copy link
Contributor Author

@mboersma

WRT the last point, I think beefy unit tests and local functional tests are the place to start for a ....

I forgot to mention this one in few ideas I discussed with @vdice. We will still run the entire test suite like the current e2e but not for every PR but as a cron Job which runs 4 times every day on a dedicated cluster.

@bacongobbler
Copy link
Member

We will still run the entire test suite like the current e2e but not for every PR but as a cron Job which runs 4 times every day on a dedicated cluster.

Will we be doing this as well as the entire test suite for control plane PRs?

@smothiki
Copy link
Contributor Author

Just had a discussion with @vdice thanks to his suggestions.
We would love to proceed with control plane testing and below are the changes we would love to make to the existing suite.

  • Create a separate helm chart for control plane. which has all the logging and monitoring charts as manifests instead of templates and use helm keep.
  • Don't reap the cluster instead do helm uninstall with out force which will persist all helm keep manifests

So for every PR to control plane components we have a clusters which has logging and monitoring and helm keep manifests already installed . Just doing helmc install chart-name will install only control plane components. If we have consent on this we would love to proceed to implementation on this front.

Pros : turn around test time is optimized .
cons: book keeping for helm- keep manifest changes and changes in logging stack.

@smothiki
Copy link
Contributor Author

@bacongobbler as per vdice suggestion cron jobs are not helping much in identifying bugs. So thinking of better approach as of now.

@jchauncey
Copy link
Member

If all you are testing is the control plane there is no benefit to installing the monitoring and logging stack. Second you gain no benefit to keeping the cluster around.

There is also no reason to stand up the control plane if you are testing the monitoring stack.

I'm not sure I understand the benefits here.

@smothiki
Copy link
Contributor Author

smothiki commented Aug 10, 2016

@jchauncey as per my knowledge we need to install logging stack atleast to check deis logs which is an important feature for workflow and should be included in e2e tests, the reason we are keeping a dedicated monitoring and logging stack

@jchauncey
Copy link
Member

Then why the hassle of keeping it around? It literally takes seconds for
that to start. You spend more time waiting for controller to come up than
the logging stack.

On Aug 10, 2016 2:56 PM, "Sivaram Mothiki" notifications@github.com wrote:

@jchauncey https://github.com/jchauncey as per my knowledge we need to
install logging stack atleast to check deis logs. that's the reason we
are keeping a dedicated monitoring and logging stack


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#303 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAaRGKPVCKvKGMUnhsNSP_2gT2rxy8gIks5qejrzgaJpZM4JgnUF
.

@smothiki
Copy link
Contributor Author

If time is not an issue we can install the entire chart.

@jchauncey
Copy link
Member

jchauncey commented Aug 10, 2016

For testing the controller plane if you are doing e2e tests there is no
reason not to stand up the entire stack. For testing pieces that are
independently shippable we should just stand up those parts without
anything else and test.

@jchauncey
Copy link
Member

jchauncey commented Aug 10, 2016

This proposal comes down to how we test the parts in such a way as to
maximize time and coverage.

In almost all cases we need a suite of tests that exercise API calls
against a live system. This supplants much of the current e2e suite.

We then take the current e2e tests and move them into cli functional tests
against a real system.

This means component prs only run their functional tests for validation and
then we use a smoke suite to test that everything comes up correctly.

We can rely on a larger suite of e2e tests for validation of a release but
we should be more than confident at that point.

@jchauncey
Copy link
Member

jchauncey commented Aug 11, 2016

So after thinking about this all day I would like to counter with this proposal that is just a slight variation on @smothiki's.

  1. We should have suites of tests dedicated for each component. These tests (for now) would exercise the app in such a way as to cover that component in a significant way. However, we should be diligent in moving those tests from using the CLI to being completely API driven.
  2. We move the current workflow-e2e tests into the workflow-cli repo
    1. we stop cutting a chart for every release of workflow (I'm not sure why we do this anyways)
    2. we move the e2e tests from being a replication controller with a shared artifact pod to being a job resource
  3. We start working on a framework for feature toggles.
    1. This framework allows us to ship features without having to worry about having all the tests in place first.
    2. We screen every PR with a set of smoke tests that verify that workflow will come up and at least do some of the most basic commands.

The pipeline for this would look like the following:

submit pr -> unit tests -> functional tests against real component -> smoke tests -> workflow-cli e2e suite with feature toggles enabled (if applicable)

The last job might only need to happen ever so often so we could make that a manual job for now if we want.

@bacongobbler
Copy link
Member

bacongobbler commented Aug 11, 2016

We should have suites of tests dedicated for each component. These tests (for now) would exercise the app in such a way as to cover that component in a significant way. However, we should be diligent in moving those tests from using the CLI to being completely API driven.

+1, I believe that's the idea behind the sub-component projects. It becomes a little harder to do that for the controller or for builder at the moment, but I imagine it would be nice to get to that point for those components (the control plane, specifically) as well.

we stop cutting a chart for every release of workflow (I'm not sure why we do this anyways)

How would users run the latest release of Workflow without a new chart? What would the new process look like? Remember that we don't want to break userspace; users are already comfortable with the idea of helm fetch && helm generate && helm install. Any additional steps after 2.0 and I feel we are breaking (package management/deployment) compatibility, which hurts the user experience.

We move the current workflow-e2e tests into the workflow-cli repo

So only workflow-cli gets tested end-to-end on changes? That doesn't make too much sense to me as we'd like to have end-to-end tests for the controller, at the very least (for example, the migration from replication controllers to deployments caught a few bugs).

we move the e2e tests from being a replication controller with a shared artifact pod to being a job resource

If this is necessary for this shift then I'm okay with this. I'd like to err on the side of "try not to bite off more than you can chew". These small tasks eventually add up.

We start working on a framework for feature toggles.

This feels slightly in-but-out of scope for the topic of "revisiting how we test". Are you suggesting feature-flagging the test suite or the feature itself? For significant features like deployments we already did the latter.

I'm all for whatever comes out of this proposal, but for me I'd like to see (as a developer):

  • more consistent test results on pull requests
  • smoother release cycles (having to wait for flaky e2e to pass so I can cut an image sucks)
  • a smooth transition over to the "new" test infrastructure (don't want development downtime due to breakages in our CI process)

How we end up achieving those three bullet points does not matter to me so much. Bonus points for a visible reduction in our GCE costs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants