Skip to content

Release Shepherd Rotation

Stephen Lewis (Burrows) edited this page Oct 25, 2024 · 22 revisions

The Release Shepherd is a 50% rotation that handles making the Terraform Provider Google (TPG) releases (https://github.com/hashicorp/terraform-provider-google/wiki/Release-Process) as well as maintaining test environments to ensure that merging contributions and making releases are as low-friction as possible.

The current schedule can be viewed in PagerDuty.

Once-per-week responsibilities

Monday morning

  • Check the release history for any patch releases and confirm they have been cherrypicked or handled
    • Rarely, changes will already be present in the release branch (i.e. due to early patch releases) or won't make sense to migrate forward. In these cases, confirm explicitly with last week's oncall that they've been handled and that the new release will not regress.
  • Run the release based on last week's shepherd's release cut: https://github.com/hashicorp/terraform-provider-google/wiki/Release-Process#on-monday

Tuesday midday

  • Check recently filed bugs for issues coming from the new release and flag them with the oncall. Evaluate whether a patch release is needed as discussed in the incident response policy.
    • If the oncall has not picked up the issue, you should consider attempting to resolve it.

Wednesday midday

Daily responsibilities

  • Triage nightly test failures for the previous night
  • Resolve test failure tickets for service/terraform (or opportunistically for other services if you can identify high-impact / low effort fixes)
  • Check 2-3 recent PRs for unrelated or recurrent VCR failures and resolve them, filing a test-failure issue if you are unable to.
  • If other responsibilities have been addressed, find old bug or persistent-bug issues and resolve them.

Triage nightly test failures

Go through every test failure in the nightly builds for both GA and Beta:

To see all test failures for a nightly test run, look at the Change Log tab for the project:

Find the commit you want to triage. Click on the commit SHA and navigate to the Problems & Tests tab on the new page. The page will list all test failures linked to that change.

  • Search for the test name (and/or error message and/or resource) in Issues and create an issue if none exists.
    • Set percentile labels based on how frequently the test is failing. test-failure-100 for consistent failures (at least 3 days in a row & not flakey), test-failure-50 for 50%-99% failure rate, test-failure-10 for 10%-49%, and test-failure-0 for 0%-9%. This will be used to analyze & prioritize the test failure.
    • In the "Affected resources" section, put the resource under test rather than the resource where the failure originated. This is because the responsibility for resolving the test failure falls on the team that owns the resource under test.
    • After creation, wait a moment and remove the forward/review label.
  • Mark the test failure on TeamCity as "investigating..." and add a link to the created issue in the comment section.

For tests which started failing since the last release cut, take a closer look to confirm whether the cause is a change in the provider. New bugs (including non-destructive permadiffs) should be handled per the Incident Response Policy.

FAQ

How long should I spend on this rotation?

As a 50% rotation, this should take around half of your time-at-desk. If you'll be unable to spend at least 8 hours on the responsibilities listed above, consider trading shifts with someone who will be able to. Conversely, if you're required to spend 16+ hours working on these responsibilities (outside of exceptional events like weeks where multiple patch releases are required), flag that with the team so that we can bring the time commitment back within expectations.

Who should handle patch releases for GCP outages?

Patch releases for current GCP outages are handled by the Google Oncall as defined in the incident response policy. However, if they determine that additional help is required, they may enlist the release shepherd to drive the patch.

On the other hand, cherrypicks are generally handled by the release shepherd to ensure that they're the owner of the weekly minor release branch

  • If the oncall determines that a change doesn't need a patch but we will want to cherrypick it (for example, if there's a major outage on a Friday), the release shepherd will own cherrypicking it.
  • If the oncall makes a patch release, they'll work with the release shepherd to ensure that it's included in the next major release

Should I block the next minor release on an upcoming patch release?

Regressions and incidents should not generally block the next minor release. If rolling out a new release without resolving a new regression introduced in the last release will break additional users we should cancel the release, and regressions that break the entire provider (such as provider initialization issues that impact many users) likely prompt a freeze until resolved ASAP through a patch. In case of ambiguity, discuss with the oncall and mutually agree on a resolution plan, then communicate that in chat. If the oncall is unavailable, substitute a TL.

Note: In general, patches cut past midday Thursday are rare, and should be cherrypicked into the minor release rather than patched.

Reading TeamCity results

  1. Go to the "Change Log" page for the provider (GA, Beta) and select the target commit based on the merge date (circled). Click on the commit SHA on the right side of the row (arrow): image

  2. Click "Problems and Tests" on the page that you navigated to. Verify that the start of the "Revision" value and the short SHA from the build match: image

  3. Scroll to the "X Failed Tests, Y new" table, click "Show all X items" at the bottom of the table and then "Expand All" at the top to view the results for all packages at once:

image