Skip to content

Mini Retro for Qubes 4.1 SDW Release [20220713]

Cory Francis Myers edited this page Jul 18, 2022 · 5 revisions

Topic: Reproducible wheels CI test failure

When: Wednesday, July 13 1:00–1:25pm (PT)

Facilitator: @creviera

Note-taker: @gonzalo-bulnes

Why are we having this retro?

This retro will be done synchronously. This is not a blameless postmortem. There wasn't a major incident that happened on prod. This meeting's purpose is to broaden our understanding of https://github.com/freedomofpress/securedrop-debian-packaging/issues/350, which was quickly fixed on release day. None of us fully understood at the time why the reproducible wheels test was failing, but we were able to come up with a fix nonetheless. Now that we have time for deeper inspection into the issue, let's discuss and find out more about what we still don't know.

Reconstructed event timeline: what happened when?

2022-06-22 : reprotest-wheels first fails while reviewing #349 (decide to capture in a separate issue)

2022-06-23 : Issue #350 opened to capture failure

2022-07-06 : Begin releasing packages for Qubes 4.1

2022-07-07 : #350 became a release blocker and was ultimately fixed

2022-07-07 : Finish releasing packages for Qubes 4.1

Open Questions

(See answers in the notes section below.)

  • Could we start off with a recap on how issue #350 was discovered?

  • How was this issue fixed?

  • What did we not understand at the time?

  • What do we know now?

Notes

  • This is not a PIR. Goal: learning from experience, from the people who did the investigation and cut the trade offs.
  • @cfm: Issue arose while reviewing #349 (which is unrelated). That PR included CI changes, but no Python tool chain changes. Approved on passing CI, that was failing in a non-obvious way. Had seen flakiness recently, but became clear the python-dateutils was specifically failing. Opened #350. Merge blocker, but not release blocker yet at that point. The failure may have started in the previous days. Initial repro: build-sync-wheels failing for securedrop-client specifically. Decision made at that point to investigate to allow packages to be built for R4.1 support. Decision made to make it a release blocker, because the packages being built didn't seem to be related to the issue and under release conditions we wanted CI to be passing. Key insight: error didn't happen locally under Py3.9+Bullseye. Unblocked by switching from @legoktm's Bullseye+Py3.9 Docker image.
  • @legoktm: Had created the image for... fun! :P Built the image as part of getting ready for Bullseye. Hadn't started using the image yet.
  • @cfm: Hadn't used the image for anything at that point in the week, and the question was asked if it had been tested enough to start using it on short notice. Answer: yes. We workaround the issue by using that Bullseye image. After release: built a pair of Dockerfiles after the CI script. Both based on Alpine (no Debian) with Py3.9 and Py3.7. That would continue on failure to gather more information. The expectation / conclusion: was that indeed the packages failed under Py3.7 and not Py3.9. That dismissed concerns of Debian involvement.
  • @cfm: There are similar issues upstream about building various build tools from source. setuptools-scm introduced a circular dependency for below Python 3.8 with importlib-metadata. Since we build everything from source, we can't break the circular dependency loop and so it broke. Were not able to identify precisely when we started pulling the version of setuptools-scm that introduced the issue. Current status: may need to look into the newest reproduction steps to ensure they can actually give us the answer to that question.
  • @legoktm: Could reproduce the issue locally, but there were enough layers involved to make it difficult to reason with. Suggesting to simplify the wheels build process. The CI job should be split into one job per wheel.
  • @eloquence: Similar Python build problems are recurrent to our setup, it is worth investigating and understanding this sample well for insights into making the process more reliable. Hint for future work in that direction: (KM) "-no-binary just means it doesn't install wheels, I think the complexity is really in building Python packages from source."

Action items

  • @creviera will make next mini retro more than 25 minutes (or will plan ahead more by asking the subject-matter expert how long they think it'll take first before scheduling)
  • @creviera to create an issue around documenting how and why we maintain images of our build containers for nightlies and the process we use to test a new container before pushing it to quay.io (DONE: See https://github.com/freedomofpress/securedrop-debian-packaging/issues/363)
  • @cfm and @legoktm will continue investigation around how python builds packages from source and why just pinning the correct version of setuptools doesn't fix the problem, and report findings in #350. (Once #350 investigation work is complete, it would be great to see a followup knowledge share on python bootstrapping problems and dependency resolution and issue(s) to simplify/ improve our tests and builder.)