Skip to content
This repository has been archived by the owner on Oct 14, 2024. It is now read-only.

Retrospective for October 2020 releases #181

Closed
adamfarley opened this issue Oct 23, 2020 · 37 comments
Closed

Retrospective for October 2020 releases #181

adamfarley opened this issue Oct 23, 2020 · 37 comments

Comments

@adamfarley
Copy link
Contributor

adamfarley commented Oct 23, 2020

Topics for the retrospective should include:

  • How rebuilds were required on docker platforms due to the introduction of the ActiveNodeTimeout feature the week previously.
  • How the test jobs didn't launch within the parallel groovy runTests method (openjdk_build_pipeline.groovy) if the openjdk-jenkins-helper library isn't loaded prior to the build (or, perhaps, at least outside the parallel code section).
  • How the test jobs being unable to launch somehow didn't cause build failure.
  • Why the Windows 64bit build here failed after complaining: "warning: failed to remove openj9/test/functional: Directory not empty". Raise issue or dismiss?
  • Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?
  • Should all calls in the community be open? Is there scope for limited-access calls (beyond the TSC), such as 121 calls? Is it fair for these to occur when the call has been mention in advance, in a public channel?
@aahlenst
Copy link
Contributor

Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?

Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.

@adamfarley
Copy link
Contributor Author

Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?

Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.

Agreed. Or automation. Or an automated checklist. Let's discuss during the retrospective meeting.

@karianna karianna added this to the October 2020 milestone Oct 23, 2020
@karianna
Copy link
Member

We should only run the main 3 platforms first for both OpenJ9 and Hotspot then run pipelines for the secondary platforms.

We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms.

@sxa
Copy link
Member

sxa commented Oct 23, 2020

Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?

Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.

I switched off /testing/ of the nightlies via the default checkboxes in the openjdkxx-pipeline jobs (since that's the bit that's generally disruptive) but it seemed to get re-enabled somehow - maybe by a pipeline regeneration?. I queried what the best way to do it this time would be in https://adoptopenjdk.slack.com/archives/C09NW3L2J/p1602789550128600

Entire build can be stopped by adjusting the triggerSchedule in pipelines/jobs/configurations/jdk*.groovy, or to switch off the tests the lines in the jdk*_pipeline_config.groovy files need to be modified to have false in the test fields.

@sxa
Copy link
Member

sxa commented Oct 23, 2020

We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms.

If we're going down that route we should implement a formal platform tier proposal (which could lead to interesting discussoins, but I'm guessing you're thinking about x86 win/mac/linux as primaries for now?) . Being devil's advocate, is there a specific problem you see that means those should be kicked off first? Obviously the others aren't competing for the same resources (unless we push the OpenJ9 XL ones out of "primary")

@sxa
Copy link
Member

sxa commented Oct 23, 2020

Retrospective item: I feel a lot of disucssions over the last 18 hours seem to have been done outside the #release channel in slack. We need to make sure current status of release-related activity is done in one place (including initiation of any calls) to make sure we're all up to date and pulling in the same direction.

@karianna
Copy link
Member

Despite commenting out the default weekly map there were still instances in the jdk_pipeline_config.groovy files which stacked up weekly tests on platforms that didn't have enough hardware to support the run (e.g. Java 11 aarch64).

The default map is comprehensive and we should use that going forward and simply get our infra support up.

A secondary concern is that we are more explicit that we're using a default weekly map in the jdk_pipeline_config.groovy - the naive engineer may get confused overseeing an empty map in most cases.

@andrew-m-leonard
Copy link

"Handover situations": If builds go on for several days for whatever reasons, it is not necessarily the case the same person will be handling a given release. We need to make such handovers easier, rather than trying to figure out from numerous slack messages in various channels. A more focussed/managed release checklist with status ? (@smlambert I know you've mentioned this previously)

@smlambert
Copy link

re: #181 (comment) - yes @andrew-m-leonard, see #178 for a WIP checklist that is intended to make it more obvious what has already occurred and by whom.

@adamfarley
Copy link
Contributor Author

Issue: Job generation doesn't appear to be reliably thread-safe, especially the concurrent test job generation we do at the end of a build.

Evidence: Groovy's struggle to load the same library in multiple concurrent threads (runTests() in openjdk_build_pipeline.groovy), and the non-fatal "No suitable checks publisher found" issue that springs up in many test runs [(Slack thread)].(https://adoptopenjdk.slack.com/archives/CLCFNV2JG/p1603464619103400)

Potential solution: If there's a way to launch jobs in a non-blocking way, we could loop over the job-generation step for each test job we want to run after a build (in a single thread), and then "check" for job results in a second loop. Once we have "results" for each test job we generated, the second loop breaks out and we continue.

@smlambert
Copy link

re: #181 (comment) - what are you intending to solve? Is it meant to address the question: How the test jobs being unable to launch somehow didn't cause build failure ?

If so, perhaps some background:

  • we originally reported and failed build pipelines on child job failures (including test jobs launched from main pipeline)
  • build pipelines were refactored/rewritten
  • since we can never get through a build pipeline without something failing (thus focus on fixing issues found in triage), we made a conscious choice to ignore test failures to allow build pipeline to complete.
  • if we want to change the earlier conscious decision, then we could choose to do something other than simply print the failure

But maybe I misunderstand what your comment is targetting...

@smlambert
Copy link

Why the Windows 64bit build here failed after complaining: "warning: failed to remove openj9/test/functional: Directory not empty".

This is a known, long-standing, problematic issue that appears to have triggered the raising of many infra issues in the past, where Jenkins jobs are unable to clean out the previous workspace (or their own at the end of their run) and other jobs fail with AccessDeniedExceptions.

All of these issues relate to the same core issue:
adoptium/infrastructure#1573
adoptium/infrastructure#1396
adoptium/infrastructure#1527
adoptium/infrastructure#1419
adoptium/infrastructure#1410
adoptium/infrastructure#1394
adoptium/infrastructure#1379
adoptium/infrastructure#1376
adoptium/infrastructure#1339
adoptium/infrastructure#1328
adoptium/infrastructure#1310
adoptium/infrastructure#1086
adoptium/infrastructure#962
adoptium/infrastructure#810
adoptium/infrastructure#784
adoptium/infrastructure#736
adoptium/infrastructure#706
adoptium/infrastructure#477
adoptium/infrastructure#417
adoptium/infrastructure#23

We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.

@adamfarley
Copy link
Contributor Author

adamfarley commented Oct 23, 2020

re: #181 (comment) - what are you intending to solve?

The problems in "Evidence", which could perhaps be renamed to "Symptoms". Now we have two issues that could be traced back to us trying to use concurrency and build generation together. I was spitballing a simplistic way for us to achieve multiple concurrent jobs, while generating them in a serial manner (possibly avoiding the non-thread-safe(?) build generation).

The test jobs failing to run is a symptom. The fact that their failure didn't cause the build to fall over is either a non-issue or a separate issue.

@adamfarley
Copy link
Contributor Author

re: #181 (comment) - We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.

Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to?

@smlambert
Copy link

smlambert commented Oct 23, 2020

Can you find the discussion and indicate what is meant by 'nuking'? There appears to be a great many comments in the open/closed infra issues listed above.

From a test pipeline perspective, the best nuking we could do means call cleanWs(). We used to do so at the start of each test run.

Then, do to pressure to not take up space on machines, we moved it to the end of every run, adoptium/aqa-tests#314.

We could call cleanWs() both at start and end of each run (taking a small hit on adding some execution minutes), but the core issue is that the cleanWs() call sometimes fails to work when run on Windows machines no matter when or how frequently you call it.

All of this is perhaps a non-issue if we spin up fresh machines on the fly, but we are not really there (and not sure if that is in our infra goals or not).

@sxa
Copy link
Member

sxa commented Oct 24, 2020

Shenandoah was not enabled for the JDK11u release - fix in adoptium/temurin-build#2177

@sxa
Copy link
Member

sxa commented Oct 24, 2020

Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to?

If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.

We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.

I think @Willsparker has deal with more of these recently (so may have an idea of how to fix properly, but let's go into that in a separate infra issue) and has been able to diagnose some of the locked workspaces, but I agree it is probably our most common recurring issue and we need to understand and resolve and try to write some auomated mitigation going forward.

@sxa
Copy link
Member

sxa commented Oct 24, 2020

apt installers for 8u272 suffers a gap in update time which affects end users - infra#1647

@sxa
Copy link
Member

sxa commented Oct 24, 2020

Getting into the realm of solutions here already, but something I've been doing in the infrastructure repo and I thin kwe should roll out to at least the build one.

  • Wherever possible the person who creates any PR should merge it (where they have the authority to do so)
  • If they do not have authority, agree with someone with authority when it will go in

Both of these support the following:

  • The person who icreated the PR is then responsibily for making sure it has the desired effect, either by a full test build run, or paying good attention to the next nightly

I think this would make the merging process less error prone and avoid "fire and forget" PRs going in without being verified, which we seem to have had quite a lot of in recent months. I'm loathed to add an extra "verify" step to the workflow but maybe we do need to say that something shouldn't be moved to "Done" until something has been confirmed.

@sxa
Copy link
Member

sxa commented Oct 25, 2020

Also I suggest we split out the "promote release" issues into HotSpot and OpenJ9 ones for various reasons:

  • To ensure there is no confusion amongst the comments as to which one is being referred to
  • To allow the issues to be closed (and if necessary re-opened) separately
  • To ensure that TSC approvals to ship are more clear historically (We should formalise the format of such an approval)

@Willsparker
Copy link

Willsparker commented Oct 26, 2020

FYI, adoptium/infrastructure#1573 is where I'm looking at the Windows workspace based issues. The main issue is leftover java.exe processes stopping Jenkins from deleting workspaces.

@andrew-m-leonard
Copy link

Proposal: "Dry-run" Release
How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?

@adamfarley
Copy link
Contributor Author

adamfarley commented Oct 26, 2020

re: #181 (comment) - Can you find the discussion and indicate what is meant by 'nuking'?

I think I meant just running cleanWs() at the start of each run, though as Stewart says:

re: #181 comment - If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.

So perhaps one way forward is to run cleanWs() at the start and end of each run, as Shelley suggests, and to answer every instance of issues like "locked folder" with a fix in cleanWs() that makes it more effective.

@adamfarley
Copy link
Contributor Author

adamfarley commented Oct 26, 2020

comment #181 - Proposal: "Dry-run" Release
How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?

Seems reasonable. We should also aim to cut down build/test/etc repo contributions during the "dry-run & release" period, so we can avoid new issues sneaking in after the dry-run but before the release.

@sxa
Copy link
Member

sxa commented Oct 26, 2020

How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?

We could also do it as soon as we enforce build repo lockdown, which varies but is usually on the Thursday/Friday before release. That way nothing else should be going in. Of course it depends how quickly we think we can fix things if they are faulty :-)

@karianna karianna modified the milestones: October 2020, November 2020 Nov 5, 2020
@sxa
Copy link
Member

sxa commented Nov 10, 2020

Docker release of arm32v7 had not appeared by today (10th november) despite being shipped about 14 days ago

@mbien
Copy link

mbien commented Nov 18, 2020

a high level suggestion from a observer: a possible mitigation of those kind of issues would be to let adopt build rc builds too. OpenJDK has rc builds at least a month before release. If adopt would build them too (as if they would be a release), potential issues could be noticed much earlier and likely solved until release. This might cause a boring release week though ;)

@sxa
Copy link
Member

sxa commented Nov 18, 2020

@mbien Haha a boring release week sounds like bliss! I think we will end up doing some sort of pre-release trial. A month is possibly a bit too far for us because a lot can happen in the month before GA when we're on a three month release cycle (and it's quite rare that a code issue from openjdk trips us up) :-)

Thanks for the input

@sxa
Copy link
Member

sxa commented Nov 19, 2020

Issue with macos packaging missing JREs: AdoptOpenJDK/homebrew-openjdk#495 (comment)

@sxa
Copy link
Member

sxa commented Nov 19, 2020

Multiple issues relating to the 11.0.9.1+1 version string which we had to address both in the build repository and the API:

@sxa
Copy link
Member

sxa commented Nov 27, 2020

Summary of everythign above (a.k.a. an easy-to-use agenda for the meeting to be held on Monday at 1400 GMT/UTC). The initials of the person who raised it in the conversations above are in []

One-off things (likely don't need much discussion)

  • [AF] ActiveNodeTimeout introduction caused docker-based builds to fail
  • [AF] Test jobs sequencing issue with openjdk-jenkins-helper not being loaded
  • [AF] Directory not empty issues on Windows

Issues:

Questions:

  • [AF] Should all calls in the community be open? Is there scope for limited-access calls (beyond the TSC)
  • [MV] Should we prioritise platforms e.g. run Windows/x64, Linux/x64, Macos/x64 pipelines first?
  • [AL] How can we make handovers easy from one team member to another during releasing if required?
  • [SA] Should we have separate HotSpot/OpenJ9 release issues to avoid platform confusion and allow closing each separately)
  • [AF] Should we have a "dry-run" release without publish e.g. on the Monday before OpenJDK's release date
  • [MB] Should we do "RC" builds as upstream OpenJDK does
  • [AA] Can we have a better visible release status like https://gist.github.com/aahlenst/bbb8ca9c87353e0c8928633961047340? With all the different branches/release dates (think ARM on 8), it's super hard to track.

References:

@adamfarley
Copy link
Contributor Author

adamfarley commented Dec 1, 2020

Meeting Results:

(Note: See the next comment for a concise list of Actions)

One-off things (likely don't need much discussion)

  • [AF] ActiveNodeTimeout introduction caused docker-based builds to fail
    To be addressed with discussion over “release warmup” later.
  • [AF] Test jobs sequencing issue with openjdk-jenkins-helper not being loaded
    Ditto.
  • [AF] Directory not empty issues on Windows
    Already fixed, and job modified to clear up after itself, so shouldn’t happen again.
  • [AF] Job generation not thread safe. "No suitable checks publisher found" warnings.
    AF to raise build issue to resolve.

Issues:

  • [**] Nightly builds were not fully stopped during release (SXA to document based on slack discussion)
    Skipped as per note.
  • [MV] Weekly runs should also be disabled during the release
    Not enabled yet. Will be discussed later on.
  • [SA] Discussions on releases happening outside #release makes it hard to keep track of release activity
    Summaries and key links to be placed in #release. Discuss in #build or #test if you want, but be sure to link threads, URLs, etc.
  • [SA] apt installers for 8u272 suffer a gap in update time which affects end users (doc so releasers are aware?)
    George & Stewart volunteers to write issue to develop documentation issue.
  • [SA] Shenandoah not enabled for the JDK11u release
    Adding shenandoahTest as part of a set of smoke tests / Add first batch of smoke tests for AdoptOpenJDK adoptium/aqa-tests#2067)
  • [SA] When things are merged, ensure someone is responsible for verification to avoid breakage
    Yes. Improvements to the PR tester pending, to ensure this happens automatically for some PRs.
  • [SA] Docker arm32 release took a long time to get published
    Release tagging in aarch src repos can be slow, and this slows the release. Dino recommends having a script to tag these repos automatically. Bharath is said to be working on this.
  • [SA] Macos packaging missing JREs
    This is the one George is working on. Installer repo PR tester has been updated to detect this in the future.
  • [SA] Build repo lockdown had some "leakage" which broke Solaris/SPARC (& others?)
    Stewart to announce this on Slack.
  • [SA] 11.0.9.1+1: Various issues with the patch number in build and API Windows installer version numbers and sorting
    Andrew to make issue to discuss solution to this.
  • [SA] Releasing doc missing info on updating tags
    An issue will be raised to resolve this.

Questions:

  • [AF] Should all calls in the community be open? Is there scope for limited-access calls (beyond the TSC)
    If possible, yes. Advise starting all calls in public channels. If doesn’t happen, then at least post summary in public channels.
  • [MV] Should we prioritise platforms e.g. run Windows/x64, Linux/x64, Macos/x64 pipelines first?
    Proposal to separate top-level build pipeline runs per major release into “important platforms” and “other platforms” (one top-level execution each). Issue to be raised to discuss this proposal.
    Related: Establish criteria for inclusion / exclusion of various platform/version builds #186 and Assess test target execution time & define test schedule adoptium/aqa-tests#2037
    Stewart will raise an issue for this discussion in the build repo.
  • [AL] How can we make handovers easy from one team member to another during releasing if required?
    Reducing the number of manual steps will make handover easier (Sample Release Checklist to improve Release Automation #178)
  • [SA] Should we have separate HotSpot/OpenJ9 release issues to avoid platform confusion and allow closing each separately)
    Yes. George volunteers for creating a PR to effect these changes to the template.
  • [AF] Should we have a "dry-run" without publish e.g. on the Monday before OpenJDK's release date?
    Probably. AF to raise an issue to debate this. Wide-ranging, so TSC repo. To include identifying scope of variable control (e.g. code freezes for repos), identifying buy-in, etc.
  • [MB] Should we do "RC" builds as upstream OpenJDK does
    Ditto.
  • [AA] Can we have a better visible release status like https://gist.github.com/aahlenst/bbb8ca9c87353e0c8928633961047340? With all the different branches/release dates (think ARM on 8), it's super hard to track.
    Yes. George volunteers to host a call to discuss this further.
  • [SL] Is there a benefit to creating an actual release checklist? Are there release steps that we can automate? Sample Release Checklist to improve Release Automation
    This was determined to be covered by other items earlier in the retrospective.

References:

Releasing document in the build repository
WIP release checklist document based on the releasing doc

@adamfarley
Copy link
Contributor Author

adamfarley commented Dec 1, 2020

Actions list:

Adam Farley:

  • Raise build issue to resolve the "test job generation not thread safe" issue. "No suitable checks publisher found" warnings. Issue raised.
  • Raise TSC issue to debate "dry-run without publish" (e.g. on the Monday before OpenJDK's release date)?
    To include identifying scope of variable control (e.g. code freezes for repos), making the change to the release docs, methods, etc. Issue raised.

George & Stewart

  • Raise issue to develop documentation for this problem: apt installers for 8u272 suffer a gap in update time which affects end users.

George Adams

Stewart Addison

Andrew Leonard

  • Raise issue to discuss solution for various issues with the patch number in build and API Windows installer version numbers and sorting (re: 11.0.9.1+1).

@adamfarley
Copy link
Contributor Author

Since the actions for this will be chased independently by their respective owners, this issue will now be closed.

Thank you everyone for participating.

@sxa
Copy link
Member

sxa commented Dec 1, 2020

@adamfarley I feel this probably shouldn't be closed until we have issues covering them, otherwise the work looks complete as there is no outstanding issues for several of these with owners

@adamfarley
Copy link
Contributor Author

Will reopen if you think that will encourage folks to follow up.

My thought was that it'd be easier to close this and simply copy the actions into January's retrospective issue, reviewing the results then.

I think the right way forward is to reopen it as you suggest, and to copy the actions once people have had a chance to update them with links to their issues.

@adamfarley
Copy link
Contributor Author

Note: Any unresolved actions have been folded into the next retrospective for review. Link.

If any have been unintentionally missed, feel free to add them.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants