-
Notifications
You must be signed in to change notification settings - Fork 33
Retrospective for October 2020 releases #181
Comments
Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists. |
Agreed. Or automation. Or an automated checklist. Let's discuss during the retrospective meeting. |
We should only run the main 3 platforms first for both OpenJ9 and Hotspot then run pipelines for the secondary platforms. We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms. |
I switched off /testing/ of the nightlies via the default checkboxes in the Entire build can be stopped by adjusting the |
If we're going down that route we should implement a formal platform tier proposal (which could lead to interesting discussoins, but I'm guessing you're thinking about x86 win/mac/linux as primaries for now?) . Being devil's advocate, is there a specific problem you see that means those should be kicked off first? Obviously the others aren't competing for the same resources (unless we push the OpenJ9 XL ones out of "primary") |
Retrospective item: I feel a lot of disucssions over the last 18 hours seem to have been done outside the #release channel in slack. We need to make sure current status of release-related activity is done in one place (including initiation of any calls) to make sure we're all up to date and pulling in the same direction. |
Despite commenting out the default weekly map there were still instances in the jdk_pipeline_config.groovy files which stacked up weekly tests on platforms that didn't have enough hardware to support the run (e.g. Java 11 aarch64). The default map is comprehensive and we should use that going forward and simply get our infra support up. A secondary concern is that we are more explicit that we're using a default weekly map in the jdk_pipeline_config.groovy - the naive engineer may get confused overseeing an empty map in most cases. |
"Handover situations": If builds go on for several days for whatever reasons, it is not necessarily the case the same person will be handling a given release. We need to make such handovers easier, rather than trying to figure out from numerous slack messages in various channels. A more focussed/managed release checklist with status ? (@smlambert I know you've mentioned this previously) |
re: #181 (comment) - yes @andrew-m-leonard, see #178 for a WIP checklist that is intended to make it more obvious what has already occurred and by whom. |
Issue: Job generation doesn't appear to be reliably thread-safe, especially the concurrent test job generation we do at the end of a build. Evidence: Groovy's struggle to load the same library in multiple concurrent threads (runTests() in openjdk_build_pipeline.groovy), and the non-fatal "No suitable checks publisher found" issue that springs up in many test runs [(Slack thread)].(https://adoptopenjdk.slack.com/archives/CLCFNV2JG/p1603464619103400) Potential solution: If there's a way to launch jobs in a non-blocking way, we could loop over the job-generation step for each test job we want to run after a build (in a single thread), and then "check" for job results in a second loop. Once we have "results" for each test job we generated, the second loop breaks out and we continue. |
re: #181 (comment) - what are you intending to solve? Is it meant to address the question: How the test jobs being unable to launch somehow didn't cause build failure ? If so, perhaps some background:
But maybe I misunderstand what your comment is targetting... |
This is a known, long-standing, problematic issue that appears to have triggered the raising of many infra issues in the past, where Jenkins jobs are unable to clean out the previous workspace (or their own at the end of their run) and other jobs fail with AccessDeniedExceptions. All of these issues relate to the same core issue: We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied. |
The problems in "Evidence", which could perhaps be renamed to "Symptoms". Now we have two issues that could be traced back to us trying to use concurrency and build generation together. I was spitballing a simplistic way for us to achieve multiple concurrent jobs, while generating them in a serial manner (possibly avoiding the non-thread-safe(?) build generation). The test jobs failing to run is a symptom. The fact that their failure didn't cause the build to fall over is either a non-issue or a separate issue. |
Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to? |
Can you find the discussion and indicate what is meant by 'nuking'? There appears to be a great many comments in the open/closed infra issues listed above. From a test pipeline perspective, the best nuking we could do means call cleanWs(). We used to do so at the start of each test run. Then, do to pressure to not take up space on machines, we moved it to the end of every run, adoptium/aqa-tests#314. We could call cleanWs() both at start and end of each run (taking a small hit on adding some execution minutes), but the core issue is that the cleanWs() call sometimes fails to work when run on Windows machines no matter when or how frequently you call it. All of this is perhaps a non-issue if we spin up fresh machines on the fly, but we are not really there (and not sure if that is in our infra goals or not). |
Shenandoah was not enabled for the JDK11u release - fix in adoptium/temurin-build#2177 |
If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.
I think @Willsparker has deal with more of these recently (so may have an idea of how to fix properly, but let's go into that in a separate infra issue) and has been able to diagnose some of the locked workspaces, but I agree it is probably our most common recurring issue and we need to understand and resolve and try to write some auomated mitigation going forward. |
apt installers for 8u272 suffers a gap in update time which affects end users - infra#1647 |
Getting into the realm of solutions here already, but something I've been doing in the infrastructure repo and I thin kwe should roll out to at least the build one.
Both of these support the following:
I think this would make the merging process less error prone and avoid "fire and forget" PRs going in without being verified, which we seem to have had quite a lot of in recent months. I'm loathed to add an extra "verify" step to the workflow but maybe we do need to say that something shouldn't be moved to "Done" until something has been confirmed. |
Also I suggest we split out the "promote release" issues into HotSpot and OpenJ9 ones for various reasons:
|
FYI, adoptium/infrastructure#1573 is where I'm looking at the Windows workspace based issues. The main issue is leftover |
Proposal: "Dry-run" Release |
I think I meant just running cleanWs() at the start of each run, though as Stewart says:
So perhaps one way forward is to run cleanWs() at the start and end of each run, as Shelley suggests, and to answer every instance of issues like "locked folder" with a fix in cleanWs() that makes it more effective. |
Seems reasonable. We should also aim to cut down build/test/etc repo contributions during the "dry-run & release" period, so we can avoid new issues sneaking in after the dry-run but before the release. |
We could also do it as soon as we enforce build repo lockdown, which varies but is usually on the Thursday/Friday before release. That way nothing else should be going in. Of course it depends how quickly we think we can fix things if they are faulty :-) |
Docker release of arm32v7 had not appeared by today (10th november) despite being shipped about 14 days ago |
a high level suggestion from a observer: a possible mitigation of those kind of issues would be to let adopt build rc builds too. OpenJDK has rc builds at least a month before release. If adopt would build them too (as if they would be a release), potential issues could be noticed much earlier and likely solved until release. This might cause a boring release week though ;) |
@mbien Haha a boring release week sounds like bliss! I think we will end up doing some sort of pre-release trial. A month is possibly a bit too far for us because a lot can happen in the month before GA when we're on a three month release cycle (and it's quite rare that a code issue from openjdk trips us up) :-) Thanks for the input |
Issue with macos packaging missing JREs: AdoptOpenJDK/homebrew-openjdk#495 (comment) |
Multiple issues relating to the
|
Summary of everythign above (a.k.a. an easy-to-use agenda for the meeting to be held on Monday at 1400 GMT/UTC). The initials of the person who raised it in the conversations above are in One-off things (likely don't need much discussion)
Issues:
Questions:
References: |
Meeting Results:(Note: See the next comment for a concise list of Actions) One-off things (likely don't need much discussion)
Issues:
Questions:
References:Releasing document in the build repository |
Actions list:Adam Farley:
George & Stewart
George Adams
Stewart Addison
Andrew Leonard
|
Since the actions for this will be chased independently by their respective owners, this issue will now be closed. Thank you everyone for participating. |
@adamfarley I feel this probably shouldn't be closed until we have issues covering them, otherwise the work looks complete as there is no outstanding issues for several of these with owners |
Will reopen if you think that will encourage folks to follow up. My thought was that it'd be easier to close this and simply copy the actions into January's retrospective issue, reviewing the results then. I think the right way forward is to reopen it as you suggest, and to copy the actions once people have had a chance to update them with links to their issues. |
Note: Any unresolved actions have been folded into the next retrospective for review. Link. If any have been unintentionally missed, feel free to add them. |
Topics for the retrospective should include:
The text was updated successfully, but these errors were encountered: