Recovering from a timeout #97

ewillink · 2024-07-07T08:26:00Z

ewillink
Jul 7, 2024

https://ci.eclipse.org/modisco/job/justj-promoter/43/display/redirect has just failed with

!MESSAGE Unable to read repository at https://ci.eclipse.org/modisco/job/modisco-master/355/artifact/org.eclipse.modisco.updatesite/target/repository/plugins/org.eclipse.modisco.infra.query.source_1.5.0.v20240707-0725.jar.
!STACK 0
java.net.SocketTimeoutException: Read timed out

Sadly this EF infrastructure failing is not rare and (on a Sunday) may take quite a few hours to be rectified.

Anyway, JustJ processing has been undermined. (The failed build has been marked as keep forever to facilitate developer investigation.) Last time this happened there was a bad repo entry and my attempts to supersede with later builds or tweak the contents failed. Fortunately the problem was a bad nightly so the brute force solution of deleting the entire nightly tree was available. However if it had occurred for a release build, the brute force approach would not be good.

Is there a recommended way to recover from the EF infrastructure breakdown? Should a JustJ bug be raised to improve resilience?

ewillink · 2024-07-07T08:38:45Z

ewillink
Jul 7, 2024
Author

Before migrating to JustJ, my bash scripts built the new content in a new directory so that it was only during the brief window of the final

mv target-dir old-dir
mv new-dir target-dir
rm -rf old-dir

that anything could go wrong and if it went wrong the result was that the new content was just lost. The brief window ensured that it was very unlikely that a third party build could catch the 'latest' build in a transient state. Currently I get the occasional build for which either platform or EMF are in a transient state.

0 replies

ewillink · 2024-07-07T08:43:55Z

ewillink
Jul 7, 2024
Author

Apologies. This should be a JustJ duscussion. I can only transfer within OOMPH. Hopefully someone has greater power / skill.

0 replies

merks · 2024-07-07T08:46:49Z

merks
Jul 7, 2024
Collaborator

There are several steps in the promotion process.

First we "mirror" a subset of the entire update site from the server to the client local structure:
- /usr/bin/rsync -avsh --exclude *.zip --exclude *.tar.gz --exclude */features --exclude */plugins --exclude */binary --exclude */.blobstore --exclude *.html --exclude */downloads --exclude */archive genie.modisco@projects-storage.eclipse.org:/home/data/httpd/download.eclipse.org/modeling/mdt/modisco/builds/ /home/jenkins/agent/workspace/justj-promoter/justj-sync/builds
  receiving incremental file list
Then we mirror the site to be promoted into the client local structure
- Mirroring 'https://ci.eclipse.org/modisco/job/modisco-master/355/artifact/org.eclipse.modisco.updatesite/target/repository' to '/home/jenkins/agent/workspace/justj-promoter/justj-sync/builds/nightly/N202407070725'
Then we generate the indexes for all the sites into the client local structure.
And finally we send the new stuff back to the server and do some final cleanup:
- /usr/bin/rsync -avsh /home/jenkins/agent/workspace/justj-promoter/justj-sync/builds/ genie.modisco@projects-storage.eclipse.org:/home/data/httpd/download.eclipse.org/modeling/mdt/modisco/builds/

In your failed build we are using p2's mirror infrastructure to mirror the build to be promoted which in this case failed to mirror the update site to the local client structure. That terminated the process. So there is really no need to recover anything. No content on the server was changed. As such, there is simply a need to run the promotion again when the servers are in better shape. Only if the final rsync step failed part way mirroring content to the server would there be the potential for broken content on the server.

Note that if the download server itself was in bad shape, we'd fail in the first step...

I'm not sure if you saw this:

https://git.eclipse.org/r/c/modisco/org.eclipse.modisco/+/207294

1 reply

ewillink Jul 7, 2024
Author

No I missed this. I guess Gerrit is now broken for Modisco. I use the trustworthy Bugzilla functionality.

The new commit seems to just revert a previous Gerrit that I also missed. The current nightly.target built ok this morning.

Once the dust settles I can use https://download.eclipse.org/oomph/archive/p2-index/ to see whether the repos are damaged. https://download.eclipse.org/modeling/mdt/modisco/builds/index.html suggests that no promotion occurred. https://download.eclipse.org/justj/?file=modeling/mdt/modisco/builds/nightly suggests that there is no partial promotion for a next JustJ promotion to activate.

It seems this failure is 'safe' unlike my earlier failure. A rebuild and we're ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eclipse Oomph

Recovering from a timeout #97

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Eclipse Oomph

Recovering from a timeout #97

ewillink Jul 7, 2024

Replies: 3 comments · 1 reply

ewillink Jul 7, 2024 Author

ewillink Jul 7, 2024 Author

merks Jul 7, 2024 Collaborator

ewillink Jul 7, 2024 Author

ewillink
Jul 7, 2024

Replies: 3 comments 1 reply

ewillink
Jul 7, 2024
Author

ewillink
Jul 7, 2024
Author

merks
Jul 7, 2024
Collaborator

ewillink Jul 7, 2024
Author