Predictability during memo file regeneration and general evolution #148

chris-allan · 2024-08-05T11:53:17Z

Since direct, real time translation of image data via Bio-Formats was added to OMERO as "OMERO.fs" ¹ in OMERO 5.0 the issue of how to handle "memo files" has become increasingly important. Each time OMERO.server is released, a statement such as this is included in the release notes ²:

Note that the Bio-Formats Memoizer cache will be invalidated on upgrade from earlier OMERO.server versions.

These memo files are essential for the performant operation of OMERO.server and this microservice. For this microservice each and every rendering operation creates a Bio-Formats reader, reads the memo file, performs the operation and then closes the reader. A current and usable memo file is a hard requirement of the current implementation and if one needs to be created it will always be done in real time, on the fly.

Allowing these memo files to be regenerated in real time is risky as full Bio-Formats reader initialization must take place. This can be especially time consuming for multi-image or complex filesets. To combat this, about 5 years ago (#47), we expanded on the image region infrastructure to allow us to regenerate memo files without requiring continuous access to a running OMERO.server. This was expanded further, about 4 years ago (#60), with a control script which includes parallelization and reporting tools.

5 years on, however, some of the same core issues remain:

The time required for the regeneration of individual memo files is hard to predict making accurately downtime estimation during an upgrade difficult if not impossible.
It is very difficult to ascertain if or why a change in any reader will cause the the current memo files created by that reader to be invalidated. Furthermore, adding a new reader to Bio-Formats invalidates all memo files. Consequently, a very high level of discipline is required at the Bio-Formats level to minimize memo file invalidation. The impact on OMERO is justifiably not always the primary concern when making changes to Bio-Formats that may affect memo files.

The granularity of the readers to which a change may affect is also hard to ascertain and is hard to test for. Thus we act very conservatively and regularly ensure that all memo files for a given OMERO instance are regenerated.
Regenerating all memo files requires visiting every file on the system. This can be exceptionally unpredictable in cases where a large amount of data is in place imported from a storage subsystem is (a) a filesystem view on object storage such as that provided by Amazon S3 File Gateway ³; (b) storage whose tiering algorithms depend on access time; (c) otherwise performing poorly.

With these issues in mind the current recommendation is to:

Run memo regeneration offline well in advance of an upgrade
In accordance with [2] always assume an upgrade will require a full regeneration of memo files
Consider [2] and [3] against the workflow, file format use, and storage realities of each deployment when deciding if full regeneration will be beneficial or counter productive

However, this does not mean we cannot consider improvements. A few improvements we might be able to make in the short term have been discussed:

Current queries that feed data into the memo file regenerator via the parallelization and batching scripts are a reverse chronological list of filesets to be processed. As each fileset is treated equally there is no attempt to batch intelligently which can create situations of poor resource utilization where one batch finishes quickly and one or more batches do not. Can we use the output of omero fs importtime (new CLI subcommand: fs importtime ome/openmicroscopy#5791) to add per-fileset weights that can be used to more accurately balance resource utilization and achieve more uniform batch execution times as a result?
Can we use the output of omero fs importtime to estimate runtime?

(omero fs importtime --cache may need to be worked on to allow for admins to run on behalf of others in a multi-user environment)

Some longer term items are also worth considering, several of which @kkoz, @sbesson, @melissalinkert, and I have discussed at various times over the years:

Work to better localize, assess and document impact of reader changes on memo files (this is currently a lot better than it used to be)
In the spirit of some client side improvements (Block UI for some long running tasks ome/omero-web#543) move further or completely towards asynchronous handling of potentially long running, risky tasks like memo file regeneration
Separate metadata rich Bio-Formats reader initialization from binary data only initialization and maintain multiple sets of memo files (these can take wildly different amounts of time)
Depart from the "include all" Bio-Formats ImageReader use after import

/cc @stick, @atTODO

The text was updated successfully, but these errors were encountered:

chris-allan assigned melissalinkert, kkoz, sbesson and erindiel Aug 5, 2024

sbesson linked a pull request Sep 10, 2024 that will close this issue

Improve fileset distribution during the memo file regeneration #150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predictability during memo file regeneration and general evolution #148

Predictability during memo file regeneration and general evolution #148

chris-allan commented Aug 5, 2024

Predictability during memo file regeneration and general evolution #148

Predictability during memo file regeneration and general evolution #148

Comments

chris-allan commented Aug 5, 2024

Footnotes