Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictability during memo file regeneration and general evolution #148

Open
chris-allan opened this issue Aug 5, 2024 · 0 comments · May be fixed by #150
Open

Predictability during memo file regeneration and general evolution #148

chris-allan opened this issue Aug 5, 2024 · 0 comments · May be fixed by #150
Assignees

Comments

@chris-allan
Copy link
Member

Since direct, real time translation of image data via Bio-Formats was added to OMERO as "OMERO.fs" 1 in OMERO 5.0 the issue of how to handle "memo files" has become increasingly important. Each time OMERO.server is released, a statement such as this is included in the release notes 2:

Note that the Bio-Formats Memoizer cache will be invalidated on upgrade from earlier OMERO.server versions.

These memo files are essential for the performant operation of OMERO.server and this microservice. For this microservice each and every rendering operation creates a Bio-Formats reader, reads the memo file, performs the operation and then closes the reader. A current and usable memo file is a hard requirement of the current implementation and if one needs to be created it will always be done in real time, on the fly.

Allowing these memo files to be regenerated in real time is risky as full Bio-Formats reader initialization must take place. This can be especially time consuming for multi-image or complex filesets. To combat this, about 5 years ago (#47), we expanded on the image region infrastructure to allow us to regenerate memo files without requiring continuous access to a running OMERO.server. This was expanded further, about 4 years ago (#60), with a control script which includes parallelization and reporting tools.

5 years on, however, some of the same core issues remain:

  1. The time required for the regeneration of individual memo files is hard to predict making accurately downtime estimation during an upgrade difficult if not impossible.

  2. It is very difficult to ascertain if or why a change in any reader will cause the the current memo files created by that reader to be invalidated. Furthermore, adding a new reader to Bio-Formats invalidates all memo files. Consequently, a very high level of discipline is required at the Bio-Formats level to minimize memo file invalidation. The impact on OMERO is justifiably not always the primary concern when making changes to Bio-Formats that may affect memo files.

    The granularity of the readers to which a change may affect is also hard to ascertain and is hard to test for. Thus we act very conservatively and regularly ensure that all memo files for a given OMERO instance are regenerated.

  3. Regenerating all memo files requires visiting every file on the system. This can be exceptionally unpredictable in cases where a large amount of data is in place imported from a storage subsystem is (a) a filesystem view on object storage such as that provided by Amazon S3 File Gateway 3; (b) storage whose tiering algorithms depend on access time; (c) otherwise performing poorly.

With these issues in mind the current recommendation is to:

  • Run memo regeneration offline well in advance of an upgrade
  • In accordance with [2] always assume an upgrade will require a full regeneration of memo files
  • Consider [2] and [3] against the workflow, file format use, and storage realities of each deployment when deciding if full regeneration will be beneficial or counter productive

However, this does not mean we cannot consider improvements. A few improvements we might be able to make in the short term have been discussed:

  1. Current queries that feed data into the memo file regenerator via the parallelization and batching scripts are a reverse chronological list of filesets to be processed. As each fileset is treated equally there is no attempt to batch intelligently which can create situations of poor resource utilization where one batch finishes quickly and one or more batches do not. Can we use the output of omero fs importtime (new CLI subcommand: fs importtime ome/openmicroscopy#5791) to add per-fileset weights that can be used to more accurately balance resource utilization and achieve more uniform batch execution times as a result?
  2. Can we use the output of omero fs importtime to estimate runtime?

(omero fs importtime --cache may need to be worked on to allow for admins to run on behalf of others in a multi-user environment)

Some longer term items are also worth considering, several of which @kkoz, @sbesson, @melissalinkert, and I have discussed at various times over the years:

  1. Work to better localize, assess and document impact of reader changes on memo files (this is currently a lot better than it used to be)
  2. In the spirit of some client side improvements (Block UI for some long running tasks ome/omero-web#543) move further or completely towards asynchronous handling of potentially long running, risky tasks like memo file regeneration
  3. Separate metadata rich Bio-Formats reader initialization from binary data only initialization and maintain multiple sets of memo files (these can take wildly different amounts of time)
  4. Depart from the "include all" Bio-Formats ImageReader use after import

/cc @stick, @atTODO

Footnotes

  1. https://omero.readthedocs.io/en/stable/developers/Server/FS.html

  2. https://www.openmicroscopy.org/2022/12/13/omero-5-6-6.html

  3. https://aws.amazon.com/storagegateway/file/s3/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants