Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrax #1074

Merged
merged 61 commits into from
Feb 24, 2023
Merged

Restrax #1074

merged 61 commits into from
Feb 24, 2023

Conversation

JoranAngevaare
Copy link
Contributor

@JoranAngevaare JoranAngevaare commented Aug 21, 2022

Add restrax

In this PR, we change the bootstrax workflow. Instead of writing to disk directly, we will write the data twice, once using fast compressors, and once using arbitrary compressors.

We would change the workflow as follows:
image

Can you briefly describe how it works?

I've given a presentation here. Also see the dedicated documentation.

Current

Bootstrax rechunks certain datatypes live (i.e. in memory). As well does different levels of compression. Admix makes a 1:1 copy for rucio. In this scheme, we don’t rechunk the lowest level datatypes, those result in many files (one per chunk of live data). Therefore, we set the chunk size to 21s for background data.

Advantage:

  • Efficient, only write data once
  • Works under many different data rates

Disadvantage:

  • Many “small” files since you don’t want to keep too much datatypes in memory
  • 21 second chunks (dictated by no-raw records rechunking) are (too) slow for SNEWs, shorter makes the online monitor data more live
  • Decision of desired compressor is directly at the start of processing
  • Many configs needed for different sources which we use to “guess” what chunk size we should use.

Proposed

Bootstrax first write all datatypes to disk -> An extra process restrax to compresses and rechunk whatever data to whatever size. In this case we can set the chunk size to any duration (e.g. 3 seconds)

Advantage:

  • Arbitrary filesize of each datatype
  • Arbitrary compressor of each datatype
  • 5 second chunks possible for all datarates (different calibrations sources)
  • In case of high rates, restrax could software-veto by cutting out raw-records based on posrec/peak size. This would require much more development.
  • If restrax writes to datamanager, we are getting a free lunch.

Disadvantage:

  • Roughly, twice the I/O at the eventbuilders (unless restrax writes to datamanager). Some extra CPU for (de)compressing, extra memory is a few times the set target_size_mb, there are no surprises here and memory usage is very predictible.
  • More complex workflow
  • Not tested, needs (some) development. Restrax would be ~100 lines of extra code, mostly for bookkeeping the data field in the rundoc.

Additional changes:

  • Update the bootstrax fields used to communicate with admix to allow this intermittent step
  • Move some duplicate functions from bootstrax into daq_core.py to prevent duplication. Thereby making the way for Refactor bootstrax #479 to be implemented
  • rename args.execute in ajax to args.production for consistency.

Further requirements:

  • Test this framework!
  • More log statements in restrax
  • We might want to make the Restrax.get_compressor_and_size more fancy after discussing with computing. It's current implementation should probably already improve their lives a lot
  • Don't infer the compressor for raw-records anymore. Only use blosc (fastest) or lz4 in case of >2 GB chunks.
  • Merge Rechunker using Mailbox AxFoundation/strax#710, this feature makes restrax 5-10x faster
  • Add documentation to the cited page

@coveralls
Copy link

coveralls commented Aug 21, 2022

Coverage Status

Coverage: 93.527% (-0.01%) from 93.538% when pulling f2cca33 on restraxer into 234b787 on master.

@JoranAngevaare JoranAngevaare mentioned this pull request Feb 9, 2023
JoranAngevaare and others added 2 commits February 9, 2023 11:16
Co-authored-by: Joran Angevaare <j.angevaare@nikhef.nl>
@JoranAngevaare JoranAngevaare marked this pull request as draft February 16, 2023 14:44
@JoranAngevaare
Copy link
Contributor Author

Thanks @mflierm and @cfuselli for the great reviews and excellent ideas. I've implemented all of them (unless explicitly stated otherwise).

You can update the configs via (assuming you are on one of the ebs):

from straxen import daq_core
db = daq_core.DataBases()
db.daq_db['restrax_config'].update_one({'name': 'restrax_config'},
                                       {'$set': {'user': 'angevaare', 'last_modified': daq_core.now()}})
db.daq_db['restrax_config'].find_one()

We could consider making a page on the website for this

@JoranAngevaare JoranAngevaare marked this pull request as ready for review February 17, 2023 08:48
@JoranAngevaare JoranAngevaare merged commit dfa021d into master Feb 24, 2023
@JoranAngevaare JoranAngevaare deleted the restraxer branch March 3, 2023 13:32
@JoranAngevaare
Copy link
Contributor Author

For bookkeeping:
two additional commits were added to solve an issue in compression:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants