Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechunker using Mailbox #710

Merged
merged 11 commits into from
Feb 13, 2023
Merged

Rechunker using Mailbox #710

merged 11 commits into from
Feb 13, 2023

Conversation

JoranAngevaare
Copy link
Contributor

@JoranAngevaare JoranAngevaare commented Feb 8, 2023

What is the problem / what does the code in this PR do
Rechunking of our data is much sought after by computing as our framework creates many small files that make them cumbersome to handle. Therefore we added a tool which does this in #686. We are now expanding it's use cases by implementing it in the DAQ workflow in XENONnT/straxen#1074.

For XENONnT/straxen#1074 I noticed the rather poor performance if all files are rechunked in serial. Therefore, I looked into parallelizing the load->saver + rechunk approach. I re-used the mailbox paradigm for this - as it was easier than just making some home-grown algorithm. Doing some parallelization here makes sense for it is similarly dependent on I/O as CPU intense (but not at the same time), so you might be waiting on I/O while the CPU is doing nothing and vice versa.

Additionally, I incorporated #709's idea of having a progress bar. I think it's quite neat.

Can you briefly describe how it works?
Use the mailbox system to load and save chunks of data. This allows easily running in multi-thread or multicore approaches.

Can you give a minimal working example (or illustrate with a figure)?
On one of the event builders servers if the DAQ, the speedup with modest number of threads ~5-10x.

On my own machine (I/O limited) the speedup is not as large:

for PAPA in process True False; 
   do echo ==== $PAPA ====; 
   rechunker --source 050347-raw_records-rfzvpzj4mf/ --dest test-rr --profile_memory --parallel $PAPA --max_workers 6 --rechunk True --target_size_mb 500 --compressor blosc; 
   echo "n files:";
   ls test-rr/050347-raw_records-rfzvpzj4mf/ | wc -l; 
   echo ======= ; 
done
==== process ====
Will write to test-rr and make sub-folder 050347-raw_records-rfzvpzj4mf
  0%|                                                              | 0/1007 [00:00<?, ?it/s]Rechunking 050347-raw_records-rfzvpzj4mf/ to test-rr/050347-raw_records-rfzvpzj4mf
Removing data in test-rr/050347-raw_records-rfzvpzj4mf to overwrite
100%|███████████████████████████████████████| 1007/1007 [03:07<00:00,  5.38it/s, 139.7 MB/s]
Re-compressed 050347-raw_records-rfzvpzj4mf/
...
Memory profiler says peak RAM usage was: 1308.0 MB
Took 190.2 s = 0.05 h
Bye, bye
n files:
52
=======
==== True ====
Will write to test-rr and make sub-folder 050347-raw_records-rfzvpzj4mf
  0%|                                                              | 0/1007 [00:00<?, ?it/s]Rechunking 050347-raw_records-rfzvpzj4mf/ to test-rr/050347-raw_records-rfzvpzj4mf
Removing data in test-rr/050347-raw_records-rfzvpzj4mf to overwrite
100%|███████████████████████████████████████| 1007/1007 [02:28<00:00,  6.78it/s, 175.8 MB/s]
Re-compressed 050347-raw_records-rfzvpzj4mf/
...
Memory profiler says peak RAM usage was: 1993.3 MB
Took 150.6 s = 0.04 h
Bye, bye
n files:
52
=======
==== False ====
Will write to test-rr and make sub-folder 050347-raw_records-rfzvpzj4mf
  0%|                                                              | 0/1007 [00:00<?, ?it/s]Rechunking 050347-raw_records-rfzvpzj4mf/ to test-rr/050347-raw_records-rfzvpzj4mf
Removing data in test-rr/050347-raw_records-rfzvpzj4mf to overwrite
100%|███████████████████████████████████████| 1007/1007 [03:45<00:00,  4.47it/s, 115.9 MB/s]
Re-compressed 050347-raw_records-rfzvpzj4mf/
..
Memory profiler says peak RAM usage was: 1341.5 MB
Took 226.3 s = 0.06 h
Bye, bye
n files:
52
=======

@coveralls
Copy link

coveralls commented Feb 8, 2023

Coverage Status

Coverage: 92.164% (+0.4%) from 91.718% when pulling 06a3426 on mailbox_rechunker into 28553a7 on master.

@JoranAngevaare JoranAngevaare mentioned this pull request Feb 8, 2023
6 tasks
Copy link
Contributor

@jmosbacher jmosbacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I dont have time to really test this but I read all the changes and its well written and organized. Given that this is a relatively isolated tool I think its safe to merge.

@JoranAngevaare
Copy link
Contributor Author

Thanks Yossi, I agree, apart from the unit-tests, I tested it extensively on the DAQ.

The main use case, XENONnT/straxen#1074 comes with extra safeguards that should catch any failures before integrating any data into the pipeline. So far with the current data-scheme (run 050411), all is running very stably.

@JoranAngevaare JoranAngevaare merged commit 67f4df9 into master Feb 13, 2023
@JoranAngevaare JoranAngevaare deleted the mailbox_rechunker branch February 13, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants