Alter the default behavior for organizing outputs #367

mabruzzo · 2024-01-26T18:26:43Z

Overview

This PR changes the way that Cholla organizes output files.

Previously all outputs were written into a single output directory.

This is fine when you have a small number of outputs.
However, when you have a lot of outputs, it can be hard to read (and identify the name of the most recent output).
Plus its a little tedious to selectively delete files

This PR changes the default output behavior. Now we group all outputs written at a single simulation cycle into a single directory. To use the older behavior you can set the new legacy_flat_outdir runtime parameter to 1 (in the future, I think it would be good to remove the older behavior completely).

More details

Output file paths traditionally followed the following template:

    "{outdir}{nfile}{pre_extension_suffix}{extension}[.{proc_id}]"

where each curly-braced token represents a different variable. In detail:
- {outdir} is the parameter from the parameter file. The historical behavior (that we currently maintain), if this is non-empty, then all characters following the last '/' are treated as a prefix to the output file name (if there aren't any '/' characters, then the whole string is effectively a prefix.
- {nfile} is the current file-output count.
- {pre_extension_suffix} is the pre-hdf5-extension suffix. It's the suffix that precedes the file extension (or {extension})
- {extension} is the filename extension. Examples include ".h5" or ".bin" or ".txt".
- {proc_id} represents the process-id that held the data that will be written to this file. In non-MPI runs, this will be omitted.

The new default behavior is to create

    "{outdir}/{nfile}/{nfile}{pre_extension_suffix}{extension}[.{proc_id}]"

where the the significance of each curly-braced token is largely unchanged. There are 2 things worth noting:
- all files written at a single simulation-cycle are now grouped in a single directory
- {outdir} never specifies a file prefix. When {outdir} is empty, it is treated as "./". Otherwise, we effectively append '/' to the end of {outdir}

Implementation Details

While doing all of this, I took the opportunity to consolidate the logic for formatting output filenames. Previously the logic was scattered throughout the codebase. Now, the logic is encapsulated by a class called FnameTemplate.

I acknowledge that it looks a little funny that we construct an FnameTemplate instance everywhere we want to determine an output filename (rather than just defining a function). I mostly did this in anticipation of some future refactoring of the io section of the codebase that I plan to do (basically it involves storing the output-related parameters outside of the global Parameter struct). In the short term, I think it would actually be nice to pass around a single FnameTemplate instance to all of the functions that use it, but I'm trying not to change function interfaces too much in the PR (I need to backport this change to the particle-feedback branch I have, which branched near the start of last summer). If you strongly dislike this design choice, I can change it.

Other Thoughts:

Currently the output filenames differ between a simulation compiled with MPI run with a single process compared to a simulation compiled without MPI. All of the filenames written by the MPI version will end in .h5.0. The non-MPI version writes filenames terminating in .h5. Going forward, I actually think it would be better to terminate the filenames with .h5.0 in the non-MPI case (for the sake of consistency). But I think that's a topic for another day

…irectories.

… the older behavior where all outputs are written to a single file.

brantr · 2024-01-26T18:32:15Z

This PR affects all downstream analysis. I agree it's "better", but it will have substantial downstream impact (either analysis scripts will be changed to reflect the output structure, or scripts will need to be used to move files around).

I think the change to serial file naming is terrific, if that matters.

evaneschneider · 2024-01-26T18:39:47Z

This PR affects all downstream analysis. I agree it's "better", but it will have substantial downstream impact (either analysis scripts will be changed to reflect the output structure, or scripts will need to be used to move files around).

I think the change to serial file naming is terrific, if that matters.

Yes for sure! This was on the CAAR to-do list for a long time and we never got around to it. But since @mabruzzo added a flag to maintain the legacy behavior for now, I think the impact should be minimal for current users. This is a good reminder though that we should double-check the concatenation and plotting examples in the python scripts directory to make them consistent.

bcaddy · 2024-01-26T18:43:04Z

I think that this will break the concatenation scripts. I'm happy to help/advise on how to update them for this but I think it should be part of the same PR or at least those changes should be in flight before this is merged.

mabruzzo · 2024-01-26T18:45:37Z

If it would be less controversial, we could make the default-behavior the default choice. With that said, it would probably hinder efforts gradually phase out the old system.

It's unclear why the continuous integration failed. There doesn't seem to be any useful error messages when I click details

bcaddy · 2024-01-26T18:50:01Z

If the CI stuff fails randomly try just rerunning it. Sometimes errors like that are just the file system or jenkins having issues.

evaneschneider · 2024-01-26T18:56:12Z

I think that this will break the concatenation scripts. I'm happy to help/advise on how to update them for this but I think it should be part of the same PR or at least those changes should be in flight before this is merged.

It should just be a matter of changing a single line of code in each of them, right? If so, I'm fine with just doing that as part of this PR. Although it's actually not clear to me from a quick glance that the concatenation scripts themselves would break, since they take the source directory as an input. It seems like it's the dask templates that would need to be edited.

mabruzzo · 2024-01-26T18:58:52Z

I think that this will break the concatenation scripts. I'm happy to help/advise on how to update them for this but I think it should be part of the same PR or at least those changes should be in flight before this is merged.

It should just be a matter of changing a single line of code in each of them, right? If so, I'm fine with just doing that as part of this PR. Although it's actually not clear to me from a quick glance that the concatenation scripts themselves would break, since they take the source directory as an input. It seems like it's the dask templates that would need to be edited.

Things will break since they try to iterate over all of the outputs. But fixing that is easy enough. I'll make of point of doing that

mabruzzo · 2024-01-26T18:59:37Z

If the CI stuff fails randomly try just rerunning it. Sometimes errors like that are just the file system or jenkins having issues.

Is there any way to do that without pushing another commit?

evaneschneider · 2024-01-26T19:01:01Z

I think that this will break the concatenation scripts. I'm happy to help/advise on how to update them for this but I think it should be part of the same PR or at least those changes should be in flight before this is merged.

It should just be a matter of changing a single line of code in each of them, right? If so, I'm fine with just doing that as part of this PR. Although it's actually not clear to me from a quick glance that the concatenation scripts themselves would break, since they take the source directory as an input. It seems like it's the dask templates that would need to be edited.

Things will break since they try to iterate over all of the outputs. But fixing that is easy enough. I'll make of point of doing that

Ah yes good point, I'm not as familiar with the newer behavior. Thanks!

bcaddy

Code looks good to me. Excellent use of std::filesystem.

…her cholla files were written in a flat directory-structure or if files from different simulation cycles were written to separate directories

mabruzzo added 2 commits January 26, 2024 12:34

we now write all files from different simulation cycles to separate d…

e37035e

…irectories.

Add a runtime parameter, legacy_flat_outdir, that lets users opt into…

8066f3a

… the older behavior where all outputs are written to a single file.

bcaddy self-requested a review February 1, 2024 18:17

bcaddy approved these changes Feb 1, 2024

View reviewed changes

mabruzzo added 6 commits February 1, 2024 14:25

updated the python concatenation scripts to automatically detect whet…

4f695aa

…her cholla files were written in a flat directory-structure or if files from different simulation cycles were written to separate directories

minor bugfix

d47d8dd

Modified the filenames that the system-tests search for.

472c001

fixing another test.

ca6019e

Fixing a small oversight with determining file names in system_tester

5fe2c5c

Merge branch 'dev' into individual_cycle_outdir

e951118

evaneschneider merged commit 3db8f10 into cholla-hydro:dev Feb 2, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alter the default behavior for organizing outputs #367

Alter the default behavior for organizing outputs #367

mabruzzo commented Jan 26, 2024

brantr commented Jan 26, 2024

evaneschneider commented Jan 26, 2024

bcaddy commented Jan 26, 2024

mabruzzo commented Jan 26, 2024

bcaddy commented Jan 26, 2024

evaneschneider commented Jan 26, 2024

mabruzzo commented Jan 26, 2024

mabruzzo commented Jan 26, 2024

evaneschneider commented Jan 26, 2024

bcaddy left a comment

Alter the default behavior for organizing outputs #367

Alter the default behavior for organizing outputs #367

Conversation

mabruzzo commented Jan 26, 2024

Overview

More details

Implementation Details

Other Thoughts:

brantr commented Jan 26, 2024

evaneschneider commented Jan 26, 2024

bcaddy commented Jan 26, 2024

mabruzzo commented Jan 26, 2024

bcaddy commented Jan 26, 2024

evaneschneider commented Jan 26, 2024

mabruzzo commented Jan 26, 2024

mabruzzo commented Jan 26, 2024

evaneschneider commented Jan 26, 2024

bcaddy left a comment

Choose a reason for hiding this comment