Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate fire-and-forget mode for async flush #531

Open
adammoody opened this issue Feb 14, 2023 · 0 comments
Open

Investigate fire-and-forget mode for async flush #531

adammoody opened this issue Feb 14, 2023 · 0 comments

Comments

@adammoody
Copy link
Contributor

adammoody commented Feb 14, 2023

After an async flush has started, an application must make another SCR call to finalize that flush. Even after the async flush has copied all files, the output set is not valid until it has been finalized. The calls that finalize async flushes are: SCR_Start_output, SCR_Complete_output, and SCR_Finalize.

If an application using async flush does not write checkpoints frequently, then it could be likely that a failure occurs after all files have been copied but before the flush has been finalized. In this case, SCR will roll back to an earlier checkpoint when restarting the application. This is a shame since the hard work of copying all of the files is done.

It would be nice to extend SCR_Init so that SCR can detect an async flush is done but not yet marked as complete. To do this, we could use the file size of each file (if we trust POSIX semantics), or we could have each rank write an additional "done" flag to the file system. On restart, SCR_Init could look for these markers and update the status of the checkpoint if it finds that all files had been successfully copied. Something similar could be added to scavenge.

In the meantime, it could be useful to add checks to calls like SCR_Need_checkpoint and SCR_Should_exit, which an application may call more frequently. In that case, one might need to configure how often SCR checks, since polling for completion may be expensive on some systems. For example, if time steps are short compared to the polling cost, we would not want to poll after every time step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant