Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock in DistributedMemoryRunner #1879

Conversation

joshua-cogliati-inl
Copy link
Contributor

@joshua-cogliati-inl joshua-cogliati-inl commented Jul 7, 2022


Pull Request Description

What issue does this change request address?

Closes #1881

What are the significant changes in functionality due to this change request?

Adds a lock in DistributedMemoryRunner to prevent two threads stepping on each other.


For Change Control Board: Change Request Review

The following review must be completed by an authorized member of the Change Control Board.

  • 1. Review all computer code.
  • 2. If any changes occur to the input syntax, there must be an accompanying change to the user manual and xsd schema. If the input syntax change deprecates existing input files, a conversion script needs to be added (see Conversion Scripts).
  • 3. Make sure the Python code and commenting standards are respected (camelBack, etc.) - See on the wiki for details.
  • 4. Automated Tests should pass, including run_tests, pylint, manual building and xsd tests. If there are changes to Simulation.py or JobHandler.py the qsub tests must pass.
  • 5. If significant functionality is added, there must be tests added to check this. Tests should cover all possible options. Multiple short tests are preferred over one large test. If new development on the internal JobHandler parallel system is performed, a cluster test must be added setting, in XML block, the node <internalParallel> to True.
  • 6. If the change modifies or adds a requirement or a requirement based test case, the Change Control Board's Chair or designee also needs to approve the change. The requirements and the requirements test shall be in sync.
  • 7. The merge request must reference an issue. If the issue is closed, the issue close checklist shall be done.
  • 8. If an analytic test is changed/added is the the analytic documentation updated/added?
  • 9. If any test used as a basis for documentation examples (currently found in raven/tests/framework/user_guide and raven/docs/workshop) have been changed, the associated documentation must be reviewed and assured the text matches the example.

@joshua-cogliati-inl joshua-cogliati-inl changed the title Lock in dist mem runner Lock in DistributedMemoryRunner Jul 7, 2022
@joshua-cogliati-inl
Copy link
Contributor Author

Things to do before merging:

  1. add a test that uses internal parallel and an optimizer that calls JobHandler.terminateJobs
  2. Decide if Lock in DistributedMemoryRunner #1879 or Lock more in cleanJobQueue #1883 is the better fix.

@@ -977,6 +977,8 @@ def terminateJobs(self, ids):
@ In, ids, list(str), job prefixes to terminate
@ Out, None
"""
#XXX terminateJobs modifies the running queue, which
# cleanJobQueue, and fillJobQueue assume can't happen
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this comment need to be rephrased due to the fixes in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly it needs to be changed to a warning.

Copy link
Collaborator

@PaulTalbot-INL PaulTalbot-INL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes are good, but one comment should be considered, and a test is needed to cover the changes in #1883 and this PR.

@PaulTalbot-INL
Copy link
Collaborator

PaulTalbot-INL commented Jul 18, 2022

@joshua-cogliati-inl a basis for a RrR test might be raven/tests/framework/CodeInterfaces/RAVEN/basic.xml. This test samples various upper and lower bounds for the inner run, which samples the simple attenuation model using those bounds.

Rather than sampling them, you might optimize the lower bound to maximize or minimize the mean_ans metric that is returned by the inner. I would consider reducing the optimization threshold to a fairly high tolerance to avoid long run times. An example optimization workflow with these settings is in raven/tests/framework/Optimizers/GradientDescent/converge_gradient.xml.

Alternatively, the RrR test raven/tests/framework/CodeInterfaces/RAVEN/rom.xml samples the fitness of some ROM tuning parameters, which could be changed to an optimization as well, but might be less trivial to converge.

@joshua-cogliati-inl
Copy link
Contributor Author

Possible test (currently running on regression machines) 9c159b8
@PaulTalbot-INL Does this look reasonable to test this? (I checked via throwing an exception in terminateJobs that it is called)

@PaulTalbot-INL
Copy link
Collaborator

Possible test (currently running on regression machines) 9c159b8

That looks good to me! Note the <TestInfo> of the new outer optimization file probably needs updating.

@joshua-cogliati-inl
Copy link
Contributor Author

Possible test (currently running on regression machines) 9c159b8

That looks good to me! Note the <TestInfo> of the new outer optimization file probably needs updating.

Thanks, I will update TestInfo in #1899

@joshua-cogliati-inl
Copy link
Contributor Author

New merge request #1899 was created and used instead.

@joshua-cogliati-inl joshua-cogliati-inl deleted the lock_in_dist_mem_runner branch July 20, 2022 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DEFECT] too many threads modifying variable in DistributedMemoryRunner
2 participants