Lock in DistributedMemoryRunner #1879

joshua-cogliati-inl · 2022-07-07T17:15:33Z

Pull Request Description

What issue does this change request address?

Closes #1881

What are the significant changes in functionality due to this change request?

Adds a lock in DistributedMemoryRunner to prevent two threads stepping on each other.

For Change Control Board: Change Request Review

The following review must be completed by an authorized member of the Change Control Board.

1. Review all computer code.
2. If any changes occur to the input syntax, there must be an accompanying change to the user manual and xsd schema. If the input syntax change deprecates existing input files, a conversion script needs to be added (see Conversion Scripts).
3. Make sure the Python code and commenting standards are respected (camelBack, etc.) - See on the wiki for details.
4. Automated Tests should pass, including run_tests, pylint, manual building and xsd tests. If there are changes to Simulation.py or JobHandler.py the qsub tests must pass.
5. If significant functionality is added, there must be tests added to check this. Tests should cover all possible options. Multiple short tests are preferred over one large test. If new development on the internal JobHandler parallel system is performed, a cluster test must be added setting, in XML block, the node <internalParallel> to True.
6. If the change modifies or adds a requirement or a requirement based test case, the Change Control Board's Chair or designee also needs to approve the change. The requirements and the requirements test shall be in sync.
7. The merge request must reference an issue. If the issue is closed, the issue close checklist shall be done.
8. If an analytic test is changed/added is the the analytic documentation updated/added?
9. If any test used as a basis for documentation examples (currently found in raven/tests/framework/user_guide and raven/docs/workshop) have been changed, the associated documentation must be reviewed and assured the text matches the example.

joshua-cogliati-inl · 2022-07-08T14:23:30Z

Things to do before merging:

add a test that uses internal parallel and an optimizer that calls JobHandler.terminateJobs
Decide if Lock in DistributedMemoryRunner #1879 or Lock more in cleanJobQueue #1883 is the better fix.

PaulTalbot-INL · 2022-07-18T16:18:54Z

ravenframework/JobHandler.py

@@ -977,6 +977,8 @@ def terminateJobs(self, ids):
      @ In, ids, list(str), job prefixes to terminate
      @ Out, None
    """
+    #XXX terminateJobs modifies the running queue, which
+    # cleanJobQueue, and fillJobQueue assume can't happen


does this comment need to be rephrased due to the fixes in this PR?

Possibly it needs to be changed to a warning.

PaulTalbot-INL

Code changes are good, but one comment should be considered, and a test is needed to cover the changes in #1883 and this PR.

PaulTalbot-INL · 2022-07-18T16:25:32Z

@joshua-cogliati-inl a basis for a RrR test might be raven/tests/framework/CodeInterfaces/RAVEN/basic.xml. This test samples various upper and lower bounds for the inner run, which samples the simple attenuation model using those bounds.

Rather than sampling them, you might optimize the lower bound to maximize or minimize the mean_ans metric that is returned by the inner. I would consider reducing the optimization threshold to a fairly high tolerance to avoid long run times. An example optimization workflow with these settings is in raven/tests/framework/Optimizers/GradientDescent/converge_gradient.xml.

Alternatively, the RrR test raven/tests/framework/CodeInterfaces/RAVEN/rom.xml samples the fitness of some ROM tuning parameters, which could be changed to an optimization as well, but might be less trivial to converge.

joshua-cogliati-inl · 2022-07-19T16:05:27Z

Possible test (currently running on regression machines) 9c159b8
@PaulTalbot-INL Does this look reasonable to test this? (I checked via throwing an exception in terminateJobs that it is called)

PaulTalbot-INL · 2022-07-19T21:45:44Z

Possible test (currently running on regression machines) 9c159b8

That looks good to me! Note the <TestInfo> of the new outer optimization file probably needs updating.

joshua-cogliati-inl · 2022-07-20T15:18:18Z

Possible test (currently running on regression machines) 9c159b8

That looks good to me! Note the <TestInfo> of the new outer optimization file probably needs updating.

Thanks, I will update TestInfo in #1899

joshua-cogliati-inl · 2022-07-20T18:40:39Z

New merge request #1899 was created and used instead.

joshua-cogliati-inl added 3 commits July 7, 2022 11:12

Removing self.thread from InternalRunner

d13a497

Renaming self.thread to self.__func

1d94885

Adding lock for __func

e2aa817

joshua-cogliati-inl changed the title ~~Lock in dist mem runner~~ Lock in DistributedMemoryRunner Jul 7, 2022

joshua-cogliati-inl added 2 commits July 7, 2022 13:47

Adding a comment describing __func

fcb0d39

Adding warning comment.

06c6a35

joshua-cogliati-inl mentioned this pull request Jul 8, 2022

Lock more in cleanJobQueue #1883

Closed

9 tasks

wangcj05 requested a review from PaulTalbot-INL July 13, 2022 17:08

PaulTalbot-INL mentioned this pull request Jul 18, 2022

[DEFECT] too many threads modifying variable in DistributedMemoryRunner #1881

Closed

13 tasks

PaulTalbot-INL reviewed Jul 18, 2022

View reviewed changes

PaulTalbot-INL approved these changes Jul 18, 2022

View reviewed changes

joshua-cogliati-inl closed this Jul 20, 2022

joshua-cogliati-inl reopened this Jul 20, 2022

joshua-cogliati-inl closed this Jul 20, 2022

joshua-cogliati-inl deleted the lock_in_dist_mem_runner branch July 20, 2022 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock in DistributedMemoryRunner #1879

Lock in DistributedMemoryRunner #1879

joshua-cogliati-inl commented Jul 7, 2022 •

edited by PaulTalbot-INL

Loading

joshua-cogliati-inl commented Jul 8, 2022

PaulTalbot-INL Jul 18, 2022

joshua-cogliati-inl Jul 18, 2022

PaulTalbot-INL left a comment

PaulTalbot-INL commented Jul 18, 2022 •

edited

Loading

joshua-cogliati-inl commented Jul 19, 2022

PaulTalbot-INL commented Jul 19, 2022

joshua-cogliati-inl commented Jul 20, 2022

joshua-cogliati-inl commented Jul 20, 2022

Lock in DistributedMemoryRunner #1879

Lock in DistributedMemoryRunner #1879

Conversation

joshua-cogliati-inl commented Jul 7, 2022 • edited by PaulTalbot-INL Loading

Pull Request Description

What issue does this change request address?

What are the significant changes in functionality due to this change request?

For Change Control Board: Change Request Review

joshua-cogliati-inl commented Jul 8, 2022

PaulTalbot-INL Jul 18, 2022

Choose a reason for hiding this comment

joshua-cogliati-inl Jul 18, 2022

Choose a reason for hiding this comment

PaulTalbot-INL left a comment

Choose a reason for hiding this comment

PaulTalbot-INL commented Jul 18, 2022 • edited Loading

joshua-cogliati-inl commented Jul 19, 2022

PaulTalbot-INL commented Jul 19, 2022

joshua-cogliati-inl commented Jul 20, 2022

joshua-cogliati-inl commented Jul 20, 2022

joshua-cogliati-inl commented Jul 7, 2022 •

edited by PaulTalbot-INL

Loading

PaulTalbot-INL commented Jul 18, 2022 •

edited

Loading