Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various cluster issues #1807

Merged
merged 8 commits into from
Apr 14, 2022

Conversation

joshua-cogliati-inl
Copy link
Contributor

@joshua-cogliati-inl joshua-cogliati-inl commented Apr 11, 2022


Pull Request Description

What issue does this change request address?

Closes #1809

What are the significant changes in functionality due to this change request?

Checks for establish_conda_env.sh before sourcing.
Runs same version of python in RavenRunsRaven code test (and adds feature to be able to do this.)
Export RAVEN_FRAMEWORK_DIR for MPILegacySimulationMode
Switches to using ray.get instead of ray.wait which works better empirically.
Adds some extra debugging information.


For Change Control Board: Change Request Review

The following review must be completed by an authorized member of the Change Control Board.

  • 1. Review all computer code.
  • 2. If any changes occur to the input syntax, there must be an accompanying change to the user manual and xsd schema. If the input syntax change deprecates existing input files, a conversion script needs to be added (see Conversion Scripts).
  • 3. Make sure the Python code and commenting standards are respected (camelBack, etc.) - See on the wiki for details.
  • 4. Automated Tests should pass, including run_tests, pylint, manual building and xsd tests. If there are changes to Simulation.py or JobHandler.py the qsub tests must pass.
  • 5. If significant functionality is added, there must be tests added to check this. Tests should cover all possible options. Multiple short tests are preferred over one large test. If new development on the internal JobHandler parallel system is performed, a cluster test must be added setting, in XML block, the node <internalParallel> to True.
  • 6. If the change modifies or adds a requirement or a requirement based test case, the Change Control Board's Chair or designee also needs to approve the change. The requirements and the requirements test shall be in sync.
  • 7. The merge request must reference an issue. If the issue is closed, the issue close checklist shall be done.
  • 8. If an analytic test is changed/added is the the analytic documentation updated/added?
  • 9. If any test used as a basis for documentation examples (currently found in raven/tests/framework/user_guide and raven/docs/workshop) have been changed, the associated documentation must be reviewed and assured the text matches the example.

@moosebuild
Copy link

Job Mingw Test on cb12a1d : invalidated by @joshua-cogliati-inl

failed in fetch

@moosebuild
Copy link

Job Mingw Test on e76a2bc : invalidated by @joshua-cogliati-inl

failed in fetch

1 similar comment
@moosebuild
Copy link

Job Mingw Test on e76a2bc : invalidated by @joshua-cogliati-inl

failed in fetch

@joshua-cogliati-inl joshua-cogliati-inl changed the title Adding check for existence of establish_conda_env.sh before sourcing. Fix various cluster issues Apr 12, 2022
PaulTalbot-INL
PaulTalbot-INL previously approved these changes Apr 12, 2022
Copy link
Collaborator

@PaulTalbot-INL PaulTalbot-INL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look good. Noting that adding profiling to RrR jobs may slow them down a bit, but is probably worth it in the feedback.

@PaulTalbot-INL
Copy link
Collaborator

PaulTalbot-INL commented Apr 13, 2022

After many re-tries, it appears there's an issue on the Windows testing machine, specifically with removing an "out" file from the rom trainer test: raven/tests/framework/CodeInterfaceTests/RAVEN/ROM/FirstMRun/4/out~test_rom_trainer

@PaulTalbot-INL
Copy link
Collaborator

If tests pass without further modifications, this is approved for merge.

Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshua-cogliati-inl Thanks for fixing the cluster tests. I have some minor comments for your consideration.

#self.thread in ray.wait([self.thread], timeout=waitTimeOut)[0]
#which ran slower in ray 1.9
else:
self.thread.finished
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be return self.thread.finished

Copy link
Contributor Author

@joshua-cogliati-inl joshua-cogliati-inl Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. (Wait, you mean this is Python, not LISP?)

@@ -1,5 +1,5 @@
<?xml version="1.0" ?>
<Simulation verbosity="debug">
<Simulation verbosity="debug" profile="jobs">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TestInfo need to be added, and the revisions node need to be updated to reflect the python command change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this isn't a test file, it's a file run by a test file. When we first added this test we had a discussion about it, and decided that the "inner" of the RrR tests should not be considered the test file; rather the "outer" should. I think the philosophical idea was that the "outer" is actually the test, while the "inner" doesn't get seen by the testing harness, and it would be confusing if this data was loaded as if it were a separate test into the regression test documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me.

@moosebuild
Copy link

Job Mingw Test on 28c7722 : invalidated by @joshua-cogliati-inl

failed in fetch

@moosebuild
Copy link

Job Mingw Test on e76a2bc : invalidated by @joshua-cogliati-inl

failed in fetch

wangcj05
wangcj05 previously approved these changes Apr 13, 2022
@moosebuild
Copy link

Job Test qsubs sawtooth on e2dab00 : invalidated by @joshua-cogliati-inl

FAILED: Failed tests/cluster_tests/test_mpi Diff tests/cluster_tests/test_pbsdsh Diff tests/cluster_tests/test_mpiqsub_parameters Diff tests/cluster_tests/AdaptiveSobol/test_parallel_adaptive_sobol Diff tests/cluster_tests/test_mpi_forced Diff tests/cluster_tests/test_mpiqsub_nosplit Diff tests/cluster_tests/RavenRunsRaven/ROM Diff tests/cluster_tests/RavenRunsRaven/Code

@moosebuild
Copy link

Job Test qsubs sawtooth on 6008ff6 : invalidated by @joshua-cogliati-inl

testing to see if running on /scratch

@@ -14,7 +14,7 @@
"""
Created on Mar 5, 2013

@author: alfoa, cogljj, crisr
@author: alfoa, cogljj, crisr, talbpw, maljdp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you blame me for this 😁

@PaulTalbot-INL PaulTalbot-INL merged commit f831851 into idaholab:devel Apr 14, 2022
@joshua-cogliati-inl joshua-cogliati-inl deleted the parallel_error branch April 14, 2022 21:20
@wangcj05 wangcj05 added the RAVENv2.2 for RAVENv2.2 Release label Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RAVENv2.2 for RAVENv2.2 Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DEFECT] Cluster tests fail randomly
4 participants