Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix defect ensemble model (with Code) and genetic algorithm #2317

Merged
merged 19 commits into from
May 16, 2024

Conversation

alfoa
Copy link
Collaborator

@alfoa alfoa commented May 8, 2024


Pull Request Description

What issue does this change request address? (Use "#" before the issue to link it, i.e., #42.)

Closes #2304

What are the significant changes in functionality due to this change request?

The EnsembleModel (with code) uses a different parallelization strategy. The batching mode has been enabled in that strategy (parallelMode ==2)

I also created an issue #2318 to better re-design the batching system


For Change Control Board: Change Request Review

The following review must be completed by an authorized member of the Change Control Board.

  • 1. Review all computer code.
  • 2. If any changes occur to the input syntax, there must be an accompanying change to the user manual and xsd schema. If the input syntax change deprecates existing input files, a conversion script needs to be added (see Conversion Scripts).
  • 3. Make sure the Python code and commenting standards are respected (camelBack, etc.) - See on the wiki for details.
  • 4. Automated Tests should pass, including run_tests, pylint, manual building and xsd tests. If there are changes to Simulation.py or JobHandler.py the qsub tests must pass.
  • 5. If significant functionality is added, there must be tests added to check this. Tests should cover all possible options. Multiple short tests are preferred over one large test. If new development on the internal JobHandler parallel system is performed, a cluster test must be added setting, in XML block, the node <internalParallel> to True.
  • 6. If the change modifies or adds a requirement or a requirement based test case, the Change Control Board's Chair or designee also needs to approve the change. The requirements and the requirements test shall be in sync.
  • 7. The merge request must reference an issue. If the issue is closed, the issue close checklist shall be done.
  • 8. If an analytic test is changed/added is the the analytic documentation updated/added?
  • 9. If any test used as a basis for documentation examples (currently found in raven/tests/framework/user_guide and raven/docs/workshop) have been changed, the associated documentation must be reviewed and assured the text matches the example.

…trained/metaModelWithCodeAndFunctionsAndGenetic/decayConstantB.py
mandd
mandd previously approved these changes May 8, 2024
Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, changes are good to me.

optimize
</Sequence>
<WorkingDir>metaModelWithCodeAndFunctionsAndGenetic</WorkingDir>
<batchSize>1</batchSize>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested it with different batch size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops. Not yet. let me test it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangcj05 I fixed the parallel execution and converted the test into a parallel (batchSize > 1) to test it.

Ready for re-review

@alfoa alfoa requested review from wangcj05 and mandd May 9, 2024 18:37
Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are good. @alfoa Could you document how you define job '' prefix'' and 'uniqueHandler' inside Ensemble? This will help for future rework on the job submission function.

type = 'RavenFramework'
input = 'continuous/unconstrained/test_ensembleModel_withGA_Code_and_Functions.xml'
[./csv]
type = UnorderedCsv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alfoa FYI, you need to change 'UnorderedCsv' to 'UnorderedCSV'. There is an inconsistent naming in the rook test system.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done and added an error msg in ROOK (to avoid a blind crash)

@moosebuild
Copy link

Job Test mac on 54f43a4 : invalidated by @wangcj05

RuntimeError: Found incorrect download: libaec. Aborting

@alfoa
Copy link
Collaborator Author

alfoa commented May 9, 2024

The changes are good. @alfoa Could you document how you define job '' prefix'' and 'uniqueHandler' inside Ensemble? This will help for future rework on the job submission function.

I can add the info in the Issue I created?

@alfoa alfoa closed this May 9, 2024
@moosebuild
Copy link

Job Test Ubuntu 20-2 Optional on a522493 : invalidated by @wangcj05

failed at fetch

@moosebuild
Copy link

Job Test qsubs sawtooth on a522493 : invalidated by @alfoa

Set python environment taking 6+ hrs?

@alfoa
Copy link
Collaborator Author

alfoa commented May 13, 2024

Job Test qsubs sawtooth on a522493 : invalidated by @alfoa

Set python environment taking 6+ hrs?

@wangcj05 There is a problem with HPC testing (set up enviroment) that is not related to this PR. (It does not seem to be a spurious problem since I invalidated the job multiple times and it gets stuck at the same step)

@alfoa
Copy link
Collaborator Author

alfoa commented May 13, 2024

@wangcj05 @joshua-cogliati-inl It seems that now in devel/main there is a cascade of failures (https://civet.inl.gov/branch/2903/) after merge of PR #2309 (I dnk if it is related...probably it is not)

rook/main.py Outdated Show resolved Hide resolved
@alfoa
Copy link
Collaborator Author

alfoa commented May 13, 2024

@wangcj05 now it fails because cvxpy is missing?

@@ -54,7 +54,7 @@
# some bad actors can't use the metadata correctly
# and so need special treatment
# -> see findLibAndVersion
metaExceptions = ['pyside2', 'AMSC', 'PIL']
metaExceptions = ['pyside2', 'AMSC', 'PIL', 'cvxpy']
Copy link
Collaborator Author

@alfoa alfoa May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'cvxpy' is added via FARM plugin. It cannot be checked with the metadata. I followed the approach in the library_handler but this should be revised since it is not easily maintainable (overall for plugins). @mandd @wangcj05 @joshua-cogliati-inl . I wonder why the regression system (for PR) did not catch this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is due to an update of the library in conda-forge 2 days ago (see https://anaconda.org/conda-forge/cvxpy-base/files).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'cvxpy' is added via FARM plugin. It cannot be checked with the metadata. I followed the approach in the library_handler but this should be revised since it is not easily maintainable (overall for plugins). @mandd @wangcj05 @joshua-cogliati-inl . I wonder why the regression system (for PR) did not catch this.

I changed the approach..if a library through a metadata is not found, we try the "slow approach".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip_check is another way to fix this. Example: <hdf5 skip_check='True'/> I have no objection to fixing it in metaExceptions however.

@moosebuild
Copy link

Job Mingw Test on 7ff7659 : invalidated by @alfoa

@alfoa alfoa mentioned this pull request May 14, 2024
9 tasks
@alfoa
Copy link
Collaborator Author

alfoa commented May 14, 2024

@wangcj05 the Mingw Test are failing and I don't see a correlation with this MR.

@moosebuild
Copy link

Job Mingw Test on 7ff7659 : invalidated by @alfoa

@joshua-cogliati-inl
Copy link
Contributor

@wangcj05 the Mingw Test are failing and I don't see a correlation with this MR.

How many tests are failing? (And if only a few, which ones?) Thanks.

@wangcj05
Copy link
Collaborator

@wangcj05 the Mingw Test are failing and I don't see a correlation with this MR.

How many tests are failing? (And if only a few, which ones?) Thanks.

FYI, there are around 33 tests failed in previous MingW test. Many for optimizations, such as simulated annealing. I have checked the new tests right now, it seems the runs are ok for now. Let's see if the test on Mingw can pass or not. @joshua-cogliati-inl

@alfoa
Copy link
Collaborator Author

alfoa commented May 14, 2024

@wangcj05 the Mingw Test are failing and I don't see a correlation with this MR.

How many tests are failing? (And if only a few, which ones?) Thanks.

FYI, there are around 33 tests failed in previous MingW test. Many for optimizations, such as simulated annealing. I have checked the new tests right now, it seems the runs are ok for now. Let's see if the test on Mingw can pass or not. @joshua-cogliati-inl

@wangcj05 @joshua-cogliati-inl unfortunately they are still failing...now in the TSA, PostProcessors, Datamining etc.

@wangcj05
Copy link
Collaborator

@joshua-cogliati-inl Let's try to test it on devel first. Can you pull the changes for cvxpy and test it on your trivial white space branch?

@joshua-cogliati-inl
Copy link
Contributor

@joshua-cogliati-inl Let's try to test it on devel first. Can you pull the changes for cvxpy and test it on your trivial white space branch?

Hm, which set of changes for cvxpy? d3c7069 ?

@alfoa
Copy link
Collaborator Author

alfoa commented May 14, 2024

@wangcj05 the Mingw Test are failing and I don't see a correlation with this MR.

How many tests are failing? (And if only a few, which ones?) Thanks.

FAILED:

Timeout tests\framework\PostProcessors\SubdomainBasicStatistics\subdomainSensitivity

Timeout tests\framework\PostProcessors\SubdomainBasicStatistics\subdomainTimeDepStats

Failed tests\framework\PostProcessors\TemporalDataMiningPostProcessor\Clustering\KMeans

Failed tests\framework\PostProcessors\TemporalDataMiningPostProcessor\Clustering\MiniBatchKMeans

Failed tests\framework\PostProcessors\TemporalDataMiningPostProcessor\Clustering\DBSCAN

Failed tests\framework\PostProcessors\TemporalDataMiningPostProcessor\Clustering\MeanShift

Failed tests\framework\PostProcessors\TemporalDataMiningPostProcessor\Clustering\AffinityPropogation

Failed tests\framework\PostProcessors\TemporalDataMiningPostProcessor\Clustering\SpectralClustering

Failed tests\framework\PostProcessors\TSACharacterizer\Basic

Failed tests\framework\PostProcessors\TopologicalPostProcessor\knn

Timeout tests\framework\PostProcessors\TopologicalPostProcessor\relaxed_beta_skeleton

Failed tests\framework\PostProcessors\TopologicalPostProcessor\beta_skeleton

Failed tests\framework\PostProcessors\TSACharacterizer\RWD

Failed tests\framework\PostProcessors\Validation\test_validation_gate_probabilistic_time_dep

Failed tests\framework\PostProcessors\Validation\test_validation_gate_probabilistic

Timeout tests\framework\PostProcessors\Validation\test_validation_gate_pcm

Diff tests\framework\ROM\FeatureSelection\DMDcRFEScoringSubgroup

Failed tests\framework\ROM\FeatureSelection\DMDcRFEScoringOnlyOutput

Failed tests\framework\ROM\FeatureSelection\DMDcRFEApplyClustering

Failed tests\framework\ROM\FeatureSelection\DMDcRFESubgroupCrossCorrelation

Timeout tests\framework\ROM\FeatureSelection\StaticSklearnVarianceThreshold

Failed tests\framework\ROM\FeatureSelection\DMDcVarianceThreshold

Failed tests\framework\ROM\MSR\cosine

Timeout tests\framework\ROM\MSR\biweight

Failed tests\framework\ROM\MSR\Epanechnikov

Failed tests\framework\ROM\MSR\exponential

Timeout tests\framework\ROM\MSR\logistic

Timeout tests\framework\ROM\MSR\Gaussian

Timeout tests\framework\ROM\MSR\Silverman

Failed tests\framework\ROM\MSR\triangular

Failed tests\framework\ROM\MSR\tricube

Failed tests\framework\ROM\MSR\triweight

Timeout tests\framework\ROM\MSR\uniform

@alfoa
Copy link
Collaborator Author

alfoa commented May 14, 2024

@joshua-cogliati-inl Let's try to test it on devel first. Can you pull the changes for cvxpy and test it on your trivial white space branch?

Hm, which set of changes for cvxpy? d3c7069 ?

7ff7659

@joshua-cogliati-inl
Copy link
Contributor

@joshua-cogliati-inl Let's try to test it on devel first. Can you pull the changes for cvxpy and test it on your trivial white space branch?

Hm, which set of changes for cvxpy? d3c7069 ?

7ff7659

Thanks: #2113

@moosebuild
Copy link

Job Mingw Test on 7ff7659 : invalidated by @alfoa

Copy link
Collaborator

@wangcj05 wangcj05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes are good.

@wangcj05
Copy link
Collaborator

Changes are good, PR checklist is good. PR can be merged.

@wangcj05 wangcj05 merged commit 782ccc0 into devel May 16, 2024
12 checks passed
@wangcj05 wangcj05 deleted the alfoa/genetic_ensemble_model branch May 16, 2024 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DEFECT] Genetic Algorithm crashes with Ensemble model and <functions>
5 participants