-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247
Comments
FYI |
Looking at the behavior of this test, it looks like the problem is rounding issues. Looking at the pass/fail criteria in the file:
it shows:
Well the problem is that on Intel, it converges in 16 iterations instead of 18. Note that this is not the only Intel where this test is converging early and failing. See, for example, the build Can we just change the convergence criteria to <= 18 iteration instead of exactly 18 iterations? Also, can we add better error message that shows why the test failing because the current output makes no sense. I can create a PR that does this. |
The updated code with the PR I am going to submit using the macros in Teuchos_TestingHelpers.hpp produces the output:
Now you can clearly see that the checks:
failed and therefore that is why the test failed. I will submit a PR that adds this better output and then we can go from there on how to fix this test. |
This was need for clear output for the test in NOX Thyra_Heq.C. I just added a simple usage of this macro. But we really need better unit tests for all of thse macros.
Before, if any of the three criteria failed, it would jsut print "Test failed". But now it prints why it failed with details. This test currently fails with Intel compilers (see trilinos#2247) but at least now it shows you why (which was not clear at all before). I also made usage of the default FancyOStream to avoid logic about what process you are on for when you should be printing or not. That is the best way to handle parallel output and better test output control. SQUASH AGAINST 'Update to clealry show why the test passes for fails (trilinos#2247)'
…ces (trilinos#2247) This is needed because the Intel compiler 17.4 seems to produce different roundoff than GCC compilers. With Intel, the NOX solver converges in 16 iterations instead of 18 iterations. So this is good, right? But to make sure that some iterations are done, I changed the pass/fail to require 14 or more iterations to make sure that a solve is performed.
Before, if any of the three criteria failed, it would jsut print "Test failed". But now it prints why it failed with details. This test currently fails with Intel compilers (see trilinos#2247) but at least now it shows you why (which was not clear at all before). I also made usage of the default FancyOStream to avoid logic about what process you are on for when you should be printing or not. That is the best way to handle parallel output and better test output control.
FYI: I created PR #2310 to clearly show why this test passes or fails. It would be good to merge this to 'develop' and let it run tomorrow so that we can see what this test looks like when it passes and what it looks like when it fails on all of the machines. That might tell us how to go about fixing this so that it passes. |
@atoth1 - could you take a look at this? |
Yep, I'll see what I can figure out this evening. My initial thought is that it makes sense that roundoff differences could cause differences in iteration counts, considering that the condition number for the least-squares problem is getting up near 10e14. The fact that it's also getting slightly different answers (diff->norm() = 1.4614e-07 in Ross's post above) from the initial solve and the restarted solve is curious though. |
Okay, with the PR #2310 merged, we are seeing the detailed pass/fail criteria this morning at: For example, you can see one passing test output at: which shows:
Could we try tightening down the ||f(x)|| to tolerance some and see if that fixes the problem? It is currently at |
That might help, but I'm thinking the best thing to do would probably be to just knock the storage depth parameter down to 2. It's currently set to 10, and 10 vs 2 doesn't really test anything different. With storage depth 2 the condition number for the least squares problem stays below 10^4 so roundoff has negligible effect. This knocks the iteration count down to 11, and changing both those get the test to pass for me on shiller with both gcc and intel compilers. |
@atoth1, sounds good to me. This does not sound like it would weaken the testings for this algorithm too much and by dropping the condition number, you get less portability problems with this test. @rppawlo, is this change @atoth1 is suggesting okay with you? |
Can we go ahead and pull the trigger on the change described above to fix this test? This this is now the only broken test holding up the promotion for the ATDM builds Otherwise, we just need to disable this test for these two Intel builds. |
@bartlettroscoe - yes this change is fine. |
Okay, I will create a PR for the change and test for these Intel builds on shiller. |
…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247).
I tried reducing the storage depth from 10 to 2 in the commit 3de4732 in my branch and re-ran the test with the
Is this correct? Should we set the numIterations criteria to 11? |
Yep, I mentioned the change in the iteration count. It was converging in 11 for me as well. |
…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247). Also has to change number of expected iterations from 18 to 11. I tested this with Intel and GNU builds and they both passed: Enabled Packages: NOX Enabled all Forward Packages 1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min) 2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min) Change expected iters from 18 to 11 (trilinos#2247)
…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247). Also has to change number of expected iterations from 18 to 11. I tested this with Intel and GNU builds and they both passed: Enabled Packages: NOX Enabled all Forward Packages 1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min) 2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)
The PR is #2388. It is running in auto PR testing now. |
…e-condition-number Reduce storge depth from 10 to 2 to reduce condition number and make portable (#2247)
The auto testing for PR #2288 passed so I clicked the merge button. Hopefully we will see this test get fixed tomorrow. Putting in review. |
This test is passing in all of the "ATDM" builds today shown at: This includes the builds:
Closing as complete. Thanks for the help @atoth1! |
…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247). Also has to change number of expected iterations from 18 to 11. I tested this with Intel and GNU builds and they both passed: Enabled Packages: NOX Enabled all Forward Packages 1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min) 2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)
CC: @trilinos/nox
Description
The test
NOX_Thyra_Heq_MPI_1
fails in the new ATDM buildTrilinos-atdm-hansen-shiller-intel-opt-serial
on hansen as shown yesterday at:and for the build
Trilinos-atdm-hansen-shiller-intel-opt-openmp
yesterday at:In both cases, the end of the test shows:
Steps to Reproduce
Anyone with access to the SNL test bed machines
hansen
(SON) orshiller
(SRN) using the instructions linked to from the page:should be able to reproduce.
The link from there to the README file:
should provide the info. But in short, once you clone Trilinos on
hansen
orshiller
into your home directory (pointed to by env varTRILINOS_DIR
), you should be able to reproduce with:I ran the above on
hansen
just now for the Trilinos version:and it produced the same failure.
The text was updated successfully, but these errors were encountered: