Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247

Closed
bartlettroscoe opened this issue Feb 16, 2018 · 17 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area pkg: NOX

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/nox

Description

The test NOX_Thyra_Heq_MPI_1 fails in the new ATDM build Trilinos-atdm-hansen-shiller-intel-opt-serial on hansen as shown yesterday at:

and for the build Trilinos-atdm-hansen-shiller-intel-opt-openmp yesterday at:

In both cases, the end of the test shows:

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 6.749e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 5.011e-05 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************
Test failed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42973,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Steps to Reproduce

Anyone with access to the SNL test bed machines hansen (SON) or shiller (SRN) using the instructions linked to from the page:

should be able to reproduce.

The link from there to the README file:

should provide the info. But in short, once you clone Trilinos on hansen or shiller into your home directory (pointed to by env var TRILINOS_DIR), you should be able to reproduce with:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-serial

$ cmake \
  -GNina \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_NOX=ON \
  $TRILINOS_DIR

$ make NP=16  NOX_Thyra_Heq

$ ctest -R NOX_Thyra_Heq_MPI_1

I ran the above on hansen just now for the Trilinos version:

a39e44b "Merge branch 'develop' of github.com:trilinos/Trilinos into develop"
Author: Curtis C. Ober <ccober@sandia.gov>
Date:   Fri Feb 16 11:38:11 2018 -0700 (19 minutes ago)

and it produced the same failure.

@bartlettroscoe bartlettroscoe added the client: ATDM Any issue primarily impacting the ATDM project label Feb 23, 2018
@bartlettroscoe
Copy link
Member Author

@rppawlo

FYI

@bartlettroscoe
Copy link
Member Author

Looking at the behavior of this test, it looks like the problem is rounding issues. Looking at the pass/fail criteria in the file:

  • Trilinos/packages/nox/test/epetra/Thyra/Thyra_Heq.C

it shows:

    // 1. Convergence
    if (solvStatus != NOX::StatusTest::Converged)
      status = 1;
    // 2. Number of iterations
    if (const_cast<Teuchos::ParameterList&>(solver->getList()).sublist("Output").get("Nonlinear Iterations", 0) != 18)
      status = 2;
    // 3. Same reset solution
    if (diff->norm() >= 1.0e-14)
      status = 3;

    success = status==0;

    if (Comm.MyPID() == 0) {
      if (success)
        std::cout << "Test passed!" << std::endl;
      else
        std::cout << "Test failed!" << std::endl;
    }
  }
  TEUCHOS_STANDARD_CATCH_STATEMENTS(verbose, std::cerr, success);

Well the problem is that on Intel, it converges in 16 iterations instead of 18.

Note that this is not the only Intel where this test is converging early and failing. See, for example, the build Linux-intel-17.0.1-MPI_Release_intel_17.0.1_openmpi_1.8.7_DEV that ran yesterday at:

Can we just change the convergence criteria to <= 18 iteration instead of exactly 18 iterations? Also, can we add better error message that shows why the test failing because the current output makes no sense.

I can create a PR that does this.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Feb 28, 2018

The updated code with the PR I am going to submit using the macros in Teuchos_TestingHelpers.hpp produces the output:

************************************************************************
-- Nonlinear Solver Step 16 -- 
||F|| = 6.500e-08  step = 1.000e+00 (Converged!)
************************************************************************

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 3.250e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 2.810e-05 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************

Check for test pass/fail:
solvStatus = Converged.... == Converged.... : passed
numIterations = 16 == 18 : FAILED
diff->norm() = 1.4614e-07 <= 1e-14 : FAILED

Test failed!

Now you can clearly see that the checks:

numIterations = 16 == 18 : FAILED
diff->norm() = 1.4614e-07 <= 1e-14 : FAILED

failed and therefore that is why the test failed.

I will submit a PR that adds this better output and then we can go from there on how to fix this test.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 28, 2018
This was need for clear output for the test in NOX Thyra_Heq.C.

I just added a simple usage of this macro.  But we really need better unit
tests for all of thse macros.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 28, 2018
Before, if any of the three criteria failed, it would jsut print "Test
failed".  But now it prints why it failed with details.

This test currently fails with Intel compilers (see trilinos#2247) but at least now it
shows you why (which was not clear at all before).

I also made usage of the default FancyOStream to avoid logic about what
process you are on for when you should be printing or not.  That is the best
way to handle parallel output and better test output control.

SQUASH AGAINST 'Update to clealry show why the test passes for fails (trilinos#2247)'
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 28, 2018
…ces (trilinos#2247)

This is needed because the Intel compiler 17.4 seems to produce different
roundoff than GCC compilers.  With Intel, the NOX solver converges in 16
iterations instead of 18 iterations.  So this is good, right?

But to make sure that some iterations are done, I changed the pass/fail to
require 14 or more iterations to make sure that a solve is performed.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Feb 28, 2018
Before, if any of the three criteria failed, it would jsut print "Test
failed".  But now it prints why it failed with details.

This test currently fails with Intel compilers (see trilinos#2247) but at least now it
shows you why (which was not clear at all before).

I also made usage of the default FancyOStream to avoid logic about what
process you are on for when you should be printing or not.  That is the best
way to handle parallel output and better test output control.
@bartlettroscoe
Copy link
Member Author

FYI: I created PR #2310 to clearly show why this test passes or fails. It would be good to merge this to 'develop' and let it run tomorrow so that we can see what this test looks like when it passes and what it looks like when it fails on all of the machines. That might tell us how to go about fixing this so that it passes.

@rppawlo
Copy link
Contributor

rppawlo commented Feb 28, 2018

@atoth1 - could you take a look at this?

@atoth1
Copy link
Contributor

atoth1 commented Mar 1, 2018

Yep, I'll see what I can figure out this evening. My initial thought is that it makes sense that roundoff differences could cause differences in iteration counts, considering that the condition number for the least-squares problem is getting up near 10e14. The fact that it's also getting slightly different answers (diff->norm() = 1.4614e-07 in Ross's post above) from the initial solve and the restarted solve is curious though.

@bartlettroscoe
Copy link
Member Author

Okay, with the PR #2310 merged, we are seeing the detailed pass/fail criteria this morning at:

For example, you can see one passing test output at:

which shows:

************************************************************************
-- Nonlinear Solver Step 16 -- 
||F|| = 6.345e-07  step = 1.000e+00
************************************************************************

    R condition number estimate (10) = 1.09022e+12

************************************************************************
-- Nonlinear Solver Step 17 -- 
||F|| = 2.154e-07  step = 1.000e+00
************************************************************************

    R condition number estimate (10) = 1.01819e+12

************************************************************************
-- Nonlinear Solver Step 18 -- 
||F|| = 7.825e-08  step = 1.000e+00 (Converged!)
************************************************************************

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 3.912e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 3.397e-06 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************

Check for test pass/fail:
solvStatus = Converged.... == Converged.... : passed
numIterations = 18 == 18 : passed
diff->norm() = 0 <= 1e-14 : passed

Test passed!

Could we try tightening down the ||f(x)|| to tolerance some and see if that fixes the problem? It is currently at 1e-8 so could we try 1e-9 to see if this fixes the problem?

@atoth1
Copy link
Contributor

atoth1 commented Mar 1, 2018

That might help, but I'm thinking the best thing to do would probably be to just knock the storage depth parameter down to 2. It's currently set to 10, and 10 vs 2 doesn't really test anything different. With storage depth 2 the condition number for the least squares problem stays below 10^4 so roundoff has negligible effect. This knocks the iteration count down to 11, and changing both those get the test to pass for me on shiller with both gcc and intel compilers.

@bartlettroscoe
Copy link
Member Author

That might help, but I'm thinking the best thing to do would probably be to just knock the storage depth parameter down to 2. It's currently set to 10, and 10 vs 2 doesn't really test anything different. With storage depth 2 the condition number for the least squares problem stays below 10^4 so roundoff has negligible effect. This knocks the iteration count down to 11, and changing both those get the test to pass for me on shiller with both gcc and intel compilers.

@atoth1, sounds good to me. This does not sound like it would weaken the testings for this algorithm too much and by dropping the condition number, you get less portability problems with this test.

@rppawlo, is this change @atoth1 is suggesting okay with you?

@bartlettroscoe
Copy link
Member Author

Can we go ahead and pull the trigger on the change described above to fix this test? This this is now the only broken test holding up the promotion for the ATDM builds Trilinos-atdm-hansen-shiller-intel-opt-openmp and Trilinos-atdm-hansen-shiller-intel-opt-openmp the "ATDM" CDash group/track as shown at:

Otherwise, we just need to disable this test for these two Intel builds.

@rppawlo
Copy link
Contributor

rppawlo commented Mar 14, 2018

@bartlettroscoe - yes this change is fine.

@bartlettroscoe
Copy link
Member Author

yes this change is fine.

Okay, I will create a PR for the change and test for these Intel builds on shiller.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 14, 2018
…portable (trilinos#2247)

This change reduces the condition number of the least-squares problem from
10^12 to 10^4 and results in more stable floating point computations and
better portability of the test.

This fixes the failures with the Intel compiler 17.0.1 with optimized compiler
flags (see trilinos#2247).
@bartlettroscoe
Copy link
Member Author

I tried reducing the storage depth from 10 to 2 in the commit 3de4732 in my branch and re-ran the test with the intel-opt-openmp configuration and it did not pass. Instead, it gave:

...

************************************************************************
-- Nonlinear Solver Step 10 -- 
||F|| = 3.667e-06  step = 1.000e+00
************************************************************************

    R condition number estimate (2) = 8384.82

************************************************************************
-- Nonlinear Solver Step 11 -- 
||F|| = 2.439e-08  step = 1.000e+00 (Converged!)
************************************************************************

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 1.220e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 9.015e-05 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************

Check for test pass/fail:
solvStatus = Converged.... == Converged.... : passed
numIterations = 11 == 18 : FAILED
diff->norm() = 0 <= 1e-14 : passed

Test failed!

Is this correct? Should we set the numIterations criteria to 11?

@atoth1
Copy link
Contributor

atoth1 commented Mar 14, 2018

Yep, I mentioned the change in the iteration count. It was converging in 11 for me as well.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 14, 2018
…portable (trilinos#2247)

This change reduces the condition number of the least-squares problem from
10^12 to 10^4 and results in more stable floating point computations and
better portability of the test.

This fixes the failures with the Intel compiler 17.0.1 with optimized compiler
flags (see trilinos#2247).

Also has to change number of expected iterations from 18 to 11.

I tested this with Intel and GNU builds and they both passed:

Enabled Packages: NOX
Enabled all Forward Packages

1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min)
2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)

Change expected iters from 18 to 11 (trilinos#2247)
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 14, 2018
…portable (trilinos#2247)

This change reduces the condition number of the least-squares problem from
10^12 to 10^4 and results in more stable floating point computations and
better portability of the test.

This fixes the failures with the Intel compiler 17.0.1 with optimized compiler
flags (see trilinos#2247).

Also has to change number of expected iterations from 18 to 11.

I tested this with Intel and GNU builds and they both passed:

Enabled Packages: NOX
Enabled all Forward Packages

1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min)
2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)
@bartlettroscoe
Copy link
Member Author

The PR is #2388. It is running in auto PR testing now.

bartlettroscoe added a commit that referenced this issue Mar 15, 2018
…e-condition-number

Reduce storge depth from 10 to 2 to reduce condition number and make portable (#2247)
@bartlettroscoe
Copy link
Member Author

The auto testing for PR #2288 passed so I clicked the merge button. Hopefully we will see this test get fixed tomorrow. Putting in review.

@bartlettroscoe bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 15, 2018
@bartlettroscoe
Copy link
Member Author

This test is passing in all of the "ATDM" builds today shown at:

This includes the builds:

  • Trilinos-atdm-hansen-shiller-cuda-debug, Passed, 1.96, 2018-03-15T13:20:55 UTC
  • Trilinos-atdm-hansen-shiller-cuda-opt, Passed, 1.39, 2018-03-15T14:54:18 UTC
  • Trilinos-atdm-hansen-shiller-gnu-debug-openmp, Passed, 1.23, 2018-03-15T08:59:09 UTC
  • Trilinos-atdm-hansen-shiller-gnu-debug-serial, Passed, 2.39, 2018-03-15T07:11:18 UTC
  • Trilinos-atdm-hansen-shiller-gnu-opt-openmp, Passed, 1.44, 2018-03-15T14:08:22 UTC
  • Trilinos-atdm-hansen-shiller-gnu-opt-serial, Passed, 1.54, 2018-03-15T07:15:45 UTC
  • Trilinos-atdm-hansen-shiller-intel-debug-openmp, Passed, 1.04, 2018-03-15T10:55:33 UTC
  • Trilinos-atdm-hansen-shiller-intel-debug-serial, Passed, 1.83, 2018-03-15T07:47:56 UTC
  • Trilinos-atdm-hansen-shiller-intel-opt-openmp, Passed, 4.93, 2018-03-15T11:45:23 UTC
  • Trilinos-atdm-hansen-shiller-intel-opt-serial, Passed, 1.7, 2018-03-15T09:39:15 UTC
  • Trilinos-atdm-sems-gcc-7-2-0, Passed, 0.81, 2018-03-15T11:53:05 UTC
  • Trilinos-atdm-white-ride-cuda-debug, Passed, 3.08, 2018-03-15T08:17:32 UTC
  • Trilinos-atdm-white-ride-cuda-opt, Passed, 1.99, 2018-03-15T10:57:59 UTC
  • Trilinos-atdm-white-ride-cuda-opt, Passed, 2.72, 2018-03-15T08:25:59 UTC
  • Trilinos-atdm-white-ride-gnu-opt-openmp, Passed, 1.42, 2018-03-15T07:05:17 UTC
  • Trilinos-atdm-white-ride-gnu-opt-openmp, Passed, 1.12, 2018-03-15T10:13:33 UTC

Closing as complete.

Thanks for the help @atoth1!

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 15, 2018
kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018
…portable (trilinos#2247)

This change reduces the condition number of the least-squares problem from
10^12 to 10^4 and results in more stable floating point computations and
better portability of the test.

This fixes the failures with the Intel compiler 17.0.1 with optimized compiler
flags (see trilinos#2247).

Also has to change number of expected iterations from 18 to 11.

I tested this with Intel and GNU builds and they both passed:

Enabled Packages: NOX
Enabled all Forward Packages

1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min)
2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)
@bartlettroscoe bartlettroscoe added the PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area label Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area pkg: NOX
Projects
None yet
Development

No branches or pull requests

3 participants