Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247

bartlettroscoe · 2018-02-16T19:53:06Z

CC: @trilinos/nox

Description

The test NOX_Thyra_Heq_MPI_1 fails in the new ATDM build Trilinos-atdm-hansen-shiller-intel-opt-serial on hansen as shown yesterday at:

https://testing-vm.sandia.gov/cdash/testDetails.php?test=43469004&build=3316785

and for the build Trilinos-atdm-hansen-shiller-intel-opt-openmp yesterday at:

https://testing-vm.sandia.gov/cdash/testDetails.php?test=43472316&build=3317072

In both cases, the end of the test shows:

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 6.749e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 5.011e-05 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************
Test failed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42973,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Steps to Reproduce

Anyone with access to the SNL test bed machines hansen (SON) or shiller (SRN) using the instructions linked to from the page:

https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/ATDM+Builds+of+Trilinos

should be able to reproduce.

The link from there to the README file:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

should provide the info. But in short, once you clone Trilinos on hansen or shiller into your home directory (pointed to by env var TRILINOS_DIR), you should be able to reproduce with:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh intel-opt-serial

$ cmake \
  -GNina \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_NOX=ON \
  $TRILINOS_DIR

$ make NP=16  NOX_Thyra_Heq

$ ctest -R NOX_Thyra_Heq_MPI_1

I ran the above on hansen just now for the Trilinos version:

a39e44b "Merge branch 'develop' of github.com:trilinos/Trilinos into develop"
Author: Curtis C. Ober <ccober@sandia.gov>
Date:   Fri Feb 16 11:38:11 2018 -0700 (19 minutes ago)

and it produced the same failure.

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-02-23T20:43:42Z

@rppawlo

FYI

bartlettroscoe · 2018-02-28T15:22:30Z

Looking at the behavior of this test, it looks like the problem is rounding issues. Looking at the pass/fail criteria in the file:

Trilinos/packages/nox/test/epetra/Thyra/Thyra_Heq.C

it shows:

    // 1. Convergence
    if (solvStatus != NOX::StatusTest::Converged)
      status = 1;
    // 2. Number of iterations
    if (const_cast<Teuchos::ParameterList&>(solver->getList()).sublist("Output").get("Nonlinear Iterations", 0) != 18)
      status = 2;
    // 3. Same reset solution
    if (diff->norm() >= 1.0e-14)
      status = 3;

    success = status==0;

    if (Comm.MyPID() == 0) {
      if (success)
        std::cout << "Test passed!" << std::endl;
      else
        std::cout << "Test failed!" << std::endl;
    }
  }
  TEUCHOS_STANDARD_CATCH_STATEMENTS(verbose, std::cerr, success);

Well the problem is that on Intel, it converges in 16 iterations instead of 18.

Note that this is not the only Intel where this test is converging early and failing. See, for example, the build Linux-intel-17.0.1-MPI_Release_intel_17.0.1_openmpi_1.8.7_DEV that ran yesterday at:

https://testing.sandia.gov/cdash/testDetails.php?test=44959403&build=3409827

Can we just change the convergence criteria to <= 18 iteration instead of exactly 18 iterations? Also, can we add better error message that shows why the test failing because the current output makes no sense.

I can create a PR that does this.

bartlettroscoe · 2018-02-28T17:09:54Z

The updated code with the PR I am going to submit using the macros in Teuchos_TestingHelpers.hpp produces the output:

************************************************************************
-- Nonlinear Solver Step 16 -- 
||F|| = 6.500e-08  step = 1.000e+00 (Converged!)
************************************************************************

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 3.250e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 2.810e-05 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************

Check for test pass/fail:
solvStatus = Converged.... == Converged.... : passed
numIterations = 16 == 18 : FAILED
diff->norm() = 1.4614e-07 <= 1e-14 : FAILED

Test failed!

Now you can clearly see that the checks:

numIterations = 16 == 18 : FAILED
diff->norm() = 1.4614e-07 <= 1e-14 : FAILED

failed and therefore that is why the test failed.

I will submit a PR that adds this better output and then we can go from there on how to fix this test.

This was need for clear output for the test in NOX Thyra_Heq.C. I just added a simple usage of this macro. But we really need better unit tests for all of thse macros.

Before, if any of the three criteria failed, it would jsut print "Test failed". But now it prints why it failed with details. This test currently fails with Intel compilers (see trilinos#2247) but at least now it shows you why (which was not clear at all before). I also made usage of the default FancyOStream to avoid logic about what process you are on for when you should be printing or not. That is the best way to handle parallel output and better test output control. SQUASH AGAINST 'Update to clealry show why the test passes for fails (trilinos#2247)'

…ces (trilinos#2247) This is needed because the Intel compiler 17.4 seems to produce different roundoff than GCC compilers. With Intel, the NOX solver converges in 16 iterations instead of 18 iterations. So this is good, right? But to make sure that some iterations are done, I changed the pass/fail to require 14 or more iterations to make sure that a solve is performed.

Before, if any of the three criteria failed, it would jsut print "Test failed". But now it prints why it failed with details. This test currently fails with Intel compilers (see trilinos#2247) but at least now it shows you why (which was not clear at all before). I also made usage of the default FancyOStream to avoid logic about what process you are on for when you should be printing or not. That is the best way to handle parallel output and better test output control.

bartlettroscoe · 2018-02-28T17:50:09Z

FYI: I created PR #2310 to clearly show why this test passes or fails. It would be good to merge this to 'develop' and let it run tomorrow so that we can see what this test looks like when it passes and what it looks like when it fails on all of the machines. That might tell us how to go about fixing this so that it passes.

rppawlo · 2018-02-28T19:32:33Z

@atoth1 - could you take a look at this?

atoth1 · 2018-03-01T00:07:03Z

Yep, I'll see what I can figure out this evening. My initial thought is that it makes sense that roundoff differences could cause differences in iteration counts, considering that the condition number for the least-squares problem is getting up near 10e14. The fact that it's also getting slightly different answers (diff->norm() = 1.4614e-07 in Ross's post above) from the initial solve and the restarted solve is curious though.

bartlettroscoe · 2018-03-01T14:29:50Z

Okay, with the PR #2310 merged, we are seeing the detailed pass/fail criteria this morning at:

https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&limit=0&filtercount=1&showfilters=1&field1=testname&compare1=61&value1=NOX_Thyra_Heq_MPI_1

For example, you can see one passing test output at:

https://testing.sandia.gov/cdash/testDetails.php?test=44996843&build=3412123

which shows:

************************************************************************
-- Nonlinear Solver Step 16 -- 
||F|| = 6.345e-07  step = 1.000e+00
************************************************************************

    R condition number estimate (10) = 1.09022e+12

************************************************************************
-- Nonlinear Solver Step 17 -- 
||F|| = 2.154e-07  step = 1.000e+00
************************************************************************

    R condition number estimate (10) = 1.01819e+12

************************************************************************
-- Nonlinear Solver Step 18 -- 
||F|| = 7.825e-08  step = 1.000e+00 (Converged!)
************************************************************************

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 3.912e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 3.397e-06 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************

Check for test pass/fail:
solvStatus = Converged.... == Converged.... : passed
numIterations = 18 == 18 : passed
diff->norm() = 0 <= 1e-14 : passed

Test passed!

Could we try tightening down the ||f(x)|| to tolerance some and see if that fixes the problem? It is currently at 1e-8 so could we try 1e-9 to see if this fixes the problem?

atoth1 · 2018-03-01T18:25:44Z

That might help, but I'm thinking the best thing to do would probably be to just knock the storage depth parameter down to 2. It's currently set to 10, and 10 vs 2 doesn't really test anything different. With storage depth 2 the condition number for the least squares problem stays below 10^4 so roundoff has negligible effect. This knocks the iteration count down to 11, and changing both those get the test to pass for me on shiller with both gcc and intel compilers.

bartlettroscoe · 2018-03-01T21:15:07Z

That might help, but I'm thinking the best thing to do would probably be to just knock the storage depth parameter down to 2. It's currently set to 10, and 10 vs 2 doesn't really test anything different. With storage depth 2 the condition number for the least squares problem stays below 10^4 so roundoff has negligible effect. This knocks the iteration count down to 11, and changing both those get the test to pass for me on shiller with both gcc and intel compilers.

@atoth1, sounds good to me. This does not sound like it would weaken the testings for this algorithm too much and by dropping the condition number, you get less portability problems with this test.

@rppawlo, is this change @atoth1 is suggesting okay with you?

bartlettroscoe · 2018-03-14T14:26:52Z

Can we go ahead and pull the trigger on the change described above to fix this test? This this is now the only broken test holding up the promotion for the ATDM builds Trilinos-atdm-hansen-shiller-intel-opt-openmp and Trilinos-atdm-hansen-shiller-intel-opt-openmp the "ATDM" CDash group/track as shown at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-14&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-intel-&field2=status&compare2=62&value2=Passed&field3=status&compare3=62&value3=notrun

Otherwise, we just need to disable this test for these two Intel builds.

rppawlo · 2018-03-14T14:29:39Z

@bartlettroscoe - yes this change is fine.

bartlettroscoe · 2018-03-14T14:33:36Z

yes this change is fine.

Okay, I will create a PR for the change and test for these Intel builds on shiller.

…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247).

bartlettroscoe · 2018-03-14T19:53:51Z

I tried reducing the storage depth from 10 to 2 in the commit 3de4732 in my branch and re-ran the test with the intel-opt-openmp configuration and it did not pass. Instead, it gave:

...

************************************************************************
-- Nonlinear Solver Step 10 -- 
||F|| = 3.667e-06  step = 1.000e+00
************************************************************************

    R condition number estimate (2) = 8384.82

************************************************************************
-- Nonlinear Solver Step 11 -- 
||F|| = 2.439e-08  step = 1.000e+00 (Converged!)
************************************************************************

************************************************************************
-- Final Status Test Results --
Converged....OR Combination -> 
  **...........Finite Number Check (Two-Norm F) = Finite
  Converged....AND Combination -> 
    Converged....F-Norm = 1.220e-09 < 1.000e-08
                 (Length-Scaled Two-Norm, Absolute Tolerance)
    Converged....WRMS-Norm = 9.015e-05 < 1
  ??...........Number of Iterations = -1 < 100
************************************************************************

Check for test pass/fail:
solvStatus = Converged.... == Converged.... : passed
numIterations = 11 == 18 : FAILED
diff->norm() = 0 <= 1e-14 : passed

Test failed!

Is this correct? Should we set the numIterations criteria to 11?

atoth1 · 2018-03-14T22:34:44Z

Yep, I mentioned the change in the iteration count. It was converging in 11 for me as well.

…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247). Also has to change number of expected iterations from 18 to 11. I tested this with Intel and GNU builds and they both passed: Enabled Packages: NOX Enabled all Forward Packages 1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min) 2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min) Change expected iters from 18 to 11 (trilinos#2247)

…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247). Also has to change number of expected iterations from 18 to 11. I tested this with Intel and GNU builds and they both passed: Enabled Packages: NOX Enabled all Forward Packages 1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min) 2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)

bartlettroscoe · 2018-03-15T01:44:59Z

The PR is #2388. It is running in auto PR testing now.

…e-condition-number Reduce storge depth from 10 to 2 to reduce condition number and make portable (#2247)

bartlettroscoe · 2018-03-15T01:53:43Z

The auto testing for PR #2288 passed so I clicked the merge button. Hopefully we will see this test get fixed tomorrow. Putting in review.

bartlettroscoe · 2018-03-15T17:23:47Z

This test is passing in all of the "ATDM" builds today shown at:

https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-15&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=testname&compare2=61&value2=NOX_Thyra_Heq_MPI_1

This includes the builds:

Trilinos-atdm-hansen-shiller-cuda-debug, Passed, 1.96, 2018-03-15T13:20:55 UTC
Trilinos-atdm-hansen-shiller-cuda-opt, Passed, 1.39, 2018-03-15T14:54:18 UTC
Trilinos-atdm-hansen-shiller-gnu-debug-openmp, Passed, 1.23, 2018-03-15T08:59:09 UTC
Trilinos-atdm-hansen-shiller-gnu-debug-serial, Passed, 2.39, 2018-03-15T07:11:18 UTC
Trilinos-atdm-hansen-shiller-gnu-opt-openmp, Passed, 1.44, 2018-03-15T14:08:22 UTC
Trilinos-atdm-hansen-shiller-gnu-opt-serial, Passed, 1.54, 2018-03-15T07:15:45 UTC
Trilinos-atdm-hansen-shiller-intel-debug-openmp, Passed, 1.04, 2018-03-15T10:55:33 UTC
Trilinos-atdm-hansen-shiller-intel-debug-serial, Passed, 1.83, 2018-03-15T07:47:56 UTC
Trilinos-atdm-hansen-shiller-intel-opt-openmp, Passed, 4.93, 2018-03-15T11:45:23 UTC
Trilinos-atdm-hansen-shiller-intel-opt-serial, Passed, 1.7, 2018-03-15T09:39:15 UTC
Trilinos-atdm-sems-gcc-7-2-0, Passed, 0.81, 2018-03-15T11:53:05 UTC
Trilinos-atdm-white-ride-cuda-debug, Passed, 3.08, 2018-03-15T08:17:32 UTC
Trilinos-atdm-white-ride-cuda-opt, Passed, 1.99, 2018-03-15T10:57:59 UTC
Trilinos-atdm-white-ride-cuda-opt, Passed, 2.72, 2018-03-15T08:25:59 UTC
Trilinos-atdm-white-ride-gnu-opt-openmp, Passed, 1.42, 2018-03-15T07:05:17 UTC
Trilinos-atdm-white-ride-gnu-opt-openmp, Passed, 1.12, 2018-03-15T10:13:33 UTC

Closing as complete.

Thanks for the help @atoth1!

…portable (trilinos#2247) This change reduces the condition number of the least-squares problem from 10^12 to 10^4 and results in more stable floating point computations and better portability of the test. This fixes the failures with the Intel compiler 17.0.1 with optimized compiler flags (see trilinos#2247). Also has to change number of expected iterations from 18 to 11. I tested this with Intel and GNU builds and they both passed: Enabled Packages: NOX Enabled all Forward Packages 1) intel-opt-openmp => passed: passed=105,notpassed=0 (2.65 min) 2) gnu-opt-openmp => passed: passed=105,notpassed=0 (8.81 min)

bartlettroscoe added the pkg: NOX label Feb 16, 2018

bartlettroscoe added this to the Initial cleanup of new ATDM builds of Trilinos milestone Feb 16, 2018

bartlettroscoe added the client: ATDM Any issue primarily impacting the ATDM project label Feb 23, 2018

nolta mentioned this issue Mar 9, 2018

test failures compiling 12.12.1 w/ intel compilers 2018.1 #2357

Closed

atoth1 mentioned this issue Mar 14, 2018

NOX: Adding pre/post line search user operations for LineSearchBased … #2377

Merged

9 tasks

bartlettroscoe mentioned this issue Mar 14, 2018

Reduce storge depth from 10 to 2 to reduce condition number and make portable (#2247) #2388

Merged

bartlettroscoe added a commit that referenced this issue Mar 15, 2018

Merge pull request #2388 from bartlettroscoe/2247-nox-thyra-heq-reduc…

0162b1c

…e-condition-number Reduce storge depth from 10 to 2 to reduce condition number and make portable (#2247)

bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 15, 2018

bartlettroscoe closed this as completed Mar 15, 2018

bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 15, 2018

bartlettroscoe added the PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area label Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247

Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247

bartlettroscoe commented Feb 16, 2018

bartlettroscoe commented Feb 23, 2018

bartlettroscoe commented Feb 28, 2018

bartlettroscoe commented Feb 28, 2018 •

edited

Loading

bartlettroscoe commented Feb 28, 2018

rppawlo commented Feb 28, 2018

atoth1 commented Mar 1, 2018

bartlettroscoe commented Mar 1, 2018

atoth1 commented Mar 1, 2018

bartlettroscoe commented Mar 1, 2018

bartlettroscoe commented Mar 14, 2018

rppawlo commented Mar 14, 2018

bartlettroscoe commented Mar 14, 2018

bartlettroscoe commented Mar 14, 2018

atoth1 commented Mar 14, 2018

bartlettroscoe commented Mar 15, 2018

bartlettroscoe commented Mar 15, 2018

bartlettroscoe commented Mar 15, 2018

Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247

Test NOX_Thyra_Heq_MPI_1 failing in new Trilinos-atdm-hansen-shiller-intel-opt-serial and XXX-openmp builds #2247

Comments

bartlettroscoe commented Feb 16, 2018

Description

Steps to Reproduce

bartlettroscoe commented Feb 23, 2018

bartlettroscoe commented Feb 28, 2018

bartlettroscoe commented Feb 28, 2018 • edited Loading

bartlettroscoe commented Feb 28, 2018

rppawlo commented Feb 28, 2018

atoth1 commented Mar 1, 2018

bartlettroscoe commented Mar 1, 2018

atoth1 commented Mar 1, 2018

bartlettroscoe commented Mar 1, 2018

bartlettroscoe commented Mar 14, 2018

rppawlo commented Mar 14, 2018

bartlettroscoe commented Mar 14, 2018

bartlettroscoe commented Mar 14, 2018

atoth1 commented Mar 14, 2018

bartlettroscoe commented Mar 15, 2018

bartlettroscoe commented Mar 15, 2018

bartlettroscoe commented Mar 15, 2018

bartlettroscoe commented Feb 28, 2018 •

edited

Loading