Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nc_test fails when optimization greater than -O0 is invoked. #1983

Closed
WardF opened this issue Apr 12, 2021 · 17 comments · Fixed by #2633
Closed

nc_test fails when optimization greater than -O0 is invoked. #1983

WardF opened this issue Apr 12, 2021 · 17 comments · Fixed by #2633

Comments

@WardF
Copy link
Member

WardF commented Apr 12, 2021

A failure in nc_test/nc_test was initially reported on the ppc64le platform as part of the tests for conda-forge/libnetcdf-feedstock release of 4.8.0. Subsequent testing has revealed an issue with some, but not all, compilers. More details to follow, opening an issue to capture this.

Update 1

Issue has been observed on M1 hardware, both in OSX, and emulated ppc64le (via qemu). Issue does not manifest on M1 hardware in virtualized ubuntu, or on intel-based OSX. I'll update with compiler versions (which is a big piece of the puzzle).

@dopplershift
Copy link
Member

I can confirm that replacing -O3 with -O0 over in conda-forge/libnetcdf-feedstock#119 allowed the failing test to pass.

@dopplershift
Copy link
Member

Some relevant build env information from over there. Host env:

    _libgcc_mutex:   0.1-conda_forge                conda-forge
    _openmp_mutex:   4.5-1_gnu                      conda-forge
    bzip2:           1.0.8-h4e0d66e_4               conda-forge
    c-ares:          1.17.1-h4e0d66e_1              conda-forge
    ca-certificates: 2020.12.5-h1084571_0           conda-forge
    curl:            7.76.0-h86bdb50_0              conda-forge
    hdf4:            4.2.13-h9a582f1_3              defaults   
    hdf5:            1.10.6-mpi_mpich_h588be55_1014 conda-forge
    jpeg:            9d-h339bb43_0                  conda-forge
    krb5:            1.17.2-h2004cfc_0              conda-forge
    libcurl:         7.76.0-hf726fd8_0              conda-forge
    libedit:         3.1.20191231-h41a240f_2        conda-forge
    libev:           4.33-h6eb9509_1                conda-forge
    libgcc-ng:       9.3.0-hc7a5eb4_18              conda-forge
    libgfortran-ng:  9.3.0-hb3da1e1_18              conda-forge
    libgfortran5:    9.3.0-hb3da1e1_18              conda-forge
    libgomp:         9.3.0-hc7a5eb4_18              conda-forge
    libnghttp2:      1.43.0-h42039ad_0              conda-forge
    libssh2:         1.9.0-ha5a9321_6               conda-forge
    libstdcxx-ng:    9.3.0-h22bc1b8_18              conda-forge
    mpi:             1.0-mpich                      conda-forge
    mpich:           3.4.1-hb1f5ef7_104             conda-forge
    ncurses:         6.2-hea85c5d_4                 conda-forge
    openssl:         1.1.1k-h4e0d66e_0              conda-forge
    tk:              8.6.10-h38e1d09_1              conda-forge
    zlib:            1.2.11-h6eb9509_1010           conda-forge

Build env:

    _libgcc_mutex:                 0.1-conda_forge         conda-forge
    _openmp_mutex:                 4.5-1_gnu               conda-forge
    binutils_impl_linux-ppc64le:   2.35.1-h5836da8_2       conda-forge
    binutils_linux-ppc64le:        2.35-hcfcb53e_30        conda-forge
    bzip2:                         1.0.8-h4e0d66e_4        conda-forge
    c-ares:                        1.17.1-h4e0d66e_1       conda-forge
    ca-certificates:               2020.12.5-h1084571_0    conda-forge
    cmake:                         3.20.1-hf140c13_0       conda-forge
    expat:                         2.3.0-h3b9df90_0        conda-forge
    gcc_impl_linux-ppc64le:        9.3.0-ha8fcf76_18       conda-forge
    gcc_linux-ppc64le:             9.3.0-h2662597_30       conda-forge
    gxx_impl_linux-ppc64le:        9.3.0-h6db6fac_18       conda-forge
    gxx_linux-ppc64le:             9.3.0-hdc9eec7_30       conda-forge
    kernel-headers_linux-ppc64le:  3.10.0-hbe0c576_10      conda-forge
    krb5:                          1.17.2-h2004cfc_0       conda-forge
    ld_impl_linux-ppc64le:         2.35.1-ha35d02b_2       conda-forge
    libcurl:                       7.76.0-hf726fd8_0       conda-forge
    libedit:                       3.1.20191231-h41a240f_2 conda-forge
    libev:                         4.33-h6eb9509_1         conda-forge
    libgcc-devel_linux-ppc64le:    9.3.0-h9f60fdd_18       conda-forge
    libgcc-ng:                     9.3.0-hc7a5eb4_18       conda-forge
    libgomp:                       9.3.0-hc7a5eb4_18       conda-forge
    libnghttp2:                    1.43.0-h42039ad_0       conda-forge
    libssh2:                       1.9.0-ha5a9321_6        conda-forge
    libstdcxx-devel_linux-ppc64le: 9.3.0-h7901c9a_18       conda-forge
    libstdcxx-ng:                  9.3.0-h22bc1b8_18       conda-forge
    libuv:                         1.41.0-h4e0d66e_0       conda-forge
    lz4-c:                         1.9.3-h3b9df90_0        conda-forge
    make:                          4.3-hf817498_1          conda-forge
    ncurses:                       6.2-hea85c5d_4          conda-forge
    openssl:                       1.1.1k-h4e0d66e_0       conda-forge
    pkg-config:                    0.29.2-h339bb43_1008    conda-forge
    rhash:                         1.4.1-h4e0d66e_0        conda-forge
    sysroot_linux-ppc64le:         2.17-h8b29623_10        conda-forge
    tk:                            8.6.10-h38e1d09_1       conda-forge
    xz:                            5.2.5-h6eb9509_1        conda-forge
    zlib:                          1.2.11-h6eb9509_1010    conda-forge
    zstd:                          1.4.9-h65c4b1a_0        conda-forge

@WardF
Copy link
Member Author

WardF commented Apr 13, 2021

Surprisingly, no failure on a raspberry pi ARM device (gcc 10.2).

@edwardhartnett
Copy link
Contributor

I would suggest you try on standard unix box with valgrind, and with address sanitizer.

@WardF
Copy link
Member Author

WardF commented Apr 13, 2021

The address sanitizer hadn't turned anything up, but valgrind was a good idea. Valgrind detected a few undefined variables that could have been the issue, but addressing them in the source .m4 files hasn't made a difference to the observed error (although valgrind no longer reports any errors). Stepping through in lldb now.

@c-ney
Copy link

c-ney commented Jun 9, 2021

This (or a similar) issue has recently popped up in Opensuse Tumbleweed i586, x86_64 and aarch64 (some logs at [0]) when they switched gcc from 10 to 11 by default (it was a huge update, so might be coming from something else, but this seems the most noteworthy change in my eyes). I only tested x86_64 locally, compiling with -O0 or -O1 works fine, but -O2 will result in nc_test failing.

test-suite.log

Haven't had the time to look any further yet.

Edit: To clarify, the version at [0] is still 4.7.4, but my local tests were with 4.8.0.

[0] https://build.opensuse.org/package/show/science/netcdf

@WardF
Copy link
Member Author

WardF commented Jun 14, 2021

Thanks for percolating this back up; I'll pick this back up and see if I can figure out what's going on.

@c-ney
Copy link

c-ney commented Jul 8, 2021

Egbert Eich from the SUSE team was able to pin this down to strict aliasing requirements in GCC11. After disabling it in [0], the package (now 4.8.0) builds with successful tests [1].

[0] https://build.opensuse.org/request/show/904468
[1] https://build.opensuse.org/package/show/science/netcdf

@WardF
Copy link
Member Author

WardF commented Jul 8, 2021

Wow, thank you, I've been working on this and a different issue in an effort to get our next release out, this is a great help. I will sort through his results and get this addressed!

@DennisHeimbigner
Copy link
Collaborator

What is the exact fix: -Wnomaybe-unitialized ?

@WardF
Copy link
Member Author

WardF commented Jul 8, 2021

@DennisHeimbigner I believe the temporary fix, pending the broader evaluation of our aliasing, is to pass -fno-strict-aliasing for gcc >= 11. Testing this fix now.

@WardF
Copy link
Member Author

WardF commented Jul 8, 2021

@DennisHeimbigner still encountering the issue on OSX w/ gcc, so continue to move forward working on this.

@edwardhartnett
Copy link
Contributor

Can we fix the code instead of adding a flag? What is the exact code issue that is causing the problem, do we know?

@WardF
Copy link
Member Author

WardF commented Jul 9, 2021

@edhartnett Fixing the code would be ideal, but the code, and the fix, still need tbd. We need to get 4.8.1 out the door and if a flag will let us implement a work around until the fix is determined, that may be the route we end up taking. But we're still sorting things out on our end.

@ArchangeGabriel
Copy link
Contributor

On ArchLinux (x86_64), we have the same issue but I can confirm that -fno-strict-aliasing solves it for us.

@WardF WardF modified the milestones: 4.8.1, 4.8.2 Jul 29, 2021
@WardF
Copy link
Member Author

WardF commented Jul 29, 2021

This error has been temporarily addressed w/ the configuration flags so that it is no longer blocking 4.8.1, and a 'proper' fix will be investigated for 4.8.2.

@WardF WardF modified the milestones: 4.8.2, 4.9.1 Jun 15, 2022
@WardF WardF modified the milestones: 4.9.1, 4.9.2 Feb 13, 2023
@WardF
Copy link
Member Author

WardF commented Feb 21, 2023

Thanks to the renewed interest, some additional info/reading provided by @dopplershift, and an expanded knowledge of -fsanitize, I think we have a fix for this. It also looks like we have a lot of additional work to do cleaning up some operations that are undefined, but consistent. But first things first. I'll be testing the potential fix for this shortly.

WardF added a commit to WardF/netcdf-c that referenced this issue Feb 21, 2023
…running nc_test/nc_test, in support of Unidata#1983, amongst others.  The cast as was being used was undefined behavior, and had to be worked around in a style approximating C++'s 'reinterpret_cast'
ZedThree added a commit to ZedThree/netcdf-c that referenced this issue Mar 7, 2023
* main: (1776 commits)
  Invert solution as discussed at Unidata#2618
  Correct a potential null pointer dereference.
  Fix a logic error that was resulting in an easy-to-miss error when running configure.
  Fix issue with dangling test file not getting cleaned up.
  Turn nczarr zip support off by default in cmake, add a status line indicating whether nczarr-zip-support is available, in libnetcdf.settings.
  Update the version of the cache action used by github action from v2 to v3.
  Explicit cast to unsigned char.
  More issues returned by sanitizer, related to attempts to assign MAX_UNSIGNED_CHAR (255) to a variable of type char.
  Fixed an issue where memcpy was potentially passed a null pointer.
  Correct another uninitialized issue.
  Correct undefined variable error.
  Fixing issues uncovered by compiling with '-fsanitize=undefined' and running nc_test/nc_test, in support of Unidata#1983, amongst others.  The cast as was being used was undefined behavior, and had to be worked around in a style approximating C++'s 'reinterpret_cast'
  Remove a stray character at the head file.
  Fix a distcheck failure with nczarr_test/run_interop.sh
  Turn benchmarks off by default. They require netcdf4, additional logic is required in order to have them on by default.
  Add execute bit to test scripts
  Fix missing endif statement
  Add generated parallel tests for nc_perf, cmake-based build system.
  Correct typo in CMakeLists.txt
  Wiring performance benchmarks into cmake, cleaned up serial compression performance test dependency on MPI.
  ...
clrpackages pushed a commit to clearlinux-pkgs/netcdf that referenced this issue Mar 22, 2023
Dennis Heimbigner (9):
      Modify H5FDhttp.c to work with HDF5 1.14.0
      update release notes
      Enable ACCEPT_ENCODING on DAP requests
      Update Release notes
      Fix a distcheck failure with nczarr_test/run_interop.sh
      Extend the dispatch table for H5FD back to version 1.13.2
      update RELEASENOTES
      Missed one occurrence of 1,14,0
      Fix byterange handling of some URLS

Greg Sjaardema (1):
      Missing `goto`

Magnus Ulimoen (1):
      Fix setting dest for non-m4 path

Mathieu Westphal (1):
      Avoid optionnaly depends on zip for NCZarr

Sergey Kosukhin (1):
      Fix macro usage

Ward Fisher (35):
      Update release notes, prepare to merge back upstream into development branch.
      Escape command symbol in doxygen template.
      Incorporate fix in support of Unidata/netcdf-c#2437 (comment)
      Update nc-config in support of Unidata/netcdf-c#2274
      Added benchmarking option to cmake-based builds, turned on unit-testing by default
      Wiring performance benchmarks into cmake, cleaned up serial compression performance test dependency on MPI.
      Correct typo in CMakeLists.txt
      Add generated parallel tests for nc_perf, cmake-based build system.
      Fix missing endif statement
      Add execute bit to test scripts
      Turn benchmarks off by default. They require netcdf4, additional logic is required in order to have them on by default.
      Remove a stray character at the head file.
      Fixing issues uncovered by compiling with '-fsanitize=undefined' and running nc_test/nc_test, in support of Unidata/netcdf-c#1983, amongst others.  The cast as was being used was undefined behavior, and had to be worked around in a style approximating C++'s 'reinterpret_cast'
      Correct undefined variable error.
      Correct another uninitialized issue.
      Fixed an issue where memcpy was potentially passed a null pointer.
      More issues returned by sanitizer, related to attempts to assign MAX_UNSIGNED_CHAR (255) to a variable of type char.
      Explicit cast to unsigned char.
      Update the version of the cache action used by github action from v2 to v3.
      Turn nczarr zip support off by default in cmake, add a status line indicating whether nczarr-zip-support is available, in libnetcdf.settings.
      Fix issue with dangling test file not getting cleaned up.
      Fix a logic error that was resulting in an easy-to-miss error when running configure.
      Correct a potential null pointer dereference.
      Invert solution as discussed at Unidata/netcdf-c#2618
      Bumped SO version.
      Update release notes.
      Bump version strings to reflect next development version.
      Correct jsonconvention map with netcdf version.
      Add hdf5 1.14.0 to GitHub CI.
      Expand CI testing with HDF5 1.14.0
      Set version strings.
      Adjust reference nczarr file for v4.9.2 release.
      Update .gitignore to allow generated files into version control for 4.9.2 release.
      Added generated files for release.
      Added generated test files for a stand-alone release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants