Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libomp 12.0.0 introduces assertion failure when running OpenMP from a background thread #49923

Closed
0x6e mannequin opened this issue Jun 4, 2021 · 19 comments
Closed

libomp 12.0.0 introduces assertion failure when running OpenMP from a background thread #49923

0x6e mannequin opened this issue Jun 4, 2021 · 19 comments
Assignees
Labels
bugzilla Issues migrated from bugzilla openmp platform:macos

Comments

@0x6e
Copy link
Mannequin

0x6e mannequin commented Jun 4, 2021

Bugzilla Link 50579
Version unspecified
OS MacOS X
Blocks #51489
Attachments Minimum working example
CC @milianw,@jprotze,@shiltian,@tstellar

Extended Description

Overview:
Our application runs OpenMP from a background thread and multiple threads may end up doing that. This has worked fine using OpemMP 11.0.0 from homebrew. However, upgrading to OpenMP 12.0.0 has introduced OMP: Error #​13: Assertion failure at kmp_runtime.cpp(3689).

Steps to reproduce:
I have attached a minimum working example.

To ease switching between OpenMP versions using brew I performed the following steps:
$ brew unlink libomp
$ cp -R /usr/local/Cellar/libomp /usr/local/Cellar/libomp@12.0.0
$ mv /usr/local/Cellar/libomp /usr/local/Cellar/libomp@11.0.0
$ rm -r /usr/local/Cellar/libomp@12.0.0/11.0.0
$ rm -r /usr/local/Cellar/libomp@11.0.0/12.0.0
$ brew link libomp@11.0.0

Compiling simply with:
$ tar xvf nested-openmp.tgz && cd nested-openmp
$ mkdir build && cd build
$ cmake ..
$ make

Running with libomp@11.0.0 cause no issues. Switching results in the attached crash
$ ./nested-openmp
$ brew unlink libomp@11.0.0 && brew link libomp@12.0.0
$ ./nested-openmp
OMP: Error #​13: Assertion failure at kmp_runtime.cpp(3689).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[1] 15285 abort ./nested-openmp
$ brew unlink libomp@11.0.0 &&

@jprotze
Copy link
Collaborator

jprotze commented Aug 20, 2021

I cannot reproduce the issue on my Linux system.

Looking at the code location for the assertion, my suspicion is that the issue might come from the hidden helper threads.
@​Shilei, do you know whether this is fixed in main/release13 branch?

https://github.com/llvm/llvm-project/blob/llvmorg-12.0.0/openmp/runtime/src/kmp_runtime.cpp#L3689

Nathan, can you please test with:
env LIBOMP_USE_HIDDEN_HELPER_TASK=0 LIBOMP_NUM_HIDDEN_HELPER_THREADS=0 ./nested-openmp

Also, which C++ library do you use libstdc++ or libc++ (see output of ldd ./nested-openmp)? If you use libstdc++, which gcc version?
The behavior for std::async might be different.

@0x6e
Copy link
Mannequin Author

0x6e mannequin commented Aug 20, 2021

Nathan, can you please test with:
env LIBOMP_USE_HIDDEN_HELPER_TASK=0 LIBOMP_NUM_HIDDEN_HELPER_THREADS=0
./nested-openmp

With those environment variables, the test application works as expected.

Also, which C++ library do you use libstdc++ or libc++ (see output of ldd ./nested-openmp)? If you use libstdc++, which gcc version?
The behavior for std::async might be different.

I am using libc++:
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 905.6.0)

@shiltian
Copy link
Contributor

I cannot reproduce the issue on Linux either. Is it on macOS as I see Homebrew was being used?

@shiltian
Copy link
Contributor

I cannot reproduce the issue on Linux either. Is it on macOS as I see
Homebrew was being used?

Oh, yeah, it is macOS. I’ll try locally on my Mac.

@tstellar
Copy link
Collaborator

mentioned in issue #51489

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@asl asl added this to the LLVM 13.0.1 release milestone Dec 12, 2021
@tstellar
Copy link
Collaborator

The deadline for requesting fixes for the release has passed. This bug is being removed from the LLVM 13.0.1 release milestone. If you have a fix or think this bug is important enough to block the release, please explain why in a comment and add the bug back to the LLVM 13.0.1 release milestone.

@tstellar tstellar removed this from the LLVM 13.0.1 release milestone Dec 21, 2021
@shiltian shiltian self-assigned this Jan 5, 2022
@shiltian
Copy link
Contributor

shiltian commented Jan 6, 2022

I cannot reproduce the failure on macOS with trunk as well. Besides, helper thread should be disabled on macOS. I don't know if the latest HomeBrew version has already covered newer code base. Please let me know if the problem still exists.

@zychen423
Copy link

Hi @shiltian,

I can confirm this problem still exisits on my mac, background is:

  • mac os 11.2.3
  • Homebrew 3.3.10-67-g537036a
  • Homebrew/homebrew-core (git revision 55aa98ff208; last commit 2022-01-17)
  • Homebrew/homebrew-cask (git revision 74952a4cee; last commit 2022-01-17)

then I encountered segmentation fault just like microsoft/LightGBM#4229 and dmlc/xgboost#7106. In the end I downgrade the version of libomp to 11.1.0 to get rid of this issue.

Please let me know if I should provide more information.

Thank you.

@shiltian
Copy link
Contributor

shiltian commented Jan 17, 2022

Still cannot reproduce the failure on macOS 12.1. Here is what I did:

➜  nested-openmp brew install libomp
Running `brew update --preinstall`...
==> Auto-updated Homebrew!
Updated 1 tap (homebrew/core).
==> Updated Formulae
Updated 1 formula.

==> Downloading https://ghcr.io/v2/homebrew/core/libomp/manifests/13.0.0
######################################################################## 100.0%
==> Downloading https://ghcr.io/v2/homebrew/core/libomp/blobs/sha256:fe1a6935e1da396268818c653a8fa8c56f34fca1e46636dd95605110cbf8446c
==> Downloading from https://pkg-containers.githubusercontent.com/ghcr1/blobs/sha256:fe1a6935e1da396268818c653a8fa8c56f34fca1e46636dd95605110cbf8446c?se
######################################################################## 100.0%
==> Pouring libomp--13.0.0.monterey.bottle.tar.gz
🍺  /usr/local/Cellar/libomp/13.0.0: 9 files, 1.6MB
==> Running `brew cleanup libomp`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
brew install libomp  4.59s user 4.65s system 101% cpu 9.121 total
➜  nested-openmp ll /usr/local/Cellar/libomp/13.0.0
total 80
drwxr-xr-x  8 shiltian  admin   256B Jan 17 16:02 ./
drwxr-xr-x  3 shiltian  admin    96B Jan 17 16:02 ../
drwxr-xr-x  3 shiltian  admin    96B Sep 24 12:18 .brew/
-rw-r--r--  1 shiltian  admin   917B Jan 17 16:02 INSTALL_RECEIPT.json
-rw-r--r--  1 shiltian  admin    19K Sep 24 12:18 LICENSE.TXT
-rw-r--r--  1 shiltian  admin    14K Sep 24 12:18 README.rst
drwxr-xr-x  5 shiltian  admin   160B Sep 24 12:18 include/
drwxr-xr-x  4 shiltian  admin   128B Sep 24 12:18 lib/
➜  nested-openmp $DEPLOY_ROOT/llvm/release/bin/clang++ -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -std=c++17 -fopenmp -I/usr/local/Cellar/libomp/13.0.0/include -L/usr/local/Cellar/libomp/13.0.0/lib  main.cpp -o main -g -lpthread
➜  nested-openmp ./main
Sum is: 1480000000
./main  2.12s user 0.75s system 1230% cpu 0.234 total
➜  nested-openmp ./main
Sum is: 1480000000
./main  2.10s user 0.64s system 1293% cpu 0.212 total
➜  nested-openmp ./main
Sum is: 1480000000
./main  2.10s user 0.74s system 1307% cpu 0.217 total
➜  nested-openmp ./main
Sum is: 1480000000
./main  2.11s user 0.72s system 1284% cpu 0.221 total
➜  nested-openmp ./main
Sum is: 1480000000
./main  2.16s user 0.70s system 1290% cpu 0.222 total
➜  nested-openmp ./main
Sum is: 1480000000
./main  2.13s user 0.74s system 1322% cpu 0.217 total

I also tried to build OpenMP from scratch using Home-brew formula, and I cannot see the crash as well.
The CMake command I was using is:

cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=$DEPLOY_ROOT/llvm/release/bin/clang -DCMAKE_CXX_COMPILER=$DEPLOY_ROOT/llvm/release/bin/clang++ -DLIBOMP_ENABLE_SHARED=OFF -DLIBOMP_INSTALL_ALIASES=OFF -DOPENMP_ENABLE_LIBOMPTARGET=OFF -DOPENMP_FILECHECK_EXECUTABLE=$DEPLOY_ROOT/llvm/release/bin/FileCheck -DOPENMP_NOT_EXECUTABLE=$DEPLOY_ROOT/llvm/release/bin/not -B $BUILD_ROOT/openmp/debug -S $HOME/Documents/code/llvm-project/openmp --install-prefix=$DEPLOY_ROOT/openmp/debug

The way I compile the test case is:

$DEPLOY_ROOT/llvm/release/bin/clang++ -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -L$DEPLOY_ROOT/openmp/debug/lib -Wl,-rpath,$DEPLOY_ROOT/openmp/debug/lib -std=c++17 -fopenmp -I$DEPLOY_ROOT/openmp/debug/include  main.cpp -o main -g -lpthread

@shiltian
Copy link
Contributor

@zychen423 I feel the failure might only happen with a specific number of threads. Probably to reproduce the failure, I need to know how many cores and threads do you have on your Mac?

@zychen423
Copy link

@shiltian, I am on an M1 macbook pro with 8 cores and threads

❯ sysctl -n hw.ncpu.
8
❯ sysctl hw.physicalcpu
hw.physicalcpu: 8
❯ sysctl hw.logicalcpu
hw.logicalcpu: 8
❯ python -c ‘import multiprocessing as mp; print(mp.cpu_count())’
8

btw segmentation fault happened when I use xgboost in my code, here is a minimal example that would work under libomp 11, but crash with libomp 12 and 13:

import logging
import numpy as np
import xgboost as xgb

def xgboost_unit_tests():
	feature_arr = np.random.uniform(size=(100,5))
	label_arr = np.random.randint(low=0, high=10, size=(100,1))
	data_arr = np.concatenate((feature_arr, label_arr), axis=1)
	#Run Attribute Rank with -O 2
	run_xgboost(data_arr,data_arr)
	logging.info("XGBoost complete")
	return


def run_xgboost(trainarr, valarr):
	trainfeats = trainarr[:,:-1]
	trainlabels = trainarr[:,-1]
	# first seg fault
	dtrain = xgb.DMatrix(trainfeats, label=trainlabels)
	valfeats = valarr[:,:-1]
	vallabels = valarr[:,-1]
	# second seg fault
	dval = xgb.DMatrix(valfeats, label=vallabels)
	return


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
	xgboost_unit_tests()

I would like to see if you can run this script. Thanks in advanced!

@shiltian
Copy link
Contributor

shiltian commented Jan 19, 2022

@zychen423 I can run the script but cannot reproduce the crash. My Mac has 16 cores so I created a VM w/ 8C8T. I installed libomp via HomeBrew, and xgboost via pip3. I did check the dependences of libxgboost.dylib and confirm everything looks good.

$ otool -L libxgboost.dylib 
libxgboost.dylib:
	@rpath/libxgboost.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/local/opt/libomp/lib/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 904.4.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

I ran the Python script, and it simply exits w/o any error.

@JonChesterfield
Copy link
Collaborator

The M1 is an ARM chip, if this is llvm running natively then it may be more vulnerable to concurrency bugs than x64. (TSO hides various mistakes than ARM in general, and I guess Apple's M1, are not as tolerant of)

@zychen423
Copy link

@shiltian I see. I've asked others who encounter this issue to take a look. Is there anything more I can do to help the situation? Thank you

@shiltian
Copy link
Contributor

@zychen423 If the issue is ARM specific, like Jon suggested, I can't do too much for now because mine is Intel based. If I can know how to reproduce it on an Intel based processor, that would be great. Thanks again for your info.

@andy-brainome
Copy link

As I just posted on xgboost#7039, I have a very simple reproduction for a libomp/xgboost segmentation fault at
https://github.com/brainome/xgboost-macos-libomp-seg-fault/blob/main/main.py

@shiltian
Copy link
Contributor

shiltian commented Jan 21, 2022

@andy-brainome Thanks for the info, but unfortunately it exits w/o any error on my Mac. Does your Mac have M1 processor? And how many cores and thread does it have?

@andy-brainome
Copy link

@andy-brainome Thanks for the info, but unfortunately it exits w/o any error on my Mac. Does your Mac have M1 processor? And how many cores and thread does it have?

No M1 here - I have a 2019 MacBook Pro running Big Sur 11.2.3 on a 6 core Intel 1.7 with 12 threads.
I upgraded/unpinned libomp to version 13.0.0 and was able to pass my unit tests where they had failed previously.
libomp v13.0 also passed pip distro brainome==1.7.108 yay!
I'll give it a good going over when port my build to cibuildwheel.

I'm glad this crash got someone's attention - as I recall, the execution stack was slightly off and python was trying to execute the string parameter of the adjacent logging statement just prior to returning. Didn't know python could do that. Very nasty hole in the vm.

If someone could post a url for brewing libomp v12.0.0, I'll plug it into my ci/cd and reproduce the unit tests.

esc added a commit to esc/numba that referenced this issue Feb 18, 2022
The `llvm-openmp` package has a Numba relevant bug that only manifests
when using that package on M1 silicon (new Apple chips) in version 12.
This will allow building Numba (the headers are needed) but dissallow
running it with `llvm-openmp` in version 12.* on Apple M1 silicon.

The `llvm-openmp` bug is here:

llvm/llvm-project#49923

This will also be future proof and allow `llvm-openmp` to be installed
in version 13 as soon as that becomes available from `main/defaults of
if people want to use `conda-forge`, since that already has version 13.
@shiltian
Copy link
Contributor

shiltian commented Feb 25, 2022

Version 12 is no longer maintained. Close this issue. Feel free to open and edit it if anyone can still observe the issue with the latest version.

devernay added a commit to NatronGitHub/Natron that referenced this issue Jun 14, 2022
…M side

llvm hav moved to github issues, here's the corresponding github issue:
llvm/llvm-project#49923
which was closed because they couldn't repro.
see also microsoft/LightGBM#4229 (comment)
let us keep the pin until someone confirms it's fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla openmp platform:macos
Projects
None yet
Development

No branches or pull requests

7 participants