Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on MacOS with libomp 12.0.0 #7039

Closed
fractor opened this issue Jun 14, 2021 · 13 comments · Fixed by #7621
Closed

Segmentation fault on MacOS with libomp 12.0.0 #7039

fractor opened this issue Jun 14, 2021 · 13 comments · Fixed by #7621

Comments

@fractor
Copy link

fractor commented Jun 14, 2021

The following code results in a Python segmentation fault when executed with Python 3.x on MacOS with libomp 12.0.0. Downgrading to a previous version of libomp solves it.

# xgboost macos seg fault when libomp==12.0.0
# passes with libomp==11.1.0

"""
	verify libomp version
		brew list --version libomp
		libomp 12.0.0
	to install libomp 11.1.0
		wget https://raw.githubusercontent.com/chenrui333/homebrew-core/0094d1513ce9e2e85e07443b8b5930ad298aad91/Formula/libomp.rb
        brew unlink libomp
        brew install --build-from-source ./libomp.rb
        brew list --version libomp

"""

import logging
import numpy as np
import xgboost as xgb

def xgboost_unit_tests():
	feature_arr = np.random.uniform(size=(100,5))
	label_arr = np.random.randint(low=0, high=10, size=(100,1))
	data_arr = np.concatenate((feature_arr, label_arr), axis=1)
	#Run Attribute Rank with -O 2
	run_xgboost(data_arr,data_arr)
	logging.info("XGBoost complete")
	return


def run_xgboost(trainarr, valarr):
	trainfeats = trainarr[:,:-1]
	trainlabels = trainarr[:,-1]
	# first seg fault
	dtrain = xgb.DMatrix(trainfeats, label=trainlabels)
	valfeats = valarr[:,:-1]
	vallabels = valarr[:,-1]
	# second seg fault
	dval = xgb.DMatrix(valfeats, label=vallabels)
	return


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
	xgboost_unit_tests()
@andy-brainome
Copy link

Just to be clear - when everything was paired down to these 40 some lines of code, the seg fault 11 was replaced with the dumping of source code to the screen. i.e. "warnings.warn(". This line of code has shifted up or down a couple of times.

andys@MacBook-Pro:~/work/xgboost-macos-seg-fault-11$ python3 main.py
/Users/andys/.pyenv/versions/3.8.7/lib/python3.8/site-packages/xgboost/data.py:104: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
  warnings.warn(
andys@MacBook-Pro:~/work/xgboost-macos-seg-fault-11$

We believe that this code leakage is what is producing the seg fault on our much more complex program located at
pip install brainome

@andy-brainome
Copy link

The last work around we figured out was importing xgboost immediately on start clears out the macos seg fault 11 issue. Our program had it imported only as necessary very late in the execution cycle.

@trivialfis
Copy link
Member

trivialfis commented Jun 14, 2021

We noticed it on github action test. Seems to be a bug in libomp: https://bugs.llvm.org/show_bug.cgi?id=50579

  • Related:

@andy-brainome
Copy link

TY hopefully this will document the workarounds for the next folks to fall into this trap

@rajivshah3
Copy link

If anyone else runs into this issue, I have a Homebrew tap for libomp 11.1.0. You can install it with brew install rajivshah3/libomp-tap/libomp@11.1.0

@zychen423
Copy link

@trivialfis

Maintainers of libomp have notice this issue, but they cannot reproduce the crash with my provided information. I am thinking that maybe you can take a look at this thread and provide more useful information (than mine😢)

Thank you

@trivialfis
Copy link
Member

trivialfis commented Jan 20, 2022

My apologies, I don't have a mac device and cannot provide any further information. If anyone is watching this thread please assist the libomp maintainers.

@andy-brainome
Copy link

I have a really simple test harness that reproduces the segmentation fault at
https://github.com/brainome/xgboost-macos-libomp-seg-fault/blob/main/main.py

@hcho3
Copy link
Collaborator

hcho3 commented Jan 29, 2022

Hello everyone, I tested libomp 13.0.0 today on my Macbook Pro and I was able to run the example script without any segfault.

I'll close this issue once I test libomp 13.0.0 in CI.

Ps. Some findings:

  • XGBoost built using libomp 12.0.0 from Homebrew => segfaults
  • XGBoost built using libomp 12.0.0 from Conda => no segfault. Users who installed XGBoost using Conda package manager were not affected by the issue.
  • libomp 12.0.0 from Conda applies a patch (https://reviews.llvm.org/D105308), whereas libomp 12.0.0 from Homebrew applies no patch. So we can conclude that the patch fixes the segfault.
  • The patch is part of libomp 13.0.0.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2022

Update: I ran into another segfault, this time using libomp 13.0.0. #7618 is hanging too.

It appears that two different versions of libomp are being loaded in, one from Conda (llvm-openmp), and another one from Homebrew.

Stack trace:

* thread #9, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000011e62e54f /usr/local/Caskroom/miniconda/base/lib/libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 47
    frame #1: 0x000000011ef33f00 /usr/local/opt/libomp/lib/libomp.dylib`kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 1440
    frame #2: 0x000000011ef30e97 /usr/local/opt/libomp/lib/libomp.dylib`__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 151
    frame #3: 0x000000011ef3362c /usr/local/opt/libomp/lib/libomp.dylib`__kmp_fork_barrier(int, int) + 445
    frame #4: 0x000000011ef1bf5a /usr/local/opt/libomp/lib/libomp.dylib`__kmp_launch_thread + 194
    frame #5: 0x000000011ef48934 /usr/local/opt/libomp/lib/libomp.dylib`__kmp_launch_worker(void*) + 278
    frame #6: 0x00007fff6a968109 libsystem_pthread.dylib`_pthread_start + 148
    frame #7: 0x00007fff6a963b8b libsystem_pthread.dylib`thread_start + 15

The undefined behavior must have been going on for a while, but went unnoticed until now.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 30, 2022

I'm inclined to adopt the fix from scikit-learn/scikit-learn#22109. The idea is to bundle 11.1.0 version of libomp into the Python wheel. This solution is known to work in the presence of other packages (scikit-learn/scikit-learn#21227 (comment)), and it's battle-tested.

In the long term, the Python community will need a robust method to keep only one copy of OpenMP runtime running (scikit-learn/scikit-learn#21227 (review)), but for now this workaround is good enough.

@pallasathena92
Copy link

pallasathena92 commented Jun 10, 2022

some data point, my unit test shows me:

 /Users/yifeliu/venv_3.8/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py:14: UserWarning: Cannot load onnxruntime.capi. Error: 'dlopen(/Users/yifeliu/venv_3.8/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so, 2): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
    Referenced from: /Users/yifeliu/venv_3.8/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so
    Reason: image not found'.
    warnings.warn("Cannot load onnxruntime.capi. Error: '{0}'.".format(str(e)))

so I installed by brew install libomp, it installed libomp--14.0.4, it gave me "segmentation fault 11"
after being downgraded to 11.1.0, it works

@AlainOUYANG
Copy link

AlainOUYANG commented Aug 8, 2022

The problem still seems to exist for libomp v14.0.4.

Here is my solution (8 Jul 2022):

  1. Uninstall all xgboost packages, whether it was installed by conda or pip.

  2. Build libomp from source (the commands from @fractor ):

    1. wget https://raw.githubusercontent.com/chenrui333/homebrew-core/0094d1513ce9e2e85e07443b8b5930ad298aad91/Formula/libomp.rb
    2. brew unlink libomp
    3. brew install --build-from-source ./libomp.rb
    4. brew list --version libomp
  3. Install xgboost using pip. The conda one (v1.5.1) is not working now (I don't know why).

Socrats added a commit to Socrats/EGTTools that referenced this issue Jan 20, 2023
- if the numerical module is loaded before numpy, a segmentation fault is produced with EXC_BAD_ACCESS.
- This is a known issue (see dmlc/xgboost#7039).
- We fix this by loading numpy and then deleting it
- in the future we need to look into why exactly this happens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants