Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia crashes for multithreaded Stack for some non-Julia models #783

Open
ablaom opened this issue Jun 9, 2022 · 3 comments
Open

Julia crashes for multithreaded Stack for some non-Julia models #783

ablaom opened this issue Jun 9, 2022 · 3 comments

Comments

@ablaom
Copy link
Member

ablaom commented Jun 9, 2022

Context: #767 adds support for an option acceleration=CPUThreads() in composite model types defined by "exporting" learning networks, and implements this option for Stack. I have been carrying out MLJ ecosystem integration tests of the new Stack with a large number of models as base models in the stack. If the base model is one from the non-Julia packages ScikitLearn.jl, XGBoost.jl, or LIBSVM.jl, and I am including CPUThreads() in the testing, then I am experiencing Julia crashes. I not been able to reliably reproduce the crashes with a "minimal example" but the follow seems to do the job on my machine:

using Pkg
Pkg.activate(temp=true)
Pkg.add(
    url="https://github.com/JuliaAI/MLJBase.jl",
    rev="stack_cache_and_acceleration",
)
Pkg.add(
    url = "https://github.com/JuliaAI/MLJTestIntegration.jl",
    rev= "multi-threading",
)
Pkg.add("NearestNeighborModels")
Pkg.add("MLJLIBSVMInterface")
Pkg.add("XGBoost")
Pkg.instantiate()

julia> Pkg.status()
      Status `/private/var/folders/4n/gvbmlhdc8xj973001s6vdyw00000gq/T/jl_wRKoZO/Project.toml`          
  [a7f614a8] MLJBase v0.20.2 `https://github.com/JuliaAI/MLJBase.jl#stack_cache_and_acceleration`        
  [61c7150f] MLJLIBSVMInterface v0.2.0
  [697918b4] MLJTestIntegration v0.1.0 `https://github.com/JuliaAI/MLJTestIntegration.jl#multi-threading`
  [636a865e] NearestNeighborModels v0.2.0
  [009559a3] XGBoost v1.5.2

using MLJBase
using NearestNeighborModels
using MLJLIBSVMInterface
using MLJTestIntegration
using XGBoost

model = EpsilonSVR()

models = (knn1=KNNRegressor(K=4),
          knn2=KNNRegressor(K=6),
          model=model)

metalearner = KNNRegressor()
measure = LPLoss(2)

# mini Boston:
y, X = unpack(MLJBase.load_boston(), ==(:MedV), col->col in [:LStat, :Rm])
data = (X, y)

mystack = Stack(
    ; metalearner,
    resampling=CV(;nfolds=3),
    acceleration=CPUThreads(),
    models...)

julia> MLJTestIntegration.test_single_target_regressors(
    [(name="EpsilonSVR", package_name="LIBSVM"),],
    level=4,
    verbosity=2
)
┌ Info: 
└ Testing EpsilonSVR from LIBSVM
[ Info: [:model_type] Loading model type ✓
[ Info: [:model_instance] Instantiating default model ✓
[ Info: [:fitted_machine] Fitting machine ✓
[ Info: [:operations] Calling `predict`, `transform` and/or `inverse_transform` ✓
[ Info: [evaluation] Evaluating model performance using with 1 resources. ✓
Internal repeatability tests, 50 of 50 trials complete ✓ Repeatable.
[ Info: Testing with 5 threads. 
[ Info: [:accelerated_evaluation] Evaluating model performance using with 2 resources. ✓
[ Info: [:tuned_pipe_evaluation] Evaluating perfomance in a tuned pipeline ✓
[ Info: [:ensemble_prediction] Ensembling ✓
[ Info: [stack_evaluation] Evaluating a stack containing model with 1 resources. ✓

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43
unknown function (ip: 0x10b82aca3)
Allocations: 279946573 (Pool: 279865905; Big: 80668); GC: 248

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:43
unknown function (ip: 0x10b80f59c)
Allocations: 279946573 (Pool: 279865905; Big: 80668); GC: 248
...

Interestingly, if I remove MLJXGBoostInterface from the env, and the using XGBoost, then there are no issues and the tests pass.

I do not seem to have problems with any pure Julia models.

In attempts to isolate, I have encountered various errors, such as:

OMP: Error #13: Assertion failure at kmp_csupport.cpp(540).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.

signal (6): Abort trap: 6
in expression starting at REPL[2]:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 105303846 (Pool: 105260636; Big: 43210); GC: 106

julia(70986,0x70000783d000) malloc: *** error for object 0x7ff0725333e0: pointer being freed was not allocated
julia(70986,0x70000783d000) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at /Users/anthony/sandbox/crash.jl:46

signal (11): Segmentation fault: 11
in expression starting at /Users/anthony/sandbox/crash.jl:46
Allocations: 279191441 (Pool: 279111122; Big: 80319); GC: 222

julia(90542,0x7000079c6000) malloc: Incorrect checksum for freed object 0x7f8da2b121a8: probably modified after being freed.
Corrupt value: 0x7f8da2b1b4c0
julia(90542,0x7000079c6000) malloc: *** set a breakpoint in malloc_error_break to debug

signal (6): Abort trap: 6
in expression starting at /Users/anthony/MLJ/MLJTestIntegration/examples/bigtest/notebook.jl:35

signal (4): Illegal instruction: 4
in expression starting at /Users/anthony/MLJ/MLJTestIntegration/examples/bigtest/notebook.jl:35

I am running with 5 threads.

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 5
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia
@pazzo83
Copy link
Collaborator

pazzo83 commented Jun 9, 2022

Interesting - I get these problems (intermittently) as well on an M1 mac with non-Julia models (XGBoost, LightGBM, etc) - but I get it when I do cross validation (calling evaluate) with multi-threading enabled. It is similarly hard for me to generate a minimal example but I get the same exceptions / seg faults that you do.

@olivierlabayle
Copy link
Collaborator

Same thing here, a simple loop with only an SVM in the Stack produces the error on my side if that helps:

metalearner = EpsilonSVR()
models = (model=EpsilonSVR(),)
mystack = Stack(
    ; metalearner,
    resampling=CV(;nfolds=3),
    cache=false,
    acceleration=CPUThreads(),
    models...)

for i in 1:3
    fitresult,_, _ = fit(mystack, 0, X, y)
end

I noticed LIBSVM also has internal multithreading, could that be related?

@OkonSamuel
Copy link
Member

OkonSamuel commented Jul 1, 2022

It appears LIBSVM isn't thread safe JuliaML/LIBSVM.jl#60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants