Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error 10 on 1.8.2 aarch64 #47193

Closed
evetion opened this issue Oct 17, 2022 · 14 comments · Fixed by yeesian/ArchGDAL.jl#352
Closed

Bus error 10 on 1.8.2 aarch64 #47193

evetion opened this issue Oct 17, 2022 · 14 comments · Fixed by yeesian/ArchGDAL.jl#352
Labels
system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips

Comments

@evetion
Copy link
Member

evetion commented Oct 17, 2022

Since 1.8.2 on Mac M1 (aarch64) the following snippet will crash Julia:

using GeoArrays
ga = GeoArray(rand(24, 24))
GeoArrays.write("test_cog.tif", ga, shortname="COG")

with

signal (10): Bus error: 10
in expression starting at test_crash.jl:3
unknown function (ip: 0x14aae0b08)
Allocations: 16899091 (Pool: 16892413; Big: 6678); GC: 12

In 1.8.1 (and 1.8.2 on other (CI) platforms) this works. I will try to debug this further, under the hood (via ArchGDAL, GDAL, GDAL_jll) this ccall an external C++ library.

Might relate to #47171.

@giordano giordano added the system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips label Oct 17, 2022
@evetion
Copy link
Member Author

evetion commented Oct 17, 2022

@gbaraldi
Copy link
Member

It does seem to come from inside libnetcdf, so it might not be julia.

@evetion
Copy link
Member Author

evetion commented Oct 17, 2022

I agree the backtrace looks suspect, but this code, and the same library, works correctly on previous Julia versions.

@gbaraldi
Copy link
Member

gbaraldi commented Oct 17, 2022

The same code works on 1.8.1 on aarch-64 mac but not on 1.8.2.
Was able to reproduce the failure. Will Bisect.

@gbaraldi
Copy link
Member

Bisected to #45173. @JeffBezanson

@gbaraldi
Copy link
Member

On master I don't get a bus error but I do get segfault.

@KristofferC KristofferC added this to the 1.9 milestone Oct 25, 2022
@gbaraldi
Copy link
Member

I went a lot deeper into this and the issue seems to be that we used to this ccall with the following arguments

Hit breakpoint:
In gdalcreatecopy(arg1, arg2, arg3, arg4, arg5, arg6, arg7) at /Users/gabrielbaraldi/.julia/packages/GDAL/TCvFp/src/libgdal.jl:6232
 6232  function gdalcreatecopy(arg1, arg2, arg3, arg4, arg5, arg6, arg7)
>6233      aftercare(
 6234          ccall(
 6235              (:GDALCreateCopy, libgdal),
 6236              GDALDatasetH,
 6237              (

About to run: (Base.cconvert)(Ptr{Nothing}, Ptr{Nothing} @0x0000600001eeca90)
1|debug> fr
[1] gdalcreatecopy(arg1, arg2, arg3, arg4, arg5, arg6, arg7) at /Users/gabrielbaraldi/.julia/packages/GDAL/TCvFp/src/libgdal.jl:6232
  | arg1::Ptr{Nothing} = Ptr{Nothing} @0x0000600001eeca90
  | arg2::String = "test_cog.tif"
  | arg3::Ptr{Nothing} = Ptr{Nothing} @0x0000000135e32140
  | arg4::Bool = false
  | arg5::Vector{String} = String[]
  | arg6::Ptr{Nothing} = Ptr{Nothing} @0x0000000000000000
  | arg7::Ptr{Nothing} = Ptr{Nothing} @0x0000000000000000

Note that arg6 is a NULL.
On 1.8.2 we now get

1|debug> n
In gdalcreatecopy(arg1, arg2, arg3, arg4, arg5, arg6, arg7) at /Users/gabrielbaraldi/.julia/packages/GDAL/TCvFp/src/libgdal.jl:6232
 6232  function gdalcreatecopy(arg1, arg2, arg3, arg4, arg5, arg6, arg7)
>6233      aftercare(
 6234          ccall(
 6235              (:GDALCreateCopy, libgdal),
 6236              GDALDatasetH,
 6237              (

About to run: <(JuliaInterpreter.CompiledCalls.var"##compiled_ccall#347")(Ptr{Nothing} @0x0000600001b2c270, Cstring(...>
1|debug> fr
[1] gdalcreatecopy(arg1, arg2, arg3, arg4, arg5, arg6, arg7) at /Users/gabrielbaraldi/.julia/packages/GDAL/TCvFp/src/libgdal.jl:6232
  | arg1::Ptr{Nothing} = Ptr{Nothing} @0x0000600001b2c270
  | arg2::String = "test_cog.tif"
  | arg3::Ptr{Nothing} = Ptr{Nothing} @0x000000013d625180
  | arg4::Bool = false
  | arg5::Vector{String} = String[]
  | arg6::Ptr{Nothing} = Ptr{Nothing} @0x0000000151cf4b08
  | arg7::Ptr{Nothing} = Ptr{Nothing} @0x0000000000000000

Now note that arg6 is not null.

From the Macos crash reporter I get

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGBUS)
Exception Codes:       KERN_PROTECTION_FAILURE at 0x0000000151cf4b08
Exception Codes:       0x0000000000000002, 0x0000000151cf4b08

Termination Reason:    Namespace SIGNAL, Code 10 Bus error: 10
Terminating Process:   exc handler [17200]

Which is the exact pointer that we passed to arg6.

@evetion
Copy link
Member Author

evetion commented Dec 19, 2022

And the sixth argument is the user provided callback progress function I believe, which doesn't work on m1. So we did the following:

https://github.com/yeesian/ArchGDAL.jl/blob/49e5a32986b8d209e63d9cfc6b566d49f9f01276/src/utils.jl#L231

@gbaraldi
Copy link
Member

gbaraldi commented Dec 19, 2022

I'm confused as to why on 1.8.1 I get a NULL, and the other I get a an actual pointer.

Is passing a NULL to gdal expected @evetion ?

@visr
Copy link
Contributor

visr commented Dec 19, 2022

Passing a NULL is valid if you don't want a progress callback. That is probably what the workaround in ArchGDAL should enforce rather that the dummy cfunction it gets now.

@gbaraldi
Copy link
Member

Oh, so it was a bug that it was getting a NULL pointer then, and that change made it so it actually got the correct results in 1.8.2, which broke it :)

@evetion
Copy link
Member Author

evetion commented Dec 20, 2022

Hmm, odd, because on 1.8.1 I do get a pointer back:

julia> using ArchGDAL
julia> @cfunction(
                   ArchGDAL.GDAL.gdaldummyprogress,
                   Cint,
                   (Cdouble, Cstring, Ptr{Cvoid})
               )
Ptr{Nothing} @0x000000013fae007c
julia> Sys.ARCH == :aarch64
true

@gbaraldi
Copy link
Member

You have to get it from deep in there, by modifying the code. Which is why I think there was a bug.
Add an print in gdalcreatecopy or archgdal.copy

@KristofferC KristofferC removed this from the 1.9 milestone Dec 20, 2022
@evetion
Copy link
Member Author

evetion commented Dec 20, 2022

Thanks for digging in! Nasty bug, I've made a PR over at ArchGDAL to fix it.

My summary:

  • ArchGDAL has a bugged macro for :aarch64 that should return a cfunction pointer, but didn't (just Ptr{Nothing} @0x0000000000000000).
  • Julia >=1.8.2 changed the behavior of the bugged macro and this made it return an actual (non-zero) pointer to somewhere, which is this crash report issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants