Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement "engine" for managing inference/codegen #54816

Merged
merged 3 commits into from
Jun 23, 2024
Merged

implement "engine" for managing inference/codegen #54816

merged 3 commits into from
Jun 23, 2024

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jun 15, 2024

Continuing from previous PRs to making CodeInstance the primary means of tracking compilation, this introduces an "engine" which keeps track externally of whether a particular inference result is in progress and where. At present, this handles unexpected cycles by permitting both threads to work on it. This is likely to be optimal most of the time currently, until we have the ability to do work-stealing of the results.

To assist with that, CodeInstance is now primarily allocated by jl_engine_reserve, which also tracks that this is being currently inferred. This creates a sort of per-(MI,owner) tuple lock mechanism, which can be used with the double-check pattern to see if inference was completed while waiting on that. The world value is not included since that is inferred later, so there is a possibility that a thread waits only to discover that the result was already invalid before it could use it (though this should be unlikely).

The process then can notify when it has finished and wants to release the reservation lock on that identity pair. When doing so, it may also provide source code, allowing the process to potentially begin a threadpool to compile that result while the main thread continues with the job of inference.

Includes fix for #53434, by ensuring SOURCE_MODE_ABI results in the item going into the global cache.

Fixes #53433, as inInference is computed by the engine and protected by a lock, which also fixes #53680.

(n.b. somehow the finalizers_inhibited counter appears to be getting corrupted by the worlds and staged tests, so maybe it will be necessarily to shift exactly how that is done, if we do want to prevent finalizers from occurring here, e.g. JULIA_CPU_THREADS=1 ../julia -p1 runtests.jl worlds int causes an assertion failure. I suspect this is because it is Task-local state, but if something fails and we forget to reject it, then it will be subtracted from ptls2 instead, which might not be the same.)

@vtjnash vtjnash requested a review from Keno June 15, 2024 01:39
@vtjnash vtjnash force-pushed the jn/engine branch 2 times, most recently from 2b86701 to 6bdf8a4 Compare June 15, 2024 20:11
@vtjnash vtjnash marked this pull request as ready for review June 15, 2024 20:11
@gbaraldi
Copy link
Member

Do we want to add something to the devdocs like:
The life of a codeinstance, all the way from someone requesting a MI to it being codegened?

@nsajko nsajko added compiler:codegen Generation of LLVM IR and native code compiler:inference Type inference bugfix This change fixes an existing bug labels Jun 17, 2024
@Keno
Copy link
Member

Keno commented Jun 19, 2024

Overall, I think I like the direction. I think the only conceptual thing I flagged is that this start assuming that owner is an egal-comparable cache-key. In further development #54373 I ended up finding it useful to separate the egal-comparable key data and the non-volatile data. That said, I'm happy to keep going with this and figure out how to make that distinction later. I think in currently implemented code, it doesn't really matter.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 20, 2024

this start assuming that owner is an egal-comparable cache-key

I speculate that what will be happening here (eventually, if we need more flexibility), is that you would use a particular egal-comparable key for the engine reservation, but then if you later infer that you you want to refine that, the code would be allowed to do that, as long as that follows conceptually the same sort narrowing and lifecycle management that applies to the other fields. It is potentially very similar to how we deal with pkg precompile locks for example, where the pidfile is simply the package name, rather than being a guarantee that it is precisely the final content hash, and then we move it into place later.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 20, 2024

Oh, also, of secondary note is that this is permitted to be reimplemented in Julia for anything that won't go through the JIT. The only reason this needs to be in C for the stdlib is that it needs to manage the state of things in the JIT before Julia is compiled and that adds some fairly strict concurrency constraints. For stuff that is merely reusing the inference machinery from Julia (e.g. not causing recursion through the JIT), it is permitted to reimplement this in Julia, in any similar way, using any alternate key comparator strategies.

@Keno
Copy link
Member

Keno commented Jun 20, 2024

Yeah, let's try it. As I said, directionally correct, so let's get the implementation in and then we can figure out what tweaks may be required for the new features that we want.

vtjnash added 3 commits June 22, 2024 18:56
Continuing from previous PRs to making CodeInstance the primary means of
tracking compilation, this introduces an "engine" which keeps track
externally of whether a particular inference result is in progress and
where. At present, this handles unexpected cycles by permitting both
threads to work on it. This is likely to be optimal most of the time
currently, until we have the ability to do work-stealing of the results.

To assist with that, CodeInstance is now primarily allocated by
`jl_engine_reserve`, which also tracks that this is being currently
inferred. This creates a sort of per-(MI,owner) tuple lock mechanism,
which can be used with the double-check pattern to see if inference was
completed while waiting on that. The `world` value is not included since
that is inferred later, so there is a possibility that a thread waits
only to discover that the result was already invalid before it could use
it (though this should be unlikely).

The process then can notify when it has finished and wants to release
the reservation lock on that identity pair. When doing so, it may also
provide source code, allowing the process to potentially begin a
threadpool to compile that result while the main thread continues with
the job of inference.

Includes fix for #53434, by ensuring SOURCE_MODE_ABI results in the item
going into the global cache.

Fixes #53433, as inInference is computed by the engine and
protected by a lock, which also fixes #53680.
There is not permitted to be either a safepoint or equivalently a
safepoint region transition while a lock is being held that might be
needed by GC as that forms a cycle. We must ensure that this thread is
permitted to keep running, so that it becomes possible for it to reach
the condition wait call and release the lock to keep in synchronized
with the notify_all in the sweep. Alternatively we could use a
timed_wait and continuous poll the GC (to ensure we actually run GC
periodically), but I would rather that be an internal GC policy than an
external forcing factor here. This prevents the GC from recruiting this
thread (via a signal) to help GC even though it is observably sleeping,
but we might consider later integrating a way for the GC to notify a set
of condition variables upon starting to wake them and recruit them
temporarily via a subsequent safepoint such as (in julia-syntax psuedocode):

    gc_safe_enter()
    lock(condvar) do
        while !condmet
            wait(condvar)
            while gc_running
                unlock(lock)
                unsafe_enter()
                gc_safepoint()
                unsafe_leave()
                lock(lock)
            end
        end
    end
    gc_safe_leave()
This bug was fixed in C17, but we are using gcc 9.1, which caused
misalignment here on 32-bit platforms. We have this behind a lock
necessarily, so it could just be a boolean, but that might need some
more code cleanup.
@vtjnash
Copy link
Member Author

vtjnash commented Jun 22, 2024

Ended up just using ~/julia/julia --project=.. bughunt.jl 'https://buildkite.com/julialang/julia-master/builds/37504#019032a2-da34-49fd-8b24-295852b1c908 and that pretty quickly let me identify the compiler bug (technically a C11 standards bug) in the version of gcc being used that was causing this to fail

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jun 22, 2024
@vtjnash vtjnash merged commit 2bf4750 into master Jun 23, 2024
8 checks passed
@vtjnash vtjnash deleted the jn/engine branch June 23, 2024 00:28
@IanButterworth IanButterworth removed the merge me PR is reviewed. Merge when all tests are passing label Jun 23, 2024
simeonschaub added a commit to JuliaDebug/Cthulhu.jl that referenced this pull request Sep 1, 2024
I'm not sure whether this is the correct way to do it, but this fixes #586 locally for me.
maleadt added a commit to JuliaGPU/GPUCompiler.jl that referenced this pull request Sep 17, 2024
maleadt added a commit to JuliaGPU/GPUCompiler.jl that referenced this pull request Sep 17, 2024
maleadt added a commit to JuliaGPU/GPUCompiler.jl that referenced this pull request Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug compiler:codegen Generation of LLVM IR and native code compiler:inference Type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llvmcall requires the compiler regression in thread-safety with data-race on mi->inInference field
6 participants