Skip to content

Commit

Permalink
Allocation Profiler: Types for all allocations (JuliaLang#50337)
Browse files Browse the repository at this point in the history
Pass the types to the allocator functions.

-------

Before this PR, we were missing the types for allocations in two cases:

1. allocations from codegen
2. allocations in `gc_managed_realloc_`

The second one is easy: those are always used for buffers, right?

For the first one: we extend the allocation functions called from
codegen, to take the type as a parameter, and set the tag there.

I kept the old interfaces around, since I think that they cannot be
removed due to supporting legacy code?

------

An example of the generated code:
```julia
  %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2
  %13 = bitcast {}*** %ptls_field6 to i8**
  %ptls_load78 = load i8*, i8** %13, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7
```

Fixes JuliaLang#43688.
Fixes JuliaLang#45268.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
  • Loading branch information
2 people authored and RAI CI (GitHub Action Automation) committed May 28, 2024
1 parent 073c43a commit 01bcde2
Show file tree
Hide file tree
Showing 12 changed files with 199 additions and 59 deletions.
90 changes: 83 additions & 7 deletions doc/src/manual/profile.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,15 +338,91 @@ argument can be passed to speed it up by making it skip some allocations.
Passing `sample_rate=1.0` will make it record everything (which is slow);
`sample_rate=0.1` will record only 10% of the allocations (faster), etc.

!!! note
!!! compat "Julia 1.11"

Older versions of Julia could not capture types in all cases. In older versions of
Julia, if you see an allocation of type `Profile.Allocs.UnknownType`, it means that
the profiler doesn't know what type of object was allocated. This mainly happened when
the allocation was coming from generated code produced by the compiler. See
[issue #43688](https://github.com/JuliaLang/julia/issues/43688) for more info.

Since Julia 1.11, all allocations should have a type reported.

For more details on how to use this tool, please see the following talk from JuliaCon 2022:
https://www.youtube.com/watch?v=BFvpwC8hEWQ

##### Allocation Profiler Example

The current implementation of the Allocations Profiler _does not
capture types for all allocations._ Allocations for which the profiler
could not capture the type are represented as having type
`Profile.Allocs.UnknownType`.
In this simple example, we use PProf to visualize the alloc profile. You could use another
visualization tool instead. We collect the profile (specifying a sample rate), then we visualize it.
```julia
using Profile, PProf
Profile.Allocs.clear()
Profile.Allocs.@profile sample_rate=0.0001 my_function()
PProf.Allocs.pprof()
```

Here is a more in-depth example, showing how we can tune the sample rate. A
good number of samples to aim for is around 1 - 10 thousand. Too many, and the
profile visualizer can get overwhelmed, and profiling will be slow. Too few,
and you don't have a representative sample.

You can read more about the missing types and the plan to improve this, here:
[issue #43688](https://github.com/JuliaLang/julia/issues/43688).

```julia-repl
julia> import Profile
julia> @time my_function() # Estimate allocations from a (second-run) of the function
0.110018 seconds (1.50 M allocations: 58.725 MiB, 17.17% gc time)
500000
julia> Profile.Allocs.clear()
julia> Profile.Allocs.@profile sample_rate=0.001 begin # 1.5 M * 0.001 = ~1.5K allocs.
my_function()
end
500000
julia> prof = Profile.Allocs.fetch(); # If you want, you can also manually inspect the results.
julia> length(prof.allocs) # Confirm we have expected number of allocations.
1515
julia> using PProf # Now, visualize with an external tool, like PProf or ProfileCanvas.
julia> PProf.Allocs.pprof(prof; from_c=false) # You can optionally pass in a previously fetched profile result.
Analyzing 1515 allocation samples... 100%|████████████████████████████████| Time: 0:00:00
Main binary filename not available.
Serving web UI on http://localhost:62261
"alloc-profile.pb.gz"
```
Then you can view the profile by navigating to http://localhost:62261, and the profile is saved to disk.
See PProf package for more options.

##### Allocation Profiling Tips

As stated above, aim for around 1-10 thousand samples in your profile.

Note that we are uniformly sampling in the space of _all allocations_, and are not weighting
our samples by the size of the allocation. So a given allocation profile may not give a
representative profile of where most bytes are allocated in your program, unless you had set
`sample_rate=1`.

Allocations can come from users directly constructing objects, but can also come from inside
the runtime or be inserted into compiled code to handle type instability. Looking at the
"source code" view can be helpful to isolate them, and then other external tools such as
[`Cthulhu.jl`](https://github.com/JuliaDebug/Cthulhu.jl) can be useful for identifying the
cause of the allocation.

##### Allocation Profile Visualization Tools

There are several profiling visualization tools now that can all display Allocation
Profiles. Here is a small list of some of the main ones we know about:
- [PProf.jl](https://github.com/JuliaPerf/PProf.jl)
- [ProfileCanvas.jl](https://github.com/pfitzseb/ProfileCanvas.jl)
- VSCode's built-in profile visualizer (`@profview_allocs`) [docs needed]
- Viewing the results directly in the REPL
- You can inspect the results in the REPL via [`Profile.Allocs.fetch()`](@ref), to view
the stacktrace and type of each allocation.

#### Line-by-Line Allocation Tracking

Expand Down
1 change: 1 addition & 0 deletions src/gc-alloc-profiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t

extern int g_alloc_profile_enabled;

// This should only be used from _deprecated_ code paths. We shouldn't see UNKNOWN anymore.
#define jl_gc_unknown_type_tag ((jl_datatype_t*)0xdeadaa03)

static inline void maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t *typ) JL_NOTSAFEPOINT {
Expand Down
19 changes: 17 additions & 2 deletions src/gc.c
Original file line number Diff line number Diff line change
Expand Up @@ -1030,13 +1030,20 @@ STATIC_INLINE jl_value_t *jl_gc_big_alloc_inner(jl_ptls_t ptls, size_t sz)
return jl_valueof(&v->header);
}

// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
// Deprecated version, supported for legacy code.
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc(jl_ptls_t ptls, size_t sz)
{
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
maybe_record_alloc_to_profile(val, sz, jl_gc_unknown_type_tag);
return val;
}
// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc_instrumented(jl_ptls_t ptls, size_t sz, jl_value_t *type)
{
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
maybe_record_alloc_to_profile(val, sz, (jl_datatype_t*)type);
return val;
}

// This wrapper exists only to prevent `jl_gc_big_alloc_inner` from being inlined into
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_big_alloc_inner`
Expand Down Expand Up @@ -1341,14 +1348,22 @@ STATIC_INLINE jl_value_t *jl_gc_pool_alloc_inner(jl_ptls_t ptls, int pool_offset
return jl_valueof(v);
}

// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
// Deprecated version, supported for legacy code.
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
int osize)
{
jl_value_t *val = jl_gc_pool_alloc_inner(ptls, pool_offset, osize);
maybe_record_alloc_to_profile(val, osize, jl_gc_unknown_type_tag);
return val;
}
// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc_instrumented(jl_ptls_t ptls, int pool_offset,
int osize, jl_value_t* type)
{
jl_value_t *val = jl_gc_pool_alloc_inner(ptls, pool_offset, osize);
maybe_record_alloc_to_profile(val, osize, (jl_datatype_t*)type);
return val;
}

// This wrapper exists only to prevent `jl_gc_pool_alloc_inner` from being inlined into
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_pool_alloc_inner`
Expand Down
2 changes: 2 additions & 0 deletions src/jl_exported_funcs.inc
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@
XX(jl_gc_alloc_3w) \
XX(jl_gc_alloc_typed) \
XX(jl_gc_big_alloc) \
XX(jl_gc_big_alloc_instrumented) \
XX(jl_gc_collect) \
XX(jl_gc_conservative_gc_support_enabled) \
XX(jl_gc_counted_calloc) \
Expand Down Expand Up @@ -185,6 +186,7 @@
XX(jl_gc_new_weakref_th) \
XX(jl_gc_num) \
XX(jl_gc_pool_alloc) \
XX(jl_gc_pool_alloc_instrumented) \
XX(jl_gc_queue_multiroot) \
XX(jl_gc_queue_root) \
XX(jl_gc_safepoint) \
Expand Down
9 changes: 5 additions & 4 deletions src/llvm-final-gc-lowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -187,12 +187,13 @@ Value *FinalLowerGC::lowerSafepoint(CallInst *target, Function &F)
Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
{
++GCAllocBytesCount;
assert(target->arg_size() == 2);
assert(target->arg_size() == 3);
CallInst *newI;

IRBuilder<> builder(target);
builder.SetCurrentDebugLocation(target->getDebugLoc());
auto ptls = target->getArgOperand(0);
auto type = target->getArgOperand(2);
Attribute derefAttr;

if (auto CI = dyn_cast<ConstantInt>(target->getArgOperand(1))) {
Expand All @@ -203,19 +204,19 @@ Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
if (offset < 0) {
newI = builder.CreateCall(
bigAllocFunc,
{ ptls, ConstantInt::get(T_size, sz + sizeof(void*)) });
{ ptls, ConstantInt::get(T_size, sz + sizeof(void*)), type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sz + sizeof(void*));
}
else {
auto pool_offs = ConstantInt::get(Type::getInt32Ty(F.getContext()), offset);
auto pool_osize = ConstantInt::get(Type::getInt32Ty(F.getContext()), osize);
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize });
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize, type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), osize);
}
} else {
auto size = builder.CreateZExtOrTrunc(target->getArgOperand(1), T_size);
size = builder.CreateAdd(size, ConstantInt::get(T_size, sizeof(void*)));
newI = builder.CreateCall(allocTypedFunc, { ptls, size, ConstantPointerNull::get(Type::getInt8PtrTy(F.getContext())) });
newI = builder.CreateCall(allocTypedFunc, { ptls, size, type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sizeof(void*));
}
newI->setAttributes(newI->getCalledFunction()->getAttributes());
Expand Down
46 changes: 29 additions & 17 deletions src/llvm-late-gc-lowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2348,22 +2348,6 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
IRBuilder<> builder(CI);
builder.SetCurrentDebugLocation(CI->getDebugLoc());

// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
// `julia.gc_alloc_obj` except it doesn't set the tag.
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
auto ptlsLoad = get_current_ptls_from_task(builder, T_size, CI->getArgOperand(0), tbaa_gcframe);
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
auto newI = builder.CreateCall(
allocBytesIntrinsic,
{
ptls,
builder.CreateIntCast(
CI->getArgOperand(1),
allocBytesIntrinsic->getFunctionType()->getParamType(1),
false)
});
newI->takeName(CI);

// LLVM alignment/bit check is not happy about addrspacecast and refuse
// to remove write barrier because of it.
// We pretty much only load using `T_size` so try our best to strip
Expand Down Expand Up @@ -2401,7 +2385,35 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
builder.CreateAlignmentAssumption(DL, tag, 16);
}
}
// Set the tag.

// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
// `julia.gc_alloc_obj` except it specializes the call based on the constant
// size of the object to allocate, to save one indirection, and doesn't set
// the type tag. (Note that if the size is not a constant, it will call
// gc_alloc_obj, and will redundantly set the tag.)
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
auto ptlsLoad = get_current_ptls_from_task(builder, T_size, CI->getArgOperand(0), tbaa_gcframe);
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
auto newI = builder.CreateCall(
allocBytesIntrinsic,
{
ptls,
builder.CreateIntCast(
CI->getArgOperand(1),
allocBytesIntrinsic->getFunctionType()->getParamType(1),
false),
builder.CreatePtrToInt(tag, T_size),
});
newI->takeName(CI);

// Now, finally, set the tag. We do this in IR instead of in the C alloc
// function, to provide possible optimization opportunities. (I think? TBH
// the most recent editor of this code is not entirely clear on why we
// prefer to set the tag in the generated code. Providing optimziation
// opportunities is the most likely reason; the tradeoff is slightly
// larger code size and increased compilation time, compiling this
// instruction at every allocation site, rather than once in the C alloc
// function.)
auto &M = *builder.GetInsertBlock()->getModule();
StoreInst *store = builder.CreateAlignedStore(
tag, EmitTagPtr(builder, tag_type, T_size, newI), M.getDataLayout().getPointerABIAlignment(0));
Expand Down
14 changes: 8 additions & 6 deletions src/llvm-pass-helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,9 @@ namespace jl_intrinsics {
auto intrinsic = Function::Create(
FunctionType::get(
T_prjlvalue,
{ Type::getInt8PtrTy(ctx), T_size },
{ Type::getInt8PtrTy(ctx),
T_size,
T_size }, // type
false),
Function::ExternalLinkage,
GC_ALLOC_BYTES_NAME);
Expand Down Expand Up @@ -236,8 +238,8 @@ namespace jl_intrinsics {
}

namespace jl_well_known {
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc);
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc);
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc_instrumented);
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc_instrumented);
static const char *GC_QUEUE_ROOT_NAME = XSTR(jl_gc_queue_root);
static const char *GC_ALLOC_TYPED_NAME = XSTR(jl_gc_alloc_typed);

Expand All @@ -251,7 +253,7 @@ namespace jl_well_known {
auto bigAllocFunc = Function::Create(
FunctionType::get(
T_prjlvalue,
{ Type::getInt8PtrTy(ctx), T_size },
{ Type::getInt8PtrTy(ctx), T_size , T_size},
false),
Function::ExternalLinkage,
GC_BIG_ALLOC_NAME);
Expand All @@ -267,7 +269,7 @@ namespace jl_well_known {
auto poolAllocFunc = Function::Create(
FunctionType::get(
T_prjlvalue,
{ Type::getInt8PtrTy(ctx), Type::getInt32Ty(ctx), Type::getInt32Ty(ctx) },
{ Type::getInt8PtrTy(ctx), Type::getInt32Ty(ctx), Type::getInt32Ty(ctx), T_size },
false),
Function::ExternalLinkage,
GC_POOL_ALLOC_NAME);
Expand Down Expand Up @@ -301,7 +303,7 @@ namespace jl_well_known {
T_prjlvalue,
{ Type::getInt8PtrTy(ctx),
T_size,
Type::getInt8PtrTy(ctx) },
T_size }, // type
false),
Function::ExternalLinkage,
GC_ALLOC_TYPED_NAME);
Expand Down
31 changes: 31 additions & 0 deletions stdlib/Profile/test/allocs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,34 @@ end
@test length(prof.allocs) >= 1
@test length([a for a in prof.allocs if a.type == String]) >= 1
end

@testset "alloc profiler catches allocs from codegen" begin
@eval begin
struct MyType x::Int; y::Int end
Base.:(+)(n::Number, x::MyType) = n + x.x + x.y
foo(a, x) = a[1] + x
wrapper(a) = foo(a, MyType(0,1))
end
a = Any[1,2,3]
# warmup
wrapper(a)

@eval Allocs.@profile sample_rate=1 wrapper($a)

prof = Allocs.fetch()
Allocs.clear()

@test length(prof.allocs) >= 1
@test length([a for a in prof.allocs if a.type == MyType]) >= 1
end

@testset "alloc profiler catches allocs from buffer resize" begin
a = Int[]
Allocs.@profile sample_rate=1 for _ in 1:100; push!(a, 1); end

prof = Allocs.fetch()
Allocs.clear()

@test length(prof.allocs) >= 1
@test length([a for a in prof.allocs if a.type == Profile.Allocs.BufferType]) >= 1
end
16 changes: 8 additions & 8 deletions test/llvmpasses/alloc-opt-gcframe.ll
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
; CHECK-NOT: @julia.gc_alloc_obj

; TYPED: %current_task = getelementptr inbounds {}*, {}** %gcstack, i64 -12
; TYPED-NEXT: [[ptls_field:%.*]] = getelementptr inbounds {}*, {}** %current_task, i64 16
; TYPED: [[ptls_field:%.*]] = getelementptr inbounds {}*, {}** %current_task, i64 16
; TYPED-NEXT: [[ptls_load:%.*]] = load {}*, {}** [[ptls_field]], align 8, !tbaa !0
; TYPED-NEXT: [[ppjl_ptls:%.*]] = bitcast {}* [[ptls_load]] to {}**
; TYPED-NEXT: [[ptls_i8:%.*]] = bitcast {}** [[ppjl_ptls]] to i8*
; TYPED-NEXT: %v = call noalias nonnull dereferenceable({{[0-9]+}}) {} addrspace(10)* @ijl_gc_pool_alloc(i8* [[ptls_i8]], i32 [[SIZE_T:[0-9]+]], i32 16)
; TYPED-NEXT: %v = call noalias nonnull dereferenceable({{[0-9]+}}) {} addrspace(10)* @ijl_gc_pool_alloc_instrumented(i8* [[ptls_i8]], i32 [[SIZE_T:[0-9]+]], i32 16, i64 {{.*}} @tag {{.*}})
; TYPED: store atomic {} addrspace(10)* @tag, {} addrspace(10)* addrspace(10)* {{.*}} unordered, align 8, !tbaa !4

; OPAQUE: %current_task = getelementptr inbounds ptr, ptr %gcstack, i64 -12
; OPAQUE-NEXT: [[ptls_field:%.*]] = getelementptr inbounds ptr, ptr %current_task, i64 16
; OPAQUE: [[ptls_field:%.*]] = getelementptr inbounds ptr, ptr %current_task, i64 16
; OPAQUE-NEXT: [[ptls_load:%.*]] = load ptr, ptr [[ptls_field]], align 8, !tbaa !0
; OPAQUE-NEXT: %v = call noalias nonnull dereferenceable({{[0-9]+}}) ptr addrspace(10) @ijl_gc_pool_alloc(ptr [[ptls_load]], i32 [[SIZE_T:[0-9]+]], i32 16)
; OPAQUE-NEXT: %v = call noalias nonnull dereferenceable({{[0-9]+}}) ptr addrspace(10) @ijl_gc_pool_alloc_instrumented(ptr [[ptls_load]], i32 [[SIZE_T:[0-9]+]], i32 16, i64 {{.*}} @tag {{.*}})
; OPAQUE: store atomic ptr addrspace(10) @tag, ptr addrspace(10) {{.*}} unordered, align 8, !tbaa !4

define {} addrspace(10)* @return_obj() {
Expand Down Expand Up @@ -270,11 +270,11 @@ L3:
}
; CHECK-LABEL: }{{$}}

; TYPED: declare noalias nonnull {} addrspace(10)* @ijl_gc_pool_alloc(i8*,
; TYPED: declare noalias nonnull {} addrspace(10)* @ijl_gc_big_alloc(i8*,
; TYPED: declare noalias nonnull {} addrspace(10)* @ijl_gc_pool_alloc_instrumented(i8*,
; TYPED: declare noalias nonnull {} addrspace(10)* @ijl_gc_big_alloc_instrumented(i8*,

; OPAQUE: declare noalias nonnull ptr addrspace(10) @ijl_gc_pool_alloc(ptr,
; OPAQUE: declare noalias nonnull ptr addrspace(10) @ijl_gc_big_alloc(ptr,
; OPAQUE: declare noalias nonnull ptr addrspace(10) @ijl_gc_pool_alloc_instrumented(ptr,
; OPAQUE: declare noalias nonnull ptr addrspace(10) @ijl_gc_big_alloc_instrumented(ptr,
declare void @external_function()
declare {}*** @julia.get_pgcstack()
declare noalias nonnull {} addrspace(10)* @julia.gc_alloc_obj({}**, i64, {} addrspace(10)*)
Expand Down
Loading

0 comments on commit 01bcde2

Please sign in to comment.