Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw exceptions with arguments without creating a GC frame #11284

Closed
wants to merge 1 commit into from

Conversation

yuyichao
Copy link
Contributor

This pull request is trying to address the issue brought up in #11244 and the issue that throwing an error with argument can bring down the performance of the non-error path because of the creation of gc frames.

As mentioned in the comment of that pull request, the idea here is to NOT inline the constructor of the error but create specialized function for the arguments in order to avoid boxing.

This function can probably be used in many other places as well but I'm just using it for SimpleVector for benchmarking before getting more feedback. (Edit: and if this PR is accepted, I will probably replace a number of other places to use this function either in this PR or in a new one)

  • Not sure if function name choice is good (and as a fact I'm bad at naming).
  • Maybe this can be exported if it is useful for user code as well.
  • The manual specialization is necessary because currently a vararg function will still require boxing of the argument and creating a tuple which defeat the purpose of those functions. Related Make f(args...) as efficient as f(args) #11248.

Benchmark:
Code

sv = Base.svec(1, 2)

@code_llvm getindex(sv, 1)

function time_func(f::Function, args...)
    println(f)
    f(args...)
    gc()
    @time for i in 1:100000000
        f(args...)
    end
    gc()
end

f(v, i) = v[i]
time_func(f, sv, 1)

Before

define %jl_value_t* @julia_getindex_67410(%jl_value_t*, i64) {
top:
  %2 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [3 x %jl_value_t*]* %2, i64 0, i64 2
  %4 = bitcast [3 x %jl_value_t*]* %2 to i64*
  store i64 2, i64* %4, align 8
  %5 = getelementptr [3 x %jl_value_t*]* %2, i64 0, i64 1
  %6 = bitcast %jl_value_t** %5 to %jl_value_t***
  %7 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %7, %jl_value_t*** %6, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %8 = icmp slt i64 %1, 1
  br i1 %8, label %if2, label %L1

L1:                                               ; preds = %top
  %9 = bitcast %jl_value_t* %0 to i64*
  %10 = load i64* %9, align 8
  %phitmp8 = icmp slt i64 %10, %1
  br i1 %phitmp8, label %if2, label %L4

if2:                                              ; preds = %L1, %top
  %11 = call %jl_value_t* @alloc_2w()
  %12 = getelementptr inbounds %jl_value_t* %11, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139927411285136 to %jl_value_t*), %jl_value_t** %12, align 8
  %13 = getelementptr inbounds %jl_value_t* %11, i64 0, i32 0
  store %jl_value_t* %0, %jl_value_t** %13, align 8
  %14 = getelementptr inbounds %jl_value_t* %11, i64 1, i32 0
  store %jl_value_t* null, %jl_value_t** %14, align 8
  store %jl_value_t* %11, %jl_value_t** %3, align 8
  %15 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %15, %jl_value_t** %14, align 8
  %16 = icmp eq %jl_value_t* %15, null
  br i1 %16, label %cont3, label %wb_not_null

wb_not_null:                                      ; preds = %if2
  %17 = bitcast %jl_value_t** %12 to i64*
  %18 = load i64* %17, align 8
  %19 = and i64 %18, 1
  %20 = icmp eq i64 %19, 0
  br i1 %20, label %cont3, label %wb_may_trigger

wb_may_trigger:                                   ; preds = %wb_not_null
  %21 = getelementptr inbounds %jl_value_t* %15, i64 -1, i32 0
  %22 = bitcast %jl_value_t** %21 to i64*
  %23 = load i64* %22, align 8
  %24 = and i64 %23, 1
  %25 = icmp eq i64 %24, 0
  br i1 %25, label %wb_trigger, label %cont3

wb_trigger:                                       ; preds = %wb_may_trigger
  call void @gc_queue_root(%jl_value_t* %11)
  br label %cont3

cont3:                                            ; preds = %wb_trigger, %wb_may_trigger, %wb_not_null, %if2
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %11, i32 302)
  br label %L4

L4:                                               ; preds = %cont3, %L1
  %26 = ptrtoint %jl_value_t* %0 to i64
  %27 = shl i64 %1, 3
  %28 = add i64 %27, %26
  %29 = inttoptr i64 %28 to i8**
  %30 = load i8** %29, align 1
  %31 = icmp eq i8* %30, null
  br i1 %31, label %if5, label %L7

if5:                                              ; preds = %L4
  call void @jl_throw_with_superfluous_argument(%jl_value_t* inttoptr (i64 139927411359824 to %jl_value_t*), i32 305)
  br label %L7

L7:                                               ; preds = %if5, %L4
  %32 = bitcast i8* %30 to %jl_value_t*
  %33 = load %jl_value_t*** %6, align 8
  store %jl_value_t** %33, %jl_value_t*** @jl_pgcstack, align 8
  ret %jl_value_t* %32
}
f
elapsed time: 1.8844901 seconds (0 bytes allocated)

After

define %jl_value_t* @julia_getindex_44351(%jl_value_t*, i64) {
top:
  %2 = icmp slt i64 %1, 1
  br i1 %2, label %if2, label %L1

L1:                                               ; preds = %top
  %3 = bitcast %jl_value_t* %0 to i64*
  %4 = load i64* %3, align 8
  %phitmp7 = icmp slt i64 %4, %1
  br i1 %phitmp7, label %if2, label %L3

if2:                                              ; preds = %L1, %top
  %5 = load %jl_value_t** inttoptr (i64 139845508162200 to %jl_value_t**), align 8
  call void @julia_throw_with_args3419(%jl_value_t* %5, %jl_value_t* %0, i64 %1)
  br label %L3

L3:                                               ; preds = %if2, %L1
  %6 = ptrtoint %jl_value_t* %0 to i64
  %7 = shl i64 %1, 3
  %8 = add i64 %7, %6
  %9 = inttoptr i64 %8 to i8**
  %10 = load i8** %9, align 1
  %11 = icmp eq i8* %10, null
  br i1 %11, label %if4, label %L6

if4:                                              ; preds = %L3
  call void @jl_throw_with_superfluous_argument(%jl_value_t* inttoptr (i64 139845508137040 to %jl_value_t*), i32 319)
  br label %L6

L6:                                               ; preds = %if4, %L3
  %12 = bitcast i8* %10 to %jl_value_t*
  ret %jl_value_t* %12
}
f
elapsed time: 1.7220829 seconds (0 bytes allocated)

Note: the benchmark time fluctuate for both but the new version (1.72-1.82s) is consistently faster than the original one (1.80-1.92s). Also it is probably hard to quantify the impact of a gc frame but at least it should be clear from the llvm ir that no gc frame was created in the function anymore.

@yuyichao
Copy link
Contributor Author

This will obviously cause more functions to be compiled but IMHO the overall code generated shouldn't be more than when it is inlined. (and the ::ANY for the first argument (type) should help reducing the code generated as well)

@ScottPJones
Copy link
Contributor

This issue also affected me, where I wanted to have improved error messages for Unicode encoding problems... I made my own separate function that took the 3 arguments, which then created and threw a UnicodeError exception... and that apparently didn’t trigger creating a GC frame...
👍

@yuyichao
Copy link
Contributor Author

Yeah, the main point is to avoid creating a wrapper function for each error type.

This should have the same effect with declaring the constructor of all error types this way (specialize on arguments but not inlined) but IMHO, it is easier to tell the user to use a special function to do throw error without overhead than to explain what is a better way to construct an error (especially for user defined error types).

@mbauman
Copy link
Member

mbauman commented May 15, 2015

I agree that something is needed here, but I'm a little hesitant to make workarounds like this nicer and more useful. It makes it more likely that they'll be picked up outside of Base as a cargo-cult optimization in places where there's already a GC frame anyways.

It really feels like we're close to having all the optimizations in place for the GC frame to be automatically omitted. f(x,y) = throw(BoundsError(x,y)) does not create a GC frame if x and y are already boxed Julia objects, because there are no allocations that occur between the creation of the BoundsError object and the error itself. But when the BoundsError needs to box its arguments, there are multiple atomic allocations, and the GC frame is necessary. … Now that I've stepped through this myself, the solution seems farther away.

I assume there are other major pitfalls to making BoundsError parametric so its arguments don't need to be boxed?

@yuyichao
Copy link
Contributor Author

Well, unless we don't need gc frame or at least they don't cause a performance penalty anymore I think this kind of optimization is necessary and I don't see why it is an issue to allow people from outside Base to use it.

I think if x and y are already boxed (and rooted somewhere else, e.g. when they are arguments passed in) the GC frame will not be created. The only issue is when they are not. And I suppose for many performance sensitive code at least one of them is very likely not (e.g. being an integer or a range).

Also, I think this is not only for BoundsError either. There are other exceptions that takes arguments and @ScottPJones 's UnicodeError mentioned above is a good example. In general I think julia's error reporting is quite poor right now (Maybe InexactError can be an example here) and one of the concern not to include more info in the error AFAIK is that they introduce GC frames and affect the performance of inrelavant paths. IMHO, if we make it easy to have good error report without any performance penalty outside of the error path, this can encourage people to make exceptions and error messages more informative and useful.

So as for parametrized BoundsError, I guess it will cause more code to be generated (since there will be more types)? And apart from that, it can only solve the issue for BoundsError (which can also be done by allowing the constructor to be specialized and not inlined) but not the error reporting issue in general.

@mbauman
Copy link
Member

mbauman commented May 15, 2015

My point is simply that I see it as a workaround that ideally wouldn't be necessary.

But I am afraid that a general solution is farther away than I had hoped, and it is good to DRY up some of these workarounds. There's also the throw_setindex_mismatch, which is a tougher case since it interpolates a string.

@carnaval
Copy link
Contributor

the throw builtin could be special cased in codegen to generate a gc frame local to the current control flow block only. It would then be a bit more costly when you throw the exception every time but our exception stuff is not very fast anyway.

@ScottPJones
Copy link
Contributor

What causes the exception handling to be not very fast? Does it at least incur no penalty if no exceptions are raised? That was a very important point of the way that exceptions were handled in CLU... [CLU even had a signaled exception cost the same as a normal return... there was nothing special about exceptions].

@yuyichao
Copy link
Contributor Author

@carnaval
Is it more preferable to add these special cases in codegen rather than using a julia helper function? Also can this (generate per-block gc frame) be used in general? (and possibly merge them if necessary)

I think I'm more wondering about what should be the long term solution for the problem. Mainly how far away are they (so that we don't need to waste time on a work around if a better solution is almost there) but also how easy it is to undo a workaround.

@mbauman
Copy link
Member

mbauman commented May 15, 2015

I could imagine a @gcframe macro that fences its enclosed code block with a GC Frame. But I am not nearly knowledgeable enough to be proposing solutions here.

@ScottPJones - this PR is an attempt to remove the only extra overhead (besides code size and potential cache misses) an untaken error branch adds to a function. This is only an issue in very small inner functions that do no allocations normally, but of course needs to allocate an error object in an error branch. The creation of that error object needs a GC frame. If there is no other need for that frame, it'd be nice to limit its creation to only occur within the error branch. I can't speak to exception performance once they're thrown, except to say that I don't think you'd want to write Pythonic duck-typed error catching control flow code in Julia.

@carnaval
Copy link
Contributor

@yuyichao I think I'd prefer to add the special case to codegen under the general rule of pushing the ugliness down to the C code where it can possibly be removed later without bothering the julia side of the fence. I'm not sure how much work would this special case incur. I think the general optimization of pushing the gc frame down in the control flow graph as far as possible is interesting in general.

Exceptions are slower than they could be for several reasons. One would be that inference does not reason about the type of thrown things (I'm incidentally working on something that could help, no guarantees). The other is I that the way we lower them makes it impossible for llvm to reason about. So even if you do try; throw(); catch; end; it wont be optimized.

I don't think it's that bad though since we generally don't use those as a control flow mechanism, only for error reporting for which the speed impact is negligible. Of course getting it so that the exceptional case does not slow down the general one is important, so the gc frame opt is interesting in that regard.

@ScottPJones
Copy link
Contributor

@carnaval Does Julia at least use the technique (which I believe was first introduced by CLU), of using the IP to determine whether there is a catch that should be taken?

@vtjnash
Copy link
Member

vtjnash commented May 16, 2015

the throw builtin could be special cased in codegen to generate a gc frame local to the current control flow block only

the throw builtin already doesn't use a gc frame. the problem noted here is that constructing the BoundsError forced the creation of a gc frame in the code leading up the eventual call to throw.

thus a more satisfying fix for this particular issue is:

diff --git a/base/essentials.jl b/base/essentials.jl
index 45b737b..c694c66 100644
--- a/base/essentials.jl
+++ b/base/essentials.jl
@@ -6,9 +6,14 @@ typealias Callable Union(Function,DataType)

 const Bottom = Union()

+macro _noinline()
+    _expr(:meta, :noinline)
+end
+
 # constructors for Core types in boot.jl
 call(T::Type{BoundsError}) = Core.call(T)
-call(T::Type{BoundsError}, args...) = Core.call(T, args...)
+call(T::Type{BoundsError}, arg1) = (@_noinline; Core.call(T, arg1))
+call(T::Type{BoundsError}, arg1, arg2) = (@_noinline; Core.call(T, arg1, arg2))
 call(T::Type{DivideError}) = Core.call(T)
 call(T::Type{DomainError}) = Core.call(T)
 call(T::Type{OverflowError}) = Core.call(T)

a more general fix for this would be to allow VarArgs methods to benefit from specsig during codegen.

but it would sufficient to allow merging #11244 now:

julia> @code_llvm getindex(1:1:1, 1)

define i64 @julia_getindex_20438(%StepRange.4, i64) {
top:
  %2 = icmp slt i64 %1, 1
  br i1 %2, label %L4, label %L1

L1:                                               ; preds = %top
  %3 = call i64 @julia_length2311(%StepRange.4 %0)
  %phitmp = icmp slt i64 %3, %1
  br i1 %phitmp, label %L4, label %L5

L4:                                               ; preds = %L1, %top
  %4 = load %jl_value_t** inttoptr (i64 4465733512 to %jl_value_t**), align 8
  %5 = call %jl_value_t* @julia_call_20439(%jl_value_t* %4, %StepRange.4 %0, i64 %1)
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 350)
  unreachable

L5:                                               ; preds = %L1
  %6 = extractvalue %StepRange.4 %0, 0
  %7 = add i64 %1, -1
  %8 = extractvalue %StepRange.4 %0, 1
  %9 = mul i64 %7, %8
  %10 = add i64 %9, %6
  ret i64 %10
}

@yuyichao
Copy link
Contributor Author

the throw builtin already doesn't use a gc frame. the problem noted here is that constructing the BoundsError forced the creation of a gc frame in the code leading up the eventual call to throw.

If I understand correctly what @carnaval means is to make throw(...) generating it's own local gc frame for the evaluation of its arguments instead of using (at least when the current function doesn't have one) the one for the current function. This should indeed solve the problem and it should be relatively easy compare to pushing down the creation of GC frames in general since throw will not return.

The problem that I want to address in this PR is to avoid the extra cost of better error reporting in general (although this should indeed help #11244 to be merged, I hope at least=) ).

IMHO, the solutions for this problem are the following,

  1. Use non-inline specialized function to create (or create and throw) the exception.

    This can be done either in the constructor (what you proposed) or in a helper function. I personally perfer the solution of using a helper function since it can be more easily used on other exception types as well and it seems to be also used by other people to solve this problem. (It might also has the minor advantage of less code being generated if different exceptions can share the same code and a smaller body for the original function body (by one function call to throw.........).)

  2. Improving the codegen: either special case for throw (or the block that ends with a throw) or the creation of GC in general. I agree that this can push the dirty part to c/c++ and my only concern is that it might be too much of work for a temperary solution. (On the other hand, reorganizing how gc frames are managed in codegen might benefit other GC improvement in the future so this might not be totally wasted work either. It is still much more work and probably requires even more planning....)

@carnaval
Copy link
Contributor

@yuyichao got it right, what I'm proposing was a heuristic of pushing down the gc frame setup inside the same basic block as throw when possible.

The more general optimization would be very interesting imo : find a set (or maybe one is a more reasonable starting point) of "root" basic block which 1) strictly dominate every store to the gc frame 2) no single basic block containing such store has more than one "root" BB as an ancestor.

Forgetting about this for now, I don't have a strong opinion on any of the two approaches. I'm fine with using one of those in a few chosen places as a stopgap. What I really don't want is people to start using it as a "the general way to have fast error handling" because we definitely could have better codegen for this.

@ScottPJones No, we do all of this using the simple setjmp/longjmp stuff. It would be interesting to try something else but it's not high on any one's todo list since we don't use handlers in any perf sensitive places. From what I understand the CLU approach requires close cooperation with the backend. To go this way the simplest would probably to reuse llvm's infrastructure for the zero-overhead cxx exception handling. I'm not familiar with it but it does seem like a pain for no clear gain for now.

@yuyichao
Copy link
Contributor Author

What I really don't want is people to start using it as a "the general way to have fast error handling" because we definitely could have better codegen for this.

OK, I see your point here. I just hope that we don't need to delay better error reporting until than. Each of those attempt seems to be either causing some performance regression (currently my PR) or introducing exactly this function for a specific type (@mbauman and maybe @ScottPJones?).

On the other hand, as you can see in the llvm IR I posted above, the function body is also much smaller. Do you know how much will this affect the performance and does this justify using a special form / function to throw an exception?

(Now that I think about this, maybe we can just use type inference to do the transformation?)

@ScottPJones
Copy link
Contributor

@carnaval Well, setjmp/longjmp are incredibly expensive!
I developed an exception system for our C code (C++ didn't even exist yet), that used setjmp/longjmp,
and it was always something that we had to be careful about in performance critical areas.
I actually got "Try {} Catch {} Finally {}" syntax into CachéObjectScript, using CLU's trick of using the IP to find the correct Catch {} handler with zero overhead when not taken, and it was a big performance win,
as well as it got people to actually write more robust code... (people had learned quickly that the old error handling, from Mumps, was very expensive, much like setjmp in C/C++, and avoided using it).
From my experience, having zero-overhead exceptions is a huge gain!

@yuyichao
Copy link
Contributor Author

I thought a big part of the overhead was from collecting the backtrace everytime. Didn't know that setjmp/longjmp are expensive too....

@ScottPJones
Copy link
Contributor

@yuyichao Think about all the register state that has to be saved on a modern machine... I remember that the Sparc architecture was particularly horrendous with setjmp/longjmp.
@carnaval I'd also like to see as general as possible optimizations in this area... and in other places,
I don't know if LLVM can help optimizing away setting up the frame for code paths that don't need it...
(many times you have a quick test, for example in string functions, if the length is 0, and just return immediately, and it is good not to have to set up the stack frame, etc.)

@vtjnash
Copy link
Member

vtjnash commented May 16, 2015

Setjmp and longjmp are each about half a dozen assembly instructions. At some point we might switch to "zero-cost" exception frames based on llvm's c++, which makes entering a try block free at, the cost of making a throw much more expensive.

The real performance loss for exceptions happens in the backtrace collection.

@vtjnash
Copy link
Member

vtjnash commented May 16, 2015

Think about all the register state that has to be saved on a modern machine

Yes, but that's the same state that has to be saved on every function call, and much less than a syscall or other context switch has to save.

I don't know if LLVM can help optimizing away setting up the frame for code paths that don't need it

We turn it off because it does not play very nice with profiling, backtraces, or debugging, while the benefit to performance is not particularly clear.

@yuyichao
Copy link
Contributor Author

@vtjnash Are we collecting the backtrace of an exception just for debugging? (Note: I'm not saying this is not a good enough reason to do that, just curious). And do we still collect a full backtrace (instead of up to the latest catch)

I don't know how the library used to collect backtrace works but would it be more effecient to generate the backtrace ourselves (using a global list for example) or will that have the same issue with creating GC frames?

@carnaval
Copy link
Contributor

I think he was talking about the gc frame not the stack frame. I entirely agree that if the stack frame setup is a performance problem for you, you probably should be inlining the function anyway.

LLVM itself cannot help us with the gc frame since we leak it to global state right away. Making this optimization would require either a custom llvm pass or modifying codegen.

For the exception cost stuff, we probably all agree that :

  • throwing one will be slowish forever due to the unwinding
  • setting up a handler could be a bit faster than setjmp

I'm not sure the implementation cost vs perf reward is worth it here. Setjmp makes it so simple and we have so many other places where we could improve perf in a much more user visible way.

@ScottPJones
Copy link
Contributor

@vtjnash I don't know what happens in Julia so much, because it seems that it inlines rather aggressively, but we always did debugging builds with the frames, production builds without, and at least for us, it made a big difference [most of it was because of using a large dispatch table to handle the ops for the VM, we didn't do any JIT compilation, and even many of the ops in C were fairly small... (the truly heavily used ones I implemented in assembly, and directly dispatched to the next instruction)]
I feel there should be a trivial way of enabling/disabling generating the frame if it isn't needed...
I also don't think it is true that it is the same state that is saved on every function call... at least from what I recall, you have to save all the registers that aren't just for scratch values, whereas on a function call,
the called function preserves only those registers that it actually needs to use (often none, these days, with lots of registers, and parameters passed in registers).

@ScottPJones
Copy link
Contributor

@carnaval I'm not saying that it should be done immediately, but I definitely think it should be done before 1.0 😀 Even the LLVM doc recommends using the zero-overhead approach vs. the SJLJ approach.

@yuyichao
Copy link
Contributor Author

OK. This version uses type inference to replace calls to throw with calls to the helper function. As for as little dirty code as possible in julia, this should be the closest with doing similar things in codegen (and it still has the advantage of smaller function body). I had a look at the codegen for gc frames and it looks like too much for me. (Although it is probably easier for an expert on this and it would be nice to know how the codegen for these works as well.)

@yuyichao
Copy link
Contributor Author

Somehow getindex(::SimpleVector, ::Int) starts to generate GC frame even if I remove all the throws in the function (i.e. no error checking at all) although it is still slightly faster (2-5%) after this change.

New benchmark (time_func defined same as above)

@code_llvm checkbounds("", 1)

f(a, b) = chr2ind(a, b)

time_func(f, "a", 1)

Before

define i1 @julia_checkbounds_44384(%jl_value_t*, i64) {
top:
  %2 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [3 x %jl_value_t*]* %2, i64 0, i64 2
  %4 = bitcast [3 x %jl_value_t*]* %2 to i64*
  store i64 2, i64* %4, align 8
  %5 = getelementptr [3 x %jl_value_t*]* %2, i64 0, i64 1
  %6 = bitcast %jl_value_t** %5 to %jl_value_t***
  %7 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %7, %jl_value_t*** %6, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %8 = icmp slt i64 %1, 1
  br i1 %8, label %L3, label %L1

L1:                                               ; preds = %top
  %9 = bitcast %jl_value_t* %0 to %jl_array_t**
  %10 = load %jl_array_t** %9, align 8
  %11 = getelementptr inbounds %jl_array_t* %10, i64 0, i32 1
  %12 = load i64* %11, align 8
  %13 = icmp slt i64 %12, %1
  br i1 %13, label %L3, label %if2

if2:                                              ; preds = %L1
  %14 = load %jl_value_t*** %6, align 8
  store %jl_value_t** %14, %jl_value_t*** @jl_pgcstack, align 8
  ret i1 true

L3:                                               ; preds = %L1, %top
  %15 = call %jl_value_t* @alloc_2w()
  %16 = getelementptr inbounds %jl_value_t* %15, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 139853858757776 to %jl_value_t*), %jl_value_t** %16, align 8
  %17 = getelementptr inbounds %jl_value_t* %15, i64 0, i32 0
  store %jl_value_t* %0, %jl_value_t** %17, align 8
  %18 = getelementptr inbounds %jl_value_t* %15, i64 1, i32 0
  store %jl_value_t* null, %jl_value_t** %18, align 8
  store %jl_value_t* %15, %jl_value_t** %3, align 8
  %19 = call %jl_value_t* @jl_box_int64(i64 signext %1)
  store %jl_value_t* %19, %jl_value_t** %18, align 8
  %20 = icmp eq %jl_value_t* %19, null
  br i1 %20, label %cont4, label %wb_not_null

wb_not_null:                                      ; preds = %L3
  %21 = bitcast %jl_value_t** %16 to i64*
  %22 = load i64* %21, align 8
  %23 = and i64 %22, 1
  %24 = icmp eq i64 %23, 0
  br i1 %24, label %cont4, label %wb_may_trigger

wb_may_trigger:                                   ; preds = %wb_not_null
  %25 = getelementptr inbounds %jl_value_t* %19, i64 -1, i32 0
  %26 = bitcast %jl_value_t** %25 to i64*
  %27 = load i64* %26, align 8
  %28 = and i64 %27, 1
  %29 = icmp eq i64 %28, 0
  br i1 %29, label %wb_trigger, label %cont4

wb_trigger:                                       ; preds = %wb_may_trigger
  call void @gc_queue_root(%jl_value_t* %15)
  br label %cont4

cont4:                                            ; preds = %wb_trigger, %wb_may_trigger, %wb_not_null, %L3
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 148)
  call void @llvm.trap()
  unreachable
}
f
   1.791 seconds     

After

define i1 @julia_checkbounds_67755(%jl_value_t*, i64) {
top:
  %2 = icmp slt i64 %1, 1
  br i1 %2, label %L3, label %L1

L1:                                               ; preds = %top
  %3 = bitcast %jl_value_t* %0 to %jl_array_t**
  %4 = load %jl_array_t** %3, align 8
  %5 = getelementptr inbounds %jl_array_t* %4, i64 0, i32 1
  %6 = load i64* %5, align 8
  %7 = icmp slt i64 %6, %1
  br i1 %7, label %L3, label %if2

if2:                                              ; preds = %L1
  ret i1 true

L3:                                               ; preds = %L1, %top
  %8 = load %jl_value_t** inttoptr (i64 139976283090536 to %jl_value_t**), align 8
  %9 = icmp eq %jl_value_t* %8, null
  br i1 %9, label %err, label %ok

err:                                              ; preds = %L3
  call void @jl_undefined_var_error(%jl_value_t* inttoptr (i64 139984968716432 to %jl_value_t*))
  unreachable

ok:                                               ; preds = %L3
  call void @julia_throw_with_args4956(%jl_value_t* %8, %jl_value_t* %0, i64 %1)
  ret i1 undef
}
f
   1.554 seconds     

P.S. Anyone know how to turn on the verbose output of @time that includes allocation etc now? Found it @timev

@yuyichao
Copy link
Contributor Author

Somehow getindex(::SimpleVector, ::Int) starts to generate GC frame even if I remove all the throws in the function (i.e. no error checking at all) although it is still slightly faster (2-5%) after this change.

The GC frame is generated because of #11304. (And it was generated for the no checking version because I forgot to add Base. for non-exported symbol when defining it in a script) Should be fixed by #11306

@ScottPJones
Copy link
Contributor

@yuyichao I hope you liked the new @timev macro! 😀

@yuyichao
Copy link
Contributor Author

@ScottPJones I liked it by looking at the implementation. Unfortunately (or fortunately?) I haven't used it to benchmark anything that gives me any non-zero result other than the time yet :P

@ScottPJones
Copy link
Contributor

It helped me see a lot of why the string conversions were so slow... and even with my fixes that I'm hoping will get merged in soon (#11004) there are some extra allocations that I have to track down for the next round of speeding up Julia string handling 😀

@yuyichao
Copy link
Contributor Author

Rebased due to conflict with #11274

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 5, 2015

This PR is probably superseded by #11508 now. See #11508 (comment) for benchmarks

@yuyichao yuyichao closed this Jun 25, 2015
@yuyichao yuyichao deleted the throw-error branch July 15, 2015 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants