Skip to content
This repository has been archived by the owner on Dec 6, 2019. It is now read-only.

Towards a semi-performant recursive interpreter #37

Merged
merged 40 commits into from
Jan 27, 2019
Merged

Towards a semi-performant recursive interpreter #37

merged 40 commits into from
Jan 27, 2019

Conversation

timholy
Copy link
Member

@timholy timholy commented Jan 20, 2019

This is a pretty big overhaul of this package. I've tried to keep its original functionality, but perhaps a split should be the next step (see #32). The first 5 commits get the tests passing, with the exception of the ui tests which I didn't even look at (CC @staticfloat re #34, #36; @pfitzseb re #33).

The rest is much more ambitious. I worked a lot on the performance, since an "easy" way to add robust breakpoints, etc, is via running your code in the interpreter. And of course there's interest in using the interpreter to circumvent compile-time cost. But then performance matters.

The most important changes here are focused on reducing the cost of dynamic dispatch (or its equivalent here, "dynamic lowered code lookup"). Some highlights:

  • At the lowest level, all calls are to Builtins and IntrinsicFunctions. I used the tfunc data in Core.Inference to auto-generate an evaluator that resolve all calls with fixed number of arguments. A straightforward extension would be resolve all calls with bounded numbers of arguments (e.g., those that have between 2 and 4 arguments).
  • To reduce the overhead of which, this adds "local method tables," one per :call Expr. These are cached by exact type, since isa(x, Int) is fast but isa(x, Integer) is slow. So it uses MethodInstance comparisons rather than Method signature comparisons, even though it might look up the same lowered code. I think a slightly more elegant way to do this would be to add a new type,
struct LocalMethodTable
    call_expr::Expr
    knownmethods::TypeMapEntry
end

to the list of valid types in a CodeInfo. (I did it this way at first but it breaks things like basic-block computation and ssa-usage analysis. So I resorted to storing this info in a separate fields.)

For a simple summation test, I'm getting about 15us per iteration. Compiled code is about 5ns, so this is still dirt-slow. But just getting it to this point was quite a major overhaul.

CC @JeffBezanson, @vtjnash, @StefanKarpinski.

@jpsamaroo
Copy link

Awesome! What's that summation test look like? The compiled version could possibly be compiling down to a static result (i.e. removing the loop and just returning the result directly), since 5ns is wicked fast. That wouldn't therefore be a good measure of this PR's performance gains.

@timholy
Copy link
Member Author

timholy commented Jan 20, 2019

The summation test is here. 5ns/iteration is expected for a GHz CPU (it would be faster still if I had added @inbounds and @simd). So 15us/iteration implies a few thousand CPU/cycles per iteration, which is pretty slow. It would be faster if we didn't care about stashing the result of computations in JuliaStackFrames, but that's essential for this package's utility as a debugger.

end
print(io,
"""
$head name == :$fname

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to do this is to evaluate the function (args[1]) and then compare directly against the builtin objects, f === Core.tuple, f === Core.ifelse, etc. We should also pre-process the code to replace constant GlobalRefs with quoted values to avoid lookups in the common case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I thought of that and then talked myself out of it for fears of introducing the interpreter-equivalent of JuliaLang/julia#265. But now I'm not so sure that's a risk, and in any event I currently clear the cache in between @interpret invocations.

@KristofferC KristofferC reopened this Jan 21, 2019
@KristofferC
Copy link
Member

KristofferC commented Jan 21, 2019

Thanks a lot for this. I pushed a commit to update the Project + Manifest to get CI running (hope you don't mind).

I'll look at the UI tests.

@KristofferC
Copy link
Member

So the UI tests pass as long as there is a good way to set the fullpath variable to false in the JuliaFrameCode. Previously, this was done by a keyword to the JuliaStackFrame. Of course I could hack something in but do you have any preference how to expose this option @timholy, or perhaps we could tweak the tests themselves?

Manifest.toml Outdated
uuid = "67417a49-6d77-5db2-98c7-c13144130cd2"
version = "0.1.2+"

[[DebuggingUtilities]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I added this while I was debugging my own work on this PR---@showln is very useful. But we don't need it to be part of this package long-term. I'll trust it's OK with you if I overwrite this.

@vtjnash
Copy link

vtjnash commented Jan 21, 2019

to auto-generate an evaluator that resolve all calls with fixed number of arguments

That all sounds very confusing. Any reason not to just execute them directly? They are just normal methods, albeit without source code accessible to reflection (and that block overloading).

@timholy
Copy link
Member Author

timholy commented Jan 21, 2019

OK, this fixes what I think are the remaining problems. This enhances the performance on my loop tests a teeny bit more (down to 13us per iteration), presumably because of resolving GlobalRefs during optimize!. I also fixed a breakage on master due to the very nice JuliaLang/julia#30641.

The most important new commit is 19c1492, which fixed a serious bug that caused it to sometimes look up the wrong lowered code for generated functions. Now this seems to behave quite robustly; for example, I can run the subarray tests. They're not fast, but not much insanely slower than in compiled mode (300s for compiled, 900s for interpreted). Most of the time is spent on one line to eval %new expressions.

@KristofferC, I also added kwargs support back to the JuliaStackFrame constructor, so hopefully you can get the ui tests passing. Thanks for tackling that! ❤️

I think this can be merged. (I can't do that, however.) And I do think we should plan on splitting out the DebuggerFramework functionality out very shortly afterwards. There's plenty of reason to be interested in an interpreter independent of its usage for a IDE or REPL debugger. So perhaps before hitting merge we should think a bit about what that will look like (and what the package names should be).

@KristofferC
Copy link
Member

One thing that I know have been discussed is to rename this package (or whatever is the package that a user will finally interact with). In 0.6 Gallium took that place (simply reexporting ASTInterpreter2) but I wonder if we should just come up with something fresh.

@timholy
Copy link
Member Author

timholy commented Jan 21, 2019

That all sounds very confusing. Any reason not to just execute them directly? They are just normal methods, albeit without source code accessible to reflection (and that block overloading).

That is what it does, see https://github.com/timholy/ASTInterpreter2.jl/blob/teh/localmt/src/builtins.jl. Rather than typing those all out, though, I just check the tfunc tables in Core.Compiler to see how many args are supported. They are generated by code in https://github.com/timholy/ASTInterpreter2.jl/blob/teh/localmt/src/generate_builtins.jl, which is about half the size of the finished product, and would have the bonus that if any of the builtins or intrinsics ever change, this should pretty much auto-update.

One thing I didn't do: we might want to consider adding if nargs == cases for arrayref and arrayset for low dimensionality.

@timholy
Copy link
Member Author

timholy commented Jan 21, 2019

I've thought about CodeInfoInterpreter.jl or LoweredInterpreter.jl or just Interpreter.jl.

@Keno
Copy link
Member

Keno commented Jan 21, 2019

Awesome work. I'll let @KristofferC take this forward in detail. My only concern is that ASTInterpreter2 was supposed to be very small and maintainable, so I'm afraid to add too much here that would make it more complicated. On the other hand, perhaps that's offset by more people looking at that. Also seconded on @vtjnash's question why generating the source file is necessary.

src/builtins.jl Outdated
return Some{Any}(Core._apply_pure(getargs(args, frame)...))
elseif f === Core._expr
return Some{Any}(Core._expr(getargs(args, frame)...))
elseif f === Core._typevar
Copy link
Member Author

@timholy timholy Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this isn't present on 1.0. One wonders if we should avoid committing builtins.jl and just generate it during Pkg.build?

Copy link
Member

@KristofferC KristofferC Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that or maybe @static check their existence.

 @static isdefined(Core, :_typevar) ? f === Core._typevar : false

During build time seems likely to be better though since otherwise we might miss things if we remove some builtins in the future.

@timholy
Copy link
Member Author

timholy commented Jan 21, 2019

Also seconded on @vtjnash's question why generating the source file is necessary.

We don't have to do it that way, although Core._typevar points out some potential advantages to autogeneration (if it's sufficiently robust...you could imagine it going either way as far as maintainability goes).

@timholy
Copy link
Member Author

timholy commented Jan 21, 2019

My only concern is that ASTInterpreter2 was supposed to be very small and maintainable, so I'm afraid to add too much here that would make it more complicated. On the other hand, perhaps that's offset by more people looking at that.

I think the biggest reason to do it is that if we can get reasonable performance with the interpreter it becomes feasible to support breakpoints in the short term. (Supporting breakpoints is easy if all the code is running in the interpreter.)

I don't think even a loving author could say that the performance of this PR makes it "reasonable," but it is a ~60x improvement over where I started. If I could get another 10x I'd be really happy. But I'm skeptical that we can get there without some additional higher-level analysis, most important probably being a certain amount of lowered-code inlining to reduce the number of created stackframes. (That seems likely to require a limited form of type-inference...) Julia's polymorphism makes it much more difficult to write a fast interpreter for than, say, languages where for i = 1:100 is a construct of the language itself (e.g., https://dzone.com/articles/adventures-in-jit-compilation-part-1-an-interptete, where even "simpleinterp" is about 180x faster than this). We have to recurse into every call to iterate.

@vtjnash
Copy link

vtjnash commented Jan 21, 2019

Most of the time is spent on one line to eval %new expressions.

It'd probably be much faster to ccall that function directly, rather than having eval do it. I've seen calls to that ccall elsewhere in this PR already...

Rather than typing those all out

That's good to hear, since that list is large, but f isa Core.Intrinsic && f(args...) should run faster than checking for the values 1 to 100 in separate if statements. If you want to skip _apply, hard coding for length instead should be fast and general (e.g. unroll each call site for 1-N args, where N is a small integer of about 6)

@timholy
Copy link
Member Author

timholy commented Jan 22, 2019

Doesn't actually work out that way:

julia> using BenchmarkTools, ASTInterpreter2

julia> function runifbuiltin(qf, qx)
           f, x = qf.value, qx.value
           if isa(f, Core.Builtin)
               return Some{Any}(f(x))
           end
           return qx
       end
runifbuiltin (generic function with 1 method)

julia> function runifintrinsic(qf, qx)
           f, x = qf.value, qx.value
           if isa(f, Core.IntrinsicFunction)
               return Some{Any}(f(x))
           end
           return qx
       end
runifintrinsic (generic function with 1 method)

julia> qf = QuoteNode(abs)
:($(QuoteNode(abs)))

julia> qx = QuoteNode(3)
:($(QuoteNode(3)))

julia> @btime runifbuiltin(qf, qx)
  64.096 ns (0 allocations: 0 bytes)
:($(QuoteNode(3)))

julia> @btime runifintrinsic(qf, qx)
  11.302 ns (0 allocations: 0 bytes)
:($(QuoteNode(3)))

julia> ex = Expr(:call, qf, qx)
:(($(QuoteNode(abs)))($(QuoteNode(3))))

julia> @btime ASTInterpreter2.maybe_evaluate_builtin(nothing, ex)
  16.826 ns (0 allocations: 0 bytes)
:(($(QuoteNode(abs)))($(QuoteNode(3))))

julia> qf = QuoteNode(Core.sizeof)
:($(QuoteNode(Core.sizeof)))

julia> @btime runifbuiltin(qf, qx)
  96.828 ns (1 allocation: 16 bytes)
Some(8)

julia> @btime runifintrinsic(qf, qx)
  14.321 ns (0 allocations: 0 bytes)
:($(QuoteNode(3)))

julia> ex = Expr(:call, qf, qx)
:(($(QuoteNode(Core.sizeof)))($(QuoteNode(3))))

julia> @btime ASTInterpreter2.maybe_evaluate_builtin(nothing, ex)
  30.768 ns (1 allocation: 16 bytes)
Some(8)

julia> qf = QuoteNode(Base.neg_int)
:($(QuoteNode(neg_int)))

julia> @btime runifbuiltin(qf, qx)
  113.218 ns (1 allocation: 16 bytes)
Some(-3)

julia> @btime runifintrinsic(qf, qx)
  56.686 ns (1 allocation: 16 bytes)
Some(-3)

julia> ex = Expr(:call, qf, qx)
:(($(QuoteNode(neg_int)))($(QuoteNode(3))))

julia> @btime ASTInterpreter2.maybe_evaluate_builtin(nothing, ex)
  45.757 ns (1 allocation: 16 bytes)
Some(-3)

I've also tried a middle ground, checking isa(f, Core.IntrinsicFunction) and then "dispatching" by number of arguments, and that's slower. I've not found any approach that's faster than the one in this PR.

@timholy
Copy link
Member Author

timholy commented Jan 22, 2019

It'd probably be much faster to ccall that function directly, rather than having eval do it.

Good call! 900s->590s.

@KristofferC
Copy link
Member

KristofferC commented Jan 22, 2019

I updated the UI tests (and their dependencies) so tests now pass on 1.1 and nightly. 1.0 fails due to the already mentioned _typevar issue.

Project.toml Outdated

[targets]
test = ["Test", "TerminalRegressionTests", "VT100"]
build = ["InteractiveUtils"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no build target.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended approach? Just make it a dependency of the package?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@timholy
Copy link
Member Author

timholy commented Jan 22, 2019

OK, I've modified this to generate the file at build time. I expect tests to pass. Before merging, I think it would be best to wait a bit in case anyone wants to give this a detailed review.

There are also a couple of issues to discuss. First, I recognize the excitement to get debugging working in Juno again. But I'd urge a bit of patience to tackle a couple of things that should be done first:

  • any package splitting/reorganization/renaming
  • integration with Revise's ability to correct line numbers. A debugger is nice for, well, fixing bugs. But if this results in line number changes in methods that don't get recompiled, the next time you try to step into code you'll presumably be at the wrong line number of the source file, and users will be really confused.
  • a couple of days for me to experiment with this for interpreting code in module-scope. I think it should work, but I haven't tried to play with it. Revise will evaluate changing its internal organization to use a stripped-down ASTInterpreter2 (just the interpreter part) to more robustly find method signatures and add "lowering backedges" (see How do you feel about hammers? #32).

The point being that I don't think it would be great to reintroduce a stepping-debugger to the Julia world and then break it only a couple of days later.

@KristofferC
Copy link
Member

Hm, it seems test only dependencies that are using a checked out branch in the Manifest aren't too happy when they get put in the test section. Will investigate.

@timholy
Copy link
Member Author

timholy commented Jan 22, 2019

Also one technical issue to highlight: before merging, perhaps I should experiment with moving the local method tables from the JuliaFrameCode to the JuliaStackFrame (the first being the "reusable component" and the second being the "instance for this particular call"). This may have performance advantages, if I can do it in a way that doesn't throw out the method tables for any recursive calls.

@timholy
Copy link
Member Author

timholy commented Jan 22, 2019

Surprisingly, moving the local method tables led to a 15% slowdown. So I say let's leave these as they are. So the only barriers I know of to merging are (1) any review comments, and (2) fixing the Pkg issues. Then after merging we can discuss splitting the package.

@timholy
Copy link
Member Author

timholy commented Jan 23, 2019

To ensure that this package can be hacked on by more than a small handful of people, I've added a bunch of docstrings and some Documenter narrative, including a somewhat gentle introduction to Julia's lowered AST representation. I haven't implemented deployment yet since I think it would be better to do that once we're basically ready to release. But you can view the docs locally by navigating to the docs directory and running julia make.jl and then opening build/index.html in a browser.

timholy and others added 16 commits January 27, 2019 14:38
This also necessitates changing how Builtins are handled, but this is
more robust than the previous module/symbol check.
The generator can in principle return arbitrary expressions depending
on the exact type.
This also makes minor code cleanups.
At least module-scope code can use Ints for :gotoifnot expressions.
Because we run the optimizer we don't want to contaminate the
original.
Separates out the `JuliaFrameCode` constructor, generalizes
`moduleof`, adds pc utilities
fixup some toml files and .travis
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants