Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance concerns with ReinterpretArray #25014

Closed
timholy opened this issue Dec 10, 2017 · 1 comment · Fixed by #28707
Closed

Performance concerns with ReinterpretArray #25014

timholy opened this issue Dec 10, 2017 · 1 comment · Fixed by #28707
Assignees

Comments

@timholy
Copy link
Member

timholy commented Dec 10, 2017

Ref https://discourse.julialang.org/t/big-overhead-with-the-new-lazy-reshape-reinterpret/7635

@ExpandingMan
Copy link
Contributor

Still seeing this problem.

I’m getting a median time of 2.6 ns for accessing an array created with unsafe_wrap, and a median time of 10.0 ns for a reinterpreted array.

From what I gather on discourse it seems that there are very few people with the expertise to tackle this.

Keno added a commit that referenced this issue May 23, 2018
When I originally wrote the new ReinterpretArray code, I made sure that LLVM was able
to optimize reinterpret(::Array) back to a single memory access with appropriate TBAA
and alignment info. Somewhere along the line LLVM lost that ability. While we should
try to recover that capability in LLVM, that showed that that is a relatively brittle
optimization for a very simple operation. So this patch takes a different approach:
We add two new intrinsics `tbaa_pointerref` and `tbaa_pointerset` that behave like
their non-TBAA variants, but additionally take a type to use as the TBAA tag. This
allows us to write a special case for `reinterpret(T, ::Array)` that directly emits
the correct pointer access. It's also a model for what a post-1.0 pure Julia
implementation of `Array` (e.g. on top of a buffer type) may look like.

Fixes #25014
Keno added a commit that referenced this issue May 24, 2018
When I originally wrote the new ReinterpretArray code, I made sure that LLVM was able
to optimize reinterpret(::Array) back to a single memory access with appropriate TBAA
and alignment info. Somewhere along the line LLVM lost that ability. While we should
try to recover that capability in LLVM, that showed that that is a relatively brittle
optimization for a very simple operation. So this patch takes a different approach:
We add two new intrinsics `tbaa_pointerref` and `tbaa_pointerset` that behave like
their non-TBAA variants, but additionally take a type to use as the TBAA tag. This
allows us to write a special case for `reinterpret(T, ::Array)` that directly emits
the correct pointer access. It's also a model for what a post-1.0 pure Julia
implementation of `Array` (e.g. on top of a buffer type) may look like.

Fixes #25014
Keno added a commit that referenced this issue Aug 16, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a new intrinsic that puts an actual
llvm.memcpy into the IR, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.
Keno added a commit that referenced this issue Aug 16, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.
Keno added a commit that referenced this issue Aug 17, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.
Keno added a commit that referenced this issue Aug 17, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.
KristofferC pushed a commit that referenced this issue Aug 19, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.

(cherry picked from commit 777810b)
KristofferC pushed a commit that referenced this issue Aug 19, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.

(cherry picked from commit 777810b)
KristofferC pushed a commit that referenced this issue Aug 19, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.

(cherry picked from commit 777810b)
KristofferC pushed a commit that referenced this issue Sep 8, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.

(cherry picked from commit 777810b)
KristofferC pushed a commit that referenced this issue Sep 8, 2018
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.

(cherry picked from commit 777810b)
KristofferC pushed a commit that referenced this issue Feb 11, 2019
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a ccall to :memcpy and turn this into
llvm.memcpy at the IR level, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.

(cherry picked from commit 777810b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants