Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for PointVector, RayVector, SubObjectIterator #3369

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

fingolfin
Copy link
Member

@fingolfin fingolfin commented Feb 15, 2024

Add type assertions to their getindex methods to improve type stability.
Similarly ensure that the return value of size is recognized as Tuple{Int}.

Silly microbenchmark, before:

julia> v = PointVector{ZZRingElem}(matrix(ZZ, [1 2 3 ; ]));

julia> @btime prod(v)
  1.217 μs (21 allocations: 336 bytes)

After:

julia> @btime prod(v);
  288.449 ns (5 allocations: 80 bytes)

@benlorenz
Copy link
Member

Thanks for working on this, this does seem to introduce a few new invalidations, can we do something about these?

I am a bit unsure which of these are really needed and why, eltype for example is inherited from AbstractVector{U} and works fine.

I tried this in the shell on master without the changes here and it seems that mostly the hint for size is important for this speed improvement:

julia> v = PointVector{ZZRingElem}(matrix(ZZ, [1 2 3 ; ]));

julia> @btime prod(v)
  687.593 ns (21 allocations: 336 bytes)
6

julia> Base.size(po::PointVector) = (size(po.p, 2)::Int,)

julia> @btime prod(v)
  118.364 ns (5 allocations: 80 bytes)
6

A bit confused why this type cannot be inferred automatically, we just use the size for MatrixElem.

PS: What julia version / what machine is that? It seems a lot slower than on my laptop.

Add type assertions to their getindex methods to improve type stability.
Similarly ensure that the return value of `size` is recognized as Tuple{Int}.

Silly microbenchmark, before:

    julia> v = PointVector{ZZRingElem}(matrix(ZZ, [1 2 3 ; ]));

    julia> @Btime prod(v)
      1.217 μs (21 allocations: 336 bytes)

After:

    julia> @Btime prod(v);
      288.449 ns (5 allocations: 80 bytes)
Copy link
Member

@benlorenz benlorenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@fingolfin
Copy link
Member Author

fingolfin commented Feb 16, 2024

Thank you for the feedback, Benjamin!

Thanks for working on this, this does seem to introduce a few new invalidations, can we do something about these?

I think they were a fluke (?) or caused by an update of a dependency; at least now this seems to be gone? (I pushed an update to this branch earlier)

I am a bit unsure which of these are really needed and why, eltype for example is inherited from AbstractVector{U} and works fine.

Indeed, my mistake -- I overlooked the inheritance and just added it -- should have checked whether it really has an effect! Anyway I removed those now.

I tried this in the shell on master without the changes here and it seems that mostly the hint for size is important for this speed improvement:

Yes, that's the main part; but together they still are better than just the size change.

The code where this came up initially used a for loop and I first noticed that it was type unstable and then changed getindex. Then I thought "let's rewrite this using prod and discovered it was much slower -- hence the size and eltype changes.

Anyway, here is a new micro benchmark using a for loop.:

function f(v)
  x = v[1]
  for i in 2:length(v)
    x *= v[i]
  end
  x
end

And now some measurements:

Before this PR:

julia> @btime f(v);
  648.773 ns (11 allocations: 224 bytes)

julia> @btime prod(v);
  1.267 μs (21 allocations: 336 bytes)

With just the size change:

julia> @btime f(v);
  364.333 ns (5 allocations: 80 bytes)

julia> @btime prod(v);
  410.000 ns (5 allocations: 80 bytes)

With just the getindex change:

julia> @btime f(v);
  647.610 ns (11 allocations: 224 bytes)

julia> @btime prod(v);
  1.262 μs (21 allocations: 336 bytes)

With the full PR:

julia> @btime f(v);
  351.807 ns (5 allocations: 80 bytes)

julia> @btime prod(v);
  369.335 ns (5 allocations: 80 bytes)

So getindex alone is useless, but the combo still is slightly ahead.

A bit confused why this type cannot be inferred automatically, we just use the size for MatrixElem.

Because the member p has type MatElem{U} which is not concrete; and the generic method is implemented as

size(x::MatrixElem{T}) where T <: NCRingElement = (nrows(x), ncols(x))

but it can't infer the return type of nrows(x)...

That suggests two things to me:

  1. we probably have some type unstable nrows methods (a quick check with Base.return_types suggests 2 out of 17)
  2. even if we fix those, perhaps we can add ::Int type annotation to that generic size method just to compensate for similar issues in the future.

So hopefully the ::Int annotation in this PR here can be made redundant; but in the meantime it doesn't hurt...

PS: What julia version / what machine is that? It seems a lot slower than on my laptop.

Julia 1.10.1 on my MacBook Pro M1 Max.

@fingolfin
Copy link
Member Author

Timings with Julia 1.10.1 on an AMD EPYC 9554 64-Core Processor (which, despite the high core count, is currently the fastest machine I have access to, in terms of single-core performance):

Before this PR:

julia> @btime f(v);
  293.838 ns (11 allocations: 224 bytes)

julia> @btime prod(v);
  713.264 ns (21 allocations: 336 bytes)

With this PR:

julia> @btime f(v);
  139.726 ns (5 allocations: 80 bytes)

julia> @btime prod(v);
  139.512 ns (5 allocations: 80 bytes)

@benlorenz
Copy link
Member

Thanks for the explanations!

It makes sense to do this here and it doesn't hurt and we can try to fix nrows later.
The invalidation might have been from the eltype.

The i7-1165G7 in my laptop boosts to 4.7GHz and with the full PR I get:

julia> @btime f(v);
  101.805 ns (5 allocations: 80 bytes)

julia> @btime prod(v);
  102.918 ns (5 allocations: 80 bytes)

This microbenchmark seems surprisingly slow on the M1, maybe julia is missing some optimization there.

@benlorenz benlorenz enabled auto-merge (squash) February 16, 2024 12:45
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Merging #3369 (bda943d) into master (6d3e523) will decrease coverage by 0.02%.
Report is 9 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3369      +/-   ##
==========================================
- Coverage   82.09%   82.07%   -0.02%     
==========================================
  Files         556      556              
  Lines       74065    74218     +153     
==========================================
+ Hits        60800    60914     +114     
- Misses      13265    13304      +39     
Files Coverage Δ
src/PolyhedralGeometry/iterators.jl 89.37% <100.00%> (ø)

... and 5 files with indirect coverage changes

@benlorenz benlorenz merged commit 138d8d8 into oscar-system:master Feb 16, 2024
36 of 44 checks passed
@fingolfin fingolfin deleted the mh/poly-iter branch February 16, 2024 13:15
@lgoettgens lgoettgens added the backport 1.0.x Should be backported to the release 1.0 branch label Feb 16, 2024
@fingolfin
Copy link
Member Author

Yeah I agree - usually my MacBook is twice as fast as the older servers, and only the AMD beat it, but not by much (it doesn't clock as high as yours, of course). But in this case the old servers are as fast or faster.

For fun, here are 8 machines (including the M1 munk in the lower right) running on master (with this PR here merged) with Julia 1.10.1:

Screen Shot 2024-02-16 at 14 29 24

The top left is one at home and it is on an ancient i7-3820 and it still blows away the M1. Wow.

benlorenz pushed a commit that referenced this pull request Feb 16, 2024
#3369)

Add type assertions to their getindex methods to improve type stability.
Similarly ensure that the return value of `size` is recognized as Tuple{Int}.

Silly microbenchmark, before:

    julia> v = PointVector{ZZRingElem}(matrix(ZZ, [1 2 3 ; ]));

    julia> @Btime prod(v)
      1.217 μs (21 allocations: 336 bytes)

After:

    julia> @Btime prod(v);
      288.449 ns (5 allocations: 80 bytes)
@benlorenz benlorenz mentioned this pull request Feb 16, 2024
33 tasks
@benlorenz benlorenz removed the backport 1.0.x Should be backported to the release 1.0 branch label Feb 19, 2024
benlorenz added a commit that referenced this pull request Feb 23, 2024
### Backported PRs

- [x] #3367
- [x] #3379 
- [x] #3369
- [x] #3291
- [x] #3325
- [x] #3350 
- [x] #3351
- [x] #3365 
- [x] #3366
- [x] #3382
- [x] #3373
- [x] #3341
- [x] #3346
- [x] #3381
- [x] #3385
- [x] #3387 
- [x] #3398 
- [x] #3399 
- [x] #3374 
- [x] #3406 
- [x] #2823
- [x] #3298
- [x] #3386 
- [x] #3412 
- [x] #3392 
- [x] #3415 
- [x] #3394
- [x] #3391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants