MAX_TUPLETYPE_LEN and cartesian arrayref performance (DON'T MERGE) #5393

timholy · 2014-01-14T12:41:21Z

(This is not really a PR, it's a bug/issue report disguised as a PR to make it easier to talk about and for others to do their own testing.)

Thanks to some great work by Jeff, julia's built-in arrayref---which implements both linear and cartesian indexing of arrays---now has amazing performance: in most cases you can't distinguish its performance between cartesian and linear indexing, even in cases where that linear index is computed efficiently because the order of access follows a pattern. I've illustrated that by running test/arrayperf in two situations: one against current master which uses linear indexing to implement getindex for arrays, and the one in this PR in which the linear-indexing code is deleted and getindex falls back to the implementation for AbstractArrays (which uses cartesian indexing). The complete results for getindex are posted in this gist.

There are a few oddities (noise due to garbage-collection?), but to me the overall pattern suggests we don't need linear indexing. But there's one consistent exception, illustrated for this example but common to all of the tests:

Slicing with contiguous blocks (cartesian case):
Small arrays:
1 dimensions (2500000 repeats, 10000000 operations): elapsed time: 0.625764791 seconds (299991840 bytes allocated)
2 dimensions (2500000 repeats, 10000000 operations): elapsed time: 0.97220765 seconds (420500120 bytes allocated)
3 dimensions (625000 repeats, 10000000 operations): elapsed time: 1.096706901 seconds (419835400 bytes allocated)
4 dimensions (625000 repeats, 10000000 operations): elapsed time: 1.048247011 seconds (431500132 bytes allocated)
5 dimensions (156250 repeats, 10000000 operations): elapsed time: 0.467072104 seconds (207361056 bytes allocated)
6 dimensions (156250 repeats, 10000000 operations): elapsed time: 0.496604407 seconds (210810456 bytes allocated)
7 dimensions (39063 repeats, 10000128 operations): elapsed time: 0.3508091 seconds (121212592 bytes allocated)
8 dimensions (39063 repeats, 10000128 operations): elapsed time: 2.864820923 seconds (1322590144 bytes allocated)
9 dimensions (9766 repeats, 10000384 operations): elapsed time: 3.00026783 seconds (1377019176 bytes allocated)
10 dimensions (9766 repeats, 10000384 operations): elapsed time: 3.102930591 seconds (1457233652 bytes allocated)

Notice that the performance of arrayref falls off a cliff at 8 dimensions. Is this something that can be fixed? Or is this really hard?

simonster · 2014-01-14T14:19:33Z

I think performance drops because MAX_TUPLETYPE_LEN in inference.jl is 8, which appears to prevent getindex from getting inlined when it is passed >8 arguments.

timholy · 2014-01-15T20:35:26Z

Good bet, @simonster, thanks.

lindahua · 2014-01-16T12:55:46Z

Actually, 0.97 sec for 2-dimension vs 0.62 sec for 1-dimension are still quite noticeable in some sense.

lindahua · 2014-01-16T13:34:35Z

I did a benchmark specifically to test the performance of linear indexing vs cartesian indexing. The code is here on gist.

Here are the results I got (I run it multiple times, the results are quite consistent):

Scanning matrix of size (8,8) for 10000000 times:
1D:  0.36637 sec
2D:  0.43078 sec
Scanning matrix of size (4,4,4) for 10000000 times:
1D:  0.36851 sec
3D:  0.48609 sec
Scanning matrix of size (4,4,2,2) for 10000000 times:
1D:  0.36736 sec
4D:  0.52755 sec

The difference is still obviously noticeable (but not very significant though).

I think a reasonable guideline here is that:

When using cartesian indexing leads to considerably more concise & generic codes, cartesian indexing is definitely a reasonable choice. Except for the most performance critical cases, it is not recommended to manually unroll cartesian indexing.
When linear indexing is not difficult to implement (e.g. working with contiguous array/array blocks), linear indexing is still preferable.

timholy · 2014-01-16T15:55:44Z

That's very interesting.

Actually, 0.97 sec for 2-dimension vs 0.62 sec for 1-dimension are still quite noticeable in some sense.

Well, the actual arrays were different, not just the indexing pattern. What I intended was to compare the two files in that gist against each other (not very convenient, I know, but I was being lazy and leveraging a script I wrote long ago and put in test/). In particular, that test also includes allocation of the output, not just traversal.

Your test is more focused on just traversal, which is of course extremely useful. While I get slightly different numbers on my laptop, they're mostly consistent with your results (except I don't really see a difference between 1D and 2D):

Scanning matrix of size (8,8) for 10000000 times:
1D:  0.51942 sec
2D:  0.52862 sec
Scanning matrix of size (4,4,4) for 10000000 times:
1D:  0.51715 sec
3D:  0.71273 sec
Scanning matrix of size (4,4,2,2) for 10000000 times:
1D:  0.51848 sec
4D:  0.82086 sec

If I make the array too big to fit in L1 cache, so that repeated traversal will generate cache misses, here's what I get:

Scanning matrix of size (512,512) for 2441 times:
1D:  0.44576 sec
2D:  0.45418 sec
Scanning matrix of size (64,64,64) for 2441 times:
1D:  0.44576 sec
3D:  0.53878 sec
Scanning matrix of size (16,32,16,32) for 2441 times:
1D:  0.44502 sec
4D:  0.50972 sec

Not very dramatic, but still noticeable.

So you are right, there is a difference that grows with higher dimension; it makes total sense, which is why I was surprised.

It seems very reasonable to have algorithms on arrays use linear indexing when it's easy, and use cartesian indexing for AbstractArray implementations. There are also some algorithms, like pairwise summation, that would probably be hard to implement using cartesian indexing.

simonster · 2014-10-14T17:16:14Z

I believe this is related to @vtjnash's comment here. I managed to hit this in real code trying to broadcast a 7 dimensional array and a vector.

JeffBezanson · 2014-12-08T18:36:54Z

Closing this in favor of other issues/prs related to cartesian stuff.

Delete linear-indexing variant of getindex

f162fb4

timholy mentioned this pull request Jan 14, 2014

SubArray indexing/yet another sketch for an ArrayView implementation #5394

Closed

timholy mentioned this pull request Jan 15, 2014

Omitted and additionally indexed dimensions in getindex #5396

Closed

jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06

jiahao force-pushed the master branch from cdde4df to 7fdc860 Compare October 28, 2014 04:20

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

JeffBezanson closed this Dec 8, 2014

timholy mentioned this pull request Jan 5, 2015

setindex! no longer inlined for rank N>6 #9622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAX_TUPLETYPE_LEN and cartesian arrayref performance (DON'T MERGE) #5393

MAX_TUPLETYPE_LEN and cartesian arrayref performance (DON'T MERGE) #5393

timholy commented Jan 14, 2014

simonster commented Jan 14, 2014

timholy commented Jan 15, 2014

lindahua commented Jan 16, 2014

lindahua commented Jan 16, 2014

timholy commented Jan 16, 2014

simonster commented Oct 14, 2014

JeffBezanson commented Dec 8, 2014

MAX_TUPLETYPE_LEN and cartesian arrayref performance (DON'T MERGE) #5393

MAX_TUPLETYPE_LEN and cartesian arrayref performance (DON'T MERGE) #5393

Conversation

timholy commented Jan 14, 2014

simonster commented Jan 14, 2014

timholy commented Jan 15, 2014

lindahua commented Jan 16, 2014

lindahua commented Jan 16, 2014

timholy commented Jan 16, 2014

simonster commented Oct 14, 2014

JeffBezanson commented Dec 8, 2014