Optimizations to local inhibition #769

ghost · 2020-02-05T01:13:25Z

Check far fewer neighbors in spatialpooler.cpp->local inhibition->wraparound;
Other miscellaneous optimizations resulting in fewer instructions in the compiled code.

As per https://discourse.numenta.org/t/spatial-pooler-local-inhibition-in-htm-core/7130

breznak · 2020-02-19T08:58:45Z

Wow, I speedup to local inhibition has been long sought for! Thanks! 👍 #123

I'll review the PR and the linked discussion. I just hope the smaller neighborhood won't result in worse results (?)

breznak

Initial quick review, I'll have to look at the forum thread. I like some of the optimizations, some I'm not sure are worth the trade off in readability / speed.

src/htm/algorithms/SpatialPooler.cpp

src/htm/types/Types.hpp

breznak · 2020-02-19T09:10:52Z

src/htm/types/Types.hpp

@@ -156,6 +157,10 @@ static const htm::Real32 Epsilon = htm::Real(1e-6);
 #endif


+  static const UInt SizeInt = sizeof(Int);
+  static const UInt shftInt = (SizeInt*CHAR_BIT)-1;


ideally won't be needed. if it is, define it only in the file where needed.

ideally won't be needed. if it is, define it only in the file where needed.

Ok.

src/htm/utils/Topology.cpp

breznak · 2020-02-19T09:19:07Z

I’ve picked some of the low hanging fruit in the local inhibition code of htm.core, and, using the dynamic_hotgym program as a benchmark (only benchmark I used), got a reduction in the local inhibition time reported by program (“SP (l)”) from ~136s to ~67s.

this is quite significant! 😮 Thank you.

c++ MNIST example is another good benchmark. We'll see the gains on more platforms here.

ghost · 2020-02-19T18:46:58Z

I'd appreciate it if you could let me know if the speed improvements I saw on my machine are reproducible.

I just hope the smaller neighborhood won't result in worse results (?)

There should be absolutely no change in results. Here's my reasoning (let me know if I'm wrong):

The way the original code is structured, the program will run through each neighbor of a column, keeping count of how many neighbors that column has (numNeighbors) and keeping count of how many of those neighbors have a larger number of overlaps than the column being examined (numBigger). After having run through all the neighbors, the program checks if numBigger < (numNeighbors +1)*(C), where C is some predetermined number. However, in a wrapAround neighborhood, the number of neighbors of a column (the final value of numNeighbors) will always be predetermined by the radius (because the neighborhood wraps around, and therefore a column will never be "near an edge"). Thus we can know the final value of numNeighbors without counting neighbors. What the pull-request code does is calculate the final value of numNeighbors before checking the neighbors of a column, and then afterward, as it checks each neighbor, the code verifies if numBigger >= (numNeighbors +1)*(C) (the negation of the previously mentioned condition condition), and stops counting neighbors if that condition is true. This produces results that are equivalent to those of the original code because if the condition numBigger >= (numNeighbors +1)*(C) is true at any point, then it would still be true after counting all subsequent neighbors (as numBigger never decreases), and if the condition is not true at a given point, the program will continue to examine neighbors, as the original code would have.

c++ MNIST example is another good benchmark. We'll see the gains on more platforms here.

I'm not familiar with that c++ MNIST one. Is that one the preferred benchmark then? I'll check to see if the pull-request has any effect on it.

Thank you for your work, breznak.

breznak · 2020-02-20T07:01:30Z

However, in a wrapAround neighborhood, the number of neighbors of a column (the final value of numNeighbors) will always be predetermined by the radius (because the neighborhood wraps around, and therefore a column will never be "near an edge"). Thus we can know the final value of numNeighbors without counting neighbors. What the pull-request code does is calculate the final value of numNeighbors before checking the neighbors of a column

Thank you for the confirmed reasoning, I really like what the PR does (whatever the optimizations turn out).
We may (probably later after testing results) if we should default(remove altogether) wrapAround to always ON.

his produces results that are equivalent to those of the original code because if the condition numBigger >= (numNeighbors +1)*(C) is true at any point, then it would still be true after counting all subsequent neighbors (as numBigger never decreases), and if the condition is not true at a given point, the program will continue to examine neighbors, as the original code would have.

perfect! Yes, the program behaves correctly.

I'm not familiar with that c++ MNIST one. Is that one the preferred benchmark then?

You can run it in ./build/Release/bin/mnist_sp. It's a image classification benchmark (on MNIST digits dataset) and heavily uses SP.

you'd have to toggle globalInhibition = false in ./src/examples/mnist/MNIST_SP.cpp to use local inhibition (and set wrap around True)

Is that one the preferred benchmark then

I'd say both are, we figured the "benchmark_hotgym" is sometimes too artificial, and average results on real data are different. (MNIST takes quite a while to run, though).

I'm comparing the speeds on Release / -O3 now

breznak · 2020-02-20T07:27:10Z

Benchmarks of this PR (on my machine)

system: OS ubuntu 18.04, gcc 8.3, c++17, 64bit, CPU i7-8565u

master

benchmark_hotgym (avg of 5 runs): 30.3973
runs:

==============TIMERS============
Init:   0.108607
Random: 0.236848
Encode: 0.0773281
SP (l): 27.5694
SP (g): 1.03938
TM:     6.22407
AN:     0.604613
outputs match
Total elapsed time = 35 seconds

==============TIMERS============
Init:   0.097677
Random: 0.263707
Encode: 0.0885447
SP (l): 30.7112
SP (g): 1.38886
TM:     8.12415
AN:     0.667723
outputs match
Total elapsed time = 41 seconds

==============TIMERS============
Init:   0.144493
Random: 0.27206
Encode: 0.0919194
SP (l): 31.3234
SP (g): 1.44896
TM:     8.52284
AN:     0.739849
outputs match
Total elapsed time = 42 seconds 

==============TIMERS============
Init:   0.0861506
Random: 0.272363
Encode: 0.0931327
SP (l): 31.3647
SP (g): 1.44982
TM:     8.53969
AN:     0.763044
outputs match
Total elapsed time = 42 seconds 

==============TIMERS============
Init:   0.0673932
Random: 0.269475
Encode: 0.0913874
SP (l): 31.0178
SP (g): 1.43296
TM:     8.45865
AN:     0.719061
outputs match
Total elapsed time = 41 seconds

PR

avg (of 5 runs): 24.2280 (which is about 20% speedup 👍 👍 )
runs:

==============TIMERS============
Init:   0.076063
Random: 0.276282
Encode: 0.0924121
SP (l): 25.6381
SP (g): 1.47253
TM:     8.64936
AN:     0.787938
outputs match
Total elapsed time = 36 seconds 


==============TIMERS============
Init:   0.125774
Random: 0.264073
Encode: 0.0893386
SP (l): 24.0384
SP (g): 1.43205
TM:     8.30734
AN:     0.766189
outputs match
Total elapsed time = 34 seconds 

==============TIMERS============
Init:   0.0529755
Random: 0.268904
Encode: 0.089577
SP (l): 24.2729
SP (g): 1.45011
TM:     8.4342
AN:     0.759945
outputs match
Total elapsed time = 35 seconds

==============TIMERS============
Init:   0.0554682
Random: 0.255424
Encode: 0.0870571
SP (l): 23.3842
SP (g): 1.40471
TM:     8.18585
AN:     0.692181
outputs match
Total elapsed time = 34 seconds

==============TIMERS============
Init:   0.057346
Random: 0.261227
Encode: 0.0876368
SP (l): 23.8062
SP (g): 1.42717
TM:     8.28385
AN:     0.742549
outputs match
Total elapsed time = 34 seconds

I've confirmed your results from this PR on my machine, the avg speedup is bout 20% (24s from 31s), which is pretty significant!! Great work, thank you 💯

now I'd like to see the effect of micro-opt, vs the change how numNeighbours is computed.

ghost · 2020-02-20T15:03:53Z

Okay, turns out I'm a failure. Every single one of the "micro-optimizations" actually made the code slower without the DEBUG flag on. My fault. I tried to revert those changes by making two additional commits to restore the code to the original format (current master branch), leaving only the "fewer neighbors" change. I'm not sure if I did that correctly (I've never used git or github before). Should I instead close this request and open a new one, to make things cleaner?

Thanks again for everything, breznak, including the clarifications. I'll look into mnist_sp.

breznak · 2020-02-20T15:11:06Z

Okay, turns out I'm a failure

What?! SP local is now around 12s! (compared to our master, and former this PR above #769 (comment) )
That's freaking 3x faster! I wouldn't call that a failure 😄 This PR just got even better. 💯

Every single one of the "micro-optimizations" actually made the code slower without the DEBUG flag on.

Yep, just remember "premature optimization is the root of all evil" :) Compilers are pushed to the limits by really smart people, so we mostly aim for the algorithmic improvements (as you did here) that yeld huge changes. Code micro-optimizations are better off left to the compiler.

Should I instead close this request and open a new one, to make things cleaner?

No need, this PR is fine now as is. Really good job, thank you!

src/htm/algorithms/SpatialPooler.cpp

ghost · 2020-02-20T15:40:59Z

SP local is now around 12s!

I'm pretty sure it isn't. The removal of the false "micro-optimizations" should result in a drop of just 1-3s max.

breznak · 2020-02-20T16:08:00Z

'm pretty sure it isn't. The removal of the false "micro-optimizations" should result in a drop of just 1-3s max.

ok, must have been effect of utilization of my machine. Now I'm getting 19-25s on master, and 10-13s in this PR, which is about 2x speedup.

breznak · 2020-02-20T16:12:04Z

@kineyev can you find some toggle in the PR here like "Allow maintainers to push changes to your branch"? I've made some cleanups and would attach that to this PR.

ghost · 2020-02-20T16:17:42Z

@kineyev can you find some toggle in the PR here like "Allow maintainers to push changes to your branch"? I've made some cleanups and would attach that to this PR.

"Allow edits from maintainers." ? It's checked. Is there anything else I need to do?

breznak

@kineyev this is good as is, really good work, thank you! 👍

Tell me, would you like to try pushing the speed even further? Here are 2 ideas:

the code that computes predN only depends on inhibitionRadius_ which only changes at isUpdateRound(). You could use that info and update the numNeighbors only then. I'm not sure if that worth it or if it'd give any improvement.
I'm not really fan of the *Neighborhood classes, see if the code could use a Topology_t (DefaultTopology) class instead. (that still uses a Neighborhood, but is a better API for changing topologies). CC @ctrl-z-9000-times what do you think of this?
I'd like to try trading memory for speed and pre-computing Neighborhood(column) for each col and storing it in a map.
- for global inhibition this depends only on column, so it should work.
- for local depends on (col, inh radius); see the first bullet.
- this would depend whether the function advance_() used in operator++ in Neighborhoods takes considerable ammount of time. COuld you look at that?

breznak · 2020-02-20T16:53:03Z

CC @dkeeney @ctrl-z-9000-times this PR does a great job at optimizing SP local inhibition performance for wrapAround=true (2x faster now!).

Do you think it's worth the trouble breaking the API and drop the wrapAround param?

it'd allow to further simplify the code
drop 1 of 2 of the Neighborhood classes
we've never observed a better results with/without the wrap-around.
if one wanted to keep the NOT wrapped functionality, they'd put the feature in a separate dimension. Or add a padding (empty bits) to the SDR.

ghost · 2020-02-20T16:55:46Z

Tell me, would you like to try pushing the speed even further? Here are 2 ideas:

I would like to help, sure, but I can't say for sure when I'll get around to doing it, since I won't have much free time in the next few weeks. I'll see what I can do though.

the code that computes predN only depends on inhibitionRadius_ which only changes at isUpdateRound(). You could use that info and update the numNeighbors only then. I'm not sure if that worth it or if it'd give any improvement.

I'll check it out. Should be easy enough to verify.

I'm not really fan of the *Neighborhood classes, see if the code could use a Topology_t (DefaultTopology) class instead. (that still uses a Neighborhood, but is a better API for changing topologies). CC @ctrl-z-9000-times what do you think of this?

I'll look into it. I'm not familiar with that Topology_t class.

I'd like to try trading memory for speed and pre-computing Neighborhood(column) for each col and storing it in a map.

for global inhibition this depends only on column, so it should work.

for local depends on (col, inh radius); see the first bullet.

this would depend whether the function advance_() used in operator++ in Neighborhoods takes considerable ammount of time. COuld you look at that?

I'll look into it as well, but I'll look at the Topology_t class first, otherwise it could be wasted effort. There is a significant amount of time spent on the on both the "operator*()" and "advance_()" functions of wrapping neighborhoods.

I'll let you know as soon as I have something.

breznak

I'm really happy with the code and speed improvements now. Thank you @kineyev !

Possible follow-ups for discussion:

drop the wrapAround param from SP (default to true) //cleanup, faster code
explore possibility of cacheing the Neighborhood results in memory
use Topology instead of (directly) Neighborhood in SP

breznak · 2020-02-20T17:08:21Z

I'll look into it. I'm not familiar with that Topology_t class.

it's in the src/htm/utils/Topology.hpp

There is a significant amount of time spent on the on both the "operator*()" and "advance_()" functions of wrapping neighborhoods.

good to know, so that would be worth it!

I'll look into it as well, but I'll look at the Topology_t class first, otherwise it could be wasted effort

The Topology class would not save us here, the DefaultTopology uses *Neighborhoods. It's an improved API build atop the Neighborhood classes.

To the time/gain ratio is imho:

"isUpdateRound" (easy, not a huge gain)
"pre-cached Neighborhood" (hard, big potential)
"topology_t" (better API)

dkeeney · 2020-02-21T01:01:59Z

Really good job on this PR.

Do you think it's worth the trouble breaking the API and drop the wrapAround param?

Lets wait until someone has time to look into this before we decide if it is worth it.

Optimizations to local inhibition

2200d26

ghost requested a review from breznak February 8, 2020 12:09

Merge branch 'master' into master

9c59ed0

breznak added optimization SP labels Feb 19, 2020

breznak requested changes Feb 19, 2020

View reviewed changes

root added 2 commits February 20, 2020 06:49

Undoing modifications

691aab9

Undoing modifications

7f0dbf7

Readability and correction (it's actually min, not max)

470312f

breznak reviewed Feb 20, 2020

View reviewed changes

src/htm/algorithms/SpatialPooler.cpp Outdated Show resolved Hide resolved

src/htm/algorithms/SpatialPooler.cpp Outdated Show resolved Hide resolved

code cleanup

d3c51d6

breznak added 3 commits February 20, 2020 16:42

Merge remote-tracking branch 'pr/master' into pr_speedup_sp_local

6a826d2

small cleanup

2ef6bc1

drop unneeded changes to Types.hpp

9af63c0

breznak reviewed Feb 20, 2020

View reviewed changes

breznak approved these changes Feb 20, 2020

View reviewed changes

breznak merged commit b7f206f into htm-community:master Feb 20, 2020

breznak mentioned this pull request Feb 20, 2020

Improve speed of SP local inhibition, 70x slower than global #123

Open

This was referenced Feb 20, 2020

pull-request opened by mistake numenta/nupic.core-legacy#1459

Closed

Naive implementations of optimizations suggested by breznak on pull-request n. 769 #776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations to local inhibition #769

Optimizations to local inhibition #769

ghost commented Feb 5, 2020 •

edited by ghost

Loading

breznak commented Feb 19, 2020

breznak left a comment

breznak Feb 19, 2020

ghost Feb 19, 2020

breznak commented Feb 19, 2020

ghost commented Feb 19, 2020 •

edited by ghost

Loading

breznak commented Feb 20, 2020

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak commented Feb 20, 2020

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak left a comment

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak left a comment

breznak commented Feb 20, 2020

dkeeney commented Feb 21, 2020

Optimizations to local inhibition #769

Optimizations to local inhibition #769

Conversation

ghost commented Feb 5, 2020 • edited by ghost Loading

breznak commented Feb 19, 2020

breznak left a comment

Choose a reason for hiding this comment

breznak Feb 19, 2020

Choose a reason for hiding this comment

ghost Feb 19, 2020

Choose a reason for hiding this comment

breznak commented Feb 19, 2020

ghost commented Feb 19, 2020 • edited by ghost Loading

breznak commented Feb 20, 2020

breznak commented Feb 20, 2020

Benchmarks of this PR (on my machine)

master

PR

ghost commented Feb 20, 2020

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak commented Feb 20, 2020

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak left a comment

Choose a reason for hiding this comment

breznak commented Feb 20, 2020

ghost commented Feb 20, 2020

breznak left a comment

Choose a reason for hiding this comment

breznak commented Feb 20, 2020

dkeeney commented Feb 21, 2020

ghost commented Feb 5, 2020 •

edited by ghost

Loading

ghost commented Feb 19, 2020 •

edited by ghost

Loading