Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations to local inhibition #769

Merged
merged 9 commits into from Feb 20, 2020
Merged

Optimizations to local inhibition #769

merged 9 commits into from Feb 20, 2020

Conversation

ghost
Copy link

@ghost ghost commented Feb 5, 2020

Check far fewer neighbors in spatialpooler.cpp->local inhibition->wraparound;
Other miscellaneous optimizations resulting in fewer instructions in the compiled code.

As per https://discourse.numenta.org/t/spatial-pooler-local-inhibition-in-htm-core/7130

@ghost ghost requested a review from breznak February 8, 2020 12:09
@breznak
Copy link
Member

breznak commented Feb 19, 2020

Wow, I speedup to local inhibition has been long sought for! Thanks! 👍 #123

I'll review the PR and the linked discussion. I just hope the smaller neighborhood won't result in worse results (?)

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial quick review, I'll have to look at the forum thread. I like some of the optimizations, some I'm not sure are worth the trade off in readability / speed.

src/htm/algorithms/SpatialPooler.cpp Outdated Show resolved Hide resolved
src/htm/algorithms/SpatialPooler.cpp Outdated Show resolved Hide resolved
src/htm/types/Types.hpp Outdated Show resolved Hide resolved
@@ -156,6 +157,10 @@ static const htm::Real32 Epsilon = htm::Real(1e-6);
#endif


static const UInt SizeInt = sizeof(Int);
static const UInt shftInt = (SizeInt*CHAR_BIT)-1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally won't be needed. if it is, define it only in the file where needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally won't be needed. if it is, define it only in the file where needed.

Ok.

src/htm/utils/Topology.cpp Outdated Show resolved Hide resolved
src/htm/utils/Topology.cpp Outdated Show resolved Hide resolved
@breznak
Copy link
Member

breznak commented Feb 19, 2020

I’ve picked some of the low hanging fruit in the local inhibition code of htm.core, and, using the dynamic_hotgym program as a benchmark (only benchmark I used), got a reduction in the local inhibition time reported by program (“SP (l)”) from ~136s to ~67s.

this is quite significant! 😮 Thank you.

c++ MNIST example is another good benchmark. We'll see the gains on more platforms here.

@ghost
Copy link
Author

ghost commented Feb 19, 2020

I'd appreciate it if you could let me know if the speed improvements I saw on my machine are reproducible.

I just hope the smaller neighborhood won't result in worse results (?)

There should be absolutely no change in results. Here's my reasoning (let me know if I'm wrong):

The way the original code is structured, the program will run through each neighbor of a column, keeping count of how many neighbors that column has (numNeighbors) and keeping count of how many of those neighbors have a larger number of overlaps than the column being examined (numBigger). After having run through all the neighbors, the program checks if numBigger < (numNeighbors +1)*(C), where C is some predetermined number. However, in a wrapAround neighborhood, the number of neighbors of a column (the final value of numNeighbors) will always be predetermined by the radius (because the neighborhood wraps around, and therefore a column will never be "near an edge"). Thus we can know the final value of numNeighbors without counting neighbors. What the pull-request code does is calculate the final value of numNeighbors before checking the neighbors of a column, and then afterward, as it checks each neighbor, the code verifies if numBigger >= (numNeighbors +1)*(C) (the negation of the previously mentioned condition condition), and stops counting neighbors if that condition is true. This produces results that are equivalent to those of the original code because if the condition numBigger >= (numNeighbors +1)*(C) is true at any point, then it would still be true after counting all subsequent neighbors (as numBigger never decreases), and if the condition is not true at a given point, the program will continue to examine neighbors, as the original code would have.

c++ MNIST example is another good benchmark. We'll see the gains on more platforms here.

I'm not familiar with that c++ MNIST one. Is that one the preferred benchmark then? I'll check to see if the pull-request has any effect on it.

Thank you for your work, breznak.

@breznak
Copy link
Member

breznak commented Feb 20, 2020

However, in a wrapAround neighborhood, the number of neighbors of a column (the final value of numNeighbors) will always be predetermined by the radius (because the neighborhood wraps around, and therefore a column will never be "near an edge"). Thus we can know the final value of numNeighbors without counting neighbors. What the pull-request code does is calculate the final value of numNeighbors before checking the neighbors of a column

Thank you for the confirmed reasoning, I really like what the PR does (whatever the optimizations turn out).
We may (probably later after testing results) if we should default(remove altogether) wrapAround to always ON.

his produces results that are equivalent to those of the original code because if the condition numBigger >= (numNeighbors +1)*(C) is true at any point, then it would still be true after counting all subsequent neighbors (as numBigger never decreases), and if the condition is not true at a given point, the program will continue to examine neighbors, as the original code would have.

perfect! Yes, the program behaves correctly.

I'm not familiar with that c++ MNIST one. Is that one the preferred benchmark then?

You can run it in ./build/Release/bin/mnist_sp. It's a image classification benchmark (on MNIST digits dataset) and heavily uses SP.

  • you'd have to toggle globalInhibition = false in ./src/examples/mnist/MNIST_SP.cpp to use local inhibition (and set wrap around True)

Is that one the preferred benchmark then

I'd say both are, we figured the "benchmark_hotgym" is sometimes too artificial, and average results on real data are different. (MNIST takes quite a while to run, though).

I'm comparing the speeds on Release / -O3 now

@breznak
Copy link
Member

breznak commented Feb 20, 2020

Benchmarks of this PR (on my machine)

  • system: OS ubuntu 18.04, gcc 8.3, c++17, 64bit, CPU i7-8565u

master

  • benchmark_hotgym (avg of 5 runs): 30.3973

  • runs:

==============TIMERS============
Init:   0.108607
Random: 0.236848
Encode: 0.0773281
SP (l): 27.5694
SP (g): 1.03938
TM:     6.22407
AN:     0.604613
outputs match
Total elapsed time = 35 seconds

==============TIMERS============
Init:   0.097677
Random: 0.263707
Encode: 0.0885447
SP (l): 30.7112
SP (g): 1.38886
TM:     8.12415
AN:     0.667723
outputs match
Total elapsed time = 41 seconds

==============TIMERS============
Init:   0.144493
Random: 0.27206
Encode: 0.0919194
SP (l): 31.3234
SP (g): 1.44896
TM:     8.52284
AN:     0.739849
outputs match
Total elapsed time = 42 seconds 

==============TIMERS============
Init:   0.0861506
Random: 0.272363
Encode: 0.0931327
SP (l): 31.3647
SP (g): 1.44982
TM:     8.53969
AN:     0.763044
outputs match
Total elapsed time = 42 seconds 

==============TIMERS============
Init:   0.0673932
Random: 0.269475
Encode: 0.0913874
SP (l): 31.0178
SP (g): 1.43296
TM:     8.45865
AN:     0.719061
outputs match
Total elapsed time = 41 seconds

PR

  • avg (of 5 runs): 24.2280 (which is about 20% speedup 👍 👍 )

  • runs:

==============TIMERS============
Init:   0.076063
Random: 0.276282
Encode: 0.0924121
SP (l): 25.6381
SP (g): 1.47253
TM:     8.64936
AN:     0.787938
outputs match
Total elapsed time = 36 seconds 


==============TIMERS============
Init:   0.125774
Random: 0.264073
Encode: 0.0893386
SP (l): 24.0384
SP (g): 1.43205
TM:     8.30734
AN:     0.766189
outputs match
Total elapsed time = 34 seconds 

==============TIMERS============
Init:   0.0529755
Random: 0.268904
Encode: 0.089577
SP (l): 24.2729
SP (g): 1.45011
TM:     8.4342
AN:     0.759945
outputs match
Total elapsed time = 35 seconds

==============TIMERS============
Init:   0.0554682
Random: 0.255424
Encode: 0.0870571
SP (l): 23.3842
SP (g): 1.40471
TM:     8.18585
AN:     0.692181
outputs match
Total elapsed time = 34 seconds

==============TIMERS============
Init:   0.057346
Random: 0.261227
Encode: 0.0876368
SP (l): 23.8062
SP (g): 1.42717
TM:     8.28385
AN:     0.742549
outputs match
Total elapsed time = 34 seconds

I've confirmed your results from this PR on my machine, the avg speedup is bout 20% (24s from 31s), which is pretty significant!! Great work, thank you 💯

  • now I'd like to see the effect of micro-opt, vs the change how numNeighbours is computed.

@ghost
Copy link
Author

ghost commented Feb 20, 2020

Okay, turns out I'm a failure. Every single one of the "micro-optimizations" actually made the code slower without the DEBUG flag on. My fault. I tried to revert those changes by making two additional commits to restore the code to the original format (current master branch), leaving only the "fewer neighbors" change. I'm not sure if I did that correctly (I've never used git or github before). Should I instead close this request and open a new one, to make things cleaner?

Thanks again for everything, breznak, including the clarifications. I'll look into mnist_sp.

@breznak
Copy link
Member

breznak commented Feb 20, 2020

Okay, turns out I'm a failure

What?! SP local is now around 12s! (compared to our master, and former this PR above #769 (comment) )
That's freaking 3x faster! I wouldn't call that a failure 😄 This PR just got even better. 💯

Every single one of the "micro-optimizations" actually made the code slower without the DEBUG flag on.

Yep, just remember "premature optimization is the root of all evil" :) Compilers are pushed to the limits by really smart people, so we mostly aim for the algorithmic improvements (as you did here) that yeld huge changes. Code micro-optimizations are better off left to the compiler.

Should I instead close this request and open a new one, to make things cleaner?

No need, this PR is fine now as is. Really good job, thank you!

src/htm/algorithms/SpatialPooler.cpp Outdated Show resolved Hide resolved
src/htm/algorithms/SpatialPooler.cpp Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Feb 20, 2020

SP local is now around 12s!

I'm pretty sure it isn't. The removal of the false "micro-optimizations" should result in a drop of just 1-3s max.

@breznak
Copy link
Member

breznak commented Feb 20, 2020

'm pretty sure it isn't. The removal of the false "micro-optimizations" should result in a drop of just 1-3s max.

ok, must have been effect of utilization of my machine. Now I'm getting 19-25s on master, and 10-13s in this PR, which is about 2x speedup.

@breznak
Copy link
Member

breznak commented Feb 20, 2020

@kineyev can you find some toggle in the PR here like "Allow maintainers to push changes to your branch"? I've made some cleanups and would attach that to this PR.

@ghost
Copy link
Author

ghost commented Feb 20, 2020

@kineyev can you find some toggle in the PR here like "Allow maintainers to push changes to your branch"? I've made some cleanups and would attach that to this PR.

"Allow edits from maintainers." ? It's checked. Is there anything else I need to do?

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kineyev this is good as is, really good work, thank you! 👍

Tell me, would you like to try pushing the speed even further? Here are 2 ideas:

  • the code that computes predN only depends on inhibitionRadius_ which only changes at isUpdateRound(). You could use that info and update the numNeighbors only then. I'm not sure if that worth it or if it'd give any improvement.
  • I'm not really fan of the *Neighborhood classes, see if the code could use a Topology_t (DefaultTopology) class instead. (that still uses a Neighborhood, but is a better API for changing topologies). CC @ctrl-z-9000-times what do you think of this?
  • I'd like to try trading memory for speed and pre-computing Neighborhood(column) for each col and storing it in a map.
    • for global inhibition this depends only on column, so it should work.
    • for local depends on (col, inh radius); see the first bullet.
    • this would depend whether the function advance_() used in operator++ in Neighborhoods takes considerable ammount of time. COuld you look at that?

@breznak
Copy link
Member

breznak commented Feb 20, 2020

CC @dkeeney @ctrl-z-9000-times this PR does a great job at optimizing SP local inhibition performance for wrapAround=true (2x faster now!).

Do you think it's worth the trouble breaking the API and drop the wrapAround param?

  • it'd allow to further simplify the code
  • drop 1 of 2 of the Neighborhood classes
  • we've never observed a better results with/without the wrap-around.
  • if one wanted to keep the NOT wrapped functionality, they'd put the feature in a separate dimension. Or add a padding (empty bits) to the SDR.

@ghost
Copy link
Author

ghost commented Feb 20, 2020

Tell me, would you like to try pushing the speed even further? Here are 2 ideas:

I would like to help, sure, but I can't say for sure when I'll get around to doing it, since I won't have much free time in the next few weeks. I'll see what I can do though.

  • the code that computes predN only depends on inhibitionRadius_ which only changes at isUpdateRound(). You could use that info and update the numNeighbors only then. I'm not sure if that worth it or if it'd give any improvement.

I'll check it out. Should be easy enough to verify.

  • I'm not really fan of the *Neighborhood classes, see if the code could use a Topology_t (DefaultTopology) class instead. (that still uses a Neighborhood, but is a better API for changing topologies). CC @ctrl-z-9000-times what do you think of this?

I'll look into it. I'm not familiar with that Topology_t class.

  • I'd like to try trading memory for speed and pre-computing Neighborhood(column) for each col and storing it in a map.

    • for global inhibition this depends only on column, so it should work.
    • for local depends on (col, inh radius); see the first bullet.
    • this would depend whether the function advance_() used in operator++ in Neighborhoods takes considerable ammount of time. COuld you look at that?

I'll look into it as well, but I'll look at the Topology_t class first, otherwise it could be wasted effort. There is a significant amount of time spent on the on both the "operator*()" and "advance_()" functions of wrapping neighborhoods.

I'll let you know as soon as I have something.

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really happy with the code and speed improvements now. Thank you @kineyev !

Possible follow-ups for discussion:

  • drop the wrapAround param from SP (default to true) //cleanup, faster code
  • explore possibility of cacheing the Neighborhood results in memory
  • use Topology instead of (directly) Neighborhood in SP

@breznak
Copy link
Member

breznak commented Feb 20, 2020

I'll look into it. I'm not familiar with that Topology_t class.

it's in the src/htm/utils/Topology.hpp

There is a significant amount of time spent on the on both the "operator*()" and "advance_()" functions of wrapping neighborhoods.

good to know, so that would be worth it!

I'll look into it as well, but I'll look at the Topology_t class first, otherwise it could be wasted effort

The Topology class would not save us here, the DefaultTopology uses *Neighborhoods. It's an improved API build atop the Neighborhood classes.

To the time/gain ratio is imho:

  • "isUpdateRound" (easy, not a huge gain)
  • "pre-cached Neighborhood" (hard, big potential)
  • "topology_t" (better API)

@dkeeney
Copy link

dkeeney commented Feb 21, 2020

Really good job on this PR.

Do you think it's worth the trouble breaking the API and drop the wrapAround param?

Lets wait until someone has time to look into this before we decide if it is worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants