Comment #26

ghost · 2019-02-27T15:06:03Z

If you apply a predetermined random pattern of sign flipping to an input array followed by the fast (Walsh) Hadamard transform you get a random projection of the input data. You can use that for unbiased dimension reduction or increase. Repeat the process for better quality.
The outputs of the random projection strongly follow the Gaussian distribution because the central limit theorem applies not only to sums but also sums and differences.
Anyway if you binarize the outputs of the the random projection you have a locality sensitive hash. If you interpret the output bits as +1,-1 you can multiply each bit by a weight and sum to get a recalled value. To train, recall and calculate the error. Divide by the number of bits. Then add or subtract that from each weight as appropriate to make the error zero. In that way you have created an associative memory.
Because the error has been distributed non-similar input will result in the error fragments being multiplied by arbitrary +1,-1 hash bit values. Again you can invoke the central limit theorem to see the error fragments sum to zero mean, low level Gaussian noise.
The memory capacity is just short of the number of bits. If you use the system under capacity you get some repetition code error correction.
Basically when you store an new memory in that system all the previous memories get contaminate by a little bit of Gaussian noise. However for an under capacity set of training data you can drive that noise to zero by repeated presentation.
This also provides a means to understand extreme learning machines and reservoir computing as associative memory.

dnbaker · 2019-02-27T15:08:16Z

I don't really understand why you've decided to add this comment, but you're right.

If you want to see applications, look at https://github.com/dnbaker/frp for orthogonal JL transforms and Gaussian Kernel Projections.

FALCONN-LIB uses it here for a LSH, as you've mentioned.

ghost · 2019-02-27T15:52:42Z

This is the nearest thing to a WHT central station to discuss it a little. I'm interested in adding soft associative memory to neural networks that they can actually learn to use (ordinary RAM being very difficult to learn how to use.) You can also use locality sensitive hashing again to switch in different blocks of weight memory to associative memory to make truly vast memory systems that are still fast yet have soft characteristics. Or use LSH to switch between modules in deep neural networks to boost efficiency.
Also there are various way of using the WHT and or random projections for weight sharing in neural networks or other weighting tricks.
One useful thing you can do to reduce crosstalk (poor input decision boundaries) with vector to vector associative memory is to learn a predeterminded random intermediate target. Instead of learning a->b learn a->r, r->b. In higher dimensions all the different r's are approximately orthogonal. Using that and some other tricks like soft binarization it is definitely possible to exceed Hopfield networks in many metrics.

One simple thing you can try is to use a WHT or random projection after the final layer of a neural network. That can help the output neurons cooperate more efficiently.

dnbaker · 2019-02-27T15:58:47Z

I recommend this paper for adaptive random spinners. The way to do this is to have a linear | FHT | linear | FHT | linear | FHT | <nonlinear transformation>. I imagine it could also be used for some kind of convolution, such as a substitute or extension of depthwise convolutions. (I'm reminded of Pay Less Attention, in particular.)

There's isn't support for an FHT layer in PyTorch yet, but it could be quite useful. The gradient is simply the FHT again; however, there is the issue of normalization, which this library does not perform. (It's smart to avoid it, though; if you repeatedly apply it, you can normalize post hoc.)

ghost · 2019-02-27T23:53:01Z

Thanks for the random spinners link.
Anyway one way to do weight sharing is to have a weight array and then have many random projections of that, each providing a different view of the underlying array. Then you use the projected values as weights. A disadvantage is the the weights all have around the same range of magnitude.
If you have say 5 vector to vector random projections of some input vector with n elements then you can organize those into n 5-dimensional unbiased dimension reduced views of the input. You can weight each of those 5 elements, sum them and put them through an activation function. Then you have a neuron with only 5 weights that is nevertheless fully connected in some sense with the entire n input elements.
And actually you can do better than that. Apply a non-linear function to each of the 5 vector elements, then weight each function output and sum. What is that? You have made a little extreme learning machine neuron. You could theoretically increase from 5 element subsamples to some number way beyond n, to get a more than fully connected neuron in some sense. With the non extreme learning machine neuron linear algebra means there is no point in going beyond n sized "subsamples."
When I read the spinner paper maybe it will say you can replace the random sign flipping with instead weighting each input before the WHT to get interesting possibilities.
In terms of normalizing the output of your WHT code the version of gcc I'm using (8.2.0) does a good job of autovectorizing the scaling code I wrote to use the SIMD instructions.

ghost · 2019-03-01T14:02:40Z

Then there are few less useful things. If you feed the output of a vector to vector random projection back to its input you get a kind of oscillator in which information will reverberate around in. You can probably understand that with digital signal processing math. You can use that in reservoir computing. If you connect those together the stored information spreads out in a similar way with entropy in a physical system.
You probably can speak too of the entropy of a WHT based random projection.
If you use random permutations as well as or instead of random sign flipping that would seem to be much higher entropy than just sign flipping on its own. However random permutation is rather time consuming on CPUs because it really hammers the caches.

The intermediate calculations in the (out of place) WHT algorithm are wavelet like. If you normalized each layer of calculation then you can pick out the highest energy/magnitude entry in the entire system, note its value and location, then remove it by setting its value to zero. Then propagate to every other layer the effects of that zeroing. Then repeat a number of times.
Later you can decompress using just the data you have collected.

Anyway the random projections are related to random point picking on a hypersphere:

http://mathworld.wolfram.com/HyperspherePointPicking.html
As the normalized WHT and sign flipping leave vector length unchanged and any randomness gets converted into the Gaussian distribution.
Also this early paper is not quite right because of the length invariance issue:
https://archive.org/details/bitsavers_mitreESDTe69266ANewMethodofGeneratingGaussianRando_2706065?q=normal+random+esd

This website is slightly messed up due to injected advertising however there is some information about using WHT random projections with smoothing for compressive sensing:
http://md2020.eu5.org/wht3.html
http://md2020.eu5.org/wht1.html

I don't know how useful that was. It is information largely in the form of hints that you can follow up or not depending on your disposition.

There is a book that has some information about the WHT:
https://www.jjj.de/fxt/

ghost · 2019-05-21T13:25:04Z

Edit (changed link):
https://discourse.processing.org/t/flaw-in-current-neural-networks/11512

ghost · 2019-05-30T00:01:15Z

The central limit theorem applies not only to the sums, it also applies to sums and differences (eg. WHT.) Especially where there is any kind of randomness involved the output of the WHT will tend toward the Gaussian distribution. That can be quite awkward sometime. You can get the uniform distribution by normalizing the vector length of the output of the WHT and applying the Gaussian CDF function to each element. The vector elements then range between 0 and 1. Alternatively with no need to normalize you can apply atan2(x,y) to 2 elements at a time and get a uniform over the range of that function.

Sometimes the WHT is too slow at a time cost of O(nln(n)). You could use what I call the hstep function if you only need different mixtures (windows onto) an input vector, say for associative memory.
Step through the elements of the input vector pairwise, put the sum of the 2 elements sequentially in the lower half of a new array, the difference sequentially in the upper half. If you repeat hstep k=log_base_2(n) times you have the out of place WHT. If you keep going of course you get back to your original vector but the WHT unwinds in the different way to the forward direction. So for a vector with 1024 elements you have k=10 and you can have 20 different views of the original vector including the original itself.

The problem with the out of place algorithm is it hammers memory bandwidth, you read data, you do +-, you write data. Even the CPU caches don't have enough bandwidth for that, the algorithm is all data movement. You can use the in-place algorithm to address that issue.
Anyway it should obviously be possible to do the WHT in specialized hardware very quickly and with low power consumption. A chip containing only integer add, subtract and bit shift circuits would be very power efficient, very small and use not so many transistor.
It would even be possible to have a completely analog system using high speed opamps.
I suggested that to Science Foundation Ireland 20 year ago and was told to ......

ghost · 2019-08-01T15:23:06Z

Fixed Filter Bank Neural Networks.
Adjust the nonlinear functions not the filters (ie. not the weighted sums.)
https://github.com/S6Regen/Fixed-Filter-Bank-Neural-Networks
https://discourse.numenta.org/t/fixed-filter-bank-neural-networks/6392
https://discourse.numenta.org/t/distributed-representations-1984-by-hinton/6378/10

Using the fast Walsh Hadamard transform as the fixed filter bank.
It seems to work well within the very weak hardware I have available for testing.

ghost · 2019-10-14T03:37:28Z

ReLU is a literal switch. An electrical switch is n volts in, n volts out when on. Zero volts out when off.
The weighted sum (dot product) of a number of weighted sums is still a linear system.
For a particular input to a ReLU neural network all the switches are decidedly in either the on or off state. A particular linear projection is in effect between the input and output.
For a particular input and a particular output neuron there is a particular composition of weighted sums that may be condensed down into a single equivalent weighted sum.
You can look at that to see what it is looking at in the input or calculate some metrics like the angle between the input and the weight vector of the equivalent weighted sum.
If the angle is near 90 degrees and the output of the neuron is large then the vector length of weight vector must be large. That makes the output very sensitive to noise in the inputs. If the angle is near zero then there are averaging and central limit theorem effects that provide some error correction.
Since ReLU switches at zero there are no sudden discontinuities in the output of an ReLU neural network for gradual change in the input. It is a seamless system of switched linear projections.
There are efficient algorithms for calculating certain dot products like the FFT or WHT.
There is no reason you cannot incorporate those directly into ReLU neural networks since they are fully compatible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comment #26

Comment #26

ghost commented Feb 27, 2019

dnbaker commented Feb 27, 2019 •

edited

Loading

ghost commented Feb 27, 2019

dnbaker commented Feb 27, 2019

ghost commented Feb 27, 2019 •

edited by ghost

Loading

ghost commented Mar 1, 2019 •

edited by ghost

Loading

ghost commented May 21, 2019 •

edited by ghost

Loading

ghost commented May 30, 2019

ghost commented Aug 1, 2019

ghost commented Oct 14, 2019

Comment #26

Comment #26

Comments

ghost commented Feb 27, 2019

dnbaker commented Feb 27, 2019 • edited Loading

ghost commented Feb 27, 2019

dnbaker commented Feb 27, 2019

ghost commented Feb 27, 2019 • edited by ghost Loading

ghost commented Mar 1, 2019 • edited by ghost Loading

ghost commented May 21, 2019 • edited by ghost Loading

ghost commented May 30, 2019

ghost commented Aug 1, 2019

ghost commented Oct 14, 2019

dnbaker commented Feb 27, 2019 •

edited

Loading

ghost commented Feb 27, 2019 •

edited by ghost

Loading

ghost commented Mar 1, 2019 •

edited by ghost

Loading

ghost commented May 21, 2019 •

edited by ghost

Loading