This project is the result of my first dive into machine learning. I decided to create my own basic feed-forward neural network from scratch (still using NumPy for matrix multiplication) so that I could learn how machine learning actually works I'm also going to be explaining some of the concepts here to further develop my understanding of neural networks. If you notice anything that is incorrect please let me know.
A basic feed-forward Neural network is comprised of just nodes, sometimes referred to as neurons, and weights. They also consist of an input layer, an output layer, and a number of hidden layers. The network weights connect input nodes to the neurons on the next layer, each input node having its own connection to each neuron.
The output (z) of a neuron can be represented as such, where X is the input vector, W is the weight matrix and b is an optional bias: $$\sum_{i}^{n}x_iw_i+b_i$$This summation can be more simply represented for all neurons in a layer as a matrix multiplication:
During the beginning of this project, I have been exclusively working with the sigmoid function as my activation function, especially for my XOR gate implementation.
Neural networks use a technique called gradient descent to minimize a cost function. In this case, its the squared error of the output (the expected output,$\hat{y}$ minus the actual output
Gradient descent is the practice of adjusting weights in small steps in order to find the minimum of the cost function. This is done by computing
To compute this gradient we simply use the chain rule.
first we must compute
note that this formula only applies to the output layer, for the rest of the layers we must compute the delta using the following formula:
from this, we can compute the gradient for both the weights and biases. for the biases, the gradient is simply:
finally, we can do the "descent" part, to do this we simply multiply our gradients by our learning rate
My first implementation was a model that predicts the output of an xor gate, for this I used a 2-2-1 structure using squared error as my cost function and sigmoid functions for every node. The script to run that implementation can be found here.
This implementation classifies handwritten digits from the well-known, MNIST dataset. It simply flattens the image matrix and feeds it into the network, which contains two hidden layers. The structure is 728 (flattened pixel matrix) - 128 - 64 - 10 (output vector). The input layer as well as all of the hidden layers utilize the sigmoid activation function, but the output layer uses the softmax activation function to show output in terms of probabilities. The MNIST implmentation script can be found here.