Concise implementations of fundamental machine learning algorithms from scratch from Stanford's CS229 course.
The purpose of this project was to deepen my understanding of the most commonly used ML algorithms.
- Visualizations
- Disclaimer
- Tests
- Prerequisites
- Installation
- Linear regression
- Logistic regression
- Naive Bayes
- Support Vector Machine
- Batch Gradient Descent
- Resources and useful links
Because looking at code is not the most interesting thing, here are some visualizations that I made with my implementations of the algorithms.
These algorithms are very simple and primitive implementations of those found in popular ml packages such as sci-kit learn, which have been refined and optimized by experts for years, so any of the implementations here should only be used for learning purposes.
All algorithms and helper functions are tested with a coverage of:
To run this project you will need to have python installed on your machine and (preferably) a virtual environment.
Installation steps:
- activate your virtual environment
- install necessary libraries:
pip install -r requirements.txt
- Because of the way the files are structured you have to run the files as modules:
python -m path.to.file
For example running the algorithms/linear_regression/linear_regression.py
would look like this:
python -m algorithms.linear_regression.linear_regression
Linear regression is a statistical model which tries to model the relationship between the dependent and independent variables.
It is mostly used for predicting continuous values and rarely for classification as it is really sensitive to outliers.
Altough a closed-form solution exits to linear regression, which would give you the optimal parameter values directly, I still used gradient descent to gain deeper knowledge of the algorithm.
The error metric that we are trying to minimalize is the root (RMSE) or the normal mean squared error (MSE):
Running the algorithm for 10k iterations with a learning rate of 0.003
the parameters almost match the ones calculated by sklearn:
Since MSE is a convex function which means that the local optimum is also the global one, my algorithm would eventually converge given enough iterations.
Logistic regression is a statistical model which works by applying the logistic function to the linear relationship combined from the input features, weights and biases. Mostly used for classification with a given threshold (usually 0.5), where values returned by the logistic function greater than or equal to the threshold are classified as 1 , below the threshold classified as 0.
Logistic regression is most commonly trained by minimizing the negative of the log likelihood:
Training and then comparing my results to sklearn yields similar results:
As you can see the weights are of by a little bit. This is because my algorithm haven't converged yet. The log likelihood is a concave function, meaning any local optimum is also the global one.
This means my algorithm would eventually reach the same weights as sklearn given enough iterations.
Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem and the naive assumption that the features are independent. Although the features are rarely independent naive bayes is nevertheless used in real situations because of its speed and the fact that it gives surprisingly good results. Naive bayes differs from the usual ml algorithm because for the training a closed-form solution can be used rather than an expensive iterative algorithm such as gradient descent. I implemented a Gaussian Naive bayes classifier which assumes that the features are independent, continuous variables and they follow a normal distribution.
For this you only need to calculate the mean, variance and prior probability of each class(here i used polars:
After this any new prediction can be made by pluggint these variables into the Probability density function and returning the label with the highest probability:
Support vector machines are [max-margin] models. The algorithm tries to find a hyperplane which best separates the two classes. For this it uses support vectors (the closest point of the two classes to the hyperplane) and by maximizing the margin (the distance between the points and the hyperplane). SVM-s are widely adopted in the industry, because they can classify data that is not linearly separable, by utilising the kernel trick which cast the points into a higher dimensional, without actually calculating the point values in the higher dimension space.Here they might actually be linearly separable. There are 2 types of the svms:
- Linear SVM
- Non-linear SVM
I've implemented a Linear SVM with a simple linear kernel. I followed the same principles as with logistic regression, but here i needed to maximize the hinge loss.
For us to understand gradient descent, first we need to know what the gradient is. According to wikipedia:
.
With (kind of) human-readable terms: The gradient is a vector whose length matches the number of variables in the function it is derived from. Each component is the derivative of the function with respect to one variable, treating all others as constants, indicating how altering that specific variable leads to the greatest change in the function's value. So now knowing the gradient, we can step into the opposite direction (hence the name descent) of the gradient to reach the minimum of the function (which is usually our objective when training an ml model). This goes on until the gradient is converged, which is usually checked by either comparing the loss or the gradient change by iteration and if they are only changing by a really small value between steps, the algorithm has converged.
There are several types of gradient descent:
- Batch gradient descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent
Here I will only talk about Batch gradient descent which is the one I implemented for my algorithms: Batch gradient works by calculating the gradient for the whole dataset then updating parameters accordingly. Which is good because this means the algorithm will always converge and bad (and this is why it is never used in real world scenarios) because calculating the gradient for a whole dataset is very slow, especially if you do it thousands of times. There is no single implementation for gradient descent since calculating it is different for every function, but here is my implementation for calculating it for MSE: