CS229 Machine Learning Algorithms

Concise implementations of fundamental machine learning algorithms from scratch from Stanford's CS229 course.
The purpose of this project was to deepen my understanding of the most commonly used ML algorithms.

Visualizations:

Because looking at code is not the most interesting thing, here are some visualizations that I made with my implementations of the algorithms.

Linear regression line fit:

Gradient descent searching for optimal parameters on the error surface in the weight space:

Disclaimer

These algorithms are very simple and primitive implementations of those found in popular ml packages such as sci-kit learn, which have been refined and optimized by experts for years, so any of the implementations here should only be used for learning purposes.

Tests

All algorithms and helper functions are tested with a coverage of:

Prerequisites

To run this project you will need to have python installed on your machine and (preferably) a virtual environment.

Installation

Installation steps:

activate your virtual environment
install necessary libraries:
pip install -r requirements.txt
Because of the way the files are structured you have to run the files as modules:
python -m path.to.file

For example running the algorithms/linear_regression/linear_regression.py would look like this:
python -m algorithms.linear_regression.linear_regression

Linear regression

Linear regression is a statistical model which tries to model the relationship between the dependent and independent variables.
It is mostly used for predicting continuous values and rarely for classification as it is really sensitive to outliers.
Altough a closed-form solution exits to linear regression, which would give you the optimal parameter values directly, I still used gradient descent to gain deeper knowledge of the algorithm.

The error metric that we are trying to minimalize is the root (RMSE) or the normal mean squared error (MSE):

Running the algorithm for 10k iterations with a learning rate of 0.003 the parameters almost match the ones calculated by sklearn:

Since MSE is a convex function which means that the local optimum is also the global one, my algorithm would eventually converge given enough iterations.

Logistic regression

Logistic regression is a statistical model which works by applying the logistic function to the linear relationship combined from the input features, weights and biases. Mostly used for classification with a given threshold (usually 0.5), where values returned by the logistic function greater than or equal to the threshold are classified as 1 , below the threshold classified as 0.

Logistic regression is most commonly trained by minimizing the negative of the log likelihood:

Training and then comparing my results to sklearn yields similar results:

As you can see the weights are of by a little bit. This is because my algorithm haven't converged yet. The log likelihood is a concave function, meaning any local optimum is also the global one. This means my algorithm would eventually reach the same weights as sklearn given enough iterations.

Naive Bayes

Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem and the naive assumption that the features are independent. Although the features are rarely independent naive bayes is nevertheless used in real situations because of its speed and the fact that it gives surprisingly good results. Naive bayes differs from the usual ml algorithm because for the training a closed-form solution can be used rather than an expensive iterative algorithm such as gradient descent. I implemented a Gaussian Naive bayes classifier which assumes that the features are independent, continuous variables and they follow a normal distribution.

For this you only need to calculate the mean, variance and prior probability of each class(here i used polars:

After this any new prediction can be made by pluggint these variables into the Probability density function and returning the label with the highest probability:

Support Vector Machine

Support vector machines are [max-margin] models. The algorithm tries to find a hyperplane which best separates the two classes. For this it uses support vectors (the closest point of the two classes to the hyperplane) and by maximizing the margin (the distance between the points and the hyperplane). SVM-s are widely adopted in the industry, because they can classify data that is not linearly separable, by utilising the kernel trick which cast the points into a higher dimensional, without actually calculating the point values in the higher dimension space.Here they might actually be linearly separable. There are 2 types of the svms:

Linear SVM
Non-linear SVM

I've implemented a Linear SVM with a simple linear kernel. I followed the same principles as with logistic regression, but here i needed to maximize the hinge loss.

Batch Gradient Descent

For us to understand gradient descent, first we need to know what the gradient is. According to wikipedia:
.
With (kind of) human-readable terms: The gradient is a vector whose length matches the number of variables in the function it is derived from. Each component is the derivative of the function with respect to one variable, treating all others as constants, indicating how altering that specific variable leads to the greatest change in the function's value. So now knowing the gradient, we can step into the opposite direction (hence the name descent) of the gradient to reach the minimum of the function (which is usually our objective when training an ml model). This goes on until the gradient is converged, which is usually checked by either comparing the loss or the gradient change by iteration and if they are only changing by a really small value between steps, the algorithm has converged.
There are several types of gradient descent:

Batch gradient descent
Stochastic Gradient Descent
Mini-batch Gradient Descent

Here I will only talk about Batch gradient descent which is the one I implemented for my algorithms: Batch gradient works by calculating the gradient for the whole dataset then updating parameters accordingly. Which is good because this means the algorithm will always converge and bad (and this is why it is never used in real world scenarios) because calculating the gradient for a whole dataset is very slow, especially if you do it thousands of times. There is no single implementation for gradient descent since calculating it is different for every function, but here is my implementation for calculating it for MSE:

and actually updating the parameters:

Resources used and useful links

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
algorithms		algorithms
data		data
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS229 Machine Learning Algorithms

Table of contents:

Visualizations:

Linear regression line fit:

Gradient descent searching for optimal parameters on the error surface in the weight space:

Disclaimer

Tests

Prerequisites

Installation

Linear regression

Logistic regression

Naive Bayes

Support Vector Machine

Batch Gradient Descent

Resources used and useful links

About

Releases

Packages

Languages

License

GellertPalfi/CS229-ML-algorithms-from-scratch

Folders and files

Latest commit

History

Repository files navigation

CS229 Machine Learning Algorithms

Table of contents:

Visualizations:

Linear regression line fit:

Gradient descent searching for optimal parameters on the error surface in the weight space:

Disclaimer

Tests

Prerequisites

Installation

Linear regression

Logistic regression

Naive Bayes

Support Vector Machine

Batch Gradient Descent

Resources used and useful links

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages