Skip to content

Latest commit

 

History

History
135 lines (92 loc) · 5.67 KB

GradientDescent.md

File metadata and controls

135 lines (92 loc) · 5.67 KB

Back To Index

Gradient Descent Tutorial

Background:

Should you have an interest in an unstable and computationally expensive algorithm called gradient descent, then this tutorial is for you. For many optimizaiton/'machine learning' problems, there are better algorithms. What you are probably intrested in is the Normal Equation. However, gradient descent perhaps has merit/use as a stepping stone to algorithms that address large parameter opimization problems(1k,1Mil parameters). Motiviation aside, the material below is my attempt to explain this algorithm.

Notation:

Firstly, the notation to be used is shown below as well as their respective dimensions:

This assumes m number of data samples has been collected and that the model has n number of parameters. y represents the output data that has been measured. theta is the matrix of model parameters that are to be estimated. r is residual matrix and is explained in the following section.

Step 1: Calculate r

In optimization, the residual vector is the vector of errors. It is the difference between the observed data and the predicted data. yHat is our predicted data or our model output.

Our model output is some function of theta .

Step 2: Calculate the Jacobian .

The Jacobian is the partial derivative of residual vector with respect to the model parameters.

If the Jacobian is independent of model parameters, this is a linear convex optimization problem. There will be a global optimal solution and the Jacobian matrix is constant.

Step 3: Calculate the gradient.

Next we calculate the gradient of the residual function with respect to the model parameters.

Step 4: Calculate the next theta.

Gradient descent requires an initial guess of theta. From there, theta is updated each iteration. The idea is to approximate the slope(gradient) of the residual function r and travel some fixed step size in the direction of the gradient. Steps are often notated in optimization as p. Here our step then is the normalized gradient multiplied by the step size alpha.

This is then calculated iteratively and theta is recalculated each step:

iterations = 10000; % Fixed iteration length 
alpha = 0.01; % Step length
for i = 2:iterations; 
   r = yMeasured - (-J*theta(:,i-1)); % Calculate residual 
   theta(:,i) = theta(:,i-1) - alpha * (J')*r/ norm((J')*r); % Update Theta
end

Note the above is for a linear convex estimation problem. In such a problem f(theta) = -J*theta.

Limitations:

An initial guess of theta must be provided to the algorithm. The convergence of the algorithm is dependent on how far off the initial guess is with respect to the step length alpha. Consequently, to use this algorithm in practice, the initial guess and alpha must be manually tunned until convergence is observed. Naturally this is a significant drawback of the algorithm. Also, if step length is too small, the algorithm may not converge in our life time or it may run into numerical issues.

Convergence:

One item not covered yet is convergence. One method is simply to run for a fixed number of iterations and hope for the best. A slightly better method is to set a maximum iteration limit but stop if convergence is observed. Gradient descent should converge quadratically if at all. Consequently,

Example:

Consider the following cubic model that is sampled for x values between 0 and 5 in 0.1 increments.The measured model as normally distributed noise of 0.1 added to it.

The following is a basic Matlab/Octive Gradient descent algorithm:

function theta = gradDescent(J,y,theta,alpha,itLimit)
for i = 1:itLimit; 
    r = y - (-J*theta); % Calculate residual 
    theta = theta - alpha * (J')*r/ norm((J')*r); % Update Theta
end
end

Using this basic algorithm operating on this data one may observe a plot similar to the following:

Note here a comparison was made versus least squares. Least squares will always produce a more optimal result and potentially do so in a less computational manner. Furthermore, least squares is guaranteed to converge where as gradient descent depends on step size selection and iteration limits.

Here is the code which generated the above plot:

x = (0:0.1:5)';
y = -0.5*x.^3+2*x.^2+2;
noise = randn(length(x),1);
yMeasured = noise + y;
iterations = 100000; % Fixed iteration length 

J = -1.*[x.^3,x.^2,x.^1,x.^0];
theta = zeros(size(J,2),1);

alpha = 0.05; % Step length

theta = gradDescent(J,yMeasured,theta,alpha,iterations);
thetaLSQ = -1*pinv((J')*J)*(J')*yMeasured;

fig1 = figure(1);
clf(fig1);
hold on
scatter(x,yMeasured);
plot(x,y)
plot(x,(-J)*theta)
plot(x,(-J)*thetaLSQ)
grid on
set(gca,'FontSize',10,'FontWeight','bold');
set(gcf,'Units','Pixels');
set(gcf, 'Position', [2500, 500, 750, 450]);
legend('Measured y', 'True y','Gradient Descent','Least Squares')
title('Gradient Descent')

Back To Index