final_report.tex

\documentclass[a4paper]{article}
\usepackage{xeCJK}
\setCJKmainfont{IPAMincho}
\usepackage{amsmath}
\usepackage{amssymb}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\matr}[1]{\mathbf{#1}}

\usepackage{graphicx}
\usepackage{url}
\usepackage{hyperref}
\usepackage{subcaption}

\author{Peter Spalthoff, 18M30829, 情報工学コース}
\title{Final report}

\begin{document}
\maketitle
\section{Problem 1}
I implemented both the batch steepest gradient and the Newton optimization for
binary classification using linear logistic regression. The code can be found at
\url{https://github.com/derpda/machine_learning_class.git}.

Figure~\ref{fig:weights_data} shows the result of the fitting for a data set of
200 data points created the same way as the provided dataset II\@. Training is
done for 100 iterations over the data set. The resulting class separation varies
slightly between the two algorithms, but both achieved an accuracy of 85\% on
this data set.

Training was also tried with mini-batches of size 16. For some of these
executions, the loss function increased. This happened particularly when the
initial weights were already close to optimal. When training for more epochs,
the loss function seems to stabilize at a point higher than previously attained
minimum values. This suggests that the applied methods do not reliably seek a
minimum of the loss function. Maybe, there is some error or numerical
instability in the calculation of the loss function. The behavior is as
expected for most executions and the values are in the same general range.

The prediction accuracy increased even in the cases where the loss function
increased. This is surprising and may be due to an error in the code. However, I
was not able to find the source of this issue. Training with all data points at
the same time did not show the same issue. Thus, all results presented are drawn
from training where all data points where used for a single weight update.

\begin{figure}[hbt]
    \begin{subfigure}{.5\textwidth}
        \includegraphics[width=\linewidth]{./tex_subfiles/weights_data_before.pdf}
    \end{subfigure}%
    \begin{subfigure}{.5\textwidth}
        \includegraphics[width=\linewidth]{./tex_subfiles/weights_data_after.pdf}
    \end{subfigure}
    \caption{The used data set along with the prediction border. The left figure
    shows the initial state and the right figure shows the state after 100
    iterations over the data for the respective optimization methods.}\label{fig:weights_data}
\end{figure}

Figure~\ref{fig:sgd_vs_newton} shows the development of the value of the loss
function for the iterations over the data (epochs). Clearly, the Newton
optimization method yields much faster convergence than the batch SGD version.
For some created data, the SGD method seems to converge faster in contradiction
to the theory presented in the lecture. The Newton approach seems to be unstable
in some situations.

\begin{figure}[hbt]
    \includegraphics[width=\linewidth]{./tex_subfiles/sgd_vs_newton.pdf}
    \caption{The value of the loss function at each epoch plotted for batch SGD
    and Newton optimization.}\label{fig:sgd_vs_newton}
\end{figure}

\section{Problem 3}
\subsection{Theory}
We start from the primal problem
\begin{equation*}
    \argmin_{\vec{w}}\left(\sum_{i=1}^{n}\max(0,1-y_i\vec{w}^T\vec{x}_i)
    + \lambda\rVert\vec{w}\lVert_2^2\right)
\end{equation*}
and introduce the slack variable $\epsilon_i=\max(0,1-y_i\vec{w}^T\vec{x}_i)$.
It has the constraints $\epsilon_i\geq0$ and
$y_i\vec{w}^T\vec{x}_i-1+\epsilon_i\geq0$.
Thus the new primal problem is
\begin{equation*}
    \argmin_{\vec{w}} \sum_{i=1}^{n}\epsilon_i + \lambda\rVert\vec{w}\lVert_2^2
\end{equation*}
We now formulate the Lagrangian as
\begin{equation}\label{eq:primal_lagrange}
    L(\vec{w}, \vec{\epsilon}) =
    \sum_{i=1}^{n}\epsilon_i
    + \lambda\rVert\vec{w}\lVert_2^2
    - \sum_{i=1}^{n}\alpha_i(y_i\vec{w}^T\vec{x}_i - 1 + \epsilon_i)
    - \sum_{i=1}^{n}\beta_i\epsilon_i
\end{equation}
Now we evaluate the first Karush-Khan-Tucker condition. We start with the
derivative for $\vec{w}$
\begin{equation*}
    \frac{\partial L}{\partial\vec{w}} =
    2\lambda\vec{w}
    - \sum_{i=1}^{n}\alpha_iy_i\vec{x}_i \stackrel{!}{=} 0
\end{equation*}
from which it is clear that
\begin{equation}\label{eq:w_representation}
    \hat{\vec{w}} = \frac{1}{2\lambda}\sum_{i=1}^{n}\alpha_iy_i\vec{x}_i
\end{equation}
which is the solution to the second sub-problem.

We continue with the derivative for $\vec{\epsilon}$
\begin{equation*}
    \frac{\partial L}{\partial\vec{\epsilon}} =
    \vec{1} - \vec{\alpha} - \vec{\beta} \stackrel{!}{=} \vec{0}
\end{equation*}
where $\vec{0}$ is a vector with only 0's and $\vec{1}$ accordingly a vector
with only 1's.

From this, we can derive
\begin{equation}\label{eq:a_representation}
    \alpha_i = 1 - \beta_i
\end{equation}
Also, with the KKT-conditions, we know $\vec{\alpha}\geq\vec{0}$ and
$\vec{\beta}\geq\vec{0}$ and thus
\begin{equation}\label{eq:a_condition}
    \vec{0} \geq \vec{\alpha} \geq \vec{1}
\end{equation}
Now we slightly rewrite the Lagrangian of the primal problem (see
equation~\ref{eq:primal_lagrange}) as
\begin{equation*}
    L(\vec{w}, \vec{\epsilon}) =
    \sum_{i=1}^{n}(1-\beta_i)\epsilon_i
    - \sum_{i=1}^{n}\alpha_i\epsilon_i
    + \lambda\rVert\vec{w}\lVert_2^2
    - \sum_{i=1}^{n}\alpha_iy_i\vec{w}^T\vec{x}_i
    + \sum_{i=1}^{n}\alpha_i
\end{equation*}
and continue by inserting the results from equation~\ref{eq:w_representation}
and equation~\ref{eq:a_representation} (while taking care of correct sum
indices) to get
\begin{equation*}
\begin{split}
    L(\vec{\alpha}, \vec{\beta}) =
    &\sum_{i=1}^{n}(1-\beta_i)\epsilon_i
    - \sum_{i=1}^{n}(1-\beta_i)\epsilon_i
    + \lambda\left\rVert\
        \frac{1}{2\lambda}\sum_{i=1}^{n}\alpha_iy_i\vec{x}_i
    \right\lVert_2^2\\
    &- \sum_{i=1}^{n}\alpha_iy_i
    \left(\frac{1}{2\lambda}
        \sum_{j=1}^{n}\alpha_jy_j\vec{x}_j^T
    \right)\vec{x}_i
    + \sum_{i=1}^{n}\alpha_i
\end{split}
\end{equation*}
The first two terms eliminate each other. We can rewrite the third term by
explicitly writing the L2 norm as $\lVert\vec{w}\rVert_2^2=\vec{w}^T\vec{w}$,
and thus
\begin{equation*}
\begin{split}
    L(\vec{\alpha}) &= \frac{1}{4\lambda}
    \sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_iy_iy_j\vec{x}_i^T\vec{x}_j\alpha_j
    - \frac{1}{2\lambda}
    \sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_iy_iy_j\vec{x}_i^T\vec{x}_j\alpha_j
    + \sum_{i=1}^{n}\alpha_i\\
    &= \sum_{i=1}^{n}\alpha_i
    - \frac{1}{4\lambda}
    \sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_iy_iy_j\vec{x}_i^T\vec{x}_j\alpha_j
\end{split}
\end{equation*}
Note that this term does no longer depend on $\vec{\beta}$. This is the dual
problem of the SVM\@. As the primal problem was a minimization, the dual problem
is a maximization problem. Writing this in matrix-vector form with
$\matr{K}=y_iy_j\vec{x}_i^T\vec{x}_j$ yields, together with the condition from
equation~\ref{eq:a_condition}, the desired result
\begin{equation*}
    \argmax_{\vec{\alpha}}\left(
        - \frac{1}{4\lambda}\vec{\alpha}^T\matr{K}\vec{\alpha}
        + \vec{\alpha}^T\vec{1}
    \right)
\end{equation*}
\subsection{Implementation}
The code of the implementation can be found in the repository on
\url{https://github.com/derpda/machine_learning_class.git}.
Data was created using the same function as was used for the first problem (data
set II). 50 iterations over the 200 data points produced the primal loss
function and dual Lagrange function scores in figure~\ref{fig:svm_loss}.
After 50 iterations, there is still a gap between the functions. However, when
running for 1000 iterations, the relative difference between the primal loss and
the dual function score is less than 1\%.
\begin{figure}[hbt]
    \includegraphics[width=\linewidth]{./tex_subfiles/svm_loss.pdf}
    \caption{The score of the primal loss function with regularization (blue)
    and the score of the negative dual Lagrange function (orange) for the
    iterations over the data.}\label{fig:svm_loss}
\end{figure}

\section{Constructive criticism of the lecture}
I thought the general idea of the lecture, focusing heavily on theory and deeper
understanding of basic machine learning techniques, was good. However, there
were often critical errors on the slides that sometimes made it very hard to
follow the lecture. It might be good to have a second pair of eyes, such as a
teaching assistant, look over the slides to make sure there are no major
errors.

As for the homework, again, I enjoyed the focus on theory. However, I personally
have a much easier time understanding and remembering the theory if I implement
actual code directly based on the theory (like in this assignment!). I think
there was only one (optional) part of one weeks assignment where an
implementation was asked. Encouraging students to implement the theory that was
taught in the lecture might help them get a deeper and longer lasting
understanding of machine learning.
\end{document}