Skip to content

Commit

Permalink
Updates from Overleaf
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreasMadsen committed Nov 13, 2019
1 parent 3081851 commit e2fb3e9
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 27 deletions.
3 changes: 1 addition & 2 deletions paper/appendix/nalu-author-comments.tex
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

\section{Additional notes}

\subsection{Language to Number Translation Tasks}
Expand Down Expand Up @@ -37,4 +36,4 @@ \subsection{Language to Number Translation Tasks}
\end{bmatrix}^T &\left(I(x_t \not= \texttt{100})\begin{bmatrix}100 \\ 1 \\ 1\end{bmatrix}\right)
\end{aligned}
\label{eq:langauge-to-numbers-lstm}
\end{equation}
\end{equation}
41 changes: 33 additions & 8 deletions paper/appendix/sequential-mnist.tex
Original file line number Diff line number Diff line change
@@ -1,22 +1,47 @@
\section{Sequential MNIST}

\subsection{Task and evaluation criteria}
The simple function task is a purely synthetic task, that doesn't require a deep network. As such it doesn't tests if an arithmetic layer prevents the networks ability to be optimized using gradient decent.
The simple function task is a purely synthetic task, that does not require a deep network. As such it does not test if an arithmetic layer inhibits the networks ability to be optimized using gradient decent.

The sequential MNIST task, takes the numerical value of a sequence of MNIST digits and then applies a binary operation recursively. That is $t_i = Op(t_{i-1}, z_t)$, where $z_t$ is the MNIST digit's numerical value.
The sequential MNIST task takes the numerical value of a sequence of MNIST digits and applies a binary operation recursively. Such that $t_i = Op(t_{i-1}, z_t)$, where $z_t$ is the MNIST digit's numerical value.

As the performance of this task depends on the quality of the image-to-scalar network, as well as the arithmetic layer itself. As threshold has to be determined from an empirical baseline. This is done by letting the arithmetic layer be solved, such that only the image-to-scalar is learned. By learning this over multiple seeds an an upper bound for an MSE threshold can be set. In our experiment we use the 1\% one-sided upper confidence-interval, assuming a student-t distribution.
The performance of this task depends on the quality of the image-to-scalar network and the arithmetic layer's ability to model the scalar. We use mean-square-error (MSE) to evaluate joint image-to-scalar and arithmetic layer model performance. To determine an MSE threshold from the correct prediction we use an empirical baseline. This is done by letting the arithmetic layer be solved, such that only the image-to-scalar is learned. By learning this over multiple seeds an upper bound for an MSE threshold can be set. In our experiment we use the 1\% one-sided upper confidence-interval, assuming a student-t distribution.

A success-criteria is again used, as reporting the MSE is not interpretable, and models that don't converge will obscure the mean. Furthermore, because the operation is applied recursively, natural error from the dataset will accumulate over time, thus exponentially increasing the MSE. Using a baseline model and reporting the successfulness solves this issue.
Similar to the simple function task we use a success-criteria as reporting the MSE is not interpretable and models that do not converge will obscure the mean. Furthermore, because the operation is applied recursively, natural error from the dataset will accumulate over time, thus exponentially increasing the MSE. Using a baseline model and reporting the successfulness solves this interpretation challenge.

\subsection{Without the \texorpdfstring{$\mathrm{R}_z$}{R\_z} regularizer}
\subsection{Addition of sequential MNIST}

Figure \label{fig:sequential-mnist-sum} shows results for sequential addition of MNIST digits. This experiment is identical to the MNIST Digit Addition Test from \citet[section 4.2]{trask-nalu}. The models are trained on a sequence of 10 digits and evaluated on sequences between 1 and 1000 MNIST digits.

Note that the NAU model includes the $R_z$ regularizer, similarly to the ``Multiplication of sequential MNIST'' experiment in section \ref{section:results:cumprod_mnist}. To provide a fair comparison, a variant of $\mathrm{NAC}_{+}$ that also uses this regularizer is included, this variant is called $\mathrm{NAC}_{+, R_z}$. Section \ref{sec:appendix:sequential-mnist-sum:ablation} provides an ablation study of the $R_z$ regularizer.

\begin{figure}[h]
\centering
\includegraphics[width=\linewidth,trim={0 0.5cm 0 0},clip]{paper/results/sequential_mnist_sum_long.pdf}
\caption{Shows the ability of each model to learn the arithmetic operation of addition and backpropagate through the arithmetic layer in order to learn an image-to-scalar value for MNIST digits. The model is tested by extrapolating to larger sequence lengths than what it has been trained on. The NAU and $\mathrm{NAC}_{+,R_z}$ models use the $\mathrm{R}_z$ regularizer from section \ref{section:results:cumprod_mnist}.}
\label{fig:sequential-mnist-sum}
\end{figure}

\subsection{Sequential addtion without the \texorpdfstring{$\mathrm{R}_z$}{R\_z} regularizer}
\label{sec:appendix:sequential-mnist-sum:ablation}

As an ablation study of the $\mathrm{R}_z$ regularizer, figure \ref{fig:sequential-mnist-sum-ablation} shows the NAU model without the $\mathrm{R}_z$ regularizer. Removing the regularizer causes a reduction in the success-rate. The reduction is likely larger, as compared to sequential multiplication, because the sequence length used for training is longer. The loss function is most sensitive to the 10th output in the sequence, as this has the largest scale. This causes some of the model instances to just learn the mean, which becomes passable for very long sequences, which is why the success-rate increases for longer sequences. However, this is not a valid solution. A well-behavior model should be successful independent of the sequence length.

\begin{figure}[h]
\centering
\includegraphics[width=\linewidth,trim={0 0.5cm 0 0},clip]{paper/results/sequential_mnist_sum_long_ablation.pdf}
\caption{Same as figure \ref{fig:sequential-mnist-sum}, but where the NAU model do not use the $\mathrm{R}_z$ regularizer.}
\label{fig:sequential-mnist-sum-ablation}
\end{figure}

\subsection{Sequential multiplication without the \texorpdfstring{$\mathrm{R}_z$}{R\_z} regularizer}
\label{sec:appendix:sequential-mnist:ablation}

As an ablation study of just the $\mathrm{R}_z$ regularizer, figure \ref{fig:sequential-mnist-prod-ablation} shows the NMU and $\mathrm{NAC}_{\bullet,\mathrm{NMU}}$ models without the $\mathrm{R}_z$ regularizer. The success-rate is somewhat similar. However, as seen in the ``sparsity error'' plot, the solution is quite different.
As an ablation study of the $\mathrm{R}_z$ regularizer figure \ref{fig:sequential-mnist-prod-ablation} shows the NMU and $\mathrm{NAC}_{\bullet,\mathrm{NMU}}$ models without the $\mathrm{R}_z$ regularizer. The success-rate is somewhat similar to figure \ref{fig:sequential-mnist-prod-results}. However, as seen in the ``sparsity error'' plot, the solution is quite different.

\begin{figure}[h]
\centering
\includegraphics[width=\linewidth,trim={0 0.5cm 0 0},clip]{results/sequential_mnist_prod_long_ablation.pdf}
\caption{Shows the ability of each model to backpropergation and extrapolate to larger sequence lengths. The NMU and $\mathrm{NAC}_{\bullet,\mathrm{NMU}}$ models do not use the $\mathrm{R}_z$ regularizer.}
\caption{Shows the ability of each model to learn the arithmetic operation of addition and backpropagate through the arithmetic layer in order to learn an image-to-scalar value for MNIST digits. The model is tested by extrapolating to larger sequence lengths than what it has been trained on. The NMU and $\mathrm{NAC}_{\bullet,\mathrm{NMU}}$ models do not use the $\mathrm{R}_z$ regularizer.}
\label{fig:sequential-mnist-prod-ablation}
\end{figure}
\end{figure}
32 changes: 16 additions & 16 deletions paper/appendix/simple-function-task.tex
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ \subsection{Effect of dataset parameter}

To stress test the models on the multiplication task, we vary the dataset parameters one at a time while keeping the others at their default value (default values in table \ref{tab:simple-function-task-defaults}). Each runs for 50 experiments with different seeds. The results, are visualized in figure \ref{fig:simple-function-static-dataset-parameters-boundary}.

In figure \ref{fig:simple-function-static-theoreical-claims-experiment}, the inperpolation-range is changed, therefore the the extrapolation-range needs to be changed such it doesn't overlap. For each interpolation-range the following extrapolation-range is used: ${\mathrm{U}[-2,-1] \text{ uses } \mathrm{U}[-6,-2]}$, ${\mathrm{U}[-2,2] \text{ uses } \mathrm{U}[-6,-2] \cup \mathrm{U}[2,6]}$, ${\mathrm{U}[0,1] \text{ uses } \mathrm{U}[1,5]}$, ${\mathrm{U}[0.1,0.2] \text{ uses } \mathrm{U}[0.2,2]}$, ${\mathrm{U}[1.1,1.2] \text{ uses } \mathrm{U}[1.2,6]}$, ${\mathrm{U}[1,2] \text{ uses } \mathrm{U}[2,6]}$, ${\mathrm{U}[10, 20] \text{ uses } \mathrm{U}[20, 40]}$.
In figure \ref{fig:simple-function-static-theoreical-claims-experiment}, the interpolation-range is changed, therefore the extrapolation-range needs to be changed such it doesn't overlap. For each interpolation-range the following extrapolation-range is used: ${\mathrm{U}[-2,-1] \text{ uses } \mathrm{U}[-6,-2]}$, ${\mathrm{U}[-2,2] \text{ uses } \mathrm{U}[-6,-2] \cup \mathrm{U}[2,6]}$, ${\mathrm{U}[0,1] \text{ uses } \mathrm{U}[1,5]}$, ${\mathrm{U}[0.1,0.2] \text{ uses } \mathrm{U}[0.2,2]}$, ${\mathrm{U}[1.1,1.2] \text{ uses } \mathrm{U}[1.2,6]}$, ${\mathrm{U}[1,2] \text{ uses } \mathrm{U}[2,6]}$, ${\mathrm{U}[10, 20] \text{ uses } \mathrm{U}[20, 40]}$.

\begin{figure}[h]
\centering
Expand All @@ -134,26 +134,26 @@ \subsection{Effect of dataset parameter}
\subsection{Gating convergence experiment}
\label{sec:appendix:nalu-gate-experiment}

In the interest of adding some understand of what goes wrong in the NALU gate, and the shared weight choice that NALU employs to fix this, we introduce the following experiment.
In the interest of adding some understand of what goes wrong in the NALU gate, and the shared weight choice that NALU employs to remedy this, we introduce the following experiment.

We train two models to fit the arithmetic task. Both uses the $\mathrm{NAC}_{+}$ in the first layer and NALU in the second layer. The only difference is that one model shares the weight between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ in the NALU, and the other just treat them as two separate models with separate weights. In both cases NALU should gate between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ and choose the appropriate operation. Note that this NALU model is different from the one presented elsewhere in this paper, including the original NALU paper \cite{trask-nalu}. The typical NALU model is just two NALU layers with shared weights.

Furthermore, we also introduce a new gated unit that simply gates between our proposed NMU and NAU, using the same sigmoid gating-mechanism in NALU. This also used seperate weights, as NMU and NAU uses different weight constrains and can therefore not be shared.
Furthermore, we also introduce a new gated unit that simply gates between our proposed NMU and NAU, using the same sigmoid gating-mechanism as in the NALU. This combination is done with seperate weights, as NMU and NAU uses different weight constrains and can therefore not be shared.

The models are trained for 100 different seeds, on the multiplication and addition task. A histogram of the gate-value is presented in figure \ref{fig:simple-function-static-nalu-gate-graph} and table \ref{tab:simple-function-static-nalu-gate-table} contains a summary. Some noteworthy observations:
The models are trained and evaluated over 100 different seeds on the multiplication and addition task. A histogram of the gate-value for all seeds is presented in figure \ref{fig:simple-function-static-nalu-gate-graph} and table \ref{tab:simple-function-static-nalu-gate-table} contains a summary. Some noteworthy observations:

\vspace{-0.3cm}\begin{enumerate}
\item When the NALU weights are separated, far more trials converge to select $\mathrm{NAC}_{+}$, for both the addition and multiplication task. Sharing the weights between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$, makes it less likely for addition to be selected.
\item When the NALU weights are separated far more trials converge to select $\mathrm{NAC}_{+}$ for both the addition and multiplication task. Sharing the weights between $\mathrm{NAC}_{+}$ and $\mathrm{NAC}_{\bullet}$ makes the gating less likely to converge for addition.
\item The performance of the addition task is dependent on NALU selecting the right operation. In the multiplication task, when the right gate is selected, $\mathrm{NAC}_{\bullet}$ do not converge consistently, unlike our NMU that converges more consistently.
\item Which gate is selected appears to be mostly random and independent of the task. This issues caused by the sigmoid gating-mechanism and thus exists independent of the used sub-units.
\item Which operation the gate converges to appears to be mostly random and independent of the task. This issues caused by the sigmoid gating-mechanism and thus exists independent of the used sub-units.
\end{enumerate}

These observations validates that the NALU gating-mechanism does not converge as intended. This becomes a critical issues when more gates are present, as is normally the case.
These observations validates that the NALU gating-mechanism does not converge as intended. This becomes a critical issues when more gates are present, as is normally the case. E.g. when stacking multiple NALU layers together.

\begin{figure}[h]
\centering
\includegraphics[width=0.98\linewidth]{results/function_task_static_nalu.pdf}
\caption{Shows the gating-value in the NALU layer and a variant that uses NAU/NMU instead of $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$. Separate/shared, refers to the weights in $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$ used in NALU.}
\caption{Shows the gating-value in the NALU layer and a variant that uses NAU/NMU instead of $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$. Separate/shared refers to the weights in $\mathrm{NAC}_{+}$/$\mathrm{NAC}_{\bullet}$ used in NALU.}
\label{fig:simple-function-static-nalu-gate-graph}
\end{figure}

Expand All @@ -162,43 +162,43 @@ \subsection{Gating convergence experiment}
\subsection{Regularization}
\label{sec:appendix:simple-function-task:regualization}

The $\lambda_{start}$ and $\lambda_{end}$ are simply selected based on how much time it takes for the model to converge. The sparsity regularizer should not be used during early optimization, as this part of the optimization is simply about getting each weight on the right side of $\pm 0.5$.
The $\lambda_{start}$ and $\lambda_{end}$ are simply selected based on how much time it takes for the model to converge. The sparsity regularizer should not be used during early optimization as this part of the optimization is exploratory and concerns finding the right solution by getting each weight on the right side of $\pm 0.5$.

In figure \ref{fig:simple-fnction-static-regularizer-add}, \ref{fig:simple-fnction-static-regularizer-sub}, and \ref{fig:simple-fnction-static-regularizer-mul} the scaling factor $\hat{\lambda}_{\mathrm{sparse}}$ is optimized.
In figure \ref{fig:simple-fnction-static-regularizer-add}, \ref{fig:simple-fnction-static-regularizer-sub} and \ref{fig:simple-fnction-static-regularizer-mul} the scaling factor $\hat{\lambda}_{\mathrm{sparse}}$ is optimized.
\begin{equation}
\lambda_{\mathrm{sparse}} = \hat{\lambda}_{\mathrm{sparse}} \max(\min(\frac{t - \lambda_{\mathrm{start}}}{\lambda_{\mathrm{end}} - \lambda_{\mathrm{start}}}, 1), 0)
\end{equation}

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth,trim={0 1.3cm 0 0},clip]{results/simple_function_static_regualization_add.pdf}
\caption{Shows effect of $\hat{\lambda}_{\mathrm{sparse}}$ in NAU, on the arithmetic dataset for the $\bm{+}$ operation.}
\caption{Shows effect of $\hat{\lambda}_{\mathrm{sparse}}$ in NAU on the arithmetic dataset for the $\bm{+}$ operation.}
\label{fig:simple-fnction-static-regularizer-add}
\end{figure}

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth,trim={0 1.3cm 0 0},clip]{results/simple_function_static_regualization_sub.pdf}
\caption{Shows effect of $\hat{\lambda}_{\mathrm{sparse}}$ in NAU, on the arithmetic dataset for the $\bm{-}$ operation.}
\caption{Shows effect of $\hat{\lambda}_{\mathrm{sparse}}$ in NAU on the arithmetic dataset for the $\bm{-}$ operation.}
\label{fig:simple-fnction-static-regularizer-sub}
\end{figure}

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth,trim={0 1.3cm 0 0},clip]{results/simple_function_static_regualization_mul.pdf}
\caption{Shows effect of $\hat{\lambda}_{\mathrm{sparse}}$ in NMU, on the arithmetic dataset for the $\bm{\times}$ operation.}
\caption{Shows effect of $\hat{\lambda}_{\mathrm{sparse}}$ in NMU on the arithmetic dataset for the $\bm{\times}$ operation.}
\label{fig:simple-fnction-static-regularizer-mul}
\end{figure}

\subsection{Comparing all models}
\label{sec:appendix:comparison-all-models}

Table \ref{tab:function-task-static-defaults-all} compares all models on all operations used in NALU \cite{trask-nalu}. All variations of model and operation, are trained for 100 different seeds. Some noteworthy observations are:
Table \ref{tab:function-task-static-defaults-all} compares all models on all operations used in NALU \cite{trask-nalu}. All variations of models and operations are trained for 100 different seeds to build confidence intervals. Some noteworthy observations are:

\begin{enumerate}
\item Division does not work for any model, including the $\mathrm{NAC}_{\bullet}$ and NALU models. This may seem surprising but is actually in line with the results from the NALU paper (\citet{trask-nalu}, table 1), where there is a large error given the interpolation range. The extrapolation range has a smaller error, but this is an artifact of their evaluation method where they normalize with a random baseline. Since a random baseline with have a higher error for the extrapolation range, a similar error will appear to be smaller. A correct solution to division should have both a small interpolation and extrapolation error.
\item Division does not work for any model, including the $\mathrm{NAC}_{\bullet}$ and NALU models. This may seem surprising but is actually in line with the results from the NALU paper (\citet{trask-nalu}, table 1) where there is a large error given the interpolation range. The extrapolation range has a smaller error, but this is an artifact of their evaluation method where they normalize with a random baseline. Since a random baseline with have a higher error for the extrapolation range, a similar error will appear to be smaller. A correct solution to division should have both a small interpolation and extrapolation error.
\item $\mathrm{NAC}_{\bullet}$ and NALU are barely able to learn $\sqrt{z}$, with just 2\% success-rate for NALU and 7\% success-rate for $\mathrm{NAC}_{\bullet}$.
\item NMU is fully capable of learning $z^2$. It learn this by learning the same subset twice in the NAU layer, this is also how $\mathrm{NAC}_{\bullet}$ learn $z^2$.
\end{enumerate}

\input{results/function_task_static_all.tex}
\input{results/function_task_static_all.tex}
2 changes: 1 addition & 1 deletion paper/sections/results.tex
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ \subsubsection{Evaluating theoretical claims}
\end{figure}

\subsection{Product of sequential MNIST}

\label{section:results:cumprod_mnist}
To investigate if a deep neural network can be learned when backpropagating through an arithmetic layer, the arithmetic layers are used as a recurrent-unit to a sequence of MNIST digits, where the target is to fit the cumulative product. This task is similar to ``MNIST Counting and Arithmetic Tasks'' in \citet{trask-nalu}\footnote{The same CNN is used, \url{https://github.com/pytorch/examples/tree/master/mnist}.}, but uses multiplication rather than addition. Each model is trained on sequences of length 2, and then tested on sequences of length up to 20 MNIST digits.

We define the success-criterion by comparing the MSE of each model with a baseline model that has a correct solution for the arithmetic layer. If the MSE of each model, is less than the upper 1\% MSE-confidence-interval of the baseline model, then the model is considered successfully converged.
Expand Down

0 comments on commit e2fb3e9

Please sign in to comment.