Neural Networks: Backward Propagation

Suppose we have M training examples $x^{(m)}, m = 1, .. M$ with known outcomes ("labels") $y^{(m)}, m = 1, .. M$. Each label y is a K-vector with all zeros except one 1 at the index k when the example belongs to the class k. Using the forward propagation we compute the predictions $h_\Theta(x^{(m)}), m = 1, .. M$. The measure for the deviation for one example $m$ is based on the logistic regression cost that is summed over all K classes:

$$ C^{(m)}(\Theta) = - \sum_{k=1}^K \left[y_k^{(m)} \log (h_\Theta (x^{(m)})_k) + (1 - y_k^{(m)})\log (1 - h_\Theta(x^{(m)})_k)\right] \tag{1} $$

The total cost is the average cost of all examples plus a regularization term $R(\Theta)$:

$$ \begin{align} C(\Theta)& = \frac{1}{M} \sum_{m=1}^M C^{(m)}(\Theta) + R(\Theta) \tag{2} \\ R(\Theta)& = \frac{\lambda}{2M} \sum_{i,j,l;j \ne 0} (\Theta_{ij}^{(l)})^2 \tag{3} \end{align} $$

Our goal is to find the parameters $\Theta$ that minimize the cost. For this purpose we need to compute the partial derivatives of $C(\Theta)$ with respect to all parameters $\Theta_{ij}^{(l)}$:

$$ D_{ij}^{(l)} = \dfrac{\partial}{\partial \Theta_{ij}^{(l)}}C(\Theta) \tag{4} $$

First we will neglect the regularization term. Since the total cost is a linear function (sum) of the costs of the individual examples we need only to compute the partial derivatives for a specific example $x^{(m)}$. In the following we will drop the superscript $(m)$ to simplify the notation.

Notice that a change of $\Theta_{ij}^{(l)}$ propagates to the next level only to $z_i^{(l+1)}$. So, the chain rule for derivation gives us:

$$ \begin{align} \dfrac{\partial C}{\partial \Theta_{ij}^{(l)}} & = \dfrac{\partial C}{\partial z_i^{(l+1)}} \dfrac{\partial z_i^{(l+1)}}{\partial \Theta_{ij}^{(l)}} \\ & = \delta^{(l+1)}_i \dfrac{\partial z_i^{(l+1)}}{\partial \Theta_{ij}^{(l)}} \tag{5} \\ & = \delta^{(l+1)}_i a_j^{(l)} ; \text{ where: } i \ge 1, j \ge 0, l \lt L \end{align} $$

where we introduced the auxiliary terms:

$$\delta^{(l)}_i = \dfrac{\partial C}{\partial z_i^{(l)}} \tag{6}$$

These terms can be computed backward starting with the output layer L. The cost for one example can be written as:

$$ C = - \sum_{k=1}^K \left[y_k \log (a_k^{(L)}) + (1 - y_k)\log (1 - a_k^{(L)})\right] \tag{7} $$

We also need the derivative of the sigmoid function $g(z)$:

$$ g'(z) = \dfrac {d g(z)} {d z} = g(z) (1 - g(z)) \tag{8} $$

Now we can compute $\delta^{(L)}_i$ for the output layer as follows:

$$ \begin{align} \delta^{(L)}_i & = \dfrac{\partial C}{\partial z_i^{(L)}} \\ & = \dfrac{\partial C}{\partial a_i^{(L)}} \dfrac{\partial a_i^{(L)}}{\partial z_i^{(L)}} \\ & = \dfrac{\partial C}{\partial a_i^{(L)}} g'(z_i^{(L)}) \\ & = \left[ - \dfrac{y_i}{a_i^{(L)}} + \dfrac{1 - y_i}{1 - a_i^{(L)}} \right] a_i^{(L)} (1 - a_i^{(L)}) \\ & = a_i^{(L)} - y_i \tag{9} \end{align} $$

The equation (9) can be written in vector form:

$$ \delta^{(L)} = a^{(L)} - y \tag{10} $$

For inner layers $l, l \lt L $ we use the multivariate chain rule ($i \ge 1$):

$$ \begin{align} \delta^{(l)}_i & = \dfrac{\partial C}{\partial z_i^{(l)}} \\ & = \sum_{j \ge 1} \dfrac{\partial C}{\partial z_j^{(l+1)}} \dfrac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}} \\ & = \sum_{j \ge 1} \delta_j^{(l+1)} \dfrac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}} \tag{11} \end{align} $$

Because

$$ \begin{align} z_j^{(l+1)} & = \sum_{k \ge 0} \Theta_{jk}^{(l)} a_k^{(l)} \\ & = \Theta_{j0} + \sum_{k \ge 1} \Theta_{jk}^{(l)} g(z_k^{(l)}) \tag{12} \end{align} $$

we get (i ≥ 1, j ≥ 1):

$$ \dfrac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}} = \Theta_{ji}^{(l)} g'(z_i^{(l)}) \tag{13} $$

Substituing the expression (13) into (11) we get (i ≥ 1):

$$ \delta^{(l)}_i = \sum_{j \ge 1} \Theta_{ji}^{(l)} \delta_j^{(l+1)} g'(z_i^{(l)}) \tag{14} $$

The formula (14) can be written in vector form as:

$$ [0; \delta^{(l)}] = (\Theta^{(l)})^T \delta^{(l+1)} \ .*\ [0; g'(z^{(l)})] \tag{15} $$

Here we used the Hadamard product ".*" of two vectors that is defined as the vector whose elements are the products of the elements of the respective vectors ("elementwise product"). We also had to add the first zero elements to $\delta^{(l)}$ and $g'(z^{(l)})$ to cope with the fact that the summation in (14) starts with $j = 1$ whereas the matrix $\Theta^{(l)}$ has a first column $\Theta_{i0}^{(l)}$.

We have derived all ingredients to compute the partial derivatives $\partial C^{(m)} / \partial \Theta_{ij}^{(l)}$. The computation of the derivatives of the regularization term $R(\Theta)$ is trivial:

$$ \dfrac{\partial R(\Theta)}{\partial \Theta_{ij}^{(l)}} = \begin{cases} \dfrac{\lambda}{M} \Theta_{ij}^{(l)} , & \text {if } j \ne 0 \tag{16} \\ 0 , & \text {if j = 0} \end{cases} $$

Algorithm

Set up the matrices $D^{(l)}, l = 1, 2, .., L-1$ that correspond to the partial derivatives (Eq. 4). The dimension of the matrix $D^{(l)}$ is $N_l \times (N_{l-1}+1)$. Initialize all these matrices with 0.
For the training examples $x^{(m)}, m = 1, 2, .., M$ do:
1. Set $a^{(1)} = x^{(1)}$. This is an N-vector.
2. Perform forward propagation to compute $a^{(l)}, l = 2, 3, ..,L$. The vector $a^{(l)}$ has the dimension $N_l$.
3. Compute the difference between the predicted values and the labeled values:
  $\delta^{(L)} = a^{(L)} - y$.
  All these vectors have the dimension $N_L = K$, where K is the number of classes.
4. Compute $\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(1)}$ using (Eq. 16):
  $ [0; \delta^{(l)}] = (\Theta^{(l)})^T \delta^{(l+1)} \ .*\ [0; g'(z^{(l)})] $
  The dimensions are:
  $\dim(\delta^{(l)}) = N_l$
  $\dim((\Theta^{(l)})^T) = (N_l + 1) \times N_{l+1} $
  $\dim(\delta^{(l+1)}) = N_{l+1}$
  $\dim(g'(z^{(l)})) = N_l$
5. Add the partial derivatives of the example $m$ to the accummulated partial derivatives according to (Eq. 5) written in matrix form:
  $D^{(l)} := D^{(l)} + \delta^{(l+1)} (a^{(l)})^T, l = 1, \dots L-1$
End the loop over the M examples.
Compute the average of the accumated values and add the regularization term:
$D^{(l)} := \dfrac{1}{M} D^{(l)} + \dfrac{\lambda}{M} \hat{\Theta}^{(l)} , l = 1, \dots L-1$
The term $\hat{\Theta}^{(l)}$ denotes the matrix $\Theta^{(l)}$ with the first column set to 0.

Neural Networks: Backward Propagation

Algorithm

Links: