Backpropagation : Derivation (Matrix form)
Input matrix is $x_0$
Layer 1 weight matrix is $W_1$
Layer 1 output is $x_1 = f_1(W_1x_0)$ , where $f$ is the activation function for layer 1
There are 4 layers (including input layer)
Hence, network output is $x_3 = f_3(W_3x_2)$
Assuming MSE loss function, with $t$ as target variable, $E = \frac{1}{2}|x_3 - t|_2^2$
$$
\begin{align*}
\frac{\partial E}{\partial W_3} &= (x_3 - t)\frac{\partial x_3}{W_3} \\
&= [(x_3 - t) \circ f_3'(W_3 x_2)] \frac{\partial W_3 x_2}{\partial W_3} \\
&= [(x_3 - t) \circ f_3'(W_3 x_2)] x_2^T \\
&= \delta_3 x_2^T \\
\textrm{Let } \delta_3 &= [(x_3 - t) \circ f_3'(W_3 x_2)] \\
\frac{\partial E}{\partial W_3}&= \delta_3 x_2^T
\end{align*}
$$
$$
\begin{align*}
\frac{\partial E}{\partial W_2} &= (x_3 - t)\frac{\partial x_3}{W_2} \\
&= [(x_3 - t) \circ f_3'(W_3 x_2)] \frac{\partial W_3 x_2}{\partial W_2} \\
&= \delta_3 \frac{\partial W_3 x_2}{\partial W_2} \\
&= W_3^T \delta_3 \frac{\partial x_2}{\partial W_2} \\
&= [W_3^T \delta_3 \circ f'_2(W_2 x_1)] \frac{\partial W_2 x_1}{\partial W_2} \\
&= \delta_2 \frac{\partial W_2 x_1}{\partial W_2} \\
&= \delta_2 x_1^T
\end{align*}
$$
$$
\begin{align*}
\frac{\partial E}{\partial W_1} &= (x_3 - t)\frac{\partial x_3}{W_1} \\
&= \delta_2 \frac{\partial W_2 x_1}{\partial W_1} \\
&= W_2^T \delta_2 \frac{\partial f(W_1 x_0)}{\partial W_1} \\
&= W_2^T \delta_2 f'_1(W_1 x_0) \frac{\partial (W_1 x_0)}{\partial W_1} \\
&= \delta_1 x_0^T
\end{align*}
$$