Gradient Computation
The computation of the gradient of cross entropy loss function is very simple and elegant. First, we apply the chain rule here: where and .
The derivative of with respect to is just: The derivative of the sigmoid function, with respect to z is just
The derivative of z, the dot product, with respect to any weight is just So put those three above derivatives in the chain rule:
Adding the summation over all the data samples, the derivative of the error with respect to any weight is: