Gradient Computation

The computation of the gradient of cross entropy loss function is very simple and elegant. First, we apply the chain rule here: where and .

The derivative of with respect to is just: The derivative of the sigmoid function, with respect to z is just

The derivative of z, the dot product, with respect to any weight is just So put those three above derivatives in the chain rule:

Adding the summation over all the data samples, the derivative of the error with respect to any weight is:

results matching ""

    No results matching ""