For any x i 1...i p from X ∈ R n 1 x ... x n p and for dimension k of size n k , the softmax activation layer applies the transform defined as
The softmax function is known as the normalized exponential (see [Bishop2006] for exact definitions of softmax).
The backward softmax layer for dimension k of size n k computes the value:
where g i 1...i p is the input gradient computed on the preceding layer.
Given p-dimensional tensors of size n 1 x n 2 x ... x n p :
G = (g i 1...i p ) with the gradient computed on the preceding layer
Y = (y i 1...i p ) with the output of the forward softmax layer
The problem is to compute the p-dimensional tensor Z = (z i 1...i p ) of size n 1 x n 2 x ... x n p such that: