so what is softmax? softmax is an activation function often used in the final layer of a multi class neural network. it basically tell us given an example what is the probability of it belonging to each class.that is P(y=j|x)
so for instance if we have a dog, a cat and a cow classes and we give it an example of a cat it will give us a probabilty like this [.1,.6,.3]. which is intutive for us, it’s 60% sure it’s a cat, 30% sure it’s a cow and 10% sure it’s a dog. which is okay, maybe the cat is alittle fat, no judging.
now that we have established that its output is intuitive for us to understand and that it’s mathematically convient? let’s deep dive of how it works
Softmax is defined as:
\[\text{Softmax}(z_{i}) = \frac{e^{z_i}}{\sum_j e^{z_j}}\]hmmm.math. how does it work? Don’t worry we will break it up in a minute
remember that typically in a layer, we compute the linear part \(Z^{[L]}\) then we apply the activation function. \(Z^{[L]}= W^{[L]}.a^{L-1}+b^L\) let’s say it looks like this \(Z{[L]}=\begin{bmatrix} 5\\\ 2 \\\ -1 \\\ 3 \end{bmatrix}\)
to apply the activation we do the following
take the elementwise exponent of the vector \(\begin{bmatrix} e^5\\\ e^2 \\\ e^-1 \\\ e^3 \end{bmatrix} =\begin{bmatrix} 148.4 \\\ 7.4 \\\ .4 \\\ 20.1 \end{bmatrix}\)
and then devide by the sum of the exponentiated vector (which is 176.3) \(a^{[L]}=\begin{bmatrix} .842 \\\ .042 \\\ .002 \\\ .114 \end{bmatrix}\)
now let’s make sure of our results using tensorflow
inputs = tf.constant([[5, 2 ,-1 , 3 ]],dtype=float)
outputs = tf.keras.activations.softmax(inputs)
print(f'output of softmax function: {outputs}')
output of softmax function: [[0.84203357 0.04192238 0.00208719 0.11395685]]
hooray. now we have this nice probablities that we understand well. but you might be wondering, i get the normalization part but why do we have to exponentiate first? that’s because the exponent ensures that
- large numbers are mapped to larger outputs and small numbers to smaller output
- probabilites can’t be negative so it gets rid of them