Softmax

Softmax is an activation function that computes a probability-like output for logistic outputs. Generally given in the form

The softmax function with temperature that you're asking about can be expressed in the following form:

$p_i = \frac{e^{z_i/\theta}}{\sum_{j=1}^K e^{z_j/\theta}}$

Where: - $<span class="arithmatex"><span class="MathJax_Preview">p_i</span><script type="math/tex">p_i$ is the probability for the i-th class - $<span class="arithmatex"><span class="MathJax_Preview">z_i</span><script type="math/tex">z_i$ is the i-th logit (input to softmax) - \theta is the temperature parameter - K is the number of classes

Key points about this equation:

The temperature parameter $\theta$ appears in both the numerator and denominator, dividing each logit $<span class="arithmatex"><span class="MathJax_Preview">z_i</span><script type="math/tex">z_i$ .
As $\theta$ approaches 0, the distribution becomes more peaked (harder), concentrating most of the probability mass on the largest logit.
As T increases, the distribution becomes more uniform (softer), spreading probability mass more evenly across all classes.
When T = 1, this reduces to the standard softmax function.
The term $<span class="arithmatex"><span class="MathJax_Preview">e^{z_i/T}</span><script type="math/tex">e^{z_i/T}$ is equivalent to $<span class="arithmatex"><span class="MathJax_Preview">(e^{z_i})^{1/T}</span><script type="math/tex">(e^{z_i})^{1/T}$ , which shows how the temperature acts as an exponent to the exponential term.

By tuning the temperature parameter, you can effectively control the "peakiness" or "softness" of the output probability distribution, which can be crucial for various machine learning applications.

Key properties of the softmax function include:

Normalization: The output values sum to 1, making them interpretable as probabilities.
Exponentiation: The use of $<span class="arithmatex"><span class="MathJax_Preview">e^{z_i}</span><script type="math/tex">e^{z_i}$ ensures all outputs are positive.
Relative scale: Larger input values result in larger probabilities, while preserving the relative ordering of the inputs.
Non-linear transformation: The softmax function introduces non-linearity, which is crucial for modeling complex relationships in neural networks.

But is softmax int he present form completely apparent. Here is some research indiating that some modifications may be preferred.