Scalable-Softmax Is Superior for Attention

Paper: arXiv
Author: Ken M. Nakanishi
Date of Publication: 31st January 2025

Overview

Softmax that is used in scaled dot product attention cause attention scores to flatten as context length increases. This reduces the model’s ability to prioritize to key information in context and also generalise to longer contexts not seen during training. To fix this author modifies softmax and introduces $SSM a x$ . $SSM a x$ is defined as:

z_{i} \mapsto \frac{n ^{s z_{i}}}{\sum _{j = 1}^{n} n ^{s z_{j}}} = \frac{e ^{(s l o g n) z_{i}}}{\sum _{j = 1}^{n} e ^{(s l o g n) z_{j}}}

where, $n$ is context length and $s$ is a learnable parameter shared amongst all layers and heads.
Results show that $SSM a x$ performs much, much better at information retrieval tasks and also generalises better to longer contexts even without being trained on longer contexts.

Problem with Softmax

Softmax transforms an input vector into a vector that can be interpreted as a probability distribution, where all elements are non-negative and sum up to one.

z_{i} \mapsto \frac{e ^{z_{i}}}{\sum _{j = 1}^{n} e ^{z_{j}}}

In the attention layers of transformer, input vector size $n$ increases as the context length grows. Softmax plays a critical role in computing attention scores over all tokens in the context, determining how much ‘attention’ is allocated to each token. When $n$ grows, the denominator in softmax increases while the numerator remains independent of $n$ . As a result the resultant distribution becomes increasingly flat (attention fading).
This reduces the model’s ability to focus on key tokens in context and also reduces ability to generalise to longer contexts.

Scalable-Softmax (SSMax)

$SSM a x$ is defined as:

z_{i} \mapsto \frac{n ^{s z_{i}}}{\sum _{j = 1}^{n} n ^{s z_{j}}} = \frac{e ^{(s l o g n) z_{i}}}{\sum _{j = 1}^{n} e ^{(s l o g n) z_{j}}}

$SSM a x$ also transforms the input vector into a probability distribution as it can clearly be seen from the definition. However the key difference is in the dependence of exponential base on input vector size $n$ . This design helps in mitigating attention fading and the resulting attention scores remains focused on the key tokens.
Author provides very nice justifications for $SSM a x$ .

Rationale Behind the Design of SSMax

To investigate the optimal variant of softmax, author replaced softmax with the following function at all layers and heads:

z_{i} \mapsto \frac{e ^{(s p_{n} + b) z_{i}}}{\sum _{j = 1}^{n} e ^{(s p_{n} + b) z_{j}}}

where, $s$ and $b$ are learnable parameters unique to each layer and head. And $p_{n}$ $(n = 1, 2, 3, .... N)$ represents learnable parameters shared across all layers and heads, depending solely on the input vector size $n$ . $N$ denotes the size of context length used during training. Author trains a model with the above function replacing softmax in scaled dot product attention.

$p_{n}$ followed a logarthimic relationship of the form, $p_{n} \approx a_{1} lo g n + a_{2}$
This finding suggested that softmax in attention mechanism could benefit from reformulation as:

z_{i} \mapsto \frac{e ^{(s l o g n + b) z_{i}}}{\sum _{j = 1}^{n} e ^{(s l o g n + b) z_{j}}} = \frac{n ^{s z_{i}} e ^{b z_{i}}}{\sum _{j = 1}^{n} n ^{s z_{j}} e ^{b z_{j}}}

where, $s$ and $b$ are layer and head specific learnable parameters. $b$ is referred as bias. Based on further evaluation (as we will see in Results), omitting $b$ turns out to be better and thus author arrived at $SSM a x$ .

Justification for the Design of SSMax

Let $z = (z_{1}, z_{2}, ...., z_{n})$ be an input vector of size $n$ . Let $z_{ma x}, z_{2 n d}, z_{min}$ denote its maximum, second maximum and minimum elements, respectively. Let $z_{ma x} > z_{2 n d}$ .

When $z$ is processed by softmax, $z_{ma x}$ is transformed as:

z_{ma x} \mapsto \frac{e ^{z_{ma x}}}{\sum _{j = 1}^{n} e ^{z_{j}}}

We can replace denominator by $(n - 1) e^{z_{min}} + e^{z_{ma x}}$ to obtain an upper bound.

\frac{e ^{z_{ma x}}}{\sum _{j = 1}^{n} e ^{z_{j}}} \leq \frac{e ^{z_{ma x}}}{( n - 1 ) e ^{z_{min}} + e ^{z_{ma x}}}

Multiplying and Dividing RHS by $e^{z_{ma x}}$ we get,

\frac{e ^{z_{ma x}}}{\sum _{j = 1}^{n} e ^{z_{j}}} \leq \frac{1}{\frac{n - 1}{e ^{z_{ma x} - z_{min}}} + 1} .

As we can clearly see, as $lim n \to \infty$ , the maximum element of the output vector produced by Softmax approaches zero.

On the other hand, when $z$ is processed by $SSM a x$ , $z_{ma x}$ is transformed as

z_{ma x} \mapsto \frac{n ^{s z_{ma x}}}{\sum _{j = 1}^{n} n ^{s z_{j}}}

Assuming $s > 0$ , we can obtain an upper bound for RHS.

\frac{n ^{s z_{ma x}}}{\sum _{j = 1}^{n} n ^{s z_{j}}} \leq \frac{n ^{s z_{ma x}}}{( n - 1 ) n ^{s z_{min}} + n ^{s z_{ma x}}} = \frac{1}{\frac{n - 1}{n ^{s (z_{ma x} - z_{min})}} + 1}

Similarly we can obtain a lower bound.

\frac{n ^{s z_{ma x}}}{\sum _{j = 1}^{n} n ^{s z_{j}}} \geq \frac{n ^{s z_{ma x}}}{( n - 1 ) n ^{s z_{2 n d}} + n ^{s z_{ma x}}} = \frac{1}{\frac{n - 1}{n ^{s (z_{ma x} - z_{2 n d})}} + 1} .

Now we have,

\frac{1}{\frac{n - 1}{n ^{s (z_{ma x} - z_{2 n d})}} + 1} \leq \frac{n ^{s z_{ma x}}}{\sum _{j = 1}^{n} n ^{s z_{j}}} \leq \frac{1}{\frac{n - 1}{n ^{s (z_{ma x} - z_{min})}} + 1}

The maximum element output by $SSM a x$ exhibits the following properties:

If $z_{ma x} - z_{2 n d} > \frac{1}{s}$ , the lower bound approaches $1$ . So the output of $SSM a x$ approaches $1$ . Meaning attention is focused on the element with the highest value.
If $z_{ma x} - z_{min} < \frac{1}{s}$ , the upper bound approaches $0$ . So the output of $SSM a x$ approaches $0$ . Meaning attention is distributed across all the elements.

Thus, $SSM a x$ ensures that attention is focused on elements whose values exceed others by approximately $\frac{1}{s}$ , while distributing attention when all values are within a range of approximately $\frac{1}{s}$ .

We can very easily convert softmax attention to $SSM a x$ attention by

SSMax (\frac{q _{n} K _{1 : n}^{T}}{d}) = Softmax (\frac{( s lo g n ) q _{n} K _{1 : n}^{T}}{d})

Results

The loss during training. Author reports $SSM a x$ is better during regular training too.

Results of SSMax being used in extended context lengths

The grey dotted line is the context window at which models were trained. And models’ context length was extended by simply increasing the $θ$ of RoPE with no additional training.

Results of SSMax in Needle in a Haystack

Results when evaluated on Needle-in-a-haystack.

Overall, results look very good with barely any extra learnable parameters. Big if true.

Thoughts

I really liked the paper. The intuition behind $SSM a x$ makes sense. The results look pretty good too with almost zero additional compute cost. And very easy to implement. Big if true.
Combining this with differential attention may increase evaluation results on Needle-in-a-haystack even more. It would be nice to try that out.

Home 🏠

Explorer