DINT Transformer

Paper: arXiv
Authors: Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Erlu Zhao, Shi Li
Date of Publication: 29th January 2025

Overview

This paper introduces a new transformer architecture based on differential transformer. Authors claim that a major drawback of differential transformer is that it’s rows of attention matrix aren’t normalised (they don’t sum up to one) so it causes numerical stability. And also authors claim that differential transformer lack global context modeling (they don’t explain why or how, they just declare it). To solve these issues they come up with a new architecture involving an ‘integration’ mechanism that apparently fixes this.

Differential Attention

The original differential transformer is a very good paper (must read, I’ll maybe make a post and link it here in future). I’ll briefly recap the architecture here. The authors of that paper introduce the differential attention mechanism. Authors calculate 2 attention matrices instead of one, and subtract one attention activation matrix from another (they take the difference, hence the name differential transformer). The authors came up with this as a way to reduce noise in the attention matrix. Intention was that, by allowing the model directly to subtract an attention matrix from another, it can learn to subtract the unnecessary noise. It was pretty popular at the time.

Specifically, given $X \in R^{N \times d_{m o d e l}}$ , it is projected to $Q$ , $K$ , $V$ matrices as usual. But,

[Q_{1}; Q_{2}] = X W_{Q}, [K_{1}; K_{2}] = X W_{K}, V = X W_{V}

$where, Q_{1}, Q_{2}, K_{1}, K_{2} \in R^{N \times d} and V \in R^{N \times 2 d} and W_{Q}, W_{K}, W_{V} \in R^{d_{m o d e l} \times 2 d}$

We get 2 query matrices, 2 key matrices. We calculate 2 different attention matrices.

DiffAttn (X) = (softmax (\frac{Q _{1} K _{1}^{⊤}}{d}) - λ \cdot softmax (\frac{Q _{2} K _{2}^{⊤}}{d})) V

$where,$

λ = exp (λ_{q 1} \cdot λ_{k 1}) - exp (λ_{q 2} \cdot λ_{k 2}) + λ_{ini t}

$λ_{q 1}, λ_{k 1}, λ_{q 2}, λ_{k 2} \in R^{d}$ are learnable vectors and $λ_{ini t} \in (0, 1)$ is a constant used for initialization.

Differential Transformer is a good paper.

Group Norm is done along with layer norm too (I’ll go a bit in depth later while explaining DINT). A fixed multiplier $(1 - λ_{ini t})$ is used after GroupNorm, which aligns the gradient flow with Transformer. And then there’s a linear layer after multi-head Differential Attention to project back to $d_{m o d e l}$

DINT Attention

We had differential attention with no differential equations. Now we have differential integral attention with neither differential equations nor integration.

DintAttn Algorithm

In DINT, we calculate the two attention matrices of Differential Attention as usual. Let’s call them $A_{1}$ and $A_{2}$

A_{1} = softmax (\frac{Q _{1} K _{1}^{⊤}}{d})

A_{2} = softmax (\frac{Q _{1} K _{1}^{⊤}}{d})

A_{d i ff} = A_{1} - λ A_{2}

The ‘integral’ component of DINT computes the average attention scores of $A_{1}$ ‘s columns.
Essentially $G \in R^{1 \times N}$ , where each element is the average value of each of the $N$ columns of $A_{1}$

G_{expanded} = G G ⋮ G where, G_{e x p an d e d} \in R^{N \times N}

We just repeat the row vector $G$ , $N$ times and stack to create $G_{e x p an d e d}$ .

DINTAttn (X) = (A_{d i ff} + γ \cdot G_{e x p an d e d}) V

We set $γ$ as $λ$ . And if $γ = λ$ , the rows are normalised and sum up to 1.
I’m being honest, I don’t get why this works, but it just works, and the rows are normalised and sum upto 1. I have checked it numerically.

The authors claim that $G$ captures the global important features.

Multi-head DINT Attention

head_{i} = DiffAttn (X; W_{Q}^{i}, W_{K}^{i}, W_{V}^{i}, λ) \overline{head}_{i} = LN (head_{i}) MultiHead (X) = Concat (\overline{head}_{1}, \dots, \overline{head}_{h}) W_{O}

where, $λ$ is shared between all the heads (that is the case in differential attention too).
$W_{O} \in R^{d_{model} \times d_{model}}$ is a learnable projection matrix.
$L N$ uses RMS Norm for each head.

Even after headwise normalization, there is a group norm for more stable training.

def DintAttn(X, W_q, W_k, W_v, λ):

Q1, Q2 = split(X x W_q) // Split matrix multiplication result for Queries
K1, K2 = split(X x W_k) // Split matrix multiplication result for Keys (Corrected to W_k)
V = X x W_v // Calculate Value matrix
s = 1 / sqrt(d) // Scaling factor
A1 = softmax((Q1 x K1.transpose(-1, -2)) x s) // Attention weights 1
A2 = softmax((Q2 x K2.transpose(-1, -2)) x s) // Attention weights 2
A3 = repeat(average(A1, column), n) // Average Attention (n is the number of rows in X)
return (λ * A3 + A1 - λ * A2) x V // Final Attention Output

def MultiHeadDINT(X, W_q, W_k, W_v, W_o, λ):

Initialize an empty list O // For outputs of each head
for i = 1 to h: // Iterate through heads
- Get the i-th slice of weight matrices: W_qi, W_ki, W_vi // Slicing for the i-th head
- O_i = GroupNorm(DintAttn(X, W_qi, W_ki, W_vi, λ)) // Differential Attention for i-th head
- Append O_i to list O
return Concat(O) x W_o // Concatenate head outputs and multiply by output weight