ReMoE- Fully Differentiable Mixture-of-Experts with ReLU Routing

Paper: arXiv
Authors: Ziteng Wang, Jianfei Chen, Jun Zhu

Overview

Classical MoE with softmax and top-K gating mechanism suffer from non-differentiability (due to top-K being a non-differentiable function) which makes them difficult to train. This paper completely replaces softmax and top-K with ReLU. A router using ReLU selects the experts. ReLU naturally zeros all the negative activations, and the experts corresponsing to positive activations are chosen for the forward pass (explained more formally later). Authors use clever load balancing and sparsity loss to make sure the router has desired properties.

Preliminaries

Throughout this post: $x^{l} = (x_{t}^{l})_{t = 1}^{T} \in R^{T \times d}$ , where $T$ is the total number of tokens (context length) and $d$ is the dimension of residual stream (d_model)

MoE

y_{t}^{l} = e = 1 \sum E R (x_{t}^{l})_{e} FFN_{e} (x_{t}^{l}; d_{ff n})

$y_{t}^{l}$ : This represents the output vector at layer $l$ for the token at position $t$ . It’s the result of the MoE layer’s computation.

$x_{t}^{l}$ : This is the input vector to the MoE layer at layer l for the token at position t.

$R (x_{t}^{l})_{e}$ : The subscript $e$ indicates the routing weight or importance assigned by the router to the e-th expert for the current input $x_{t}^{l}$ . The router essentially determines to what extent each expert should contribute to the final output.

$FFN_{e} (x_{t}^{l}; d_{ff n})$ : This represents the e-th Feed-Forward Network (FFN) expert. $d_{ff n}$ is the intermediate size of expert. Usually 4 * $d$ .

Top-K Routing

In Top-K routing the router is defined as:

R (x_{t}^{l}) = Topk (Softmax (x_{t}^{l} W_{r}), k)

$(x_{t}^{l} W_{r}) \in R^{T \times E}$ and $Top-k$ retains the top $k$ values while setting the rest to zero.

But the problem with top-K function is that it is clearly discontinuous with jump discontuity at the $k-th$ largest value as seen in the figure. The jump discontinuity can be fully eliminated with a ReLU and experts transition between being active and inactive at $0$ .

ReMoE

ReLU routing function is defined as:

R (x_{t}^{l}) = ReLU (x_{t}^{l} W_{r})

$R (x_{t}^{l}) \in R^{T \times E}$ , with $(1 - \frac{k}{E})$ being the sparsity where k is the number of active experts.
In regular $Top-k$ the Softmax outputs sum to 1, representing the probabilities of selecting each expert. Only the $K$ highest are retained and the rest are eliminated. But in $ReMoE$ , $ReLU$ naturally acts as a gate at the point zero. The outputs of $ReLU$ routers represent the weights assigned to each expert, which can include 0. $ReLU$ allows the router to learn which experts to activate (i.e., when to produce 0s) in a fully differentiable manner.

Another key difference between them is that $Top-k$ is always forced to choose $k$ experts, but $ReLU$ can dynamically choose the number of experts as it’s not hard coded. This would possibly allow more compute to be dedicated to tokens which are difficult to process.

But the key question is how do we regulate the sparsity and load balance between experts?

Controlling Sparsity via $L_{1}$ regularization

To regulate sparsity and achieve the desired sparsity of $(1 - \frac{k}{E})$ , authors introduce a $L_{1}$ regularization loss along with the $L_{m}$ language modeling loss.

L = L_{l m} + λ_{i} L_{re g}

where,

λ_{i + 1} = λ_{i} \cdot α^{s i g n ((1 - \frac{k}{E}) - S_{i})}

$λ_{0}$ and $α$ are hyperparameters chosen at the beginning and from then $λ_{i + 1}$ is adaptively changed at every step.

$S_{i}$ denotes average sparsity of all router outputs at step $i$ ,

S_{i} = 1 - \frac{1}{L TE} l = 1 \sum L t = 1 \sum T e = 1 \sum E 1 {R (x_{t}^{l})_{e} > 0}

The key intuition being, $LTE$ is the total number of experts that can be active for a forward pass, it’s the maximum possible number.
$\sum_{l = 1}^{L} \sum_{t = 1}^{T} \sum_{e = 1}^{E} 1 {R (x_{t}^{l})_{e} > 0}$ is the actual number of active experts in the forward pass.

$\frac{1}{L TE} \sum_{l = 1}^{L} \sum_{t = 1}^{T} \sum_{e = 1}^{E} 1 {R (x_{t}^{l})_{e} > 0}$ gives us the ratio of $\frac{# active experts}{max number of experts}$ . So $S_{i}$ denotes the average sparsity.

Back to $λ_{i + 1}$ ,
$s i g n ((1 - \frac{k}{E}) - S_{i})$ is positive if desired sparsity is lesser than $S_{i}$ (there are more active experts than desired). And $λ_{i + 1}$ is increased by a factor of $α$ .
Similarly, if $s i g n ((1 - \frac{k}{E}) - S_{i})$ is negative if desired sparsity is greater than $S_{i}$ (there are less experts active than desired). So $λ_{i + 1}$ is reduced by a factor of $α$ .

The regularization term $L_{re g}$ uses the $L_{1}$ -norm:

L_{re g} = \frac{1}{L T} l = 1 \sum L t = 1 \sum T ∣∣ R (x_{t}^{l}) ∣ ∣_{1} = \frac{1}{L T} l = 1 \sum L t = 1 \sum T e = 1 \sum E R (x_{t}^{l})_{e}

With this $L_{re g}$ , we can control the sparsity around the desired level of $(1 - \frac{k}{E})$ . A key implication of this is that, on average, ReMoE ensures tokens are routed to $k$ experts across different layers, tokens maintaining same FLOPs as regular $Top-k$ . So the model has complete control of how many experts to activate as long as the number of experts activated on average are within the desired level. We get to see the number of experts active across different layers varies.

Integrating Load Balancing into $L_{1}$ Regularization

To address load balancing, authors modify the loss term introduced above.

L_{re g, l b} = \frac{1}{L T} l = 1 \sum L t = 1 \sum T e = 1 \sum E f_{l, e} R (x_{t}^{l})_{e}

f_{l, e} = \frac{E}{k T} t = 1 \sum T 1 {R (x_{t}^{l})_{e} > 0}

$f_{l, e}$ is non-differentiable and represents the average activation ratio of expert $e$ in layer $l$ ,relative to the desired ratio $\frac{k}{E}$ . This mechanism penalizes experts receiving more tokens by driving their router outputs toward zero more rapidly.

Three Stages of training of ReMoE

Authors observe three stages during training.

The first stage is the warm-up stage, or the dense stage. During this stage, $λ_{i}$ is small, while $L_{l m}$ is large and decreases rapidly. Training $ReMoE$ at this stage is nearly equivalent to training its dense counterpart with the same total number of parameters. Each expert processes more than half of the tokens, allowing the experts to diversify from their random initializations.

The second stage is the sparsifying stage, or the dense to sparse stage. At this point, the sparse regularization term $λ_{i}$ $L_{re g}$ becomes significant, causing the $ReLU$ routers to activate fewer experts. This forces the experts to become more diverse without causing an increase in $L_{l m}$

The third stage is the stable stage, or the sparse stage. In this phase, the sparsity $S_{i}$ stabilizes at the preset target. During this stage, $L_{l m}$ is optimized while being softly guided along the sparse subspace by $L_{re g}$ . Both $L_{re g}$ and $λ_{i}$ change very slowly, with $L_{re g}$ gradually decreasing and $λ_{i}$ gradually increasing. However, the overall regularization term, $λ_{i}$ $L_{re g}$ , remains relatively constant.

Results

The authors perform extensive testing. They provide many details and it should be fairly easy to replicate.

Discussion

Dynamic expert allocation

Authors claim that model dynamically allocates compute in prediction of common vs rare tokens.

Results of Dynamic allocation of compute in ReMoE

Role of load balancing

The white boxes are the experts which were activated with fewer than $\frac{1}{64}$ tokens. It’s interesting how very few experts are needed in the earlier layers when compared to the later layers.

Average sparsity is around the desired level but earlier layers are so sparse when compared to later layers.

Domain specialised experts

Also the experts are specialised when compared to regular $Top-k$

Home 🏠

Explorer

ReMoE- Fully Differentiable Mixture-of-Experts with ReLU Routing

Overview

Preliminaries