Paper: arXiv
Authors: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan
Overview
Traditionally in MoEs there is an explicit router/gating mechanism that processes every token and routes the token to few selected experts. The idea being, instead of using all the parameters of the model, using a select few experts to process the tokens would massively reduce compute costs during inference. In this paper, the authors argue that the traditionally used router may be unaware of capabilities of individual experts and hence using it may not result in selecting the best experts to process the token. So rather than having a dedicated router, they use the experts themselves to decide and select a few experts to process the token. But then any such mechanism would require using weights of the experts to select the best, but the whole point of MoE is to reduce the active parameter count during inference. So to deal with that, the authors use low rank factorization of expert weight matrices. They calculate the L2 norm of the activations of the expert. They argue that the higher the norm of activations, the better the expert is in processing the token. And based on the activations they select experts to process the token.
How do experts work?
In this paper an expert is defined as:
The authors cite Geva et al. (2021), according to it, an expert can be interpreted as key-value memory networks. Where the input is projected into a βkeyβ vector and this βkeyβ vector can retrieve knowledge stored in the parameters through a key-value matching mechanism, like matrix multiplication by .
So if the expert can handle the token effectively, the βkeyβ vector must be highly activated, i.e. its norm must be big. This is the main principle behind their work in this paper.
To test this hypothesis, they ignore the gating mechanism of a regular MoE and simply select the top-K experts with highest L2 Norm when processing each token. And the performance drop was not a lot, and the models were doing pretty well.
Autonomy-of-Experts (AoE)
The authors low-rank factorize the matrix into .
So now each expert looks like:
We can think of forward pass within an expert as two distinct steps:
Step 1: They calculate the activations for all experts just till and cache them. Then they select the top-K experts whose L2-Norm of the activations is the highest.
Step 2: The top-K experts continue the forward pass. We take the previous computed activations and compute the expert:
We drop the rest of the unused cached activations.
This already seems like a bad idea due to (i) increase in memory and (ii) also increase in compute in the routing mechanism.
(i) Their best performing model takes up 16% extra memory when compared to a regular MoE. The authors just brush it off lol.
(ii) To make comparison fair, they also increase the compute in regular gating mechanism in their baseline MoE. AoE still performs a bit better.
They do extensive testing by varying the dimensions of the low rank projection matrix and up projection matrix . They also do ablation studies by using various other expert-selection techniques other than the simple top-K.
They also perform many experiments regarding load balancing, by comparing the performances of models trained with a load balancing loss vs no load balancing loss.
They do provide extensive details about their experiments, and it should be fairly easy to reproduce.
Results
Meh. They werenβt impressive.
I donβt think the extra 1 percent accuracy is worth 16% in memory usage increase.
They also analyse the load distribution across all the experts and compare it with baseline MoE. AoEs do just a tad bit better than regular MoEs, and the load is distributed a bit more uniform in AoEs.
Thoughts
Pretty cool paper. But Iβm not that impressed by the performance gains when compared to the extra VRAM needed.