Introduction

While the Mistral's 7x8B model has been making headlines for several months, we're admittedly a bit late to the discussion. However, with the recent announcement of Pixtral AI, Mistral's first-ever multimodal model, it feels like the perfect time to revisit the basics. In this article, we'll explore MoE from its early foundations to its current relevance in the AI world, particularly focusing on the Mistral AI 7x8B model, a breakthrough in large-scale, efficient model design.

MoE's resurgence in the deep learning landscape highlights its ability to handle large, complex tasks while optimizing computational resources. It has become a key enabler for developing scalable and adaptive models in natural language processing (NLP), computer vision, and beyond. Our goal here is to break down the concepts of MoE in a way that's accessible to students and AI enthusiasts, without sacrificing the technical depth needed to truly understand its impact.

And who knows? In a future article, we might dive into Pixtral AI, exploring how Mistral is taking MoE into the world of multimodal learning.

History of Mixture of Experts

Origins

The history of Mixture of Experts (MoE) traces back to the early 1990s and spans key developments in machine learning and deep learning. Below is a chronological overview of the most significant milestones:

1991: Initial Concept by Jacobs, Jordan, Nowlan, and Hinton

Paper: Adaptive Mixtures of Local Experts

The concept of Mixture of Experts was introduced by researchers Ronald Jacobs, Michael I. Jordan, David Nowlan, and Geoffrey Hinton. They were among the first to propose this divide-and-conquer strategy in machine learning, where complex tasks are divided into subtasks. In their approach, multiple "expert" models were trained to handle specific parts of the data or problem, and a gating network was used to determine which expert was most suitable for each input.

This marked the foundation of the MoE framework.

Key Concepts of Mixture of Experts:

Input: The system receives input data, which could be anything from images to text, depending on the task.
Expert Networks: Each expert network is a separate model that specializes in solving certain parts of the problem. (As we have seen before, experts learn more fine-grained information than entire domains. As such, calling them "experts" has sometimes been seen as misleading.)
Gating Network: The gating network is crucial. It doesn't solve the problem itself but decides which expert should handle the current input. It does this by looking at the input and assigning probabilities or weights to each expert based on how well it thinks that expert can solve the task. The gating network can "learn" over time which experts are better for which types of input.
Stochastic One-Out-of-N Selector: This means that, based on the output of the gating network, one of the experts is chosen randomly but with a bias toward the expert that the gating network believes is best. The selection is not purely random; it's guided by the probabilities given by the gating network.

2016: Revival of MoE in Deep Learning

Paper: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

By 2016, Mixture of Experts (MoEs) had evolved from being standalone models, thanks to pioneers like Eigen, Ranzato, and Ilya, into integrated components of larger deep networks. Instead of being treated as separate models, MoEs were now layers within deep neural networks, where the experts could be thought of as complex neurons. While this integration increased the power of the models, it also added considerable complexity. To address this, Conditional Computation, a concept introduced by Yoshua Bengio, was employed. This allowed only certain parts of the neural network to be activated based on the input, minimizing unnecessary computations.

To achieve this, the authors devised a mechanism that avoids triggering the entire expert network for every input. The function G(x) selectively computes only the most relevant experts by forcing the less important signals to compute to zero, significantly reducing the computational burden. But this raises an important question: how does the network decide which experts should be activated?

The answer lies in Noisy Top-k Gating, implemented through the following steps:

The function KeepTopK(v,k) selects only the top k elements from a set of values v, where the top k experts are retained, and the rest are set to negative infinity. This ensures that only the most relevant experts are passed through while ignoring the rest.

Next, a noisy gating mechanism H(x) is introduced. This function incorporates noise into the gating process to add an element of randomness, preventing the model from repeatedly selecting the same experts. The equation involves two components: the first part, (x⋅Wg)i, represents the learned gating weights, while the second part introduces noise, scaled by a Softplus transformation. This noise ensures a more balanced selection of experts over multiple inputs.

Finally, the G(x) function applies Softmax on the top k experts chosen by the KeepTopK function. This converts the selected values into probabilities, determining how much each expert should contribute to the final computation. By only activating the top experts, the model ensures computational efficiency and sparsity.

To further mitigate the risk of overusing certain experts once the top k experts are selected through the gating mechanism, a thoughtful optimization of the loss function was introduced. This optimization penalizes the model when specific experts are selected too frequently, encouraging the gating function G(x) to distribute the incoming data more evenly across all experts. It begins by tracking the average gating values over all data batches. If the distribution becomes skewed, where some experts are receiving significantly more inputs than others the loss increases, pushing the model to adjust and achieve a more balanced workload distribution.

2021: Switch Transformer – Scaling MoE for NLP

Paper: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

The Sparsely Gated Mixture of Experts (MoEs) made significant strides by introducing sparsity and selective computation, but they also brought new complexities. As researchers aimed to scale these models beyond hundreds of billions of parameters, especially given the demonstrated success of increasing parameter count in Transformers, they faced issues such as instabilities during training, high communication overheads, and imbalanced expert workloads. To address these, Switch Transformers were developed, offering a more efficient approach that allowed models to scale up to trillion-parameter sizes.

In traditional MoE models, each input activated two experts, leading to increased computation and communication overheads as the models grew larger. The Switch Transformer, however, simplified this process by activating only one expert per input, hence the name "Switch."

This simplification brought several advantages:

Lower computation costs
Reduced communication overhead
Simplified routing mechanisms

One of the challenges faced by sparse models is the uneven workload across experts, where some experts receive too many inputs while others remain idle. To mitigate this, the Switch Transformer introduces the concept of expert capacity.

Expert capacity refers to the number of tokens an expert can process in one batch.
The capacity factor is a tunable parameter, typically set between 1 and 1.25, which allows each expert to process slightly more tokens than the average.

At the core of the Switch Transformer is its gating mechanism, which routes input tokens to experts. For each token, the gating mechanism computes a score for each expert based on how well the expert matches the input features. This score is calculated as the dot product between the token and the learned parameters of each expert. The expert with the highest score is then chosen to process the token.

When an expert reaches its capacity, any additional tokens are either routed to the next layer via residual connections or dropped. This ensures that no expert is overloaded, improving overall efficiency.

To further balance the workload, the Switch Transformer introduces a load-balancing auxiliary loss. This additional loss penalizes the model when certain experts are overused and others underused, encouraging a more even distribution of tokens across all experts.

In addition to addressing load balancing, the Switch Transformer also optimizes training efficiency. Training large models at full 32-bit precision is computationally expensive, both in terms of memory and processing time. To optimize efficiency, it employs selective precision during training, reducing memory usage without compromising performance.

Experts are trained using bfloat16 precision, reducing the memory footprint and computation required.
Routers, which handle sensitive computations like exponentiation, are trained with full 32-bit precision to ensure accuracy.

This selective application of precision allows Switch Transformers to achieve faster training times and lower memory consumption, without sacrificing the model's final quality or performance.

Before diving into Mixtral, let's briefly recap. MoE models, introduced in the 90s, follow a "divide and conquer" approach, breaking problems into smaller parts. Predictions are made by combining expert outputs, guided by a gating network. A key innovation was Top-K routing, where only the top-K experts are used per token, greatly improving efficiency. The Switch Transformer advanced this further by setting K = 1, selecting just one expert per token, which simplified computation and enabled scaling up to 128 experts for even better performance.

2023: Mistral AI and the Mistral 7x8B Model

Paper: Mixtral of experts

The move from Switch Transformers to Mixtral reflects an ongoing pursuit of more efficient and scalable models in the MoE domain. While Switch Transformers simplified the process by selecting just one expert per token K=1, allowing models to reach unprecedented scales, they also introduced limitations in flexibility. Mixtral addresses these demonstrating a balance between maintaining the scale of the model and refining how experts handle each token for better performance.

The Mixtral of Experts model builds on the Mistral 7B architecture by incorporating eight feedforward blocks (experts). While the name might suggest the model behaves like an ensemble of eight separate models with 7 billion parameters each, the actual parameter count for each expert is 5.6 billion. This means that the total number of parameters amounts to around 45 billion, taking into account how parameters are shared across the experts rather than being completely independent for each block.

The experts, as outlined in section (4), total a staggering 45 billion parameters. However, there's a twist. Despite this massive figure, the model operates using far fewer parameters. What truly sets Mixtral apart from Switch Transformers comes down to two critical elements: the expert routing mechanism and the type of experts used.

Expert Routing:

Switch Transformer: Routes tokens to only one expert.
Mixtral: Routes tokens to two experts per token (i.e., K=2).

Making it that only a fraction around 12 billion active parameters are used at any given time for two selected experts. As for the expert type:

Expert Type:

Switch Transformer: Typically use standard ReLU or GELU activations.
Mixtral: Uses SwiGLU as the expert's activation.

Mistral's Mixtral 7x8B leverages SwiGLU (Swish-Gated Linear Units) as the activation function within its expert layers.

SwiGLU is derived from the combination of the Swish activation function (x * sigmoid(x)) and GLU (Gated Linear Unit), which are known for better non-linear representational capacity and its particularly well-suited to sparse models like MoE, as it helps maintain numerical stability and improve training dynamics when only a few experts are active at a time.

Conclusion

Mixtral is undeniably promising, and the intrigue only deepens when considering that no details have been disclosed about the size of the dataset used for pretraining, its composition, or the specific preprocessing techniques employed. This lack of information, far from diminishing its appeal, actually fuels anticipation. It leaves us eagerly awaiting further insights how did Mixtral achieve such remarkable efficiency and performance without the transparency we're accustomed to in model development? The mystery surrounding its data pipeline suggests there may be innovative methods at play, making us all the more excited to learn what lies beneath the surface when these details are eventually revealed.

References

I want to extend a heartfelt thank you to all the incredible people who have generously shared their work and knowledge online. Your explanations, time, and effort have made complex topics so much more accessible for so many of us. A special thanks to Maarten Grootendorst – your visual guides were nothing short of mind-blowing, they made reading about intricate concepts not only fun but also intuitive, and I couldn't resist borrowing some of them so I'm giving credit where it's due. For anyone reading this, please go check out Maarten's work it's incredibly thorough. And if you want even more fantastic visualizations related to LLMs, I highly recommend checking out his book.

Note: The images referenced in the original article have been omitted from this markdown version. Please refer to the original HTML file for visual diagrams and illustrations.

The Link

Engineering Studio

Advisory Unit

Tech Club

Technology Radar

LLM Hardware Sizing Tool

Deep dive into mixture of experts