Study of Mixture of experts(MoE)

Study of Mixture of experts(MoE)

AI Summary

MoE is an efficient structure that selectively activates only some experts to maintain the actual computation while increasing the model size. By distributing tokens through a gate network, it ensures both performance and cost efficiency, but there are issues with memory requirements and learning imbalance. As a result, MoE has established itself as a key architecture that simultaneously addresses the scalability and efficiency of LLMs.

We start with the question, “Does a larger model always mean better performance?”

Throughout the history of deep learning, the simplest yet most powerful scaling law has been “the more parameters, the better the performance.” However, training a massive model like GPT-4 requires months and millions of GPU-hours. Faced with the reality of exponentially increasing computational costs, researchers had to find a paradoxical solution: “increase model capacity while keeping actual computational load constant.” Mixture-of-Experts (MoE) is precisely that solution.

The Concept of MoE

First proposed by Jacobs et al. in 1991, MoE originated from the idea that “multiple small models (experts) work in sequence only when needed.” When a single model handles all situations, overfitting and interference frequently occur, but by dividing the data into regions and assigning them to different experts, much more efficient learning is possible. In practice, when a specific token (word) is input, the gate network decides, “Let's send this token to experts A and B,” while the remaining experts remain idle, preventing an increase in computational load.

The role of the ‘gate’ is simpler than one might think.

The gate network typically consists of a thin linear layer followed by a Softmax function. By retaining only the two highest probability values (Top-2) from the Softmax output and masking the rest to 0, ‘sparse routing’ is achieved. This way, even with 8 experts, the number of activated parameters is nearly identical to that of a Dense model. For example, Mixtral 8×7B has a total of 4.7 billion parameters stored in VRAM, but when calculating a single token, only about 1.2 billion of those parameters participate in the multiplication.

How do the experts differ from each other?

From the moment the gates distribute tokens, each expert naturally specializes in specific patterns. According to Google's Switch-Transformer experiment, some experts primarily processed punctuation and numbers, while others focused on specific language fragments. A similar phenomenon was observed in Meta's 2024 DeepSeek-V2 study, demonstrating that “if routing is done properly, the learning process naturally divides tasks.”

Avoiding the ‘load imbalance’ trap.

What happens when all tokens are concentrated on popular experts? Popular expert GPUs overheat, while the remaining GPUs idle, slowing down the overall speed. To prevent this, the ‘Load-Balancing Loss’ technique was introduced. If the gate selects a specific expert too frequently, the loss function imposes a penalty to force distribution. Switch-Transformer further simplifies this idea with Router Z-Loss, significantly improving training stability.

Advantages and Disadvantages of MoE

First, the advantages: 

First — The reduced computational cost speaks for itself. With the same FLOPS, you can run much larger models, which directly saves companies on training costs and electricity bills. 

Second, the latency per token is reduced. In services like RAG (Retrieval-Augmented Generation) where a single call must be short, MoE makes a noticeable difference. 

On the other hand, the disadvantages are also clear. 

Since all expert parameters must be “loaded” into VRAM, the memory requirements are much higher than those of dense models. Additionally, with limited data, the learning load is distributed across experts, potentially reducing performance. 

The Evolution of MoE from Classic to Modern

In the 1990s, Hierarchical MoE stacked gates in a tree structure to refine “contextual decision-making.” The next generation, DMoE (Deep MoE), repeatedly stacked gates and experts by layer to explosively increase “effective computation paths.” 

In 2017, Google's Sparsely-Gated MoE demonstrated that even when model capacity was increased by 1,000 times, computation volume could be kept in check by simply activating the top k experts.

In 2023, the Hugging Face blog post “Mixture of Experts Explained” announced the arrival of the “ready-to-use open MoE” era through the Mixtral 8×7B case study. 

MoE is particularly effective in recommendation systems!

Recommendation models that need to simultaneously optimize multiple objectives such as clicks, dwell time, and purchase conversion inherently require multi-task learning. However, using only a shared bottom structure causes task conflicts. MMoE (Multi-gate MoE) solves this problem by sharing experts but keeping gates separate for each task. In actual industrial settings, three gates are set for three goals—user engagement, satisfaction, and sales—while sharing a single expert pool, yielding good results.

Why the future is even more promising.

MoE researchers are already exploring three paths. 

First, distilling large MoE models into dense models for deployment on mobile devices. 

Second, extreme quantization (QMoE) technology that compresses parameters down to 1 bit.
Third, a SaaS model where experts themselves are exchanged via a “marketplace.” 

At this point, the serverless AI era, where “only the experts you need are called upon when needed, and costs are paid based on usage,” becomes a reality.

ChainShift provides customized solutions that enhance customers' AI search visibility by integrating MoE technology into AEO·GEO services. We are actively adopting newly released models such as GPT-OSS to quickly experiment with and apply the optimal ‘Mixture of Experts (MoE)’ combinations, and we will continue research and updates until search visibility and click-through rates are actually improved. In the AI-first era, with ChainShift, your content will be at the forefront of Google, Bing, and AI search engines.

ChainShift Amy

Reference

Switch Transformer:

https://arxiv.org/abs/2101.03961

Mixture of Experts Explained:

https://huggingface.co/blog/moe

DeepSeek-V2 논문:

https://arxiv.org/abs/2405.04434

MoE for RecSys:

https://blog.reachsumit.com/posts/2023/04/moe-for-recsys/

Nvidia Tech blog

https://developer.nvidia.com/ko-kr/blog/applying-mixture-of-experts-in-llm-architectures/

Hugging Face

https://huggingface.co/blog/moe

© 2025 ChainShift. All rights reserved. Unauthorized reproduction and redistribution prohibited.

이전 글

“Competitors do it 100 times? We actually tried 2 million queries with AI.”... ChainShift, changing the GEO landscape

다음 글

What is TF-IDF… Can it read documents with just one word?

Check the performance diagnosis of our brand's AI search

We will examine the visibility of our brand, citation structure, and market share compared to competitors in AI search, and suggest actionable improvement directions.

Recent posts

View all posts

Chainshift Co., Ltd.

Chainshift Co., Ltd.

BRN : 845-86-03383

CEO : Amy Han

4F, 21, Baekbeom-ro 31-gil, Mapo-gu, Seoul, Republic of Korea