awesome-mixture-of-experts

A collection of AWESOME things about mixture-of-experts

This repo is a collection of AWESOME things about mixture-of-experts, including papers, code, etc. Feel free to star and fork.

Open Models

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [Jan 2024] Repo Paper
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training [Dec 2023] Repo
Mixtral of Experts [Dec 2023] Repo Paper
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [Aug 2023] Repo Paper
Efficient Large Scale Language Modeling with Mixtures of Experts [Dec 2021] Repo Paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Feb 2021] Repo Paper

Papers

Must Read

I list my favorite MoE papers here. I think these papers can greatly help new MoErs to know about this topic.

A Review of Sparse Expert Models in Deep Learning [4 Sep 2022]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [11 Jan 2021]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [13 Dec 2021]
Scaling Vision with Sparse Mixture of Experts [NeurIPS2021]
ST-MoE: Designing Stable and Transferable Sparse Expert Models [17 Feb 2022]
Mixture-of-Experts with Expert Choice Routing [NeurIPS 2022]
Brainformers: Trading Simplicity for Efficiency [ICML 2023]
From Sparse to Soft Mixtures of Experts [2 Aug 2023]
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models Aug 2023

MoE Model

Publication

Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [ICML 2023]
Robust Mixture-of-Expert Training for Convolutional Neural Networks [ICCV 2023]
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts [EMNLP 2023]
PAD-Net: An Efficient Framework for Dynamic Networks [ACL 2023]
Brainformers: Trading Simplicity for Efficiency [ICML 2023]
On the Representation Collapse of Sparse Mixture of Experts [NeurIPS 2022]
StableMoE: Stable Routing Strategy for Mixture of Experts [ACL 2022]
Taming Sparsely Activated Transformer with Stochastic Experts [ICLR 2022]
Go Wider Instead of Deeper [AAAI2022]
Hash layers for large sparse models [NeurIPS2021]
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning [NeurIPS2021]
Scaling Vision with Sparse Mixture of Experts [NeurIPS2021]
BASE Layers: Simplifying Training of Large, Sparse Models [ICML2021]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [ICLR2017]
CPM-2: Large-scale cost-effective pre-trained language models [AI Open]
Mixture of experts: a literature survey [Artificial Intelligence Review]

arXiv

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework [4 Jun 2024] Repo Paper
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [23 May 2024] Repo Paper
MoEC: Mixture of Expert Clusters [19 Jul 2022]
No Language Left Behind: Scaling Human-Centered Machine Translation [6 Jul 2022]
Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners [8 Jun 2022]
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts [6 Jun 2022]
Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation [5 Jun 2022]
Interpretable Mixture of Experts for Structured Data [5 Jun 2022]
Task-Specific Expert Pruning for Sparse Mixture-of-Experts [1 Jun 2022]
Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers [28 May 2022]
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [24 May 2022]
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT [24 May 2022]
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code [12 May 2022]
SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach [26 Apr 2022]
Residual Mixture of Experts [20 Apr 2022]
Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners [16 Apr 2022]
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [15 Apr 2022]
Mixture-of-experts VAEs can disregard variation in surjective multimodal data [11 Apr 2022]
Efficient Language Modeling with Sparse all-MLP [14 Mar 2022]
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [2 Mar 2022]
Mixture-of-Experts with Expert Choice Routing [18 Feb 2022]
ST-MoE: Designing Stable and Transferable Sparse Expert Models [17 Feb 2022]
Designing Effective Sparse Expert Models [17 Feb 2022]
Unified Scaling Laws for Routed Language Models [2 Feb 2022]
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [28 Jan 2022]
One Student Knows All Experts Know: From Sparse to Dense [26 Jan 2022]
Dense-to-Sparse Gate for Mixture-of-Experts [29 Dec 2021]
Efficient Large Scale Language Modeling with Mixtures of Experts [20 Dec 2021]
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [13 Dec 2021]
Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition [10 Dec 2021]
SpeechMoE2: Mixture-of-Experts Model with Improved Routing [23 Nov 2021]
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [23 Nov 2021]
Towards More Effective and Economic Sparsely-Activated Model [14 Oct 2021]
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [8 Oct 2021]
Sparse MoEs meet Efficient Ensembles [7 Oct 2021]
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [5 Oct 2021]
Cross-token Modeling with Conditional Computation [5 Sep 2021]
M6-T: Exploring Sparse Expert Models and Beyond [31 May 2021]
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts [7 May 2021]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [11 Jan 2021]
Exploring Routing Strategies for Multilingual Mixture-of-Experts Models [28 Sept 2020]

MoE System

Publication

Pathways: Asynchronous Distributed Dataflow for ML [MLSys2022]
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [OSDI2022]
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[PPoPP2022]
BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores [PPoPP2022]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [ICLR2021]

arXiv

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts [29 Nov 2022]
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [28 Mar 2022]
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System [20 Mar 2022]
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [14 Jan 2022]
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [29 Sep 2021]
FastMoE: A Fast Mixture-of-Expert Training System [24 Mar 2021]

MoE Application

Publication

Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields [02 Feb 2023]

arXiv

Spatial Mixture-of-Experts [24 Nov 2022]
A Mixture-of-Expert Approach to RL-based Dialogue Management [31 May 2022]
Pluralistic Image Completion with Probabilistic Mixture-of-Experts [18 May 2022]
ST-ExpertNet: A Deep Expert Framework for Traffic Prediction [5 May 2022]
Build a Robust QA System with Transformer-based Mixture of Experts [20 Mar 2022]
Mixture of Experts for Biomedical Question Answering [15 Apr 2022]

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-mixture-of-experts

Contents

Open Models

Papers

Must Read

MoE Model

MoE System

MoE Application

Library

About

Releases

Packages

Vvsmile/awesome-mixture-of-experts

Folders and files

Latest commit

History

Repository files navigation

awesome-mixture-of-experts

Contents

Open Models

Papers

Must Read

MoE Model

MoE System

MoE Application

Library

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages