Skip to content

A curated reading list of research in Adaptive Computation, Inference-Time Computation & Mixture of Experts (MoE).

License

Notifications You must be signed in to change notification settings

koayon/awesome-adaptive-computation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Adaptive Computation

Awesome License

Awesome Adaptive Computation is a curated list of Adaptive Computation papers, models, explainers and libraries for Machine Learning.

Contents

About

Adaptive Computation (sometimes called Dynamic Compute) is the ability of a machine learning system to adjust its function and compute budget for each example.

Adaptive Computation techniques include Mixture of Experts (decoupling model capacity and model compute) and Early Exiting (saving compute on easy inputs) as well as sampling techniques.


In this repo, links are organised by topic and have explanations so you can decide what you would like to read. Especially recommended links are starred 🌟

Star this repository to see the latest developments in this research field.

We accept contributions! We strongly encourage researchers & practitioners to make pull requests with papers, approaches and explanations that they feel others in the community would benefit from πŸ€—

Mixture of Experts (Sparse MoE)

The Mixture of Experts (MoE) paradigm uses a routing layer to choose a limited number of parameters to apply to a given input rather than using all the available parameters.

This conditional computation allows us model capacity to increase without also scaling the compute required for each forward pass. This is useful because bigger models are more sample efficient and more compute efficient to train.

MoE models are also useful for compartmentalising knowledge and avoiding negative interference from irrelevant computation. Mixtral-8x7B is an open-weights MoE model which is comparable to much larger models. Google DeepMind similarly show that their Gemini 1.5 Pro based on an MoE architecture is competitive with their much larger Gemini 1 Ultra. Databricks/Mosaic DBRX is another powerful MoE model and it seems that MoE is now the go-to architecture for large models.

JetMoE, based on the ModuleFormer MoE conception, shows that MoEs can also be effective at smaller scales.

D2DMoE: Dense to Dynamic-k Mixture-of-Experts Conversion, Szatkowski et al. (2024), pdf code

While MoE models are mostly used for scaling up the parameter count, recently MoEfication has shown that static dense models can be converted to MoEs to improve execution time. D2DMoE makes further progress in improving the efficiency of these dense-to-MoE converted models by: 1) showing that the efficiency of the resulting model can be significantly enhanced by enforcement of activation sparsity in the base model; 2) proposing Expert Contribution Routing, a novel objective for the training of the gating networks, which are now tasked to predict the output norm of each expert for the given input, enabling approximation of each expert’s relative contribution; 3) introducing dynamic-k gating, which allows the model to appropriately distribute its computational budget between easy and hard inputs; 4) extending the proposed conversion scheme to any linear layers such as multi-head attention projections.

Skywork-MoE, Skywork (2024) pdf

An open-source MoE in the style of Switch Transformer. They detail two training tricks for getting better MoE performance. Firstly they normalise the routing logits before it goes through the softmax in order to reduce the entropy in the router and make the router more decisive. Secondly they have a different auxiliary loss coefficient for each layer and this is tuned during training depending on how many tokens were dropped at that layer. This helps to reduce impact of the auxiliary loss as the router becomes more balanced and confident.

DynMoE, CUHK: Guo et al (2024) pdf code

One problem with the token choice MoE approach is that every token is allocated the same number of experts. Ideally for true adaptive computation this would be variable depending on how difficult the token is. The authors introduce a routing mechanism which allows for a variable number of experts per token as well as a procedure for dynamically changing the number of experts during training. This allows for better performance with less hyperparameter sweeping.

Also see Dynamic Routing in MoEs

DeepSeek-v2, DeepSeek (2024) pdf

A large (236B), performant and open MoE with lots of details about training and checkpoints available. Useful for understanding a modern MoE recipe.

MoEs for Deep-RL, Google DeepMind: Obando-Ceron et al (2024) pdf

The authors show that MoEs can be used to improve the sample efficiency of popular RL systems such as DQN and Rainbow. The authors show that using MoEs (in particular the SoftMoE variant) improves the ultimate performance of the RL systems. Previously, scaling up the underlying models in RL systems was often wasteful in parameters, but the authors show using MoEs they can get predictable performance improvements with scale. This suggests that scaling laws for Deep RL systems could be possible.

MoE Design Choices, EPFL: Fan et al (2024) pdf

The authors ablate some decision decisions for MoEs and shows the benefits of compared to vanilla transformers. They unfortunately only study very small models and so similar analysis for larger models could likely be useful for the community.

🌟 Routers in Vision MoEs, DeepMind: Liu et al (2024) pdf

Compares the performance of different routing mechanisms in MoEs trained on Vision tasks. They show that Language Model routers can adapt well to Vision and that for Vision tasks (where the task isn't autoregressive), Soft MoE outperforms. They also reframe some previous routing methods mathematically to more clearly detail the differences. Worth a read for anyone deciding which MoE approach to choose for their application.

MoE-LLaVA, Peking University: Lin et al (2024) pdf code

Whilst MoEs have had much success in ViTs and LLMs, the authors also show that they can be effective in LVLMs (Large Vision Language Models). By exploiting the sparsity and increased parameter count of MoEs whilst maintaining FLOPs, we get the expected boost in both performance and hallucination avoidance.

Mixtral of Experts, Mixtral (2024) pdf official code

The paper describing Mixtral's State of Art LLM based on the MoE paradigm.

BlackMamba (MoE-Mamba), Zyphra: Anthony et al (2024) pdf code models

The authors combine the MoE paradigm with a recent SSM-based architecture Mamba. Mamba provides Transformer-like performance and scaling properties whilst using a sub-quadratic attention variant allowing for much larger sequence lengths. Here we see that this architecture can additionally be combined with MoEs to increase performance, similarly to MoEs for transformers or RNNs previously. For a general explainer on Mamba see here.

🌟 Offloading for Fast MoE Inference, Moscow: Eliseev & Mazur (2023) pdf official code

Despite some open-weights MoEs being available, they are not the most popular models used for inference due to the large memory footprint required at inference time. To address this the authors propose architecture-aware quantisation, an LRU cache for experts (to exploit the fact that experts are more likely than chance to repeat for two adjacent tokens) and a speculative expert loading algorithm. Since the inputs (x) to each layer only differ iteratively by a small amount (due to the residual stream carrying information from layer to layer), they note that by applying the routing function for the subsequent layer at the current layer, you can get a good guess for which experts to load. The upshot of this is that MoE models, like Mixtral, can be run on consumer grade hardware with much increased generation speed. This is a huge win for inference efficiency in the memory constrained and single batch regime. MoE-Infinity is a similar offloading paradigm with code here

🌟 QMoE, ISTA: Frantar & Alistarh (2023) pdf code

Generally MoEs require larger footprints but fewer FLOPs compared to a dense model of which achieves similar performance. In the quest to reduce the memory footprint, we might seek to perform quantisation. The authors present a compression method which takes advantage of the inherent sparsity to compress the model at a 20x compression rate whilst retaining most performance. This compresses each fp16 weight to the equivalent of less than one bit. For the first time it's possible to run a trillion parameter model on consumer hardware.

SparseMixer - Sparse Backpropagation for MoE Training, Microsoft: Liu et al (2023) pdf

One of the most important parts of an MoE is the router which allows the experts to specialise well. Unfortunately, typical MoE training gives suboptimal routers as suggested by Hash routers performing almost as well as more principled routing mechanisms. This paper suggests the reason is due to MoE training ignoring parts of the gradient and suggests a midpoint-rule based gradient approximation which substantially improves training.

MoV/MoLoRA, Cohere For AI: Zadouri et al (2023) pdf, official Jax code

Introduces parameter efficient MoE models where instead of routing between entire FFN layers, we route between adapters such as LoRAs or $(IA)^3$ with the same base model. This allows for much of the benefits of the (Soft) MoE paradigm but without the huge memory footprint (particularly compared to previous upscaling methods). The HydraMoE project also take a similar approach.

Soft Merging of Experts (SMEAR), UNC: Zadouri et al (2023) pdf

Takes the opposite approach to Soft-MoE and averages the Expert weights rather than the tokens. This is interesting given that model merging approaches which show that linear combinations of models can perform well on tasks that either model was trained for. Note here the averaging operation can become prohibitively expensive if we different experts for each token (similar FLOPs to forward passes on all experts and ensembling). Hence the method relies on a Task-MoE approach of picking the same expert configuration on a per-example rather than a per-token basis.

AutoMoE, UBC/Microsoft: Jawahar et al (2023) pdf, official PyTorch code

One of the promises of MoE is being able to apply different amounts of compute to each token. Generally, this has been achieved by different tokens being processed and dropped by different numbers of experts per layer. AutoMoE also uses differently sized experts to achieve more heterogeneity. They perform an Architectural search for optimal architectures given computational constraints.

🌟 Expert Choice MoEs, Google: Zhou et al (2022) pdf, blog, PyTorch code

Introduces a principled, truly compute-adaptive MoE model. In traditional MoE models the tokens select the top experts that they would most like to be processed by. In Expert Choice routing however, the experts choose the top tokens that they would like to process. Hence multiple experts can pick the same token and give it lots of compute, and similarly all experts can ignore a token so it is skipped for that layer. As well as improving training efficiency, this approach also has the benefits that it helps with load balancing and eliminates the need for auxiliary loss functions.

🌟 Task Level MoEs, Various (2022) DeMix pdf, Task-MoE pdf c-BTM code

Instead of routing each token separately these approaches use the same Expert for entire documents based on the task (which is supplied to the network). Instead of learning the routing, we supply the routing based on what we know about the tasks, inducing our own inductive bias. Also note that this offers memory footprint benefits at inference time - if inference is for a limited set of tasks, we only need these enough GPU memory for these experts. ELMForest - Branch, Train, Merge (BTM), c-BTM and Branch, Train, Mix (BTX) are follow-ups which use ensembling approaches from multiple LMs trained independently in a continual learning approach.

No Language Left Behind, Meta (2022) pdf, official PyTorch code

Translation is a natural setting for MoEs - some but not all of the parameters for English to Chinese translation be relevant in English to French translation as well. But using all of the English to Chinese knowledge might confuse the model. MoE therefore has useful inductive biases to allow this model to use only the relevant parts. Here the researchers show that the MoE approach scales even for extremely low-resource languages. Translation may be a natural environment for task/document-level rather than token-level routing.

Hash Routing, Meta: Roller et al (2021) pdf

Uses a static rather than fixed routing per input token and shows similar results to more principled routing methods in some regimes. Suggests that previous routing methods may be somewhat under-optimised.

Switch Transformers, Google: Fedus et al (2021) pdf, review paper, PyTorch code, model

Simplifies the MoE routing algorithm with top-1 routing. Shows that we can exploit the scaling laws with parameters as well as simply compute and develops distributed systems approach to MoE

WideNet (Go Wider Instead of Deeper), NUS: Xue et al (2021) pdf

Suggests a parameter sharing approach using a single layer of multiple MoEs repeated multiple times as transformer blocks (similar to the Universal Transformer but with MoEs). This results in a deep model which has O(expert_num) instead of O(layer_depth) parameters. They achieve SoTA results with fewer parameters than previous models. More recently, Apple's One Wide Feedforward paper details the amount of redundancy across layers. This suggests this approach is increasingly fruitful for on-device models.

🌟 Outrageously Large Neural Networks (aka The Sparse MoE Layer), Google: Shazeer et al (2017) pdf

Introduces Mixture of Expert models in their modern form using Sparsely Gated MoE layer and a trainable gating network. They use RNNs as this is pre "Transformers Eating The World".

Other Modular Architectures

Stylus Diffusion Adapter Selection, UC Berkeley: Luo et al (2024) pdf, repo

Diffusion model users often use adapters rather than full finetunes to achieve models which perform well on a particular style. The authors here automatically select and compose relevant adapters for the prompt using a model routing approach. A nice application of inference-time routing which we might expect to become more commonplace in the future.

MoDE - CLIP Data Experts via Clustering, Meta: Ma et al (2024) pdf, pytorch code

The authors apply a Task-specific MoE based on training parallel CLIP models on restricted domains and ensembling these together. They show increased performance with less compute and note the ability to add in new "experts" for new domains asynchronously and after initial training as a Continual Learning play.

🌟 Mixture of Depths, DeepMind: Raposo et al (2023) pdf,

The traditional Early-Exit formulation allows "easier" tokens to not go through the whole network, reducing compute but it has a couple of problems. Firstly, it's not clear that the layers that you want to skip are necessarily the final ones (perhaps an easy token should skip some middle layers instead) and secondly, Early Exit might not be in practise that much faster on GPUs due to the variable compute graph. In an attempt to align Early-Exit work with the Hardware Lottery, the authors suggest enforcing a compute budget but fixing the computational graph allowing dynamic allocation of FLOPs across tokens in the sequence, optimising the allocation along the sequence for different layers across the model depth. Because they opt for an expert-choice routing mechanism, they also introduce novel sampling methods to ensure validity for autoregressive generation which seem broadly applicable to other MoE models.

This builds on the LayerDrop approach to structured dropout.

Fast FeedForward (FFF), ETH Zurich: Belcak et al (2023) pdf1, pdf2, official pytorch code1, official pytorch code2, pytorch code

Instead of the usual FeedForward Network, the authors propose a balanced tree structure where depending on your path through the tree, a different function is applied to the input. Inputs go either left or right through the tree depending on the result of a dot product with a learned discriminating vector. This approach results in encouraging performance with somewhat limited inference FLOPs but high training FLOPs and unstructured inference sparsity which must be applied sequentially which falls foul of the Hardware Lottery and doesn't parallelise nicely on GPUs.

🌟 Soft-MoE, Google DeepMind: Puigcerver et al (2023) pdf, pytorch code

Instead of Sparse MoE models, they allow each expert to using its router to select weights for a weighted average of input tokens that it wants to process. They show SoTA results on Image recognition tasks. Note since the approach relies on Expert Choice, it doesn't yet generalise to autoregressive generation.

Early Exit: End-to-End Adaptive Computation

Early Exit approaches ask if we get the output of a neural network without going through all the layers, particularly if faced with an easier example. This is typically done by learning an exit probability at each layer.

EE-LLM, Alibaba: Chen et al (2024) pdf pytorch code

The authors extend Megatron into a library which natively supports the Early Exit paradigm taking full advantage of 3D parallelism. Other contributions include methods to efficiently facilitate backprop even when some layers may be unused and methods to handle the fact that using naive early exit would result in missing KV-caches for some tokens. They find that limiting the Early Exit layers to a few intermediate layers substantially improves performance.

Sparse Universal Transformer (SUT), MILA: Tan et al (2023) pdf

Combines the Universal Transformer approach (RNN with transformer blocks) with the Mixture of Experts paradigm (multiple experts instead of a single FFN layer). They also use a new stick-breaking-based dynamic halting mechanism. This brings all the benefits of Sparse MoEs (such as less inference compute whilst having a lot of parameters) and the benefits of Universal Transformer (parameter efficiency, Turing-completeness and generalization ability) together.

AdaTape, Google: Xue et al (2023) pdf, blog, official jax code

Extends the ACT method by giving the model a "tape" which contains some additional inputs which may be useful for encoding. For each token the model may append a variable number of tape tokens to the input, which allows it to regulate how much additional compute we add. The paper shows impressive performs on image classification tasks and the 'parity' task on long sequences.

Dataset Pruning Using Early Exit Networks, GΓΆrmez et al (2023) pdf

Early Exit Networks naturally learn which input examples are "easy" (can be exited early) or "difficult" (require all the layers of the network). The authors use this property to prune datasets to use for training and finetuning. The algorithm EEPrune achieves SOTA performance for dataset pruning in some regimes.

🌟 PonderNet, DeepMind: Banino et al (2021) pdf, PyTorch code

Allows the model to exit after each transformer layer if it's confident in the answer. It introduces a stable probabilistic policy for halting which provides low-variance unbiased gradient updates. This can also be combined with the SkipNet paradigm where we instead of exiting directly, skip to the final few layers to allow our universal computation (applied to all inputs) to be at the end as well as the start of the network.

PaBEE, DeepMind: Zhou et al (2020) pdf, official PyTorch code

Introduces Patient Early Stopping. Whilst ACT has a learned exit probability, PABEE instead looks at the output class if it were to exit. We exit if the intermediate outputs are the same over multiple layers. Interestingly they suggest that the reason for this isn't just speed; they suggest that early stopping will improve performance due to lower risk of "overthinking" (analogously to stopping training earlier to prevent overfitting). F-PaBEE prevents a slightly more flexible approach based on similarity scores.

Universal Transformer, Google: Dehghani et al (2019) pdf

Reuses the transformer block recurrently across multiple layers with an ACT-like halting mechanism. RNNs can be better than transformers at length extrapolation but here we get the best of the Transformer (training parallelizability) and the best of the RNN (recurrent inductive bias). Universal Transformers can also be shown to be Turing-complete.

Adaptive Computation Time (ACT) for RNNs, Google: Graves (2016) pdf

Introduces the ACT approach for models to learn how many computational steps they should take before returning an output. This approach is built on and refined in many later papers such as PonderNet.

More Compute Per Output Token

Masked Diffusion Language Models, Cornell: Sahoo et al (2024) pdf, pytorch code video

Autoregressive models sample a single token at a time, regardless of how difficult this token is to predict. Diffusion models however have the benefit that the number of steps from the noised input to the final output can be varied which acts as a knob controlling the amount of compute applied. The authors introduce a simplified method for Diffusion Language models based on BERT which achieves better perplexity than previous Diffusion Language models (though not at autoregressive model levels). This avenue provides a different approach to varying compute per output token.

🌟 Quiet-STaR, Stanford: Zelikman et al (2024) pdf,

One of the core motivations of Adaptive Computation is noting that for difficult tokens we should spend more compute. There have been prompting ways to do this (e.g. Chain of Thought) and recurrent ways to do this (e.g. Universal Transformers) but ideally we'd want the LLM to just start writing down more tokens, using all faculties on difficult tokens, without being told when to apply this technique and in a natural next-token prediction way that takes advantage of its pretraining. Quiet-STaR is exactly that. The model generates hidden rationale tokens which it can use to reason but don't get shown to the user (or loss function). This generalises the previous STaR work by the same authors and the Pause Token results in a way that is much more generally effective.

This is the real deal folks! I'm extremely excited about this approach. And it's a cracked team too, they start the paper with a quote from Danish philosopher SΓΈren Kierkegaard. An excellent formulation and one of those papers that makes you realise why you got into this field.

Adaptive Computation for Black-box models

For black box pre-trained models, perhaps those behind an API, there are some techniques for using Adaptive Computation. These are promising techniques for those with limited compute budgets.

Prompting techniques such as Reflexion, Debate, Chain of Thought, Tree of Thought and Chain of Verification can also be used to improve performance for Black-box models.

Online Speculative Decoding, Berkeley: Liu et al (2024) pdf, pytorch code

The authors propose an Active Learning approach to choosing draft models for speculative decoding. In downtime when the GPUs are not maxed out for inference, they use the capacity to instead finetune a draft model on the recent outputs from the large teacher model. In this way the system can respond to distribution shift and still stay performant by accepting more tokens from the draft model. Depending on the distribution shift this can result in a latency reductions going from 1.22x (naive speculative decoding under distribution shift) to 3.06x (their method).

🌟 Scaling LLM Test-Time Compute, Deepmind: Snell et al (2024) pdf,

They test two strategies for using test-time compute: (1) searching against dense, process-based verifier reward models in a tree-like fashion and (2) utilising Dynamic Evaluation-style updating of the model’s distribution at test time given the prompt. They find that using these strategies they're able to achieve a 4x improvement over using best-of-N with the same compute budget. Even more strikingly, they find a 14x improvement over FLOP-matching with using a larger model. This is a huge win for Adaptive Compute-style approaches. On a different note the paper's motivating setup and styling are well executed which makes the paper a nice read.

🌟 Jacobi Consistency Large Language Models (CLLMs), SJTU: Kou et al (2024) pdf, blog

An interesting approach to the multi-token prediction problem. They give a language model a prompt and a random k-tokens to come next. They then run a forward pass: this will give the "correct" next token (i.e. directly after the prompt) but there's some chance it also updates one of the following tokens to being correct as well. They use a consistency loss (similar to in Diffusion models) to improve the trajectory from [k random tokens] --> [k correct tokens] very similar to a diffusion approach. We notes that this can take at most k forward passes but ideally can be done fewer forward passes.

The aim of the game is to repeatedly put the k tokens through the model until they reach a fixed point (i.e. a forward pass doesn't change them). (Also note that on this final forward pass we also have the k+1th token returned). This is an interesting approach that changes fundamentally the language modelling objective from autoregressive prediction to predicting trajectories for multi-token sequences.

🌟 Multi-Token Prediction, Meta: Gloeckle et al (2024) pdf

Traditionally LLMs predict one token at a time. This is somewhat inhuman and inefficient because often once the start of a word/phrase is predicted, the end is trivial. The authors here treat subsequent token prediction as an auxiliary task and train additional heads to predict further tokens. The real benefit of this approach isn't in inference though but in training. Pre-training with this auxiliary task is more sample efficient and forces the LM to learn better representations for medium-term dependencies. In generative tasks such as coding, these models much outperform traditional LLMs with lower latency.

Many-Shot In-Context Learning, Google DeepMind: Agarwal et al (2024) pdf

It has been long observed that language models can learn how to do a new task from examples of inputs, reasoning chains and outputs. This is known as few-shot Chain of Thought (CoT). Historically, the number of examples has been limited by the context window though. In this work, the authors suggest that using hundreds or thousands of examples (typically model generated) can aid performance and out-of-distribution robustness via the In-Context Learning mechanism. In other words, they formalise another way to turn inference time compute into better performance.

Can be added to your DSPy program for a long-context models.

🌟 Martian LLM Router, Martian: Hu et al (2024) pdf, blog

Martian provide the first LLM router, which dynamically routes queries to the best LLM in real-time, to achieve higher performance and lower cost than any individual API. They're able to choose models which might be better at a single task and to route away from powerful expensive models when a cheaper one will suffice. In order to choose which model to use, they use a new interpretability technique known as model mapping. Worth paying attention to.

Chip Huyen discusses model routing approaches here

EAGLE, Peking/Microsft: Li et al (2024) pdf, PyTorch code

An improvement to speculative decoding which uses the fact that upper layers in the model have good features for multiple tokens ahead to predict future tokens from the current one without using all the layers. This approach is typically 50% faster than previous single-model speculative decoding efforts and 3x faster than vanilla decoding.

🌟 Contrastive Decoding, Stanford: Li et al (2023) pdf, pdf2

A small helper model generates tokens alongside the main model. Tokens are up-weighted if the large model finds them proportionally much more plausible than the small model. This approach improves the quality of open-ended generations and reasoning ability. To extend this method towards additionally adaptive computation, smaller contrastive models could be applied conditionally depending on the input.

🌟 Speculative Sampling, DeepMind: Chen et al (2023) pdf, pdf2, blog, PyTorch code PyTorch blog

A smaller model generates multiple tokens autoregressively and then a larger model checks the smaller model against what it would have generated (all in one go). We accept only the tokens where the two models agree (by some acceptance criteria) and then the larger model's next token. This gives exactly the same output as the larger model would have but with significantly reduced sampling time. This takes advantage of the fact that we can parallelise evaluation whilst generation happens token by token. Additionally Online Speculative Decoding suggests we can use any excess compute (at inference time) to retrain the small model online on the query distribution with teacher-student distillation. Note that the small model need not be a transformer: Recurrent Drafter from Apple suggest using a fast RNN for speculative decoding and large n-gram models could also be used as a non-parametric approach. Indeed REST suggest retrieving follow-on tokens from the web for the speculative decoding head. See also Accelerated Speculative Sampling (ASpS) with Tree Monte Carlo (or video) for further improvements to this method.

FrugalGPT, Stanford: Chen et al (2023) pdf

Details various approaches for fully black box adaptive computation (i.e. from an API where you don't even get logits). They use an LLM Cascade strategy where given a prompt they select n models to try sampling with, in order of increasing parameter count. The first model samples and we check the generation with a scoring function. If the generation is rejected then we generate with a more capable model. We continue this process until we accept the generation or are using the largest model. Interestingly this approach provides some shielding against inverse scaling problems. They also use completion caching.

Beam Search, Google: Sutskever et al (2014) pdf

Beam search allows LMs to see the probability of choosing a few tokens at a time before selecting one by building out a tree. Increasing the number of beams increases the number of options explored downstream and hence the amount of compute per token.

Continual Learning

🌟 Lifelong-MoE, Google DeepMind: Chen et al (2023) pdf

Trains a language model for multiple tasks by training for one task, freezing these weights and then adding some additional layers which can help to train the next task (in combination with the frozen layers) This treats pretrained weights more like an API (which you can use but not edit) when training a model to do a new task. This helps to eliminate the catastrophic forgetting that can happen with naive finetuning.

Sparse Upcycling, Google Research: Komatsuzaki et al (2023) pdf

Shows that you can use an pre-trained dense model checkpoints as an initialisation for training sparse MoEs. This reduces the overall compute budget needed and reduces the sunk costs for already trained models. Sparse upcycling can be viewed as an efficient form of finetuning which converts a pretrained dense model to a sparse model for inference.

MuNet, Google: Gesmundo et al (2022-23) pdf, pdf2, pdf3, pdf4, official jax code

Defines an evolutionary algorithm which adds different tasks onto an existing base model by (1) inserting adapter layers, (2) changing hyperparameters, (3) freezing layers and (4) copying layers to retrain. An interesting sketch of what Adaptive Computation could look like in the future.

Tools & Agents

One way of varying compute is on some tokens calling out to an external API for parts of completions.

SWE-Agent: Princeton, Yang et al (2024) code demo

An AI Software Engineer (Γ  la Devin) which takes a GitHub issue and autonomously tries to fix it. Operates fast (couple of minutes) and performs well on the SWE-bench benchmark. One of the first AI agents to actually work in the real world and it's open-source.

🌟 LLM-Powered Autonomous Agents, OpenAI: Lilian Weng (2023) blog

An overview of agents as general problem-solvers powered by LLMs such as AutoGPT and GPT-Engineer Agents typically can act within the world are are augmented with the ability to do explicit long term planning (via decomposing goals into sub-goals and learning from its mistakes), long-term memory (via a vector database) and tool use (calling external APIs).

ChatGPT Plugins, OpenAI (2023) blog, demo

GPT-4 has access to plugins for tasks where it would be better suited to call an API. Examples include Code Interpreter, web browser and Wolfram Alpha.

Toolformer, Meta: Schick et al (2023) pdf, pdf2

Trained models to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction Effectively the LMs teach themselves how to use tools. In the limit case of this we simply require LMs/agents to be able to ask the right questions, know where to ask them and possibly be able to interpret the answers they receive. In other words, we offload the actual computation to external APIs (which may themselves be ML models) and use much smaller base models.

Games

🌟 Libratus: heads-up no-limit poker, Meta: Brown and Sandholm (2017) pdf, pdf2, video

The first AI to beat humans at Texas Hold Em Poker (heads up). An important part of the approach was in computing real-time responses to opponent moves, spending more compute on less obvious moves.

AlphaGo/AlphaZero, DeepMind: Silver et al (2016) pdf, pdf2, film, blog

This result needs no introduction. In terms of Adaptive Computation, they the depth of the Monte Carlo Tree Search (MCTS) was allowed to be variable.

Pre-cursors to Adaptive Computation

Dynamic Evaluation blog

Dynamic evaluation is an inference-time finetuning approach which which allows for online learning to increase performance on a given task. This was popular for RNN approaches but has fallen out of favour due to wanting simple-to-deploy models over an API and the rise of In-Context Learning. Similar approaches have seen some success on the ARC challenge. See also Jack Cole interview

Attention and The Transformer, Vaswani et al (2017) pdf pdf2

Although we don't normally think of it this way, attention can be viewed as a conditional computation mechanism. The matrix which is applied to the input is dependant on the incoming data.

Conditional Computation, Bengio et al. (2016) pdf

They use Reinforcement Learning to train a policy gradient to decide which parts of the network to activate, in effect learning a dropout policy for sparsity.

Adaptive Mixtures of Local Experts, Jacobs et al (1991) pdf

Collaborative, learned Mixture of Experts approaches to handle subsets of the training set are proposed. It's remarkable how close current approaches are to the original gating network. They also show intuitive expert specialisation on the task of vowel discrimination.

Open Source Libraries

🌟 DSPy, Stanford: Kandpal et al (2023) pdf, code

A framework which allows AI engineers to build LLM pipelines in code. Here we can also algorithmically optimize LM prompts and weights using their compilation tools. Within this framework pipelines are written like PyTorch code and engineers can write control flows to allow for Adaptive Computation.

git_theta, UNC: Kandpal et al (2023) pdf, official framework-agnostic code

A git extension which allows tracking and merging changes to model checkpoints like git does with code. With git_theta you can see diffs in parameter groups and merge model finetuning branches with merging approaches. It's also efficient with low-rank changes to parameter groups.

🌟 MegaBlocks, Stanford/Databricks (2022) pdf pytorch code

A lightweight library for training MoE models which is well integrated with MegaTron-LM. Maintained by Databricks and used by Mistral, it's becoming the standard in MoE training. The core of the library is implementing "dropless-MoE" efficiently.

🌟 DeepSpeed-MoE, Microsoft: Rajbhandari et al (2022) blog, pdf, official PyTorch code

Training and inference solution for distributed MoE models. They also present a new MoE architecture PR-MoE which has more experts in higher layers and a method for distilling expert models into dense 'student models'.

AI Safety

With adaptive computation, models can choose to use more compute on harder problems.

For problems where we're concerned about systems failing by not being able to do sufficient computation then Adaptive Computation is very positive for Alignment. We should expect fewer mistakes from a model utilising Adaptive Computation, even on more difficult problems. Additionally, Adaptive Computation based systems are less susceptible to Adversarial Attacks. That is to say Adaptive Computation makes models more robust.

However, for problems where we're concerned about systems being deceptive or mesa-optimising increasing the amount of inference-time compute increases their ability to do so. Here the failure is not a "mistake" but entirely intentional from the system's perspective. Inference-time search is one way that a model could implement deceptive alignment for example.

Scaling Laws

Toward Inference-Optimal MoEs, UCSD: Yun et al (2024) pdf

The Chinchilla scaling laws focused on how to allocate compute to get the best model for a given amount of training compute. Since then LLama and others have focused on optimising for inference-compute as well as training compute. For MoEs there are additional considerations here - how many experts should you assign for the parameter count given that at inference time cost depends on the active parameters? The authors find that fewer experts are more efficient at inference time but more experts are more efficient at training time.

Knowledge Capacity Scaling Laws, Meta: Allen-Zhu & Li (2024) pdf

The authors examine the Physics of Language Models and how much data they can store per parameter. They find that typically models can store around 2 bits per parameter, and this doesn't reduce too much with MoE models. This confirms (since MoEs are typically much larger in parameter count) that these models can store a lot more information than traditional models. It also suggests that knowledge capacity is relatively independent of forward pass compute giving a natural (if imprecise) cleaving of intelligence represented in compute applied and knowledge represented in the parameters.

Sparse Scaling Laws, DeepMind: Frantar et al (2023) pdf

Scaling laws paper in the style of the Chinchilla paper. Details the optimal sparsity for a model given the inference FLOPs and training budget. They suggest that sparsity is especially important for larger models when seeing diminishing returns past Chinchilla optimality. See also Unified Scaling Laws.

Further Scaling Laws For Fine-Grained MoEs suggest ways to optimally select the trade-off between the number and size of experts.

Scaling Scaling Laws with Board Games, Andy Jones (2021) pdf

The Bitter Lesson suggests that there are two general techniques that work well in Machine Learning - search and learning. This paper suggests that these can be traded off against one another - that is, instead of additional learning you could add capable search to achieve similar performance. We can trade off train-time and test-time compute depending on our requirements.

Other

Blending Is All You Need, Cambridge: Lu et al (2024) pdf

Mark this under "bizarre". They have a multi-modal setup where they completely randomly and uniformly select a model to answer each query in a conversation. The authors report higher user engagement and retention metrics using this approach over each individual model. One hypothesis is that each model can influence models that answer afterwards through the conditioning on previous tokens and there might be some implicit benefits to a model seeing tokens which are slightly off-distribution for what it would have produced. Perhaps they also have different refusal policies too. It's not clear why this should work but this provides a floor for more sophisticated model routing procedures.

Buffer Overflow in MoEs, DeepMind: Hayes et al (2024) pdf

Typically with token choice routing methods in MoEs, there are implicitly cross-batch dependencies (i.e. the same token could be routed to different experts if in it's batch most of the other tokens also wanted to go to its preferred expert). The authors show that this batch dependency can be used as an attack surface. They present a few solutions - mostly this shouldn't be a problem if batch sizes are very large (as in inference in a big AI lab) but it's an interesting one to watch out for. We might expect ML systems security to be an increasingly large field of research.

FLOPs are all you need, Emin Orhan (2023) blog

Short post detailing how the success of deep learning models is very correlated with the amount of compute that they use per parameter efficiently and how they share parameters.

Review Paper: Dynamic Neural Networks Survey, Han et al (2022) pdf

A review of Adaptive Computation approaches.




Thanks for reading, if you have any suggestions or corrections please submit a pull request! And please hit the star button to show your appreciation.

About

A curated reading list of research in Adaptive Computation, Inference-Time Computation & Mixture of Experts (MoE).

Topics

Resources

License

Stars

Watchers

Forks