Code for the paper 'Masked Mixers for Language Generation and Retrieval', which you can read here. Datasets and trained models will be added soon.
For a less formal version of this work written as a technical blog post, see this page
Motivation: Poor input representation accuracy in transformers, but much better accuracy in MLP-mixers adapted for causal language modeling (aka masked mixers)
Finding: Masked mixers are approximately as efficient learners of language generation relative to transformers but are far superior for retrieval.
Unless you want to replicate a specific experiment, use the src
directory to train, run, and evaluate mixers and other related models.
The transfixer implementation is tightly bound to the Huggingface Llama implementation, and may be found here as a branch of the transformers library version 4.42.2.
There are two directories for experimental replication purposes: pc
denotes code used for the 1x Nvidia RTX 3060 node and server
denotes code used for the 4x V100 node (compatible with DDP).