Metaseq originated as a fork of fairseq that merged FSDP with Megatron's tensor parallel libraries in order to train a 175B using 1k 80GB A100s.
In order to enable faster iteration, we have removed most features offered by fairseq, leaving only the bare minimum set needed to work at 175B scale. We have also renamed a lot of the Fairseq* classes to be prefixed with Base* or Metaseq*. The following includes a full list of renamed classes:
-
Training internals renaming (optimizer related changes + dropout)
- FairseqOptimizer → BaseOptimizer
- LegacyFairseqOptimizer → LegacyOptimizer
- FairseqLRScheduler → BaseLRScheduler
- FairseqCriterion → BaseCriterion
- FairseqIncrementalState → IncrementalState
- FairseqAdam → MetaseqAdam
- FairseqAdamConfig → MetaseqAdamConfig
- FairseqSGDW → MetaseqSGDW
- FairseqDropout → Dropout
-
Model arch related renaming
- FairseqDecoder → BaseDecoder
- FairseqEncoder → BaseEncoder
- DistributedFairseqModel → DistributedModel
- BaseFairseqModel → BaseModel
- FairseqEncoderDecoderModel → EncoderDecoderModel (to be ripped out, only affected tests)
- FairseqLanguageModel → LanguageModel
-
Config and circuitry renaming
- FairseqTask → BaseTask
- LegacyFairseqTask → LegacyTask
- FairseqDataclass → MetaseqDataclass
- FairseqConfig → MetaseqConfig
- FairseqDataset → BaseDataset
-
Module renaming
- fairseq → metaseq
- fairseq_cli → metaseq_cli