Skip to content

mbzuai-nlp/ArTST

Repository files navigation

ArTST

This repository contains the implementation of the paper:

ArTST: Arabic Text and Speech Transformer

     


* equal contribution   1 MBZUAI  

ArabicNLP 2023

ArTST

ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture in this first edition follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification.

Update

  • December, 2024: Fine-tuning notebook using hugging face trainer on colab
  • October, 2024: Huggingface ArTST v2 ASR model card hugging face transformers
  • October, 2024: Released ArTSTv2 base that covers 17 dialects in pre-training
  • October, 2024: Huggingface ArTST v1 ASR model card hugging face transformers
  • February, 2024: Released ArTST TTS on hugging face transformers
  • February, 2024: Bug fix with checkpoint loading
  • December, 2023: Released ArTST ASR demo HF-Spaces
  • November, 2023: Released ArTST TTS demo HF-Spaces
  • October, 2023: Open-sourced model's weight to HuggingFace
  • October, 2023: ArTST was accepted by EMNLP (ArabicNLP conference) 2023.

Checkpoints

Pre-Trained Models

Model Pre-train Dataset Model Tokenizer
ArTST v1 base MGB2 Hugging Face Hugging Face
ArTST v2 base Dialects Hugging Face Hugging Face
ArTST v3 base Multilingual soon soon

Finetuned Models

Model FInetune Dataset Model Tokenizer
ArTST v1 ASR MGB2 Hugging Face Hugging Face
ArTST v1 TTS ClArTTS Hugging Face Hugging Face
ArTST* TTS ClArTTS Hugging Face Hugging Face
ArTST v2 ASR QASR Hugging Face - safetenors Hugging Face
ArTST v2 ASR Dialects soon soon

Environment & Installation

Python version: 3.8+

  1. Clone this repo
cd ArTST
conda create -n artst python=3.8
conda activate artst
pip install -r requirements.txt
  1. Install fairseq
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
python setup.py build_ext --inplace
  1. Download Checkpoints
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/MBZUAI/ArTST

Loading Model

With HuggingFace Transformers

from transformers import (
    SpeechT5ForSpeechToText,
    SpeechT5Processor,
    SpeechT5Tokenizer,
)

device = "cuda" if torch.cuda.is_available() else "CPU"

model_id = "mbzuai/artst-v2-asr" # or "mbzuai/artst_asr" for v1

tokenizer = SpeechT5Tokenizer.from_pretrained(model_id)
processor = SpeechT5Processor.from_pretrained(model_id , tokenizer=tokenizer)
model = SpeechT5ForSpeechToText.from_pretrained(model_id).to(device)

With Fairseq

import torch
from artst.tasks.artst import ArTSTTask
from artst.models.artst import ArTSTTransformerModel

checkpoint = torch.load('checkpoint.pt')
checkpoint['cfg']['task'].t5_task = 't2s' # or "s2t" for asr
checkpoint['cfg']['task'].data = 'path-to-folder-with-checkpoints'
task = ArTSTTask.setup_task(checkpoint['cfg']['task'])

model = ArTSTTransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])

Data Preparation

Speech

For pretraining, follow the steps for preparing wav2vec 2.0 manifest here and preparing HuBERT label here.

For finetuning TTS task, an extra column is required in the speech manifest file for speaker embedding. To generate speaker embedding, we use speech brain. Here is a DATA_ROOT sample folder structure that contains manifest samples.

Text

Pretrain:

Please use fairseq-preprocess to generate the index and bin files for the text data. We use sentencepiece to pre-process the text, we've provided our SPM models and dictionary in this repo. You need to use the SPM model to process the text and then use fairseq-preprocess with the provided dictionary to get the index and bin files. Note that after SPM processes sentences, the resulting text should have individual characters separated by space.

For Finetuning, a simple text file containing corresponding texts on each line suffices. See here for sample manifest. Normalize the texts as we did for training/evaluation using this script.

Training

The bash files contain the parameters and hyperparameters used for pretraining and finetuning. Find more details on training arguments here

Pretrain

bash /scripts/pretrain/train.sh

Finetune

ASR

bash /scripts/ASR/finetune.sh

TTS

bash /scripts/TTS/finetune.sh

Inference

ASR

bash /scripts/ASR/inference.sh

TTS

bash /scripts/TTS/inference.sh

Acknowledgements

ArTST is built on SpeechT5 Architecture. If you use any of ArTST models, please cite

@inproceedings{toyin2023artst,
  title={ArTST: Arabic Text and Speech Transformer},
  author={Toyin, Hawau and Djanibekov, Amirbek and Kulkarni, Ajinkya and Aldarmaki, Hanan},
  booktitle={Proceedings of ArabicNLP 2023},
  pages={41--51},
  year={2023}
}