kyutai-labs · adefossez · Sep 18, 2024 · Sep 17, 2024 · Sep 17, 2024 · Sep 17, 2024
diff --git a/README.md b/README.md
@@ -1,9 +1,85 @@
-# moshi
+# Moshi: a speech-text fundation model for real time dialogue
+
+![precommit badge](https://github.com/kyutai-labs/moshi/workflows/precommit/badge.svg)
+![rust ci badge](https://github.com/kyutai-labs/moshi/workflows/rust_ci/badge.svg)
+
+ [Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
+ It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses
+ audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),
+ yet performs better than existing, non-streaming, codec like
+ [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps).
+
+ Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user.
+ At inference, the stream from the user is taken from the audio input,
+and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step,
+while a large, 7B parameter Transformer models the temporal dependencies. Moshi achieves a theoretical latency
+of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms.
+[Talk to Moshi](https://moshi.chat) now on our live demo.
+
+<p align="center">
+<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
+    one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
+    and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+    for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
+width="800px"></p>
+
+Mimi builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312)
+and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder,
+and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
+average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive steps in Moshi.
+Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
+a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while
+Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM,
+without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi
+uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of subjective quality
+despite its low bitrate.
+
+<p align="center">
+<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
+    one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
+    and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+    for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
+width="800px"></p>
+
+
+## Organisation of the repository
 
 There are three separate versions of the moshi inference stack in this repo.
-- The python version using PyTorch is in the `moshi` directory.
-- The python version using MLX is in the `moshi_mlx` directory.
-- The rust version used in production is in the `rust` directory.
+- The python version using PyTorch is in the [`moshi/`](moshi/) directory.
+- The python version using MLX for M series Macs is in the [`moshi_mlx/`](moshi_mlx/) directory.
+- The rust version used in production is in the [`rust/`](rust/) directory.
+
+Finally, the code for the live demo is provided in the [`client/`](client/) directory.
+
+## Requirements
+
+You will need at least Python 3.10. For the rust backend, you will need a recent version of
+the [Rust toolchain](https://rustup.rs/). For specific requirements, please check the individual backends
+directories. You can install the PyTorch and MLX clients with the following:
+
+```bash
+pip install moshi      # moshi PyTorch, from PyPI
+pip install moshi_mlx  # moshi MLX, from PyPI
+# Or the bleeding edge versions for Moshi and Moshi-MLX.
+pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
+pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"
+```
+
+While we hope that the present codebase will work on Windows, we do not provide official support for it.
+We have tested the MLX version with MacBook Pro M3. At the moment, we do not support quantization
+for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).
+
+
+## Development
+
+If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
+```
+# From the root of the clone of the repo
+pip install -e 'moshi[dev]'
+pip install -e 'moshi_mlx[dev]'
+pre-commit install
+```
 
 ## Python (PyTorch)
 
@@ -15,18 +91,25 @@ run the model, you can then use either the web UI or a command line client.
 
 Start the server with:
 ```bash
-PYTHONPATH=moshi python -m moshi.server
+python -m moshi.server [--gradio_tunnel]
 ```
 
-And then access the web UI on [localhost:8998](http://localhost:8998).
-
-If the server is running on a remote box, you may want to forward the 8998 port
-via your ssh connection so as to be able to access the web UI locally.
+And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
+with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
+Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
+Alternatively, you might want to use SSH to redirect your connection.
 
 Accessing a server that is not localhost via http may cause issues around using
 the microphone in the web UI (in some browsers this is only allowed using
 https).
 
+A local client is also available, as
+```bash
+python -m moshi.client [--url URL_TO_GRADIO]
+```
+However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
+nor does it try to compensate for a growing lag by skipping frames.
+
 ## Python (MLX) for local inference on macOS
 
 You can either compile and install the `rustymimi` extension or install it via
@@ -40,13 +123,14 @@ maturin dev -r -m rust/mimi-pyo3/Cargo.toml
 
 Then the model can be run with:
 ```bash
-PYTHONPATH=moshi_mlx python -m moshi_mlx.local  \
-    --model ~/tmp/moshiko_mlx_301e30bf@120.q8.safetensors \
-    --mimi ~/tmp/tokenizer-e351c8d8-checkpoint125.safetensors \
-    --quantized 8
+python -m moshi_mlx.local -q 4   # weights quantized to 4 bits
+python -m moshi_mlx.local -q 8   # weights quantized to 8 bits
 ```
 
-This uses a command line interface, alternatively you can use `local_web` to use
+This uses a command line interface, which is bare bone. It doesn't do any echo cancellation,
+nor does it try to compensate for a growing lag by skipping frames.
+
+Alternatively you can use `python -m moshi_mlx.local_web` to use
 the web UI, connection is via http on [localhost:8998](http://localhost:8998).
 
 ## Rust
@@ -102,3 +186,25 @@ npm run build
 ```
 
 The web UI can then be found in the `client/dist` directory.
+
+## License
+
+The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
+The web client code is provided under the MIT license.
+Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
+the MIT license.
+
+## Citation
+
+If you use either Mimi or Moshi, please cite the following paper,
+
+```
+@article{defossez2024moshi,
+    title={Moshi: a speech-text foundation model for real-time dialogue},
+    author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
+    journal={arXiv:TBC},
+    year={2024},
+}
+```
+
+[moshi]: https://arxiv.org/
diff --git a/client/LICENSE b/client/LICENSE
@@ -0,0 +1,23 @@
+Permission is hereby granted, free of charge, to any
+person obtaining a copy of this software and associated
+documentation files (the "Software"), to deal in the
+Software without restriction, including without
+limitation the rights to use, copy, modify, merge,
+publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software
+is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice
+shall be included in all copies or substantial portions
+of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
+ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
+TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
+PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
+SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
+IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
diff --git a/client/README.md b/client/README.md
@@ -14,3 +14,20 @@ Frontend for the demo.
 ## Skipping the queue
 To skip the queue for standalone use, once the project is running go to `/?worker_addr={WORKER_ADDR}` where `WORKER_ADDR` is your worker instance address.
 For example : `https://localhost:5173/?worker_addr=0.0.0.0:8088`
+
+## License
+
+The present code is provided under the MIT license.
+
+## Citation
+
+If you use this client as part of your research, please cite the following paper.
+
+```
+@article{defossez2024moshi,
+    title={Moshi: a speech-text foundation model for real-time dialogue},
+    author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
+    journal={arXiv:TBC},
+    year={2024},
+}
+```
diff --git a/mimi.png b/mimi.png
diff --git a/moshi.png b/moshi.png
diff --git a/moshi/moshi/LICENSE.audiocraft → moshi/LICENSE.audiocraft b/moshi/moshi/LICENSE.audiocraft → moshi/LICENSE.audiocraft
diff --git a/moshi/MANIFEST.in b/moshi/MANIFEST.in
@@ -0,0 +1,7 @@
+include LICENSE*
+include *.md
+include *.cfg
+include requirements.txt
+include moshi/py.typed
+include mimi.png
+include moshi.png
diff --git a/moshi/README.md b/moshi/README.md
@@ -1 +1,131 @@
-# moshi - pytorch
+# Moshi - PyTorch
+
+See the [top-level README.md](../README.md) for more information.
+This provides the Rust backend and client implementation.
+
+[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
+It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compress
+audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),
+yet performs better than existing, non-streaming, codec like
+[SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps).
+
+Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user.
+At inference, the one from the user is taken from the audio input,
+and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step,
+while a large, 7B parameters, Transformer models the temporal dependencies. Moshi achieves a theoretical latency
+of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms.
+[Talk to Moshi](https://moshi.chat) now on our live demo.
+
+<p align="center">
+<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
+    one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
+    and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+    for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
+width="800px"></p>
+
+Mimi is builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312)
+and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder,
+and adapting the strides to match an overall frame rate of 12.5 Hz. This allows to get closer to the
+average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive step in Moshi.
+Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
+a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while
+Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM,
+without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi
+uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of subjective quality
+despite its low bitrate.
+
+<p align="center">
+<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
+    one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
+    and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
+    for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
+width="800px"></p>
+
+
+## Requirements
+
+You will need at least Python 3.10. We kept a minimal set of dependencies for the current project.
+It was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure
+to have PyTorch properly installed before installing Moshi.
+
+```bash
+pip install moshi      # moshi PyTorch, from PyPI
+# Or the bleeding edge versions for Moshi
+pip install -e "git+https://git@github.com/kyutai-labs/moshi#egg=moshi&subdirectory=moshi"
+```
+
+While we hope that the present codebase will work on Windows, we do not provide official support for it.
+At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).
+
+
+## Usage
+
+This package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi).
+
+In order to run in interactive mode, you need to start a server which will
+run the model, you can then use either the web UI or a command line client.
+
+Start the server with:
+```bash
+python -m moshi.server [--gradio_tunnel]
+```
+
+And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
+with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
+Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
+Alternatively, you might want to use SSH to redirect your connection.
+
+Accessing a server that is not localhost via http may cause issues around using
+the microphone in the web UI (in some browsers this is only allowed using
+https).
+
+A local client is also available, as
+```bash
+python -m moshi.client [--url URL_TO_GRADIO]
+```
+However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
+nor does it try to compensate for a growing lag by skipping frames.
+
+## Development
+
+If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
+```bash
+# From the current folder (e.g. `moshi/`)
+pip install -e '.[dev]'
+pre-commit install
+```
+
+Once locally installed, Mimi can be tested with the following command, from **the root** of the repository,
+```bash
+wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
+python scripts/mimi_test.py
+
+```
+
+Similary, Moshi can be tested (with a GPU) with
+```bash
+python scripts/moshi_benchmark.py
+```
+
+
+## License
+
+The present code is provided under the MIT license.
+Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
+the MIT license.
+
+## Citation
+
+If you use either Mimi or Moshi, please cite the following paper,
+
+```
+@article{defossez2024moshi,
+    title={Moshi: a speech-text foundation model for real-time dialogue},
+    author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
+    journal={arXiv:TBC},
+    year={2024},
+}
+```
+
+[moshi]: https://arxiv.org/
diff --git a/moshi/mimi.png b/moshi/mimi.png
diff --git a/moshi/moshi.png b/moshi/moshi.png
diff --git a/moshi/moshi/server.py b/moshi/moshi/server.py
@@ -247,7 +247,7 @@ async def handle_root(_):
     if setup_tunnel is not None:
         tunnel = setup_tunnel('localhost', args.port, tunnel_token, None)
         log("info", f"Tunnel started, if executing on a remote GPU, you can use {tunnel}.")
-        log("info", f"Note that this tunnel goes through the US and you might experience high latency in Europe.")
+        log("info", "Note that this tunnel goes through the US and you might experience high latency in Europe.")
     web.run_app(app, port=args.port)
 
 

diff --git a/moshi/moshi/testing.md b/moshi/moshi/testing.md
diff --git a/moshi/setup.cfg b/moshi/setup.cfg
@@ -4,3 +4,7 @@ max-line-length = 120
 [flake8]
 max-line-length = 120
 ignore = E203,E704
+exclude = 
+    dist
+    build
+