Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme [5/N] #34

Merged
merged 31 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 120 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,85 @@
# moshi
# Moshi: a speech-text fundation model for real time dialogue
adefossez marked this conversation as resolved.
Show resolved Hide resolved

![precommit badge](https://github.com/kyutai-labs/moshi/workflows/precommit/badge.svg)
![rust ci badge](https://github.com/kyutai-labs/moshi/workflows/rust_ci/badge.svg)

[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses
audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),
yet performs better than existing, non-streaming, codec like
[SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps).

Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user.
At inference, the stream from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
adefossez marked this conversation as resolved.
Show resolved Hide resolved
which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step,
while a large, 7B parameter Transformer models the temporal dependencies. Moshi achieves a theoretical latency
of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms.
[Talk to Moshi](https://moshi.chat) now on our live demo.

<p align="center">
<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
width="800px"></p>

Mimi builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312)
and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder,
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the
average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive steps in Moshi.
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while
Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM,
without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi
uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of subjective quality
despite its low bitrate.

<p align="center">
<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
width="800px"></p>


## Organisation of the repository

There are three separate versions of the moshi inference stack in this repo.
- The python version using PyTorch is in the `moshi` directory.
- The python version using MLX is in the `moshi_mlx` directory.
- The rust version used in production is in the `rust` directory.
- The python version using PyTorch is in the [`moshi/`](moshi/) directory.
- The python version using MLX for M series Macs is in the [`moshi_mlx/`](moshi_mlx/) directory.
- The rust version used in production is in the [`rust/`](rust/) directory.

Finally, the code for the live demo is provided in the [`client/`](client/) directory.

## Requirements

You will need at least Python 3.10. For the rust backend, you will need a recent version of
the [Rust toolchain](https://rustup.rs/). For specific requirements, please check the individual backends
directories. You can install the PyTorch and MLX clients with the following:

```bash
pip install moshi # moshi PyTorch, from PyPI
pip install moshi_mlx # moshi MLX, from PyPI
# Or the bleeding edge versions for Moshi and Moshi-MLX.
pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"
```

While we hope that the present codebase will work on Windows, we do not provide official support for it.
We have tested the MLX version with MacBook Pro M3. At the moment, we do not support quantization
for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).


## Development

If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
```
# From the root of the clone of the repo
pip install -e 'moshi[dev]'
pip install -e 'moshi_mlx[dev]'
pre-commit install
```

## Python (PyTorch)

Expand All @@ -15,18 +91,25 @@ run the model, you can then use either the web UI or a command line client.

Start the server with:
```bash
PYTHONPATH=moshi python -m moshi.server
python -m moshi.server [--gradio_tunnel]
```

And then access the web UI on [localhost:8998](http://localhost:8998).

If the server is running on a remote box, you may want to forward the 8998 port
via your ssh connection so as to be able to access the web UI locally.
And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
Alternatively, you might want to use SSH to redirect your connection.

Accessing a server that is not localhost via http may cause issues around using
the microphone in the web UI (in some browsers this is only allowed using
https).

A local client is also available, as
```bash
python -m moshi.client [--url URL_TO_GRADIO]
```
However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.

## Python (MLX) for local inference on macOS

You can either compile and install the `rustymimi` extension or install it via
Expand All @@ -40,13 +123,14 @@ maturin dev -r -m rust/mimi-pyo3/Cargo.toml

Then the model can be run with:
```bash
PYTHONPATH=moshi_mlx python -m moshi_mlx.local \
--model ~/tmp/moshiko_mlx_301e30bf@120.q8.safetensors \
--mimi ~/tmp/tokenizer-e351c8d8-checkpoint125.safetensors \
--quantized 8
python -m moshi_mlx.local -q 4 # weights quantized to 4 bits
python -m moshi_mlx.local -q 8 # weights quantized to 8 bits
```

This uses a command line interface, alternatively you can use `local_web` to use
This uses a command line interface, which is bare bone. It doesn't do any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.

Alternatively you can use `python -m moshi_mlx.local_web` to use
the web UI, connection is via http on [localhost:8998](http://localhost:8998).

## Rust
Expand Down Expand Up @@ -102,3 +186,25 @@ npm run build
```

The web UI can then be found in the `client/dist` directory.

## License

The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
The web client code is provided under the MIT license.
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
the MIT license.

## Citation

If you use either Mimi or Moshi, please cite the following paper,

```
@article{defossez2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
journal={arXiv:TBC},
year={2024},
}
```

[moshi]: https://arxiv.org/
23 changes: 23 additions & 0 deletions client/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the
Software without restriction, including without
limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
17 changes: 17 additions & 0 deletions client/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,20 @@ Frontend for the demo.
## Skipping the queue
To skip the queue for standalone use, once the project is running go to `/?worker_addr={WORKER_ADDR}` where `WORKER_ADDR` is your worker instance address.
For example : `https://localhost:5173/?worker_addr=0.0.0.0:8088`

## License

The present code is provided under the MIT license.

## Citation

If you use this client as part of your research, please cite the following paper.

```
@article{defossez2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
journal={arXiv:TBC},
year={2024},
}
```
Binary file added mimi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added moshi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
7 changes: 7 additions & 0 deletions moshi/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include LICENSE*
include *.md
include *.cfg
include requirements.txt
include moshi/py.typed
include mimi.png
include moshi.png
132 changes: 131 additions & 1 deletion moshi/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,131 @@
# moshi - pytorch
# Moshi - PyTorch

See the [top-level README.md](../README.md) for more information.
This provides the Rust backend and client implementation.

[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compress
audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),
yet performs better than existing, non-streaming, codec like
[SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) (50 Hz, 4 kbps), or [SemantiCodec](https://github.com/haoheliu/SemantiCodec-inference) (50 Hz, 1kbps).

Moshi models **two streams of audio**: one corresponds to Moshi, and one to the user.
At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
which greatly improves the quality of its generation. A small depth transformer models inter codebook dependencies for a given step,
while a large, 7B parameters, Transformer models the temporal dependencies. Moshi achieves a theoretical latency
of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms.
[Talk to Moshi](https://moshi.chat) now on our live demo.

<p align="center">
<img src="./moshi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
width="800px"></p>

Mimi is builds on previous neural audio codecs such as [SoundStream](https://arxiv.org/abs/2107.03312)
and [EnCodec](https://github.com/facebookresearch/encodec), adding a Transformer both in the encoder and decoder,
and adapting the strides to match an overall frame rate of 12.5 Hz. This allows to get closer to the
average frame rate of text tokens (~3-4 Hz), and limit the number of auto-regressive step in Moshi.
Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match
a self-supervised representation from [WavLM](https://arxiv.org/abs/2110.13900). Interestingly, while
Mimi is fully causal and streaming, it learns to match sufficiently well the non causal representation from WavLM,
without introducing any delays. Finally, and similary to [EBEN](https://arxiv.org/pdf/2210.14090), Mimi
uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of subjective quality
despite its low bitrate.

<p align="center">
<img src="./mimi.png" alt="Schema representing the structure Moshi. Moshi models two streams of audio:
one corresponds to Moshi, and one to the user. At inference, the one from the user is taken from the audio input,
and the one for Moshi is sampled from. Along that, Moshi predicts text tokens corresponding to its own speech
for improved accuracy. A small depth transformer models inter codebook dependencies for a given step."
width="800px"></p>


## Requirements

You will need at least Python 3.10. We kept a minimal set of dependencies for the current project.
It was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure
to have PyTorch properly installed before installing Moshi.

```bash
pip install moshi # moshi PyTorch, from PyPI
# Or the bleeding edge versions for Moshi
pip install -e "git+https://git@github.com/kyutai-labs/moshi#egg=moshi&subdirectory=moshi"
```

While we hope that the present codebase will work on Windows, we do not provide official support for it.
At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).


## Usage

This package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi).

In order to run in interactive mode, you need to start a server which will
run the model, you can then use either the web UI or a command line client.

Start the server with:
```bash
python -m moshi.server [--gradio_tunnel]
```

And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
with no direct access, `--gradio_tunnel` will create a tunnel with a URL accessible from anywhere.
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
Alternatively, you might want to use SSH to redirect your connection.

Accessing a server that is not localhost via http may cause issues around using
the microphone in the web UI (in some browsers this is only allowed using
https).

A local client is also available, as
```bash
python -m moshi.client [--url URL_TO_GRADIO]
```
However note, that unlike the web browser, this client is bare bone. It doesn't do any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.

## Development

If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
```bash
# From the current folder (e.g. `moshi/`)
pip install -e '.[dev]'
pre-commit install
```

Once locally installed, Mimi can be tested with the following command, from **the root** of the repository,
```bash
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
python scripts/mimi_test.py

```

Similary, Moshi can be tested (with a GPU) with
```bash
python scripts/moshi_benchmark.py
```


## License

The present code is provided under the MIT license.
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
the MIT license.

## Citation

If you use either Mimi or Moshi, please cite the following paper,

```
@article{defossez2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
journal={arXiv:TBC},
year={2024},
}
```

[moshi]: https://arxiv.org/
Binary file added moshi/mimi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added moshi/moshi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion moshi/moshi/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ async def handle_root(_):
if setup_tunnel is not None:
tunnel = setup_tunnel('localhost', args.port, tunnel_token, None)
log("info", f"Tunnel started, if executing on a remote GPU, you can use {tunnel}.")
log("info", f"Note that this tunnel goes through the US and you might experience high latency in Europe.")
log("info", "Note that this tunnel goes through the US and you might experience high latency in Europe.")
web.run_app(app, port=args.port)


Expand Down
15 changes: 0 additions & 15 deletions moshi/moshi/testing.md

This file was deleted.

4 changes: 4 additions & 0 deletions moshi/setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@ max-line-length = 120
[flake8]
max-line-length = 120
ignore = E203,E704
exclude =
dist
build

Loading