v0.3
Hey everyone,
We're officially making Ultravox 0.3 available today. The weights have been pushed to Hugging Face (along with updated datasets for training), and the model training code has been updated as well. We’re also opening up early preview access to our Ultravox APIs through our managed service. For more information on that, please go here: https://fixie-ai.github.io/ultradox/
v0.3 demonstrates substantially improved speech understanding. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (the higher the number the better):
Ultravox 0.2 | Ultravox 0.3 | |
---|---|---|
en_de | 12.07 | 22.68 |
es_en | 15.17 | 24.10 |
This version of Ultravox uses a frozen Llama 3.1 8B pre-trained core. The speech adapter was trained on 2.5k hours of speech from both LibriSpeech and CommonVoice. The training time on 8xH100s is roughly 80 minutes. We expect to increase the size of our training sets by 1-2 orders of magnitude over the next few months. For comparison, 0.2 was trained on ~1.5k hours of audio.
In addition to increasing the overall size of the training set, v0.3 also introduces two other important changes. The first is that we’re augmenting the ASR data sets with synthetic data in the form of generated continuations. The second change is that we’ve migrated to a Knowledge Distillation approach for calculating loss. Combined, both of these approaches result in much higher speech to text alignment in the adapter. You can learn more in their respective papers.
The key benefit of better adapter alignment is that it makes it easier to customize Ultravox to particular needs and use cases by allowing it to extend any pre-trained LLM (including fine-tuned versions) with speech capabilities while retaining core capabilities across modalities. If this is something that interests you, please get in touch.
We’d love feedback on the model, so please let us know what works well and what doesn’t. To make testing easier, we built a new Gradio demo. To run it, simply run just gradio
inside of the Ultravox folder.
What's Changed
- Remove legacy directory by @farzadab in #1
- Improved Evaluations by @farzadab in #2
- Audio Encoder to bfloat16 by @farzadab in #4
- Whisper encoder + No 30 second padding by @farzadab in #5
- Optionally include "passage" in BoolQ samples by @juberti in #6
- Add tts_tool, for converting a HF dataset to audio by @juberti in #12
- Add logging code by @juberti in #19
- Fixes the default HF model name by @cezarc1 in #13
- Update Hugging Face link by @simonw in #17
- Don't run tests on docs changes by @juberti in #21
- Local tokenizer and processor for more consistent CI by @farzadab in #16
- Tool for uploading to HF Hub by @farzadab in #15
- Remove mlflow dependency by @juberti in #23
- Switch from Pip to Poetry by @juberti in #24
- Tool for adding new synthetic columns by @farzadab in #14
- entails -> provides a rationale for by @farzadab in #27
- Add @file syntax to ds_tool by @juberti in #28
- datasets: Handle converting
int16
audio data inVoiceSample
. by @shaper in #26 - Allow for toggling training and eval on/off by @farzadab in #29
- Add Eleven and Fireworks support to ds_tool by @juberti in #31
- Don't fail basic inference due to missing OAI key by @juberti in #34
- BoolQ for Training and Eval by @farzadab in #30
- Extending
ds_tool
for SODA conversational dataset by @farzadab in #32 - Add streaming support, using HF TextStreamer by @juberti in #46
- Minor fixes to ds_tool and infer_tool by @juberti in #36
- SODA Dataset for Training by @farzadab in #35
- HF pipeline to run Ultravox independent of Ultravox repo by @farzadab in #49
- Runs Tags for filtering by @farzadab in #51
- More validations by @farzadab in #48
- CoVoST 2 dataset by @farzadab in #53
- Speech Translation Evals by @farzadab in #54
- Update ds_tool.py by @zqhuang211 in #52
- Llama3.1 by @farzadab in #56
- Make so infer_tools works with a single arg for filename by @cdiddy77 in #55
- HF Model loading fixes by @farzadab in #59
- Separate files for eval logs by @farzadab in #61
- Add "without any explanation" to ST prompt by @farzadab in #60
- Support KL loss by @zqhuang211 in #63
- [ds_tool] Tools with Audio by @farzadab in #62
- Add basic data_processing test by @juberti in #64
- Add generic dataset by @zqhuang211 in #67
- Fix TypeError: non-default argument 'template' follows default argument, and filter out audio by @liPatrick in #69
- Filter out audio in map sample by @liPatrick in #72
- Update caching to use prefix by @liPatrick in #76
- Add weighted sampling in InterleaveDataset by @zqhuang211 in #70
- Update default config to ultravox_v0.3 by @zqhuang211 in #84
New Contributors
- @cezarc1 made their first contribution in #13
- @simonw made their first contribution in #17
- @shaper made their first contribution in #26
- @cdiddy77 made their first contribution in #55
Full Changelog: https://github.com/fixie-ai/ultravox/commits/v0.3