Skip to content

Release v4.46.0

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 24 Oct 08:15
· 349 commits to main since this release

New model additions

Moshi

The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez,
Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.

Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a
text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec,
while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of
explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text
tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of
generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first
real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.

image

Zamba

Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using
next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral
v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was
pre-trained on 1T tokens of text and code data.

zamba

GLM

The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team,
THUDM & ZhipuAI.

The abstract from the paper starts with the following:

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This
report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.

image

Idefics 3

The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

  • It uses Llama3 for the text model.
  • It uses an updated processing logic for the images.
  • It removes the perceiver.

image

PhiMoE

The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are
used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate
projection layers are also fused.

image

Watermarking

This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.

from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
    keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
    ngram_len=5,
)

# Generation with watermarking
tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
output_sequences = model.generate(
    **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
)
watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
print(watermarked_text)

Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkLogitsProcessor
Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkDetector

how-synthid-works-high-level
  • Add SynthID (watermerking by Google DeepMind) by @gante in #34350

Quantization

BitNet

BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version)
image

  • FEAT : Adding BitNet quantization method to HFQuantizer by @MekkCyber in #33410

GGUF loading in transformers

More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now
be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize
the models after further training has been done.

Notable improvements and additions

Pipeline API synchronisation

We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.

Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!

Executorch compatibility

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.

how-executorch-works-high-level

Gradient accumulation bugfix

  • Fix Gradient Accumulation issue by @ArthurZucker in #34191
  • Enable users to use their own loss functions + deal with prefetching for grad accum by @muellerzr in #34198
  • Enable Gradient Accumulation fix across all models + trainer fully in forward() by @muellerzr #34283

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @manuelsh
    • adding positional encoder changes and tests (#32600)
  • @ArthurZucker
    • [MllamaProcessor] Update errors and API with multiple image (#33715)
    • [clean_up_tokenization_spaces] Pl bart was failing, updating (#33735)
    • [MllamaImageProcessing] Update doc (#33747)
    • [modular] fixes! (#33820)
    • add setter for trainer processor (#33911)
    • [PR run-slow] (#33939)
    • hot fix self.position_embeddings->self.position_embedding (#33958)
    • fix red check-copies (#33964)
    • [Red CIs] Fix hub failures (#34001)
    • properly fix and RUN_SLOW (#33965)
    • [pytes collection] Fix flax test collection (#34004)
    • Add support for all and potentilly deleting functions (#33859)
    • [Patch helper] update to not have to checkout main (#34006)
    • Add documentation for docker (#33156)
    • Fix Gradient Accumulation issue (#34191)
    • Fix-red-ci (#34230)
  • @molbap
    • Fix position embeddings singular/plural (#33678)
    • Uniformize model processors (#31368)
  • @vasqu
    • Update Albumentations Versions (#33704)
    • [TF] Fix Tensorflow XLA Generation on limited seq_len models (#33903)
    • Mistral-related models for QnA (#34045)
  • @VladOS95-cyber
    • Add gguf support for bloom (#33473)
    • Bug fix gguf qwen2moe (#33940)
    • Add gguf support for StableLM (#33793)
    • Add gguf support for gpt2 (#34044)
    • Add GGUF for starcoder2 (#34094)
  • @ydshieh
    • Add Slow CI reminder bot (#33506)
    • post reminder comment only once (#33848)
    • Avoid using context that is not accessable from external contributors (#33866)
    • Don't run reminder bot for now (#33883)
    • Update SSH workflow file (#34084)
    • avoid many failures for ImageGPT (#34071)
    • Avoid many test failures for LlavaNextVideoForConditionalGeneration (#34070)
    • Ping team members for new failed tests in daily CI (#34171)
  • @amyeroberts
    • Repo consistency fix after #33339 (#33873)
    • Trainer - deprecate tokenizer for processing_class (#32385)
  • @ylacombe
    • [Tests] Diverse Whisper fixes (#33665)
    • Fix distil whisper segment computation (#33920)
    • [TESTS] ASR pipeline (#33925)
    • Fix DAC slow tests (#34088)
    • Moshi integration (#33624)
  • @ringohoffman
    • Remove logits.float() (#33902)
    • Default synced_gpus to True when using FullyShardedDataParallel (#33483)
    • Only cast logits to float when computing loss (#34147)
  • @garg-amit
  • @pglorio
  • @tomlimi
    • [WIP] Add Tokenizer for MyT5 Model (#31286)
  • @yijun-lee
    • 🌐 [i18n-KO] Translated gguf.md to Korean (#33764)
    • 🌐 [i18n-KO] Translated audio_utils.md to Korean (#33802)
    • 🌐 [i18n-KO] Translated esm.md to Korean (#33796)
    • 🌐 [i18n-KO] Translated time_series_utils.md to Korean (#33806)
    • 🌐 [i18n-KO] Translated pipelines_utils.md to Korean (#33809)
    • 🌐 [i18n-KO] Translated trainer.md to Korean (#33797)
    • 🌐 [i18n-KO] Translated chameleon.md to Korean (#33799)
    • 🌐 [i18n-KO] Translated gemma.md to Korean (#33936)
    • 🌐 [i18n-KO] Translated feature_extractor.md to Korean (#33775)
    • 🌐 [i18n-KO] Translated tokenization_utils.md to Korean (#33813)
    • 🌐 [i18n-KO] Translated file_utils.md to Korean (#33803)
    • 🌐 [i18n-KO] Translated openai-gpt.md to Korean (#33801)
    • 🌐 [i18n-KO] Translated biogpt.md to Korean (#33773)
    • 🌐 [i18n-KO] Translated image_processing_utils.md to Korean (#33804)
    • 🌐 [i18n-KO] Translated modular_transformers.md to Korean (#33772)
    • 🌐 [i18n-KO] Translated modeling_utils.md to Korean (#33808)
    • 🌐 [i18n-KO] Translated text_generation.md to Korean (#33777)
    • 🌐 [i18n-KO] Translated generation_utils.md to Korean (#33818)
    • 🌐 [i18n-KO] Translated gemma2.md to Korean (#33937)
    • 🌐 [i18n-KO] Translated trainer_utils.md to Korean (#33817)
  • @fabxoe
    • 🌐 [i18n-KO] Translated main_classes/quantization.md to Korean (#33959)
    • 🌐 [i18n-KO] Translated main_classes/configuration.md to Korean (#33952)
    • 🌐 [i18n-KO] Translated model_doc/mamba.md to Korean (#33626)
    • 🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean (#33574)
    • 🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean (#33587)
    • 🌐 [i18n-KO] Translated �model_doc/clip.md to Korean (#33610)
    • 🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean (#33612)
    • 🌐 [i18n-KO] Translated model_doc/llama3.md to Korean (#33635)
    • 🌐 [i18n-KO] Translated model_doc/mistral.md to Korean (#33648)
    • 🌐 [i18n-KO] Translated model_doc/cohere.md to Korean (#33885)
    • 🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean (#33951)
    • 🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean (#33968)
    • 🌐 [i18n-KO] Translated main_classes/onnx.md to Korean (#33601)
    • 🌐 [i18n-KO] Translated model_doc/bart.md to Korean (#33893)
    • 🌐 [i18n-KO] Translated model_doc/deberta.md to Korean (#33967)
    • 🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean (#33955)
    • 🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean (#33629)
    • 🌐 [i18n-KO] Translated main_classes/model.md to Korean (#33606)
    • 🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean (#33597)
    • 🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean (#33596)
    • 🌐 [i18n-KO] Translated model_doc/informer.md to Korean (#33585)
    • 🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean (#33569)
    • 🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean (#33954)
    • 🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean (#33589)
  • @MekkCyber
    • FEAT : Adding BitNet quantization method to HFQuantizer (#33410)
    • Fix data_seed unused (#33731)
    • Small Fix to modular converter (#34051)
  • @AhmedAlmaghz
    • Add Translate docs into Arabic - section files CONCEPTUAL GUIDES (#33982)
  • @alex-bene
    • Add post_process_depth_estimation to image processors and support ZoeDepth's inference intricacies (#32550)