Releases · huggingface/transformers

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution data. It was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling.

Distil-Whisper copies the entire encoder from Whisper, meaning it retains Whisper's robustness to different audio conditions. It only copies 2 decoder layers, which significantly reduces the time taken to auto-regressively generate text tokens:

Distil-Whisper is MIT licensed and directly available in the Transformers library with chunked long-form inference, Flash Attention 2 support, and Speculative Decoding. For details on using the model, refer to the following instructions.

Joint work from @sanchit-gandhi, @patrickvonplaten and @srush.

[Assistant Generation] Improve Encoder Decoder by @patrickvonplaten in #26701
[WhisperForCausalLM] Add WhisperForCausalLM for speculative decoding by @patrickvonplaten in #27195
[Whisper, Bart, MBart] Add Flash Attention 2 by @patrickvonplaten in #27203

Fuyu

The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.

The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.

By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.

Joint work from @molbap, @pcuenca, @amyeroberts, @ArthurZucker

Add fuyu model by @molbap in #26911
Fuyu: improve image processing by @molbap in #27007

SeamlessM4T

The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)

SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.

Add Seamless M4T model by @ylacombe in #25693

Kosmos-2

The KOSMOS-2 model was proposed in Kosmos-2: Grounding Multimodal Large Language Models to the World by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.

KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale dataset of grounded image-text pairs GRIT. The spatial coordinates of the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective entity text spans (for example, a snowman followed by <patch_index_0044><patch_index_0863>). The data format is similar to “hyperlinks” that connect the object regions in an image to their text span in the corresponding caption.

Add Kosmos-2 model by @ydshieh in #24709

Owl-v2

OWLv2 was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2 scales up OWL-ViT using self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. This results in large gains over the previous state-of-the-art for zero-shot object detection.

Add OWLv2, bis by @NielsRogge in #26668

🚨🚨🚨 Safetensors by default for `torch` serialization 🚨🚨🚨

Version v4.35.0 now puts safetensors serialization by default. This is a significant change targeted at making users of the Hugging Face Hub, transformers, and any downstream library leveraging it safer.

The safetensors library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).

It was already the default loading mechanism since v4.30.0 and would therefore already default to loading model.safetensors files instead of pytorch_model.bin if these were present in the repository.

With v4.35.0, any call to save_pretrained for torch models will now save a safetensors file. This safetensors file is in the PyTorch format, but can be loaded in TensorFlow and Flax models alike.

⚠️ If you run into any issues with this, please let us know ASAP in the issues so that we may help you. Namely, the following errors may indicate something is up:

Loading a safetensors file and having a warning mentioning missing weights unexpectedly
Obtaining completely wrong/random results at inference after loading a pretrained model that you have saved in safetensors

If you wish to continue saving files in the .bin format, you can do so by specifying safe_serialization=False in all your save_pretrained calls.

Safetensors serialization by default by @LysandreJik in #27064

Chat templates

Chat templates have been expanded with the addition of the add_generation_prompt argument to apply_chat_template(). This has also enabled us to rework the ConversationalPipeline class to use chat templates. Any model with a chat template is now automatically usable through ConversationalPipeline.

Add add_generation_prompt argument to apply_chat_template by @Rocketknight1 in #26573
Conversation pipeline fixes by @Rocketknight1 in #26795

Guides

Two new guides on LLMs were added the library:

[docs] LLM prompting guide by @MKhalusova in #26274
[docs] Optimizing LLMs by @patrickvonplaten in #26058

Quantization

Exllama-v2 integration

Exllama-v2 provides better GPTQ kernel for higher throughput and lower latency for GPTQ models. The original code can be found here.

add exllamav2 arg by @SunMarc in #26437
Add exllamav2 better by @SunMarc in #27111

You will need the latest versions of optimum and auto-gptq. Read more about the integration here.

AWQ integration

AWQ is a new and popular quantization scheme, already used in various libraries such as TGI, vllm, etc. and known to be faster than GPTQ models according to some benchmarks. The original code can be found here and here you can read more about the original paper.

We support AWQ inference with original kernels as well as kernels provided through autoawq package that you can simply install with pip install autoawq.

[core / Quantization ] AWQ integration by @younesbelkada in #27045

We also provide an example script on how to push quantized weights on the hub on the original repository.

Read more about the benchmarks and the integration here

GPTQ on CPU !

You can now run GPTQ models on CPU using the latest version of auto-gptq thanks to @vivekkhandelwal1 !

Add support for loading GPTQ models on CPU by @vivekkhandelwal1 in #26719

Attention mask refactor

We refactored the attention mask logic for major models in transformers. For instance, we removed padding_mask argument which was ambiguous for some users

Remove ambiguous padding_mask and instead use a 2D->4D Attn Mask Mapper by @patrickvonplaten in #26792
[Attention Mask] Refactor all encoder-decoder attention mask by @patrickvonplaten in #27086

Flash Attention 2 for more models + quantizat...

Contributors

srush, dotneet, and 118 other contributors

Assets 2

18 Oct 21:15

ArthurZucker

v4.34.1

acc394c

Patch release: v4.34.1

A patch release was made for the following three commits:

Add add_generation_prompt argument to apply_chat_template (#26573)
Fix backward compatibility of Conversation (#26741)
[Tokenizer] Fix slow and fast serialization (#26570)

Assets 2

03 Oct 15:00

LysandreJik

v4.34.0

b71f20a

v4.34: Mistral, Persimmon, Prompt templating, Flash Attention 2, Tokenizer refactor

New models

Mistral

Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:

Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
Byte-fallback BPE tokenizer - ensures that characters are never mapped to out-of-vocabulary tokens.

[Mistral] Mistral-7B-v0.1 support by @Bam4d in #26447

Persimmon

The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.

[Persimmon] Add support for persimmon by @ArthurZucker in #26042

BROS

BROS stands for BERT Relying On Spatiality. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.

Add BROS by @jinhopark8345 in #23190

ViTMatte

ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.

Add ViTMatte by @NielsRogge in #25843

Nougat

Nougat uses the same architecture as Donut, meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.

Add Nougat by @NielsRogge and @molbap in #25942

Prompt templating

We've added a new template feature for chat models. This allows the formatting that a chat model was trained with to be saved with the model, ensuring that users can exactly reproduce that formatting when they want to fine-tune the model or use it for inference. For more information, see our template documentation.

Overhaul Conversation class and prompt templating by @Rocketknight1 in #25323

🚨🚨 Tokenizer refactor

[Tokenizer] attemp to fix add_token issues by @ArthurZucker in #23909
Nit-added-tokens by @ArthurZucker in #26538 adds some fix to #23909 .

🚨Workflow Changes 🚨:

These are not breaking changes per se but rather bugfixes. However, we understand that this may result in some workflow changes so we highlight them below.

unique_no_split_tokens attribute removed and not used in the internal logic
sanitize_special_tokens() follows a deprecation cycle and does nothing
All attributes in SPECIAL_TOKENS_ATTRIBUTES are stored as AddedTokens and no strings.
loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
the length of a tokenizer is now max(set(self.get_vocab().keys())) accounting for holes in the vocab. The vocab_size no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
added_tokens_decoder holds AddedToken, not strings.
add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
initializing a tokenizer form scratch will now add missing special tokens to the vocab.
stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
fairseq_ids_to_tokens attribute removed for Barthez (was not used)

➕ Most visible features:

printing a tokenizer now shows tokenizer.added_tokens_decoder for both fast and slow tokenizers. Moreover, additional tokens that were already part of the initial vocab are also found there.
faster from_pretrained, faster add_tokens because special and non special can be mixed together and the trie is not always rebuilt.
faster encode/decode with caching mechanism for added_tokens_decoder/encoder.
information is fully saved in the tokenizer_config.json

For any issues relating to this, make sure to open a new issue and ping @ArthurZucker.

Flash Attention 2

FA2 support added to transformers for most popular architectures (llama, mistral, falcon) architectures actively being contributed in this issue (#26350). Simply pass use_flash_attention_2=True when calling from_pretrained

In the future, PyTorch will support Flash Attention 2 through torch.scaled_dot_product_attention, users would be able to benefit from both (transformers core & transformers + SDPA) implementations of Flash Attention-2 with simple changes (model.to_bettertransformer() and force-dispatch the SDPA kernel to FA-2 in the case of SDPA)

[core ] Integrate Flash attention 2 in most used models by @younesbelkada in #25598

For our future plans regarding integrating F.sdpa from PyTorch in core transformers, see here: #26557

Lazy import structure

Support for lazy loading integration libraries has been added. This will drastically speed up importing transformers and related object from the library.

Example before this change:

2023-09-11 11:07:52.010179: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
python3 -c "from transformers import CLIPTextModel"  3.31s user 3.06s system 220% cpu 2.893 total

After this change:

python3 -c "from transformers import CLIPTextModel"  1.70s user 1.49s system 220% cpu 1.447 total

[Core] Add lazy import structure to imports by @patrickvonplaten in #26090

Bugfixes and improvements

Fix typo by @susnato in #25966
Fix Detr CI by @ydshieh in #25972
Fix test_load_img_url_timeout by @ydshieh in #25976
nn.Identity is not required to be compatible with PyTorch < 1.1.0 as the minimum PyTorch version we currently support is 1.10.0 by @statelesshz in #25974
Add Pop2Piano space demo. by @susnato in #25975
fix typo by @kai01ai in #25981
Use main in conversion script by @ydshieh in #25973
[doc] Always call it Agents for consistency by @julien-c in #25958
Update RAG README.md with correct path to examples/seq2seq by @tleyden in #25953
Update training_args.py to remove the runtime error by @sahel-sh in #25920
Trainer: delegate default generation values to generation_config by @gante in #25987
Show failed tests on CircleCI layout in a better way by @ydshieh in #25895
Patch with accelerate xpu by @abhilash1910 in #25714
PegasusX add _no_split_modules by @andreeahedes in #25933
Add TFDebertaV2ForMultipleChoice by @raghavanone in #25932
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler by @pacman100 in #25863
[Wav2Vec2 Conformer] Fix inference float16 by @sanchit-gandhi in #25985
Add LLaMA resources by @eenzeenee in #25859
[CI] Fix red CI and ERROR failed should show by @ArthurZucker in #25995
[VITS] tokenizer integration test: fix revision did not exist by @ArthurZucker in #25996
Fix Mega chunking error when using decoder-only model by @tanaymeh in #25765
save space when converting hf model to megatron model. by @flower-with-safe in #25950
Update README.md by @NinoRisteski in #26003
Falcon: fix revision propagation by @LysandreJik in #26006
TF-OPT attention mask fixes by @Rocketknight1 in #25238
Fix small typo README.md by @zspo in #25934
🌐[i18n-KO] Translated llm_tutorial.md to Korean by @harheem in #25791
Remove Falcon from undocumented list by @Rocketknight1 in #26008
modify context length for GPTQ + version bump by @SunMarc in #25899
Fix err with FSDP by @muellerzr in #25991
fix _resize_token_embeddings will set lm head size to 0 when enabled deepspeed zero3 by @kai01ai in #26024
Fix CircleCI config by @ydshieh in #26023
Add tgs speed metrics by @CokeDong in #25858
[VITS] Fix nightly tests by @sanchit-gandhi in #25986
Added HerBERT to README.md by @Muskan011 in #26020
Fix vilt config docstring parameter to match value in init by @raghavanone in #26017
Punctuation fix by @kwonmha in #26025
Try to fix training Loss inconsistent after resume from old checkpoint by @dumpmemory in #25872
Fix Dropout Implementation in Graphormer by @alexanderkrauck in #24817
Update missing docs on activation_dropout and fix DropOut docs for SEW-D by @gau-nernst in #26031
Skip warning if tracing with dynamo by @angelayi in #25581
🌐 [i18n-KO] Translated llama.md to Korean by @harheem in #26044
[CodeLlamaTokenizerFast] Fix fix set_infilling_processor to properly reset by @ArthurZucker in #26041
[CITests] skip failing tests until #26054 is merged by @ArthurZucker in #26063
only main process should call _save on deepspeed zero3 by @zjjMaiMai in #25959
docs: update link huggingface map by @pphuc25 in #26077
docs: add space to docs by @pphuc25 in #26067
[core] Import tensorflow inside relevant methods...

Contributors

jbochi, tleyden, and 92 other contributors

Assets 2

27 Sep 15:09

LysandreJik

v4.33.3

bffac92

Patch release: v4.33.3

A patch release was made for the following three commits:

DeepSpeed ZeRO-3 handling when resizing embedding layers (#26259)
[doc] Always call it Agents for consistency (#25958)
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler (#25863)

Assets 2

15 Sep 20:24

LysandreJik

v4.33.2

6da93f5

Patch release: v4.33.2

A patch release was done for these two commits:

Fix pad to multiple of (#25732)
fix _resize_token_embeddings will set lm head size to 0 when enabled deepspeed zero3 (#26024)

Assets 2

06 Sep 21:14

LysandreJik

v4.33.1

fa6107c

Falcon, Code Llama, ViTDet, DINO v2, VITS

Falcon

Falcon is a class of causal decoder-only models built by TII. The largest Falcon checkpoints have been trained on >=1T tokens of text, with a particular emphasis on the RefinedWeb corpus. They are made available under the Apache 2.0 license.

Falcon’s architecture is modern and optimized for inference, with multi-query attention and support for efficient attention variants like FlashAttention. Both ‘base’ models trained only as causal language models as well as ‘instruct’ models that have received further fine-tuning are available.

Falcon port #24523 by @Rocketknight1
Falcon: Add RoPE scaling by @gante in #25878
Add proper Falcon docs and conversion script by @Rocketknight1 in #25954
Put Falcon back by @LysandreJik in #25960
[Falcon] Remove SDPA for falcon to support earlier versions of PyTorch (< 2.0) by @younesbelkada in #25947

Code Llama

Code Llama, is a family of large language models for code based on Llama 2, providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks.

[CodeLlama] Add support for CodeLlama by @ArthurZucker in #25740
[CodeLlama] Fix CI by @ArthurZucker in #25890

ViTDet

ViTDet reuses the ViT model architecture, adapted to object detection.

Add ViTDet by @NielsRogge in #25524

DINO v2

DINO v2 is the next iteration of the DINO model. It is added as a backbone class, allowing it to be re-used in downstream models.

[DINOv2] Add backbone class by @NielsRogge in #25520

VITS

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.

add VITS model by @hollance in #24085

Breaking changes:

🚨🚨🚨 [Refactor] Move third-party related utility files into integrations/ folder 🚨🚨🚨 by @younesbelkada in #25599

Moves all third party libs (outside HF ecosystem) related utility files inside integrations/ instead of having them in transformers directly.

In order to get the previous usage you should be changing your call to the following:

- from transformers.deepspeed import HfDeepSpeedConfig
+ from transformers.integrations import HfDeepSpeedConfig

Bugfixes and improvements

[DOCS] MusicGen Docs Update by @xNul in #25510
[MINOR:TYPO] by @cakiki in #25646
Pass the proper token to PEFT integration in auto classes by @sgugger in #25649
Put IDEFICS in the right section of the doc by @sgugger in #25650
TF 2.14 compatibility by @Rocketknight1 in #25630
Fix bloom add prefix space by @ArthurZucker in #25652
removing unnecesssary extra parameter by @rafaelpadilla in #25643
Adds TRANSFORMERS_TEST_BACKEND by @vvvm23 in #25655
stringify config by @AleksanderWWW in #25637
Add input_embeds functionality to gpt_neo Causal LM by @gaasher in #25659
Update doc toctree by @ydshieh in #25661
Add Llama2 resources by @wonhyeongseo in #25531
[SPM] Patch spm Llama and T5 by @ArthurZucker in #25656
[GPTNeo] Add input_embeds functionality to gpt_neo Causal LM by @ArthurZucker in #25664
fix wrong path in some doc by @ydshieh in #25658
Remove utils/documentation_tests.txt by @ydshieh in #25680
Prevent Dynamo graph fragmentation in GPTNeoX with torch.baddbmm fix by @norabelrose in #24941
⚠️ [CLAP] Fix dtype of logit scales in init by @sanchit-gandhi in #25682
Sets the stalebot to 10 AM CEST by @LysandreJik in #25678
Fix pad_token check condition by @ydshieh in #25685
[DOCS] Added docstring example for EpsilonLogitsWarper #24783 by @sanjeevk-os in #25378
correct resume training steps number in progress bar by @pphuc25 in #25691
Generate: general test for decoder-only generation from inputs_embeds by @gante in #25687
Fix typo in configuration_gpt2.py by @susnato in #25676
fix ram efficient fsdp init by @pacman100 in #25686
[LlamaTokenizer] make unk_token_length a property by @ArthurZucker in #25689
Update list of persons to tag by @sgugger in #25708
docs: Resolve typos in warning text by @tomaarsen in #25711
Fix failing test_batch_generation for bloom by @ydshieh in #25718
[PEFT] Fix peft version by @younesbelkada in #25710
Fix number of minimal calls to the Hub with peft integration by @sgugger in #25715
[AutoGPTQ] Add correct installation of GPTQ library + fix slow tests by @younesbelkada in #25713
Generate: nudge towards do_sample=False when temperature=0.0 by @gante in #25722
[from_pretrained] Simpler code for peft by @ArthurZucker in #25726
[idefics] idefics-9b test use 4bit quant by @stas00 in #25734
ImageProcessor - check if input pixel values between 0-255 by @amyeroberts in #25688
[from_pretrained] Fix failing PEFT tests by @younesbelkada in #25733
[ASR Pipe Test] Fix CTC timestamps error message by @sanchit-gandhi in #25727
🌐 [i18n-KO] Translated visual_question_answering.md to Korean by @wonhyeongseo in #25679
[PEFT] Fix PeftConfig save pretrained when calling add_adapter by @younesbelkada in #25738
fixed typo in speech encoder decoder doc by @asusevski in #25745
Add FlaxCLIPTextModelWithProjection by @pcuenca in #25254
Generate: add missing logits processors docs by @gante in #25653
[DOCS] Add example for HammingDiversityLogitsProcessor by @jessthebp in #25481
Generate: logits processors are doctested and fix broken doctests by @gante in #25692
[CLAP] Fix logit scales dtype for fp16 by @sanchit-gandhi in #25754
[Sentencepiece] make sure legacy do not require protobuf by @ArthurZucker in #25684
fix encoder hook by @SunMarc in #25735
Docs: fix indentation in HammingDiversityLogitsProcessor by @gante in #25756
Add type hints for several pytorch models (batch-3) by @nablabits in #25705
Correct attention mask dtype for Flax GPT2 by @liutianlin0121 in #25636
fix a typo in docsting by @statelesshz in #25759
[idefics] small fixes by @stas00 in #25764
Add docstrings and fix VIVIT examples by @Geometrein in #25628
[LlamaFamiliy] add a tip about dtype by @ArthurZucker in #25794
Add type hints for several pytorch models (batch-2) by @nablabits in #25557
Add type hints for pytorch models (final batch) by @nablabits in #25750
Add type hints for several pytorch models (batch-4) by @nablabits in #25749
[idefics] fix vision's hidden_act by @stas00 in #25787
Arde/fsdp activation checkpointing by @arde171 in #25771
Fix incorrect Boolean value in deepspeed example by @tmm1 in #25788
fixing name position_embeddings to object_queries by @Lorenzobattistela in #24652
Resolving Attribute error when using the FSDP ram efficient feature by @pacman100 in #25820
[Docs] More clarifications on BT + FA by @younesbelkada in #25823
fix register by @zspo in #25779
Minor wording changes for Code Llama by @osanseviero in #25815
[LlamaTokenizer] tokenize nits. by @ArthurZucker in #25793
fix warning trigger for embed_positions when loading xglm by @MattYoon in #25798
🌐 [i18n-KO] Translated peft.md to Korean by @nuatmochoi in #25706
🌐 [i18n-KO] model_memory_anatomy.md to Korean by @mjk0618 in #25755
Error with checking args.eval_accumulation_steps to gather tensors by @chaumng in #25819
Tests: detect lines removed from "utils/not_doctested.txt" and doctest ALL generation files by @gante in #25763
🌐 [i18n-KO] Translated add_new_pipeline.md to Korean by @heuristicwave in #25498
🌐 [i18n-KO] Translated community.md to Korean by @sim-so in #25674
🤦update warning to If you want to use the new behaviour, set `legacy=… by @ArthurZucker in #25833
update remaining Pop2Piano checkpoints by @susnato in #25827
[AutoTokenizer] Add data2vec to mapping by @sanchit-gandhi in #25835
MaskFormer,Mask2former - reduce memory load by @amyeroberts in #25741
Support loading base64 images in pipelines by @InventivetalentDev in #25633
Update README.md by @NinoRisteski in #25834
Generate: models with custom generate() return True in can_generate() by @gante in #25838
Update README.md by @NinoRisteski in #25832
minor typo fix in PeftAdapterMixin docs by @tmm1 in #25829
Add flax installation in daily doctest workflow by @ydshieh in #25860
Add Blip2 model in VQA pipeline by @jpizarrom in #25532
Remote tools are turned off by @LysandreJik in #25867
Fix imports by @ydshieh in #25869
fix max_memory for bnb by @SunMarc in #25842
Docs: fix example failing doctest in generation_strategies.md by @gante in #25874
pin pandas==2.0.3 by @ydshieh in #25875
Reduce CI output by @ydshieh in #25876
[ViTDet] Fix doc tests by @NielsRogge in #25880
For xla tensors, use an alternative way to get a unique id by @qihqi in #25802
fix ds z3 checkpointing when stage3_gather_16bit_weights_on_model_save=False by @pacman100 in #25817
Modify efficient GPU training doc with now-available adamw_bnb_8bit optimizer by @veezbo in #25807
[TokenizerFast] can_save_slow_tokenizer as a property for when vocab_file's folder was removed by @ArthurZucker in #25626
Save image_processor while saving pipeline (ImageSegmentationPipeline) by @raghavanone in #25884
[InstructBlip] FINAL Fix instructblip test by @younesbelkada in #25887
Add type hints for tf models batch 1 by @nablabits in #25853
Update setup.py by @ydshieh in #25893
Smarter check for is_tensor by @sgugger in #25871
remove torch_dtype override by @SunMarc in #25894
fix FSDP model resume optimizer & schedu...

Contributors

tmm1, jpizarrom, and 51 other contributors

Assets 2

28 Aug 12:48

LysandreJik

v4.32.1

ccb92be

Patch release: v4.32.1

Patch release including several patches from v4.31.0, listed below:

Put IDEFICS in the right section of the doc (#25650)
removing unnecesssary extra parameter (#25643)
[SPM] Patch spm Llama and T5 (#25656)
Fix bloom add prefix space (#25652)
Generate: add missing logits processors docs (#25653)
[idefics] small fixes (#25764)

Assets 2

22 Aug 13:11

LysandreJik

v4.32.0

41aef33

IDEFICS, GPTQ Quantization

IDEFICS

The IDEFICS model was proposed in OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

IDEFICS is the first open state-of-the-art visual language model at the 80B scale!

The model accepts arbitrary sequences of image and text and produces text, similarly to a multimodal ChatGPT.

Blogpost: hf.co/blog/idefics
Playground: HuggingFaceM4/idefics_playground

new model: IDEFICS via HuggingFaceM4 by @stas00 in #24796

MPT

MPT has been added and is now officially supported within Transformers. The repositories from MosaicML have been updated to work best with the model integration within Transformers.

[MPT] Add MosaicML's MPT model to transformers by @ArthurZucker & @younesbelkada in #24629

GPTQ Integration

GPTQ quantization is now supported in Transformers, through the optimum library. The backend relies on the auto_gptq library, from which we use the GPTQ and QuantLinear classes.

See below for an example of the API, quantizing a model using the new GPTQConfig configuration utility.

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_name = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer,  group_size=128, desc_act=False)
# works also with device_map (cpu offload works but not disk offload)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, quantization_config=config)

Most models under TheBloke namespace with the suffix GPTQ should be supported, for example, to load a GPTQ quantized model on TheBloke/Llama-2-13B-chat-GPTQ simply run (after installing latest optimum and auto-gptq libraries):

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "TheBloke/Llama-2-13B-chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

For more information about this feature, we recommend taking a look at the following announcement blogpost: https://huggingface.co/blog/gptq-integration

GPTQ integration by @SunMarc in #25062

Pipelines

A new pipeline, dedicated to text-to-audio and text-to-speech models, has been added to Transformers. It currently supports the 3 text-to-audio models integrated into transformers: SpeechT5ForTextToSpeech, MusicGen and Bark.

See below for an example:

from transformers import pipeline

classifier = pipeline(model="suno/bark")
output = pipeline("Hey it's HuggingFace on the phone!")

audio = output["audio"]
sampling_rate = output["sampling_rate"]

Add Text-To-Speech pipeline by @ylacombe in #24952

Classifier-Free Guidance decoding

Classifier-Free Guidance decoding is a text generation technique developed by EleutherAI, announced in this paper. With this technique, you can increase prompt adherence in generation. You can also set it up with negative prompts, ensuring your generation doesn't go in specific directions. See its docs for usage instructions.

add CFG for .generate() by @Vermeille in #24654

Task guides

A new task guide going into Visual Question Answering has been added to Transformers.

VQA task guide by @MKhalusova in #25244

Model deprecation

We continue the deprecation of models that was introduced in #24787.

By deprecating, we indicate that we will stop maintaining such models, but there is no intention of actually removing those models and breaking support for them (they might one day move into a separate repo/on the Hub, but we would still add the necessary imports to make sure backward compatibility stays). The main point is that we stop testing those models. The usage of the models drives this choice and aims to ease the burden on our CI so that it may be used to focus on more critical aspects of the library.

Deprecate unused OpenLlama architecture by @tomaarsen in #24922

Translation Efforts

There are ongoing efforts to translate the transformers' documentation in other languages. These efforts are driven by groups independent to Hugging Face, and their work is greatly appreciated further to lower the barrier of entry to ML and Transformers.

If you'd like to kickstart such an effort or help out on an existing one, please feel free to reach out by opening an issue.

🌐 [i18n-KO] Translatedtasks/document_question_answering.md to Korean by @jungnerd in #24588
🌐 [i18n-KO] Fixed Korean and English quicktour.md by @wonhyeongseo in #24664
🌐 [i18n-KO] Updated Korean serialization.md by @wonhyeongseo in #24686
🌐 [i18n-KO] Translated performance.md to Korean by @augustinLib in #24883
🌐 [i18n-KO] Translated testing.md to Korean by @Sunmin0520 in #24900
🌐 [i18n-KO] Translated perf_train_cpu.md to Korean by @seank021 in #24911
🌐 [i18n-KO] Translated <tf_xla>.md to Korean by @54data in #24904
🌐 [i18n-KO] Translated perf_hardware.md to Korean by @augustinLib in #24966
🌐 [i18n-KO] Translated hpo_train.md to Korean by @harheem in #24968
🌐 [i18n-KO] Translated perf_infer_cpu.md to Korean by @junejae in #24920
🌐 [i18n-KO] Translated pipeline_webserver.md to Korean by @kihoon71 in #24828
🌐 [i18n-KO] Translated transformers_agents.md to Korean by @sim-so in #24881
🌐 [i18n-KO] Translated perf_infer_gpu_many.md to Korean by @heuristicwave in #24943
🌐 [i18n-KO] Translated perf_infer_gpu_one.md to Korean by @eenzeenee in #24978
🌐 [i18n-KO] Translated add_tensorflow_model.md to Korean by @keonju2 in #25017
🌐 [i18n-KO] Translated perf_train_cpu_many.md to Korean by @nuatmochoi in #24923
🌐 [i18n-KO] Translated add_new_model.md to Korean by @mjk0618 in #24957
🌐 [i18n-KO] Translated model_summary.md to Korean by @0525hhgus in #24625
🌐 [i18n-KO] Translated philosophy.md to Korean by @TaeYupNoh in #25010
🌐 [i18n-KO] Translated perf_train_tpu_tf.md to Korean by @0525hhgus in #25433
🌐 [i18n-KO] Translated docs: ko: pr_checks.md to Korean by @sronger in #24987

Explicit input data format for image processing

Addition of input_data_format argument to image transforms and ImageProcessor methods, allowing the user to explicitly set the data format of the images being processed. This enables processing of images with non-standard number of channels e.g. 4 or removes error which occur when the data format was inferred but the channel dimension was ambiguous.

import numpy as np
from transformers import ViTImageProcessor

img = np.random.randint(0, 256, (4, 6, 3))
image_processor = ViTImageProcessor()
inputs = image_processor(img, image_mean=0, image_std=1, input_data_format="channels_first")

Input data format by @amyeroberts in #25464
Add input_data_format argument, image transforms by @amyeroberts in #25462

Documentation clarification about efficient inference through `torch.scaled_dot_product_attention` & Flash Attention

Users are not aware that it is possible to force dispatch torch.scaled_dot_product_attention method from torch to use Flash Attention kernels. This leads to considerable speedup and memory saving, and is also compatible with quantized models. We decided to make this explicit to users in the documentation.

[Docs / BetterTransformer ] Added more details about flash attention + SDPA : #25265

In a nutshell, one can just run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to("cuda")

# convert the model to BetterTransformer
model.to_bettertransformer()

input_text = "Hello my dog is cute and"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

to enable Flash-attenion in their model. However, this feature does not support padding yet.

FSDP and DeepSpeed Changes

Users will no longer encounter CPU RAM OOM when using FSDP to train very large models in multi-gpu or multi-node multi-gpu setting.
Users no longer have to pass fsdp_transformer_layer_cls_to_wrap as the code now use _no_split_modules by default which is available for most of the popular models. DeepSpeed Z3 init now works properly with Accelerate Launcher + Trainer.

add util for ram efficient loading of model when using fsdp by @pacman100 in #25107
fix fsdp checkpointing issues by @pacman100 in #24926
fsdp fixes and enhancements by @pacman100 in #24980
fix deepspeed load best model at end when the model gets sharded by @pacman100 in #25057
resolving zero3 init when using accelerate config with Trainer by @pacman100 in #25227
fix z3 init when using accelerate launcher by @pacman100 in #25589