This repository is a collection of notes, code snippets, and resources for learning Natural Language Processing (NLP) I collected throughout my journey. The main goal is to have a reference guide for future projects and to share knowledge with others.
├── basics <- Hugging Face basic concepts and examples
│ ├── hf_pipeline <- NLP pipelines examples
│ ├── hf_inference <- Breakdown pipeline components
│ ├── hf_model_creation <- Instantiate models
│ ├── hf_tokenizers <- Tokenizers basics
│ ├── hf_processing_data <- Loading dataset from Hub
│ ├── hf_finetuning <- Basic fine-tuning task
│ ├── hf_datasets <- Dataset operations
│ ├── hf_tokenizers_training <- Adapt tokenizers to new data
│ └── hf_tokenizers_offsets <- Tokenizers offset mapping
|
└── mains_tasks <- Tackling main NLP tasks
├── sequence_classification <- Classify sequences of tokens
├── token_classification <- Set labels for each token
├── masked_language_modeling <- Filling blanks for domain adaptation
├── causal_language_modeling <- Predict next token
└── semantic_search <- Retrieve similar documents
Only the most important files and directories are listed above.
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.
The following is a list of common NLP tasks, with some examples of each:
- Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
- Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
- Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
- Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context
- Generating a new sentence from an input text: Translating a text into another language, summarizing a text
NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.
AutoModel
architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states
, also known as features
. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
# outputs.last_hidden_state (batch_size, sequence_length, hidden_size)
The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension.
There are many different architectures available in Hugging Face Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:
*Model
(retrieve the hidden states)*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification
- and others
If, for example, we need a model with a sequence classification head, to be able to classify the sentences as positive or negative, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification.
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
# outputs.logits (batch_size, num_labels)
When using from_pretrained()
, the weights are downloaded and cached (so future calls to the method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers
. We can customize your cache folder by setting the HF_HOME
environment variable.
When using datasets.load_dataset()
, the datasets are downloaded and cached in the cache folder, which defaults to ~/.cache/huggingface/datasets
. We can customize your cache folder by setting the HF_HOME
environment variable.
Model | BPE | WordPiece | Unigram |
---|---|---|---|
Training | Starts from a small vocabulary and learns rules to merge tokens | Starts from a small vocabulary and learns rules to merge tokens | Starts from a large vocabulary and learns rules to remove tokens |
Training step | Merges the tokens corresponding to the most common pair | Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent | Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus |
Learns | Merge rules and a vocabulary | Just a vocabulary | A vocabulary with a score for each token |
Encoding | Splits a word into characters and applies the merges learned during training | Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word | Finds the most likely split into tokens, using the scores learned during training |
Checkout basics/hf_pipeline.ipynb
for examples of using Hugging Face's pipeline for NLP tasks. There are several tasks you can try out of the box without too much effort.
Check available pipelines here
For many NLP tasks, we can rely on pre-trained models that have been trained on large datasets. Leveraging zero-shot or few-shot capabilities of these models can provide a strong baseline for many tasks.
For example, for making a spam and mail classifier, we can just use a zero-shot classifier from Hugging Face Transformers library.
from transformers import pipeline
oracle = pipeline(model="facebook/bart-large-mnli")
oracle(
"I have a problem with my iphone that needs to be resolved asap!!",
candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)
"""
Output:
{
'sequence': 'I have a problem with my iphone. Resolved asap!!',
'labels': ['urgent', 'phone', 'not urgent', 'tablet', 'computer'],
'scores': [0.504131, 0.479352, 0.013123621, 0.003235, 0.0022361]
}
"""
The models in the Hub are not limited to Hugging Face Transformers or even NLP. There are models from Flair and AllenNLP for NLP, Asteroid and pyannote for speech, and timm for vision, to name a few.
You can train you own model and share it at Hugging Face. You can find the instructions here and here. Also, information about building a model card here.
We can easily supercharge our training loop with Hugging Face Accelerate library. Checkout an example at basics/hf_finetuning_pytorch_accelerate.ipynb
.
Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to accelerator.prepare()
. This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device
) and replacing loss.backward()
with accelerator.backward(loss)
.
In order to benefit from the speed-up offered by Cloud TPUs, we recommend padding your samples to a fixed length with the
padding="max_length"
andmax_length
arguments of the tokenizer.
Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:
accelerate config
which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:
accelerate launch train.py
which will launch the distributed training. You can find more examples in the Hugging Face Accelerate repo.
Map method applies a custom processing function to each row in the dataset. Should return a dictionary. If the item has new columns/keys, they will be added to the dataset, if it has the same keys, it will be replaced. We can use the map method in batch mode with Dataset.map(function, batched=True)
. For example:
def lowercase_title(example):
return {"title": example["title"].lower()}
my_dataset = my_dataset.map(lowercase_title)
The Dataset.map()
method takes a batched argument that, if set to True
, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.
When you specify batched=True
the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map()
should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to lowercase all titles, but using batched=True
:
def lowercase_title(batch):
return {"title": [title.lower() for title in batch["title"]]}
my_dataset = my_dataset.map(lowercase_title, batched=True)
When using a model that requires a specific data collator, we can create a custom data collator. You can find an example in this notebook about masked language modeling and whole word masking.
A data collator is just a function that takes a list of samples and converts them into a batch.
By default, the Trainer will remove any columns that are not part of the model’s forward() method. This means that if, for example, you’re using the whole word masking collator, you’ll also need to set
remove_unused_columns=False
to ensure we don’t lose the word_ids column during training.
- ML CO2 Impact: Website to calculate the carbon footprint of your machine learning models. Is integrated with Hugging Face's model hub. To learn more about this, you can read this blog post which will show you how to generate an
emissions.csv
file with an estimate of the footprint of your training, as well as the documentation of Hugging Face Transformers addressing this topic.
- Gradient Accumulation: Accumulate gradients over multiple steps before performing an optimization step. This is useful when the model is too large and you have to use a smaller batch size but still want to use a large effective batch size for stable training.
- Gradient Checkpointing: Trade compute for memory by recomputing the forward pass of the model during backpropagation. This is useful when the model is too large to fit in memory.
- Mixed Precision Training: Use half-precision floating point arithmetic to reduce memory usage and speed up training.
- LoRA: There are methods that focus on fine-tuning large models by adding adapter layers to the model during fine-tuning, which can reduce the memory footprint of the model. After fine-tuning, the adapter layers can be merged into the model, introducing no additional parameters nor computational overhead.
- Quantization: There are tools like Unsloth that provide quantized models for training and inference, which can be used to reduce the memory footprint of the model.
- Freeze Layers: Freeze the weights of some layers during training to reduce the memory footprint of the model.
Tool | Description | Tags |
---|---|---|
vLLM | vLLM is a fast and easy-to-use library for LLM inference and serving. | LLM - Serving |
LoRAX | Serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. | LLM - LoRA - Serving |
LLM Compressor | Easy-to-use library for optimizing models for deployment with vllm | LLM - Compression |
Ludwig | Ludwig is a low-code framework for building custom AI models like LLMs and other deep neural networks. | LLM - Fine-Tuning - Low-Code |
Axolotl | Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures. | LLM - Fine-Tuning - Low-Code |
LitGPT | High-performance LLMs with recipes to pretrain, finetune, deploy at scale | LLM - Fine-Tuning - Low-Code |
Distilabel | Synthesize data for AI and add feedback on the fly! | Data - Synthetic |
- Dataset loading documentation: Guide to learn how to load a dataset from: The Hub without a dataset loading script; Local loading script; Local files; In-memory data; Offline; A specific slice of a split. Audio, Image and Text datasets.
- Sequence Classification
- BERT
- TF-IDF
- Named Entity Recognition (Token Classification)
- Fine-tuning a masked language model
- Translation
- Summarization
- Question Answering
- Causal Language Modeling
- Semantic Search
- Retrieval Augmented Generation (RAG)
- Visual Document Understanding - Parsing: Donut Tutorial
- Review Unstloth Documentation and Projects
- Chat Templates
- Reward Modeling
- & More - Create a folder with examples and projects
- Instruction-Following SFT. Hermes 3 proposes Completions Only cross-entropy loss
- Completions Only
- DPO
- ORPO
- Improved sequence classification: Compare the performance of a base model finetuned on IMDB with a model previously finetuned with MLM on the same dataset and later finetuned on IMDB for sequence classification.
- Large Language Model Course - Maxime Labonne
- Book | LLM Engineer's Handbook - Maxime Labonne
- Book | Building LLMs for Production - Louis-François Bouchard
- Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth: A comprehensive overview of supervised fine-tuning.
- Unsloth: UnslothAI is a parameter-efficient fine-tuning library for LLMs that accelerates fine-tuning by 2-5 times while using 70% less memory.
- RAG using Llama 3 by Meta AI - Lightning AI: A studio building a completely self-hosted "Chat with your Docs" RAG application using Llama-3, served locally through Ollama.
- Create synthetic datasets with Llama 3.1 - Lightning AI: Laverage Llama 3.1 models and Distilabel, an open-source framework for AI engineers, to create and evaluate synthetic data.
- Finetune and Deploy Llama 3.1 8B - Lightning AI: A studio that shows how to finetune and deploy a Llama 3.1 8B model using LitGPT.
- Prompt Engineering vs Finetuning vs RAG: Pros and cons of each technique. This is important because it will help you to understand when and how to use these techniques effectively.
- Safeguarding LLMs with Guardrails: Given that the open-ended nature of LLM-driven applications can produce responses that may not align with an organization’s guidelines or policies, a set of safety measurements and actions are becoming table stakes for maintaining trust in generative AI.
- The Ultimate Guide to LLM Fine Tuning: Best Practices & Tools: Explore various Large Language Models fine tuning methods and learn about their benefits and limitations.
- Evaluating Large Language Models: Methods, Best Practices & Tools: Explore 7 effective methods, best practices, and evolving frameworks for assessing LLMs' performance and impact across industries.