Skip to content

AIdventures/NLPJourney

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLPJourney

Welcome Illustration

This repository is a collection of notes, code snippets, and resources for learning Natural Language Processing (NLP) I collected throughout my journey. The main goal is to have a reference guide for future projects and to share knowledge with others.

Index

Project Overview

├── basics       <- Hugging Face basic concepts and examples
│   ├── hf_pipeline               <- NLP pipelines examples
│   ├── hf_inference              <- Breakdown pipeline components
│   ├── hf_model_creation         <- Instantiate models
│   ├── hf_tokenizers             <- Tokenizers basics
│   ├── hf_processing_data        <- Loading dataset from Hub
│   ├── hf_finetuning             <- Basic fine-tuning task
│   ├── hf_datasets               <- Dataset operations
│   ├── hf_tokenizers_training    <- Adapt tokenizers to new data
│   └── hf_tokenizers_offsets     <- Tokenizers offset mapping
|
└──  mains_tasks  <- Tackling main NLP tasks
    ├── sequence_classification   <- Classify sequences of tokens
    ├── token_classification      <- Set labels for each token
    ├── masked_language_modeling  <- Filling blanks for domain adaptation
    ├── causal_language_modeling  <- Predict next token
    └── semantic_search           <- Retrieve similar documents

Only the most important files and directories are listed above.

What is NLP?

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

The following is a list of common NLP tasks, with some examples of each:

  • Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
  • Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
  • Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
  • Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context
  • Generating a new sentence from an input text: Translating a text into another language, summarizing a text

NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.

Learnings

Model heads: Making sense out of numbers

AutoModel architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
# outputs.last_hidden_state (batch_size, sequence_length, hidden_size)

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension.

HF Models Break Down

There are many different architectures available in Hugging Face Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:

  • *Model (retrieve the hidden states)
  • *ForCausalLM
  • *ForMaskedLM
  • *ForMultipleChoice
  • *ForQuestionAnswering
  • *ForSequenceClassification
  • *ForTokenClassification
  • and others

If, for example, we need a model with a sequence classification head, to be able to classify the sentences as positive or negative, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification.

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
# outputs.logits (batch_size, num_labels)

Huggin Face storing

Models

When using from_pretrained(), the weights are downloaded and cached (so future calls to the method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. We can customize your cache folder by setting the HF_HOME environment variable.

Data

When using datasets.load_dataset(), the datasets are downloaded and cached in the cache folder, which defaults to ~/.cache/huggingface/datasets. We can customize your cache folder by setting the HF_HOME environment variable.

Tokenizers Algorithms

Model BPE WordPiece Unigram
Training Starts from a small vocabulary and learns rules to merge tokens Starts from a small vocabulary and learns rules to merge tokens Starts from a large vocabulary and learns rules to remove tokens
Training step Merges the tokens corresponding to the most common pair Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus
Learns Merge rules and a vocabulary Just a vocabulary A vocabulary with a score for each token
Encoding Splits a word into characters and applies the merges learned during training Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word Finds the most likely split into tokens, using the scores learned during training

Toolbox

Pipeline

Checkout basics/hf_pipeline.ipynb for examples of using Hugging Face's pipeline for NLP tasks. There are several tasks you can try out of the box without too much effort.

Check available pipelines here

Easy Baseline & Labeling

For many NLP tasks, we can rely on pre-trained models that have been trained on large datasets. Leveraging zero-shot or few-shot capabilities of these models can provide a strong baseline for many tasks.

For example, for making a spam and mail classifier, we can just use a zero-shot classifier from Hugging Face Transformers library.

from transformers import pipeline

oracle = pipeline(model="facebook/bart-large-mnli")
oracle(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)
"""
Output:
{
 'sequence': 'I have a problem with my iphone. Resolved asap!!',
 'labels': ['urgent', 'phone', 'not urgent', 'tablet', 'computer'],
 'scores': [0.504131, 0.479352, 0.013123621, 0.003235, 0.0022361]
}
"""

The Hugging Face Hub: Models

The models in the Hub are not limited to Hugging Face Transformers or even NLP. There are models from Flair and AllenNLP for NLP, Asteroid and pyannote for speech, and timm for vision, to name a few.

Sharing a Model at Hugging Face

You can train you own model and share it at Hugging Face. You can find the instructions here and here. Also, information about building a model card here.

Accelerate

We can easily supercharge our training loop with Hugging Face Accelerate library. Checkout an example at basics/hf_finetuning_pytorch_accelerate.ipynb.

Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to accelerator.prepare(). This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device) and replacing loss.backward() with accelerator.backward(loss).

In order to benefit from the speed-up offered by Cloud TPUs, we recommend padding your samples to a fixed length with the padding="max_length" and max_length arguments of the tokenizer.

Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:

accelerate config

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

accelerate launch train.py

which will launch the distributed training. You can find more examples in the Hugging Face Accelerate repo.

Dataset map() batched

Map method applies a custom processing function to each row in the dataset. Should return a dictionary. If the item has new columns/keys, they will be added to the dataset, if it has the same keys, it will be replaced. We can use the map method in batch mode with Dataset.map(function, batched=True). For example:

def lowercase_title(example):
    return {"title": example["title"].lower()}

my_dataset = my_dataset.map(lowercase_title)

The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.

When you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to lowercase all titles, but using batched=True:

def lowercase_title(batch):
    return {"title": [title.lower() for title in batch["title"]]}

my_dataset = my_dataset.map(lowercase_title, batched=True)

Custom Data Collator

When using a model that requires a specific data collator, we can create a custom data collator. You can find an example in this notebook about masked language modeling and whole word masking.

A data collator is just a function that takes a list of samples and converts them into a batch.

By default, the Trainer will remove any columns that are not part of the model’s forward() method. This means that if, for example, you’re using the whole word masking collator, you’ll also need to set remove_unused_columns=False to ensure we don’t lose the word_ids column during training.

Environment Impact

  • ML CO2 Impact: Website to calculate the carbon footprint of your machine learning models. Is integrated with Hugging Face's model hub. To learn more about this, you can read this blog post which will show you how to generate an emissions.csv file with an estimate of the footprint of your training, as well as the documentation of Hugging Face Transformers addressing this topic.

Training Tricks

Memory Efficient Training

  • Gradient Accumulation: Accumulate gradients over multiple steps before performing an optimization step. This is useful when the model is too large and you have to use a smaller batch size but still want to use a large effective batch size for stable training.
  • Gradient Checkpointing: Trade compute for memory by recomputing the forward pass of the model during backpropagation. This is useful when the model is too large to fit in memory.
  • Mixed Precision Training: Use half-precision floating point arithmetic to reduce memory usage and speed up training.
  • LoRA: There are methods that focus on fine-tuning large models by adding adapter layers to the model during fine-tuning, which can reduce the memory footprint of the model. After fine-tuning, the adapter layers can be merged into the model, introducing no additional parameters nor computational overhead.
  • Quantization: There are tools like Unsloth that provide quantized models for training and inference, which can be used to reduce the memory footprint of the model.
  • Freeze Layers: Freeze the weights of some layers during training to reduce the memory footprint of the model.

Tools Table

Tool Description Tags
vLLM vLLM is a fast and easy-to-use library for LLM inference and serving. LLM - Serving
LoRAX Serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. LLM - LoRA - Serving
LLM Compressor Easy-to-use library for optimizing models for deployment with vllm LLM - Compression
Ludwig Ludwig is a low-code framework for building custom AI models like LLMs and other deep neural networks. LLM - Fine-Tuning - Low-Code
Axolotl Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures. LLM - Fine-Tuning - Low-Code
LitGPT High-performance LLMs with recipes to pretrain, finetune, deploy at scale LLM - Fine-Tuning - Low-Code
Distilabel Synthesize data for AI and add feedback on the fly! Data - Synthetic

Useful Links

  • Dataset loading documentation: Guide to learn how to load a dataset from: The Hub without a dataset loading script; Local loading script; Local files; In-memory data; Offline; A specific slice of a split. Audio, Image and Text datasets.

To-Do

Main NLP Tasks

Advanced NLP Tasks

Mini Projects

  • Improved sequence classification: Compare the performance of a base model finetuned on IMDB with a model previously finetuned with MLM on the same dataset and later finetuned on IMDB for sequence classification.

Other

Credits

Courses

Posts & Articles

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published