Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed links in multimodal models transfer learning introduction #305

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions chapters/en/unit4/multimodal-models/transfer_learning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ In the preceding sections, we've delved into the fundamental concepts of multimo

There are several approaches to how you can adapt multimodal models to your use case:

1. **Zero\few-shot learning**. Zero\few-shot learning involves leveraging large pretrained models capable of solving problems not present in the training data. These approaches can be useful when there is little labeled data for a task (5-10 examples) or there is none at all. [Unit 11](../Unit%2011%20%20-%20Zero%20Shot%20Computer%20Vision/1.mdx) will delve deeper into this topic.
1. **Zero\few-shot learning**. Zero\few-shot learning involves leveraging large pretrained models capable of solving problems not present in the training data. These approaches can be useful when there is little labeled data for a task (5-10 examples) or there is none at all. [Unit 11](https://huggingface.co/learn/computer-vision-course/unit11/1) will delve deeper into this topic.

2. **Training the model from scratch**. When pre-trained model weights are unavailable or the model's dataset substantially differs from your own, this method becomes necessary. Here, we initialize model weights randomly (or via more sophisticated methods like [He initialization](https://arxiv.org/abs/1502.01852)) and proceed with the usual training. However, this approach demands substantial amounts of training data.

Expand Down Expand Up @@ -38,12 +38,12 @@ However, despite its advantages, transfer learning has some challenges that shou

## Transfer Learning Applications

We'll explore practical applications of transfer learning across various tasks. Navigate to the Jupyter notebook relevant to your task of interest from the provided table.
We'll explore practical applications of transfer learning across various tasks. The table below provides a description of the tasks that can be solved using multimodal models, as well as examples of how you can fine-tune them on your data.

| Task | Description | Model | Notebook |
| ----------- | ---------------------------------------------------------------- | ------------------------------------------------- | ----------- |
| Fine-tune CLIP | Fine-tuning CLIP on a custom dataset | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | [CLIP notebook](https://) |
| VQA | Answering a question in natural <br/> language based on an image | [dandelin/vilt-b32-finetuned-vqa](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) | [VQA notebook](https://) |
| Image-to-Text | Describing an image in natural language | [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large) | [Text 2 Image notebook](https://) |
| Open-set object detection | Detect objects by natural language input | [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) | [Grounding DINO notebook](https://) |
| Assistant (GTP-4V like) | Instruction tuning in the multimodal field | [LLaVA](https://github.com/haotian-liu/LLaVA) | [LLaVa notebook](https://) |
| Task | Description | Model |
| ----------- | ---------------------------------------------------------------- | ------------------------------------------------- |
| [Fine-tune CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/Clip_finetune.ipynb)| Fine-tuning CLIP on a custom dataset | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
| [VQA](https://huggingface.co/docs/transformers/main/en/tasks/visual_question_answering#train-the-model) | Answering a question in natural <br/> language based on an image | [dandelin/vilt-b32-mlm](https://huggingface.co/dandelin/vilt-b32-mlm) |
| [Image-to-Text](https://huggingface.co/docs/transformers/main/en/tasks/image_captioning) | Describing an image in natural language | [microsoft/git-base](https://huggingface.co/microsoft/git-base) |
| [Open-set object detection](https://docs.ultralytics.com/models/yolo-world/) | Detect objects by natural language input | [YOLO-World](https://huggingface.co/papers/2401.17270) |
| [Assistant (GTP-4V like)](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) | Instruction tuning in the multimodal field | [LLaVA](https://huggingface.co/docs/transformers/model_doc/llava) |
Loading