From d7dfcaf484c0088c15ffa30baae429062f2b3501 Mon Sep 17 00:00:00 2001 From: Johannes Date: Fri, 12 Apr 2024 16:49:26 +0200 Subject: [PATCH 1/4] smaller additions and adjustments --- .../en/Unit 0 - Welcome/TableOfContents.mdx | 84 +++++++++---------- chapters/en/Unit 0 - Welcome/welcome.mdx | 4 +- .../introduction.mdx | 6 ++ 3 files changed, 50 insertions(+), 44 deletions(-) diff --git a/chapters/en/Unit 0 - Welcome/TableOfContents.mdx b/chapters/en/Unit 0 - Welcome/TableOfContents.mdx index c5a342b1b..111dedeee 100644 --- a/chapters/en/Unit 0 - Welcome/TableOfContents.mdx +++ b/chapters/en/Unit 0 - Welcome/TableOfContents.mdx @@ -1,47 +1,45 @@ # Table of Contents for Notebooks -Welcome to the Community Computer Vision Course! πŸ€— -Join us as we delve into the fundamentals and recent developments in computer vision. -Our goal is to offer a beginner-friendly resource. -Let's dive into the chapters for a wealth of knowledge! +Here you can find a list of notebooks that contain accompanying and hands-on material to the chapters you find in this course. +Feel free to browse them at your own speed and interest. -| Chapter Title | Notebooks | Colabs | -|--------------------------------------------------------|----------------------------------------------------------|----------| -| Unit 0 - Welcome | No Notebook | No Colab | -| Unit 1 - Fundamentals | No Notebook | No Colab | -| Unit 2 - Convolutional Neural Networks | [Transfer Learning with VGG19](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/transfer_learning_vgg19.ipynb) | [Transfer Learning with VGG](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/transfer_learning_vgg19.ipynb) | -| | [Using ResNet with timm](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/timm_Resnet.ipynb) | [timm_Resnet](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/timm_Resnet.ipynb) | -| Unit 3 - Vision Transformers | [Detection Transformer (DETR)](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/DETR.ipynb) | [Detection Transformer (DETR)](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/DETR.ipynb) | -| | [Fine-tuning Vision Transformers for Object Detection](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Fine-tuning%20Vision%20Transformers%20for%20Object%20detection.ipynb) | [Fine-tuning Vision Transformers for Object Detection](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Fine-tuning%20Vision%20Transformers%20for%20Object%20detection.ipynb) | -| | [Knowledge Distillation](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb) | [Knowledge Distillation](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb) | -| | [LoRA Fine-tuning for Image Classification](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/LoRA-Image-Classification.ipynb) | [LoRA Fine-tuning for Image Classification](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/LoRA-Image-Classification.ipynb) | -| | [Fine-tuning for Multilabel Image Classification](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb) | [Fine-tuning for Multilabel Image Classification](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb) | -| | [Transfer Learning for Image Classification](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb) | [Transfer Learning for Image Classification](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb) | -| | [Transfer Learning for Image Segmentation](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) | [Transfer Learning for Image Segmentation](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) | -| | [Swin Transformer](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Swin.ipynb) | [Swin Transformer](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Swin.ipynb) | -| Unit 4 - Multimodal Models | [Clip Crop](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/ClipCrop.ipynb) | [Clip Crop](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/ClipCrop.ipynb) | -| | [Fine-tuning CLIP](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/Clip_finetune.ipynb) | [Fine-tuning CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/Clip_finetune.ipynb) | -| | [Clustering with CLIP](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Clustering%20with%20CLIP.ipynb) | [Clustering with CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Clustering%20with%20CLIP.ipynb) | -| | [Image Classification with CLIP](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image%20classification%20with%20CLIP.ipynb) | [Image Classification with CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image%20classification%20with%20CLIP.ipynb) | -| | [Image Retrieval with Prompts](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_retrieval_with_prompts.ipynb) | [Image Retrieval with Prompts](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_retrieval_with_prompts.ipynb) | -| | [Image Similarity](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_similarity.ipynb) | [Image Similarity](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_similarity.ipynb) | -| Unit 5 - Generative Models | No Notebook | No Colab | -| Unit 6 - Basic CV Tasks | [Fine-tune SAM on Custom Dataset](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%206%20-%20Basic%20CV%20Tasks/Fine_tune_SAM_(Segment_Anything_Model)_on_Custom_Dataset.ipynb) | [Fine-tune SAM on Custom Dataset](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%206%20-%20Basic%20CV%20Tasks/Fine_tune_SAM_(Segment_Anything_Model)_on_Custom_Dataset.ipynb) | -| Unit 7 - Video and Video Processing | No Notebook | No Colab | -| Unit 8 - 3D Vision, Scene Rendering, and Reconstruction| No Notebook | No Colab | -| Unit 9 - Model Optimization | [Edge TPU](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/edge_tpu.ipynb) | [Edge TPU](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/edge_tpu.ipynb) | -| | [ONNX](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/onnx.ipynb) | [ONNX](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/onnx.ipynb) | -| | [OpenVINO](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/openvino.ipynb) | [OpenVINO](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/openvino.ipynb) | -| | [Optimum](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/optimum.ipynb) | [Optimum](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/optimum.ipynb) | -| | [TensorRT](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tensorrt.ipynb) | [TensorRT](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tensorrt.ipynb) | -| | [TMO](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tmo.ipynb) | [TMO](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tmo.ipynb) | -| | [Torch](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/torch.ipynb) | [Torch](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/torch.ipynb) | -| Unit 10 - Synthetic Data Creation | [Dataset Labeling with OWLv2](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/OWLV2_labeled_image_dataset_with_annotations.ipynb) | [Dataset Labeling with OWLv2](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/OWLV2_labeled_image_dataset_with_annotations.ipynb) | -| | [Generating Synthetic Lung Images](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/Synthetic_lung_images_hf_course.ipynb) | [Generating Synthetic Lung Images](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/Synthetic_lung_images_hf_course.ipynb) | -| | [BlenderProc Examples](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/blenderproc_examples.ipynb) | [BlenderProc Examples](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/blenderproc_examples.ipynb) | -| | [Image Labeling with BLIP-2](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/image_labeling_BLIP_2.ipynb) | [Image Labeling with BLIP-2](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/image_labeling_BLIP_2.ipynb) | -| | [Synthetic Data Creation with SDXL Turbo](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/synthetic_data_creation_sdxl_turbo.ipynb) | [Synthetic Data Creation with SDXL Turbo](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/synthetic_data_creation_sdxl_turbo.ipynb) | -| Unit 11 - Zero Shot Computer Vision | No Notebook | No Colab | -| Unit 12 - Ethics and Biases | No Notebook | No Colab | -| Unit 13 - Outlook | No Notebook | No Colab | +| Chapter Title | Notebooks | Colabs | +| ------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Unit 0 - Welcome | No Notebook | No Colab | +| Unit 1 - Fundamentals | No Notebook | No Colab | +| Unit 2 - Convolutional Neural Networks | [Transfer Learning with VGG19](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/transfer_learning_vgg19.ipynb) | [Transfer Learning with VGG](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/transfer_learning_vgg19.ipynb) | +| | [Using ResNet with timm](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/timm_Resnet.ipynb) | [timm_Resnet](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%202%20-%20Convolutional%20Neural%20Networks/timm_Resnet.ipynb) | +| Unit 3 - Vision Transformers | [Detection Transformer (DETR)](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/DETR.ipynb) | [Detection Transformer (DETR)](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/DETR.ipynb) | +| | [Fine-tuning Vision Transformers for Object Detection](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Fine-tuning%20Vision%20Transformers%20for%20Object%20detection.ipynb) | [Fine-tuning Vision Transformers for Object Detection](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Fine-tuning%20Vision%20Transformers%20for%20Object%20detection.ipynb) | +| | [Knowledge Distillation](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb) | [Knowledge Distillation](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/KnowledgeDistillation.ipynb) | +| | [LoRA Fine-tuning for Image Classification](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/LoRA-Image-Classification.ipynb) | [LoRA Fine-tuning for Image Classification](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/LoRA-Image-Classification.ipynb) | +| | [Fine-tuning for Multilabel Image Classification](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb) | [Fine-tuning for Multilabel Image Classification](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/fine-tuning-multilabel-image-classification.ipynb) | +| | [Transfer Learning for Image Classification](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb) | [Transfer Learning for Image Classification](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-image-classification.ipynb) | +| | [Transfer Learning for Image Segmentation](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) | [Transfer Learning for Image Segmentation](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning-segmentation.ipynb) | +| | [Swin Transformer](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Swin.ipynb) | [Swin Transformer](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/Swin.ipynb) | +| Unit 4 - Multimodal Models | [Clip Crop](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/ClipCrop.ipynb) | [Clip Crop](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/ClipCrop.ipynb) | +| | [Fine-tuning CLIP](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/Clip_finetune.ipynb) | [Fine-tuning CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/Clip_finetune.ipynb) | +| | [Clustering with CLIP](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Clustering%20with%20CLIP.ipynb) | [Clustering with CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Clustering%20with%20CLIP.ipynb) | +| | [Image Classification with CLIP](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image%20classification%20with%20CLIP.ipynb) | [Image Classification with CLIP](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image%20classification%20with%20CLIP.ipynb) | +| | [Image Retrieval with Prompts](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_retrieval_with_prompts.ipynb) | [Image Retrieval with Prompts](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_retrieval_with_prompts.ipynb) | +| | [Image Similarity](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_similarity.ipynb) | [Image Similarity](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%204%20-%20Multimodal%20Models/CLIP%20and%20relatives/Image_similarity.ipynb) | +| Unit 5 - Generative Models | No Notebook | No Colab | +| Unit 6 - Basic CV Tasks | [Fine-tune SAM on Custom Dataset]() | [Fine-tune SAM on Custom Dataset]() | +| Unit 7 - Video and Video Processing | No Notebook | No Colab | +| Unit 8 - 3D Vision, Scene Rendering, and Reconstruction | No Notebook | No Colab | +| Unit 9 - Model Optimization | [Edge TPU](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/edge_tpu.ipynb) | [Edge TPU](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/edge_tpu.ipynb) | +| | [ONNX](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/onnx.ipynb) | [ONNX](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/onnx.ipynb) | +| | [OpenVINO](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/openvino.ipynb) | [OpenVINO](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/openvino.ipynb) | +| | [Optimum](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/optimum.ipynb) | [Optimum](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/optimum.ipynb) | +| | [TensorRT](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tensorrt.ipynb) | [TensorRT](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tensorrt.ipynb) | +| | [TMO](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tmo.ipynb) | [TMO](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/tmo.ipynb) | +| | [Torch](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/torch.ipynb) | [Torch](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%209%20-%20Model%20Optimization/torch.ipynb) | +| Unit 10 - Synthetic Data Creation | [Dataset Labeling with OWLv2](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/OWLV2_labeled_image_dataset_with_annotations.ipynb) | [Dataset Labeling with OWLv2](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/OWLV2_labeled_image_dataset_with_annotations.ipynb) | +| | [Generating Synthetic Lung Images](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/Synthetic_lung_images_hf_course.ipynb) | [Generating Synthetic Lung Images](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/Synthetic_lung_images_hf_course.ipynb) | +| | [BlenderProc Examples](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/blenderproc_examples.ipynb) | [BlenderProc Examples](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/blenderproc_examples.ipynb) | +| | [Image Labeling with BLIP-2](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/image_labeling_BLIP_2.ipynb) | [Image Labeling with BLIP-2](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/image_labeling_BLIP_2.ipynb) | +| | [Synthetic Data Creation with SDXL Turbo](https://github.com/johko/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/synthetic_data_creation_sdxl_turbo.ipynb) | [Synthetic Data Creation with SDXL Turbo](https://colab.research.google.com/github/fariddinar/computer-vision-course/blob/main/notebooks/Unit%2010%20-%20Synthetic%20Data%20Creation/synthetic_data_creation_sdxl_turbo.ipynb) | +| Unit 11 - Zero Shot Computer Vision | No Notebook | No Colab | +| Unit 12 - Ethics and Biases | No Notebook | No Colab | +| Unit 13 - Outlook | No Notebook | No Colab | diff --git a/chapters/en/Unit 0 - Welcome/welcome.mdx b/chapters/en/Unit 0 - Welcome/welcome.mdx index bad520ca2..6e5b4cea0 100644 --- a/chapters/en/Unit 0 - Welcome/welcome.mdx +++ b/chapters/en/Unit 0 - Welcome/welcome.mdx @@ -132,7 +132,6 @@ Our goal was to create a computer vision course that is beginner-friendly and th - Reviewers: [Ratan Prasad](https://github.com/ratan), [William Bonvini](https://github.com/WilliamBonvini), [Mohammed Hamdy](https://github.com/mmhamdy), [Adhi Setiawan](https://github.com/adhiiisetiawan), [Ameed Taylor](https://github.com/atayloraerospace0) - Writers: [John Fozard](https://github.com/jfozard), [Vasu Gupta](https://github.com/vasugupta9), [Psetinek](https://github.com/psetinek) - **Unit 9 - Model Optimization** - Reviewers: [Ratan Prasad](https://github.com/ratan), [Mohammed Hamdy](https://github.com/mmhamdy), [Adhi Setiawan](https://github.com/adhiiisetiawan), [Ameed Taylor](https://github.com/atayloraerospace) @@ -158,4 +157,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th - Reviewers: [Ratan Prasad](https://github.com/ratan), [Ameed Taylor](https://github.com/atayloraerospace), [Mohammed Hamdy](https://github.com/mmhamdy) - Writers: [Farros Alferro](https://github.com/farrosalferro), [Mohammed Hamdy](https://github.com/mmhamdy), [Louis Ulmer](https://github.com/lulmer), [Dario Wisznewer](https://github.com/dariowsz), [gonzachiar](https://github.com/gonzachiar) +**Organisation Team** +[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko) + We are happy to have you here, let's get started! diff --git a/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx b/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx index dea7740b4..66217794d 100644 --- a/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx +++ b/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx @@ -1,5 +1,11 @@ # Introduction to Convolutional Neural Networks +In the last unit we learned about the fundamentals of vision, images and Computer Vision. We also explored visual features as a crucial part of analyzing images with the help of computers. + +The approaches we discussed are today often referred to as "classical" Computer Vision. While working fine on many small and restrained datasets and settings, classical methods have their limits that come to light when looking at bigger scale real-world datasets. + +In this unit, we will learn about Convolutional Neural Networks, an important step forward in terms of scale and performance of Computer Vision. + ## Convolution: Basic Ideas Convolution is an operation used to extract features from data. The data can be 1D, 2D or 3D. We'll explain the operation with a solid example. All you need to know now is that the operation simply takes a matrix made of numbers, moves it through the data, and takes the sum of products between the data and that matrix. This matrix is called kernel or filter. You might say, "What does it have to do with the feature extraction, and how am I supposed to apply it? From a0a75c09d5b6675ee26495d834a0012243f6d2fa Mon Sep 17 00:00:00 2001 From: Johannes Date: Fri, 12 Apr 2024 22:22:10 +0200 Subject: [PATCH 2/4] smooth out multimodal part --- ... Transformers for Image Classification.mdx | 11 ----- .../introduction.mdx | 37 ++++++++++++++++ .../CLIP and relatives/Introduction.mdx | 11 +++-- ...ntroduction.mdx => a_multimodal_world.mdx} | 42 +++++++++++-------- .../Unit 4 - Multimodal Models/pre-intro.mdx | 42 +++++++++++-------- .../supplementary-material.mdx | 5 ++- chapters/en/_toctree.yml | 10 ++--- 7 files changed, 104 insertions(+), 54 deletions(-) create mode 100644 chapters/en/Unit 3 - Vision Transformers/introduction.mdx rename chapters/en/Unit 4 - Multimodal Models/{introduction.mdx => a_multimodal_world.mdx} (88%) diff --git a/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx b/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx index fe8e05df3..89fcbd10c 100644 --- a/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx @@ -4,17 +4,6 @@ As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (Vision Transformers). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers. -### CNN vs Vision Transformers: Inductive Bias - -Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far. - -Here's a couple of inductive biases we observe in CNNs: - -- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features. -- Locality: pixels in an image interact mainly with its surrounding pixels to form features. - -These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases. - ### Using pre-trained Vision Transformers It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available models from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending). diff --git a/chapters/en/Unit 3 - Vision Transformers/introduction.mdx b/chapters/en/Unit 3 - Vision Transformers/introduction.mdx new file mode 100644 index 000000000..8b647e1b5 --- /dev/null +++ b/chapters/en/Unit 3 - Vision Transformers/introduction.mdx @@ -0,0 +1,37 @@ +# Introduction to Vision Transformers + +In the recent unit we learned about Convolutional Neural Networks (CNNs) and some of their use cases. +We saw how some of the most prominent CNNs can be implemented and fine-tuned on custom data. + +CNNs have been the go-to choice for Computer Vision practicioners for many tasks ever since the advent of the AlexNet architecture. +However, the wheel of time keeps turning and new architectures are proposed on a regular basis, that challenge the supremacy of CNNs. +While many of them fade away or are only used in niche domains, the recent years had emerge one competitor who is on the way to take +the crown as the favourite choice for Computer Vision practicioners: Vision Transformers (ViTs). + +## Transformers +The Transfomer architecture was originally proposed in the 2017 paper []"Attention is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. +The novel architecture quickly caught interest in the Natural Language Processing community, which before has mainly been building its +applications on Recurrent Neural Network (RNN) architectures like Long Short-Term Memory(LSTM). + +Transformers promised a wide array of new application possibilities, as they surpassed RNN architectures in many terms, e.g. Parallelization, +Scalability and Flexibility. + +## Self-Attention +While the new architecture has many mechanisms that play together nicely, the most important of them might be Self-Attention, +a new way of taking into account the dependency between parts of the input. + +..... + + +### CNN vs Vision Transformers: Inductive Bias + +Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far. + +Here's a couple of inductive biases we observe in CNNs: + +- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features. +- Locality: pixels in an image interact mainly with its surrounding pixels to form features. + +These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases. + + diff --git a/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/Introduction.mdx b/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/Introduction.mdx index 7765ca545..ab1f595d0 100644 --- a/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/Introduction.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/Introduction.mdx @@ -1,6 +1,6 @@ -# Introduction +# CLIP and Relatives -This section provides an overview of CLIP and similar models, highlighting their unique features and applicability to various machine learning tasks. +So far we have learned about the fundamentals of multimodality with a special spotlight of Vision Language Models. This chapters provide a short overview of CLIP and similar models, highlighting their unique features and applicability to various machine learning tasks. It sets the stage for a high-level exploration of key multimodal models that have emerged before and after CLIP, showcasing their significant contributions to the advancement of multimodal AI. ## Pre-CLIP @@ -9,9 +9,11 @@ In this part, we explore the innovative attempts in multimodal AI before CLIP. The focus is on influential papers that used deep learning to make significant strides in the field: 1. **"Multimodal Deep Learning" by Ngiam et al. (2011):** This paper demonstrated the use of deep learning for multimodal inputs, emphasizing the potential of neural networks in integrating different data types. It laid the groundwork for future innovations in multimodal AI. + - [Multimodal Deep Learning](https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf) 2. **"Deep Visual-Semantic Alignments for Generating Image Descriptions" by Karpathy and Fei-Fei (2015):** This study presented a method for aligning textual data with specific image regions, enhancing the interpretability of multimodal systems and advancing the understanding of complex visual-textual relationships. + - [Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf) 3. **"Show and Tell: A Neural Image Caption Generator" by Vinyals et al. (2015):** This paper marked a significant step in practical multimodal AI by showing how CNNs and RNNs could be combined to transform visual information into descriptive language. @@ -22,12 +24,15 @@ The focus is on influential papers that used deep learning to make significant s The emergence of CLIP brought new dimensions to multimodal models, as illustrated by the following developments: 1. **CLIP:** OpenAI's CLIP was a game-changer, learning from a vast array of internet text-image pairs and enabling zero-shot learning, contrasting with earlier models. + - [CLIP](https://openai.com/blog/clip/) 2. **GroupViT:** Innovating in segmentation and semantic understanding, GroupViT combined these aspects with language, showing advanced integration of language and vision. + - [GroupViT](https://arxiv.org/abs/2202.11094) 3. **BLIP:** BLIP introduced bidirectional learning between vision and language, pushing the boundaries for generating text from visual inputs. + - [BLIP](https://arxiv.org/abs/2201.12086) 4. **OWL-VIT:** Focusing on object-centric representations, OWL-VIT advanced the understanding of objects within images in context with text. @@ -41,4 +46,4 @@ These developments highlight the evolving methods of processing multimodal data The upcoming sections will delve into the "Losses" aspect, focusing on various loss functions and self-supervised learning crucial for training multimodal models. The "Models" section will provide a deeper understanding of CLIP and its variants, exploring their designs and functionalities. Finally, the "Practical Notebooks" section will offer hands-on experience, addressing challenges like data bias and applying these models in tasks such as image search engines and visual question answering systems. -These sections aim to deepen your knowledge and practical skills in the multifaceted world of multimodal AI. \ No newline at end of file +These sections aim to deepen your knowledge and practical skills in the multifaceted world of multimodal AI. diff --git a/chapters/en/Unit 4 - Multimodal Models/introduction.mdx b/chapters/en/Unit 4 - Multimodal Models/a_multimodal_world.mdx similarity index 88% rename from chapters/en/Unit 4 - Multimodal Models/introduction.mdx rename to chapters/en/Unit 4 - Multimodal Models/a_multimodal_world.mdx index 361e66ca4..54c5acb70 100644 --- a/chapters/en/Unit 4 - Multimodal Models/introduction.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/a_multimodal_world.mdx @@ -1,18 +1,19 @@ -# Introduction +# A Multimodal World -Welcome to the chapter on Fusion of Text and Vision. This chapter builds the foundation for the later sections of the unit. We will explore: -- The notion of multimodality, and different sensory inputs humans use for efficient decision making. +Welcome to the chapter on the fundamentals of multimodality. This chapter builds the foundation for the later sections of the unit. We will explore: + +- The notion of multimodality, and different sensory inputs humans use for efficient decision making. - Why is it important for making innovative applications and services through which we can interact and make lives easier. - Multimodality in context to Deep Learning, data, tasks, and models. - Related applications like multimodal emotion recognition and multimodal search. So let's begin πŸ€— -## What is multimodality? πŸ“ΈπŸ“πŸŽ΅ +## What is Multimodality? πŸ“ΈπŸ“πŸŽ΅ -A modality means a medium or a way in which something exists or is done. In our daily lives, we come across many scenarios where we have to make decisions and perform tasks. For this, we use our 5 sense organs (eyes to see, ears to hear, nose to smell, tongue to taste, and skin to touch). Based on the information from all sense organs, we assess our environment, perform tasks, and make decisions for our survival. Each of these 5 sense organs is a different modality through which information comes to us and thus the word multimodality or multimodal. +A modality means a medium or a way in which something exists or is done. In our daily lives, we come across many scenarios where we have to make decisions and perform tasks. For this, we use our 5 sense organs (eyes to see, ears to hear, nose to smell, tongue to taste, and skin to touch). Based on the information from all sense organs, we assess our environment, perform tasks, and make decisions for our survival. Each of these 5 sense organs is a different modality through which information comes to us and thus the word multimodality or multimodal. -Think about this scenario for a moment, on a windy night you hear an eerie sound while you are on your bed πŸ‘»πŸ˜¨. You feel a bit scared, as you are unaware about the source of the sound. You try to gather some courage and check your environment but you are unable to figure this out 😱. Daringly, you turn on the lights and you find out that it was just your window which was half-opened through which the wind was blowing and making the sound in the first place πŸ˜’. +Think about this scenario for a moment, on a windy night you hear an eerie sound while you are on your bed πŸ‘»πŸ˜¨. You feel a bit scared, as you are unaware about the source of the sound. You try to gather some courage and check your environment but you are unable to figure this out 😱. Daringly, you turn on the lights and you find out that it was just your window which was half-opened through which the wind was blowing and making the sound in the first place πŸ˜’. So what just happened here? Initially you had restricted understanding of the situation due to your limited knowledge of the environment. This limited knowledge was due to the fact because you were just relying on your ears (the eerie sound), to make sense. But as soon as you turned on the lights in the room and looked around through your eyes (added another sense organ), you had a better understanding about the whole situation. As we kept on adding modalities our understanding of the situation became better and clearer than before, for the same scenario, this suggests that adding more modalities to the same situation assist each other and improves the information content. Even while taking this course and moving ahead, would you not like cool infographics, accompanied by video content explaining minute concepts instead of just plain textual content πŸ˜‰ @@ -20,7 +21,7 @@ Here you go: ![Multimodality Notion](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/multimodal_fusion_text_vision/multimodal_elephant.png) -*An infographic on multimodality and why it is important to capture the overall sense of data through different modalities. The infographic is multimodal as well (image + text).* +_An infographic on multimodality and why it is important to capture the overall sense of data through different modalities. The infographic is multimodal as well (image + text)._ Many times communication between 2 people gets really awkward in textual mode, slightly improves when voices are involved but greatly improves when you are able to visualize body language and facial expressions as well. This has been studied in detail by the American Psychologist, Albert Mehrabian who stated this as the 7-38-55 rule of communication, the rule states: "In communication, 7% of the overall meaning is conveyed through verbal mode (spoken words), 38% through voice and tone and 55% through body language and facial expressions." @@ -30,22 +31,25 @@ Many times communication between 2 people gets really awkward in textual mode, s To be more general, in the context of AI, 7% of the meaning conveyed is through textual modality, 38% through audio modality and 55% through vision modality. Within the context of deep learning, we would refer each modality as a way data arrives to a deep learning model for processing and predictions. The most commonly used modalities in deep learning are: vision, audio and text. Other modalities can also be considered for specific use cases like LIDAR, EEG Data, eye tracking data etc. -Unimodal models and datasets are purely based on a single modality, and have been studied for long with many tasks and benchmarks but are limited in their capabilities. Relying on a single modality might not give us the complete picture, and combining more modalities will increase the information content and reduce the possibility of missing cues that might be in them. +Unimodal models and datasets are purely based on a single modality, and have been studied for long with many tasks and benchmarks but are limited in their capabilities. Relying on a single modality might not give us the complete picture, and combining more modalities will increase the information content and reduce the possibility of missing cues that might be in them. For the machines around us to be more intelligent, better at communicating with us and have enhanced interpretation and reasoning capabilities, it is important to build applications and services around models and datasets that are multimodal in nature. Because, multimodality can give us a clearer and more accurate representation of the world around us enabling us to develop applications that are closer to the real-world scenarios. **Common combinations of modalities and real life examples:** + - Vision + Text : Infographics, Memes, Articles, Blogs. - Vision + Audio: A Skype call with your friend, dyadic conversations. -- Vision + Audio + Text: Watching YouTube videos or movies with captions, social media content in general is multimodal. +- Vision + Audio + Text: Watching YouTube videos or movies with captions, social media content in general is multimodal. - Audio + Text: Voice notes, music files with lyrics ## Multimodal Datasets + A dataset consisting of multiple modalities is a multimodal dataset. Out of the common modality combinations let us see some examples: + - Vision + Text: [Visual Storytelling Dataset](https://visionandlanguage.net/VIST/), [Visual Question Answering Dataset](https://visualqa.org/download.html), [LAION-5B Dataset](https://laion.ai/blog/laion-5b/). - Vision + Audio: [VGG-Sound Dataset](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [RAVDESS Dataset](https://zenodo.org/records/1188976), [Audio-Visual Identity Database (AVID)](https://www.avid.wiki/Main_Page). - Vision + Audio + Text: [RECOLA Database](https://diuf.unifr.ch/main/diva/recola/), [IEMOCAP Dataset](https://sail.usc.edu/iemocap/). -Now let us see what kind of tasks can be performed using a multimodal dataset? There are many examples, but we will focus generally on tasks that contains the visual and textual +Now let us see what kind of tasks can be performed using a multimodal dataset? There are many examples, but we will focus generally on tasks that contains the visual and textual A multimodal dataset will require a model which is able to process data from multiple modalities, such a model is a multimodal model. ## Multimodal Tasks and Models @@ -56,40 +60,44 @@ models specifically designed for these tasks. So tasks and models go hand in han Hugging Face supports a wide variety of multimodal tasks. Let us look into some of them. **Some multimodal tasks supported by πŸ€— and their variants:** + 1. Vision + Text: + - [Visual Question Answering or VQA](https://huggingface.co/tasks/visual-question-answering): Aiding visually impaired persons, efficient image retrieval, video search, Video Question Answering, Document VQA. - [Image to Text](https://huggingface.co/tasks/image-to-text): Image Captioning, Optical Character Recognition (OCR), Pix2Struct. - [Text to Image](https://huggingface.co/tasks/text-to-image): Image Generation - [Text to Video](https://huggingface.co/tasks/text-to-video): Text-to-video editing, Text-to-video search, Video Translation, Text-driven Video Prediction. -2. Audio + Text: +2. Audio + Text: + - [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) (or Speech to Text): Virtual Speech Assistants, Caption Generation. - [Text to Speech](https://huggingface.co/tasks/text-to-speech): Voice assistants, Announcement Systems. -πŸ’‘An amazing usecase of multimodal task is Multimodal Emotion Recognition (MER). The MER task involves recognition of emotion from two or more modalities like audio+text, text+vision, audio+vision or vision+text+audio As we discussed in the example, MER is more efficient than unimodal emotion recognition and gives clear insight into the +πŸ’‘An amazing usecase of multimodal task is Multimodal Emotion Recognition (MER). The MER task involves recognition of emotion from two or more modalities like audio+text, text+vision, audio+vision or vision+text+audio As we discussed in the example, MER is more efficient than unimodal emotion recognition and gives clear insight into the emotion recognition task. Check out more on MER with [this repository](https://github.com/EvelynFan/AWESOME-MER). + ![Multimodal model flow](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/multimodal_fusion_text_vision/Multimodal.jpg) -A multimodal model, is a model that can be used to perform multimodal tasks by processing data coming from multiple modalities at the same time. These models combine the uniqueness and strengths of different modalities to make a complete representation of data enhancing the performance on multiple tasks. Multimodal models are trained to integrate and process data from sources like images, videos, text, audio etc. -The process of combining these modalities begins with multiple unimodal models. The outputs of these unimodal models (encoded data) are then fused using a strategy by the fusion module. The strategy of fusion can be early fusion, late fusion or hybrid fusion. The overall task of the fusion module is to make a combined representation of the encoded data from the unimodal models. Finally, a classification network takes up the fused representation to make predictions. +A multimodal model, is a model that can be used to perform multimodal tasks by processing data coming from multiple modalities at the same time. These models combine the uniqueness and strengths of different modalities to make a complete representation of data enhancing the performance on multiple tasks. Multimodal models are trained to integrate and process data from sources like images, videos, text, audio etc. +The process of combining these modalities begins with multiple unimodal models. The outputs of these unimodal models (encoded data) are then fused using a strategy by the fusion module. The strategy of fusion can be early fusion, late fusion or hybrid fusion. The overall task of the fusion module is to make a combined representation of the encoded data from the unimodal models. Finally, a classification network takes up the fused representation to make predictions. A detailed section on multimodal tasks and models with a focus on Vision and Text, will be discussed in the next chapter. ## An application of multimodality: Multimodal Search πŸ”ŽπŸ“²πŸ’» Internet search was the one key advantage Google had, but with the introduction of ChatGPT by OpenAI, Microsoft started out with -powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence. +powering up their Bing search engine so that they can crush the competition. It was only restricted initially to LLMs, looking into large corpus of text data but the world around us, mainly social media content, web articles and all possible form of online content is largely multimodal. When we search about an image, the image pops up with a corresponding text to describe it. Won't it be super cool to have another powerful multimodal model which involved both Vision and Text at the same time? This can revolutionize the search landscape hugely, and the core tech involved in this is multimodal learning. We know that many companies also have a large database which is multimodal and mostly unstructured in nature. Multimodal models might help companies with internal search, interactive documentation (chatbots), and many such use cases. This is another domain of Enterprise AI where we leverage AI for organizational intelligence. Vision Language Models (VLMs) are models that can understand and process both vision and text modalities. The joint understanding of both modalities lead VLMs to perform various tasks efficiently like Visual Question Answering, Text-to-image search etc. VLMs thus can serve as one of the best candidates for multimodal search. So overall, VLMs should find some way to map text and image pairs to a joint embedding space where each text-image pair is present as an embedding. We can perform various downstream tasks using these embeddings, which can also be used for search. The idea of such a joint space is that image and text embeddings that are similar in meaning will lie close together, enabling us to do searches for images based on text (text-to-image search) or vice-versa. - -πŸ’‘Meta released first multimodal AI model to bind information from 6 different modalities: images and videos, audio, text, depth, thermal, and inertial measurement units (IMUs). Learn more about it [here](https://imagebind.metademolab.com/). +πŸ’‘Meta released the first multimodal AI model to bind information from 6 different modalities: images and videos, audio, text, depth, thermal, and inertial measurement units (IMUs). Learn more about it [here](https://imagebind.metademolab.com/). + After going through the fundamentals for multimodality, let's now take a look into different multimodal tasks and models available in πŸ€— and their applications via cool demos and Spaces. diff --git a/chapters/en/Unit 4 - Multimodal Models/pre-intro.mdx b/chapters/en/Unit 4 - Multimodal Models/pre-intro.mdx index 652c548be..935c7a5b8 100644 --- a/chapters/en/Unit 4 - Multimodal Models/pre-intro.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/pre-intro.mdx @@ -1,33 +1,41 @@ # Exploring Multimodal Text and Vision Models: Uniting Senses in AI -Welcome to the Multimodal Text and Vision Models unit! πŸŒπŸ“šπŸ‘οΈ In this journey, we'll dive into the world where computers understand images, videos and text in a way how people use their senses together to understand things in the world around them. +Welcome to the Multimodal Text and Vision Models unit! πŸŒπŸ“šπŸ‘οΈ + +In the last unit we have learned about the Transformer architecture, which revolutionized Natural Language Processing, but did not stop at the text modality. +As we have seen it has begun to conquer the field of Vision (including image and video), bringing with it a wide array of new research and applications. + +In this unit, we'll focus on the data fusion possibilities that the modality-overlapping usage of Transformers has enabled and the benefitting tasks and models. ## Exploring Multimodality πŸ”ŽπŸ€”πŸ’­ -Our adventure begins with understanding why blending text and images is crucial, exploring the history of multimodal models, and discovering how self-supervised learning unlocks the power of multimodality. The unit discusses about different modalities with a focus on text and vision. In this unit we will encounter: +Our adventure begins with understanding why blending text and images is crucial, exploring the history of multimodal models, and discovering how self-supervised learning unlocks the power of multimodality. The unit discusses about different modalities with a focus on text and vision. In this unit we will encounter three main topics: -**1. Fusion of Text and Vision** -This chapter serves as a foundation, enabling learners to understand the significance of multimodal data, its representation, and its diverse applications laying the groundwork for the fusion of text and vision within AI models. +**1. A Multimodal World + Introduction to Vision Language Models** +These chapter serve as a foundation, enabling learners to understand the significance of multimodal data, its representation, and its diverse applications laying the groundwork for the fusion of text and vision within AI models. In this chapter, you will: - - Understand the nature of real-world multimodal data coming from various sensory inputs that are important for human decision-making. - - Explore practical applications of multimodality in robotics, search , Visual Reasoning etc., showcasing their functionality and diverse applications. - - Learn about diverse multimodal tasks and models focusing on Image to Text, Text to Image, VQA, Document VQA, Captioning, Visual Reasoning etc. - - Conclude with an introduction on Vision Language Models and cool applications including multimodal chatbots. + +- Understand the nature of real-world multimodal data coming from various sensory inputs that are important for human decision-making. +- Explore practical applications of multimodality in robotics, search , Visual Reasoning etc., showcasing their functionality and diverse applications. +- Learn about diverse multimodal tasks and models focusing on Image to Text, Text to Image, VQA, Document VQA, Captioning, Visual Reasoning etc. +- Conclude with an introduction on Vision Language Models and cool applications including multimodal chatbots. **2. CLIP and Relatives** -Moving ahead, this chapter talks about the popular CLIP model and similar vision language models. +Moving ahead, this chapter talks about the popular CLIP model and similar vision language models. In this chapter you will: - - Dive deep into CLIP's magic, from theory to practical applications, and explore its variations. - - Discover relatives like Image-bind, BLIP, and others, along with their real-world implications and challenges. - - Explore the functionality of CLIP, its applications in search, zero-shot classification, and generation models like DALL-E. - - Understand contrastive and non-contrastive losses and exploring the self-supervised learning techniques. + +- Dive deep into CLIP's magic, from theory to practical applications, and explore its variations. +- Discover relatives like Image-bind, BLIP, and others, along with their real-world implications and challenges. +- Explore the functionality of CLIP, its applications in search, zero-shot classification, and generation models like DALL-E. +- Understand contrastive and non-contrastive losses and explore the self-supervised learning techniques. **3. Transfer Learning: Multimodal Text and Vision** In the final chapter of the unit you will: - - Explore diverse multimodal model applications in specific tasks, including one-shot, few-shot, training from scratch, and transfer learning, setting the stage for an exploration of transfer learning's advantages and practical applications in Jupyter notebooks. - - Engage in detailed practical implementations within Jupyter notebooks, covering tasks such as CLIP fine-tuning, Visual Question Answering, Image-to-Text, Open-set object detection, and GPT-4V-like Assistant models, focusing on task specifics, datasets, fine-tuning methods, and inference analyses. - - Conclude by comparing previous sections, discussing benefits, challenges, and offering insights into potential future advancements in multimodal learning. + +- Explore diverse multimodal model applications in specific tasks, including one-shot, few-shot, training from scratch, and transfer learning, setting the stage for an exploration of transfer learning's advantages and practical applications in Jupyter notebooks. +- Engage in detailed practical implementations within Jupyter notebooks, covering tasks such as CLIP fine-tuning, Visual Question Answering, Image-to-Text, Open-set object detection, and GPT-4V-like Assistant models, focusing on task specifics, datasets, fine-tuning methods, and inference analyses. +- Conclude by comparing previous sections, discussing benefits, challenges, and offering insights into potential future advancements in multimodal learning. ## Your Journey Ahead πŸƒπŸ»β€β™‚οΈπŸƒπŸ»β€β™€οΈπŸƒπŸ» @@ -35,6 +43,6 @@ Get ready for a captivating experience! We'll explore the mechanisms behind mult By the end of this unit, you'll possess a solid understanding of multimodal tasks, hands-on-experience with multimodal models, build cool applications based on them, and the evolving landscape of multimodal learning. -Join us as we navigate the fascinating domain where text and vision converge, unlocking the possibilities of AI understanding the world in a more human-like manner. +Join us as we navigate the fascinating domain where text and vision converge, unlocking the possibilities of AI understanding the world in a more human-like manner. Let's begin πŸš€πŸ€—βœ¨ diff --git a/chapters/en/Unit 4 - Multimodal Models/supplementary-material.mdx b/chapters/en/Unit 4 - Multimodal Models/supplementary-material.mdx index 2714737be..b7194e587 100644 --- a/chapters/en/Unit 4 - Multimodal Models/supplementary-material.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/supplementary-material.mdx @@ -6,5 +6,8 @@ We hope that you found the unit on multimodal models exciting. If you'd like to - [**11-777 MMML**](https://cmu-multicomp-lab.github.io/mmml-course/fall2022/) course on multimodal machine learning by CMU. You can find the video lectures [**here**](https://www.youtube.com/@LPMorency/playlists). - [**Blog on Multimodality and LLMs by Chip Huyen**](https://huyenchip.com/2023/10/10/multimodal.html) provides a comprehensive overview of multimodality, large multimodal models, systems like BLIP, CLIP, etc. - [**Awesome Multimodal ML**](https://github.com/pliang279/awesome-multimodal-ml), a GitHub repository containing papers, courses, architectures, workshops, tutorials etc. -- [**Awesome Multimodal Large Language Models**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models), a GitHub repository containing papers and datasets related to multimodal LLMs. +- [**Awesome Multimodal Large Language Models**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models), a GitHub repository containing papers and datasets related to multimodal LLMs. - [**EE/CS 148, Caltech**](https://gkioxari.github.io/teaching/cs148/) course on Large Language and Vision Models. + +In the next unit we will take a look at another kind of Neural Network Models that were revolutionized by multimodality in the last years: **Generative Neural Networks** +Get you paint brush ready and join us on another exciting adventure in the realm of Computer Vision 🀠 diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index ae95abedc..4011a04d0 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -66,13 +66,15 @@ local: "Unit 3 - Vision Transformers/KnowledgeDistillation" - title: Unit 4 - Multimodal Models sections: - - title: Introduction - local: "Unit 4 - Multimodal Models/introduction" - title: Exploring Multimodal Text and Vision Models - Uniting Senses in AI local: "Unit 4 - Multimodal Models/pre-intro" + - title: A Multimodal World + local: "Unit 4 - Multimodal Models/a_multimodal_world.mdx" + - title: Introduction to Vision Language Models + local: "Unit 4 - Multimodal Models/vlm-intro" - title: Multimodal Tasks and Models local: "Unit 4 - Multimodal Models/tasks-models-part1" - - title: Introduction to CLIP + - title: CLIP and Relatives local: "Unit 4 - Multimodal Models/CLIP and relatives/Introduction" - title: Losses local: "Unit 4 - Multimodal Models/CLIP and relatives/losses" @@ -84,8 +86,6 @@ local: "Unit 4 - Multimodal Models/CLIP and relatives/owl_vit" - title: Transfer Learning of Multimodal Models local: "Unit 4 - Multimodal Models/transfer_learning" - - title: Introduction to Vision Language Models - local: "Unit 4 - Multimodal Models/vlm-intro" - title: Supplementary Reading and Resources local: "Unit 4 - Multimodal Models/supplementary-material" - title: Unit 5 - Generative Models From 6960a5d466e847affd56631a26575608012948c0 Mon Sep 17 00:00:00 2001 From: Johannes Date: Fri, 12 Apr 2024 22:28:51 +0200 Subject: [PATCH 3/4] remove vit introduction and reinsert inductive bias part --- ... Transformers for Image Classification.mdx | 11 ++++++ .../introduction.mdx | 37 ------------------- 2 files changed, 11 insertions(+), 37 deletions(-) delete mode 100644 chapters/en/Unit 3 - Vision Transformers/introduction.mdx diff --git a/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx b/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx index 89fcbd10c..fe8e05df3 100644 --- a/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/Vision Transformers for Image Classification.mdx @@ -4,6 +4,17 @@ As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (Vision Transformers). Before we get started with transfer learning / fine-tuning concepts, let's compare Convolutional Neural Networks (CNNs) with Vision Transformers. +### CNN vs Vision Transformers: Inductive Bias + +Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far. + +Here's a couple of inductive biases we observe in CNNs: + +- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features. +- Locality: pixels in an image interact mainly with its surrounding pixels to form features. + +These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases. + ### Using pre-trained Vision Transformers It's not feasible for everyone to train a Vision Transformer on millions of images to get good performance. Instead, one can use openly available models from places such as the [Hugging Face Hub](https://huggingface.co/models?sort=trending). diff --git a/chapters/en/Unit 3 - Vision Transformers/introduction.mdx b/chapters/en/Unit 3 - Vision Transformers/introduction.mdx deleted file mode 100644 index 8b647e1b5..000000000 --- a/chapters/en/Unit 3 - Vision Transformers/introduction.mdx +++ /dev/null @@ -1,37 +0,0 @@ -# Introduction to Vision Transformers - -In the recent unit we learned about Convolutional Neural Networks (CNNs) and some of their use cases. -We saw how some of the most prominent CNNs can be implemented and fine-tuned on custom data. - -CNNs have been the go-to choice for Computer Vision practicioners for many tasks ever since the advent of the AlexNet architecture. -However, the wheel of time keeps turning and new architectures are proposed on a regular basis, that challenge the supremacy of CNNs. -While many of them fade away or are only used in niche domains, the recent years had emerge one competitor who is on the way to take -the crown as the favourite choice for Computer Vision practicioners: Vision Transformers (ViTs). - -## Transformers -The Transfomer architecture was originally proposed in the 2017 paper []"Attention is All You Need"](https://arxiv.org/abs/1706.03762) by Vaswani et al. -The novel architecture quickly caught interest in the Natural Language Processing community, which before has mainly been building its -applications on Recurrent Neural Network (RNN) architectures like Long Short-Term Memory(LSTM). - -Transformers promised a wide array of new application possibilities, as they surpassed RNN architectures in many terms, e.g. Parallelization, -Scalability and Flexibility. - -## Self-Attention -While the new architecture has many mechanisms that play together nicely, the most important of them might be Self-Attention, -a new way of taking into account the dependency between parts of the input. - -..... - - -### CNN vs Vision Transformers: Inductive Bias - -Inductive bias is a term used in machine learning to describe the set of assumptions that a learning algorithm uses to make predictions. In simpler terms, inductive bias is like a shortcut that helps a machine learning model make educated guesses based on the information it has seen so far. - -Here's a couple of inductive biases we observe in CNNs: - -- Translational Equivariance: an object can appear anywhere in the image, and CNNs can detect its features. -- Locality: pixels in an image interact mainly with its surrounding pixels to form features. - -These are lacking in Vision Transformers. Then how do they perform so well? It's because they're highly scalable and they're trained on massive amounts of images. Hence, they overcome the need for these inductive biases. - - From d2b4eda34ef646d3ee97289f14718f6c288f6c84 Mon Sep 17 00:00:00 2001 From: Johannes Date: Sun, 14 Apr 2024 21:24:46 +0200 Subject: [PATCH 4/4] fix toctree --- chapters/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 60f041b62..d6ded4e20 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -69,7 +69,7 @@ - title: Exploring Multimodal Text and Vision Models - Uniting Senses in AI local: "Unit 4 - Multimodal Models/pre-intro" - title: A Multimodal World - local: "Unit 4 - Multimodal Models/a_multimodal_world.mdx" + local: "Unit 4 - Multimodal Models/a_multimodal_world" - title: Introduction to Vision Language Models local: "Unit 4 - Multimodal Models/vlm-intro" - title: Multimodal Tasks and Models