From 227e279d1c5873d1c6bad50e00d443d9b8a48fa2 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Fri, 12 Apr 2024 17:26:17 +0200 Subject: [PATCH 01/52] Punctuation updated and broken image link restored --- chapters/en/Unit 0 - Welcome/welcome.mdx | 14 +++++++------- .../feature-extraction/feature-matching.mdx | 8 ++++---- .../image_and_imaging/examples-preprocess.mdx | 8 ++++---- .../image_and_imaging/imaging.mdx | 2 +- .../convnext.mdx | 16 ++++++++++------ .../googlenet.mdx | 8 ++++---- .../introduction.mdx | 2 +- .../Common Vision Transformers - DETR.mdx | 8 ++++---- .../Convolutional Vision Transformer.mdx | 8 ++++---- .../Unit 3 - Vision Transformers/MobileVIT.mdx | 2 +- .../Swin Transformer.mdx | 8 ++++---- ...ision Transformer for Objection Detection.mdx | 6 +++--- .../Unit 4 - Multimodal Models/introduction.mdx | 4 ++-- 13 files changed, 49 insertions(+), 45 deletions(-) diff --git a/chapters/en/Unit 0 - Welcome/welcome.mdx b/chapters/en/Unit 0 - Welcome/welcome.mdx index 7a77c780e..5e35746f2 100644 --- a/chapters/en/Unit 0 - Welcome/welcome.mdx +++ b/chapters/en/Unit 0 - Welcome/welcome.mdx @@ -12,8 +12,8 @@ On this page, you can find how to join the learners community, make a submission To obtain your certification for completing the course, complete the following assignments: -1. Training/fine-tuning a Model -2. Building an application and hosting it on Hugging Face Spaces +1. Training/fine-tuning a Model. +2. Building an application and hosting it on Hugging Face Spaces. ### Training/fine-tuning a Model @@ -21,7 +21,7 @@ There are notebooks under the Notebooks/Vision Transformers section. As of now, The model repository needs to have the following: -1. A properly filled Model Card [you can check out here for more information](https://huggingface.co/docs/hub/en/model-cards) +1. A properly filled Model Card: [you can check out here for more information](https://huggingface.co/docs/hub/en/model-cards). 2. If you trained a model with transformers and pushed it to Hub, the model card will be generated. In that case, edit the card and fill in more details. 3. Add the dataset’s ID to the model card to link the model repository to the dataset repository. @@ -29,12 +29,12 @@ The model repository needs to have the following: In this assignment section, you'll be building a Gradio-based application for your computer vision model and sharing it on 🤗 Spaces. Learn more about these tasks using the following resources: -- [Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio) -- [How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt) +- [Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio). +- [How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt). ## Certification 🥇 -Once you've finished the assignments — Training/fine-tuning a Model and Creating a Space — please complete the [form](https://forms.gle/JaSYEf1pEZ4HtNKGA) with your name, email, and links to your model and Space repositories to receive your certificate +Once you've finished the assignments — Training/fine-tuning a Model and Creating a Space — please complete the [form](https://forms.gle/JaSYEf1pEZ4HtNKGA) with your name, email, and links to your model and Space repositories to receive your certificate. ## Join the community! @@ -52,7 +52,7 @@ As a computer vision course learner, you may find the following set of channels - `#computer-vision`: a catch-all channel for everything related to computer vision. - `#cv-study-group`: a place to exchange ideas, ask questions about specific posts and start discussions. -- `#3d`: a channel to discuss aspects of computer vision specific to 3D computer vision +- `#3d`: a channel to discuss aspects of computer vision specific to 3D computer vision. If you are interested in generative AI, we also invite you to join all channels related to the Diffusion Models: #core-announcements, #discussions, #dev-discussions, and #diff-i-made-this. diff --git a/chapters/en/Unit 1 - Fundamentals/feature-extraction/feature-matching.mdx b/chapters/en/Unit 1 - Fundamentals/feature-extraction/feature-matching.mdx index 313bdf04e..805919dbd 100644 --- a/chapters/en/Unit 1 - Fundamentals/feature-extraction/feature-matching.mdx +++ b/chapters/en/Unit 1 - Fundamentals/feature-extraction/feature-matching.mdx @@ -8,7 +8,7 @@ Imagine you have a giant box of puzzle pieces, and you're trying to find a speci Now that we have an intuitive idea of how brute-force matches are found, let's dive into the algorithms. We are going to use the descriptors that we learned about in the previous chapter to find the matching features in two images. -First install and load libraries +First install and load libraries. ```bash !pip install opencv-python @@ -137,13 +137,13 @@ We also create a dictionary to specify the maximum leafs to visit as follows. search_params = dict(checks=50) ``` -Initiate SIFT detector +Initiate SIFT detector. ```python sift = cv.SIFT_create() ``` -Find the keypoints and descriptors with SIFT +Find the keypoints and descriptors with SIFT. ```python kp1, des1 = sift.detectAndCompute(img1, None) @@ -259,7 +259,7 @@ Fm, inliers = cv2.findFundamentalMat(mkpts0, mkpts1, cv2.USAC_MAGSAC, 0.5, 0.999 inliers = inliers > 0 ``` -Finally, we can visualize the matches +Finally, we can visualize the matches. ```python draw_LAF_matches( diff --git a/chapters/en/Unit 1 - Fundamentals/image_and_imaging/examples-preprocess.mdx b/chapters/en/Unit 1 - Fundamentals/image_and_imaging/examples-preprocess.mdx index 7bb4cbbec..e6c108346 100644 --- a/chapters/en/Unit 1 - Fundamentals/image_and_imaging/examples-preprocess.mdx +++ b/chapters/en/Unit 1 - Fundamentals/image_and_imaging/examples-preprocess.mdx @@ -5,10 +5,10 @@ Now that we have seen what are images, how they are acquired, and their impact, ## Operations in Digital Image Processing In digital image processing, operations on images are diverse and can be categorized into: -- Logical -- Statistical -- Geometrical -- Mathematical +- Logical. +- Statistical. +- Geometrical. +- Mathematical. - Transform operations. Each category encompasses different techniques, such as morphological operations under logical operations or fourier transforms and principal component analysis (PCA) under transforms. In this context, we refer to morphology as the group of operations tha use structuring elements to generate images of the same size by looking into the values of the pixel neighborhood. Understanding the distinction between element-wise and matrix operations is important in image manipulation. Element-wise operations, such as raising an image to a power or dividing it by another image, involve processing each pixel individually. This pixel-based approach contrasts with matrix operations, which utilize matrix theory for image manipulation. Having said that, you can do whatever you want with images, as they are matrices containing numbers! diff --git a/chapters/en/Unit 1 - Fundamentals/image_and_imaging/imaging.mdx b/chapters/en/Unit 1 - Fundamentals/image_and_imaging/imaging.mdx index d718071ae..59f497341 100644 --- a/chapters/en/Unit 1 - Fundamentals/image_and_imaging/imaging.mdx +++ b/chapters/en/Unit 1 - Fundamentals/image_and_imaging/imaging.mdx @@ -16,7 +16,7 @@ The core of digital image formation is the function \\(f(x,y)\\), which is deter In transmission-based imaging, such as X-rays, transmissivity takes the place of reflectivity. The digital representation of an image is essentially a matrix or array of numerical values, each corresponding to a pixel. The process of transforming continuous image data into a digital format is twofold: -- Sampling, which digitizes the coordinate values +- Sampling, which digitizes the coordinate values. - Quantization, which converts amplitude values into discrete quantities. The resolution and quality of a digital image significantly depend on the following: diff --git a/chapters/en/Unit 2 - Convolutional Neural Networks/convnext.mdx b/chapters/en/Unit 2 - Convolutional Neural Networks/convnext.mdx index 97a462d22..9fc79ddae 100644 --- a/chapters/en/Unit 2 - Convolutional Neural Networks/convnext.mdx +++ b/chapters/en/Unit 2 - Convolutional Neural Networks/convnext.mdx @@ -9,15 +9,17 @@ ConvNext represents a significant improvement to pure convolution models by inco ## Key Improvements The author of the ConvNeXT paper starts building the model with a regular ResNet (ResNet-50), then modernizes and improves the architecture step-by-step to imitate the hierarchical structure of Vision Transformers. The key improvements are: -- Training Techniques -- Macro Design -- ResNeXt-ify -- Inverted Bottleneck -- Large Kernel Sizes -- Micro Design +- Training Techniques. +- Macro Design. +- ResNeXt-ify. +- Inverted Bottleneck. +- Large Kernel Sizes. +- Micro Design. + We will go through each of the key improvements. These designs are not novel in itself. However, you can learn how researchers adapt and modify designs systematically to improve existing models. To show the effectiveness of each improvement, we will compare the model's accuracy before and after the modification on ImageNet-1K. + [Block Comparison](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/block_comparison.png) @@ -64,6 +66,7 @@ One common idea in every Transformer block is the usage of an inverted bottlenec This idea has also been used and popularized in Computer Vision by MobileNetV2. ConvNext adopts this idea, having input layers with 96 channels and increasing the hidden layers to 384 channels. By using this technique, it improves the model accuracy from 80.5% to 80.6%. + [Inverted Bottleneck Comparison](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inverted_bottleneck.png) @@ -74,6 +77,7 @@ However, before adjusting the kernel size, it is necessary to reposition the dep This repositioning enables the 1x1 layers to efficiently handle computational tasks, while the depthwise convolution layer functions as a more non-local receptor. With this, the network can harness the advantages of incorporating bigger kernel-sized convolutions. Implementing a 7x7 kernel size maintains the accuracy at 80.6% but reduces the overall FLOPs efficiency of the model. + [Moving up the Depth Conv Layer](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/depthwise_moveup.png) diff --git a/chapters/en/Unit 2 - Convolutional Neural Networks/googlenet.mdx b/chapters/en/Unit 2 - Convolutional Neural Networks/googlenet.mdx index 9086921bc..8b3c9832e 100644 --- a/chapters/en/Unit 2 - Convolutional Neural Networks/googlenet.mdx +++ b/chapters/en/Unit 2 - Convolutional Neural Networks/googlenet.mdx @@ -21,7 +21,7 @@ The Inception Module insists on applying convolution filters of different kernel ![inception_naive](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inception_naive.png) -Figure 1: Naive Inception Module +Figure 1: Naive Inception Module. As we can see applying multiple convolutions at multiple scales with bigger kernel sizes, like 5x5, can increase the number of parameters drastically. This problem is pronounced as the input feature size (channel size) increases. So as we go deep in the network stacking these "Inception Modules", the computation will increase drastically. The simple solution is to reduce the number of features wherever computational requirements seem to increase. The major pain points of high computation are the convolution layers. The feature dimension is reduced by a computationally inexpensive $1 \times 1$ convolution just before the 3x3 and 5x5 convolution. Let's see it with an example. @@ -31,7 +31,7 @@ We would also want to reduce the output features of max pooling before concatena ![inception_reduced](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inception_reduced.png) -Figure 2: Inception Module +Figure 2: Inception Module. Also, because of the parallel operations of convolutions at multiple scales, we are ensuring more operations without going deeper into the network, essentially mitigating the vanishing gradient problem. @@ -68,14 +68,14 @@ These auxiliary classifiers are removed at inference time. However, minimal gain ![googlenet_aux_clf](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/googlenet_auxiliary_classifier.jpg) -Figure 3: An Auxiliary Classifier +Figure 3: An Auxiliary Classifier. ### Architecture - GoogLeNet The complete architecture of GoogLeNet is shown in Figure below. All convolutions, including inside the inception block, use ReLU activation. It starts with two convolution(s) and max-pooling blocks. This is followed by a block of two inception modules (3a and 3b) and a max pooling. This follows a block of 5 inception blocks (4a, 4b, 4c, 4d, 4e) and a max pooling after. The auxiliary classifiers are taken out from outputs of 4a and 4d. Two inception blocks follow (5a and 5b). After this, an average pooling and a fully connected layer of 128 units are used. ![googlenet_arch](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/googlenet_architecture.png) -Figure 4: Complete GoogLeNet Architecture +Figure 4: Complete GoogLeNet Architecture. ### Code diff --git a/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx b/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx index dea7740b4..51a58eea3 100644 --- a/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx +++ b/chapters/en/Unit 2 - Convolutional Neural Networks/introduction.mdx @@ -64,7 +64,7 @@ The feature map will be the same size as the original data. The result of the co If we keep applying the convolution, we get the following feature map.
- 2D Feature Map + 2D Feature Map
Which shows us the horizontal changes (the edges). This filter is actually called the Prewitt Filter. diff --git a/chapters/en/Unit 3 - Vision Transformers/Common Vision Transformers - DETR.mdx b/chapters/en/Unit 3 - Vision Transformers/Common Vision Transformers - DETR.mdx index f0bfdf98e..3114d3292 100644 --- a/chapters/en/Unit 3 - Vision Transformers/Common Vision Transformers - DETR.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/Common Vision Transformers - DETR.mdx @@ -138,12 +138,12 @@ class DETR(nn.Module): ``` ### Going line by line in the `forward` function: **Backbone** -The input image is first put through a ResNet backbone and then a convolution layer, which reduces the dimension to the `hidden_dim` +The input image is first put through a ResNet backbone and then a convolution layer, which reduces the dimension to the `hidden_dim`. ```python x = self.backbone(inputs) h = self.conv(x) ``` -they are declared in the `__init__` function +they are declared in the `__init__` function. ```python self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2]) self.conv = nn.Conv2d(2048, hidden_dim, 1) @@ -171,7 +171,7 @@ self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2)) self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2)) ``` **Resize** -Before going into the transformer, the features with size `(batch size, hidden_dim, H, W)` are reshaped to `(hidden_dim, batch size, H*W)`. This makes them a sequential input for the transformer +Before going into the transformer, the features with size `(batch size, hidden_dim, H, W)` are reshaped to `(hidden_dim, batch size, H*W)`. This makes them a sequential input for the transformer. ```python h.flatten(2).permute(2, 0, 1) ``` @@ -185,7 +185,7 @@ In the end, the outputs, which is a tensor of size `(query_pos_dim, batch size, ```python return self.linear_class(h), self.linear_bbox(h).sigmoid() ``` -The first of which predicts the class. An additional class is added for the `No Object` class +The first of which predicts the class. An additional class is added for the `No Object` class. ```python self.linear_class = nn.Linear(hidden_dim, num_classes + 1) ``` diff --git a/chapters/en/Unit 3 - Vision Transformers/Convolutional Vision Transformer.mdx b/chapters/en/Unit 3 - Vision Transformers/Convolutional Vision Transformer.mdx index 740bd952e..33136fc64 100644 --- a/chapters/en/Unit 3 - Vision Transformers/Convolutional Vision Transformer.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/Convolutional Vision Transformer.mdx @@ -50,7 +50,7 @@ The four main highlights of CvT that helped achieve superior performance and com Time to get hands-on! Let's explore how to code each major blocks of the CvT architecture in PyTorch shown in the official implementation [[8]](#cvt-imp). -1. Importing required libraries +1. Importing required libraries. ```python from collections import OrderedDict @@ -121,7 +121,7 @@ The method takes several parameters related to a convolutional layer (such as in The rearrangement of dimensions is performed using the `Rearrange` operation, which reshapes the input tensor. The resulting projection block is then returned. -3. Implementation of **Convolutional Token Embedding** +3. Implementation of **Convolutional Token Embedding**. ```python class ConvEmbed(nn.Module): @@ -161,7 +161,7 @@ This code defines a ConvEmbed module that performs patch-wise embedding on an in In summary, this module is designed for patch-wise embedding of images, where each patch is processed independently through a convolutional layer, and optional normalization is applied to the embedded features. -4. Implementation of **Vision Transformer** Block +4. Implementation of **Vision Transformer** Block. ```python class VisionTransformer(nn.Module): @@ -277,7 +277,7 @@ This code defines a Vision Transformer module. Here's a brief overview of the co - **Forward Method:** The forward method processes the input through the patch embedding, rearranges the dimensions, adds the classification token if present, applies dropout, and then passes the data through the stack of transformer blocks. Finally, the output is rearranged back to the original shape, and the classification token (if present) is separated from the rest of the sequence before returning the output. -5. Implementation of Convolutional Vision Transformer Block (**Hierarchy of Transformers**) +5. Implementation of Convolutional Vision Transformer Block (**Hierarchy of Transformers**). ```python class ConvolutionalVisionTransformer(nn.Module): diff --git a/chapters/en/Unit 3 - Vision Transformers/MobileVIT.mdx b/chapters/en/Unit 3 - Vision Transformers/MobileVIT.mdx index f3bee3942..f5492e5fa 100644 --- a/chapters/en/Unit 3 - Vision Transformers/MobileVIT.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/MobileVIT.mdx @@ -23,7 +23,7 @@ A diagram of the MobileViT Block is shown below: Okay, that's a lot to take in. Let's break that down. - The block takes in an image with multiple channels. Let's say for an RGB image 3 channels, so the block takes in a three channeled image. -- It then performs a N by N convolution on the channels appending them to the existing channels +- It then performs a N by N convolution on the channels appending them to the existing channels. - The block then creates a linear combination of these channels and adds them to the existing stack of channels. - For each channel these images are unfolded into flattened patches. - Then these flattened patches are passed through a transformer to project them into new patches. diff --git a/chapters/en/Unit 3 - Vision Transformers/Swin Transformer.mdx b/chapters/en/Unit 3 - Vision Transformers/Swin Transformer.mdx index 3a9f748b4..6d52f8709 100644 --- a/chapters/en/Unit 3 - Vision Transformers/Swin Transformer.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/Swin Transformer.mdx @@ -40,7 +40,7 @@ Key parts of the [implementation of Swin from the original paper](https://github 1. **Initialize Parameters**. Among various other dropout and normalization parameters, these parameters include: - `window_size`: Size of the windows for local self-attention. - `ape (bool)`: If True, add absolute position embedding to the patch embedding. - - `fused_window_process`: Optional hardware optimization + - `fused_window_process`: Optional hardware optimization. 2. **Apply Patch Embedding**: Similar to ViT, Images are split into non-overlapping patches and linearly embedded using `Conv2D`. @@ -52,7 +52,7 @@ Key parts of the [implementation of Swin from the original paper](https://github - The model is composed of multiple layers (`BasicLayer`) of `SwinTransformerBlock`s, each downsampling the feature map for hierarchical processing using `PatchMerging`. - The dimensionality of features and resolution of feature maps change across layers. -7. **Classification Head**: Similar to ViT, it uses an Multi-Layer Perceptron (MLP) head for classification tasks, as defined in `self.head`, as the last step +7. **Classification Head**: Similar to ViT, it uses an Multi-Layer Perceptron (MLP) head for classification tasks, as defined in `self.head`, as the last step. ```python class SwinTransformer(nn.Module): @@ -379,9 +379,9 @@ The feature map is partitioned into windows via `window_partition`. A **cyclic s Cyclic shift allows the model to capture relationships between adjacent windows, enhancing its ability to learn spatial contexts beyond the local scope of individual windows. -2. **Windowed attention**: Perform attention using window-based multi-head self attention (W-MSA) module +2. **Windowed attention**: Perform attention using window-based multi-head self attention (W-MSA) module. -3. **Merge Patches**: Patches are merged via `PatchMerging` +3. **Merge Patches**: Patches are merged via `PatchMerging`. 4. **Reverse cyclic shift**: After attention is done, the window partitioning is undone via `reverse_window`, and the cyclic shift operation is reversed, so that the feature map retains its original form. diff --git a/chapters/en/Unit 3 - Vision Transformers/Vision Transformer for Objection Detection.mdx b/chapters/en/Unit 3 - Vision Transformers/Vision Transformer for Objection Detection.mdx index 1d25c6037..7ed782a34 100644 --- a/chapters/en/Unit 3 - Vision Transformers/Vision Transformer for Objection Detection.mdx +++ b/chapters/en/Unit 3 - Vision Transformers/Vision Transformer for Objection Detection.mdx @@ -15,17 +15,17 @@ This section will describe how object detection tasks are achieved using Vision Object detection is a computer vision task that involves identifying and localizing objects within an image or video. It consists of two main steps: -- First, recognizing the types of objects present (such as cars, people, or animals), +- First, recognizing the types of objects present (such as cars, people, or animals). - Second, determining their precise locations by drawing bounding boxes around them. These models typically receive images (static or frames from videos) as their inputs, with multiple objects present in each image. For example, consider an image containing several objects such as cars, people, bicycles, and so on. Upon processing the input, these models produce a set of numbers that convey the following information: -- Location of the object (XY coordinates of the bounding box) +- Location of the object (XY coordinates of the bounding box). - Class of the object. There are a lot of of applications around object detection. One of the most significant examples is in the field of autonomous driving, where object detection is used to detect different objects (like pedestrians, road signs, traffic lights, etc) around the car that become one of the inputs for taking decisions. -To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](/chapters/en/Unit%206%20-%20Basic%20CV%20Tasks/object_detection.mdx) on Object Detection 🤗 +To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](/chapters/en/Unit%206%20-%20Basic%20CV%20Tasks/object_detection.mdx) on Object Detection 🤗. ### The Need to Fine-tune Models in Object Detection 🤔 diff --git a/chapters/en/Unit 4 - Multimodal Models/introduction.mdx b/chapters/en/Unit 4 - Multimodal Models/introduction.mdx index 361e66ca4..c69d4686e 100644 --- a/chapters/en/Unit 4 - Multimodal Models/introduction.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/introduction.mdx @@ -37,7 +37,7 @@ For the machines around us to be more intelligent, better at communicating with - Vision + Text : Infographics, Memes, Articles, Blogs. - Vision + Audio: A Skype call with your friend, dyadic conversations. - Vision + Audio + Text: Watching YouTube videos or movies with captions, social media content in general is multimodal. -- Audio + Text: Voice notes, music files with lyrics +- Audio + Text: Voice notes, music files with lyrics. ## Multimodal Datasets A dataset consisting of multiple modalities is a multimodal dataset. Out of the common modality combinations let us see some examples: @@ -59,7 +59,7 @@ Hugging Face supports a wide variety of multimodal tasks. Let us look into some 1. Vision + Text: - [Visual Question Answering or VQA](https://huggingface.co/tasks/visual-question-answering): Aiding visually impaired persons, efficient image retrieval, video search, Video Question Answering, Document VQA. - [Image to Text](https://huggingface.co/tasks/image-to-text): Image Captioning, Optical Character Recognition (OCR), Pix2Struct. -- [Text to Image](https://huggingface.co/tasks/text-to-image): Image Generation +- [Text to Image](https://huggingface.co/tasks/text-to-image): Image Generation. - [Text to Video](https://huggingface.co/tasks/text-to-video): Text-to-video editing, Text-to-video search, Video Translation, Text-driven Video Prediction. 2. Audio + Text: From 87a9456435add4bcb0529239bc9a2ea0ae1345ab Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Mon, 15 Apr 2024 17:56:22 +0200 Subject: [PATCH 02/52] Review punctuation and fixed some broken links --- .../blenderProc.mdx | 32 +++++++------ .../datagen-diffusion-models.mdx | 2 +- .../point_clouds.mdx | 16 +++---- .../synthetic-lung-images.mdx | 20 ++++---- .../synthetic_datasets.mdx | 8 ++-- .../conclusion.mdx | 2 +- chapters/en/Unit 13 - Outlook/hyena.mdx | 6 +-- .../CLIP and relatives/clip.mdx | 6 +-- .../tasks-models-part1.mdx | 4 +- .../Unit 4 - Multimodal Models/vlm-intro.mdx | 10 ++-- .../Introduction - Diffusions.mdx | 30 ++++++------ .../Diffusion models/stable_diffusion.mdx | 4 +- .../GANs & VAEs/StyleGAN.mdx | 28 +++++------ .../Introduction/Introduction.mdx | 8 ++-- .../Ethical Issues.mdx | 8 ++-- .../en/Unit 5 - Generative Models/gans.mdx | 6 +-- .../variational_autoencoders.mdx | 4 +- .../3D Vision/NVS.mdx | 16 +++---- .../3d_measurements_stereo_vision.mdx | 46 +++++++++---------- .../Introduction/brief_history.mdx | 8 ++-- .../Terminologies and Basics/CameraModels.mdx | 8 ++-- .../LinearAlgebra.mdx | 4 +- .../nerf.mdx | 8 ++-- .../intro_to_model_optimization.mdx | 8 ++-- .../tools_and_frameworks.mdx | 42 ++++++++--------- 25 files changed, 169 insertions(+), 165 deletions(-) diff --git a/chapters/en/Unit 10 - Synthetic Data Creation/blenderProc.mdx b/chapters/en/Unit 10 - Synthetic Data Creation/blenderProc.mdx index 8110ad70e..962ed313f 100644 --- a/chapters/en/Unit 10 - Synthetic Data Creation/blenderProc.mdx +++ b/chapters/en/Unit 10 - Synthetic Data Creation/blenderProc.mdx @@ -103,24 +103,28 @@ You can install BlenderProc via pip: Alternately, you can clone the official [BlenderProc repository](https://github.com/DLR-RM/BlenderProc) from GitHub using Git: -`git clone https://github.com/DLR-RM/BlenderProc` +```bash +git clone https://github.com/DLR-RM/BlenderProc +``` BlenderProc must be run inside the blender python environment (bpy), as this is the only way to access the Blender API. -`blenderproc run ` +```bash +blenderproc run +``` You can check out this notebook to try BlenderProc in Google Colab, demos the basic examples provided [here](https://github.com/DLR-RM/BlenderProc/tree/main/examples/basics). Here are some images rendered with the basic example: -![colors](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/colors.png) -![normals](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/normals.png) -![depth](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/depth.png) +![colors](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/colors.png). +![normals](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/normals.png). +![depth](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/depth.png). ## Blender Resources -- [User Manual](https://docs.blender.org/manual/en/latest/0) -- [Awesome-blender -- Extensive list of resources](https://awesome-blender.netlify.app) -- [Blender Youtube Channel](https://www.youtube.com/@BlenderOfficial) +- [User Manual](https://docs.blender.org/manual/en/latest/0). +- [Awesome-blender -- Extensive list of resources](https://awesome-blender.netlify.app). +- [Blender Youtube Channel](https://www.youtube.com/@BlenderOfficial). ### The following video explains how to render a 3D syntehtic dataset in Blender: @@ -132,15 +136,15 @@ Here are some images rendered with the basic example: ## Papers / Blogs -- [Developing digital twins of multi-camera metrology systems in Blender](https://iopscience.iop.org/article/10.1088/1361-6501/acc59e/pdf_) -- [Generate Depth and Normal Maps with Blender](https://www.saifkhichi.com/blog/blender-depth-map-surface-normals) -- [Object detection with synthetic training data](https://medium.com/rowden/object-detection-with-synthetic-training-data-f6735a5a34bc) +- [Developing digital twins of multi-camera metrology systems in Blender](https://iopscience.iop.org/article/10.1088/1361-6501/acc59e/pdf_). +- [Generate Depth and Normal Maps with Blender](https://www.saifkhichi.com/blog/blender-depth-map-surface-normals). +- [Object detection with synthetic training data](https://medium.com/rowden/object-detection-with-synthetic-training-data-f6735a5a34bc). ## BlenderProc Resources -- [BlenderProc Github Repo](https://github.com/DLR-RM/BlenderProc) -- [BlenderProc: Reducing the Reality Gap with Photorealistic Rendering](https://elib.dlr.de/139317/1/denninger.pdf) -- [Documentation](https://dlr-rm.github.io/BlenderProc/) +- [BlenderProc Github Repo](https://github.com/DLR-RM/BlenderProc). +- [BlenderProc: Reducing the Reality Gap with Photorealistic Rendering](https://elib.dlr.de/139317/1/denninger.pdf). +- [Documentation](https://dlr-rm.github.io/BlenderProc/). ### The following video provides an overview of the BlenderProc pipeline: diff --git a/chapters/en/Unit 10 - Synthetic Data Creation/datagen-diffusion-models.mdx b/chapters/en/Unit 10 - Synthetic Data Creation/datagen-diffusion-models.mdx index fc8772f03..b9ad71c77 100644 --- a/chapters/en/Unit 10 - Synthetic Data Creation/datagen-diffusion-models.mdx +++ b/chapters/en/Unit 10 - Synthetic Data Creation/datagen-diffusion-models.mdx @@ -59,7 +59,7 @@ This means we have many tools under our belt to generate synthetic data! ## Approaches to Synthetic Data Generation -There are generally three cases for needing synthetic data, +There are generally three cases for needing synthetic data: **Extending an existing dataset:** diff --git a/chapters/en/Unit 10 - Synthetic Data Creation/point_clouds.mdx b/chapters/en/Unit 10 - Synthetic Data Creation/point_clouds.mdx index e5ec8007c..a373a6371 100644 --- a/chapters/en/Unit 10 - Synthetic Data Creation/point_clouds.mdx +++ b/chapters/en/Unit 10 - Synthetic Data Creation/point_clouds.mdx @@ -22,19 +22,19 @@ The 3D Point Data is mainly used in self-driving capabilities, but now other AI ## Generation and Data Representation -We will be using the python library [point-cloud-utils](https://github.com/fwilliams/point-cloud-utils), and [open-3d](https://github.com/isl-org/Open3D), which can be installed by +We will be using the python library [point-cloud-utils](https://github.com/fwilliams/point-cloud-utils), and [open-3d](https://github.com/isl-org/Open3D), which can be installed by: ```bash pip install point-cloud-utils ``` -We will be also using the python library open-3d, which can be installed by +We will be also using the python library open-3d, which can be installed by: ```bash pip install open3d ``` -OR a Smaller CPU only version +OR a Smaller CPU only version: ```bash pip install open3d-cpu @@ -45,7 +45,7 @@ Now, first we need to understand the formats in which these point clouds are sto **Why?** - `point-cloud-utils` supports reading common mesh formats (PLY, STL, OFF, OBJ, 3DS, VRML 2.0, X3D, COLLADA). -- If it can be imported into [MeshLab](https://github.com/cnr-isti-vclab/meshlab), we can read it! (from their readme) +- If it can be imported into [MeshLab](https://github.com/cnr-isti-vclab/meshlab), we can read it! (from their readme). The type of file is inferred from its file extension. Some of the extensions supported are: @@ -53,13 +53,13 @@ The type of file is inferred from its file extension. Some of the extensions sup - A simple PLY object consists of a collection of elements for representation of the object. It consists of a list of (x,y,z) triplets of a vertex and a list of faces that are actually indices into the list of vertices. - Vertices and faces are two examples of elements and the majority of the PLY file consists of these two elements. -- New properties can also be created and attached to the elements of an object, but these should be added in such a way that old programs do not break when these new properties are encountered +- New properties can also be created and attached to the elements of an object, but these should be added in such a way that old programs do not break when these new properties are encountered. ** STL (Standard Tessellation Language) ** - This format approximates the surfaces of a solid model with triangles. - These triangles are also known as facets, where each facet is described by a perpendicular direction and three points representing the vertices of the triangle. -- However, these files have no description of Color and Texture +- However, these files have no description of Color and Texture. ** OFF (Object File Format) ** @@ -77,11 +77,11 @@ The type of file is inferred from its file extension. Some of the extensions sup - X3D is an XML based 3D graphics file format for presentation of 3D information. It is a modular standard and is defined through several ISO specifications. - The format supports vector and raster graphics, transparency, lighting effects, and animation settings including rotations, fades, and swings. -- X3D has the advantage of encoding color information (unlike STL) that is used during printing the model on a color 3D printer +- X3D has the advantage of encoding color information (unlike STL) that is used during printing the model on a color 3D printer. ** DAE (Digital Asset Exchange) ** - This is an XML schema which is an open standard XML schema, from which DAE files are built. -- This file format is based on the COLLADA (COLLAborative Design Activity) XML schema which is an open standard XML schema for the exchange of digital assets among graphics software applications +- This file format is based on the COLLADA (COLLAborative Design Activity) XML schema which is an open standard XML schema for the exchange of digital assets among graphics software applications. - The format's biggest selling point is its compatibility across multiple platforms. - COLLADA files aren't restricted to one program or manufacturer. Instead, they offer a standard way to store 3D assets. diff --git a/chapters/en/Unit 10 - Synthetic Data Creation/synthetic-lung-images.mdx b/chapters/en/Unit 10 - Synthetic Data Creation/synthetic-lung-images.mdx index 3a2cb2e90..d02f48d2c 100644 --- a/chapters/en/Unit 10 - Synthetic Data Creation/synthetic-lung-images.mdx +++ b/chapters/en/Unit 10 - Synthetic Data Creation/synthetic-lung-images.mdx @@ -12,22 +12,22 @@ The generator has the following model architecture: - The input is a vector a 100 random numbers and the output is a image of size 128*128*3. - The model has 4 convolutional layers: - - Conv2D layer - - Batch Normalization layer - - ReLU activation -- Conv2D layer with Tanh activation + - Conv2D layer. + - Batch Normalization layer. + - ReLU activation. +- Conv2D layer with Tanh activation. The discriminator has the following model architecture: - The input is an image and the output is a probability indicating whether the image is fake or real. - The model has one convolutional layer: - - Conv2D layer - - Leaky ReLU activation + - Conv2D layer. + - Leaky ReLU activation. - Three convolutional layers with: - - Conv2D layer - - Batch Normalization layer - - Leaky ReLU activation -- Conv2D layer with Sigmoid + - Conv2D layer. + - Batch Normalization layer. + - Leaky ReLU activation. +- Conv2D layer with Sigmoid. **Data Collection** diff --git a/chapters/en/Unit 10 - Synthetic Data Creation/synthetic_datasets.mdx b/chapters/en/Unit 10 - Synthetic Data Creation/synthetic_datasets.mdx index cff284d2f..f2cc8a417 100644 --- a/chapters/en/Unit 10 - Synthetic Data Creation/synthetic_datasets.mdx +++ b/chapters/en/Unit 10 - Synthetic Data Creation/synthetic_datasets.mdx @@ -39,8 +39,8 @@ Semantic segmentation is vital for autonomous vehicles to interpret and navigate | Name | Year | Description | Paper | | Additional Links | |---------------------|--------------|-------------|----------------|---------------------|---------------------| -| Virtual KITTI 2 | 2020 | Virtual Worlds as Proxy for Multi-Object Tracking Analysis | [Virtual KITTI 2](https://arxiv.org/pdf/2001.10773.pdf) | | [Website](https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds/) | -| ApolloScape | 2019 | Compared with existing public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape contains much large and richer labeling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labeling, lane-mark labeling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities, and daytimes | [The ApolloScape Open Dataset for Autonomous Driving and its Application](https://arxiv.org/abs/1803.06184) | | [Website](https://apolloscape.auto/) | +| Virtual KITTI 2 | 2020 | Virtual Worlds as Proxy for Multi-Object Tracking Analysis. | [Virtual KITTI 2](https://arxiv.org/pdf/2001.10773.pdf) | | [Website](https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds/) | +| ApolloScape | 2019 | Compared with existing public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape contains much large and richer labeling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labeling, lane-mark labeling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities, and daytimes. | [The ApolloScape Open Dataset for Autonomous Driving and its Application](https://arxiv.org/abs/1803.06184) | | [Website](https://apolloscape.auto/) | | Driving in the Matrix | 2017 | The core idea behind "Driving in the Matrix" is to use photo-realistic computer-generated images from a simulation engine to produce annotated data quickly. | [Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?](https://arxiv.org/pdf/1610.01983.pdf) | | [GitHub](https://github.com/umautobots/driving-in-the-matrix) ![GitHub stars](https://img.shields.io/github/stars/umautobots/driving-in-the-matrix.svg?style=social&label=Star) | | CARLA | 2017 | **CARLA** (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine 4. Technically, it operates similarly to, as an open source layer over Unreal Engine 4 that provides sensors in the form of RGB cameras (with customizable positions), ground truth depth maps, ground truth semantic segmentation maps with 12 semantic classes designed for driving (road, lane marking, traffic sign, sidewalk and so on), bounding boxes for dynamic objects in the environment, and measurements of the agent itself (vehicle location and orientation). | [CARLA: An Open Urban Driving Simulator](https://arxiv.org/pdf/1711.03938v1.pdf) | | [Website](https://carla.org/) | | Synthia | 2016 | A large collection of synthetic images for semantic segmentation of urban scenes. SYNTHIA consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations for 13 classes: misc, sky, building, road, sidewalk, fence, vegetation, pole, car, sign, pedestrian, cyclist, lane-marking. | [The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Ros_The_SYNTHIA_Dataset_CVPR_2016_paper.html) | | [Website](https://synthia-dataset.net/) | @@ -55,8 +55,8 @@ Navigating indoor environments can be challenging due to their complexity. These | Name | Year | Description | Paper | Additional Links | |--------------|--------------|-------------|----------------|--------------| |Habitat | 2023 | An Embodied AI simulation platform for studying collaborative human-robot interaction tasks in home environments. | [HABITAT 3.0: A CO-HABITAT FOR HUMANS, AVATARS AND ROBOTS](https://ai.meta.com/static-resource/habitat3) | [Website](https://aihabitat.org/habitat3/) | -| Minos | 2017 | Multimodal Indoor Simulator | [MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments](https://arxiv.org/pdf/1712.03931.pdf) | [GitHub](https://github.com/minosworld/minos) ![GitHub stars](https://img.shields.io/github/stars/minosworld/minos.svg?style=social&label=Star) | -| House3D | 2017 (archived in 2021) | A Rich and Realistic 3D Environment | [Building generalisable agents with a realistic and rich 3D environment](https://arxiv.org/pdf/1801.02209v2.pdf) | [GitHub](https://github.com/facebookresearch/House3D) ![GitHub stars](https://img.shields.io/github/stars/facebookresearch/House3D.svg?style=social&label=Star) | +| Minos | 2017 | Multimodal Indoor Simulator. | [MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments](https://arxiv.org/pdf/1712.03931.pdf) | [GitHub](https://github.com/minosworld/minos) ![GitHub stars](https://img.shields.io/github/stars/minosworld/minos.svg?style=social&label=Star) | +| House3D | 2017 (archived in 2021) | A Rich and Realistic 3D Environment. | [Building generalisable agents with a realistic and rich 3D environment](https://arxiv.org/pdf/1801.02209v2.pdf) | [GitHub](https://github.com/facebookresearch/House3D) ![GitHub stars](https://img.shields.io/github/stars/facebookresearch/House3D.svg?style=social&label=Star) | ### Human Action Recognition and Simulation diff --git a/chapters/en/Unit 12 - Ethics and Biases/conclusion.mdx b/chapters/en/Unit 12 - Ethics and Biases/conclusion.mdx index 5e08e413a..6cd518d88 100644 --- a/chapters/en/Unit 12 - Ethics and Biases/conclusion.mdx +++ b/chapters/en/Unit 12 - Ethics and Biases/conclusion.mdx @@ -67,7 +67,7 @@ This is work that highlights and explores techniques for making machine learning ### 🧑‍🤝‍🧑 Inclusive These are projects which broaden the scope of who builds and benefits in the machine learning world. Some examples: -- Curating diverse datasets that increase the representation of underserved groups +- Curating diverse datasets that increase the representation of underserved groups. - Training language models on languages that aren't yet available on the Hugging Face Hub. - Creating no-code and low-code frameworks that allow non-technical folk to engage with AI. diff --git a/chapters/en/Unit 13 - Outlook/hyena.mdx b/chapters/en/Unit 13 - Outlook/hyena.mdx index 7599888f7..c0a934293 100644 --- a/chapters/en/Unit 13 - Outlook/hyena.mdx +++ b/chapters/en/Unit 13 - Outlook/hyena.mdx @@ -91,8 +91,8 @@ Some work has been conducted to speed up this computation like FastFFTConv based ![nd_hyena.png](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/outlook_hyena_images/nd_hyena.png) In essence, Hyena can be performed in two steps: -1. Compute a set of N+1 linear projections similarly of attention (it can be more than 3 projections) -2. Mixing up the projections: The matrix \\(H(u)\\) is defined by a combination of matrix multiplications +1. Compute a set of N+1 linear projections similarly of attention (it can be more than 3 projections). +2. Mixing up the projections: The matrix \\(H(u)\\) is defined by a combination of matrix multiplications. ## Why Hyena Matters @@ -113,7 +113,7 @@ Hyena has been applied to N-Dimensional data with the Hyena N-D layer and can be here is a noticeable enhancement in GPU memory efficiency with the increase in the number of image patches. Hyena Hierarchy facilitates the development of larger, more efficient convolution models for long sequences. -The potential for Hyena type models for computer vision would be a more efficient GPU memory consumption of patches, that would allow : +The potential for Hyena type models for computer vision would be a more efficient GPU memory consumption of patches, that would allow: - The processing of larger, higher-resolution images - The use of smaller patches, allowing a fine-graine feature representation diff --git a/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx b/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx index 8186b5b9e..348db6de5 100644 --- a/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx @@ -28,7 +28,7 @@ CLIP, can be leveraged for a variety of applications. Here are some notable use For practical applications, one typically uses an image, and pre-defined classes as input. The provided Python example demonstrates how to use the transformers library for running CLIP. In this example, we want to zero-shot classify the image below between `dog` or `cat`. -![A photo of cats](http://images.cocodataset.org/val2017/000000039769.jpg) +![A photo of cats](https://images.cocodataset.org/val2017/000000039769.jpg) ```python from PIL import Image @@ -55,8 +55,8 @@ probs = logits_per_image.softmax(dim=1) ``` After executing this code, we got the following probabilities: -- "a photo of a cat": 99.49% -- "a photo of a dog": 0.51% +- "a photo of a cat": 99.49%. +- "a photo of a dog": 0.51%. ## Limitations diff --git a/chapters/en/Unit 4 - Multimodal Models/tasks-models-part1.mdx b/chapters/en/Unit 4 - Multimodal Models/tasks-models-part1.mdx index bc1a87fce..7f652c134 100644 --- a/chapters/en/Unit 4 - Multimodal Models/tasks-models-part1.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/tasks-models-part1.mdx @@ -187,7 +187,7 @@ Learn more about how to train and use DocVQA models in HuggingFace `transformers *Example of Input (Image) and Output (Text) for the Image Captioning Model. [[1]](#pretraining-paper)* - **Inputs:** - Image: Image in various formats (e.g., JPEG, PNG). - - Pre-trained image feature extractor (optional): A pre-trained neural network that can extract meaningful features from images, such as a convolutional neural network (CNN) + - Pre-trained image feature extractor (optional): A pre-trained neural network that can extract meaningful features from images, such as a convolutional neural network (CNN). - **Outputs:** Textual captions: Single Sentence or Paragraph that accurately describe the content of the input images, capturing objects, actions, relationships, and overall context. See the above example for the reference. - **Task:** To automatically generate natural language descriptions of images. This involves: (1) Understanding the visual content of the image (objects, actions, relationships). (2) Encoding this information into a meaningful representation. (3) Decoding this representation into a coherent, grammatically correct, and informative sentence or phrase. @@ -350,7 +350,7 @@ You can try out the Grounding DINO model in the Google Colab [here](https://cola Now, let's how can we use text-image generation models in HuggingFace. -Install `diffusers` library +Install `diffusers` library: ```bash pip install diffusers --upgrade ``` diff --git a/chapters/en/Unit 4 - Multimodal Models/vlm-intro.mdx b/chapters/en/Unit 4 - Multimodal Models/vlm-intro.mdx index c3c00b1e2..db4ca5fc4 100644 --- a/chapters/en/Unit 4 - Multimodal Models/vlm-intro.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/vlm-intro.mdx @@ -1,11 +1,11 @@ # Introduction to Vision Language Models What will you learn from this chapter: -- A brief introduction to multimodality -- Introduction to Vision Language Models -- Various learning strategies -- Common datasets used for VLMs -- Downstream tasks and evaluation +- A brief introduction to multimodality. +- Introduction to Vision Language Models. +- Various learning strategies. +- Common datasets used for VLMs. +- Downstream tasks and evaluation. ## Our World is Multimodal Humans explore the world through diverse senses: sight, sound, touch, and scent. A complete grasp of our surroundings emerges by harmonizing insights from these varied modalities. diff --git a/chapters/en/Unit 5 - Generative Models/Diffusion models/Introduction - Diffusions.mdx b/chapters/en/Unit 5 - Generative Models/Diffusion models/Introduction - Diffusions.mdx index 15434c11d..4907aae9b 100644 --- a/chapters/en/Unit 5 - Generative Models/Diffusion models/Introduction - Diffusions.mdx +++ b/chapters/en/Unit 5 - Generative Models/Diffusion models/Introduction - Diffusions.mdx @@ -2,17 +2,17 @@ What you will learn from this chapter: -- What are diffusion models and how do they differ from GANs -- Major sub categories of Diffusion models -- Use cases of Diffusion models -- Drawback in Diffusion models +- What are diffusion models and how do they differ from GANs. +- Major sub categories of Diffusion models. +- Use cases of Diffusion models. +- Drawback in Diffusion models. ## Diffusion Models and their Difference from GANs Diffusion models are a new and exciting area in computer vision that has shown impressive results in creating images. These generative models work on two stages, a forward diffusion stage and a reverse diffusion stage: first, they slightly change the input data by adding some noise, and then they try to undo these changes to get back to the original data. This process of making changes and then undoing them helps generate realistic images. -These generative models raised the bar to a new level in the area of generative modeling, particularly referring to models such as [Imagen](https://imagen.research.google/) and [Latent Diffusion Models](https://arxiv.org/abs/2112.10752)(LDMs). For instance consider the below images generated via such models +These generative models raised the bar to a new level in the area of generative modeling, particularly referring to models such as [Imagen](https://imagen.research.google/) and [Latent Diffusion Models](https://arxiv.org/abs/2112.10752)(LDMs). For instance consider the below images generated via such models. ![Example images generated using diffusion models](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/diffusion-eg.png) @@ -33,9 +33,9 @@ In diffusion models, Gaussian noise is added step-by-step to the training images ## Major Variants of Diffusion models -There are 3 major diffusion modelling frameworks +There are 3 major diffusion modelling frameworks: - Denoising diffusion probabilistic models (DDPMs): - - DDPMs are models that employ latent variables to estimate the probability distribution. From this point of view, DDPMs can be viewed as a special kind of variational auto-encoders (VAEs), where the forward diffusion stage corresponds to the encoding process inside VAE, while the reverse diffusion stage corresponds to the decoding process + - DDPMs are models that employ latent variables to estimate the probability distribution. From this point of view, DDPMs can be viewed as a special kind of variational auto-encoders (VAEs), where the forward diffusion stage corresponds to the encoding process inside VAE, while the reverse diffusion stage corresponds to the decoding process. - Noise conditioned score networks (NCSNs): - It is based on training a shared neural network via score matching to estimate the score function (defined as the gradient of the log density) of the perturbed data distribution at different noise levels. - Stochastic differential equations (SDEs): @@ -46,15 +46,15 @@ There are 3 major diffusion modelling frameworks ## Use Cases of Diffusion Models Diffusion is used in a variety of tasks including, but not limited to: -- Image generation - Generating images based on prompts -- Image super-resolution - Increasing resolution of images -- Image inpainting - Filling up a degraded portion of an image based on prompts +- Image generation - Generating images based on prompts. +- Image super-resolution - Increasing resolution of images. +- Image inpainting - Filling up a degraded portion of an image based on prompts. - Image editing - Editing specific/entire part of the image without losing its visual identity. -- Image-to-image translation - This includes changing background, attributes of the location etc -- Learned Latent representation from diffusion models can also be used for - - Image segmentation - - Classification - - Anomaly detection +- Image-to-image translation - This includes changing background, attributes of the location etc. +- Learned Latent representation from diffusion models can also be used for. + - Image segmentation. + - Classification. + - Anomaly detection. Want to play with diffusion models? No worries, Hugging Face's [Diffusers](https://huggingface.co/docs/diffusers/index) library comes to rescue. You can use almost all recent diffusion SOTA models for almost any task. diff --git a/chapters/en/Unit 5 - Generative Models/Diffusion models/stable_diffusion.mdx b/chapters/en/Unit 5 - Generative Models/Diffusion models/stable_diffusion.mdx index 8dae27d25..7e353a374 100644 --- a/chapters/en/Unit 5 - Generative Models/Diffusion models/stable_diffusion.mdx +++ b/chapters/en/Unit 5 - Generative Models/Diffusion models/stable_diffusion.mdx @@ -3,8 +3,8 @@ This chapter introduces the building blocks of Stable Diffusion which is a gener [Stability AI](https://stability.ai/), [RunwayML](https://runwayml.com/) and CompVis Group at LMU Munich following the [paper](https://arxiv.org/pdf/2112.10752.pdf). What will you learn from this chapter? -- Fundamental components of Stable Diffusion -- How to use `text-to-image`, `image2image`, inpainting pipelines +- Fundamental components of Stable Diffusion. +- How to use `text-to-image`, `image2image`, inpainting pipelines. ## What Do We Need for Stable Diffusion to Work? To make this section interesting we will try to answer some questions to understand the basic components of the Stable Diffusion process. diff --git a/chapters/en/Unit 5 - Generative Models/GANs & VAEs/StyleGAN.mdx b/chapters/en/Unit 5 - Generative Models/GANs & VAEs/StyleGAN.mdx index 9d2339025..72a0d3d9f 100644 --- a/chapters/en/Unit 5 - Generative Models/GANs & VAEs/StyleGAN.mdx +++ b/chapters/en/Unit 5 - Generative Models/GANs & VAEs/StyleGAN.mdx @@ -2,11 +2,11 @@ What you will learn in this chapter: -- What is missing in Vanilla GAN -- StyleGAN1 components and benifits -- Drawback of StyleGAN1 and the need for StyleGAN2 -- Drawback of StyleGAN2 and the need for StyleGAN3 -- Usecases of StyleGAN +- What is missing in Vanilla GAN. +- StyleGAN1 components and benifits. +- Drawback of StyleGAN1 and the need for StyleGAN2. +- Drawback of StyleGAN2 and the need for StyleGAN3. +- Usecases of StyleGAN. ## What is missing in Vanilla GAN Generative Adversarial Networks(GANs) are a class of generative models that produce realistic images. But it is very evident that you don't have any control over how the images are generated. In Vanilla GANs, you have two networks (i) A Generator, and (ii) A Discriminator. A Discriminator takes an image as input and returns whether it is a real image or a synthetically generated image by the generator. A Generator takes in noise vector (generally sampled from a multivariate Gaussian) and tries to produce images that look similar but not exactly the same as the ones available in the training samples, initially, it will be a junk image but in a long run the aim of the Generator is to fool the Discriminator into believing that the images generated by the generator are real. @@ -22,15 +22,15 @@ TL DR; StyleGAN is a special modification made to the architectural style of the Let us just dive into the special components introduced in StyleGAN that give StyleGAN the power which we described above. Don't get intimidated by the figure above, it is one of the simplest yet powerful ideas which you can easily understand. As I already said, StyleGAN only modifies Generator and the Discriminator remains the same, hence it is not mentioned above. Diagram (a) corresponds to the structure of ProgessiveGAN. ProgessiveGAN is just a Vanilla GAN, but instead of generating images of a fixed resolution, it progressively generates images of higher resolution in aim of generating realistic high resolution images, i.e., block 1 of generator generates image of resolution 4 by 4, block 2 of generator generates image of resolution 8 by 8 and so on. -Diagram (b) is the proposed StyleGAN architecture. It has the following main components; -1. A mapping network -2. AdaIN (Adaptive Instance Normalisation) -3. Concatenation of Noise vector +Diagram (b) is the proposed StyleGAN architecture. It has the following main components: +1. A mapping network. +2. AdaIN (Adaptive Instance Normalisation). +3. Concatenation of Noise vector. Let's break it down one by one. ### Mapping Network -Instead of passing the latent code (also known as the noise vector) z directly to the generator as done in traditional GANs, now it is mapped to w by a series of 8 MLP layers. The produced latent code w is not just passed as input to the first layer of the Generator, like in ProgessiveGAN, rather it is passed on to each block of the Generator Network (In StyleGAN terms, it is called a Synthesis Network). There are two major ideas here; +Instead of passing the latent code (also known as the noise vector) z directly to the generator as done in traditional GANs, now it is mapped to w by a series of 8 MLP layers. The produced latent code w is not just passed as input to the first layer of the Generator, like in ProgessiveGAN, rather it is passed on to each block of the Generator Network (In StyleGAN terms, it is called a Synthesis Network). There are two major ideas here: - Mapping the latent code from z to w disentangles the feature space. By disentanglement what we mean here is in a latent code of dimension 512, if you change just one of its feature values (say out of 512 values, you just increase or decrease the 4th value), then ideally in disentangled feature space, only one of the real world feature should change. If the 4th feature value corresponds to the real-world feature 'smile', then changing the 4th value of the 512-dimension latent code should generate images that are smiling/not smiling/something in between. - Passing latent code to each layer has a profound effect on the kind of the real features controlled. For instance, the effect of passing latent code w to lower blocks of the Synthetis network has control over high-level aspects such as pose, general hairstyle, face shape, and eyeglasses, and the effect of passing latent code w to blocks of the higher resolution of the synthetis network has control over smaller scale facial features, hairstyle, eyes open/closed etc. @@ -44,7 +44,7 @@ AdaIN modifies the instance Normalization by allowing the normalization paramete In StyleGAN, the latent code is not directly passed on to synthesis network rather affine transformer w, i.e y is passed to different blocks. y is called the 'style' representation. Here, \\(y_{s,i}\\) and \\(y_{b,i}\\) are the mean and standard deviation of the style representation y, and \\(mu(x_i)\\) and \\(sigma(x_i)\\) are the mean and standard deviation of the feature map x. -AdaIN enables the generator to modulate its behavior during the generation process dynamically. This is particularly useful in scenarios where different parts of the generated output may require different styles or characteristics +AdaIN enables the generator to modulate its behavior during the generation process dynamically. This is particularly useful in scenarios where different parts of the generated output may require different styles or characteristics. ### Concatenation of Noise vector @@ -67,7 +67,7 @@ You can see the blob structure in the above image, which the authors claim to ha ![Demodulation](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/stylegan2_demod.png) -(ii) Fixing strong location preference artifact in Progessive GAN structure +(ii) Fixing strong location preference artifact in Progessive GAN structure. ![Phase Artifact](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/progress.png) @@ -77,7 +77,7 @@ A skip generator and a residual discriminator was used to overcome the issue, wi There are also other changes introduced in StyleGAN2, but the above two are important to know at first hand. -## Drawbacks of StyleGAN2 and the need for StyleGAN3; +## Drawbacks of StyleGAN2 and the need for StyleGAN3 The same set of authors of StyleGAN2 figured out the dependence of the synthesis network on absolute pixel coordinates in an unhealthy manner. This leads to the phenomenon called the aliasing effect. ![Animation of aliasing](https://huggingface.co/datasets/hwaseem04/Documentation-files/resolve/main/CV-Course/MP4%20to%20GIF%20conversion.gif) @@ -99,7 +99,7 @@ StyleGAN's ability to generate photorealistic images has opened doors for divers **Creative explorations** -- Generating fashion designs: StyleGAN can be used to generate realistic and diverse fashion designs +- Generating fashion designs: StyleGAN can be used to generate realistic and diverse fashion designs. - Creating immersive experiences: StyleGAN can be used to create realistic virtual environments for gaming, education, and other applications. For instance, Stylenerf: A style-based. 3d aware generator for high-resolution image synthesis. These are just a non-exhaustive list. diff --git a/chapters/en/Unit 5 - Generative Models/Introduction/Introduction.mdx b/chapters/en/Unit 5 - Generative Models/Introduction/Introduction.mdx index 4d2a7dbfd..d540798a8 100644 --- a/chapters/en/Unit 5 - Generative Models/Introduction/Introduction.mdx +++ b/chapters/en/Unit 5 - Generative Models/Introduction/Introduction.mdx @@ -10,9 +10,9 @@ these tasks can be expanded into more complex processes such as semantic segment For the sake of brevity, in this chapter, we will consider generative models that solve these tasks: -* noise to image (DCGAN) -* text to image (diffusion models) -* image to image (StyleGAN, cycleGAN, diffusion models) +* noise to image (DCGAN). +* text to image (diffusion models). +* image to image (StyleGAN, cycleGAN, diffusion models). This section will cover 2 kinds of generative models. GAN-based models, and diffusion-based models. @@ -28,7 +28,7 @@ Some other metrics you might come across are SSIM, PSNR, IS(Inception Score), an * PSNR (peak signal-to-noise ratio) can be interpreted almost as mean-squared-error. Generally, values from [25,34] are okay results while 34+ is very good. -* SSIM (Structural Similarity Index) is a metric in the range [0, 1] where 1 is a perfect match. The final index is calculated from 3 components: luminance, contrast, and structure. [this paper](https://arxiv.org/pdf/2006.13846.pdf) analyzes SSIM and its components if you're really interested +* SSIM (Structural Similarity Index) is a metric in the range [0, 1] where 1 is a perfect match. The final index is calculated from 3 components: luminance, contrast, and structure. [this paper](https://arxiv.org/pdf/2006.13846.pdf) analyzes SSIM and its components if you're really interested. * Inception score was introduced in [Improved Techniques for Training GANs](https://arxiv.org/pdf/1606.03498.pdf). It is calculated using the features on the inceptionv3 model. The higher the better. It is a mathematically very interesting metric, but has recently fallen out of favor. diff --git a/chapters/en/Unit 5 - Generative Models/PRACTICAL APPLICATIONS & CHALLENGES/Ethical Issues.mdx b/chapters/en/Unit 5 - Generative Models/PRACTICAL APPLICATIONS & CHALLENGES/Ethical Issues.mdx index 3683907ae..436e9666f 100644 --- a/chapters/en/Unit 5 - Generative Models/PRACTICAL APPLICATIONS & CHALLENGES/Ethical Issues.mdx +++ b/chapters/en/Unit 5 - Generative Models/PRACTICAL APPLICATIONS & CHALLENGES/Ethical Issues.mdx @@ -4,9 +4,9 @@ The widespread adoption of AI-powered image editing tools raises significant con What you will learn from this chapter: -- Impact of such AI images/videos on society -- Current approaches to tackle the issues -- Future scope +- Impact of such AI images/videos on society. +- Current approaches to tackle the issues. +- Future scope. ## Impact on Society The ability to effortlessly edit and alter images has the potential to: @@ -27,6 +27,6 @@ The future of AI-edited images will likely involve: - **Advanced detection and mitigation techniques:** Researchers will ideally develop more advanced techniques for detecting and mitigating the harms associated with AI-edited images. But is like a cat-and-mouse game where one group develops sophisticated realistic images generation algorithms, whereas another group develops methods to identify them. - **Public awareness and education:** Public awareness campaigns and educational initiatives will be crucial in promoting responsible use of AI-edited images and combating the spread of misinformation. -- **Protecting rights of image artist:** Companies like OpenAI, Google, StabiltyAI that trains large text-to-image models are facing slew of lawsuits because of scraping works of artists from internet without crediting them in anyway. Techniques like image poisoning is an emerging research problem where an artists' image is added with human-eye-invisible noise-like pixel changes before uploading on internet. This potentially corrupts the training data and hence model's image generation capability if scraped directly. You can read about this more from - [here](https://www.technologyreview.com/2023/10/23/1082189/data-poisoning-artists-fight-generative-ai/), and [here](https://arxiv.org/abs/2310.13828) +- **Protecting rights of image artist:** Companies like OpenAI, Google, StabiltyAI that trains large text-to-image models are facing slew of lawsuits because of scraping works of artists from internet without crediting them in anyway. Techniques like image poisoning is an emerging research problem where an artists' image is added with human-eye-invisible noise-like pixel changes before uploading on internet. This potentially corrupts the training data and hence model's image generation capability if scraped directly. You can read about this more from - [here](https://www.technologyreview.com/2023/10/23/1082189/data-poisoning-artists-fight-generative-ai/), and [here](https://arxiv.org/abs/2310.13828). This is a rapidly evolving field, and it is crucial to stay informed about the latest developments. \ No newline at end of file diff --git a/chapters/en/Unit 5 - Generative Models/gans.mdx b/chapters/en/Unit 5 - Generative Models/gans.mdx index c6a842636..b92472a40 100644 --- a/chapters/en/Unit 5 - Generative Models/gans.mdx +++ b/chapters/en/Unit 5 - Generative Models/gans.mdx @@ -2,7 +2,7 @@ ## Introduction Generative Adversarial Networks (GANs) are a class of deep learning models introduced by [Ian Goodfellow](https://scholar.google.ca/citations?user=iYN86KEAAAAJ&hl=en) and his colleagues in 2014. The core idea behind GANs is to train a generator network to produce data that is indistinguishable from real data, while simultaneously training a discriminator network to differentiate between real and generated data. -* **Architecture overview:** GANs consist of two main components: `the generator` and `the discriminator` +* **Architecture overview:** GANs consist of two main components: `the generator` and `the discriminator`. * **Generator:** The generator takes random noise \\(z\\) as input and generates synthetic data samples. Its goal is to create data that is realistic enough to deceive the discriminator. * **Discriminator:** The discriminator, akin to a detective, evaluates whether a given sample is real (from the actual dataset) or fake (generated by the generator). Its objective is to become increasingly accurate in distinguishing between real and generated samples. @@ -19,12 +19,12 @@ GANs and VAEs are both popular generative models in machine learning, but they h * **Example:** A GAN-generated image of a bedroom is likely to be indistinguishable from a real one, while a VAE-generated bedroom might appear blurry or have unrealistic lighting. ![Example of GAN-Generated bedrooms taken from Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, 2015](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/generative_models/bedroom.png) - **VAEs:** - * **Strengths:** Easier to train and more stable than GANs + * **Strengths:** Easier to train and more stable than GANs. * **Weaknesses:** May generate blurry, less detailed images with unrealistic features. * **Other Tasks:** - **GANs:** * **Strengths:** Can be used for tasks like super-resolution and image-to-image translation. - * **Weaknesses:** May not be the best choice for tasks that require a smooth transition between data points + * **Weaknesses:** May not be the best choice for tasks that require a smooth transition between data points. - **VAEs:** * **Strengths:** Widely used for tasks like image denoising and anomaly detection. * **Weaknesses:** May not be as effective as GANs for tasks that require high-quality image generation. diff --git a/chapters/en/Unit 5 - Generative Models/variational_autoencoders.mdx b/chapters/en/Unit 5 - Generative Models/variational_autoencoders.mdx index e1be4a308..6c045f94b 100644 --- a/chapters/en/Unit 5 - Generative Models/variational_autoencoders.mdx +++ b/chapters/en/Unit 5 - Generative Models/variational_autoencoders.mdx @@ -1,7 +1,7 @@ # Variational Autoencoders ## Introduction to Autoencoders -Autoencoders are a class of neural networks primarily used for unsupervised learning and dimensionality reduction. The fundamental idea behind autoencoders is to encode input data into a lower-dimensional representation and then decode it back to the original data, aiming to minimize the reconstruction error. The basic architecture of an autoencoder consists of two main components - `the encoder` and `the decoder` +Autoencoders are a class of neural networks primarily used for unsupervised learning and dimensionality reduction. The fundamental idea behind autoencoders is to encode input data into a lower-dimensional representation and then decode it back to the original data, aiming to minimize the reconstruction error. The basic architecture of an autoencoder consists of two main components - `the encoder` and `the decoder`. * **Encoder:** The encoder is responsible for transforming the input data into a compressed or latent representation. It typically consists of one or more layers of neurons that progressively reduce the dimensions of the input. * **Decoder:** The decoder, on the other hand, takes the compressed representation produced by the encoder and attempts to reconstruct the original input data. Like the encoder, it often consists of one or more layers, but in the reverse order, gradually increasing the dimensions. @@ -24,7 +24,7 @@ In the context of Vanilla Autoencoders (AE), the smile feature is encapsulated a ## Mathematics Behind VAEs Understanding the mathematical concepts behind VAEs involves grasping the principles of probabilistic modeling and variational inference. ![Variational Autoencoder - Lilian Weng Blog](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/generative_models/vae.png) -* **Probabilistic Modeling:** In VAEs, the latent space is modeled as a probability distribution, often assumed to be a multivariate Gaussian. This distribution is parameterized by the mean and standard deviation vectors, which are outputs of the probabilistic encoder \\( q_\phi(z|x) \\). This comprosises of our learned representation z which is further used to sample from the decoder as \\(p_\theta(x|z) \\) +* **Probabilistic Modeling:** In VAEs, the latent space is modeled as a probability distribution, often assumed to be a multivariate Gaussian. This distribution is parameterized by the mean and standard deviation vectors, which are outputs of the probabilistic encoder \\( q_\phi(z|x) \\). This comprosises of our learned representation z which is further used to sample from the decoder as \\(p_\theta(x|z) \\). * **Loss Function:** The loss function for VAEs comprises two components: the reconstruction loss (measuring how well the model reconstructs the input) similar to the vanilla autoencoders and the KL divergence (measuring how closely the learned distribution resembles a chosen prior distribution, usually gaussian). The combination of these components encourages the model to learn a latent representation that captures both the data distribution and the specified prior. * **Encouraging Meaningful Latent Representations:** By incorporating the KL divergence term into the loss function, VAEs are encouraged to learn a latent space where similar data points are closer, ensuring a meaningful and structured representation. The autoencoder's loss function aims to minimize both the reconstruction loss and the latent loss. A smaller latent loss implies a limited encoding of information that would otherwise enhance the reconstruction loss. Consequently, the Variational Autoencoder (VAE) finds itself in a delicate balance between the latent loss and the reconstruction loss. This equilibrium becomes pivotal, as a `smaller latent loss` tends to result in generated images closely resembling those present in the training set but lacking in visual quality. Conversely, a `smaller reconstruction loss` leads to well-reconstructed images during training but hampers the generation of novel images that deviate significantly from the training set. Striking a harmonious balance between these two aspects becomes imperative to achieve desirable outcomes in both image reconstruction and generation. diff --git a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3D Vision/NVS.mdx b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3D Vision/NVS.mdx index b3ddfbfe3..4ff173dff 100644 --- a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3D Vision/NVS.mdx +++ b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3D Vision/NVS.mdx @@ -49,10 +49,10 @@ A model was trained separately on each class of object (e.g. planes, benches, ca ![Input image of a chair](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_input.png) ![Rotating gif animation of rendered novel views](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_output.gif) -image from https://alexyu.net/pixelnerf +Image from https://alexyu.net/pixelnerf. -The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf) +The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf). ### Related methods @@ -60,7 +60,7 @@ In the [ObjaverseXL](https://arxiv.org/pdf/2307.05663.pdf) paper, PixelNeRF was See also - [Generative Query Networks](https://deepmind.google/discover/blog/neural-scene-representation-and-rendering/), [Scene Representation Networks](https://www.vincentsitzmann.com/srns/), -[LRM](https://arxiv.org/pdf/2311.04400.pdf) +[LRM](https://arxiv.org/pdf/2311.04400.pdf). ## Zero123 (or Zero-1-to-3) @@ -74,20 +74,20 @@ Zero123 is built upon the [Stable Diffusion](https://arxiv.org/abs/2112.10752) a However, it adds a few new twists. The model actually starts with the weights from [Stable Diffusion Image Variations](https://huggingface.co/spaces/lambdalabs/stable-diffusion-image-variations), which uses the CLIP image embeddings (the final hidden state) of the input image to condition the diffusion U-Net, instead of a text prompt. However, here these CLIP image embeddings are concatenated with the relative viewpoint transformation between the input and novel views. -(This viewpoint change is represented in terms of spherical polar coordinates.) +(This viewpoint change is represented in terms of spherical polar coordinates). ![Zero123](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Zero123.png) -image from https://zero123.cs.columbia.edu +image from https://zero123.cs.columbia.edu. The rest of the architecture is the same as Stable Diffusion. However, the latent representation of the input image is concatenated channel-wise with the noisy latents before being input into the denoising U-Net. -To explore this model further, see the [Live Demo](https://huggingface.co/spaces/cvlab/zero123-live) +To explore this model further, see the [Live Demo](https://huggingface.co/spaces/cvlab/zero123-live). ### Related methods [3DiM](https://3d-diffusion.github.io/) - X-UNet architecture, with cross-attention between input and noisy frames. -[Zero123-XL](https://arxiv.org/pdf/2311.13617.pdf) - Trained on the larger objaverseXL dataset. See also [Stable Zero 123](https://huggingface.co/stabilityai/stable-zero123) +[Zero123-XL](https://arxiv.org/pdf/2311.13617.pdf) - Trained on the larger objaverseXL dataset. See also [Stable Zero 123](https://huggingface.co/stabilityai/stable-zero123). -[Zero123++](https://arxiv.org/abs/2310.15110) - Generates 6 new fixed views, at fixed relative positions to the input view, with reference attention between input and generated images +[Zero123++](https://arxiv.org/abs/2310.15110) - Generates 6 new fixed views, at fixed relative positions to the input view, with reference attention between input and generated images. diff --git a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3d_measurements_stereo_vision.mdx b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3d_measurements_stereo_vision.mdx index ac287651b..f4e2c8b28 100644 --- a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3d_measurements_stereo_vision.mdx +++ b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/3d_measurements_stereo_vision.mdx @@ -15,8 +15,8 @@ Figure 1: Image formation using single camera ## Solution Let's assume we are given the following information: -1. Single image of a scene point P -2. Pixel coordinates of point P in the image +1. Single image of a scene point P. +2. Pixel coordinates of point P in the image. 3. Position and orientation of the camera used to capture the image. For simplicity, we can also place an XYZ coordinate system at the location of the pinhole, with the z-axis perpendicular to the image place and the x-axis, and y-axis parallel to the image plane like in Figure 1. 4. Internal parameters of the camera, such as focal length and location of principal point. The principal point is where the optical axis intersects the image plane. Its location in the image plane is usually denoted as (Ox,Oy). @@ -31,9 +31,9 @@ With the information provided above, we can find a 3D line that originates from Given 2 lines in 3D, there are are three possibilities for their intersection: -1. Intersect at exactly 1 point -2. Intersect at infinite number of points -3. Do not intersect +1. Intersect at exactly 1 point. +2. Intersect at infinite number of points. +3. Do not intersect. If both images (with original and new camera positions) contain point P, we can conclude that the 3D lines must intersect at least once and that the intersection point is point P. Furthermore, we can envision infinite points where both lines intersect only if the two lines are collinear. This is achievable if the pinhole at the new camera position lies somewhere on the original 3D line. For all other positions and orientations of the new camera location, the two 3D lines must intersect precisely at one point, where point P lies. @@ -44,7 +44,7 @@ Since there are many different positions and orientations for the camera locatio ![Figure 2: Image formation using 2 cameras](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_simple_stereo.jpg?download=true) -Figure 2: Image formation using 2 cameras +Figure 2: Image formation using 2 cameras. 1. Origin of the coordinate system is placed at the pinhole of the first camera which is usually the left camera. 2. Z axis of the coordinate system is defined perpendicular to the image plane. @@ -52,10 +52,10 @@ Figure 2: Image formation using 2 cameras 4. We also have X and Y directions in a 2D image. X is the horizontal direction and Y is the vertical direction. We will refer to these directions in the image plane as u and v respectively. Therefore, pixel coordinates of a point are defined using (u,v) values. 5. X axis of the coordinate system is defined as the u direction / horizontal direction in the image plane. 6. Similarly Y axis of the coordinate system is defined as the v direction / vertical direction in the image plane. -7. Second camera (more precisely the pinhole of the second camera) is placed at a distance b called baseline in the positive x direction to the right of the first camera. Therefore, x,y,z coordinates of pinhole of second camera are (b,0,0) +7. Second camera (more precisely the pinhole of the second camera) is placed at a distance b called baseline in the positive x direction to the right of the first camera. Therefore, x,y,z coordinates of pinhole of second camera are (b,0,0). 5. Image plane of the second camera is oriented parallel to the image plane of the first camera. -6. u and v directions in the image plane of second/right camera are aligned with the u and v directions in the image plane of the first/left camera -7. Both left and right cameras are assumed to have the same intrinsic parameters like focal length and location of principal point +6. u and v directions in the image plane of second/right camera are aligned with the u and v directions in the image plane of the first/left camera. +7. Both left and right cameras are assumed to have the same intrinsic parameters like focal length and location of principal point. With the above configuration in place, we have the below equations which map a point in 3D to the image plane in 2D. @@ -68,12 +68,12 @@ With the above configuration in place, we have the below equations which map a p 2. \\(v\_right = f\_y * \frac{y}{z} + O\_y\\) Different symbols used in above equations are defined below: -* \\(u\_left\\), \\(v\_left\\) refer to pixel coordinates of point P in the left image -* \\(u\_right\\), \\(v\_right\\) refer to pixel coordinates of point P in the right image +* \\(u\_left\\), \\(v\_left\\) refer to pixel coordinates of point P in the left image. +* \\(u\_right\\), \\(v\_right\\) refer to pixel coordinates of point P in the right image. * \\(f\_x\\) refers to the focal length (in pixels) in x direction and \\(f\_y\\) refers to the focal length (in pixels) in y direction. Actually, there is only 1 focal length for a camera which is the distance between the pinhole (optical center of the lens) to the image plane. However, pixels may be rectangular and not perfect squares, resulting in different fx and fy values when we represent f in terms of pixels. -* x,y,z are 3D coordinates of the point P (any unit like cm, feet, etc can be used) -* \\(O\_x\\) and \\(O\_y\\) refer to pixel coordinates of the principal point -* b is called the baseline and refers to the distance between the left and right cameras. Same units are used for both b and x,y,z coordinates (any unit like cm, feet, etc can be used) +* x,y,z are 3D coordinates of the point P (any unit like cm, feet, etc can be used). +* \\(O\_x\\) and \\(O\_y\\) refer to pixel coordinates of the principal point. +* b is called the baseline and refers to the distance between the left and right cameras. Same units are used for both b and x,y,z coordinates (any unit like cm, feet, etc can be used). We have 4 equations above and 3 unknowns - x, y and z coordinates of a 3D point P. Intrinsic camera parameters - focal lengths and principal point are assumed to be known. Equations 1.2 and 2.2 indicate that the v coordinate value in the left and right images is the same. @@ -97,10 +97,10 @@ We'll work through an example, capture some images, and perform some calculation ### Raw Left and Right Images The left and right cameras in OAK-D Lite are oriented similarly to the geometry of the simplified solution detailed above. The baseline distance between the left and right cameras is 7.5cm. Left and right images of a scene captured using this device are shown below. The figure also shows these images stacked horizontally with a red line drawn at a constant height (i.e. at a constant v value ). We'll refer to the horizontal x-axis as u and the vertical y-axis as v. -Raw Left Image +Raw Left Image. ![Raw Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_left_frame.jpg?download=true) -Raw Right Image +Raw Right Image. ![Raw Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_right_frame.jpg?download=true) ![Raw Stacked Left and Right Images ](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_stacked_frames.jpg?download=true) @@ -114,27 +114,27 @@ Let's focus on a single point - the top left corner of the laptop. As per equati ### Rectified Left and Right Images We can perform image rectification/post-processing to correct for differences in intrinsic parameters and orientations of the left and right cameras. This process involves performing 3x3 matrix transformations. In the OAK-D Lite API, a stereo node performs these calculations and outputs the rectified left and right images. Details and source code can be viewed [here](https://github.com/luxonis/depthai-experiments/blob/master/gen2-stereo-on-host/main.py). In this specific implementation, correction for intrinsic parameters is performed using intrinsic camera matrices, and correction for orientation is performed using rotation matrices(part of calibration parameters) for the left and right cameras. The rectified left image is transformed as if the left camera had the same intrinsic parameters as the right one. Therefore, in all our following calculations, we'll use the intrinsic parameters for the right camera i.e. focal length of 452.9 and principal point at (298.85, 245.52). In the rectified and stacked images below, notice that the red line at constant v touches the top-left corner of the laptop in both the left and right images. -Rectified Left Image +Rectified Left Image. ![Rectified Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_left_frame.jpg?download=true) -Rectified Right Image +Rectified Right Image. ![Rectified Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_right_frame.jpg?download=true) -Rectified and Stacked Left and Right Images +Rectified and Stacked Left and Right Images. ![Rectified and Stacked Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_stacked_frames.jpg?download=true) Let's also overlap the rectified left and right images to see the difference. We can see that the v values for different points remain mostly constant in the left and right images. However, the u values change, and this difference in the u values helps us find the depth information for different points in the scene, as shown in Equation 6 above. This difference in 'u' values \\(u\_left - u\_right\\) is called disparity, and we can notice that the disparity for points near the camera is greater compared to points further away. Depth z and disparity \\(u\_left - u\_right\\) are inversely proportional, as shown in equation 6. -Rectified and Overlapped Left and Right Images +Rectified and Overlapped Left and Right Images. ![Rectified and Overlapped Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_overlapping_frames.jpg?download=true) ### Annotated Left and Right Rectified Images Let's find the 3D coordinates for some points in the scene. A few points are selected and manually annotated with their (u,v) values, as shown in the figures below. Instead of manual annotations, we can also use template-based matching, feature detection algorithms like SIFT, etc for finding corresponding points in left and right images. -Annotated Left Image +Annotated Left Image. ![Annotated Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_left_img.jpg?download=true) -Annotated Right Image +Annotated Right Image. ![Annotated Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_right_img.jpg?download=true) ### 3D Coordinate Calculations @@ -168,7 +168,7 @@ We can also compute 3D distances between different points using their (x,y,z) va | d5(9-10) | 16.9 | 16.7 | 1.2 | | d6(9-11) | 23.8 | 24 | 0.83 | -Calculated Dimension Results +Calculated Dimension Results. ![Calculated Dimension Results](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/calculated_dim_results.png?download=true) ## Conclusion diff --git a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Introduction/brief_history.mdx b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Introduction/brief_history.mdx index f50f6ca04..39876dabc 100644 --- a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Introduction/brief_history.mdx +++ b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Introduction/brief_history.mdx @@ -2,17 +2,17 @@ ## 1838: Stereoscopy -- **Inventor**: Sir Charles Wheatstone +- **Inventor**: Sir Charles Wheatstone. - **Technique**: Presenting offset images to each eye through a stereoscope, creating depth perception. ## 1853: Anaglyph 3D -- **Pioneer**: Louis Ducos du Hauron +- **Pioneer**: Louis Ducos du Hauron. - **Method**: Using glasses with colored filters to separate images in complementary colors, creating a depth illusion. ## 1936: Polarized 3D -- **Developer**: Edwin H. Land +- **Developer**: Edwin H. Land. - **Approach**: Utilizing polarized light technology in 3D movies, with glasses that filter light in specific directions. ## 1960s: Virtual Reality @@ -22,7 +22,7 @@ ## 1979: Autostereograms (Magic Eye Images) -- **Creator**: Christopher Tyler +- **Creator**: Christopher Tyler. - **Concept**: 2D patterns that allow viewers to see 3D images without special glasses. ## 1986: IMAX 3D diff --git a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/CameraModels.mdx b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/CameraModels.mdx index 529873eb1..e6049019b 100644 --- a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/CameraModels.mdx +++ b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/CameraModels.mdx @@ -13,11 +13,11 @@ There are a number of different conventions for the direction of the camera axes ### Pinhole camera coordinate transformation ![Pinhole transformation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Pinhole_transform.png) -Each point in 3D space maps to a single point on the 2D plane. To find the map between 3D and 2D coordinates, we first need to know the intrinsics of the camera, which for a pinhole camera are - - the focal lengths, \\(f_x\\) and \\(f_y\\) +Each point in 3D space maps to a single point on the 2D plane. To find the map between 3D and 2D coordinates, we first need to know the intrinsics of the camera, which for a pinhole camera are: + - the focal lengths, \\(f_x\\) and \\(f_y\\). - the coordinates of the principle point, \\(c_x\\)and \\(c_y\\), which is the optical centre of the image. This point is where the optical axis intersects the image plane. -Using these intrinsic parameters, we construct the camera matrix +Using these intrinsic parameters, we construct the camera matrix: $$ K = \begin{pmatrix} @@ -29,7 +29,7 @@ $$ In order to apply this to a point \\(p=[x,y,z]\\) to a point in 3D space, we multiply the point by the camera matrix $K @ p$ to give a new 3x1 vector \\([u,v,w]\\). This is a homogeneous vector in 2D, but where the last component isn't 1. To find the position of the point in the image plane we have to divide the first two coordinates by the last one, to give the point \\([u/w, v/w]\\). -Whilst this is the textbook definition of the camera matrix, if we use the Blender camera convention it will flip the image left to right and up-down (as points in front of the camera will have negative z-values). One potential way to fix this is to change the signs of some of the elements of the camera matrix +Whilst this is the textbook definition of the camera matrix, if we use the Blender camera convention it will flip the image left to right and up-down (as points in front of the camera will have negative z-values). One potential way to fix this is to change the signs of some of the elements of the camera matrix: $$ K = \begin{pmatrix} diff --git a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/LinearAlgebra.mdx b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/LinearAlgebra.mdx index 3f52d9046..e89e1b239 100644 --- a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/LinearAlgebra.mdx +++ b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/Terminologies and Basics/LinearAlgebra.mdx @@ -135,7 +135,7 @@ The output should look something like this: Rotations around an axis are another commonly used transformation. There are a number of different ways of representing rotations, including Euler angles and quaternions, which can be very useful in some applications. Again, libraries such as Pytorch3d include a wide range of functionalities for performing rotations. However, as a simple example, we will just show how to construct rotations about each of the three axes. -- Rotation around the X-axis +- Rotation around the X-axis: $$ R_x(\alpha) = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos\alpha & -\sin\alpha & 0 \\ 0 & \sin\alpha & \cos\alpha & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$ @@ -170,7 +170,7 @@ The output should look something like this: ![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/rotation.png) -- Rotation around the Y-axis +- Rotation around the Y-axis: $$ R_y(\beta) = \begin{pmatrix} \cos\beta & 0 & \sin\beta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin\beta & 0 & \cos\beta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} $$ diff --git a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/nerf.mdx b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/nerf.mdx index 687fc273a..6e9ef2492 100644 --- a/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/nerf.mdx +++ b/chapters/en/Unit 8 - 3D Vision, Scene Rendering and Reconstruction/nerf.mdx @@ -8,7 +8,7 @@ Furthermore, it allows us to store large scenes with a smaller memory footprint ## Short History 📖 The field of NeRFs is relatively young with the first publication by [Mildenhall et al.](https://www.matthewtancik.com/nerf) appearing in 2020. Since then, a vast number of papers have been published and fast advancements have been made. -Since 2020, more than 620 preprints and publications have been released, with more than 250 repositories on GitHub. *(as of Dec 2023, statistics from [paperswithcode.com](https://paperswithcode.com/method/nerf))* +Since 2020, more than 620 preprints and publications have been released, with more than 250 repositories on GitHub. *(as of Dec 2023, statistics from [paperswithcode.com](https://paperswithcode.com/method/nerf))*. Since the first formulation of NeRFs requires long training times (up to days on beefy GPUs), there have been a lot of advancements towards faster training and inference. An important leap was NVIDIA's [Instant-ngp](https://nvlabs.github.io/instant-ngp/), which was released in 2022. @@ -18,7 +18,7 @@ This novel approach was faster to train and query while performing on par qualit [Mipnerf-360](https://jonbarron.info/mipnerf360/), which was also released in 2022, is also worth mentioning. Again, the model architecture is the same as for most NeRFs, but the authors introduced a novel scene contraction that allows us to represent scenes that are unbounded in all directions, which is important for real-world applications. [Zip-NeRF](https://jonbarron.info/zipnerf/), released in 2023, combines recent advancements like the encoding from [Instant-ngp](https://nvlabs.github.io/instant-ngp/) and the scene contraction from [Mipnerf-360](https://jonbarron.info/mipnerf360/) to handle real-world situation whilst decreasing training times to under an hour. -*(this is still measured on beefy GPUs to be fair)* +*(this is still measured on beefy GPUs to be fair)*. Since the field of NeRFs is rapidly evolving, we added a section `sota` at the end where we will tease the latest research and the possible future direction of NeRFs. @@ -32,7 +32,7 @@ A simple NeRF pipeline can be summarized with the following picture: ![nerf_pipeline](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_pipeline.png) -Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf) +Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf). **(a)** Sample points and viewing directions along camera rays and pass them through the network. @@ -144,7 +144,7 @@ visualize_grid(grid, encoded_grid, resolution) The output should look something like the image below: -![encoding](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/nerf_encodings.png) +![encoding](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_encodings.png) The second trick worth mentioning is that most methods use smart approaches to sample points in space. Essentially, we want to avoid sampling in regions where the scene is empty. diff --git a/chapters/en/Unit 9 - Model Optimization/intro_to_model_optimization.mdx b/chapters/en/Unit 9 - Model Optimization/intro_to_model_optimization.mdx index ac39fe428..5e21693b8 100644 --- a/chapters/en/Unit 9 - Model Optimization/intro_to_model_optimization.mdx +++ b/chapters/en/Unit 9 - Model Optimization/intro_to_model_optimization.mdx @@ -16,12 +16,12 @@ As we already know, optimizing the model is important in before the deployment s There are several techniques in the model optimization, which will be explained in the next section. However, this section will briefly describe several types: 1. Pruning: Pruning is the process of eliminating redundant or unimportant connections in the model. This aims to reduce model size and complexity. -![Pruning](https://huggingface.co/datasets/hf-vision/course-assets/raw/main/pruning.png) +![Pruning](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/pruning.png) 2. Quantization: Quantization means converting model weights from high-precision formats (e.g., 32-bit floating-point) to lower-precision formats (e.g., 16-bit floating-point or 8-bit integers) to reduce memory footprint and increase inference speed. 3. Knowledge Distillation: Knowledge distillation aims to transfer knowledge from a complex and larger model (teacher model) to a smaller model (student model) by mimicking the behavior of the teacher model. -![Knowledge Distillation](https://huggingface.co/datasets/hf-vision/course-assets/raw/main/knowledge_distillation.png) +![Knowledge Distillation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/knowledge_distillation.png) 4. Low-rank approximation: Approximates large matrices with small ones, reducing memory consumption and computational costs. 5. Model compression with hardware accelerators: This process is like pruning and quantization. But, running on specific hardware such as NVIDIA GPUs and Intel Hardware. @@ -34,9 +34,9 @@ A trade-off exists between accuracy, performance, and resource usage when deploy Image below shows a common computer vision model in terms of model size, accuracy, and latency. Bigger model has high accuracy, but needs more time for inference and big size. -![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/raw/main/model_size_vs_accuracy.png) +![Model Size VS Accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/model_size_vs_accuracy.png) -![Accuracy VS Latency](https://huggingface.co/datasets/hf-vision/course-assets/raw/main/accuracy_vs_latency.png) +![Accuracy VS Latency](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/accuracy_vs_latency.png) These are the three things we must consider: where do we focus on the model we trained? For example, focusing on high accuracy will result in a slower model during inference or require extensive resources. To overcome this, we apply one of the optimization methods as explained so that the model we get can maximize or balance the trade-off between the three components mentioned above. diff --git a/chapters/en/Unit 9 - Model Optimization/tools_and_frameworks.mdx b/chapters/en/Unit 9 - Model Optimization/tools_and_frameworks.mdx index 985ac7392..159004522 100644 --- a/chapters/en/Unit 9 - Model Optimization/tools_and_frameworks.mdx +++ b/chapters/en/Unit 9 - Model Optimization/tools_and_frameworks.mdx @@ -7,7 +7,7 @@ The TensorFlow Model Optimization Toolkit is a suite of tools for optimizing machine learning models for deployment. The TensorFlow Lite post-training quantization tool enable users to convert weights to 8 bit precision which reduces the trained model size by about 4 times. The tools also include API for pruning and quantization during training is post-training quantization is insufficient. -These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators +These help user to reduce latency and inference cost, deploy models to edge devices with restricted resources and optimized execution for existing hardware or new special purpose accelerators. ### Setup guide @@ -18,7 +18,7 @@ pip install -U tensorflow-model-optimization ### Hands-on guide -For a hands-on guide on how to use the Tensorflow Model Optimization Toolkit, refer this [notebook](https://colab.research.google.com/drive/1t1Tq6i0JZbOwloyhkSjg8uTTVX9iUkgj#scrollTo=D_MCHp6cwCFb) +For a hands-on guide on how to use the Tensorflow Model Optimization Toolkit, refer this [notebook](https://colab.research.google.com/drive/1t1Tq6i0JZbOwloyhkSjg8uTTVX9iUkgj#scrollTo=D_MCHp6cwCFb). ## Pytorch Quantization ### Overview @@ -40,7 +40,7 @@ import torch.quantization ``` ## Hands-on guide -For a hands-on guide on how to use the Pytorch Quantization, refer this [notebook](https://colab.research.google.com/drive/1toyS6IUsFvjuSK71oeLZZ51mm8hVnlZv +For a hands-on guide on how to use the Pytorch Quantization, refer this [notebook](https://colab.research.google.com/drive/1toyS6IUsFvjuSK71oeLZZ51mm8hVnlZv). ## ONNX Runtime @@ -49,12 +49,12 @@ For a hands-on guide on how to use the Pytorch Quantization, refer this [noteboo ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries. ONNX Runtime can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks. The benefits of using ONNX Runtime for Inferencing are as follows: -- Improve inference performance for a wide variety of ML models -- Run on different hardware and operating systems -- Train in Python but deploy into a C#/C++/Java app -- Train and perform inference with models created in different frameworks +- Improve inference performance for a wide variety of ML models. +- Run on different hardware and operating systems. +- Train in Python but deploy into a C#/C++/Java app. +- Train and perform inference with models created in different frameworks. -For more details on ONNX Runtime, see [here](https://onnxruntime.ai/docs/) +For more details on ONNX Runtime, see [here](https://onnxruntime.ai/docs/). ### Setup guide @@ -72,7 +72,7 @@ pip install onnxruntime-gpu ### Hands-on guide -For a hands-on guide on how to use the ONNX Runtime, refer this [notebook](https://colab.research.google.com/drive/1A-qYPX52V2q-7fXHaLeNRJqPUk3a4Qkd) +For a hands-on guide on how to use the ONNX Runtime, refer this [notebook](https://colab.research.google.com/drive/1A-qYPX52V2q-7fXHaLeNRJqPUk3a4Qkd). ## TensorRT @@ -88,11 +88,11 @@ TensorRT is available as a pip package, `tensorrt`. To install the package, run ``` pip install tensorrt ``` -for other installation methods, see [here](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#install) +for other installation methods, see [here](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#install). ### Hands-on guide -For a hands-on guide on how to use the TensorRT, refer this [notebook](https://colab.research.google.com/drive/1b8ueEEwgRc9fGqky1f6ZPx5A2ak82FE1) +For a hands-on guide on how to use the TensorRT, refer this [notebook](https://colab.research.google.com/drive/1b8ueEEwgRc9fGqky1f6ZPx5A2ak82FE1). ## OpenVINO @@ -100,10 +100,10 @@ For a hands-on guide on how to use the TensorRT, refer this [notebook](https://c The OpenVINO™ toolkit enables user to optimize a deep learning model from almost any framework and deploy it with best-in-class performance on a range of Intel® processors and other hardware platforms. The benefits of using OpenVINO includes: -- link directly with OpenVINO Runtime to run inference locally or use OpenVINO Model Server to serve model inference from a separate server or within Kubernetes environment +- link directly with OpenVINO Runtime to run inference locally or use OpenVINO Model Server to serve model inference from a separate server or within Kubernetes environment. - Write an application once, deploy it anywhere on your preferred device, language and OS. -- has minimal external dependencies -- Reduces first-inference latency by using the CPU for initial inference and then switching to another device once the model has been compiled and loaded to memory +- has minimal external dependencies. +- Reduces first-inference latency by using the CPU for initial inference and then switching to another device once the model has been compiled and loaded to memory. ### Setup guide @@ -112,11 +112,11 @@ Openvino is available as a pip package, `openvino`. To install the package, run pip install openvino ``` -For other installation methods, see [here](https://docs.openvino.ai/2023.2/openvino_docs_install_guides_overview.html?VERSION=v_2023_2_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE) +For other installation methods, see [here](https://docs.openvino.ai/2023.2/openvino_docs_install_guides_overview.html?VERSION=v_2023_2_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE). ### Hands-on guide -For a hands-on guide on how to use the OpenVINO, refer this [notebook](https://colab.research.google.com/drive/1FWD0CloFt6gIEd0WBSMBDDKzA7YUE8Wz) +For a hands-on guide on how to use the OpenVINO, refer this [notebook](https://colab.research.google.com/drive/1FWD0CloFt6gIEd0WBSMBDDKzA7YUE8Wz). ## Optimum @@ -142,11 +142,11 @@ Optimum is available as a pip package, `optimum`. To install the package, run th pip install optimum ``` -For installation of accelerator-specific features, see [here](https://huggingface.co/docs/optimum/installation) +For installation of accelerator-specific features, see [here](https://huggingface.co/docs/optimum/installation). ### Hands-on guide -For a hands-on guide on how to use Optimum for quantization, refer this [notebook](https://colab.research.google.com/drive/1tz4eHqSZzGlXXS3oBUc2NRbuRCn2HjdN) +For a hands-on guide on how to use Optimum for quantization, refer this [notebook](https://colab.research.google.com/drive/1tz4eHqSZzGlXXS3oBUc2NRbuRCn2HjdN). ## EdgeTPU @@ -154,12 +154,12 @@ For a hands-on guide on how to use Optimum for quantization, refer this [noteboo Edge TPU is Google’s purpose-built ASIC designed to run AI at the edge. It delivers high performance in a small physical and power footprint, enabling the deployment of high-accuracy AI at the edge. The benefits of using EdgeTPU includes: -- Complements Cloud TPU and Google Cloud services to provide an end-to-end, cloud-to-edge, hardware + software infrastructure for AI-based solutions deployment -- High performance in a small physical and power footprint +- Complements Cloud TPU and Google Cloud services to provide an end-to-end, cloud-to-edge, hardware + software infrastructure for AI-based solutions deployment. +- High performance in a small physical and power footprint. - Combined custom hardware, open software, and state-of-the-art AI algorithms to provide high-quality, easy to deploy AI solutions for the edge. For more details on EdgeTPU, see [here](https://cloud.google.com/edge-tpu) -For guide on how to setup and use EdgeTPU, refer this [notebook](https://colab.research.google.com/drive/1aMEZE2sI9aMLLBVJNSS37ltMwmtEbMKl) +For guide on how to setup and use EdgeTPU, refer this [notebook](https://colab.research.google.com/drive/1aMEZE2sI9aMLLBVJNSS37ltMwmtEbMKl). From aee83e5ddb446bb4301730686b01ec7786b2f2e4 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Tue, 23 Apr 2024 15:46:17 +0200 Subject: [PATCH 03/52] Restored HTTP image link --- .../en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx b/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx index 348db6de5..60e1b35d6 100644 --- a/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx +++ b/chapters/en/Unit 4 - Multimodal Models/CLIP and relatives/clip.mdx @@ -28,7 +28,7 @@ CLIP, can be leveraged for a variety of applications. Here are some notable use For practical applications, one typically uses an image, and pre-defined classes as input. The provided Python example demonstrates how to use the transformers library for running CLIP. In this example, we want to zero-shot classify the image below between `dog` or `cat`. -![A photo of cats](https://images.cocodataset.org/val2017/000000039769.jpg) +![A photo of cats](http://images.cocodataset.org/val2017/000000039769.jpg) ```python from PIL import Image From 7efc70c10ad04e4decbc301b14feebcd8947f994 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Fri, 26 Apr 2024 10:21:55 +0200 Subject: [PATCH 04/52] Update chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx Co-authored-by: Merve Noyan --- chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx index 72a0d3d9f..8ed1aa56e 100644 --- a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx +++ b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx @@ -3,7 +3,7 @@ What you will learn in this chapter: - What is missing in Vanilla GAN. -- StyleGAN1 components and benifits. +- StyleGAN1 components and benefits - Drawback of StyleGAN1 and the need for StyleGAN2. - Drawback of StyleGAN2 and the need for StyleGAN3. - Usecases of StyleGAN. From e84231b3985d188856c2e6b2c3561bd942e7552b Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Fri, 26 Apr 2024 10:22:14 +0200 Subject: [PATCH 05/52] Update chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx Co-authored-by: Merve Noyan --- chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx index 8ed1aa56e..ebf4e21a1 100644 --- a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx +++ b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx @@ -6,7 +6,7 @@ What you will learn in this chapter: - StyleGAN1 components and benefits - Drawback of StyleGAN1 and the need for StyleGAN2. - Drawback of StyleGAN2 and the need for StyleGAN3. -- Usecases of StyleGAN. +- Use cases of StyleGAN ## What is missing in Vanilla GAN Generative Adversarial Networks(GANs) are a class of generative models that produce realistic images. But it is very evident that you don't have any control over how the images are generated. In Vanilla GANs, you have two networks (i) A Generator, and (ii) A Discriminator. A Discriminator takes an image as input and returns whether it is a real image or a synthetically generated image by the generator. A Generator takes in noise vector (generally sampled from a multivariate Gaussian) and tries to produce images that look similar but not exactly the same as the ones available in the training samples, initially, it will be a junk image but in a long run the aim of the Generator is to fool the Discriminator into believing that the images generated by the generator are real. From 5a071d48d859848a4710a3efffe05f6d131d0363 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Fri, 26 Apr 2024 10:28:55 +0200 Subject: [PATCH 06/52] Updated suggestions and broken link --- .../image_and_imaging/examples-preprocess.mdx | 10 +++++----- chapters/en/unit10/blenderProc.mdx | 18 +++++++++--------- chapters/en/unit10/point_clouds.mdx | 2 +- chapters/en/unit10/synthetic-lung-images.mdx | 16 ++++++++-------- chapters/en/unit2/cnns/googlenet.mdx | 8 ++++---- chapters/en/unit3/vision-transformers/cvt.mdx | 10 +++++----- ...ion-transformer-for-objection-detection.mdx | 2 +- .../generative-models/gans-vaes/stylegan.mdx | 12 ++++++------ .../en/unit8/3d_measurements_stereo_vision.mdx | 14 +++++++------- chapters/en/unit9/tools_and_frameworks.mdx | 14 +++++++------- 10 files changed, 53 insertions(+), 53 deletions(-) diff --git a/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx b/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx index 7a0b7d73f..05f58658e 100644 --- a/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx +++ b/chapters/en/unit1/image_and_imaging/examples-preprocess.mdx @@ -5,11 +5,11 @@ Now that we have seen what are images, how they are acquired, and their impact, ## Operations in Digital Image Processing In digital image processing, operations on images are diverse and can be categorized into: -- Logical. -- Statistical. -- Geometrical. -- Mathematical. -- Transform operations. +- Logical +- Statistical +- Geometrical +- Mathematical +- Transform operations Each category encompasses different techniques, such as morphological operations under logical operations or fourier transforms and principal component analysis (PCA) under transforms. In this context, we refer to morphology as the group of operations that use structuring elements to generate images of the same size by looking into the values of the pixel neighborhood. Understanding the distinction between element-wise and matrix operations is important in image manipulation. Element-wise operations, such as raising an image to a power or dividing it by another image, involve processing each pixel individually. This pixel-based approach contrasts with matrix operations, which utilize matrix theory for image manipulation. Having said that, you can do whatever you want with images, as they are matrices containing numbers! diff --git a/chapters/en/unit10/blenderProc.mdx b/chapters/en/unit10/blenderProc.mdx index 962ed313f..69bf2b423 100644 --- a/chapters/en/unit10/blenderProc.mdx +++ b/chapters/en/unit10/blenderProc.mdx @@ -122,9 +122,9 @@ Here are some images rendered with the basic example: ## Blender Resources -- [User Manual](https://docs.blender.org/manual/en/latest/0). -- [Awesome-blender -- Extensive list of resources](https://awesome-blender.netlify.app). -- [Blender Youtube Channel](https://www.youtube.com/@BlenderOfficial). +- [User Manual](https://docs.blender.org/manual/en/latest/0) +- [Awesome-blender -- Extensive list of resources](https://awesome-blender.netlify.app) +- [Blender Youtube Channel](https://www.youtube.com/@BlenderOfficial) ### The following video explains how to render a 3D syntehtic dataset in Blender: @@ -136,15 +136,15 @@ Here are some images rendered with the basic example: ## Papers / Blogs -- [Developing digital twins of multi-camera metrology systems in Blender](https://iopscience.iop.org/article/10.1088/1361-6501/acc59e/pdf_). -- [Generate Depth and Normal Maps with Blender](https://www.saifkhichi.com/blog/blender-depth-map-surface-normals). -- [Object detection with synthetic training data](https://medium.com/rowden/object-detection-with-synthetic-training-data-f6735a5a34bc). +- [Developing digital twins of multi-camera metrology systems in Blender](https://iopscience.iop.org/article/10.1088/1361-6501/acc59e/pdf_) +- [Generate Depth and Normal Maps with Blender](https://www.saifkhichi.com/blog/blender-depth-map-surface-normals) +- [Object detection with synthetic training data](https://medium.com/rowden/object-detection-with-synthetic-training-data-f6735a5a34bc) ## BlenderProc Resources -- [BlenderProc Github Repo](https://github.com/DLR-RM/BlenderProc). -- [BlenderProc: Reducing the Reality Gap with Photorealistic Rendering](https://elib.dlr.de/139317/1/denninger.pdf). -- [Documentation](https://dlr-rm.github.io/BlenderProc/). +- [BlenderProc Github Repo](https://github.com/DLR-RM/BlenderProc) +- [BlenderProc: Reducing the Reality Gap with Photorealistic Rendering](https://elib.dlr.de/139317/1/denninger.pdf) +- [Documentation](https://dlr-rm.github.io/BlenderProc/) ### The following video provides an overview of the BlenderProc pipeline: diff --git a/chapters/en/unit10/point_clouds.mdx b/chapters/en/unit10/point_clouds.mdx index a373a6371..199148daa 100644 --- a/chapters/en/unit10/point_clouds.mdx +++ b/chapters/en/unit10/point_clouds.mdx @@ -45,7 +45,7 @@ Now, first we need to understand the formats in which these point clouds are sto **Why?** - `point-cloud-utils` supports reading common mesh formats (PLY, STL, OFF, OBJ, 3DS, VRML 2.0, X3D, COLLADA). -- If it can be imported into [MeshLab](https://github.com/cnr-isti-vclab/meshlab), we can read it! (from their readme). +- If it can be imported into [MeshLab](https://github.com/cnr-isti-vclab/meshlab), we can read it! (from their readme) The type of file is inferred from its file extension. Some of the extensions supported are: diff --git a/chapters/en/unit10/synthetic-lung-images.mdx b/chapters/en/unit10/synthetic-lung-images.mdx index d02f48d2c..1527306db 100644 --- a/chapters/en/unit10/synthetic-lung-images.mdx +++ b/chapters/en/unit10/synthetic-lung-images.mdx @@ -12,21 +12,21 @@ The generator has the following model architecture: - The input is a vector a 100 random numbers and the output is a image of size 128*128*3. - The model has 4 convolutional layers: - - Conv2D layer. - - Batch Normalization layer. - - ReLU activation. + - Conv2D layer + - Batch Normalization layer + - ReLU activation - Conv2D layer with Tanh activation. The discriminator has the following model architecture: - The input is an image and the output is a probability indicating whether the image is fake or real. - The model has one convolutional layer: - - Conv2D layer. - - Leaky ReLU activation. + - Conv2D layer + - Leaky ReLU activation - Three convolutional layers with: - - Conv2D layer. - - Batch Normalization layer. - - Leaky ReLU activation. + - Conv2D layer + - Batch Normalization layer + - Leaky ReLU activation - Conv2D layer with Sigmoid. **Data Collection** diff --git a/chapters/en/unit2/cnns/googlenet.mdx b/chapters/en/unit2/cnns/googlenet.mdx index 8b3c9832e..9086921bc 100644 --- a/chapters/en/unit2/cnns/googlenet.mdx +++ b/chapters/en/unit2/cnns/googlenet.mdx @@ -21,7 +21,7 @@ The Inception Module insists on applying convolution filters of different kernel ![inception_naive](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inception_naive.png) -Figure 1: Naive Inception Module. +Figure 1: Naive Inception Module As we can see applying multiple convolutions at multiple scales with bigger kernel sizes, like 5x5, can increase the number of parameters drastically. This problem is pronounced as the input feature size (channel size) increases. So as we go deep in the network stacking these "Inception Modules", the computation will increase drastically. The simple solution is to reduce the number of features wherever computational requirements seem to increase. The major pain points of high computation are the convolution layers. The feature dimension is reduced by a computationally inexpensive $1 \times 1$ convolution just before the 3x3 and 5x5 convolution. Let's see it with an example. @@ -31,7 +31,7 @@ We would also want to reduce the output features of max pooling before concatena ![inception_reduced](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inception_reduced.png) -Figure 2: Inception Module. +Figure 2: Inception Module Also, because of the parallel operations of convolutions at multiple scales, we are ensuring more operations without going deeper into the network, essentially mitigating the vanishing gradient problem. @@ -68,14 +68,14 @@ These auxiliary classifiers are removed at inference time. However, minimal gain ![googlenet_aux_clf](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/googlenet_auxiliary_classifier.jpg) -Figure 3: An Auxiliary Classifier. +Figure 3: An Auxiliary Classifier ### Architecture - GoogLeNet The complete architecture of GoogLeNet is shown in Figure below. All convolutions, including inside the inception block, use ReLU activation. It starts with two convolution(s) and max-pooling blocks. This is followed by a block of two inception modules (3a and 3b) and a max pooling. This follows a block of 5 inception blocks (4a, 4b, 4c, 4d, 4e) and a max pooling after. The auxiliary classifiers are taken out from outputs of 4a and 4d. Two inception blocks follow (5a and 5b). After this, an average pooling and a fully connected layer of 128 units are used. ![googlenet_arch](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/googlenet_architecture.png) -Figure 4: Complete GoogLeNet Architecture. +Figure 4: Complete GoogLeNet Architecture ### Code diff --git a/chapters/en/unit3/vision-transformers/cvt.mdx b/chapters/en/unit3/vision-transformers/cvt.mdx index 33136fc64..281754573 100644 --- a/chapters/en/unit3/vision-transformers/cvt.mdx +++ b/chapters/en/unit3/vision-transformers/cvt.mdx @@ -50,7 +50,7 @@ The four main highlights of CvT that helped achieve superior performance and com Time to get hands-on! Let's explore how to code each major blocks of the CvT architecture in PyTorch shown in the official implementation [[8]](#cvt-imp). -1. Importing required libraries. +1. Importing required libraries ```python from collections import OrderedDict @@ -61,7 +61,7 @@ from einops import rearrange from einops.layers.torch import Rearrange ``` -2. Implementation of **Convolutional Projection**. +2. Implementation of **Convolutional Projection** ```python def _build_projection(self, dim_in, dim_out, kernel_size, padding, stride, method): @@ -121,7 +121,7 @@ The method takes several parameters related to a convolutional layer (such as in The rearrangement of dimensions is performed using the `Rearrange` operation, which reshapes the input tensor. The resulting projection block is then returned. -3. Implementation of **Convolutional Token Embedding**. +3. Implementation of **Convolutional Token Embedding** ```python class ConvEmbed(nn.Module): @@ -161,7 +161,7 @@ This code defines a ConvEmbed module that performs patch-wise embedding on an in In summary, this module is designed for patch-wise embedding of images, where each patch is processed independently through a convolutional layer, and optional normalization is applied to the embedded features. -4. Implementation of **Vision Transformer** Block. +4. Implementation of **Vision Transformer** Block ```python class VisionTransformer(nn.Module): @@ -277,7 +277,7 @@ This code defines a Vision Transformer module. Here's a brief overview of the co - **Forward Method:** The forward method processes the input through the patch embedding, rearranges the dimensions, adds the classification token if present, applies dropout, and then passes the data through the stack of transformer blocks. Finally, the output is rearranged back to the original shape, and the classification token (if present) is separated from the rest of the sequence before returning the output. -5. Implementation of Convolutional Vision Transformer Block (**Hierarchy of Transformers**). +5. Implementation of Convolutional Vision Transformer Block (**Hierarchy of Transformers**) ```python class ConvolutionalVisionTransformer(nn.Module): diff --git a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx b/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx index 7ed782a34..7dcfecd46 100644 --- a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx +++ b/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx @@ -25,7 +25,7 @@ These models typically receive images (static or frames from videos) as their in There are a lot of of applications around object detection. One of the most significant examples is in the field of autonomous driving, where object detection is used to detect different objects (like pedestrians, road signs, traffic lights, etc) around the car that become one of the inputs for taking decisions. -To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](/chapters/en/Unit%206%20-%20Basic%20CV%20Tasks/object_detection.mdx) on Object Detection 🤗. +To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](/learn/computer-vision-course/unit6/basic-cv-tasks/object_detection) on Object Detection 🤗. ### The Need to Fine-tune Models in Object Detection 🤔 diff --git a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx index ebf4e21a1..0dfeda2f7 100644 --- a/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx +++ b/chapters/en/unit5/generative-models/gans-vaes/stylegan.mdx @@ -2,10 +2,10 @@ What you will learn in this chapter: -- What is missing in Vanilla GAN. +- What is missing in Vanilla GAN - StyleGAN1 components and benefits -- Drawback of StyleGAN1 and the need for StyleGAN2. -- Drawback of StyleGAN2 and the need for StyleGAN3. +- Drawback of StyleGAN1 and the need for StyleGAN2 +- Drawback of StyleGAN2 and the need for StyleGAN3 - Use cases of StyleGAN ## What is missing in Vanilla GAN @@ -23,9 +23,9 @@ Let us just dive into the special components introduced in StyleGAN that give St As I already said, StyleGAN only modifies Generator and the Discriminator remains the same, hence it is not mentioned above. Diagram (a) corresponds to the structure of ProgessiveGAN. ProgessiveGAN is just a Vanilla GAN, but instead of generating images of a fixed resolution, it progressively generates images of higher resolution in aim of generating realistic high resolution images, i.e., block 1 of generator generates image of resolution 4 by 4, block 2 of generator generates image of resolution 8 by 8 and so on. Diagram (b) is the proposed StyleGAN architecture. It has the following main components: -1. A mapping network. -2. AdaIN (Adaptive Instance Normalisation). -3. Concatenation of Noise vector. +1. A mapping network +2. AdaIN (Adaptive Instance Normalisation) +3. Concatenation of Noise vector Let's break it down one by one. diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index f4e2c8b28..4ad2112a6 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -114,27 +114,27 @@ Let's focus on a single point - the top left corner of the laptop. As per equati ### Rectified Left and Right Images We can perform image rectification/post-processing to correct for differences in intrinsic parameters and orientations of the left and right cameras. This process involves performing 3x3 matrix transformations. In the OAK-D Lite API, a stereo node performs these calculations and outputs the rectified left and right images. Details and source code can be viewed [here](https://github.com/luxonis/depthai-experiments/blob/master/gen2-stereo-on-host/main.py). In this specific implementation, correction for intrinsic parameters is performed using intrinsic camera matrices, and correction for orientation is performed using rotation matrices(part of calibration parameters) for the left and right cameras. The rectified left image is transformed as if the left camera had the same intrinsic parameters as the right one. Therefore, in all our following calculations, we'll use the intrinsic parameters for the right camera i.e. focal length of 452.9 and principal point at (298.85, 245.52). In the rectified and stacked images below, notice that the red line at constant v touches the top-left corner of the laptop in both the left and right images. -Rectified Left Image. +Rectified Left Image ![Rectified Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_left_frame.jpg?download=true) -Rectified Right Image. +Rectified Right Image ![Rectified Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_right_frame.jpg?download=true) -Rectified and Stacked Left and Right Images. +Rectified and Stacked Left and Right Images ![Rectified and Stacked Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_stacked_frames.jpg?download=true) Let's also overlap the rectified left and right images to see the difference. We can see that the v values for different points remain mostly constant in the left and right images. However, the u values change, and this difference in the u values helps us find the depth information for different points in the scene, as shown in Equation 6 above. This difference in 'u' values \\(u\_left - u\_right\\) is called disparity, and we can notice that the disparity for points near the camera is greater compared to points further away. Depth z and disparity \\(u\_left - u\_right\\) are inversely proportional, as shown in equation 6. -Rectified and Overlapped Left and Right Images. +Rectified and Overlapped Left and Right Images ![Rectified and Overlapped Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_overlapping_frames.jpg?download=true) ### Annotated Left and Right Rectified Images Let's find the 3D coordinates for some points in the scene. A few points are selected and manually annotated with their (u,v) values, as shown in the figures below. Instead of manual annotations, we can also use template-based matching, feature detection algorithms like SIFT, etc for finding corresponding points in left and right images. -Annotated Left Image. +Annotated Left Image ![Annotated Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_left_img.jpg?download=true) -Annotated Right Image. +Annotated Right Image ![Annotated Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_right_img.jpg?download=true) ### 3D Coordinate Calculations @@ -168,7 +168,7 @@ We can also compute 3D distances between different points using their (x,y,z) va | d5(9-10) | 16.9 | 16.7 | 1.2 | | d6(9-11) | 23.8 | 24 | 0.83 | -Calculated Dimension Results. +Calculated Dimension Results ![Calculated Dimension Results](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/calculated_dim_results.png?download=true) ## Conclusion diff --git a/chapters/en/unit9/tools_and_frameworks.mdx b/chapters/en/unit9/tools_and_frameworks.mdx index 159004522..62012207d 100644 --- a/chapters/en/unit9/tools_and_frameworks.mdx +++ b/chapters/en/unit9/tools_and_frameworks.mdx @@ -100,10 +100,10 @@ For a hands-on guide on how to use the TensorRT, refer this [notebook](https://c The OpenVINO™ toolkit enables user to optimize a deep learning model from almost any framework and deploy it with best-in-class performance on a range of Intel® processors and other hardware platforms. The benefits of using OpenVINO includes: -- link directly with OpenVINO Runtime to run inference locally or use OpenVINO Model Server to serve model inference from a separate server or within Kubernetes environment. -- Write an application once, deploy it anywhere on your preferred device, language and OS. -- has minimal external dependencies. -- Reduces first-inference latency by using the CPU for initial inference and then switching to another device once the model has been compiled and loaded to memory. +- link directly with OpenVINO Runtime to run inference locally or use OpenVINO Model Server to serve model inference from a separate server or within Kubernetes environment +- Write an application once, deploy it anywhere on your preferred device, language and OS +- has minimal external dependencies +- Reduces first-inference latency by using the CPU for initial inference and then switching to another device once the model has been compiled and loaded to memory ### Setup guide @@ -154,9 +154,9 @@ For a hands-on guide on how to use Optimum for quantization, refer this [noteboo Edge TPU is Google’s purpose-built ASIC designed to run AI at the edge. It delivers high performance in a small physical and power footprint, enabling the deployment of high-accuracy AI at the edge. The benefits of using EdgeTPU includes: -- Complements Cloud TPU and Google Cloud services to provide an end-to-end, cloud-to-edge, hardware + software infrastructure for AI-based solutions deployment. -- High performance in a small physical and power footprint. -- Combined custom hardware, open software, and state-of-the-art AI algorithms to provide high-quality, easy to deploy AI solutions for the edge. +- Complements Cloud TPU and Google Cloud services to provide an end-to-end, cloud-to-edge, hardware + software infrastructure for AI-based solutions deployment +- High performance in a small physical and power footprint +- Combined custom hardware, open software, and state-of-the-art AI algorithms to provide high-quality, easy to deploy AI solutions for the edge For more details on EdgeTPU, see [here](https://cloud.google.com/edge-tpu) From 018cef9e4e2ba14b91d88cc7ef18074968209985 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:46:55 +0900 Subject: [PATCH 07/52] Update chapters/en/unit0/welcome/welcome.mdx Co-authored-by: Merve Noyan --- chapters/en/unit0/welcome/welcome.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index 6220a5c40..6f87792d6 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -12,7 +12,7 @@ On this page, you can find how to join the learners community, make a submission To obtain your certification for completing the course, complete the following assignments: -1. Training/fine-tuning a Model. +1. Training/fine-tuning a model 2. Building an application and hosting it on Hugging Face Spaces. ### Training/fine-tuning a Model From 2e333288e11c443b7cf0001b395b8f0799e8180d Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:47:16 +0900 Subject: [PATCH 08/52] Update chapters/en/unit0/welcome/welcome.mdx Co-authored-by: Merve Noyan --- chapters/en/unit0/welcome/welcome.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index 6f87792d6..951d1ca17 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -13,7 +13,7 @@ On this page, you can find how to join the learners community, make a submission To obtain your certification for completing the course, complete the following assignments: 1. Training/fine-tuning a model -2. Building an application and hosting it on Hugging Face Spaces. +2. Building an application and hosting it on Hugging Face Spaces ### Training/fine-tuning a Model From 75167928fa13134bc8f9c10c5a2067f610632707 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:47:39 +0900 Subject: [PATCH 09/52] Update chapters/en/unit0/welcome/welcome.mdx Co-authored-by: Merve Noyan --- chapters/en/unit0/welcome/welcome.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index 951d1ca17..a9b1786cb 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -30,7 +30,7 @@ The model repository needs to have the following: In this assignment section, you'll be building a Gradio-based application for your computer vision model and sharing it on 🤗 Spaces. Learn more about these tasks using the following resources: -- [Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio). +- [Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio) - [How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt). ## Certification 🥇 From b6be1d82e2ca6b1feae93ad1193381716399adae Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:47:48 +0900 Subject: [PATCH 10/52] Update chapters/en/unit0/welcome/welcome.mdx Co-authored-by: Merve Noyan --- chapters/en/unit0/welcome/welcome.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index a9b1786cb..ca0b958ec 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -31,7 +31,7 @@ The model repository needs to have the following: In this assignment section, you'll be building a Gradio-based application for your computer vision model and sharing it on 🤗 Spaces. Learn more about these tasks using the following resources: - [Getting started with Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt#introduction-to-gradio) -- [How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt). +- [How to share your application on 🤗 Spaces](https://huggingface.co/learn/nlp-course/chapter9/4?fw=pt) ## Certification 🥇 From 3953ca74e546b872b72ef892ecbc54d6c499d229 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:49:02 +0900 Subject: [PATCH 11/52] Update chapters/en/unit0/welcome/welcome.mdx Co-authored-by: Merve Noyan --- chapters/en/unit0/welcome/welcome.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index ca0b958ec..b2a54b4eb 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -53,7 +53,7 @@ As a computer vision course learner, you may find the following set of channels - `#computer-vision`: a catch-all channel for everything related to computer vision. - `#cv-study-group`: a place to exchange ideas, ask questions about specific posts and start discussions. -- `#3d`: a channel to discuss aspects of computer vision specific to 3D computer vision. +- `#3d`: a channel to discuss aspects of computer vision specific to 3D computer vision If you are interested in generative AI, we also invite you to join all channels related to the Diffusion Models: #core-announcements, #discussions, #dev-discussions, and #diff-i-made-this. From efd687e40d1f2fc64838c7018c6e17fe5f4b4fe0 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Wed, 15 May 2024 09:52:18 +0900 Subject: [PATCH 12/52] Updated punctuation --- chapters/en/unit0/welcome/welcome.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index b2a54b4eb..6d6820b9a 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -51,8 +51,8 @@ There are many channels focused on various topics on our Discord server. You wil As a computer vision course learner, you may find the following set of channels particularly relevant: -- `#computer-vision`: a catch-all channel for everything related to computer vision. -- `#cv-study-group`: a place to exchange ideas, ask questions about specific posts and start discussions. +- `#computer-vision`: a catch-all channel for everything related to computer vision +- `#cv-study-group`: a place to exchange ideas, ask questions about specific posts and start discussions - `#3d`: a channel to discuss aspects of computer vision specific to 3D computer vision If you are interested in generative AI, we also invite you to join all channels related to the Diffusion Models: #core-announcements, #discussions, #dev-discussions, and #diff-i-made-this. From 17f63fe5eafab48c2947c693338639ad4eb1da36 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:52:56 +0900 Subject: [PATCH 13/52] Update chapters/en/unit10/blenderProc.mdx Co-authored-by: Merve Noyan --- chapters/en/unit10/blenderProc.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit10/blenderProc.mdx b/chapters/en/unit10/blenderProc.mdx index 69bf2b423..e238adf67 100644 --- a/chapters/en/unit10/blenderProc.mdx +++ b/chapters/en/unit10/blenderProc.mdx @@ -116,7 +116,7 @@ blenderproc run You can check out this notebook to try BlenderProc in Google Colab, demos the basic examples provided [here](https://github.com/DLR-RM/BlenderProc/tree/main/examples/basics). Here are some images rendered with the basic example: -![colors](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/colors.png). +![colors](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/colors.png) ![normals](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/normals.png). ![depth](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/depth.png). From 90d441e32fc585361b7cf84052808ac58d829c4f Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:53:17 +0900 Subject: [PATCH 14/52] Update chapters/en/unit10/blenderProc.mdx Co-authored-by: Merve Noyan --- chapters/en/unit10/blenderProc.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit10/blenderProc.mdx b/chapters/en/unit10/blenderProc.mdx index e238adf67..30898a99a 100644 --- a/chapters/en/unit10/blenderProc.mdx +++ b/chapters/en/unit10/blenderProc.mdx @@ -117,7 +117,7 @@ You can check out this notebook to try BlenderProc in Google Colab, demos the ba Here are some images rendered with the basic example: ![colors](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/colors.png) -![normals](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/normals.png). +![normals](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/normals.png) ![depth](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/depth.png). ## Blender Resources From 50a4fecbfd8689d0c4a8bc53609c2ed15ff53345 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:53:57 +0900 Subject: [PATCH 15/52] Update chapters/en/unit10/blenderProc.mdx Co-authored-by: Merve Noyan --- chapters/en/unit10/blenderProc.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit10/blenderProc.mdx b/chapters/en/unit10/blenderProc.mdx index 30898a99a..d907609f9 100644 --- a/chapters/en/unit10/blenderProc.mdx +++ b/chapters/en/unit10/blenderProc.mdx @@ -118,7 +118,7 @@ Here are some images rendered with the basic example: ![colors](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/colors.png) ![normals](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/normals.png) -![depth](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/depth.png). +![depth](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/synthetic-data-creation-PBR/depth.png) ## Blender Resources From 3857b6bb38993ab24f868d52cc2129e3ef71b281 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:54:25 +0900 Subject: [PATCH 16/52] Update chapters/en/unit10/point_clouds.mdx Co-authored-by: Merve Noyan --- chapters/en/unit10/point_clouds.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit10/point_clouds.mdx b/chapters/en/unit10/point_clouds.mdx index 199148daa..74c0558b1 100644 --- a/chapters/en/unit10/point_clouds.mdx +++ b/chapters/en/unit10/point_clouds.mdx @@ -25,7 +25,7 @@ The 3D Point Data is mainly used in self-driving capabilities, but now other AI We will be using the python library [point-cloud-utils](https://github.com/fwilliams/point-cloud-utils), and [open-3d](https://github.com/isl-org/Open3D), which can be installed by: ```bash - pip install point-cloud-utils +pip install point-cloud-utils ``` We will be also using the python library open-3d, which can be installed by: From 7082512e4b3764d4a1911872437ce5db90ab7aa1 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:55:41 +0900 Subject: [PATCH 17/52] Update chapters/en/unit10/synthetic_datasets.mdx Co-authored-by: Merve Noyan --- chapters/en/unit10/synthetic_datasets.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit10/synthetic_datasets.mdx b/chapters/en/unit10/synthetic_datasets.mdx index f2cc8a417..9ba820f87 100644 --- a/chapters/en/unit10/synthetic_datasets.mdx +++ b/chapters/en/unit10/synthetic_datasets.mdx @@ -39,7 +39,7 @@ Semantic segmentation is vital for autonomous vehicles to interpret and navigate | Name | Year | Description | Paper | | Additional Links | |---------------------|--------------|-------------|----------------|---------------------|---------------------| -| Virtual KITTI 2 | 2020 | Virtual Worlds as Proxy for Multi-Object Tracking Analysis. | [Virtual KITTI 2](https://arxiv.org/pdf/2001.10773.pdf) | | [Website](https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds/) | +| Virtual KITTI 2 | 2020 | Virtual Worlds as Proxy for Multi-Object Tracking Analysis | [Virtual KITTI 2](https://arxiv.org/pdf/2001.10773.pdf) | | [Website](https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds/) | | ApolloScape | 2019 | Compared with existing public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape contains much large and richer labeling including holistic semantic dense point cloud for each site, stereo, per-pixel semantic labeling, lane-mark labeling, instance segmentation, 3D car instance, high accurate location for every frame in various driving videos from multiple sites, cities, and daytimes. | [The ApolloScape Open Dataset for Autonomous Driving and its Application](https://arxiv.org/abs/1803.06184) | | [Website](https://apolloscape.auto/) | | Driving in the Matrix | 2017 | The core idea behind "Driving in the Matrix" is to use photo-realistic computer-generated images from a simulation engine to produce annotated data quickly. | [Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?](https://arxiv.org/pdf/1610.01983.pdf) | | [GitHub](https://github.com/umautobots/driving-in-the-matrix) ![GitHub stars](https://img.shields.io/github/stars/umautobots/driving-in-the-matrix.svg?style=social&label=Star) | | CARLA | 2017 | **CARLA** (CAR Learning to Act) is an open simulator for urban driving, developed as an open-source layer over Unreal Engine 4. Technically, it operates similarly to, as an open source layer over Unreal Engine 4 that provides sensors in the form of RGB cameras (with customizable positions), ground truth depth maps, ground truth semantic segmentation maps with 12 semantic classes designed for driving (road, lane marking, traffic sign, sidewalk and so on), bounding boxes for dynamic objects in the environment, and measurements of the agent itself (vehicle location and orientation). | [CARLA: An Open Urban Driving Simulator](https://arxiv.org/pdf/1711.03938v1.pdf) | | [Website](https://carla.org/) | From 63f11a99727ad3303d21d057b496cb014b73bdce Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:57:05 +0900 Subject: [PATCH 18/52] Update chapters/en/unit2/cnns/convnext.mdx Co-authored-by: Merve Noyan --- chapters/en/unit2/cnns/convnext.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 904d0c4aa..49ffe226e 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -9,7 +9,7 @@ ConvNext represents a significant improvement to pure convolution models by inco ## Key Improvements The author of the ConvNeXT paper starts building the model with a regular ResNet (ResNet-50), then modernizes and improves the architecture step-by-step to imitate the hierarchical structure of Vision Transformers. The key improvements are: -- Training Techniques. +- Training techniques - Macro Design. - ResNeXt-ify. - Inverted Bottleneck. From b531fd3286ae6364a81babeaa98a02471979afd6 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:57:27 +0900 Subject: [PATCH 19/52] Update chapters/en/unit2/cnns/convnext.mdx Co-authored-by: Merve Noyan --- chapters/en/unit2/cnns/convnext.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 49ffe226e..3c22564b6 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -10,7 +10,7 @@ ConvNext represents a significant improvement to pure convolution models by inco The author of the ConvNeXT paper starts building the model with a regular ResNet (ResNet-50), then modernizes and improves the architecture step-by-step to imitate the hierarchical structure of Vision Transformers. The key improvements are: - Training techniques -- Macro Design. +- Macro design - ResNeXt-ify. - Inverted Bottleneck. - Large Kernel Sizes. From 9e346487d57eb74614959dc2824cf0a4cf9ce8d3 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:58:08 +0900 Subject: [PATCH 20/52] Update chapters/en/unit2/cnns/convnext.mdx Co-authored-by: Merve Noyan --- chapters/en/unit2/cnns/convnext.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 3c22564b6..7e4a893dc 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -11,7 +11,7 @@ The author of the ConvNeXT paper starts building the model with a regular ResNet The key improvements are: - Training techniques - Macro design -- ResNeXt-ify. +- ResNeXt-ify - Inverted Bottleneck. - Large Kernel Sizes. - Micro Design. From 5ba57b43645e8ffe42da11f4800fe1993b78d0be Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:58:48 +0900 Subject: [PATCH 21/52] Update chapters/en/unit2/cnns/convnext.mdx Co-authored-by: Merve Noyan --- chapters/en/unit2/cnns/convnext.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 7e4a893dc..1fe9b0ed6 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -12,7 +12,7 @@ The key improvements are: - Training techniques - Macro design - ResNeXt-ify -- Inverted Bottleneck. +- Inverted bottleneck - Large Kernel Sizes. - Micro Design. From 6a17a6044203033d56af0144c2843a9b9a019fd1 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:59:18 +0900 Subject: [PATCH 22/52] Update chapters/en/unit2/cnns/convnext.mdx Co-authored-by: Merve Noyan --- chapters/en/unit2/cnns/convnext.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 1fe9b0ed6..4da0f7916 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -13,7 +13,7 @@ The key improvements are: - Macro design - ResNeXt-ify - Inverted bottleneck -- Large Kernel Sizes. +- Large kernel sizes - Micro Design. We will go through each of the key improvements. From d987aebf0c6416b6db0030331228019bb187c434 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 09:59:58 +0900 Subject: [PATCH 23/52] Update chapters/en/unit2/cnns/convnext.mdx Co-authored-by: Merve Noyan --- chapters/en/unit2/cnns/convnext.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 4da0f7916..03d36dfee 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -14,7 +14,7 @@ The key improvements are: - ResNeXt-ify - Inverted bottleneck - Large kernel sizes -- Micro Design. +- Micro design We will go through each of the key improvements. These designs are not novel in itself. However, you can learn how researchers adapt and modify designs systematically to improve existing models. From 93cc5dc89ad46b1ffdc7a814551fd324446aefb8 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:01:06 +0900 Subject: [PATCH 24/52] Update chapters/en/unit3/vision-transformers/detr.mdx Co-authored-by: Merve Noyan --- chapters/en/unit3/vision-transformers/detr.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit3/vision-transformers/detr.mdx b/chapters/en/unit3/vision-transformers/detr.mdx index 3114d3292..75548178a 100644 --- a/chapters/en/unit3/vision-transformers/detr.mdx +++ b/chapters/en/unit3/vision-transformers/detr.mdx @@ -143,7 +143,7 @@ The input image is first put through a ResNet backbone and then a convolution la x = self.backbone(inputs) h = self.conv(x) ``` -they are declared in the `__init__` function. +They are declared in the `__init__` function. ```python self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2]) self.conv = nn.Conv2d(2048, hidden_dim, 1) From ccbd8ca1edf3109ee1cb966e44b50753aaaf96d0 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:01:45 +0900 Subject: [PATCH 25/52] Update chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx Co-authored-by: Merve Noyan --- .../vision-transformer-for-objection-detection.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx b/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx index 7dcfecd46..9b341fc9c 100644 --- a/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx +++ b/chapters/en/unit3/vision-transformers/vision-transformer-for-objection-detection.mdx @@ -25,7 +25,7 @@ These models typically receive images (static or frames from videos) as their in There are a lot of of applications around object detection. One of the most significant examples is in the field of autonomous driving, where object detection is used to detect different objects (like pedestrians, road signs, traffic lights, etc) around the car that become one of the inputs for taking decisions. -To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](/learn/computer-vision-course/unit6/basic-cv-tasks/object_detection) on Object Detection 🤗. +To deepen your understanding of the ins-and-outs of object detection, check out our [dedicated chapter](https://huggingface.co/learn/computer-vision-course/unit6/basic-cv-tasks/object_detection) on Object Detection 🤗. ### The Need to Fine-tune Models in Object Detection 🤔 From 16c7f5e3a6396d7829d718769055fb23d8f21ae0 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:02:11 +0900 Subject: [PATCH 26/52] Update chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx Co-authored-by: Merve Noyan --- chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx b/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx index 60e1b35d6..2e9a2f7a5 100644 --- a/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx +++ b/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx @@ -55,7 +55,7 @@ probs = logits_per_image.softmax(dim=1) ``` After executing this code, we got the following probabilities: -- "a photo of a cat": 99.49%. +- "a photo of a cat": 99.49% - "a photo of a dog": 0.51%. ## Limitations From 011d71427ecb973e72fc8285206ad48a27f4e594 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:02:28 +0900 Subject: [PATCH 27/52] Update chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx Co-authored-by: Merve Noyan --- chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx b/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx index 2e9a2f7a5..0eb51e3a3 100644 --- a/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx +++ b/chapters/en/unit4/multimodal-models/clip-and-relatives/clip.mdx @@ -56,7 +56,7 @@ probs = logits_per_image.softmax(dim=1) After executing this code, we got the following probabilities: - "a photo of a cat": 99.49% -- "a photo of a dog": 0.51%. +- "a photo of a dog": 0.51% ## Limitations From fb6a7c61a995860b464d9cbbae2995e9ef6905ff Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:03:03 +0900 Subject: [PATCH 28/52] Update chapters/en/unit5/generative-models/diffusion-models/introduction.mdx Co-authored-by: Merve Noyan --- .../unit5/generative-models/diffusion-models/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx index 4907aae9b..236d971b8 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx @@ -2,7 +2,7 @@ What you will learn from this chapter: -- What are diffusion models and how do they differ from GANs. +- What are diffusion models and how do they differ from GANs - Major sub categories of Diffusion models. - Use cases of Diffusion models. - Drawback in Diffusion models. From da4143394c28caba9ef7324d91cff4dcfb6329ba Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:03:34 +0900 Subject: [PATCH 29/52] Update chapters/en/unit5/generative-models/diffusion-models/introduction.mdx Co-authored-by: Merve Noyan --- .../unit5/generative-models/diffusion-models/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx index 236d971b8..b7b1d8b63 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx @@ -3,7 +3,7 @@ What you will learn from this chapter: - What are diffusion models and how do they differ from GANs -- Major sub categories of Diffusion models. +- Major sub categories of diffusion models - Use cases of Diffusion models. - Drawback in Diffusion models. From a23ecc0e480a506339a450a41f11005bc16b15cf Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:04:17 +0900 Subject: [PATCH 30/52] Update chapters/en/unit5/generative-models/diffusion-models/introduction.mdx Co-authored-by: Merve Noyan --- .../unit5/generative-models/diffusion-models/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx index b7b1d8b63..106d51f00 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx @@ -4,7 +4,7 @@ What you will learn from this chapter: - What are diffusion models and how do they differ from GANs - Major sub categories of diffusion models -- Use cases of Diffusion models. +- Use cases of diffusion models - Drawback in Diffusion models. From 56ad3569948fc1eaca6a31027cb3ddb58f9ce380 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:04:38 +0900 Subject: [PATCH 31/52] Update chapters/en/unit5/generative-models/diffusion-models/introduction.mdx Co-authored-by: Merve Noyan --- .../unit5/generative-models/diffusion-models/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx index 106d51f00..a5df2dee9 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx @@ -5,7 +5,7 @@ What you will learn from this chapter: - What are diffusion models and how do they differ from GANs - Major sub categories of diffusion models - Use cases of diffusion models -- Drawback in Diffusion models. +- Drawback in diffusion models ## Diffusion Models and their Difference from GANs From c8e9a55888273cd0a064c944d7b8e81498ae5d29 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:05:13 +0900 Subject: [PATCH 32/52] Update chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx Co-authored-by: Merve Noyan --- .../generative-models/diffusion-models/stable-diffusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx index 7e353a374..e3d32d0ed 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx @@ -3,7 +3,7 @@ This chapter introduces the building blocks of Stable Diffusion which is a gener [Stability AI](https://stability.ai/), [RunwayML](https://runwayml.com/) and CompVis Group at LMU Munich following the [paper](https://arxiv.org/pdf/2112.10752.pdf). What will you learn from this chapter? -- Fundamental components of Stable Diffusion. +- Fundamental components of Stable Diffusion - How to use `text-to-image`, `image2image`, inpainting pipelines. ## What Do We Need for Stable Diffusion to Work? From 279c80805745810be9b6937da8b84a0ac0fa7415 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:06:07 +0900 Subject: [PATCH 33/52] Update chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx Co-authored-by: Merve Noyan --- .../generative-models/diffusion-models/stable-diffusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx index e3d32d0ed..1016015f4 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx @@ -4,7 +4,7 @@ This chapter introduces the building blocks of Stable Diffusion which is a gener What will you learn from this chapter? - Fundamental components of Stable Diffusion -- How to use `text-to-image`, `image2image`, inpainting pipelines. +- How to use `text-to-image`, `image2image`, inpainting pipelines ## What Do We Need for Stable Diffusion to Work? To make this section interesting we will try to answer some questions to understand the basic components of the Stable Diffusion process. From 9e4eb7b08a679cfcc22cd8c388a2737106fa8a75 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:07:02 +0900 Subject: [PATCH 34/52] Update chapters/en/unit5/generative-models/introduction/introduction.mdx Co-authored-by: Merve Noyan --- .../en/unit5/generative-models/introduction/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/introduction/introduction.mdx b/chapters/en/unit5/generative-models/introduction/introduction.mdx index 870f3a9ec..16b0f8bd5 100644 --- a/chapters/en/unit5/generative-models/introduction/introduction.mdx +++ b/chapters/en/unit5/generative-models/introduction/introduction.mdx @@ -15,7 +15,7 @@ these tasks can be expanded into more complex processes such as semantic segment For the sake of brevity, in this chapter, we will consider generative models that solve these tasks: -* noise to image (DCGAN). +* noise to image (DCGAN) * text to image (diffusion models). * image to image (StyleGAN, cycleGAN, diffusion models). From 034d3fb24b8a55d0f6a60bb2d724201e2fe2900c Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:07:25 +0900 Subject: [PATCH 35/52] Update chapters/en/unit5/generative-models/introduction/introduction.mdx Co-authored-by: Merve Noyan --- .../en/unit5/generative-models/introduction/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/introduction/introduction.mdx b/chapters/en/unit5/generative-models/introduction/introduction.mdx index 16b0f8bd5..17a2d2b32 100644 --- a/chapters/en/unit5/generative-models/introduction/introduction.mdx +++ b/chapters/en/unit5/generative-models/introduction/introduction.mdx @@ -16,7 +16,7 @@ these tasks can be expanded into more complex processes such as semantic segment For the sake of brevity, in this chapter, we will consider generative models that solve these tasks: * noise to image (DCGAN) -* text to image (diffusion models). +* text to image (diffusion models) * image to image (StyleGAN, cycleGAN, diffusion models). This section will cover 2 kinds of generative models. GAN-based models, and diffusion-based models. From 2937fbcbcfe4a6741aa97d872d7c10e17fdd60f3 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:07:48 +0900 Subject: [PATCH 36/52] Update chapters/en/unit5/generative-models/introduction/introduction.mdx Co-authored-by: Merve Noyan --- .../en/unit5/generative-models/introduction/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/introduction/introduction.mdx b/chapters/en/unit5/generative-models/introduction/introduction.mdx index 17a2d2b32..ad5dbe5a4 100644 --- a/chapters/en/unit5/generative-models/introduction/introduction.mdx +++ b/chapters/en/unit5/generative-models/introduction/introduction.mdx @@ -17,7 +17,7 @@ For the sake of brevity, in this chapter, we will consider generative models tha * noise to image (DCGAN) * text to image (diffusion models) -* image to image (StyleGAN, cycleGAN, diffusion models). +* image to image (StyleGAN, cycleGAN, diffusion models) This section will cover 2 kinds of generative models. GAN-based models, and diffusion-based models. From df87c303a321c31b65f5cb5b6e590f6d9fb7e1cf Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:08:17 +0900 Subject: [PATCH 37/52] Update chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx Co-authored-by: Merve Noyan --- .../generative-models/practical-applications/ethical-issues.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx index b318b2ee4..0ae477d0a 100644 --- a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx +++ b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx @@ -6,7 +6,7 @@ What you will learn from this chapter: - Impact of such AI images/videos on society. - Current approaches to tackle the issues. -- Future scope. +- Future scope ## Impact on Society From c134cb8e9eebe89a8af4ca91c9ded3c9171247d1 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:08:39 +0900 Subject: [PATCH 38/52] Update chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx Co-authored-by: Merve Noyan --- .../generative-models/practical-applications/ethical-issues.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx index 0ae477d0a..bd2298086 100644 --- a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx +++ b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx @@ -5,7 +5,7 @@ The widespread adoption of AI-powered image editing tools raises significant con What you will learn from this chapter: - Impact of such AI images/videos on society. -- Current approaches to tackle the issues. +- Current approaches to tackle the issues - Future scope ## Impact on Society From 8d927af2205515bdb9420e6204f1297185e995bf Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:09:11 +0900 Subject: [PATCH 39/52] Update chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx Co-authored-by: Merve Noyan --- .../generative-models/practical-applications/ethical-issues.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx index bd2298086..d2a976022 100644 --- a/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx +++ b/chapters/en/unit5/generative-models/practical-applications/ethical-issues.mdx @@ -4,7 +4,7 @@ The widespread adoption of AI-powered image editing tools raises significant con What you will learn from this chapter: -- Impact of such AI images/videos on society. +- Impact of such AI images/videos on society - Current approaches to tackle the issues - Future scope From 16a17a0086919632b596be0869383877801ff5d5 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:09:55 +0900 Subject: [PATCH 40/52] Update chapters/en/unit8/3d_measurements_stereo_vision.mdx Co-authored-by: Merve Noyan --- chapters/en/unit8/3d_measurements_stereo_vision.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index 584aae2e4..ac8c0a8f2 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -16,7 +16,7 @@ We aim to solve the problem of determining the 3D structure of objects. In our p ## Solution Let's assume we are given the following information: -1. Single image of a scene point P. +1. Single image of a scene point P 2. Pixel coordinates of point P in the image. 3. Position and orientation of the camera used to capture the image. For simplicity, we can also place an XYZ coordinate system at the location of the pinhole, with the z-axis perpendicular to the image place and the x-axis, and y-axis parallel to the image plane like in Figure 1. 4. Internal parameters of the camera, such as focal length and location of principal point. The principal point is where the optical axis intersects the image plane. Its location in the image plane is usually denoted as (Ox,Oy). From f74dce23df015135a356e60d950fd7a475b7f377 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:10:31 +0900 Subject: [PATCH 41/52] Update chapters/en/unit8/3d_measurements_stereo_vision.mdx Co-authored-by: Merve Noyan --- chapters/en/unit8/3d_measurements_stereo_vision.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index ac8c0a8f2..dcf1d727b 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -17,7 +17,7 @@ We aim to solve the problem of determining the 3D structure of objects. In our p Let's assume we are given the following information: 1. Single image of a scene point P -2. Pixel coordinates of point P in the image. +2. Pixel coordinates of point P in the image 3. Position and orientation of the camera used to capture the image. For simplicity, we can also place an XYZ coordinate system at the location of the pinhole, with the z-axis perpendicular to the image place and the x-axis, and y-axis parallel to the image plane like in Figure 1. 4. Internal parameters of the camera, such as focal length and location of principal point. The principal point is where the optical axis intersects the image plane. Its location in the image plane is usually denoted as (Ox,Oy). From 99450b6272f2364af7135c73f45c8abe996a80da Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:11:30 +0900 Subject: [PATCH 42/52] Update chapters/en/unit8/3d_measurements_stereo_vision.mdx Co-authored-by: Merve Noyan --- chapters/en/unit8/3d_measurements_stereo_vision.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index dcf1d727b..0c4ec7a8d 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -73,7 +73,7 @@ Different symbols used in above equations are defined below: * \\(u\_left\\), \\(v\_left\\) refer to pixel coordinates of point P in the left image. * \\(u\_right\\), \\(v\_right\\) refer to pixel coordinates of point P in the right image. * \\(f\_x\\) refers to the focal length (in pixels) in x direction and \\(f\_y\\) refers to the focal length (in pixels) in y direction. Actually, there is only 1 focal length for a camera which is the distance between the pinhole (optical center of the lens) to the image plane. However, pixels may be rectangular and not perfect squares, resulting in different fx and fy values when we represent f in terms of pixels. -* x,y,z are 3D coordinates of the point P (any unit like cm, feet, etc can be used). +* x, y, z are 3D coordinates of the point P (any unit like cm, feet, etc can be used). * \\(O\_x\\) and \\(O\_y\\) refer to pixel coordinates of the principal point. * b is called the baseline and refers to the distance between the left and right cameras. Same units are used for both b and x,y,z coordinates (any unit like cm, feet, etc can be used). From 1c5551460a7c2ae101e9731b9597ad43d11a8978 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:13:29 +0900 Subject: [PATCH 43/52] Update chapters/en/unit8/introduction/brief_history.mdx Co-authored-by: Merve Noyan --- chapters/en/unit8/introduction/brief_history.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/introduction/brief_history.mdx b/chapters/en/unit8/introduction/brief_history.mdx index 39876dabc..2fd3fa880 100644 --- a/chapters/en/unit8/introduction/brief_history.mdx +++ b/chapters/en/unit8/introduction/brief_history.mdx @@ -7,7 +7,7 @@ ## 1853: Anaglyph 3D -- **Pioneer**: Louis Ducos du Hauron. +- **Pioneer**: Louis Ducos du Hauron - **Method**: Using glasses with colored filters to separate images in complementary colors, creating a depth illusion. ## 1936: Polarized 3D From 39299fa1bd2367e97e525f64f3d082c33eaea755 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Wed, 15 May 2024 10:14:11 +0900 Subject: [PATCH 44/52] Update chapters/en/unit8/introduction/brief_history.mdx Co-authored-by: Merve Noyan --- chapters/en/unit8/introduction/brief_history.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/introduction/brief_history.mdx b/chapters/en/unit8/introduction/brief_history.mdx index 2fd3fa880..6d9aa63ab 100644 --- a/chapters/en/unit8/introduction/brief_history.mdx +++ b/chapters/en/unit8/introduction/brief_history.mdx @@ -22,7 +22,7 @@ ## 1979: Autostereograms (Magic Eye Images) -- **Creator**: Christopher Tyler. +- **Creator**: Christopher Tyler - **Concept**: 2D patterns that allow viewers to see 3D images without special glasses. ## 1986: IMAX 3D From 7e8bbc098cd932ef6bf129770f49a24f8779f995 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Wed, 15 May 2024 13:13:25 +0900 Subject: [PATCH 45/52] Modified pip install spacing --- chapters/en/unit10/point_clouds.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit10/point_clouds.mdx b/chapters/en/unit10/point_clouds.mdx index 74c0558b1..6a470c3f7 100644 --- a/chapters/en/unit10/point_clouds.mdx +++ b/chapters/en/unit10/point_clouds.mdx @@ -31,13 +31,13 @@ pip install point-cloud-utils We will be also using the python library open-3d, which can be installed by: ```bash - pip install open3d +pip install open3d ``` OR a Smaller CPU only version: ```bash - pip install open3d-cpu +pip install open3d-cpu ``` Now, first we need to understand the formats in which these point clouds are stored in, and for that, we need to look at mesh cloud. From 71233b6081b472b1c43d6999e93617d07c5e8076 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Wed, 15 May 2024 13:17:12 +0900 Subject: [PATCH 46/52] Extended spacing update of pip install instructions across units --- chapters/en/unit10/blenderProc.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit10/blenderProc.mdx b/chapters/en/unit10/blenderProc.mdx index d907609f9..8ff6d2a8c 100644 --- a/chapters/en/unit10/blenderProc.mdx +++ b/chapters/en/unit10/blenderProc.mdx @@ -98,7 +98,7 @@ It is specifically created to help in the generation of realistic looking images You can install BlenderProc via pip: ```bash - pip install blenderProc +pip install blenderProc ``` Alternately, you can clone the official [BlenderProc repository](https://github.com/DLR-RM/BlenderProc) from GitHub using Git: From 6107b34a71be65ec88d05169402c2e53d4a0ce71 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Wed, 15 May 2024 13:22:25 +0900 Subject: [PATCH 47/52] Removed punctuations in table following standards --- chapters/en/unit10/synthetic_datasets.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit10/synthetic_datasets.mdx b/chapters/en/unit10/synthetic_datasets.mdx index 9ba820f87..1d912689c 100644 --- a/chapters/en/unit10/synthetic_datasets.mdx +++ b/chapters/en/unit10/synthetic_datasets.mdx @@ -55,8 +55,8 @@ Navigating indoor environments can be challenging due to their complexity. These | Name | Year | Description | Paper | Additional Links | |--------------|--------------|-------------|----------------|--------------| |Habitat | 2023 | An Embodied AI simulation platform for studying collaborative human-robot interaction tasks in home environments. | [HABITAT 3.0: A CO-HABITAT FOR HUMANS, AVATARS AND ROBOTS](https://ai.meta.com/static-resource/habitat3) | [Website](https://aihabitat.org/habitat3/) | -| Minos | 2017 | Multimodal Indoor Simulator. | [MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments](https://arxiv.org/pdf/1712.03931.pdf) | [GitHub](https://github.com/minosworld/minos) ![GitHub stars](https://img.shields.io/github/stars/minosworld/minos.svg?style=social&label=Star) | -| House3D | 2017 (archived in 2021) | A Rich and Realistic 3D Environment. | [Building generalisable agents with a realistic and rich 3D environment](https://arxiv.org/pdf/1801.02209v2.pdf) | [GitHub](https://github.com/facebookresearch/House3D) ![GitHub stars](https://img.shields.io/github/stars/facebookresearch/House3D.svg?style=social&label=Star) | +| Minos | 2017 | Multimodal Indoor Simulator | [MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments](https://arxiv.org/pdf/1712.03931.pdf) | [GitHub](https://github.com/minosworld/minos) ![GitHub stars](https://img.shields.io/github/stars/minosworld/minos.svg?style=social&label=Star) | +| House3D | 2017 (archived in 2021) | A Rich and Realistic 3D Environment | [Building generalisable agents with a realistic and rich 3D environment](https://arxiv.org/pdf/1801.02209v2.pdf) | [GitHub](https://github.com/facebookresearch/House3D) ![GitHub stars](https://img.shields.io/github/stars/facebookresearch/House3D.svg?style=social&label=Star) | ### Human Action Recognition and Simulation From c652dfb9499ac4adc682dcd081810480d9e5ea9f Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Wed, 15 May 2024 13:25:19 +0900 Subject: [PATCH 48/52] Removed unneeded new lines --- chapters/en/unit2/cnns/convnext.mdx | 3 --- 1 file changed, 3 deletions(-) diff --git a/chapters/en/unit2/cnns/convnext.mdx b/chapters/en/unit2/cnns/convnext.mdx index 03d36dfee..4d67f0cc9 100644 --- a/chapters/en/unit2/cnns/convnext.mdx +++ b/chapters/en/unit2/cnns/convnext.mdx @@ -20,7 +20,6 @@ We will go through each of the key improvements. These designs are not novel in itself. However, you can learn how researchers adapt and modify designs systematically to improve existing models. To show the effectiveness of each improvement, we will compare the model's accuracy before and after the modification on ImageNet-1K. - ![Block Comparison](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/block_comparison.png) @@ -68,7 +67,6 @@ This idea has also been used and popularized in Computer Vision by MobileNetV2. ConvNext adopts this idea, having input layers with 96 channels and increasing the hidden layers to 384 channels. By using this technique, it improves the model accuracy from 80.5% to 80.6%. - ![Inverted Bottleneck Comparison](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/inverted_bottleneck.png) @@ -80,7 +78,6 @@ This repositioning enables the 1x1 layers to efficiently handle computational ta With this, the network can harness the advantages of incorporating bigger kernel-sized convolutions. Implementing a 7x7 kernel size maintains the accuracy at 80.6% but reduces the overall FLOPs efficiency of the model. - ![Moving up the Depth Conv Layer](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/depthwise_moveup.png) From 1c20a24300d48296ee7f3a0c2ba2124593179cc2 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Wed, 15 May 2024 13:31:38 +0900 Subject: [PATCH 49/52] Updated some broken links pointing to the course --- .../vision-transformers-for-image-segmentation.mdx | 2 +- .../generative-models/diffusion-models/stable-diffusion.mdx | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx b/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx index d4cd6c662..d807fb5ac 100644 --- a/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx +++ b/chapters/en/unit3/vision-transformers/vision-transformers-for-image-segmentation.mdx @@ -40,7 +40,7 @@ The architecture is composed of three components: **Segmentation Module**: Generates class probability predictions and mask embeddings for each segment using a linear classifier and a Multi-Layer Perceptron (MLP), respectively. The mask embeddings are used in combination with per-pixel embeddings to predict binary masks for each segment. -The model is trained with a binary mask loss, the same one as [DETR](https://github.com/johko/computer-vision-course/blob/9ad9b01f2383377ac9482dcbe02c91465b573b0b/chapters/en/Unit%203%20-%20Vision%20Transformers/Common%20Vision%20Transformers%20-%20DETR.mdx), and a cross-entropy classification loss per predicted segment. +The model is trained with a binary mask loss, the same one as [DETR](https://huggingface.co/learn/computer-vision-course/unit3/vision-transformers/detr), and a cross-entropy classification loss per predicted segment. ### Panoptic Segmentation Inference Example with Hugging Face Transformers diff --git a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx index 1016015f4..f7c783070 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/stable-diffusion.mdx @@ -25,7 +25,7 @@ Latent diffusion models address the high computational demands of processing lar - How are we fusing texts with images since we are using prompts? We know that during inference time, we can feed in the description of an image we'd like to see and some pure noise as a starting point, and the model does its best to 'denoise' the random input into something that matches the caption. -SD leverages a pre-trained transformer model based on something called [CLIP](https://github.com/johko/computer-vision-course/blob/main/chapters/en/Unit%204%20-%20Mulitmodal%20Models/CLIP%20and%20relatives/clip.mdx). CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt. +SD leverages a pre-trained transformer model based on something called [CLIP](https://huggingface.co/learn/computer-vision-course/unit4/multimodal-models/clip-and-relatives/clip). CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt. - How can we add-in good inductive biases? From a39c49df09c49e51859d4222e42fbc0c289b53c0 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Thu, 16 May 2024 09:22:25 +0900 Subject: [PATCH 50/52] Added new line to separate sentences --- chapters/en/unit8/3d-vision/nvs.mdx | 1 + 1 file changed, 1 insertion(+) diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index be9cbff8a..f3112f680 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -93,6 +93,7 @@ To explore this model further, see the [Live Demo](https://huggingface.co/spaces ### Related methods [3DiM](https://3d-diffusion.github.io/) - X-UNet architecture, with cross-attention between input and noisy frames. + [Zero123-XL](https://arxiv.org/pdf/2311.13617.pdf) - Trained on the larger objaverseXL dataset. See also [Stable Zero 123](https://huggingface.co/stabilityai/stable-zero123). [Zero123++](https://arxiv.org/abs/2310.15110) - Generates 6 new fixed views, at fixed relative positions to the input view, with reference attention between input and generated images. From 3f73731863e296da301cde10142b5edbfb617fd2 Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Thu, 16 May 2024 09:26:49 +0900 Subject: [PATCH 51/52] Removed unneeded puntuaction --- chapters/en/unit8/3d_measurements_stereo_vision.mdx | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index 0c4ec7a8d..75a40bc61 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -32,9 +32,9 @@ With the information provided above, we can find a 3D line that originates from Given 2 lines in 3D, there are are three possibilities for their intersection: -1. Intersect at exactly 1 point. -2. Intersect at infinite number of points. -3. Do not intersect. +1. Intersect at exactly 1 point +2. Intersect at infinite number of points +3. Do not intersect If both images (with original and new camera positions) contain point P, we can conclude that the 3D lines must intersect at least once and that the intersection point is point P. Furthermore, we can envision infinite points where both lines intersect only if the two lines are collinear. This is achievable if the pinhole at the new camera position lies somewhere on the original 3D line. For all other positions and orientations of the new camera location, the two 3D lines must intersect precisely at one point, where point P lies. From 1c6923355c2e94c7a5b69478a7858ff9a1f78a7d Mon Sep 17 00:00:00 2001 From: sergiopaniego Date: Thu, 16 May 2024 09:36:25 +0900 Subject: [PATCH 52/52] Removed unneeded punctuation --- chapters/en/unit4/multimodal-models/vlm-intro.mdx | 10 +++++----- .../diffusion-models/introduction.mdx | 6 +++--- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/chapters/en/unit4/multimodal-models/vlm-intro.mdx b/chapters/en/unit4/multimodal-models/vlm-intro.mdx index db4ca5fc4..c3c00b1e2 100644 --- a/chapters/en/unit4/multimodal-models/vlm-intro.mdx +++ b/chapters/en/unit4/multimodal-models/vlm-intro.mdx @@ -1,11 +1,11 @@ # Introduction to Vision Language Models What will you learn from this chapter: -- A brief introduction to multimodality. -- Introduction to Vision Language Models. -- Various learning strategies. -- Common datasets used for VLMs. -- Downstream tasks and evaluation. +- A brief introduction to multimodality +- Introduction to Vision Language Models +- Various learning strategies +- Common datasets used for VLMs +- Downstream tasks and evaluation ## Our World is Multimodal Humans explore the world through diverse senses: sight, sound, touch, and scent. A complete grasp of our surroundings emerges by harmonizing insights from these varied modalities. diff --git a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx index a5df2dee9..f53b7a5be 100644 --- a/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx +++ b/chapters/en/unit5/generative-models/diffusion-models/introduction.mdx @@ -52,9 +52,9 @@ Diffusion is used in a variety of tasks including, but not limited to: - Image editing - Editing specific/entire part of the image without losing its visual identity. - Image-to-image translation - This includes changing background, attributes of the location etc. - Learned Latent representation from diffusion models can also be used for. - - Image segmentation. - - Classification. - - Anomaly detection. + - Image segmentation + - Classification + - Anomaly detection Want to play with diffusion models? No worries, Hugging Face's [Diffusers](https://huggingface.co/docs/diffusers/index) library comes to rescue. You can use almost all recent diffusion SOTA models for almost any task.