johko · merveenoyan · Apr 22, 2024 · Apr 12, 2024 · Apr 12, 2024 · Apr 12, 2024
@@ -132,7 +132,6 @@ Our goal was to create a computer vision course that is beginner-friendly and th
 - Reviewers: [Ratan Prasad](https://github.com/ratan), [William Bonvini](https://github.com/WilliamBonvini), [Mohammed Hamdy](https://github.com/mmhamdy), [Adhi Setiawan](https://github.com/adhiiisetiawan), [Ameed Taylor](https://github.com/atayloraerospace0)
 - Writers: [John Fozard](https://github.com/jfozard), [Vasu Gupta](https://github.com/vasugupta9), [Psetinek](https://github.com/psetinek)
 
-
 **Unit 9 - Model Optimization**
 
 - Reviewers: [Ratan Prasad](https://github.com/ratan), [Mohammed Hamdy](https://github.com/mmhamdy), [Adhi Setiawan](https://github.com/adhiiisetiawan), [Ameed Taylor](https://github.com/atayloraerospace)
@@ -158,4 +157,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
 - Reviewers: [Ratan Prasad](https://github.com/ratan), [Ameed Taylor](https://github.com/atayloraerospace), [Mohammed Hamdy](https://github.com/mmhamdy)
 - Writers: [Farros Alferro](https://github.com/farrosalferro), [Mohammed Hamdy](https://github.com/mmhamdy), [Louis Ulmer](https://github.com/lulmer), [Dario Wisznewer](https://github.com/dariowsz), [gonzachiar](https://github.com/gonzachiar)
 
+**Organisation Team**
+[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
-[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
+[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
+We'd like to thank [Maria Khalusova](https://huggingface.co/MariaK) for her thorough reviews.
-[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
+[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
+We'd like to thank [Maria Khalusova](https://huggingface.co/MariaK) for her thorough reviews.
+
 We are happy to have you here, let's get started!
@@ -1,5 +1,11 @@
 # Introduction to Convolutional Neural Networks
 
+In the last unit we learned about the fundamentals of vision, images and Computer Vision. We also explored visual features as a crucial part of analyzing images with the help of computers.
+
+The approaches we discussed are today often referred to as "classical" Computer Vision. While working fine on many small and restrained datasets and settings, classical methods have their limits that come to light when looking at bigger scale real-world datasets. 
+
+In this unit, we will learn about Convolutional Neural Networks, an important step forward in terms of scale and performance of Computer Vision.
+
 ## Convolution: Basic Ideas
 
 Convolution is an operation used to extract features from data. The data can be 1D, 2D or 3D. We'll explain the operation with a solid example. All you need to know now is that the operation simply takes a matrix made of numbers, moves it through the data, and takes the sum of products between the data and that matrix. This matrix is called kernel or filter. You might say, "What does it have to do with the feature extraction, and how am I supposed to apply it?

@@ -1,6 +1,6 @@
-# Introduction
+# CLIP and Relatives
 
-This section provides an overview of CLIP and similar models, highlighting their unique features and applicability to various machine learning tasks.
+So far we have learned about the fundamentals of multimodality with a special spotlight of Vision Language Models. This chapters provide a short overview of CLIP and similar models, highlighting their unique features and applicability to various machine learning tasks.
 It sets the stage for a high-level exploration of key multimodal models that have emerged before and after CLIP, showcasing their significant contributions to the advancement of multimodal AI.
 
 ## Pre-CLIP
@@ -9,9 +9,11 @@ In this part, we explore the innovative attempts in multimodal AI before CLIP.
 The focus is on influential papers that used deep learning to make significant strides in the field:
 
 1. **"Multimodal Deep Learning" by Ngiam et al. (2011):** This paper demonstrated the use of deep learning for multimodal inputs, emphasizing the potential of neural networks in integrating different data types. It laid the groundwork for future innovations in multimodal AI.
+
    - [Multimodal Deep Learning](https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf)
 
 2. **"Deep Visual-Semantic Alignments for Generating Image Descriptions" by Karpathy and Fei-Fei (2015):** This study presented a method for aligning textual data with specific image regions, enhancing the interpretability of multimodal systems and advancing the understanding of complex visual-textual relationships.
+
    - [Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf)
 
 3. **"Show and Tell: A Neural Image Caption Generator" by Vinyals et al. (2015):** This paper marked a significant step in practical multimodal AI by showing how CNNs and RNNs could be combined to transform visual information into descriptive language.
@@ -22,12 +24,15 @@ The focus is on influential papers that used deep learning to make significant s
 The emergence of CLIP brought new dimensions to multimodal models, as illustrated by the following developments:
 
 1. **CLIP:** OpenAI's CLIP was a game-changer, learning from a vast array of internet text-image pairs and enabling zero-shot learning, contrasting with earlier models.
+
    - [CLIP](https://openai.com/blog/clip/)
 
 2. **GroupViT:** Innovating in segmentation and semantic understanding, GroupViT combined these aspects with language, showing advanced integration of language and vision.
+
    - [GroupViT](https://arxiv.org/abs/2202.11094)
 
 3. **BLIP:** BLIP introduced bidirectional learning between vision and language, pushing the boundaries for generating text from visual inputs.
+
    - [BLIP](https://arxiv.org/abs/2201.12086)
 
 4. **OWL-VIT:** Focusing on object-centric representations, OWL-VIT advanced the understanding of objects within images in context with text.
@@ -41,4 +46,4 @@ These developments highlight the evolving methods of processing multimodal data
 The upcoming sections will delve into the "Losses" aspect, focusing on various loss functions and self-supervised learning crucial for training multimodal models.
 The "Models" section will provide a deeper understanding of CLIP and its variants, exploring their designs and functionalities.
 Finally, the "Practical Notebooks" section will offer hands-on experience, addressing challenges like data bias and applying these models in tasks such as image search engines and visual question answering systems.
-These sections aim to deepen your knowledge and practical skills in the multifaceted world of multimodal AI.
+These sections aim to deepen your knowledge and practical skills in the multifaceted world of multimodal AI.