Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Batch for Better Transitions #250

Merged
merged 5 commits into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 41 additions & 43 deletions chapters/en/Unit 0 - Welcome/TableOfContents.mdx

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion chapters/en/Unit 0 - Welcome/welcome.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,6 @@ Our goal was to create a computer vision course that is beginner-friendly and th
- Reviewers: [Ratan Prasad](https://github.com/ratan), [William Bonvini](https://github.com/WilliamBonvini), [Mohammed Hamdy](https://github.com/mmhamdy), [Adhi Setiawan](https://github.com/adhiiisetiawan), [Ameed Taylor](https://github.com/atayloraerospace0)
- Writers: [John Fozard](https://github.com/jfozard), [Vasu Gupta](https://github.com/vasugupta9), [Psetinek](https://github.com/psetinek)


**Unit 9 - Model Optimization**

- Reviewers: [Ratan Prasad](https://github.com/ratan), [Mohammed Hamdy](https://github.com/mmhamdy), [Adhi Setiawan](https://github.com/adhiiisetiawan), [Ameed Taylor](https://github.com/atayloraerospace)
Expand All @@ -158,4 +157,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th
- Reviewers: [Ratan Prasad](https://github.com/ratan), [Ameed Taylor](https://github.com/atayloraerospace), [Mohammed Hamdy](https://github.com/mmhamdy)
- Writers: [Farros Alferro](https://github.com/farrosalferro), [Mohammed Hamdy](https://github.com/mmhamdy), [Louis Ulmer](https://github.com/lulmer), [Dario Wisznewer](https://github.com/dariowsz), [gonzachiar](https://github.com/gonzachiar)

**Organisation Team**
[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
[Merve Noyan](https://github.com/merveenoyan), [Adam Molnar](https://github.com/lunarflu), [Johannes Kolbe](https://github.com/johko)
We'd like to thank [Maria Khalusova](https://huggingface.co/MariaK) for her thorough reviews.


We are happy to have you here, let's get started!
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Introduction to Convolutional Neural Networks

In the last unit we learned about the fundamentals of vision, images and Computer Vision. We also explored visual features as a crucial part of analyzing images with the help of computers.

The approaches we discussed are today often referred to as "classical" Computer Vision. While working fine on many small and restrained datasets and settings, classical methods have their limits that come to light when looking at bigger scale real-world datasets.

In this unit, we will learn about Convolutional Neural Networks, an important step forward in terms of scale and performance of Computer Vision.

## Convolution: Basic Ideas

Convolution is an operation used to extract features from data. The data can be 1D, 2D or 3D. We'll explain the operation with a solid example. All you need to know now is that the operation simply takes a matrix made of numbers, moves it through the data, and takes the sum of products between the data and that matrix. This matrix is called kernel or filter. You might say, "What does it have to do with the feature extraction, and how am I supposed to apply it?
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Introduction
# CLIP and Relatives

This section provides an overview of CLIP and similar models, highlighting their unique features and applicability to various machine learning tasks.
So far we have learned about the fundamentals of multimodality with a special spotlight of Vision Language Models. This chapters provide a short overview of CLIP and similar models, highlighting their unique features and applicability to various machine learning tasks.
It sets the stage for a high-level exploration of key multimodal models that have emerged before and after CLIP, showcasing their significant contributions to the advancement of multimodal AI.

## Pre-CLIP
Expand All @@ -9,9 +9,11 @@ In this part, we explore the innovative attempts in multimodal AI before CLIP.
The focus is on influential papers that used deep learning to make significant strides in the field:

1. **"Multimodal Deep Learning" by Ngiam et al. (2011):** This paper demonstrated the use of deep learning for multimodal inputs, emphasizing the potential of neural networks in integrating different data types. It laid the groundwork for future innovations in multimodal AI.

- [Multimodal Deep Learning](https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf)

2. **"Deep Visual-Semantic Alignments for Generating Image Descriptions" by Karpathy and Fei-Fei (2015):** This study presented a method for aligning textual data with specific image regions, enhancing the interpretability of multimodal systems and advancing the understanding of complex visual-textual relationships.

- [Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/cvpr2015.pdf)

3. **"Show and Tell: A Neural Image Caption Generator" by Vinyals et al. (2015):** This paper marked a significant step in practical multimodal AI by showing how CNNs and RNNs could be combined to transform visual information into descriptive language.
Expand All @@ -22,12 +24,15 @@ The focus is on influential papers that used deep learning to make significant s
The emergence of CLIP brought new dimensions to multimodal models, as illustrated by the following developments:

1. **CLIP:** OpenAI's CLIP was a game-changer, learning from a vast array of internet text-image pairs and enabling zero-shot learning, contrasting with earlier models.

- [CLIP](https://openai.com/blog/clip/)

2. **GroupViT:** Innovating in segmentation and semantic understanding, GroupViT combined these aspects with language, showing advanced integration of language and vision.

- [GroupViT](https://arxiv.org/abs/2202.11094)

3. **BLIP:** BLIP introduced bidirectional learning between vision and language, pushing the boundaries for generating text from visual inputs.

- [BLIP](https://arxiv.org/abs/2201.12086)

4. **OWL-VIT:** Focusing on object-centric representations, OWL-VIT advanced the understanding of objects within images in context with text.
Expand All @@ -41,4 +46,4 @@ These developments highlight the evolving methods of processing multimodal data
The upcoming sections will delve into the "Losses" aspect, focusing on various loss functions and self-supervised learning crucial for training multimodal models.
The "Models" section will provide a deeper understanding of CLIP and its variants, exploring their designs and functionalities.
Finally, the "Practical Notebooks" section will offer hands-on experience, addressing challenges like data bias and applying these models in tasks such as image search engines and visual question answering systems.
These sections aim to deepen your knowledge and practical skills in the multifaceted world of multimodal AI.
These sections aim to deepen your knowledge and practical skills in the multifaceted world of multimodal AI.
Loading
Loading