Image recognition and classification has been a complex problem to solve using tehcnology. Deep learning architectures such as Convolutional Neural Networks (CNN) have demonstrated to achieve a high performance accuracy in such tasks. In the present project will be demonstrated how transfer learning techniques using feature extraction and data augmentation tackle this kind of problems where complexity increases drastically, especially in situations that demand classificacion of very similar images belonging to completely different contexts.
Since the 2010’s-decade deep learning field progressed taking giant steps at always expected tasks such as object classification, speech recognition, text-written processing, image generation, etc. AI competitions such as the “Imagenet challenge” led to convolutional architectures became so popular when solving object recognition and classification problems, this is because of the accuracy levels reached, from around 70% in 2011 to more than 95% (better than humans even!) in 2015[1].
Figure 1. Camouflaged owl[2]
Very similar objects recognition is not a new task for humans at all, because in nature it happens all the time: we can observe how animals use camouflage to survive, hid from preys to hunt them, and so on. However, during this report, we won’t study camouflage issues, but objects that look alike in completely different contexts: labradoodles and fried chicken, dogs and bagels, sheepdogs and mops and chihuahuas and muffins of course!
Figure 2. Labradoodle vs fried chicken[2]
Figure 3. Dog vs bagel[2]
Figure 4. Sheepdog vs mop[2]
Figure 5. Chihuahua vs muffin[2]
As we pointed out in the abstract, our objective is to classify very similar objects that belong to very different contexts. Due to the wide range of known (and unknown) examples we will focus on one of the most popular cases: The chihuahua-muffin problem.
To tackle this kind of problems we will use a Convolutional Neural Network (CNN) architecture known as VGG19 by setting up its configuration parameters in a similar way as Ph.D Togootogtokh and Ph.D Amartuvshin did it in their paper "Deep Learning Approach for Very Similar Objects Recognition Application on Chihuahua and Muffin Problem"[2]. Therefore, this implementation has been inspired in the work of both professors, all credits to them, they proposed this state-of-the-art solution in 2018 which can be found in arxiv.org.
Deep learning is a specific subfield of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. In deep learning, these layered representations are learned via models called neural networks, structured in literal layers stacked on top of each other.
Nowadays, there are numerous neural network architectures aimed at different purposes. However, as mentioned in the abstract we will focus only in Convolutional Neural Networks (Deep Convolutional Networks in the diagram) because of the effectiveness achieved in classification tasks.
Figure 6. Neural Networks Architectures[5]
A Convolutional Neural Network (CNN) characterizes by convolution layers which learn local patterns—in the case of images, patterns found in small 2D windows of the inputs. Thus, convolutions work by sliding these windows of size 3 × 3 or 5 × 5 over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features. Where feature map can be understood as follows: every dimension in the depth axis is a feature.
Figure 7. How convolution works[1]
Figure 8. CNN diagram example, VGG16 Architecture[1]
Created by the Visual Geometry Group at Oxford's this architecture uses some ideas from it's predecessors (AlexNet) and improves them in a significant way that, in 2014 it out-shined other state of the art models and is still preferred for a lot of challenging problems[6].
VGG19 is a variant of VGG model which in short consists of 19 layers (16 convolution layers, 3 Fully connected layers, 5 MaxPool layers and 1 SoftMax layer). There are other variants of VGG like VGG11, VGG16 and others. VGG19 has 19.6 billion Floating Operations (FLOPs). The main purpose for which VGG was designed was to win ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)[6].
Figure 9. VGG19 Architecture[7]
Brief explanation of how the VGG19 architecture works:
- A fixed size of (224 * 224 originally, but for this project will be 112 * 112) RGB image was given as input to this network which means that the matrix was of shape (224,224,3).
- The only preprocessing that was done is that they subtracted the mean RGB value from each pixel, computed over the whole training set.
- Used kernels of (3 * 3) size with a stride size of 1 pixel, this enabled them to cover the whole notion of the image.
- Spatial padding was used to preserve the spatial resolution of the image.
- Max pooling was performed over a 2 * 2 pixel windows with sride 2.
- This was followed by Rectified linear unit(ReLu) to introduce non-linearity to make the model classify better and to improve computational time as the previous models used tanh or sigmoid functions this proved much better than those.
- Implemented three fully connected layers from which first two were of size 4096 and after that a layer with 1000 channels for 1000-way ILSVRC classification and the final layer is a softmax function[6].
Takes the predictions of the network and the true target (what you wanted the network to output) and computes a distance score, capturing how well the network has done on a specific example. These are the commond used loss functions:
- CategoricalCrossentropy
- SparseCategoricalCrossentropy
- BinaryCrossentropy
- MeanSquaredError
- KLDivergence
- CosineSimilarity
According to the type of problem to be solved, sparse_categorical_crossentropy is the option due to the fact that the output could belong to one of the two following classes: "Chihuahua" or "Muffin".
Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD). These are the common used optimizers:
- SGD (with o without momentum)
- RMSprop
- Adam
- Adagrad
Focuses on the knowledge gained from previous Machine Learning systems which will be used for another one to learn how to solve similar tasks that will include partially o completely different data[8].
Figure 10. Transfer learning diagram[8]
Feature extraction consists of using the representations learned by a previously trained model to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch[1].
Figure 11. How feature extraction[1]
This technique consists of getting more instances from an image dataset by doing transformations[1] such as:
- rotation
- zoom in / zoom out
- crop
- grayscale
- flip
Figure 12. Data augmentation example[1]
All images used for this project belong to third party sources such as:
- Open Images
- Oxford's pets dataset [9]
- Imagenet dataset [10]
- adobe stock
- istock photo
- getty images
- pexels
- unsplash
python setup.py
: creates chihuahua_vs_muffin folder which contains test, train and validation datasetstraining.py
: trains the CNN, saves the model in vgg19_chihuahua_vs_muffin.h5, and queries the model
-
Choose a folder:
-
- Muffin folder
-
- Chihuahua folder
-
- Exit from program
-
-
Enter a valid instance id
- 1-500 for muffin
- 1-900 for chihuahua
Figure 13. Querying the model after 4 epochs
NOTE #1: 2 extra graphs will appear before querying, that's because of the accuracy metrics shown in the following sections.
NOTE #2: To run keras with GPU in Windows you will have to setup some configurations (running this program in CPU will be super slow), I would recommend to follow this tutorial: https://lifewithdata.com/2022/01/16/how-to-install-tensorflow-and-keras-with-gpu-support-on-windows/
Figure 14. Training and validation accuracy after 100 epochs
Figure 15. Training and validation loss after 100 epochs
Figure 16. Test accuracy and loss
Figure 17. Querying chihuahua_232 after 100 epochs
Figure 18. Querying muffin_138 after 100 epochs
As we saw training accuracy and validation accuracy have reached around 93% and 97% respectively (sometimes this percentage is bigger) and test accuracy reached 94%. In addition, these metrics could vary depending on the combination of loss functions, new dataset instances, changing the number of neurons in the very last layers, etc.
On the other hand, it has been a wise decision to use data augmentation to tackle this kind of problems because according to the papaer an accuracy of 95% has been reached as well, such percentage means that this very specific problem has been solved, this small project demonstrates the advantages in time and efficiency achieved by the implementation of transfer learning techniques.
Only 1000 images were analyzed (500 chihuahuas and 500 muffins), it cannot be concluded that we will always get such accuracy percentages for this kind of problems: dataset instaces, computing resources, and so on, have a big influence in the final results.
[1]Chollet, F., 2022. Deep Learning With Python. 2nd ed. Greenwich, USA: Manning Publications.
[2]S. Gettle, "CAMOUFLAGE IN NATURE - Steve Gettle Nature Photography", Steve Gettle Nature Photography, 2022. [Online]. Available: http://stevegettle.com/2008/10/08/camouflage-in-nature/. [Accessed: 19- May- 2022].
[3]A. Gri, "Puppies Or Food? 12 Pics That Will Make You Question Reality", Bored Panda, 2022. [Online]. Available: https://www.boredpanda.com/dog-food-comparison-bagel-muffin-lookalike-teenybiscuit-karen-zack/?utm_source=google&utm_medium=organic&utm_campaign=organic. [Accessed: 19- May- 2022].
[4]E. Togootogtokh and A. Amartuvshin, "Deep Learning Approach for Very Similar Objects Recognition Application on Chihuahua and Muffin Problem", arXiv, 2018. Available: https://arxiv.org/abs/1801.09573. [Accessed 19 May 2022].
[5]"Neural Networks: Chapter 6 - Neural Architectures", Chronicles of AI, 2022. [Online]. Available: https://chroniclesofai.com/neural-networks-chapter-6-neural-architectures/. [Accessed: 20- May- 2022].
[6]A. Kaushik, "Understanding the VGG19 Architecture", OpenGenus IQ: Computing Expertise & Legacy, 2022. [Online]. Available: https://iq.opengenus.org/vgg19-architecture/. [Accessed: 21- May- 2022].
[7]Y. Zheng, C. Yang and A. Merkulov, "Breast cancer screening using convolutional neural network and follow-up digital mammography", Computational Imaging III, 2018. Available: 10.1117/12.2304564 [Accessed 21 May 2022].
[8]K. Shah, "A Quick Overview to the Transfer Learning and it’s Significance in Real World Applications", Medium, 2022. [Online]. Available: https://medium.com/towards-tech-intelligence/a-quick-overview-to-the-transfer-learning-and-its-significance-in-real-world-applications-790fb57debad. [Accessed: 22- May- 2022].
[9] Oxford Pet Animal Dataset. http://www.robots.ox.ac.uk/~vgg/data/pets/
[10] IMAGENET. http://www.image-net.org/