This repository contains tensorflow implemenation of two models for semantic segmentations known to give high accuracy:
-
U-net (https://arxiv.org/abs/1505.04597) : The network can be found in u_net.py. Here is the architecture which I have borrowed from the paper : There are a few minor differences in my implementation. I have used 'same' padding to simplify things. For the upsampling, I have simply used
tf.image.resize_images
function (see layers_unet.py) . The full transpose convolution (deconvolution) layer is implemented for FCN described next. -
FCN with global convolution (https://arxiv.org/abs/1703.02719) : The network can be found in fcn_gcn_net.py. Here is the architecture which I have borrowed from the paper : Again, there are a few minor differences in my implementation. In particular, I have used VGG style encoder instead of ResNet blocks. All the layers/blocks used in the architecture (including the deconvolution layer) can be found in layers_fcn_gcn_net.py.
I applied these models to one of the Kaggle competetions where the background behind the object (in this case : cars) had to be removed. More details can be found here : Kaggle : Carvana image masking challenge. Due to lack of time and resources, I ended up making only a single submission and got a score of 99.2% (winning solution had a score of 99.7%). For this particular challenge, since there is only one class, U-net is a better model choice. Here is a sample result when U-net is applied to test image:
Scope for improvement : There are several strategies that could have improved the score but I did not use due to lack of time:
- Image preprocessing and data augmentation
- Use higher resolution images : I was using AWS which wasn't fast enough to handle high resolution images. So, I had to scale down images considerably which leads to lower accuracy.
- Tiling sub-regions of high resolution image : This strategy will ensure that each tile can fit in the GPU but is obviously more time consuming.
- Apply Conditional Random Field post-processing