iccv2019-notes

Disclaimer: this is a personal perspective, I could miss a lot of things and could not promise that there are no mistakes.

The marks for papers [?/10] again are totally subjective and based on their usability for me. Approximate translation from these marks is >=9 - must read; >=6 - must read if you are working in the specified area. If the paper title is ~~crossed~~ these means I decided not to go into this (despite I was interested during main conference, the reason may be that the area is very far from my current interests), if the paper goes without comments it means that I will probably add them later

Intro

Over 7,500 participants, 4 days of main conference + 60 workshops and 12 tutorials. 1075 accepted papers (10% orals) on the main conference alone.

All of the papers can be found at CVF open access. The official videos from oral session will be available at CVF youtube channel

It was absolutely infeasible to track everything so I almost completely skipped the following topics

Autonomous driving
3D, Point clouds
Video analysis
Computer Vision in medical images
Captioning, Visual Grounding, Visual Question Answering

I spend a few time to:

Domain adaptation, Zero-shot, few-shot, unsupervised, self-supervised, semi-supervised. TL;DR motivation - to learn as fast as humans with less/no data by few examples. In practice it still works poor enough and can not compare to supervised methods. At the current stage we can not do this successfully enough, but when we will it will be a giant step forward.
Knowledge distillation, federated learning. TL;DR many papers have controversal results - sometimes it works, sometimes don't, sometimes very useful, sometimes useless. You can try but do not expect much
Deepfakes in images and videos. TL;DR you can not completely trust any digital image/video anymore. There are huge movement in the area and already several datasets present. The problem is - when you know the "deepfake attack" method and trained on on the data which was produced with this method you can take ~70-95% accuracy (which is itself not much), but when you don't know the method your deepfake detector may be close to random (50%)

I took a closer look on:

Semantic and instance segmentation, object detection
New architectures, modules, losses, augmentations, optimizetion methods
Neural architecture search
Interpretibility
Text detection and recognition
Network compression
GANs, style transfer

TL;DR - detected trends and advancements

Avoid imagenet pretraining for transfer learning. Self-supervised techniques seem to work better. It seems (and is shown in several papers) that it is probably much better to train from scratch on your data, pretrained weights initialization may decrease convergence speed but do not guarantee that final metrics will be better. Instead of training from scratch you can also try to combine your training with self-supervised methods.
Efficient layers instead of convolutions. Several layers proposed to be used as drop-in replacements of vanilla convolutions. The most notable example is probably OctConv. The problem with such new layers is that vanilla convolution has highly optimized implementations, which is not true for this new proposed layers even if they require less computation in theory.
Efficient loss functions. Several new loss functions were proposed instead of CE and it seems that they should be a default choice now as they are easy to implement and outperform Center Loss or OHEM strategies and sometimes provide more clear class separation in the embedding space and even work well for imbalanced classification. See Losses section for details.
Going from anchor-based object detection to dense predictions. Several papers propose to go from anchor-based detectors to dense predictions, see Instance Segmentation and Object Detection sections below for more details
Generative models from single image. This includes better deepfakes, neural talking heads and GANs on single image (SinGAN and InGAN).
Revival of auxiliary intermediate classifiers. I especially liked this paper where authors apply distillation between final classifier and intermediate classifiers and these improved results a lot.
Attemps for fashion generation and try-on. A lot of works in this field but they do not really work for now, however it is just a matter of time.
Preregistration Workshop. The current scheme for ML Conferences is not how science normally works: you have a hypothesys, than provide experiments to prove/disprove it, but in ML Papers the pipeline is the opposite: having the results you hypothesyze to explain them which is known as HARKing (Hypothesyzing after the results are known). The consequence of this is high positive bias, SOTA-hacking (the paper is much more likely accepted if it claims to beat SOTA) and more importantly poor generalization of the results (say, you conducted experiments on 10 classification datasets and recieve SOTA on 3 of them; you publish a paper with these 3 and complitely omit 7 which is absolutely terrible to the science). The idea of preregistration is for fighting this problem and separate hypothesys generation and hypothesys validation. Looks very promising direction despite having lots of problems. This is probably the most important idea I've heard of in the entire conference. Preregistration workshop link. Here is the link to the video which clearly explains the problem (I will post it when it will be available).

Papers by topic

Augmentation

[7/10][They are independant] Are adversarial robustness and common perturbation robustness independant attibutes
[8/10][Crop one image and paste into another + mix ground truth label proportionally to the area] CutMix: Regularizing Strategy to Train Strong Classifiers with Localizable Features SOTA augmentation strategy, also improve transfer learning
[7/10][Find best augmentation policy along with network training] Online Hyper-parameter Learning for Auto-Augmentation Strategy Gives slight improvements in quality, but in fact less efficient in terms of time for search than this

Modules

Conv-replacements

[10/10][OctConv is both faster and more accurate; drop-in replacement for vanilla conv] Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution. The trick is that the final architecture have to be optimized (in terms of framework matrix operations) otherwise it will be actially slower. Gives up to 30% speed up with better accuracy. The idea of the paper is to explicitly decompose features into high-frequent (H, W, C_h) and low-frequent (H // octave, W // octave, C_l) and process them separately than exchange the information.
[9/10][Suspicious layer which surprisingly improves both accuracy and speed and is a drop-in replacement for vanilla convolution] Dynamic Multi-scale Filters for Semantic Segmentation Replace vanilla conv with the following 2-branch structure: the first branch computes KxK kernel via adaptive_pool(KxK) -> conv1x1; the second branch applies 1x1 conv to features; then 2 branches merge via depthwise conv with kernel computed from the first branch and after that additional 1x1 conv. Ablation study shows that it can give +7% mIoU compared to vanilla conv. Now why it's suspicious? The computed kernel's top left element is essentially taken from image top left part, and the same for bottom right kernel element (it's essentially image bottom right part). And this very different elements are applied to very similar local features. My intuition fails to explain why it make sense, maybe we have to add additional global pooling for the kernel and firstly convolve kernel with it.
[6/10][Essentially local self-attention with inefficient (not optimized) computation] Local Relation Networks for Image Recognition

Pooling variations

[4/10][Learned pooling works slightly better but slower] LIP: Local Importance-based Pooling
[4/10][Learned pooling with global features; slightly better but slower] Global Feature Guided Local Pooling

Multi-scale Feature aggregation

[9/10][SOTA on cityscapes-val, proposed global-local context module to aggregate multidimensional features] Adaptive Context Network for Scene Parsing
[8/10][Fast and efficient feature upsampling with very little overhead] CARAFE: Content-Aware ReAssembly of FEatures

Attention

[5/10][Local visual attention in the autoregressive models order] AttentionRNN: Structured Spatial Attention Mechanism

Semantic segmentation

[10/10][Non-uniform downsampling of high-resolution images] Efficient segmentation: learning downsampling near semantic boundaries. Network for creation of non-uniform downsampling grid aimed to increase space for semantic boundaries. The results are reasonably better than uniform downsampling. 3-steps: 1)non-uniform downsampling (image is downsampled to very small resulution (32x32 or 64x64 for example), on this resolution downsampling network is trained, ground truth are derived from a reasonable optimization problem on ground truth segmentation map) == very fast stage 2)main segmentation network runs on non-uniform downsampled image 3)the result is upsampled (which can be done as we know downsampling strategy)

Adversarial training

[9/10][Instead of using global discriminator between ground truth and predicted segmentation map they use "gambler" which predicts from (image, predicted segmap) to CE-weights to maximize sum(weights * CELoss)] I Bet You Are Wrong: Gambling Adversarial Networks for Structured Semantic Segmentation Instead of using global discriminator between ground truth and predicted segmentation map they use "gambler" which predicts from (image, predicted segmap) to CE-weights to maximize sum(weights * CELoss). Seem to improve perfornamce a lot compared to previous adversarial training approaches. Additional benefit is that gambler does not see GT so it is less sensitive to errors in GT

Context aggregation

[9/10][SOTA on cityscapes-val, proposed global-local context module to aggregate multidimensional features] Adaptive Context Network for Scene Parsing
[6/10][Self-attention on ASPP features flattened + concated] Asymmetric Non-local Neural Networks for Semantic Segmentation Instead of global self-attention (which is very costly) they 1)use ASPP 2)flatten all ASPP maps 3)concat the resulted 1x1 maps 4)use attention on this concated features (which means that you can select 0.1 * global (1x1) pool + 0.3 * 2x2pool[0,0] + 0.01 * 2x2pool[0,1] + ...). The module improves final metric obviously.

Make use of class prior

[8/10][Per class centers computation (based on coarse segmentation) + attention on them -> fine segmentation] ACFNet: Attentional clas feature network for semantic segmentation

Make use of boundaries

[8/10][Separate (chip) shape stream from image gradients and dual-task learning (shape + segmentation)] Gated-SCNN: Gated Shape CNN for Semantic Segmentation. 1)very cheap 3-layer shape stream which accepts image gradients + 1st layer CNN features and exchanges information with the main backbone via gating mechanism 2)dual loss (edge detection + semantic segmentation) + consensus regularization penalty (checks that semantic segmentation output is consistant with predicted edges)
[8/10][Another approach for using boundary: first, learn boundary as N+1' class then introduce UAGs and some crazy staff] Boundary-Aware Feature Propagation for Scene Segmentation

Salient object detection (note that everyone exploits edges & boundaries in some way)

EGNet: Edge Guidance Network for Salient Object Detection
Selectivity or Invariance: Boundary aware Salient Object Detection
Stacked Cross Refined Network for Edge-aware Salient Object Detection

Other

[6/10][Detect unknown objects using optical flow] Towards segmenting everything that moves
[5/10][Typically works better and established SOTA, but obviously slower, nothing surprising] Recurrent u-net for resource constrained segmentation. Recurrence in several layers close to lowest-resolution ones.
[???][Reformulate loss for convex objects; I didn't understand that; looks like computational geometry thing can't say how useful it is with NNs] Convex Shape Prior for Multi-object Segmentation Using a Single Level Set Function

Instance segmentation

[9/10][Learn prototypes and coefficients to combine them; can be 3-10x faster than MaskRCNN and have comparable accuracy] YOLACT Real-time Instance Segmentation Each anchor predicts bbox + classes + prototypes weights. The separate branch predicts prototypes.
[9/10][Backbone + point proposal -> mask of the object with point] AdaptIS: Adaptive Instance Selection Network Proposed network is capable of generating instance mask by specifying point on that instance. Backbone extracts features. Features + point -> small net with AdaIN (where norms are computed from point info) -> instance mask. To get all objects on the image authors trained separate "point proposal" branch which is trained after everything else is frozen and predicts binary label "will point be good for object mask prediction?". From this branch top k% points are sampled and used for predicting objects.

Object detection

[7/10][Dense object detection by simply predicting bbox coordinates. Simple and efficient.] FCOS: Fully Convolutional One-Stage Object Detection Directly predict 4D distances to object (top, left, bottom, right) in each foregroung pixel + NMS -> SOTA + simplisity + integration with depth prediction + no anchors computation (i.e. IoU) and anchor hyperparameters. Details: 1)each object is predicted on only one feature map (on training) and on testing in case of multiple resolutions alarmed for an object only the smallest is chosen; 2)propose "centerness" branch which predicts normalized ([0, 1]) distance between pixel and center, this branch is used in NMS multiplied by classification.
[6/10][Global scale predicted in each resnet block, dilations are selected based on that] POD: Practical object detection with scale-sensitive network
[6/10][To improve detection of different scales objects replace some convs with 3 conv branches with same params but different dilations] Scale-aware Trident Network for Object Detection
[7/10][Change of target - ~~bbox~~ reppoints - arbitrary points whose circumference locates object accurately] RepPoints: Point set Representation for object detection These reppoint may be iteratively refined in prediction, they are learn by localization and classification losses.
[6/10][Cycle-gan (clean image<->image with pathology) on small regions specified by masks] Generative Modelling for small data object detection
[5/10][2-stage detector with smaller backbone - works close-to-realtime on GPU]ThunderNet: Towards real time generic object detection on mobile device

Text detection and recognition

SNICER: Single noisy image denoicing and rectification for improving licence plate recognition
State-of-the-art in action: unconstrained text detection
Convolutional character networks
Large-scale Tag-based Font retrieval with Generative Feature Learning
Chinese Street View Text: Large-scale Chinese Reading and partially supervised learning
TextDragon: An End-to-End Framework for Arbitraty Shaped Text Spotting
Efficient and accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
What's wrong with Scene Text Recognition Model Comparations? Dataset and Model analysys
Towards unconstrained text spotting
Controllable artistic text style transfer via shape-matching GAN

Content generation, generative models, GANs, style transfer

[10/10 BEST PAPER ICCV2019] SinGAN: Learning a Generative Model from a Single Natural Image Use generators and discriminators on multiple resolutions and train on single image patches. Multiple applications without additional training including super resolution, image editing, single image animation, paint2image. video
[8/10]InGAN: Capturing and Retargeting the "DNA" of a Natural Image GAN trained on patches of a single image and able to produce similar images of different shapes.
[6/10][Single net adversarial attack for multiple target classes] Once a MAN: Towards Multi-target attack via learning multi-target adversarial network once. Single model to produce adversarial examples towords any class (surprisingly, all the previous works use one model for one class, in this work - one work for all classes)
[8/10]FUNIT: Few-shot Unsupervised Image-to-Image translation
[7/10]Lifelong GAN: Continual learning for Conditional Image Generation With access to previous models and new data the task is to be able to produce both new and old classes.
[8/10]PuppetGAN: Cross-domain Image Manipulation by Demonstration Manipulate separate attributes of an image (e.g. mouth, rotation, lightning, etc) from target image
[7/10][Couple of tricks to make "aging" more personalized] S2GAN: Sharing Aging Factors Across Ages and Sharing Aging Trends Among Individuals
[7/10][User edits image a little in sketch in certain place -> realistic edited image] sc-fegan: face editing generative adversarial network with user's sketch and color
[10/10][Using pretrained GAN adapt for new classes and domains (even for 100 samples dataset) by training only batch statistics] Image generation from small datasets via Batch Statistic Adaptation With large pretrained generator (e.g. BigGAN) train only BatchNorm params (gamma and beta) and that's it - works even on very small datasets, results looks very good!

GAN Training Stability improvements

[9/10][Increase of stability + SOTA GAN metrics; different term for WGAN to optimize quadratic wassershtein distance] Wassershtein GAN with quadratic transport cost
[9/10][Spec-regularization >> spec-norm (in terms of both stability an final results)] Spectral regularization for combating mode collapse in GANs

Video synthesys

[9/10] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models video
Markov decision process for video generation

source person -> pose; target person + source pose -> synthesys

Dance Dance Generation: Motion Transfer for Internet Videos
Everybody Dance Now (University of California) video

Image extension

(also SinGAN and InGAN)

Boundless: Generative Adversarial Network for Image Extension
Very Long Natural Scenary Image Prediction by Outpaining

Style transfer

[examples really looks like simple color transform] Photorealistic style transfer via Wavelet Transforms
A closed-form solution to universal style transfer
Understanding whitening and coloring transform for universal style transfer
[5/10] [Style transfer on entire image + semantic segmentation masks = style transfer for selected object classes] Class-based styling: real-time localized style transfer with semantic segmentation

Fashion, clothes try-on

In general all these methods still work quite poorly, but at least somehow

FW-GAN (Flow navigated warping gan for video virtual try-on)
Personalized Fashion Design (Cong Yu et al)

Neural Architecture Search

[9/10][Sampling from random graph model gives results compared with current SOTA-NAS, which means that all that the current NAS do is not really better than random search] Exploring Randomly Wired Neural Networks for Image Recognition
[8/10][Most of classification loss functions can be represented from parametrized loss function with 2 params, authors tried to find the optimal for these params]AM-LFS: AutoML for Loss Function Search
[9/10][Improvements vs handcrafted GANs on Cifar10 in terms of IS] AutoGAN: Neural architecture search for generative adversarial networks
[8/10][Evaluator predicts how likely model will have lower validation score] One-Shot Neural Architecture Search via Self-Evaluated Template Network

Compression

My knowledge of compression techniques is quite limeted so do not really trust these quality marks

[8/10] Automated Multi-Stage Compression of Neural Networks - tensor decompositions, two repetitive steps: compression and fine-tuning, 10-15x compression rate with 1-2% metric drop (depend on dataset). pytorch code
[8/10][4bit quantization + finetuning] DSConv: Efficient Convolutional Operator
[6/10][Yolo compression success story, known techniques applied properly] SlimYOLOv3: narrower faster and better for real-time
Accelerate CNN via Recursive Bayesian Pruning
[6/10][Speed-quality tradeoff without retraining; but the results are worse than SOTA] adaptive inference cost with convolutional neural mixture models The idea is to work with mixture of nets (each layer may be applied or not applied). Inference cost is O(N*(N-1)/2)) where N is number of layers - which is relatively slow. In pruning we omit some layers thus having some speedup. The main benefit is that net does not need to be retrained, but the approach seems complicated to implement and works worse in quality compared to SOTA.
Workshop: Compact and Efficient Feature Representation and Learning in Computer Vision 2019

Anomaly detection

Real time aerial suspicious analysis (asana): system for identification and re-identification of suspicious individuals in crowds using the bayesian scatter-net hybrid network
Detecting the unexpected by image resynthesis
memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection

Other notes:

Book "Explainable AI: Interpreting, explaining and visualizing deep learning"

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
workshops		workshops
README.md		README.md
tutorials.md		tutorials.md

asmekal/iccv2019-notes

Folders and files

Latest commit

History

Repository files navigation

iccv2019-notes

Intro

TL;DR - detected trends and advancements

Papers by topic

Augmentation

Modules

Conv-replacements

Pooling variations

Multi-scale Feature aggregation

Attention

Semantic segmentation

Adversarial training

Context aggregation

Make use of class prior

Make use of boundaries

Salient object detection (note that everyone exploits edges & boundaries in some way)

Other

Instance segmentation

Object detection

Text detection and recognition

Content generation, generative models, GANs, style transfer

GAN Training Stability improvements

Video synthesys

Image extension

Style transfer

Fashion, clothes try-on

Neural Architecture Search

Compression

Anomaly detection

Other small topics

Sounds

Imbalanced classes

Knowledge distillation

Self-supervised

Losses

Interpretability

Clustering

Human unsertainty for training

Motion in the dark

Other (random)

Other notes:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages