Disclaimer: this is a personal perspective, I could miss a lot of things and could not promise that there are no mistakes.
The marks for papers [?/10] again are totally subjective and based on their usability for me. Approximate translation from these marks is >=9 - must read; >=6 - must read if you are working in the specified area. If the paper title is crossed these means I decided not to go into this (despite I was interested during main conference, the reason may be that the area is very far from my current interests), if the paper goes without comments it means that I will probably add them later
Over 7,500 participants, 4 days of main conference + 60 workshops and 12 tutorials. 1075 accepted papers (10% orals) on the main conference alone.
All of the papers can be found at CVF open access. The official videos from oral session will be available at CVF youtube channel
It was absolutely infeasible to track everything so I almost completely skipped the following topics
- Autonomous driving
- 3D, Point clouds
- Video analysis
- Computer Vision in medical images
- Captioning, Visual Grounding, Visual Question Answering
I spend a few time to:
- Domain adaptation, Zero-shot, few-shot, unsupervised, self-supervised, semi-supervised. TL;DR motivation - to learn as fast as humans with less/no data by few examples. In practice it still works poor enough and can not compare to supervised methods. At the current stage we can not do this successfully enough, but when we will it will be a giant step forward.
- Knowledge distillation, federated learning. TL;DR many papers have controversal results - sometimes it works, sometimes don't, sometimes very useful, sometimes useless. You can try but do not expect much
- Deepfakes in images and videos. TL;DR you can not completely trust any digital image/video anymore. There are huge movement in the area and already several datasets present. The problem is - when you know the "deepfake attack" method and trained on on the data which was produced with this method you can take ~70-95% accuracy (which is itself not much), but when you don't know the method your deepfake detector may be close to random (50%)
I took a closer look on:
- Semantic and instance segmentation, object detection
- New architectures, modules, losses, augmentations, optimizetion methods
- Neural architecture search
- Interpretibility
- Text detection and recognition
- Network compression
- GANs, style transfer
-
Avoid imagenet pretraining for transfer learning. Self-supervised techniques seem to work better. It seems (and is shown in several papers) that it is probably much better to train from scratch on your data, pretrained weights initialization may decrease convergence speed but do not guarantee that final metrics will be better. Instead of training from scratch you can also try to combine your training with self-supervised methods.
-
Efficient layers instead of convolutions. Several layers proposed to be used as drop-in replacements of vanilla convolutions. The most notable example is probably OctConv. The problem with such new layers is that vanilla convolution has highly optimized implementations, which is not true for this new proposed layers even if they require less computation in theory.
-
Efficient loss functions. Several new loss functions were proposed instead of CE and it seems that they should be a default choice now as they are easy to implement and outperform Center Loss or OHEM strategies and sometimes provide more clear class separation in the embedding space and even work well for imbalanced classification. See Losses section for details.
-
Going from anchor-based object detection to dense predictions. Several papers propose to go from anchor-based detectors to dense predictions, see Instance Segmentation and Object Detection sections below for more details
-
Generative models from single image. This includes better deepfakes, neural talking heads and GANs on single image (SinGAN and InGAN).
-
Revival of auxiliary intermediate classifiers. I especially liked this paper where authors apply distillation between final classifier and intermediate classifiers and these improved results a lot.
-
Attemps for fashion generation and try-on. A lot of works in this field but they do not really work for now, however it is just a matter of time.
-
Preregistration Workshop. The current scheme for ML Conferences is not how science normally works: you have a hypothesys, than provide experiments to prove/disprove it, but in ML Papers the pipeline is the opposite: having the results you hypothesyze to explain them which is known as HARKing (Hypothesyzing after the results are known). The consequence of this is high positive bias, SOTA-hacking (the paper is much more likely accepted if it claims to beat SOTA) and more importantly poor generalization of the results (say, you conducted experiments on 10 classification datasets and recieve SOTA on 3 of them; you publish a paper with these 3 and complitely omit 7 which is absolutely terrible to the science). The idea of preregistration is for fighting this problem and separate hypothesys generation and hypothesys validation. Looks very promising direction despite having lots of problems. This is probably the most important idea I've heard of in the entire conference. Preregistration workshop link. Here is the link to the video which clearly explains the problem (I will post it when it will be available).
-
[7/10][They are independant] Are adversarial robustness and common perturbation robustness independant attibutes
-
[8/10][Crop one image and paste into another + mix ground truth label proportionally to the area] CutMix: Regularizing Strategy to Train Strong Classifiers with Localizable Features SOTA augmentation strategy, also improve transfer learning
-
[7/10][Find best augmentation policy along with network training] Online Hyper-parameter Learning for Auto-Augmentation Strategy Gives slight improvements in quality, but in fact less efficient in terms of time for search than this
-
[10/10][OctConv is both faster and more accurate; drop-in replacement for vanilla conv] Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution. The trick is that the final architecture have to be optimized (in terms of framework matrix operations) otherwise it will be actially slower. Gives up to 30% speed up with better accuracy. The idea of the paper is to explicitly decompose features into high-frequent (H, W, C_h) and low-frequent (H // octave, W // octave, C_l) and process them separately than exchange the information.
-
[9/10][Suspicious layer which surprisingly improves both accuracy and speed and is a drop-in replacement for vanilla convolution] Dynamic Multi-scale Filters for Semantic Segmentation Replace vanilla conv with the following 2-branch structure: the first branch computes KxK kernel via adaptive_pool(KxK) -> conv1x1; the second branch applies 1x1 conv to features; then 2 branches merge via depthwise conv with kernel computed from the first branch and after that additional 1x1 conv. Ablation study shows that it can give +7% mIoU compared to vanilla conv. Now why it's suspicious? The computed kernel's top left element is essentially taken from image top left part, and the same for bottom right kernel element (it's essentially image bottom right part). And this very different elements are applied to very similar local features. My intuition fails to explain why it make sense, maybe we have to add additional global pooling for the kernel and firstly convolve kernel with it.
-
[6/10][Essentially local self-attention with inefficient (not optimized) computation] Local Relation Networks for Image Recognition
-
[4/10][Learned pooling works slightly better but slower] LIP: Local Importance-based Pooling
-
[4/10][Learned pooling with global features; slightly better but slower] Global Feature Guided Local Pooling
-
[9/10][SOTA on cityscapes-val, proposed global-local context module to aggregate multidimensional features] Adaptive Context Network for Scene Parsing
-
[8/10][Fast and efficient feature upsampling with very little overhead] CARAFE: Content-Aware ReAssembly of FEatures
- [5/10][Local visual attention in the autoregressive models order] AttentionRNN: Structured Spatial Attention Mechanism
- [10/10][Non-uniform downsampling of high-resolution images] Efficient segmentation: learning downsampling near semantic boundaries. Network for creation of non-uniform downsampling grid aimed to increase space for semantic boundaries. The results are reasonably better than uniform downsampling. 3-steps: 1)non-uniform downsampling (image is downsampled to very small resulution (32x32 or 64x64 for example), on this resolution downsampling network is trained, ground truth are derived from a reasonable optimization problem on ground truth segmentation map) == very fast stage 2)main segmentation network runs on non-uniform downsampled image 3)the result is upsampled (which can be done as we know downsampling strategy)
- [9/10][Instead of using global discriminator between ground truth and predicted segmentation map they use "gambler" which predicts from (image, predicted segmap) to CE-weights to maximize sum(weights * CELoss)] I Bet You Are Wrong: Gambling Adversarial Networks for Structured Semantic Segmentation Instead of using global discriminator between ground truth and predicted segmentation map they use "gambler" which predicts from (image, predicted segmap) to CE-weights to maximize sum(weights * CELoss). Seem to improve perfornamce a lot compared to previous adversarial training approaches. Additional benefit is that gambler does not see GT so it is less sensitive to errors in GT
-
[9/10][SOTA on cityscapes-val, proposed global-local context module to aggregate multidimensional features] Adaptive Context Network for Scene Parsing
-
[6/10][Self-attention on ASPP features flattened + concated] Asymmetric Non-local Neural Networks for Semantic Segmentation Instead of global self-attention (which is very costly) they 1)use ASPP 2)flatten all ASPP maps 3)concat the resulted 1x1 maps 4)use attention on this concated features (which means that you can select 0.1 * global (1x1) pool + 0.3 * 2x2pool[0,0] + 0.01 * 2x2pool[0,1] + ...). The module improves final metric obviously.
- [8/10][Per class centers computation (based on coarse segmentation) + attention on them -> fine segmentation] ACFNet: Attentional clas feature network for semantic segmentation
-
[8/10][Separate (chip) shape stream from image gradients and dual-task learning (shape + segmentation)] Gated-SCNN: Gated Shape CNN for Semantic Segmentation. 1)very cheap 3-layer shape stream which accepts image gradients + 1st layer CNN features and exchanges information with the main backbone via gating mechanism 2)dual loss (edge detection + semantic segmentation) + consensus regularization penalty (checks that semantic segmentation output is consistant with predicted edges)
-
[8/10][Another approach for using boundary: first, learn boundary as N+1' class then introduce UAGs and some crazy staff] Boundary-Aware Feature Propagation for Scene Segmentation
-
EGNet: Edge Guidance Network for Salient Object Detection
-
Selectivity or Invariance: Boundary aware Salient Object Detection
-
Stacked Cross Refined Network for Edge-aware Salient Object Detection
-
[6/10][Detect unknown objects using optical flow] Towards segmenting everything that moves
-
[5/10][Typically works better and established SOTA, but obviously slower, nothing surprising] Recurrent u-net for resource constrained segmentation. Recurrence in several layers close to lowest-resolution ones.
-
[???][Reformulate loss for convex objects; I didn't understand that; looks like computational geometry thing can't say how useful it is with NNs] Convex Shape Prior for Multi-object Segmentation Using a Single Level Set Function
-
[9/10][Learn prototypes and coefficients to combine them; can be 3-10x faster than MaskRCNN and have comparable accuracy] YOLACT Real-time Instance Segmentation Each anchor predicts bbox + classes + prototypes weights. The separate branch predicts prototypes.
-
[9/10][Backbone + point proposal -> mask of the object with point] AdaptIS: Adaptive Instance Selection Network Proposed network is capable of generating instance mask by specifying point on that instance. Backbone extracts features. Features + point -> small net with AdaIN (where norms are computed from point info) -> instance mask. To get all objects on the image authors trained separate "point proposal" branch which is trained after everything else is frozen and predicts binary label "will point be good for object mask prediction?". From this branch top k% points are sampled and used for predicting objects.
-
[7/10][Dense object detection by simply predicting bbox coordinates. Simple and efficient.] FCOS: Fully Convolutional One-Stage Object Detection Directly predict 4D distances to object (top, left, bottom, right) in each foregroung pixel + NMS -> SOTA + simplisity + integration with depth prediction + no anchors computation (i.e. IoU) and anchor hyperparameters. Details: 1)each object is predicted on only one feature map (on training) and on testing in case of multiple resolutions alarmed for an object only the smallest is chosen; 2)propose "centerness" branch which predicts normalized ([0, 1]) distance between pixel and center, this branch is used in NMS multiplied by classification.
-
[6/10][Global scale predicted in each resnet block, dilations are selected based on that] POD: Practical object detection with scale-sensitive network
-
[6/10][To improve detection of different scales objects replace some convs with 3 conv branches with same params but different dilations] Scale-aware Trident Network for Object Detection
-
[7/10][Change of target -
bboxreppoints - arbitrary points whose circumference locates object accurately] RepPoints: Point set Representation for object detection These reppoint may be iteratively refined in prediction, they are learn by localization and classification losses. -
[6/10][Cycle-gan (clean image<->image with pathology) on small regions specified by masks] Generative Modelling for small data object detection
-
[5/10][2-stage detector with smaller backbone - works close-to-realtime on GPU]ThunderNet: Towards real time generic object detection on mobile device
-
SNICER: Single noisy image denoicing and rectification for improving licence plate recognition
-
State-of-the-art in action: unconstrained text detection
-
Convolutional character networks
-
Large-scale Tag-based Font retrieval with Generative Feature Learning
-
Chinese Street View Text: Large-scale Chinese Reading and partially supervised learning
-
TextDragon: An End-to-End Framework for Arbitraty Shaped Text Spotting
-
Efficient and accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
-
What's wrong with Scene Text Recognition Model Comparations? Dataset and Model analysys
-
Towards unconstrained text spotting
-
Controllable artistic text style transfer via shape-matching GAN
-
[10/10 BEST PAPER ICCV2019] SinGAN: Learning a Generative Model from a Single Natural Image Use generators and discriminators on multiple resolutions and train on single image patches. Multiple applications without additional training including super resolution, image editing, single image animation, paint2image. video
-
[8/10]InGAN: Capturing and Retargeting the "DNA" of a Natural Image GAN trained on patches of a single image and able to produce similar images of different shapes.
-
[6/10][Single net adversarial attack for multiple target classes] Once a MAN: Towards Multi-target attack via learning multi-target adversarial network once. Single model to produce adversarial examples towords any class (surprisingly, all the previous works use one model for one class, in this work - one work for all classes)
-
[8/10]FUNIT: Few-shot Unsupervised Image-to-Image translation
-
[7/10]Lifelong GAN: Continual learning for Conditional Image Generation With access to previous models and new data the task is to be able to produce both new and old classes.
-
[8/10]PuppetGAN: Cross-domain Image Manipulation by Demonstration Manipulate separate attributes of an image (e.g. mouth, rotation, lightning, etc) from target image
-
[7/10][Couple of tricks to make "aging" more personalized] S2GAN: Sharing Aging Factors Across Ages and Sharing Aging Trends Among Individuals
-
[7/10][User edits image a little in sketch in certain place -> realistic edited image] sc-fegan: face editing generative adversarial network with user's sketch and color
-
[10/10][Using pretrained GAN adapt for new classes and domains (even for 100 samples dataset) by training only batch statistics] Image generation from small datasets via Batch Statistic Adaptation With large pretrained generator (e.g. BigGAN) train only BatchNorm params (gamma and beta) and that's it - works even on very small datasets, results looks very good!
-
[9/10][Increase of stability + SOTA GAN metrics; different term for WGAN to optimize quadratic wassershtein distance] Wassershtein GAN with quadratic transport cost
-
[9/10][Spec-regularization >> spec-norm (in terms of both stability an final results)] Spectral regularization for combating mode collapse in GANs
-
[9/10] Few-Shot Adversarial Learning of Realistic Neural Talking Head Models video
-
Markov decision process for video generation
source person -> pose; target person + source pose -> synthesys
-
Dance Dance Generation: Motion Transfer for Internet Videos
-
Everybody Dance Now (University of California) video
(also SinGAN and InGAN)
-
Boundless: Generative Adversarial Network for Image Extension
-
Very Long Natural Scenary Image Prediction by Outpaining
-
[examples really looks like simple color transform] Photorealistic style transfer via Wavelet Transforms
-
A closed-form solution to universal style transfer
-
Understanding whitening and coloring transform for universal style transfer
-
[5/10] [Style transfer on entire image + semantic segmentation masks = style transfer for selected object classes] Class-based styling: real-time localized style transfer with semantic segmentation
In general all these methods still work quite poorly, but at least somehow
-
FW-GAN (Flow navigated warping gan for video virtual try-on)
-
Personalized Fashion Design (Cong Yu et al)
-
[9/10][Sampling from random graph model gives results compared with current SOTA-NAS, which means that all that the current NAS do is not really better than random search] Exploring Randomly Wired Neural Networks for Image Recognition
-
[8/10][Most of classification loss functions can be represented from parametrized loss function with 2 params, authors tried to find the optimal for these params]AM-LFS: AutoML for Loss Function Search
-
[9/10][Improvements vs handcrafted GANs on Cifar10 in terms of IS] AutoGAN: Neural architecture search for generative adversarial networks
-
[8/10][Evaluator predicts how likely model will have lower validation score] One-Shot Neural Architecture Search via Self-Evaluated Template Network
My knowledge of compression techniques is quite limeted so do not really trust these quality marks
-
[8/10] Automated Multi-Stage Compression of Neural Networks - tensor decompositions, two repetitive steps: compression and fine-tuning, 10-15x compression rate with 1-2% metric drop (depend on dataset). pytorch code
-
[8/10][4bit quantization + finetuning] DSConv: Efficient Convolutional Operator
-
[6/10][Yolo compression success story, known techniques applied properly] SlimYOLOv3: narrower faster and better for real-time
-
Accelerate CNN via Recursive Bayesian Pruning
-
[6/10][Speed-quality tradeoff without retraining; but the results are worse than SOTA] adaptive inference cost with convolutional neural mixture models The idea is to work with mixture of nets (each layer may be applied or not applied). Inference cost is O(N*(N-1)/2)) where N is number of layers - which is relatively slow. In pruning we omit some layers thus having some speedup. The main benefit is that net does not need to be retrained, but the approach seems complicated to implement and works worse in quality compared to SOTA.
-
Workshop: Compact and Efficient Feature Representation and Learning in Computer Vision 2019
-
Real time aerial suspicious analysis (asana): system for identification and re-identification of suspicious individuals in crowds using the bayesian scatter-net hybrid network
-
Detecting the unexpected by image resynthesis
-
memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection
- [visual sound localization and separation] the sound of motions
- [sound separation by predicting M mixture components, then substracting one-by-one their masks, then refining separate sounds] recursive visual sound separation using minus-plus net
-
[9/10][Loss for classification+clustering specifically for imbalanced clustering; good separation of embeddings in the vector space] Gaussian margin for max-margin class imbalanced learning
-
[7/10][3-player adversarial game between a convex generator, a multi-class classifier network, and a real/fake discriminator to perform oversampling in deep learning systems. The convex generator generates new samples from the minority classes as convex combinations of existing instances, aiming to fool both the discriminator as well as the classifier into misclassifying the generated samples] generative adversarial minority oversampling
-
[9/10][Amazing! Aux intermediate classifiers are revived and distillation among them improves quality by 0.8-4%] Be Your Own Teacher: Improve the performance of CNN via Self-distillation
-
[8/10][More accurate networks are often worse teachers; the solution is to stop training of the teacher model early; results generalize among models and datasets] On the Efficacy of Knowledge Distillation
- [10/10][Self-supervised pretraining > Imagenet pretraining; comprehensive benchmarking self-supervised approaches on different datasets]Scaling and benchmarking self-supervised visual representation learning
- [9/10][SOTA on semi-supervised]S4L: Self-Supervised Semi-Supervised Learning
-
[9/10][Dynamic cross-entropy smoothing; very simple to implement and seem to work slightly better than OHEM, Central Loss and others]Anchor Loss: Modulating Loss Scale Based on Prediction
-
[8/10][Softmax with multiple centers per class > TripletLoss (no need to sample triplets and better results); prof that smoothed softmax loss minimization == soft triplet loss minimization, so it's easier to optimize the first one as it does not require sampling] SoftTripletLoss: Deep metric learning without triplet sampling
-
[6/10][single side loss function overestimation] continual learning by asymmetric loss approximation with single-side overestimation
-
Explaining Neural Networks Semantically and Qualitatively
-
fooling network interpretation in image classification
-
Seeing what a GAN cannot generate
-
Subspace structure-aware spectral clustering for robust subspace clustering
-
Invariant information clustering for unsupervised image classification and segmentation
-
GAN-Tree: An incrimentally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions
-
Deep Comprehensive Maining for Image Clustering
-
[9/10][Human unsertainty for classification (0.6 dog, 0.4 cat), works better than no-human smoothing methods, such as label smoothing] Human unsertainty makes classification more robust
-
[5/10][Force machines look same regions as humans helps, but the annotation cost?] Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
- Seeing Motion in the dark
- Learning to see moving objects in the dark
-
[8/10][Imagenet pretraining may improve convergence speed but do not necessary leads to better results; training from scratch is better] Rethinging Imagenet Pre-training
-
[9/10][Learn multiple prototypes per class to detect noisy labels; train on both noisy and pseudo labels; no need to clean data; no specific noise distribution assumptions; SOTA for noisy classificationDeep Self-learning From Noisy Labels
-
[9/10][Fast second order optimizer (cost of backward ~ 2-3 * costs of forward)] Small steps and giant leaps: Minimal Newton solvers for Deep Learning Converges better than Adam, SGD & co
-
Selective Sparse Sampling for Fine-Grained Image Recognition
-
Dynamic anchor feature selection for single shot object detection -
VideoBERT: A Joint model for Video and Language Representation Learning
-
PR Product: A substitute for inner product in neural networks
-
Deep Meta Metric Learning
-
[9/10][Slow net on 1/N frames, fast net on (N-1)/N frames] Slow-Fast Networks for Video Recognition
-
[6/10][One of many works on domain adaptation]Self-training with progressive augmentation for Unsupervised Person Re-identification
-
Learning to paint with model-based deep reinforcement learning
-
Joint Demosaicing and Denoising by Fine-tuning of Bursts of Raw Images
-
Improving CNN Classifiers by Estimating Test-time Priors
-
Joint Acne Image Grading and Counting via Label Distribution Learning
-
[ransac-like to fit arbitraty shapes or arbitrary counts] Progressive-X: Efficient, Anytime, Multi-Model Fitting Algorithm
-
Noise flow: noise modeling with conditional normalizing flows
-
[comic colorization] Tag2Pix: Line Art Colorization Using Text Tag With SECat and Changing Loss
-
Learning lightweighted LANE Detection CNNs by self-attention distilation -
Transductive Learning for Zero-shot Object Detection
- Book "Explainable AI: Interpreting, explaining and visualizing deep learning"