Contrastive Multiview Coding

Yonglong Tian
MIT CSAIL
Dilip Krishnan
Google
Phillip Isola
MIT CSAIL

Code [GitHub]
arXiv 2019 [Paper]

Teaser

Figure 1: Examples of multiview datasets and representations learned from them. Dotted lines represent the contrastive objective that encourages congruent views to be brought together in representation-space.

Abstract

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, viewed by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We hypothesize that a powerful representation is one that models view-invariant factors. Based on this hypothesis, we investigate a contrastive coding scheme, in which a representation is learned that aims to maximize mutual information between different views but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. The resulting learned representations perform above the state of the art for downstream tasks such as object classification, compared to formulations based on predictive learning or single view reconstruction, and improve as more views are added.

Highlights

(1) Representation quality as a function of the number of contrasted views
We found that, the more views we train with, the better the representation (of each single view).

Teaser

(2) Contrastive objective v.s. Predictive objective
We compare the contrastive objective to cross-viewprediction, finding an advantage to the contrastiveapproach

(3) Unsupervised v.s. Supervised
ResNet-101 trained with our unsupervised CMC objective surpasses supervisedly trained AlexNet on ImageNet classification.

Method

Our CMC learns by contrasting between congruent and incongruent pairs across views. For each set of views a deep representation is learnt by bringing views of the same scene together in embedding space, while pushing views of different scenes apart. Intuitively, this can be further interpreted as discriminating the joint distribution of views from the product of marginals. In other words, our CMC learns representations by maximizing mutual information between views.

Teaser

Figure 2: Here we show an example of learning from two views: the luminance channel (L) of an image and the ab-color channel. The strawberry's L and ab channels embed to nearby points whereas the ab channel of a different image (a photo of blueberries) embeds to a far away point.

Acknowledgements

Thanks to Devon Hjelm for providing implementation details of Deep InfoMax, Zhirong Wu and Richard Zhang for helpful discussion and comments. This material is based upon work supported by Google Cloud.