What Makes for Good views for Contrastive Learning?

Yonglong Tian1   Chen Sun2   Ben Poole2   Dilip Krishnan2   Cordelia Schmid2   Phillip Isola1  

1 MIT CSAIL   2 Google Research  



Teaser

Figure 1: Constructing views for contrastive learning is important. But what are good views? We hypothesize good views should be those that only share label information w.r.t the downstream task, while throwing away nuisance factors, which we call InfoMin principle.

Abstract

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a by-product, we also achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readoff with a ResNet-50). In addition, transferring our models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised pre-training.

Code and Models

Methods

(1) Synthesizing Views: we can use an unsupervised adversarial objective to minimize the mutual information between views, while using an supervised objective (only on small amount of labeled data) to retain relevant information.

Teaser

Figure 2: Schematic of contrastive representation learning with a learned view generator. An input image is split into two views using an invertible view generator. To learn the view generator, we optimize the losses in yellow: minimizing information between views while ensuring we can classify the object from each view. The encoders used to estimate mutual information are always trained to maximize the InfoNCE lower bound. After learning the view generator, we reset the weights of the encoders, and train with a fixed view generator without the additional supervised classification losses.

(2) InfoMin Data Augmenation on ImageNet

Teaser

Figure 3: The augmentation that we manually designed following the principle of InfoMin. As can be see from the left figure, lower INCE typically results in higher accuracy before we touch a turning point (which we might haven't touched yet).

Related Publications