Contrastive Representation Distillation

Yonglong Tian
MIT CSAIL
Dilip Krishnan
Google
Phillip Isola
MIT CSAIL

Code [GitHub]
arXiv 2019 [Paper]


Drop-in replacement for Knowledge Distillation, with state of the art performance (example):

loss_crd = CRDLoss(feat_student, feat_teacher, pos_index, neg_index)



Teaser

Figure 1. The three distillation settings we consider: (a) compressing a model, (b) transferring knowledge from one modality (e.g., RGB) to another (e.g., depth), (c) distilling an ensemble of nets into a single network.

Highlights

(1) A contrastive-based objective for transferring knowledge between deep networks.

(2) Forges connection between knowledge distillation and representation learning.

(3) Applications to model compression, cross-modal transfer, and ensemble distillation.

(4) Benchmarking 12 recent distillation methods; CRD outperforms all other methods, with 57% average relative improvement over the second best method, which, surprisingly, is the original KD (Hinton et. al.).

Benchmark Results

(1) Teacher and student are of the same architectural type.

Teaser

Table 1. Benchmarking various distillation methods on teachers and students of the same architectural type. Red and Green arrows represent underformance and outperformance compared with KD baseline, respectively.

(2) Teacher and student are of different architectural types.

Teaser

Table 2. Benchmarking various distillation methods on teachers and students of different architectural types. Red and Green arrows represent underformance and outperformance compared with KD baseline, respectively.

Abstract

Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation.