VA-RED2: Video Adaptive Redundancy Reduction

Bowen Pan1, Rameswar Panda2, Camilo Fosco1, Chung-Ching Lin3, Alex Andonian1, Yue Meng2,

Kate Saenko2,4, Aude Oliva1,2, Rogerio Feris2

1 MIT CSAIL,   2 MIT-IBM Waston AI Lab,   3 Microsoft,   4 Boston University

[Paper]       [Code (coming soon)]


Our VA-RED2 framework dynamically reduces the redundancy in two dimensions. Example 1 (left) shows a case where the input video has little movement. The features in the temporal dimension are highly redundant, so our framework fully computes a subset of features, and reconstructs the rest with cheap linear operations. In the second example, we show that our framework can reduce computational complexity by performing a similar operation over channels: only part of the features along the channel dimension are computed, and cheap operations are used to generate the rest.


Main Idea

Our main goal is to automatically decide which feature maps to compute for each input video in order to classify it correctly with the minimum computation. The intuition behind our proposed method is that there are many similar feature maps along the temporal and channel dimensions. For each video instance, we estimate the ratio of feature maps that need to be fully computed along the temporal dimension and channel dimension. Then, for the other feature maps, reconstruct them from those pre-computed feature maps using cheap linear operations.


Method

Teaser

An illustration of dynamic convolution along temporal dimension (a) and channel dimension (b) respectively. ϕt and ϕs represent the temporal cheap operation and spatial cheap operation respectively. In (a), we multiply the temporal stride S with the factor R = 2pt to reduce computation, where pt is the temporal policy output by soft modulation gate. In (b), we compute part of output features with the ratio of r=(1/2)pc, where pc is the channel policy.


Qualitative Results

Teaser

Some qualitative examples on Mini-Kinetics-200. For each category, we plot two input video clips which consume the most and the least computational cost respectively. We infer these video clips with 8-frame dynamic R(2+1)D-18 model trained on Mini-Kinetics-200 and the percentage indicates the ratio of actual computational cost of 2D convolution to that of the original fixed model.


Reference

B. Pan and R. Panda and C. Fosco and C. Lin and A. Andonian and Y. Meng and K. Saenko and A. Oliva and R. Feris. VA-RED2: Adaptive Redundancy Reduction. ICLR 2021 [Bibtex]