In many vision problems, we want to infer two (or more) hidden factors which interact to produce our observations. We may want to disentangle illuminant and object colors in color constancy; rendering conditions from surface shape in shape-from-shading; face identity and head pose in face recognition; or font and letter class in character recognition. We refer to these two factors generically as ``style'' and ``content''.
We introduce a general framework for analyzing the style of a multimedia signal. We assume that we can observe a training signal under several different styles. This information is often available or can be generated. We then fit those data with a bilinear model which explicitly represents the two-factor nature of the observations. The result is a modular representation of the signal which allows for independent manipulation of the two factors, style and content.
We focus on three kinds of tasks: extrapolating the style of data to unseen content classes, classifying data with known content under a novel style, and translating two sets of data, generated in different styles and with distinct content, into each other's styles. We show examples from color constancy, face pose estimation, shape-from-shading, typography and speech.
Style extrapolation in typography. The training data were all letters of the 5 fonts at left. The test data were all the Monaco letters except those shown at right. The synthesized Monaco letters compare well with the missing ones.