FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations

frames

Figure 1: Three illustrative examples of our fast concept learning task. Our model learns from three naturally occurring data streams: (a) looking at images, reading sentences that describe the objects in the scene, and (b) interpreting supplemental sentences that relate the novel concept with other concepts. (c) The acquired novel concepts (e.g., red and white-eyed vireo transfer to downstream tasks, such as visual reasoning. We present a meta-learning framework to solve this task.

model

Figure 2: Overview of FALCON-G. FALCON-G starts from extracting object-centric representations and parse input sentences into semantic programs. It executes the program to locate the objects being referred to. The example objects, together with the reconstructed concept graphs are fed into two GNNs sequentially to predict the concept embedding for the "white-eyed vireo". The derived concept embedding can be used in downstream tasks such as visual reasoning.

Abstract

We present a meta-learning framework for learning new visual concepts quickly, from just one or a few examples, guided by multiple naturally occurring data streams: simultaneously looking at images, reading sentences that describe the objects in the scene, and interpreting supplemental sentences that relate the novel concept with other concepts. The learned concepts support downstream applications, such as answering questions by reasoning about unseen images. Our model, namely FALCON, represents individual visual concepts, such as colors and shapes, as axis-aligned boxes in a high-dimensional space (the "box embedding space"). Given an input image and its paired sentence, our model first resolves the referential expression in the sentence and associate the novel concept with particular objects in the scene. Next, our model interprets supplemental sentences to relate the novel concept with other known concepts, such as "X has property Y" or "X is a kind of Y". Finally, it infers an optimal box embedding for the novel concept that jointly 1) maximizes the likelihood of the observed instances in the image, and 2) satisfies the relationships between the novel concepts and the known ones. We demonstrate the effectiveness of our model on both synthetic and real-world datasets.



Visit the ICLR site for video. You will need to register for the conference.