MultiModal Action Conditioned Video Generation

ICCV 2025
MultiModal Video Generation Teaser

Abstract

We present a framework for generating multi-sensory videos that capture realistic visual, auditory, and tactile signals from physical interactions. Our approach learns a unified multimodal representation that enables action-conditioned video generation across different sensory modalities. By training on paired multi-sensory data, our model learns to synthesize coherent videos where the visual content, accompanying sounds, and tactile feedback are temporally aligned and physically plausible. We demonstrate that our multimodal representation effectively captures the correlations between different sensory modalities, enabling applications such as cross-modal generation, sensory completion, and action-conditioned simulation of multi-sensory experiences.

Method Overview

Our framework learns a shared latent space for multiple sensory modalities, enabling coherent multi-sensory video generation conditioned on actions. The model captures temporal dynamics and cross-modal correlations to produce physically plausible multi-sensory outputs.

Method Overview

Visual Encoder

Extracts rich visual features from video frames capturing appearance and motion.

Audio Encoder

Processes audio spectrograms to capture sound characteristics and temporal patterns.

Tactile Encoder

Encodes tactile sensor readings to represent touch and physical interactions.

Unified Multimodal Representation

Multimodal Representation

Cross-Modal Alignment

Our representation learning approach aligns features from different sensory modalities into a shared embedding space. This enables the model to understand the inherent correlations between what we see, hear, and feel during physical interactions.

  • Contrastive learning aligns paired multi-sensory data
  • Temporal synchronization preserves event timing
  • Physics-aware constraints ensure plausibility

Semantic Binding

The learned representation captures semantic relationships between modalities—understanding that a hammer striking wood produces both a specific visual deformation and a characteristic impact sound, along with corresponding tactile vibrations.

  • Object-centric representations bind modalities
  • Material properties influence all modalities
  • Action context shapes expected sensory outputs

Action-Conditioned Generation

Diffusion-Based Video Synthesis

We employ a latent diffusion model that operates in the unified multimodal space, enabling generation of coherent multi-sensory videos. The model is conditioned on actions to simulate realistic physical interactions.

Given an initial state and action sequence, our model iteratively denoises latent representations to produce temporally consistent video frames, synchronized audio waveforms, and corresponding tactile signals.

Diffusion Process
1

Encode Initial State

Extract multimodal features from the initial observation frame

2

Action Conditioning

Inject action embeddings to guide the generation process

3

Iterative Denoising

Progressively refine latent codes across all modalities

4

Decode Outputs

Generate synchronized video, audio, and tactile signals

Training Pipeline

I

Stage 1: Representation Learning

Train modality-specific encoders with contrastive objectives to align representations across vision, audio, and tactile modalities in a shared embedding space.

Contrastive Loss Cross-Modal Self-Supervised
II

Stage 2: Generative Training

Train the diffusion model to generate multi-sensory outputs conditioned on actions, leveraging the pre-trained multimodal representations for coherent synthesis.

Diffusion Loss Action-Conditioned Temporal

Training Objectives

align

Cross-modal contrastive alignment loss for representation learning

diffusion

Denoising score matching loss for video generation

sync

Temporal synchronization loss across modalities

Applications

Cross-Modal Generation

Generate missing modalities from available ones—produce audio from silent video or tactile from visual.

Action Simulation

Simulate multi-sensory outcomes of actions for robotics training and interactive applications.

Video Prediction

Predict future frames with associated audio and tactile signals given past observations.

VR/AR Content

Create immersive multi-sensory experiences for virtual and augmented reality applications.

Robot Learning

Train robots with simulated multi-sensory feedback for manipulation and interaction tasks.

Material Understanding

Infer material properties from multi-sensory observations for physics-based reasoning.

Results

Our model generates coherent multi-sensory videos with aligned visual, audio, and tactile modalities. Below we show qualitative results demonstrating cross-modal generation capabilities.

Action → Multi-Sensory Generation

BibTeX

@inproceedings{li2025multimodal,
  title={MultiModal Action Conditioned Video Generation},
  author={Li, Yichen and Torralba, Antonio},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2025}
}