MultiModal
Action Conditioned
Video Generation
Abstract
We present a framework for generating multi-sensory videos that capture realistic visual, auditory, and tactile signals from physical interactions. Our approach learns a unified multimodal representation that enables action-conditioned video generation across different sensory modalities. By training on paired multi-sensory data, our model learns to synthesize coherent videos where the visual content, accompanying sounds, and tactile feedback are temporally aligned and physically plausible. We demonstrate that our multimodal representation effectively captures the correlations between different sensory modalities, enabling applications such as cross-modal generation, sensory completion, and action-conditioned simulation of multi-sensory experiences.
Method Overview
Our framework learns a shared latent space for multiple sensory modalities, enabling coherent multi-sensory video generation conditioned on actions. The model captures temporal dynamics and cross-modal correlations to produce physically plausible multi-sensory outputs.
Visual Encoder
Extracts rich visual features from video frames capturing appearance and motion.
Audio Encoder
Processes audio spectrograms to capture sound characteristics and temporal patterns.
Tactile Encoder
Encodes tactile sensor readings to represent touch and physical interactions.
Unified Multimodal Representation
Cross-Modal Alignment
Our representation learning approach aligns features from different sensory modalities into a shared embedding space. This enables the model to understand the inherent correlations between what we see, hear, and feel during physical interactions.
- Contrastive learning aligns paired multi-sensory data
- Temporal synchronization preserves event timing
- Physics-aware constraints ensure plausibility
Semantic Binding
The learned representation captures semantic relationships between modalities—understanding that a hammer striking wood produces both a specific visual deformation and a characteristic impact sound, along with corresponding tactile vibrations.
- Object-centric representations bind modalities
- Material properties influence all modalities
- Action context shapes expected sensory outputs
Action-Conditioned Generation
Diffusion-Based Video Synthesis
We employ a latent diffusion model that operates in the unified multimodal space, enabling generation of coherent multi-sensory videos. The model is conditioned on actions to simulate realistic physical interactions.
Given an initial state and action sequence, our model iteratively denoises latent representations to produce temporally consistent video frames, synchronized audio waveforms, and corresponding tactile signals.
Encode Initial State
Extract multimodal features from the initial observation frame
Action Conditioning
Inject action embeddings to guide the generation process
Iterative Denoising
Progressively refine latent codes across all modalities
Decode Outputs
Generate synchronized video, audio, and tactile signals
Training Pipeline
Stage 1: Representation Learning
Train modality-specific encoders with contrastive objectives to align representations across vision, audio, and tactile modalities in a shared embedding space.
Stage 2: Generative Training
Train the diffusion model to generate multi-sensory outputs conditioned on actions, leveraging the pre-trained multimodal representations for coherent synthesis.
Training Objectives
ℒalign
Cross-modal contrastive alignment loss for representation learning
ℒdiffusion
Denoising score matching loss for video generation
ℒsync
Temporal synchronization loss across modalities
Applications
Cross-Modal Generation
Generate missing modalities from available ones—produce audio from silent video or tactile from visual.
Action Simulation
Simulate multi-sensory outcomes of actions for robotics training and interactive applications.
Video Prediction
Predict future frames with associated audio and tactile signals given past observations.
VR/AR Content
Create immersive multi-sensory experiences for virtual and augmented reality applications.
Robot Learning
Train robots with simulated multi-sensory feedback for manipulation and interaction tasks.
Material Understanding
Infer material properties from multi-sensory observations for physics-based reasoning.
Results
Our model generates coherent multi-sensory videos with aligned visual, audio, and tactile modalities. Below we show qualitative results demonstrating cross-modal generation capabilities.
Action → Multi-Sensory Generation
BibTeX
@inproceedings{li2025multimodal,
title={MultiModal Action Conditioned Video Generation},
author={Li, Yichen and Torralba, Antonio},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025}
}