|
|
Contact
Email: ganchuang [at] csail (dot) mit (dot) eduNews
Research Highlight
A Multi-Modal Interactive Physical Simulation Platform for
Computer Vision, Robotics and Cognitive Science
Publications(by date / by topic)
2023
![]() |
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
arXiv:2311.0145 |
![]() |
DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models
NeurIPS 2023 (Oral) |
![]() |
3D-LLM: Injecting the 3D World into Large Language Models
NeurIPS 2023 (Spotlight) |
![]() |
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
NeurIPS 2023 (Spotlight) |
![]() |
DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics
NeurIPS 2023 |
![]() |
Adaptive Online Replanning with Diffusion Models
NeurIPS 2023 |
![]() |
NeurIPS 2023 Dataset Track |
![]() |
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
ICCV 2023 |
![]() |
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
ICCV 2023 |
![]() |
Learning Vision-and-Language Navigation from YouTube Videos
ICCV 2023 |
![]() |
EMNLP 2023 |
![]() |
ModuleFormer: Learning Modular Large Language Models From Uncurated Data
Pre-print 2023 |
![]() |
Building Cooperative Embodied Agents Modularly with Large Language Models
Pre-print 2023 |
![]() |
Learning Neural Constitutive Laws from Motion Observations for Generalizable PDE Dynamics
ICML 2023 |
![]() |
Reparameterized Policy Learning for Multimodal Trajectory Optimization
ICML 2023 (Oral) |
![]() |
On the Forward Invariance of Neural ODEs
ICML 2023 |
![]() |
Roboninja: Learning an Adaptive Cutting Policy for Multi-material Objects
RSS 2023 |
![]() |
JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions
ACL 2023 (Findings) |
![]() |
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners
CVPR 2023 |
![]() |
3D Concept Learning and Reasoning from Multi-View Images
CVPR 2023 |
![]() |
EC^ 2: Emergent Communication for Embodied Control
CVPR 2023 |
![]() |
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
CVPR 2023 |
![]() |
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
CVPR 2023 |
![]() |
Masked Motion Encoding for Self-Supervised Video Representation Learning
CVPR 2023 |
![]() |
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation
ICLR 2023 (Spotlight) |
![]() |
ICLR 2023 (Spotlight) |
![]() |
SoftZoo: A Soft Robot Co-design Benchmark For Locomotion In Diverse Environments
ICLR 2023 |
![]() |
Planning with Large Language Models for Code Generation
ICLR 2023 |
![]() |
ICLR 2023 |
![]() |
Hyper-Decision Transformer for Efficient Online Policy Adaptation
ICLR 2023 |
2022
![]() |
Learning Neural Acoustic Fields
NeurIPS 2022 |
![]() |
Learning Physical Dynamics with Subequivariant Graph Neural Networks
NeurIPS 2022 (Spotlight) |
![]() |
3D Concept Grounding on Neural Fields
NeurIPS 2022 |
![]() |
Learning Active Camera for Multi-Object Navigation
NeurIPS 2022 (Spotlight) |
![]() |
Weakly-supervised Multi-granularity Map Learning for Vision-and-Language Navigation
NeurIPS 2022 (Spotlight) |
![]() |
On-Device Training Under 256KB Memory
NeurIPS 2022 |
![]() |
SNAKE: Shape-aware Neural 3D Keypoint Field
NeurIPS 2022 (Spotlight) |
![]() |
Noisy Agents: Self-supervised Exploration by Predicting Auditory Events
IROS 2022 |
![]() |
CORL 2022 |
![]() |
Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation
CORL 2022 |
![]() |
Weakly Supervised Grounding for VQA in Vision-Language Transformers
ECCV 2022 (Oral) |
![]() |
Prompting Decision Transformer for Few-shot Policy Generalization
ICML 2022 |
![]() |
Finding Fallen Objects Via Asynchronous Audio-Visual Integration
CVPR 2022 |
![]() |
Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction
CVPR 2022 |
![]() |
ICRA 2022 |
![]() |
ICLR 2022 (Oral) |
![]() |
ICLR 2022 |
![]() |
Linking Emergent and Natural Languages via Corpus Transfer
ICLR 2022 (Spotlight) |
![]() |
ComPhy: Compositional Physical Reasoning of Objects and Events from Videos
ICLR 2022 |
![]() |
Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics
ICLR 2022 (Spotlight) |
![]() |
Network Augmentation for Tiny Deep Learning
ICLR 2022 |
![]() |
ICLR 2022 |
2021
![]() |
ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation
NeurIPS Dataset 2021 (Oral) |
![]() |
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
NeurIPS 2021 |
![]() |
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
NeurIPS 2021 |
![]() |
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?
NeurIPS 2021 |
![]() |
STAR: A Benchmark for Situated Reasoning in Real-World Videos
NeurIPS Dataset 2021 |
![]() |
Curious Representation Learning for Embodied Intelligence
ICCV 2021 |
![]() |
OPEn: An Open-ended Physics Environment for Learning Without a Task
IROS 2021 |
![]() |
AGENT: A Benchmark for Core Psychological Reasoning
ICML 2021 |
![]() |
Temporal and Object Quantification Networks
IJCAI 2021 |
![]() |
PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics.
ICLR 2021 (Spotlight) |
![]() |
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
ICLR 2021 |
![]() |
Learning Task Decomposition with Order-Memory Policy Network
ICLR 2021 |
2020
![]() |
Foley Music: Learning to Generate Music from Videos
ECCV 2020 |
![]() |
Music Gesture for Visual Sound Separation
CVPR 2020 |
![]() |
Dense Regression Network For Video Grounding
CVPR 2020 |
![]() |
TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning
NeurIPS 2020 |
![]() |
MCUNet: Tiny Deep Learning on IoT Devices
NeurIPS 2020 (Spotlight) |
![]() |
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
ICLR 2020 (Spotlight) |
![]() |
Deep Audio Priors Emerge From Harmonic Convolutional Networks
ICLR 2020 |
![]() |
Once for All: Train One Network and Specialize it for Efficient Deployment
ICLR 2020 |
![]() |
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
ICRA 2020 |
2019
![]() |
Self-supervised Moving Vehicle Tracking with Stereo Sound
ICCV 2019 |
![]() |
ICCV 2019 |
![]() |
TSM: Temporal Shift Module for Efficient Video Understanding
ICCV 2019 |
![]() |
Graph Convolutional Networks for Temporal Action Localization
ICCV 2019 |
![]() |
Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement
NeurIPS 2019 (Spotlight) |
![]() |
Visual Concept-Metaconcept Learning
NeurIPS 2019 |
![]() |
Cross-channel Communication Networks
NeurIPS 2019 |
![]() |
ICLR 2019 (Oral) |
![]() |
Defensive quantization: When efficiency meets robustness
ICLR 2019 |
2018
![]() |
Weakly Supervised Dense Event Captioning in Videos
NeurIPS 2018 |
![]() |
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
NeurIPS 2018 (Spotlight) |
![]() |
ECCV 2018 |
![]() |
Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency
ECCV 2018 |
![]() |
Geometry-Guided CNNs for Self-supervised Video Representation Learning
CVPR 2018 |
![]() |
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
CVPR 2018 |
![]() |
End-to-End Learning of Motion Representation for Video Understanding
CVPR 2018 (Spotlight) |
![]() |
Sparse, Smart Contours to Represent and Edit Images
CVPR 2018 |
![]() |
Video Captioning with Multi-Faceted Attention
TACL 2018 |
2017
![]() |
StyleNet: Generating Attractive Visual Captions with Styles
CVPR 2017 |
![]() |
Semantic Compositional Networks for Visual Captioning
CVPR 2017 (Spotlight) |
![]() |
ICCV 2017 |
![]() |
Recurrent Topic-Transition GAN for Visual Paragraph Generation
ICCV 2017 |
2016
![]() |
Automatic Concept Discovery from Parallel Text and Visual Corpora
ICCV 2015 |
Embodied Intelligence
![]() |
ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation
NeurIPS Dataset 2021 (Oral) |
![]() |
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation
ICLR 2023 (Spotlight) |
![]() |
ICLR 2023 (Spotlight) |
![]() |
SoftZoo: A Soft Robot Co-design Benchmark For Locomotion In Diverse Environments
ICLR 2023 |
![]() |
ICLR 2023 |
![]() |
Hyper-Decision Transformer for Efficient Online Policy Adaptation
ICLR 2023 |
![]() |
Learning Active Camera for Multi-Object Navigation
NeurIPS 2022 (Spotlight) |
![]() |
Weakly-supervised Multi-granularity Map Learning for Vision-and-Language Navigation
NeurIPS 2022 (Spotlight) |
![]() |
CORL 2022 |
![]() |
Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation
CORL 2022 |
![]() |
Prompting Decision Transformer for Few-shot Policy Generalization
ICML 2022 |
![]() |
ICRA 2022 |
![]() |
ICLR 2022 (Oral) |
![]() |
ICLR 2022 |
![]() |
Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics
ICLR 2022 (Spotlight) |
![]() |
OPEn: An Open-ended Physics Environment for Learning Without a Task
IROS 2021 |
![]() |
Curious Representation Learning for Embodied Intelligence
ICCV 2021 |
![]() |
PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics.
ICLR 2021 (Spotlight) |
![]() |
Learning Task Decomposition with Order-Memory Policy Network.
ICLR 2021 |
![]() |
Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement
NeurIPS 2019 (Spotlight) |
Audio-Visual Scene Analysis
![]() |
Learning Neural Acoustic Fields
NeurIPS 2022 |
![]() |
Noisy Agents: Self-supervised Exploration by Predicting Auditory Events
IROS 2022 |
![]() |
Finding Fallen Objects Via Asynchronous Audio-Visual Integration
CVPR 2022 |
![]() |
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
ICRA 2020 |
![]() |
Foley Music: Learning to Generate Music from Videos
ECCV 2020 |
![]() |
Music Gesture for Visual Sound Separation
CVPR 2020 |
![]() |
ECCV 2018 |
![]() |
Self-supervised Moving Vehicle Tracking with Stereo Sound
ICCV 2019 |
![]() |
ICCV 2019 |
Visual Commonsense Reasoning
![]() |
Learning Physical Dynamics with Subequivariant Graph Neural Networks
NeurIPS 2022 (Spotlight) |
![]() |
Planning with Large Language Models for Code Generation
ICLR 2023 |
![]() |
3D Concept Grounding on Neural Fields
NeurIPS 2022 |
![]() |
Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction
CVPR 2022 |
![]() |
Weakly Supervised Grounding for VQA in Vision-Language Transformers
ECCV 2022 (Oral) |
![]() |
ComPhy: Compositional Physical Reasoning of Objects and Events from Videos
ICLR 2022 |
![]() |
Linking Emergent and Natural Languages via Corpus Transfer
ICLR 2022 (Spotlight) |
![]() |
ICLR 2022 |
![]() |
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
NeurIPS 2021 |
![]() |
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
NeurIPS 2021 |
![]() |
STAR: A Benchmark for Situated Reasoning in Real-World Videos
NeurIPS Dataset 2021 |
![]() |
AGENT: A Benchmark for Core Psychological Reasoning
ICML 2021 |
![]() |
Temporal and Object Quantification Networks
IJCAI 2021 |
![]() |
Grounding Physical Object and Event Concepts Through Dynamic Visual Reasoning.
ICLR 2021 |
![]() |
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
ICLR 2020 (Spotlight) |
![]() |
Dense Regression Network For Video Grounding
CVPR 2020 |
![]() |
ICLR 2019 (Oral) |
![]() |
Visual Concept-Metaconcept Learning
NeurIPS 2019 |
![]() |
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
NIPS 2018 (Spotlight) |
![]() |
ICCV 2017 |
Visual Representations Learning
![]() |
On-Device Training Under 256KB Memory
NeurIPS 2022 |
![]() |
SNAKE: Shape-aware Neural 3D Keypoint Field
NeurIPS 2022 (Spotlight) |
![]() |
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?
NeurIPS 2021 |
![]() |
TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning
NeurIPS 2020 |
![]() |
MCUNet: Tiny Deep Learning on IoT Devices
NeurIPS 2020 (Spotlight) |
![]() |
Once for All: Train One Network and Specialize it for Efficient Deployment
ICLR 2020 |
![]() |
Cross-channel Communication Networks
NeurIPS 2019 |
![]() |
TSM: Temporal Shift Module for Efficient Video Understanding
ICCV 2019 |
![]() |
Graph Convolutional Networks for Temporal Action Localization
ICCV 2019 |
![]() |
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
CVPR 2018 |
![]() |
End-to-End Learning of Motion Representation for Video Understanding
CVPR 2018 (Spotlight) |
![]() |
DevNet: A Deep Event Network for multimedia event detection and evidence recounting
CVPR 2015 |
Learning from Unlabeled Videos
![]() |
Geometry-Guided CNNs for Self-supervised Video Representation Learning
CVPR 2018 |
![]() |
You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images
CVPR 2016 (Spotlight) |
![]() |
Recognizing an Action Using Its Name: A Knowledge-Based Approach
IJCV 2016 |
![]() |
Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames
ECCV 2016 |
Generative Models for Vision and Language
![]() |
Weakly Supervised Dense Event Captioning in Videos
NeurIPS 2018 |
![]() |
Video Captioning with Multi-Faceted Attention
TACL 2018 |
![]() |
StyleNet: Generating Attractive Visual Captions with Styles
CVPR 2017 |
![]() |
Semantic Compositional Networks for Visual Captioning
CVPR 2017 (Spotlight) |
![]() |
Recurrent Topic-Transition GAN for Visual Paragraph Generation
ICCV 2017 |
![]() |
Automatic Concept Discovery from Parallel Text and Visual Corpora
ICCV 2015 |
![]() |
Sparse, Smart Contours to Represent and Edit Images
CVPR 2018 |
Domaim Adaptation
![]() |
Learning Attributes Equals Multi-Source Domain Generalization
CVPR 2016 (Spotlight) |
![]() |
Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency
ECCV 2018 |
Competitions
• Rank 1st in ActivityNet AVA Challenge 2018
• Rank 1st in ActivityNet Kinetics Challenge 2017
• Rank 1st in NIST TRECVID MED and MER 2014
• Rank 2nd in Moments in Time 2018
• Rank 3rd in Youtube8M Challenge 2017
• Rank 3rd in ActivityNet classification Challenge 2016
Data & Software
• NS-VQA. Neural-Symbolic Visual Reasoning.
• WSDEC. Weakly-supervised Dense Event Captioning.
• The Sound of Pixels. Listen to the sound of pixels.
• Smart Contours. Edit images using contours.
• Attention Clusters. Multiple and diverse attention for video classification.
• SCN. Semantic composition network for image and video captioning.
• VQS. Visual question segmentation.
• TVNET. End to end video motion learning.
• Youtube8M. Temporal modeling for video classification.
Honors
• Outstanding Doctoral Thesis Award at Tsinghua University (2018)
• Excellent Graduate Student at Tsinghua University (2018)
• Top Talented Graduate Student at Tsinghua University (2017)
• Academic Rising Star Finalist at Tsinghua University (2016, 2017)
• Microsoft Fellowship (2016)
• Baidu Fellowship (2016)
• National Scholarship, by Ministry of Education of China (2015)