Learning Articulated Motions
From Visual Demonstration
Sudeep Pillai, Matthew R. Walter and Seth Teller
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, MA 02139 USA
Email: {spillai, mwalter, teller}@csail.mit.edu
Abstract—Many functional elements of human homes and
workplaces consist of rigid components which are connected
through one or more sliding or rotating linkages. Examples
include doors and drawers of cabinets and appliances; laptops;
and swivel office chairs. A robotic mobile manipulator would
benefit from the ability to acquire kinematic models of such
objects from observation. This paper describes a method by
which a robot can acquire an object model by capturing depth
imagery of the object as a human moves it through its range of
motion. We envision that in future, a machine newly introduced
to an environment could be shown by its human user the
articulated objects particular to that environment, inferring from
these “visual demonstrations” enough information to actuate each
object independently of the user.
Our method employs sparse (markerless) feature tracking,
motion segmentation, component pose estimation, and articu-
lation learning; it does not require prior object models. Using
the method, a robot can observe an object being exercised, infer
a kinematic model incorporating rigid, prismatic and revolute
joints, then use the model to predict the object’s motion from
a novel vantage point. We evaluate the method’s performance,
and compare it to that of a previously published technique, for
a variety of household objects.
I. INTRODUCTION
A long-standing challenge in robotics is to endow robots
with the ability to interact effectively with the diversity of
objects common in human-made environments. Existing ap-
proaches to manipulation often assume that objects are simple
and drawn from a small set. The models are then either pre-
defined or learned from training, for example requiring fiducial
markers on object parts, or prior assumptions about object
structure. Such requirements may not scale well as the number
and variety of objects increases. This paper describes a method
with which robots can learn kinematic models for articulated
objects in situ, simply by observing a user manipulate the
object. Our method learns open kinematic chains that involve
rigid linkages, and prismatic and revolute motions, between
parts.
There are three primary contributions of our approach that
make it effective for articulation learning. First, we propose
a feature tracking algorithm designed to perceive articulated
motions in unstructured environments, avoiding the need to
embed fiducial markers in the scene. Second, we describe a
motion segmentation algorithm that uses kernel-based clus-
tering to group feature trajectories arising from each object
part. A subsequent optimization step recovers the 6-DOF pose
Fig. 1: The proposed framework reliably learns the underlying
kinematic model of multiple articulated objects from user-
provided visual demonstrations, and subsequently predicts
their motions at future encounters.
of each object part. Third, the method enables use of the
learned articulation model to predict the object’s motion when
it is observed from a novel vantage point. Figure 1 illustrates
a scenario where our method learns kinematic models for
a refrigerator and microwave from separate user-provided
demonstrations, then predicts the motion of each object in
a subsequent encounter. We present experimental results that
demonstrate the use of our method to learn kinematic models
for a variety of everyday objects, and compare our method’s
performance to that of the current state of the art.
II. RELATED WORK
Providing robots with the ability to learn models of ar-
ticulated objects requires a range of perceptual skills such
as object tracking, motion segmentation, pose estimation,
and model learning. It is desirable for robots to learn these
models from demonstrations provided by ordinary users. This
necessitates the ability to deal with unstructured environments
and estimate object motion without requiring tracking markers.
Traditional tracking algorithms such as KLT [2], or those
based on SIFT [15] depend on sufficient object texture and
may be susceptible to drift when employed over an object’s
full range of motion. Alternatives such as large-displacement
optical flow [4] or particle video methods [19] tend to be more
accurate but require substantially more computation.
Fig. 2: Articulation learning architecture.
Articulated motion understanding generally requires a com-
bination of motion tracking and segmentation. Existing motion
segmentation algorithms use feature based trackers to construct
spatio-temporal trajectories from sensor data, and cluster these
trajectories based on rigid-body motion constraints. Recent
work by Brox and Malik [3] in segmenting feature trajectories
has shown promise in analyzing and labeling motion profiles
of objects in video sequences in an unsupervised manner.
Recent work by Elhamifar and Vidal [5] has proven effective
at labeling object points based purely on motion visible in a
sequence of standard camera images. Our framework employs
similar techniques, and introduce a segmentation approach for
features extracted from RGB-D data.
Researchers have studied the problem of learning models
from visual demonstration. Yan and Pollefeys [24] and Huang
et al. [10] employ structure from motion techniques to segment
the articulated parts of an object, then estimate the prismatic
and rotational degrees of freedom between these parts. These
methods are sensitive to outliers in the feature matching step,
resulting in significant errors in pose and model estimates.
Closely related to our work, Katz et al. [13] consider the
problem of extracting segmentation and kinematic models
from interactive manipulation of an articulated object. They
take a deterministic approach, first assuming that each object
linkage is prismatic and proceed to fit a rotational degree-of-
freedom only if the residual is above a specified threshold.
Katz et al. learn from observations made in clean, clutter-
free environments and primarily consider objects in close
proximity to the RGB-D sensor. Recently, Katz et al. [14]
propose an improved learning method that has equally good
performance with reduced algorithmic complexity. However,
the method does not explicitly reason over the complexity of
the inferred kinematic models, and tends to over-fit to observed
motion. In contrast, our algorithm targets in situ learning in
unstructured environments with probabilistic techniques that
provide robustness to noise. Our method adopts the work of
Sturm et al. [22], which used a probabilistic approach to reason
over the likelihood of the observations while simultaneously
penalizing complexity in the kinematic model. Their work
differs from ours in two main respects: they required that
fiducial markers be placed on each object part in order to
provide nearly noise-free observations; and they assume that
the number of unique object parts is known a priori.
III. ARTICULATION LEARNING FROM VISUAL
DEMONSTRATION
This section introduces the algorithmic components of our
method. Figure 2 illustrates the steps involved.
Our approach consists of a training phase and a prediction
phase. The training phase proceeds as follows: (i) Given RGB-
D data, a feature tracker constructs long-range feature trajec-
tories in 3-D. (ii) Using a relative motion similarity metric,
clusters of rigidly moving feature trajectories are identified.
(iii) The 6-DOF motion of each cluster is then estimated
using 3-D pose optimization. (iv) Given a pose estimate for
each identified cluster, the most likely kinematic structure and
model parameters for the articulated object are determined.
Figure 3 illustrates the steps involved in the training phase
with inputs and outputs for each component.
Propagate & Match
Features
Construct Feature
Trajectories
scale=s
Motion
Segmentation
Initialize Features
(GFTT)
Articulation
Learning
Pose
Estimation
DB
Compute Dense
Optical Flow
Fig. 3: The training phase.
Once the kinematic model of an articulated object is learned,
our system can predict the motion trajectory of the object
during future encounters. In the prediction phase: (i) Given
RGB-D data, the description of the objects in the scene,
D
query
, is extracted using SURF [1] descriptors. (ii) Given
a set of descriptors D
query
, the best-matching object and
its kinematic model,
ˆ
G,
ˆ
M
ij
, (ij)
ˆ
G are retrieved; and
(iii) From these correspondences and the kinematic model
parameters of the matching object, the object’s articulated
motion is predicted. Figure 4 illustrates the steps involved in
the prediction phase.
RGB-D
Image
Object Description
(SURF)
Query DB
Motion
Prediction
Fig. 4: The prediction phase.
A. Spatio-Temporal Feature Tracking
The first step in articulation learning from visual demon-
stration involves visually observing and tracking features on
the object while it is being manipulated. We focus on unstruc-
tured environments without fiducial markers. Our algorithm
combines interest-point detectors and feature descriptors with
traditional optical flow methods to construct long-range feature
trajectories. We employ Good Features To Track (GFTT) [20]
to initialize up to 1500 salient features with a quality level of
0.04 or greater, across multiple image scales. Once the features
are detected, we populate a mask image that captures regions
where interest points are detected at each pyramid scale. We
use techniques from previous work on dense optical flow [7] to
predict each feature at the next timestep. Our implementation
also employs median filtering as suggested by Wang et al. [23]
to reduce false positives.
We bootstrap the detection and tracking steps with a feature
description step that extracts and learns the description of
the feature trajectory. At each image scale, we compute the
SURF descriptor [1] over features that were predicted from
the previous step, denoted as
ˆ
f
t
, and compare them with
the description of the detected features at time t, denoted
as f
t
. Subsequently, detected features f
t
that are sufficiently
close to predicted features
ˆ
f
t
and that successfully meet a
desired match score are added to the feature trajectory, while
the rest are pruned. To combat drift, we use the detection
mask as a guide to reinforce feature predictions with feature
detections. Additionally, we incorporate flow failure detection
techniques [12] to reduce drift in feature trajectories.
Like other feature-based methods [14] our method requires
visual texture. In typical video sequences, some features are
continuously tracked, while other features are lost due to
occlusion or lack of image saliency. To provide rich trajectory
information, we continuously add features to the scene as
needed. We maintain a constant number of feature trajectories
tracked, by adding newly detected features in regions that are
not yet occupied. From RGB-D depth information, image-
space feature trajectories can be easily extended to 3-D. As
a result, each feature key-point is represented by its normal-
ized image coordinates (u, v), position p R
3
and surface
normal n, represented as (p,n) R
3
× SO(2). We denote
F = {F
1
, . . . , F
n
} as the resulting set of feature trajectories
constructed, where F
i
= {( p
1
, n
1
), . . . , (p
t
, n
t
)}. To combat
noise inherent in our consumer-grade RGB-D sensor, we post-
process the point cloud with a fast bilateral filter [18] with
parameters σ
s
= 20 px, σ
r
= 4 cm.
B. Motion Segmentation
To identify the kinematic relationships among parts in an
articulated object, we first distinguish the trajectory taken
by each part. In particular, we analyze the motions of the
object parts with respect to each other over time, and infer
whether or not pairs of object parts are rigidly attached. To
reason over candidate segmentations, we formulate a clustering
problem to identify the different motion subspaces in which
the object parts lie. After clustering, similar labels imply rigid
attachment, while dissimilar labels indicate non-rigid relative
motion between parts.
If two features in R
3
×SO(2) belong to the same rigid part,
the relative displacement and angle between the features will
be consistent over the common span of their trajectories. The
distribution over the relative change in displacement vectors
and angle subtended is modeled as a zero-mean Gaussian,
N(µ, Σ) = (0, Σ), where Σ is the expected noise covariance
for rigidly-connected feature pairs. The similarity of two
feature trajectories can then be defined as:
L(i, j) =
1
T
tt
i
t
j
exp
γ
d(x
t
i
, x
t
j
) µ
d
ij
2
(1)
where t
i
and t
j
are the observed time instances of the feature
trajectories i, and j respectively, T = |t
i
t
j
|, and γ is
a parameter characterizing the relative motion of the two
trajectories. For a pair of 3-D key-point features p
i
, and p
j
,
we estimate the mean relative displacement between a pair of
points moving rigidly together as:
µ
d
ij
=
1
T
tt
i
t
j
d(p
i
t
, p
j
t
) (2)
where d(p
i
, p
j
) = p
i
p
j
. For 3-D key-points, we use
γ =
1
2 cm
in Eqn. 1. Figure 5 illustrates an example of rigid
and non-rigid motions of feature trajectory pairs, and their
corresponding distribution of relative displacements.
For a pair of surface normals n
i
and n
j
, we define the mean
distance as
µ
d
ij
=
1
T
tt
i
t
j
d( n
i
t
, n
j
t
), (3)
where d( n
i
, n
j
) = 1 n
i
· n
j
. In this case, we use
γ =
1
cos(15
)
in Eqn. 1.
Since the bandwidth parameter γ for a pair of feature trajec-
tories can be intuitively predicted from the expected variance
in relative motions of trajectories, we employ DBSCAN [6],
a density-based clustering algorithm, to find rigidly associated
feature trajectories. The resulting cluster assignments are de-
noted as C = {C
1
, . . . , C
k
}, where cluster C
i
consists of a
set of rigidly-moving feature trajectories.
C. Multi-Rigid-Body Pose Optimization
Given the cluster label assignment for each feature trajec-
tory, we subsequently determine the 6-DOF motion of each
cluster. We define Z
t
i
as the set of features belonging to cluster
C
i
at time t. Additionally, we define X = X
1
, . . . , X
k
as the