In short, the strategy is to run mean-shift only on the last and current
frames. I use the grid representation of my previous paper, that is, for
video streams I use a (x,y,t,I) grid and I compute the point density in
this grid and label the grid nodes to form clusters. Since I consider
only two frames, the t dimension spans only (t0-1) and t0, with t0 the
time of the current frame. I assume that the past frame (t0-1) has been
processed and that its density and labels are known. I compute the
density at t0 using the formula given in the paper as pseudo-code. The
next two steps describe how I label the grid nodes at t0.
Step 1: Propagating the labels forward in time. For each local maximum
of density at (t0-1) with a label L (there is a trick here but I'll come
back to it later), I check whether there is an adjacent grid node at t0
with higher density. If it is the case, I follow the steepest slope to
reach a local maximum at t0. I assign the label L to this node. In
terms of topology and Morse-Smale complex, I am extending the ascending
paths created at the previous frame to account for the current frame.
Step 2: Assigning labels at t0. I use the same watershed-like algorithm
as in my previous paper with minor variations. Some local maxima have
been labeled in the previous step; in this case, I do not create a new
label. When I analyze a neighborhood, I consider the grid nodes at t0
and (t0-1). Everything happens as before: no labeled node: create a new
label, one label: copy it, two or more labels: mark as boundary. There
is one more thing to consider: if a node is a local maximum among the t0
nodes but there is a (t0-1) node with higher density, then I mark this
node as "not a local maximum" so that I do not start an ascending path
in the first step when I will process the next frame. The last point is
that I do not run the per-pixel classification to save time: the
boundary nodes are assigned the label of their densest neighbor.