In short, the strategy is to run mean-shift only on the last and current frames. I use the grid representation of my previous paper, that is, for video streams I use a (x,y,t,I) grid and I compute the point density in this grid and label the grid nodes to form clusters. Since I consider only two frames, the t dimension spans only (t0-1) and t0, with t0 the time of the current frame. I assume that the past frame (t0-1) has been processed and that its density and labels are known. I compute the density at t0 using the formula given in the paper as pseudo-code. The next two steps describe how I label the grid nodes at t0. Step 1: Propagating the labels forward in time. For each local maximum of density at (t0-1) with a label L (there is a trick here but I'll come back to it later), I check whether there is an adjacent grid node at t0 with higher density. If it is the case, I follow the steepest slope to reach a local maximum at t0. I assign the label L to this node. In terms of topology and Morse-Smale complex, I am extending the ascending paths created at the previous frame to account for the current frame. Step 2: Assigning labels at t0. I use the same watershed-like algorithm as in my previous paper with minor variations. Some local maxima have been labeled in the previous step; in this case, I do not create a new label. When I analyze a neighborhood, I consider the grid nodes at t0 and (t0-1). Everything happens as before: no labeled node: create a new label, one label: copy it, two or more labels: mark as boundary. There is one more thing to consider: if a node is a local maximum among the t0 nodes but there is a (t0-1) node with higher density, then I mark this node as "not a local maximum" so that I do not start an ascending path in the first step when I will process the next frame. The last point is that I do not run the per-pixel classification to save time: the boundary nodes are assigned the label of their densest neighbor.