\documentclass[10pt,twocolumn,letterpaper]{article} 

\usepackage{cvpr}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}

% Include other packages here, before hyperref.

% If you comment hyperref and then uncomment it, you should delete 
% egpaper.aux before re-running latex.  (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}


\cvprfinalcopy % *** Uncomment this line for the final submission

%\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}

% Pages are numbered in submission mode, and unnumbered in camera-ready
%\ifcvprfinal\pagestyle{empty}\fi
\begin{document}

%%%%%%%%% TITLE
\title{Improving Super-Resolution Enhancement of Video by using Optical Flow}

\author{Chris Crutchfield\\
MIT\\
{\tt\small ccrutch@mit.edu}
% For a paper whose authors are all at the same institution, 
% omit the following lines up until the closing ``}''.
% Additional authors and addresses can be added with ``\and'', 
% just like the second author.
% To save space, use either the email address or home page, not both}
}

\maketitle
\thispagestyle{empty}

%%%%%%%%% ABSTRACT
\begin{abstract}
In the literature there has been much research into two methods of attacking the super-resolution problem: using optical flow-based techniques to align low-resolution images as samples of a target high-resolution image, and using learning-based techniques to estimate perceptually-plausible high frequency components of a low-resolution image.  Both of these approaches have been naturally extended to apply to image sequences from video, yet heretofore there have been no investigations into combining these methods to obviate problems associated with each method individually.  We show how to merge these two disparate approaches to attack two problems associated with super-resolution for video: removing temporal artifacts (``flicker'') and improving image quality.
\end{abstract}

%%%%%%%%% BODY TEXT
\section{Introduction}

Super-resolution enhancement of images has been a well-studied topic in the literature, with a wide variety of solutions.  All of these methods attempt to solve the same problem: to increase the resolution (the number of pixels) of a given image while also estimating the missing high frequency content of the resized image.  

Of these methods, there have been three major approaches for increasing image resolution.  The first method involves interpolating a single image to a higher resolution, and then boosting its high frequencies by applying a deconvolution filter~\cite{schultz}.  The second method uses several low resolution, aligned images as samples of a high resolution image, which it them attempts to estimate~\cite{chiang}.  The third method uses learning-based techniques to infer perceptually-plausible high frequencies for a low resolution image~\cite{freeman}.

However, despite the fact that super-resolution for images has been a well-studied topic, there have been comparatively few investigations into applying these techniques to video (in effect, generalizing the problem to three dimensions --- two in space and one temporal --- where the goal is to increase the spatial resolution by making use of the information provided by the additional temporal dimension).  The first investigation into this domain was an extension of Chiang and Boult~\cite{chiang}, involving the use of optical flow to align successive frames~\cite{bakerkanade}.  The second approach was an extension of the methods of Freeman et al.~\cite{freeman} to apply their VISTA algorithm to each frame individually~\cite{bishop}.

\subsection{Super-Resolution}

The task of super-resolution seems nearly impossible at first --- to extract information from an image that is simply not present.  Although all the methods listed in this section claim to achieve this task, there is no true algorithm for solving this problem exactly.  Since many very different high-resolution images may all map to the same low-resolution image, we have no hope for recovering the initial image for all cases (see Figure \ref{fig:downsample}).  However we may \emph{estimate} or \emph{infer} what the high-resolution image most likely looks like using several techniques.  In this section I will review the three main super-resolution techniques applied in the literature.

\begin{figure}
\begin{center}
\includegraphics[width=0.9\linewidth]{Images/downsample.jpg}
\end{center}
   \caption{The images on the right are very different in terms of frequency content, yet they both map to the same image when blurred and downsampled.}
\label{fig:downsample}
\end{figure}

Several papers (Schultz et al.~\cite{schultz}, Chiang et al.~\cite{chiang2}) have addressed the task of boosting high frequencies present in an image by deblurring using a deconvolution filter (typically Wiener deconvolution).  There are a host of problems associated with this task; however, they are mostly concerned with estimating the blur kernel that has been applied to the image, in order to deconvolve and remove the blur (and therefore boost up the missing high-frequency components).  This is certainly no trivial task, since estimating the blur of an image is an inexact science.  The conclusion of Chiang et al.~\cite{chiang2} was that more robust methods are needed in practice in order for deconvolution alone to be feasible.

\begin{figure}
\begin{center}
\includegraphics[width=1.1\linewidth]{Images/bakerkanade2.png}
\end{center}
   \caption{Super-resolution using image sequences.}
\label{fig:bakerkanade2}
\end{figure}

Another approach discussed in Chiang et al.~\cite{chiang} is to reconstruct a high-resolution image from a sequence of low-resolution images that are pre-aligned (see Figure \ref{fig:bakerkanade2} for a pictorial representation).  In this ``registration'' step, each pixel in the high-resolution is assigned a point (which may be a subpixel location) in each low-resolution image.  The assumption is that the registration is known \emph{a priori}, so that we then only concern ourselves with combining these images to produce the high-resolution output.  In order to combine the images, each low-resolution image is warped into the coordinate frame of the high-resolution image (using the registration information).  There are several methods for performing this task, which involves interpolating the values of the subpixel locations in each image using the registration (typically one may use nearest-neighbor, bilinear, or bicubic interpolation).  The result of these computations is a stack of high-resolution images, which may then be fused together by taking a robust mean to produce a composite high-resolution image.  This image may then be deblurred by applying the Wiener deconvolution filter mentioned above.

In the seminal paper by Freeman, Pasztor and Carmichael~\cite{freeman}, a \emph{learning-based} algorithm, VISTA, for solving the super-resolution problem was developed.  In this paper, they demonstrate how to extract perceptually-plausible high-frequency components from a low-resolution image.  They do so by constructing a training set out of a sequence of high-resolution images.  By pairing image patches from the high-resolution image to their low-resolution counterparts (by blurring and downsampling to remove the high-frequency components), one can infer from a given low-resolution image patch what the most likely high-resolution patch would be.

\begin{figure}
\begin{center}
\includegraphics[width=0.9\linewidth]{Images/freeman.png}
\end{center}
   \caption{Markov Random Field for images.  The observations $y_i$ are the low-resolution patches in the input image.  The nodes we wish to estimate $x_i$ represent the high-resolution patches in the output.  Each $x_i$ is connected to its associated $y_i$ (enforcing that the medium-frequency components of $x_i$ are ``close'' to the medium-frequency components of $y_i$).  In addition, each $x_i$ is connected to its neighbor with some compatibility criterion, ensuring that neighboring high-resolution patches ``stitch'' together well.}
\label{fig:mrf}
\end{figure}

In particular, in order to solve this problem for the entire image, they construct a Markov Random Field (see Figure \ref{fig:mrf}) for the image.  By applying Bayesian Belief Propagation to this network, they then can reconstruct the maximum likelihood high-resolution output, conditioned on the low-resolution input.

%\begin{equation} \mathcal{L}_k(\mathbf{x}) = \| \mathbf{x} - \mathbf{x}_k \|^2 \end{equation}

%\begin{equation} \mathcal{E}_k^{(\alpha)}(\mathbf{x}) = \mathcal{L}_k(\mathbf{x}) + \alpha \mathcal{V}(\mathbf{y}_k, \{\mathbf{y}_{k'} \, : \, k' \in \mathcal{N}_\mathbf{x}^{(t)}\}) \label{eqn:cost} \end{equation}


\subsection{Super-Resolution for Video}

The motivations for applying super-resolution for video are quite apparent.  Videos require such a large amount of storage space that they are often of much smaller resolution than the devices used to display them.  In particular, in the case of videos streamed over the internet, the space requirements are even more stringent (indeed, they then become bandwidth requirements).  This begs the following question: what if we could design an algorithm that allows videos to be streamed to the user at a relatively low resolution, but with some processing we could boost the video to the higher resolution of their display?  As mentioned above, there have been several approaches to answering this question.

The work of Baker and Kanade~\cite{bakerkanade} extended the results of Chiang et al.~\cite{chiang} to apply to video.  By computing the \emph{optical flow} of the video sequence using Lucas-Kanade~\cite{lucaskanade} or a similar approach, we can then warp the video frames surrounding a particular frame so that they all are aligned in the same coordinate frame.  By repeating this process for each frame, we obtain a collection of low-resolution, aligned images for each frame of the video.  We can then apply the techniques of Chiang et al. to extract a high-resolution estimate of this frame from the collection of aligned images (see Figure \ref{fig:bakerkanade} for more details).  Although this technique works well for carefully chosen examples, when applied to real-world data, optical flow algorithms fail to provide the precision necessary to extract a quality high-resolution output.

\begin{figure}
\begin{center}
\includegraphics[width=1.1\linewidth]{Images/bakerkanade.png}
\end{center}
   \caption{Super-resolution optical flow algorithm: (1) Bilinearly interpolate each frame individually, (2a) Compute optical flow to neighboring frames, (2b) Warp each frame to its neighbors, (2c) Compute a robust mean of the collection of frames, (3) Deblur the result using a Wiener deconvolution filter.}
\label{fig:bakerkanade}
\end{figure}

The work of Bishop, Blake and Marthi~\cite{bishop} extended the results of Freeman et al.~\cite{freeman} to video.  They noticed that if you apply the VISTA algorithm individually to each frame of the video, the result is visually unappealing due to many temporal artifacts.  They noted that these ``distracting scintillations'' were caused by the lack of temporal consistency from applying the VISTA algorithm independently to each frame.  Since each frame is processed independently, a high-resolution image patch applied in frame $i$ might not be the same as the high-resolution image patch applied in frame $i+1$ (even though both patches might be identical or very close in low-resolution).  Their solution involved the addition of a regularization parameter $\beta$ to the cost function for selecting patches,  %(see Equation \eqref{eqn:cost})
in order to favor re-selecting the same patch for successive frames.  

%\begin{eqnarray*} 
%\mathcal{E}_k^{(\alpha, \beta)}(t) & = & \|\mathbf{x}^{(t)} - \mathbf{x}_k\|^2 \\
%& & + \alpha \mathcal{V}(\mathbf{y}_k, \{\mathbf{y}_{k'} \, : \, k' \in \mathcal{N}_\mathbf{x}^{(t)}\}) \\
%& & - \beta I(\mathbf{y} = \mathbf{y}_k(t-1)) \\
%\end{eqnarray*}

%Where $I(\cdot)$ is a binary-valued indicator function.  By including this in the cost function they were able to reduce the amount of %high-frequency scintillations in the resulting super-resolution video (See Figure \ref{fig:bishop}).

\begin{figure}
\begin{center}
\includegraphics[width=0.9\linewidth]{Images/bishop1.png} \\
\hbox{}
\includegraphics[width=0.9\linewidth]{Images/bishop2.png} 
\end{center}
   \caption{(Above) Average absolute error between the super-resolution video and the ground truth video averaged across all frames, before applying the regularlization parameter.  (Below) After.}
\label{fig:bishop}
\end{figure}

\subsection{Our Approach}

In this paper we propose new methods for combining the work of Baker et al.~\cite{bakerkanade} and Bishop et al.~\cite{bishop} to use both optical flow-based techniques as well as learning-based techniques for the problem of super-resolution for video.  We divide the paper into two sections with two different goals.  First, we wish to develop an algorithm that uses optical-flow techniques to reduce the temporal flickering associated with applying the VISTA algorithm to each frame individually.  Second, we wish to come up with a method for using optical-flow techniques for video sequences which are exactly low-resolution samples of some high-resolution image, shifted around randomly by some subpixel amounts, in order to extract high-resolution outputs that are of better quality than simply applying the VISTA algorithm to each frame individually.

\subsection{Viewing the Results}

Since the subject of this paper deals with the perceptual quality of video to the human eye, our results do not lend well to being displayed in a static format (e.g. as figures in this paper).  Therefore I have posted screencaptures of each frame of the video, in a side-by-side manner, on my personal webspace for viewing.  In addition I have posted MPEG files for each video, but due to the compression associated with this format it does not display very well.  These files will be made available at  \url{http://people.csail.mit.edu/cyc/6.869/project/superres.html}.

\section{Reducing Temporal Flicker}

The goal of this algorithm is to eliminate the high-frequency flicker present in the approach of applying VISTA individually to each frame, while still retaining the desired perceptually-plausible high frequencies.  By applying the optical flow techniques of Baker et al.~\cite{bakerkanade} we hope to gather a collection of VISTA-enhanced samples for each frame.  Since the desired high frequencies will be present in each frame of the video, whereas the undesired high-frequencies (the noise) will vary from frame to frame, by taking a robust mean of these collections for each frame the hope is that the desired high frequencies will constructively interfere, whereas the undesired high frequencies will destructively interfere.

\subsection{A Super-Resolution Optical Flow Algorithm} \label{myalg}

This algorithm was adapted from the Super-Resolution Optical Flow algorithm of Baker et al.~\cite{bakerkanade}.

\begin{enumerate}
\item Apply VISTA~\cite{freeman} individually to each frame of the video (which uses bicubic interpolation to double the resolution of the image, and then adds in perceptually-plausible high frequencies).
\item For each frame of the video, iterate the following steps until convergence (in practice, 5--10 iterations are usually enough).  Note that for the first, second, second-to-last, and last frames you may simply leave them be.
\begin{enumerate}
\item For frame $X_i$, compute the optical flow from $X_i$ to $X_{i-1}$ and $X_{i-2}$, as well as to $X_{i+1}$ and $X_{i+2}$.  For the sake of efficiency, we use the Lucas-Kanade optical flow algorithm~\cite{lucaskanade}.
\item Warp the frames $X_{i-2}, X_{i-1}, X_{i+1}, X_{i+2}$ into the coordinate frame of $X_i$ to create a collection of aligned images.
\item Let $X_i'$ be a robust mean of $X_{i-2}, X_{i-1}, X_i, X_{i+1}, X_{i+2}$ (in practice this is usually just the arithmetic mean or the median).  Replace $X_i$ with $X_i'$ for the next iteration.
\end{enumerate}
\item (Optional) Deblur each frame using a Wiener deconvolution filter.
\end{enumerate}

Note that in the last step, we may decide to deblur each frame of the resulting high-resolution video to remove any blur that may have been caused by imprecise optical flow.  Since the techniques used for estimating optical flow and for warping images according to this flow are not perfect, they may cause the resulting image frame to become blurry as a result of this imprecision (if the optical flow or warping is off by even a subpixel amount, this may cause some perceptible blurriness in the output).  Therefore in order to remove this artifact it may be necessary to deblur the image (in practice we use a Wiener deconvolution filter with a Gaussian blur kernel).

\subsection{Results and Analysis}

\begin{figure}
\begin{center}
\includegraphics[width=0.9\linewidth]{Images/flicker_vista.jpg} \\
\hbox{}
\includegraphics[width=0.9\linewidth]{Images/flicker_myalg.jpg} 
\end{center}
   \caption{(Above) Average absolute error between the VISTA super-resolution video and the ground truth video averaged across all frames.  (Below) Average absolute error between the output of the algorithm in Section \ref{myalg} and the ground truth video averaged across all frames.  Note that much of the error density around the eyeglasses (where the flickering is most apparent in the video) has been reduced.}
\label{fig:flicker}
\end{figure}

It is difficult to objectively quantify the amount of ``flicker'' present in a video, since it is largely a measure based on human perception.  Thus for the purposes of comparing our algorithm to the output of VISTA applied individually to each frame we need to come up with some quantitative measure of flicker content.  In order to do this we create our input video by blurring and downsampling our ``ground truth'' video.  By doing so, we can compare our output to the ground truth to determine our error.  For the analysis of this algorithm we used an image sequence created from two pictures: one of a man with a neutral expression on his face, and the other of the same man with a smile on his face.  By computing the optical flow of these two images and temporally interpolating the intermediate frames, we created a test dataset of 32 frames of resolution $350 \times 350$.  

In Figure \ref{fig:flicker} we compare the output of applying VISTA individually to each frame to the output of our algorithm.  If one watches the associated video of the output of VISTA\footnote{Available at \url{http://people.csail.mit.edu/cyc/6.869/project/flicker_vista.mpg}}, one notices that most of the flickering occurs around the rims of the man's eyeglasses.  This is largely due to the fact that these are the most prominent edges in the image sequence, hence the areas where the VISTA algorithm tries to add the most high frequencies to (in order to preserve the shape of the edge in the high-resolution output).  In the output of our algorithm, this is somewhat reduced, as one can see that the dark areas around the man's eyeglasses are noticeably lighter.

\begin{figure}
\begin{center}
\includegraphics[width=0.9\linewidth]{Images/noiseplot.jpg}
\end{center}
   \caption{Plot of the measure of flicker averaged over the pixels of the output of VISTA (shown in blue) and the output of the algorithm in Section \ref{myalg} (in red).}
\label{fig:noiseplot}
\end{figure}

However, Figure \ref{fig:flicker} only measures the average error between the ground truth and the outputs of the two algorithms, which is not what we desire to measure.  We wish to have some quantitative measurement of the ``flicker content'' of both outputs.  Therefore in Figure \ref{fig:noiseplot} we compare the high-frequency content of each video --- the output of VISTA and the output of our algorithm.  The measure of high-frequency content was produced by applying a fifth-order highpass Butterworth filter to each frame of the video, and then averaging over the absolute value of the result.  This data was plotted for each frame of the sequence.  The reduction in flicker content is modest, yet not ideal (as one can note by the scale of the $y$-axis).

Despite the reduction in flicker content, the output of our algorithm still maintains the high-frequency detail of the output of VISTA\footnote{Due to space constraints, the output cannot be displayed in this paper.  However, they may be viewed at \url{http://people.csail.mit.edu/cyc/6.869/project/superres.html}}.

\subsubsection{Misalignment Due to Poor Optical Flow}

\begin{figure}
\begin{center}
\includegraphics[width=0.9\linewidth]{Images/good.jpg} \\
\hbox{}
\includegraphics[width=0.9\linewidth]{Images/bad.jpg} 
\end{center}
   \caption{(Above) Correct ground truth image.  (Below) Output of the algorithm in Section \ref{myalg}.  Images taken from the dataset of the PhD thesis of Hedvig Kjellstr\"{o}m~\cite{hedvig}}
\label{fig:hedvig}
\end{figure}

The above algorithm relies heavily on having precise optical flow information.  In practice this means that the video framerate must be extremely high so that we can reliably compute the Lucas-Kanade optical flow between frames.  However, when we apply our technique to a different dataset that does not have such a high framerate, we find that our algorithm fails.  As shown in Figure \ref{fig:hedvig}, we have an image sequence of a woman walking on a paved road.  Since the images were only sampled at a rate of about 15 frames per step, the optical flow algorithm used in our algorithm has a hard time tracking the figure.  As a result, it cannot reliably warp together succesive frames into a common coordinate frame, and the result of averaging this mess is a very blurry figure.  This highlights the importance of having reliable, precise optical flow for \emph{any} optical flow-based approach to solving the super-resolution problem.  As mentioned in Zhao and Sawhney~\cite{zhao}, nearly all of the prior optical flow-based approaches relied on near perfect alignment of images, and when tested on actual real-world examples they break down due to misalignment.  They conclude that ``errors resulting from traditional flow algorithms may render super-resolution\footnote{To be clear, they are referring only to optical flow-based methods.} infeasible''.

\section{Using Video Frames as Samples} 

In this section we consider a modified variant of the algorithm of Section \ref{myalg}.  The case we wish to consider is when our input video sequence is a randomly-ordered set of shifted downsamplings of a ground truth high-resolution image.  This may occur for example, if we have a low-resolution video of a static subject, with random, nearly imperceptible pertubations of the camera (and therefore, the camera reads in a randomly shifted downsampling of the subject for each frame).  The rationale behind this is to come up with a contrived example where a set of frames in the video yields a lot of information about the underlying ground truth image (in fact, if one processed all of the frames, one could entirely reconstruct the ground truth image in this model!)  The hope is that by applying optical flow techniques to this simpler model, we can come up with higher-quality outputs than just applying VISTA individually to each frame.

\begin{figure}
\begin{center}
\includegraphics[width=\linewidth]{Images/Frames/teapot.jpg}
\end{center}
   \caption{Test image used for this algorithm.  In order to create the image sequence of the video, this image was blurred and (twice) downsampled by random offsets.}
\label{fig:teapot}
\end{figure}

In this toy model, we take our ground truth image (see Figure \ref{fig:teapot}) and construct a 16-frame sequence by selecting a random offset and then downsampling by 4 (see Figure \ref{fig:offsets} for the offsets used in our testcases).  Since this is downsampled by 4, we first apply the algorithm of Section \ref{myalg}, and then apply VISTA to the results.  We compare the result to the output of applying VISTA twice to the downsampled input set.

\begin{figure}
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
8 & 15 & 2 & 16 \\
\hline
10 & 7 & 4 & 3 \\
\hline
11 & 14 & 6 & 12 \\
\hline
9 & 5 & 13 & 1 \\
\hline
\end{tabular}
\end{center}
\caption{Offsets used in the creation of the 16-frame input sequence.  A high-resolution image was downsampled by picking an offset in the order shown above and selecting every fourth pixel.}
\label{fig:offsets}
\end{figure}

\subsection{Results and Analysis}

\begin{figure}
\begin{center}
\includegraphics[width=\linewidth]{Images/flicker_diff.jpg}
\end{center}
   \caption{Absolute difference between the output of VISTA and our approach, averaged over all 16 frames.}
\label{fig:diff}
\end{figure}

In Figure \ref{fig:diff} we compare the average absolute difference between the output of VISTA and our approach.  We notice that most of the difference seems to occur around the edges in the image, which may be a result of our approach cleaning up much of the high-frequency noise that VISTA includes.

In Figure \ref{fig:closeups} we compare Frame 3 of the output for both algorithms.  It is clear that the output of our approach has much fewer high-frequency artifacts than the VISTA approach, while still maintaining the sharp definition of edges.

Although we have no real way of measuring it, it seems like the approach outlined above gives a better output for this toy model.  Though it may simply be due to the ability of our algorithm to supress the high-frequency noise that VISTA produces, it might be the case that our algorithm is extracting some additional information about the ground truth image from neighboring frames in the video.  Therefore I feel that this approach definitely may be worth looking into in the future.

\section{Conclusion}

In this paper we have shown how to use optical flow-based techniques to alleviate some of the problems associated with applying VISTA individually to each frame of the video.  Although it relies heavily on the precision of the optical flow algorithm, this seems like a promising approach for producing a better super-resolution algorithm for video.   Such techniques may eventually yield better video compression algorithms for websites such as YouTube, for whom bandwidth concerns are paramount.

\begin{figure*}
\begin{center}
\includegraphics[width=0.45\linewidth]{Images/Frames/sampling_vista03.jpg} \,\,\, \includegraphics[width=0.45\linewidth]{Images/Frames/sampling_myalg03.jpg} \\
\hbox{}
\includegraphics[width=0.45\linewidth]{Images/vista_closeup.jpg} \,\,\, \includegraphics[width=0.45\linewidth]{Images/myalg_closeup.jpg} \\
\end{center}
   \caption{Side-by-side comparison of Frame 3 of the dataset.  On the left is the result of applying VISTA twice to a blurred and (twice) downsampled version of Figure \ref{fig:teapot}.  On the right is the result of applying our algorithm.  Note that in the image, edges are less noisy and features are clearer.}
\label{fig:closeups}
\end{figure*}

{\small
\bibliographystyle{ieee}
\bibliography{paper}
}

\end{document}
