Review of least-squares polynomial fitting
------------------------------------------

We consider a discrete 1-D "patch" of image intensity, I(x),
of size 2w+1, where x is an integer in [-w,w].

From calculus, etc. we know I(x) can be decomposed into the
following Taylor series expansion about x=0: 

  I(x) = I(0) + x*I'(0) + x^2/2*I''(0) + ... 

We'll take the first (n+1) terms of the Taylor series to get
an n-th degree approximation of the full series: 

  I(x) = I(0) + x*I'(0) + x^2/2*I''(0) + ... + x^n/n!*I^n(0)

In other words, we're approximating the function I(x) using an
n-th degree polynomial. 

The polynomial coefficients are defined by the derivatives of the
intensity at the origin, I(0), but we don't actually know 
these in advance. Fitting them to the data will give us
the polynomial we're looking for.

[ Note: why not just calculate these derivatives directly 
  from I(x)? Because the image is given at discrete points
  only, we can't say much about the derivatives without 
  assuming a particular underlying model for smoothness .. 
  but that very model (we assume a polynomial) is what we're 
  trying to fit! ]

[ Note: we are re-estimating I(0), even though its value is 
  already given! The curve we fit doesn't necessarily pass 
  through this data point -- rather, we find the best-fit 
  polynomial that minimizes inconsistency with *all* of the 
  data jointly. ]

For a given value of x, we can therefore write:

  I(x) = [1  x  1/2*x^2  ...  1/n!*x^n] * [ I(0)   ]
                                          [ I'(0)  ]
                                             ... 
                                          [ I^n(0) ]

But we have 2w+1 different values of x, all with the same RHS.
So we can stack them all together into a linear system:

      I     =      X         *    d
  (2w+1)x1     (2w+1)x(n+1)    (n+1)x1
    known         known        unknown

This is just a linear system of the form A*x=b. 

Solving the equation X*d=I in least-squares terms (i.e. using 
the pseudoinverse or SVD) will give the vector d minimizing 
||X*d - I||^2.



Example with Gaussian weighting added
-------------------------------------

Let's take a concrete example:
  n = 2 (2nd order approximation, ie. fitting a quadratic)
  w = 4 (9x1 pixel image patch)

In this case X is
  [ 1  -4  16/2 ]
  [ 1  -3   9/2 ]
  [ 1  -2   4/2 ]
  [ 1  -1   1/2 ]
  [ 1   0     0 ]
  [ 1   1   1/2 ]
  [ 1   2   4/2 ]
  [ 1   3   9/2 ]
  [ 1   4  16/2 ]

and the unknown d is
  [ d_0 ]
  [ d_1 ]
  [ d_2 ]

Suppose that I is
  [ 2  3  4  1.1  0  0.9  4  5  4 ]^T
(draw on coord axes)

Note: at the origin it fits I(x)=x^2 almost perfectly, 
which would correspond to d=[0 0 2]^T. 

But solving X*d=I takes *all* the data into account, and 
so gives us something quite different:
  \hat{d} = [ 1.7576 ]
            [ 0.2300 ]
            [ 0.2727 ]
(sketch it, slight curve closer to a horizontal line)

Now consider a weighted version of this, to favor the center 
pixels more. Define the weights according to:

  \Omega(x) = exp(-x^2)   (un-normalized Gaussian, centered at origin)

For this example, x=[-4,4] and \Omega(x) is
  [0.0000  0.0001  0.0183  0.3679  1.0000  0.3679  0.0183  0.0001  0.0000]^T
  The extreme values aren't zero, just very small!

Then the new linear system we need to solve is:

  W*I = W*X*d

where W is the (2*w+1)x(2*w+1) diagonal matrix formed by placing 
all the weights, [\Omega(-w) \Omega(-w+1) ... \Omega(w)], along 
the diagonal.

The least-squares solution to this (weighted) system will give 
the vector d minimizing the norm 

  ||W*I - W*X*d||^2 = ||W*(I-X*d)||^2.

So the W matrix just has the effect of scaling the regular error, 
ie. the difference between the image I and the best-fit polynomial 
reconstruction X*d, differently for each pixel. Pixels with higher 
weight (for this example, pixels near the center) will contribute 
relatively more to the total error.
    
For this example, solving the weighted equation gives:
  \hat{d} = [  0.0000 ]
            [ -0.0990 ]
            [  2.0000 ]
which is very close to I(x)=x^2, just like we'd expect!



More generally ...
------------------

Minimizing a quadratic error functional leads to a linear system 
of equations. We sketched out the proof of this earlier using 
matrices.

But here's a more basic derivation ... let's start from a 
completely *general* quadratic function defined for an 
n-dimensional vector x:

  E(x) = x^T*A*x + b^T*x + c
       = \sum_i \sum_j A_{ij}*x_i*x_j  +  \sum_i b_i*x_i  +  c

Note how this function includes all the possible quadratic 
terms (x_i^2), cross terms (x_i*x_j, for i \neq j), and linear 
terms (x_i), as well as a constant term.

Differentiating with respect to a particular component x_k, 
we obtain

  dE(x)/dx_k = \sum_i [ A_{ik}*x_i ] + \sum_i [ A_{ki}*x_i ] + b_k

The terms correspond to summing along the rows of A, summing
along the columns of A, and selecting an element of b. Verify that
this indeed does the right thing for the x_k^2 term!

The derivative will be zero at extreme values, and a quadratic 
function only has one such value. Therefore we can find the
extremum by solving:

  dE(x)/dx_k = 0
  [ A_{1k} ... A{nk} ]*x + [ A_{k1} ... A{kn} ]*x + b_k = 0
  ([ A_{1k} ... A{nk} ] + [ A_{k1} ... A{kn} ])*x = -b_k

Which is a simple linear equation in x.

But the whole argument above applies to any component x_k,
so we get n different linear equations, one per component of 
x. This is an (n x n) linear system, which can be written 
simply as:

  (A + A^T)*x = -b

When the quadratic function satisfies E(x) >= 0, eg. when it
is an error function of the form E(x) = || x - y ||^2, the 
extremum we get from solving the linear system above will always
be a *minimum* (can you see why?).


APPENDIX: Matlab source code
----------------------------
  
  xx = [-4:4]
  X  = [xx.^0 ; xx.^1 ; xx.^2/2]' 
  I  = [2 3 4 1.1 0 0.9 4 5 4]'    
  d1 = X\I
  Omega = exp(-xx.^2)
  W  = diag(Omega)
  d2 = (W*X)\(W*I)
  xx_fine = linspace(-4,4,100);
  X_fine  = [xx_fine.^0 ; xx_fine.^1 ; xx_fine.^2/2]';
  Omega_fine = exp(-xx_fine.^2);
  figure(1); clf; hold on; 
  h=plot(xx,I,'.-'); set(h,'MarkerSize',20)
  plot(xx_fine,X_fine*d1,'r-')
  plot(xx_fine,9*Omega_fine,'k--')
  plot(xx_fine,X_fine*d2,'g-')
  axis([-4 4 0 10])
  legend('1D image patch','quadratic fit','weighting function, \Omega','weighted quadratic fit')
  print -dpng polyfit