Progress

From LadypackWiki

Table of contents

Friday, September 22, 2006 - progress pause

This project is on pause for a bit (has been since late July) while I focus on helping with the MIT Urban Challenge team and finishing up a book I'm working on. Hopefully, things will calm down in a month or so, and I'll be able to pick this up again.

Tuesday, July 18, 2006 - BBF, DGC, SFM

Implemented the Best Bin First (BBF) algorithm for approximate nearest neighbor search, as described in [1].

Spent a week or so working primarily on DGC. Picking out cameras and such.

Now looking at Structure From Motion methods. Plan on implementing some of the methods described in Hartley & Zisserman and applying them to the ladybug sequences.

[1] J.S. Beis and D.G. Lowe. Shape indexing using approximate nearest-neighbor search in highdimensional spaces. In Proc. IEEE Conf. Comp. Vision Patt. Recog., pages 1000--1006, 1997.

Wednesday, June 28, 2006 - tracking SIFT features

Computed SIFT features on a video sequence using David Lowe's binary. Ran exhaustive nearest-neighbor search to match features in one frame to next using the metric described in Lowe's binary distribution (match to nearest neighbor if the distance to the strongest match is at most 0.6x the distance of the second strongest match).

Plotted the number of features successfully tracked over time. It's very easy to tell by visual inspection where I was standing still, and where I started moving again. We were hoping that there would be some nice valleys at scene changes (e.g. when passing through a doorway), and other statistical hints, but it might be harder than that.

It would be interesting to:

  • profile the types of features that get successfully tracked and those that don't.
  • account for velocity somehow and use that as a hint. haven't really thought this one through.

Image:Tracked-sift-all-cameras.png

 Excel spreadsheet

Monday, June 26, 2006 - caching SIFT features

This week I plan on running a bunch of experiments with SIFT. The first hurdle I'm encountering is that it takes quite a while to compute. Currently, on my 2.8 GHz dual processor Pentium Xeon workstation, David Lowe's Linux binary takes 4.0 seconds to operate on a single 1024x768 image from one of the ladybug2 cameras. Multiply this by 6 and factor in a few seconds for I/O operations, and we're looking at 30 seconds per frame. Since I'm capturing at 30 fps, a single 3 minute video sequence will have 30 * 60 * 3 = 5400 frames. SIFT will need 5400 * 30 = 162000 seconds == 45 hours to process 3 minutes of video. That's not very cool.

Optimized versions of SIFT do exist, so speedups are possible, along with algorithm improvements (e.g. running 6 in parallel?), but for now I'm just going to cache the results. Blech.

known implementations of SIFT

Friday, June 23, 2006 - CVPR

I'm back from CVPR. This was the first time I've attended a conference with more than a couple hundred people, and it was quite the experience.

CVPR 2006 notes

Sunday, June 18, 2006 - notes from CVPR2006

what is heteroscesdastic? kernel density estimation? RANSAC scale / optimal scale? M-estimator

Tuesday, June 6, 2006 - faster laptop hard disks

We've been using the ladybug2 in the compressed 30fps mode. This results in slightly more than 1 MB per frame. At 30fps, we can round it up to 35 MB/s. That's how fast we need to write the images to disk.

Problem is that the current laptop can't write the data fast enough. The longest data set we've taken so far is 4 minutes, before it runs out of buffer space and craps out.

I've been poking around looking for hard disks that can handle this kind of load, but haven't seen much yet. Seems like the fastest 7200 rpm 3.5" hard disks top out at around 25 MB/s. Paul at Point Grey suggests using a RAID0 configuration. I don't have much experience with that, so am collecting links.

http://www.tomshardware.com/2001/09/06/raid_without_additional_hardware/page11.html

http://groups.google.com/group/comp.sys.laptops/browse_thread/thread/c86483ffeddb5a7/24b9287809b25938?lnk=st&q=laptop+raid+support&rnum=2#24b9287809b25938

http://hardware.slashdot.org/article.pl?sid=05/08/25/1620204

Windows Software RAID Guide (http://www.techimo.com/articles/index.pl?photo=149)

Paul sez Alienware and http://carillondirect.com/ have laptops with hardware RAID solutions.

Thursday, June 1, 2006 - visiting University of Washington

I asked Seth if he knew a good way for me to find a legitimate excuse to hang out in Seattle for the summer, and he contacted Steve Seitz (http://www.cs.washington.edu/homes/seitz/), who was kind enough to give me some desk space for a few months. I'll still be working on my MIT projects, but will be doing so while in Seattle. So I'm technically a visiting graduate student here for a few weeks each month.

Chatted with Aseem Agarwala (http://agarwala.org/) yesterday, who's interested in portable mounts for firewire cameras. He wants to do some sort of real-time interactive Rephotography (http://en.wikipedia.org/wiki/Rephotography) assistant that helps the user align the camera in exactly the right spot.

Aseem pointed me to Voodoo Camera Tracker (http://www.digilab.uni-hannover.de/docs/manual.html), which apparently is useful in Structure From Motion computation.

And also Speeded Up Robust Features (http://www.vision.ee.ethz.ch/publications/get_abstract.cgi?procs=392&mode=&lang=en) (Herbert Bay, Tinne Tuytelaars, Luc Van Gool - ECCV 2006)

http://www.vision.ee.ethz.ch/~surf/

Wednesday, May 17, 2006 - bilinear constraint is a poor norm

The bilinear constraint of the egomotion equations is a poor norm with which to evaluate the consistency of an optical flow vector with the estimates of camera translation and rotation. It's poor because it's highly susceptible to random image noise, and can quickly classify good estimates as poor estimates and vice versa. I've been trying to come up with a more robust norm, without success. I attribute this to not knowing the depth of scene points, and evaluating the estimates would be much easier with some kind of reasonable estimate on the depth of tracked scene points.

There are two ways I could get depth of a tracked scene point. One is by doing stereo correspondence in a single frame. Since the ladybug cameras have significant overlap, and they are well calibrated, it would be possible to roughly estimate the depth of tracked features that are visible in more than one camera. Determining which tracked features are good candidates for this is pretty easy, as I could just apply a mask based on the location of the tracked feature in the image.

Another way to extract depth is by doing short sequence structure from motion. This would be a sort of iterative process that optimizes the scene depth of tracked features along with the camera rotation and translation vectors.

Since these two approaches are independent of each other, it's also possible to combine them...

Wednesday, May 17, 2006 - associating labels with visual cues

I put together a proof-of-concept system for associating a transcribed speech label with a visual cue identified by the laser pointer.

link to xvid video (http://people.csail.mit.edu/albert/ladypack/media/lp1-20060428-013715-speechlabel.xvid.avi). This is the result of automatic speech recognition and laser pointer detection. Speech is manually segmented by the user during the data collection phase (toggle button on the cell phone marks the start and end of an utterance). Laser dot is only searched for during the frames corresponding to an utterance, which solves the "is it present or not" problem.

Right now, the visual cue is simply the laser dot. Ideally, we'd want to be able to use some kind of robust visual descriptor, like a collection of SIFT features. In order to do that, the image would need to be segmented into foreground/background, where foreground is the object to be described. One way to do this is to give segmentation hints in the audio stream. For example, if I say "This is the door to Russ's office", then it could run a door detector and fit a model of a door to the area pointed to. That may not be terribly scalable, so another option is to give more general hints that merely describe the shape and size of the object being described. These could range from coarse hints like, "this small thing is the fish tank" and "this big thing is the door to Russ's office", to very specific hints like, "This is the door to Russ's office. It is 6 feet tall and 3 feet wide." I'm not sure which is the best way to go.

Monday, May 8, 2006 - Segmenting Speech

One problem that I had overlooked early on was the problem of segmenting an audio stream into utterances. Speech recognizers operate on individual utterances, and don't handle any larger speech structures than that, so the first task in speech recognition is to segment your audio stream. According to Lee Hetherington, this is quite an active research area with no agreed upon solution. The naive way to approach the problem is by using what are called energy-based methods, which roughly equate noise with speech - if the energy of an audio signal is above a certain threshold, then the user is saying something. These approaches, while simple, are susceptible to poor calibration (where do you set the threshold?), dynamic environments (moving from a quiet office room to a noisy cafeteria), and spurious noise (tapping the microphone or aspirated consonants).

In the end, I decided that segmenting an audio stream is not a problem that I want to approach or even to outsource, so I modified my cell phone to serve as a manual speech segmenter. The cell phone already controls when the overall recording starts and stops, so it was a fairly simple matter to add a couple more controls to mark the beginning and end of an utterance. This puts a bit more onus on the user to correctly segment audio streams, but it's not too hard to tell someone, "push this button before you start speaking, and again once you stop".

Monday, May 8, 2006 - Derivation of optical flow equations for a spherical camera

Derivation of optical flow equations for a spherical camera

Sunday, May 7, 2006 - Finding a Laser Dot

Finding a Laser Dot

Wednesday, Apr 19, 2006

The error metric I was using to evaluate the estimate of camera translation T and rotation Ω is lousy, now that I think about it some more. The problem is that it only works if there's an implicit assumption about the depth of imaged feature points. Specifically, it's perfect if all imaged points have a 5-meter radius from the camera, and performance deteriorates the greater the deviation from that depth. That's no good in general.

Thursday, Apr 13, 2006

Spoke to Jim Glass, who says they have some new speech recognition software that should simplify the process of doing basic speech recognition. It's based on an XML-RPC approach, where you package both the grammar and the audio data into an XML-RPC request, and the return result is an n-best list. Lee Hetherington sent me some sample code to get started, which I'll be taking a look at soon.

Also todo, read through Matt Antone's PhD thesis (http://groups.csail.mit.edu/graphics/pubs/thesis_antone.pdf).

Peter Sand sent me a compiled binary to do high quality optical flow computation using the Brox method. Ideally, I'd replace the KLT feature tracker with this code, but it's ridiculously slow (minutes per image), and already written in C++, so there's absolutely no way it will perform as fast as I need it to. Regardless, it would be nice to have a Linux version (Peter's is win32), so I'll try porting it to Linux when I have a chance.

Monday, Apr 10, 2006

I've packaged up a first round of data files and sent them to Ed Olson, who will be putting them through his SLAM solver (http://rvsn.csail.mit.edu/graphoptim) and see how they look.

Will spend the next couple days getting a Galaxy (http://groups.csail.mit.edu/sls//technologies/galaxy.shtml) / SpeechBuilder (http://groups.csail.mit.edu/sls/technologies/speechbuilder.html) installation up and running to do automatic speech recognition.

Wednesday, Apr 5, 2006

My first implementation of egomotion estimation started by using the IMU that's rigidly attached to the Ladybug to provide an initial guess for the camera's rotational velocity. Ω and T were then iteratively optimized. In the E step, an estimate is made of the expected optical flow at each feature point. This uses equation (1) from yesterday, and furthermore assumes that the world point corresponding to each feature point has constant depth from the camera. This was, of course, an awful assumption. The compatibility between each observed optical flow vector α and expected optical flow vector β was measured as

w = \frac{ \frac{1}{2}(||\alpha||^2 + ||\beta||^2) + \alpha \cdot \beta}{||\alpha||^2 + ||\beta||^2}

This was a pretty arbitrary function I just came up with and has the property that for | | α | | = | | β | | , it takes on value 0 when the two vectors are exactly opposite, value 1 when they're coincident, and 0.5 when they're orthogonal. The idea was to give difference in direction more importance than difference in magnitude, while still giving some weight to difference in magnitude.

After a number of trials, I came to the conclusion that the iterative optimization of Ω and T wasn't doing very well. I still need to figure out why...

I ended up not estimating Ω at all, because the IMU estimate was already quite good (by visual inspection). By holding Ω fixed, I could get a reasonable estimate of T.

Re-weighting the optical flow vectors hasn't had much of an effect, but I haven't given this much of a chance yet.

Since egomotion estimation can only produce the direction of T, I'm integrating the linear acceleration of the IMU to get a velocity measurement. I then take the magnitude of that velocity and apply it to T. This works fine for a short period of time, but the error can grow without bound. To compensate for that, I use a threshold based on what I'm calling the T field mean. The idea is to subtract the estimated rotational component from the observed optical flow. What's left should be optical flow generated as a direct result of camera translation. If the mean vector of this residual flow field has magnitude less than a certain threshold, then I clamp the estimate of T to zero. This worked pretty well for a few short (< 60s) data sets.

For now, I'm going to leave the algorithm as it is and begin putting something together to feed to Ed's SLAM solver.

One improvement I could make is to incorporate more constraints from the IMU. The next signal to incorporate is the absolute roll and pitch of the IMU, as the gravitometers are quite good at providing good estimates of those. This means that the rotational velocity update would only update the yaw (heading).

Tuesday, Apr 4, 2006

Given an optical flow field, try to estimate the camera's motion through the world (egomotion estimation). To simplify the problem, make the very strong assumption that the world is static and the camera is the only moving object in the scene. Use a camera-centric reference frame, so that the position of any point in the world is given with the camera at the origin. If this static-world assumption holds, and the camera has translational and rotational velocities T and Ω, then the motion \dot{P} of a given point P in the world will be

(1) \dot{P} = - T - \Omega \times P

Most cameras cannot directly observe P, and only the projection \hat{P} of P onto the camera's image plane. For a spherical camera (the ladybug), we have

(2) \hat{P} = \frac{P}{||P||}

Gluckman and Nayar adapted Horn's equations of motion by taking the derivative of \hat{P} with respect to time, and substituting in the first equation. After a bunch of algebraic manipulation, you get to

(3) U(\hat{P}) = \frac{d}{dt}\hat{P} = \frac{1}{||P||}((T\cdot\hat{P})\hat{P} - T) - \Omega \times \hat{P}

If T is zero (no translation), then the equation simplifies to

(4) U(\hat{P}) = - \Omega \times \hat{P}

If T is non-zero, then we can (still following Gluckman & Nayar) take the cross product with \hat{P} and the dot product with T to arrive at

(5) T \cdot ( \hat{P} \times (U + (\Omega \times \hat{P}))) = 0

These are the equations that I started with to do egomotion estimation for the Ladybug camera. Originally, Seth suggested some form of EM optimization. The reasoning here is that the original assumption of a static world just doesn't hold in practice, and many optical flow vectors will not be a direct result of the camera's motion. That, combined with noise and imperfect optical flow computation, results in data that's difficult to work with. The purpose of EM would be to iteratively estimate T and Ω (M step), while also estimating the likelihood that each optical flow vector was generated by the camera's motion (E step). Flow vectors would be weighted during the M step to assist in the egomotion estimation.

So what we start with is a long list of optical flow vectors. Dense optical flow would be computationally prohibitive, so for now i'm tracking Harris corner features with stb's KLT tracker. I'm limiting the feature detector to 200 features per Ladybug image, for a total of 1200 features per frame (6 images per frame). Not all of the features are successfully tracked across frames, so we'll typically have around 1100 optical flow vectors per frame. The Ladybug camera calibration is known, so the position of each of these features can be mapped to a sphere in the Ladybug-centric reference frame. Using the terminology above, we have \hat{P}_i and Ui for i = 0 \ldots 1100.

Now we have two unknowns T and Ω to estimate. The idea here is to do some sort of gradient descent. First, hold Ω fixed and optimize T. Then, hold T fixed and optimize Ω.

To optimize T while holding Ω fixed, we can rewrite (5) by replacing the cross product with a matrix multplication and arrive at

M = \begin{bmatrix}     0 & \hat{P}_3 & -\hat{P}_2 \\    \hat{P}_3 & 0 & -\hat{P}_1 \\    -\hat{P}_2 & \hat{P}_1 & 0 \end{bmatrix}

T \cdot (M^2 \Omega) = T \cdot ( M U )

With 1100 of these, we have 1100 linear equations in two unknowns. This system is homogeneous, though, and has a degenerate solution of T = 0, so to get a non-zero estimate, we need to add the constraint | | T | | = 1. SVD can then be used to solve for T.

It's not always the case that we'll want a non-zero T. There's a skeezy hack to get around that.

To optimize Ω while holding T fixed, we only need to solve a non-homogeneous system of linear equations. If T is zero, then (4) can be used directly in a linear least squares fashion to solve for Ω. If T is non-zero, then (5) can be used in one system of linear equations to produce a single estimate for Ω.

Older log entries

Navigation