Some tips for coming up with a project:
- Find a big data set and implement an algorithm from class
on that data set. If you want to do more:
Compare the algorithm from class to
a linear time algorithm? Try a few versions of the sublinear time
algorithm? Invent a new version of the class algorithm and try it out?
Perhaps the sublinear algorithm isn't actually faster because of
the memory model in the machine, e.g. disk scans are faster
than jumping around and sampling -- can you take
advantage of sampling + fast scans?
- Can you get a better dependence on n or epsilon for
some algorithm in class? (check with me before you work too hard on that..
I may already know a better bound) Can you get a better lower bound?
Can you get a better dependence for a special class of inputs?
(say subgraphs of the grid? planar graphs? expanding graphs? random graphs?)
-
What if you add a stronger assumption on the form of the input?
For example, maybe the input is a set of points in sorted order?
(in this case, you can do binary search and a number of other things)
What can you do that is interesting in this case for other
problems that we considered in class?
Another way of viewing this question -- what if I give you access
to some sort of famous data structure for the input (say range queries,
nearest neighbor queries, or something like that) -- can you do anything
interesting with that?
-
What if I give you a weaker assumption on the form of the input -- maybe
you can only get random examples of the input rather than query access.
Can you solve any interesting property testing problems in this case?
For example, for estimating the diameter of a graph or a point set?
For estimating the average degree of a graph?
-
Most of our distance related results are in Hamming distance or L1 distance.
Is it easier or harder for a different distance metric -- e.g. EMD, or
any other favorite distance?
e.g., are there sublinear time tests for whether the data is well approximated by a histogram under EMD (as opposed to L1,L2 distances, which has been studied).
- Some specific questions that may or may not be doable:
(come see me for explanations)
-
Sampling correctors for distributions that are one of (bimodal,
juntas, monotone over the plane).
-
Local computation algorithms for any job shop scheduling problem.
-
LCAs for sparse subgraphs that aren't so sparse (i.e., average degree
10) (improving on general questions of Levi, Ron, Rubinfeld
which look for sparse subgraphs that are (1+eps) average degree)
A few links to research that came out of previous projects in this class: