G
Sun Jan 30 2005

This directory contains the 2nd released linux binaries for the new
version of GMTK. Specifically, you will find in this directory:

New versions:

gmtkJT - new version of score (prob(evidence))
gmtkEMtrainNew - new EM trainer
gmtkViterbiNew - new viterbi inference engine
gmtkTriangulate - core GMTK triangulation engine
gmtkDTindex - create a decision tree index file
gmtkParmConvert - convert parameters
gmtkTFmerge - merge two or more triangulation files
gmtkTime  - compute work done in given amount of time

There is currently no documentation for the new versions. The new
version is much faster and more powerful, however, so I am making
these files available for now for linux until actual source and
documentation is available.

The versions here also support language model files, and a number of
other features.

The names of the new programs eventually will change as will the front
end Viterbi program interface. The old versions (prior to 2004) are no
longer being maintained.
   
Again, there is *no* documentation at the moment (this is a pre-alpha
version) but we are happy to try to answer questions as best as
possible.  The new version now supports short utterances (template
length or template + 1 length), and supports disconnected
networks. Because of the speedups, my group and a few others have
found the new versions indispensable to getting their work done.

Using very simple triangulations (see below), actual real-time
inference speedups (time_old_version/time_new_version) is so far on
the order of at minimum about 4.5 and at maximum about 700 (almost 3
orders of magnitude) using a combination of system and algorithmic
speedups.  If you try the new version and you are not getting a
speedup, or if things are slow, it's very probably because you are
using a poor triangulation. Memory decreases are similar (reductions
from about 3.5 to up to 100). The new GMTK is not slow, but the new
GMTK with a poor triangulation can be slow however. It is important
for you therefore to try to find a good triangulation!

In the new version, all triangulation is done offline via
gmtkTriangulate. All the new inference programs require a ".trifile".
Getting a good triangulation at the moment takes some art (a better
way is in the works).

For now, here is a quick and dirty triangulation that usually works
pretty well when you have much determinism in your graph.

   gmtkTriangulate -strF xxx -rePart T -findBest T  -triangulation completed 

0) Note that '-rePart T' implies '-reTri T' 
1) If you can do it, it's often very good to keep '-findBest' turned
   on. If you are finding that it is taking a long time, you might also
   try '-force L' or '-force R', whichever completes faster. This can
   have a big speedup impact, depending on the graph of
   course. Including the appropriate '-force' option should make the
   delay tolerable. Note, however, that with -findBest T, it runs
   an exponential algorithm so it might run for weeks (this is not an infinite loop).
   If you are seeing this on any of your graphs, try both:
           "-findBest F -force L"
   and
           "-findBest F -force R"
   and use the version .trifile which runs faster. Also, use '-verbose 30' to have it print out
   what's going on. Also, if you see memory growing slowly, this is not a memory leak,
   rather it is memoizing previous cases. If you want to turn off memoizing, use the
   -noBoundaryMemoize option.
 
2) In general, it's good to keep backup triangulation files around
   (sometimes triangulation files take a long time to generate, and they
   are easy to delete, so gmtkTriangulate's default is to keep 10 backup trifiles files,
   something that has saved me much time in the past, at the cost of a
   bit of directory messiness).

3) it's probably good to use the default trifile name (i.e.,
   .str.trifile) then the gmtkJT, gmtkViterbiNew, etc. command lines are
   shorter. I.e., if you don't give an '-outputTri' option, it'll name
   it foo.str.trifile (which is the default name for the inference
   programs).

4) Currently, the 'completed' triangulation seems to in general work
   well when there is much determinism in the graphs. We know 
   better ones exists (having esoteric reasons to do with the E
   partition) but it can be difficult to find.  

When you do not have any determinism, use the following:

   gmtkTriangulate -strF xxx -rePart T -findBest T  -anyTime 60s

The program will run for 60 seconds and output the best triangulation
it found in that amount of time. If you want to see what it is doing,
run it with the '-verbose 20' option.


Other points:

1) The new versions (inference and triangulation) have a '-verbose'
   option that take a number between 1 and 100.  Higher numbers print
   out many more messages (inference with -verbose 100 prints out
   every step!!). Default is 10. Good ones for debugging
   your output include:
         -verb 50,  prints partition messages
         -verb 60,  prints clique insertions
         -verb 70,  prints variable iterations w/o parent values
         -verb 80,  prints variable iterations w parent values

2) In the new version, you need to make sure that each decision tree
   maps *all* possible parent values to a valid child value. The
   aurora tutorial on the web has a bug where some sets of parent
   values are mapped to values out of the range [0:card-1] of the
   child. This was never a problem in the old version as that case
   just happened to occur with zero probability (so was pruned away
   before it occurred), but in the new version, for some
   triangulations you'll get an error.

   Also, note that decision tree leaf node formulas are now surrounded by
   curly braces, such as {}. In other words, your DT leaf formulas must look like:
   
       -1  { p0 }
   
   rather than the old version:

       -1  ( p0 )

   The good news is that GMTK now supports arbitrary integer formulas in 
   DT leaf nodes, so you can do things like:

       -1  { min(p0,3+p1>>2) + 5 } 

   Many different integer operations are defined. Documentation forthcoming.

3) Related to 2 above, the new version can benefit significantly (both
   speed and memory) from using RV cardinalities as small as
   possible. For example, if you have a binary transition variable,
   make sure the card is 2. While it is not incorrect to make it
   larger than 2, there are cases where keeping it at 2 will help (in
   general, there are cases where the benefit can be significant).

4) The new version has changed the names 'GC_IN_FILE' to 'MC_IN_FILE'
   ('mixture component' rather than 'Gaussian component') and has changed
   'MG_IN_FILE' to 'MX_IN_FILE' ('mixture' rather than 'mixture Gaussian').

5) in gmtkEMTrainNew, you'll see a few new beam pruning options, specifically:
    
     -cbeam is analogous to the beam pruning that existed in the old version, it prunes 
            entries at the clique level.

     -sbeam is new, and it prunes at the separator level. Removing a separator entry corresponds
            potentially to removing many clique entries. 

     '-ckbeam k' where k is an integer  is also new, it prunes cliques by leaving  all but the
            best k clique entries in place after pruning.  

     '-crbeam p' where 0 < p <= 1  is a fraction of the clique entries to retain after pruning
            (kind of like combination of -cbeam and -ckbeam).

     You can try -cbeam, -ckbeam, -sbeam, or use them all together. Early studies show that
     -cbeam and -sbeam have similar effects, although this highly depends on the distributions.
     -ckbeam can be very effective for high "entropy" cliques.

     -ebeam is also new, and it prunes clique entries during EM training. If you have very high
             dimensional Gaussians and/or you wish to prune away "outliers" during EM training,
             then -ebeam will do the trick.

     -cpbeam is newer still, and it prunes partially completed clique entries in the C partition 
             based on an estimate of what the maximum scoring clique entry will be. This
             is good if you want to reduce both speed and memory (values are similar to cbeam).



6) If you see the following, please ignore for now, it is harmless (will be fixed soon).

      WARNING: Can't close pipe 'cpp foo.str.trifile'

   where 'foo' is the name of your structure file.
  

Feedback and any bugs you discover would be much appreciated.


