======================================================================
======================================================================

This is BOOSTREE v.0.1, bundled with the OMLET library v.0.96

Contact : 

Xavier Carreras  : carreras --at-- csail.mit.edu
Lluis Padro      : padro --at-- lsi.upc.edu


See also the homepage of OMLET+FRIES: 

  http://www.lsi.upc.edu/~nlp/omlet+fries/
	

September 2007

======================================================================
======================================================================
DESCRIPTION


-- This package provides code that implements the AdaBoost.MH learning
algorithm, with a decision tree learner as weak learner.  This
learning algorithm is a boosting algorithm for multilabel
classification problems, and was introduced by Schapire and Singer in
1999 [1] (specifically, see section 7 of [1]).

-- In multilabel classification, there exists a number of predefined
categories or labels. Examples may belong to zero or more categories.

-- At training time, the algorithm receives a collection of labeled
examples (i.e. the training set) and learns a classifier.

-- At testing time, the algorithm receives a classifier and new
examples. For every example, it produces a prediction score for each
category. These scores are interpreted as follows: if the score
associated to some category is positive, the classifier predicts that
the example belongs to the category; if the score is negative, the
classifier predicts that the example does not belong to the
category. In addition, the magnitude of the score (that is, its
absolute value) can be interpreted as the confidence of the
prediction. Higher magnitudes are interpreted as more confident
predictions (it can be shown experimentally that, in fact, the
magnitude of a prediction is a good estimate of the confidence of that
prediction).

-- Note that binary classification can be seen as a particular case of
multilabel classification. In the binary case, there exists one single
category, and examples can belong or not to that category. Therefore,
running the multilabel AdaBoost.MH classifier on binary data is
equivalent to running the the binary version of AdaBoost discussed in
the first sections of the article by Schapire and Singer.

-- This algorithm can be used as well for multiclass classification,
where each example belongs to one and exactly one category. In this
setting, the predicted category for a test example would be the
category that receives higher score. However, it has to be noted that
the AdaBoost.MH algorithm is designed for multi-label problems, where
labels do not "compete" between each other. In the literature there
exist many algorithms designed specifically for multiclass
classification.

-- The code assumes a representation of examples based on binary
features. Moreover, the code assumes sparse representations, in which
only a small fraction of features are active in an example.

-- This algorithm has given very good results in many NLP problems. In
some cases, NLP systems based on this algorithm have obtained the
state-of-the-art results. The following papers make use of the code
distributed here: [2], [3], [4], [5], [6], [7]. 


REFERENCES


[1] Schapire, Robert and Yoram Singer. "Improved boosting algorithms
    with confidence-rated predictions". Machine Learning 37:3,
    pp. 297-336, 1999.

[2] X. Carreras and Ll. Màrquez, "Boosting Trees for Clause Splitting",
    In Proc. of CoNLL-2001 Shared Task. 2001.

[3] X. Carreras and Ll. Màrquez, "Boosting Trees for Anti-Spam Email 
    Filtering", In Proc. of the Conference on Recent Advances in NLP
    (RANLP'01). 2001.

[4] X. Carreras, Ll. Màrquez and Ll. Padró, "Named Entity Extraction
    Using AdaBoost", In Proc. of CoNLL-2002 Shared Task, 2002.

[5] X. Carreras, Ll. Màrquez and Ll. Padró, "A Simple Named Entity
    Extractor Using AdaBoost", In Proc. of CoNLL-2003 Shared Task, 2003.

[6] Ll. Màrquez, P. Comas, J. Giménez and N. Català, "Semantic Role
    Labeling as Sequential Tagging", In Proc. of CoNLL-2005 Shared
    Task. 2005.

[7] M. Surdeanu and J. Turmo, "Semantic Role Labeling Using Complete
    Syntactic Analysis", In Proc. of CoNLL-2005 Shared Task, 2005.


======================================================================
INSTALLATION AND PLATFORM DETAILS

The code has been developed on GNU/Linux-x86 machines, using g++
v3.2. Moreover, it has been succesfully tested on OSX platforms.

To compile and run the code, the following packages are required : 
  - gcc 3.x or gcc 4.x
  - automake
  - Perl (only to use the evaluation program)

To install the code, unpack the distribution package in some
directory. In that directory, run the following commands :

   $  ./configure
   $  make
   
You will find the programs in the src/bin directory.


======================================================================
TRAINING AND TESTING CLASSIFIERS 

This package offers a program named "boostree". This program is used
for training a classifier from data, and for testing the learned
classifer with new examples. After compiling the package, "boostree"
is found under directory src/bin.

In addition, the package offers an evaluation script named
"evaluate.pl", which computes standard evaluation metrics such as
classification accuracy, precision, recall or the F-measure. The
script is also found under the src/bin directory, requires
Perl, and does not require compilation.


DATA FORMATS 
------------

An examples consists of two lists (both can be null): 
  - The list of categories that the example belongs to. 
  - The list of boolean features that are active in the example

In the programs, categories and features are represented by
integers. In a multilabel problem with L possible categories, the
categories take values from 0 to L-1. On the other hand, in a feature
space of D features, feature indices take values from 1 to D.

A data file is a collection of examples. Each example is represented
with two lines. The first line encodes the positive categories for the
example, separated by spaces. The second line encodes the active
features in the example, saparated by spaces. Either lines may be
empty, if this corresponds. 

For testing, the format of the data files is the same as described. If
the categories of the test examples are not known, the line encoding
the categories shall be left empty.

TRAINING A CLASSIFIER 
---------------------

The following is a typical command to train a classifier : 

$ boostree --train -l <labels> -t <rounds> -d <depth> training_file model_file

where : 

 training_file : file containing the training examples
 model_file    : file where the learned model will be saved to
 -l int        : number of labels in the data (required parameter)
 -t int        : number of boosting rounds (required parameter)
 -d int        : depth of the weak decision trees; a value of 0
                  indicates decision stumps (defalult option); 
		  greater values indicate deeper decision trees

Other available options for training : 
 -e float      : smoothing value (see paper of Schapire and Singer '99)
 -v int        : verbosity level, 0 no verbose, 1 verbose (default)


The following is a typical command to test a classifier : 

$ boostree --classify -L <labels> -T <rounds> model_file test_file 

where : 

 model_file    : file containing the model
 test_file     : file containing the test examples
                  if not given, examples are read from the standard input
 -l int        : number of labels in the data (required parameter)
 -t int        : number of weak rules considered for testing
                  by default, all weak rules are considered
 -v int        : verbosity level, 0 no verbose, 1 verbose (default)

When testing, the program produces one line per testing example, with
two fields per category. The first field is the prediction score
produced by the classifier on that category. The second field is a
sign indicating the membership of the example to the category (this
information is taken from the input test file). 


======================================================================
AN EXAMPLE WITH TEXT CATEGORIZATION 

This package includes toy datasets to illustrate the use of the
programs for learning and testing classifiers. The task associated
with this data is text categorization. There are 10 defined categories
or topics. The goal of the task is to determine the topics addressed
in a given text or document. Examples are represented with a
bag-of-words, as is standard in the problem. Each binary feature
corresponds to the occurence of a word in the document. 

The data included in the package is part of the "Reuters RCV1" corpus,
and has been extracted from the free distribution included in the
online appendices of the following article:

Lewis, David D., Y. Yang, T. Rose, and F. Li. "RCV1: A New Benchmark
Collection for Text Categorization Research". Journal of Machine
Learning Research, 5:361-397, 2004. 


The data can be found under directory "data" of the package. This
directory contains the folloging files:

- CATS           : the list of the 10 categories; each line contains
	            the identifier of a category, the label of the 
                    category in the Reuters corpus, and the occurrences
                    of the category in the training file.
- FEATS          : the list of bag-of-words features considered for the 
	            problem; each line contains the identifier of a 
                    feature, the associated word, and the occurrences of 
                    the feature in training.	            
- TRAIN.original : training set before mapping categories and features
                    to integers. It contains 1000 examples. 
- TRAIN          : training set after mapping categories and features.
- TEST.original  : test set before mapping categories and features
                    to integers. It contains 200 examples. 
- TEST           : test set after mapping categories and features.
- map-data.pl    : a Perl script to map data files from string labels 
	            to integer identifiers. Is it used as follows : 
	            $ map-data.pl CATS FEATS < original_file > mapped_file
	            where CATS and FEATS are dictionaries mapping 
	            categorory/feature labels to integer identifiers as 
	            those given in this package. 
                    Use option "-u" to unmap, i.e. to map from integer 
                    identifiers to string labels
- AB.D*          : various classifiers learned on the training data
- OUTPUT.D*      : the output of such classifiers on the test data

The following commands are used to train four classifiers, each with
an increasing depth of weak rules. In all cases we train for up to 500
weak rules. In real-sized datasets, the boosting process is usually
run for a few thousands of iterations. Here are the commands (we
assume that the comands are in the path, and the files are in the
current directory):

$ boostree --train -l 10 -t 500 -d 0 TRAIN AB.D0 
$ boostree --train -l 10 -t 500 -d 1 TRAIN AB.D1
$ boostree --train -l 10 -t 500 -d 2 TRAIN AB.D0 
$ boostree --train -l 10 -t 500 -d 3 TRAIN AB.D3

Hint : use option "-v 0" to disable printing to standard error.

We can now use the classifiers to make predictions on new examples.
For instance, here is a command to use AB.D0 on test : 

$ boostree --classify -l 10 AB.D0 TEST > OUTPUT.D0

The script "evaluate.pl" will compute evaluation metrics for this
output. By default, it computes metrics for every category, as well as
micro-averaged measures across all the categories. Here's the command,
together with the output it produces:

$ evaluate.pl OUTPUT.D0
#_file OUTPUT.D0             +as+    +as-    -as+    -as-      acc   recall    prec.       f1
#--------------------------------------------------------------------------------------------
label.0                        55      16      18     111    83.00    77.46    75.34    76.39
label.1                        49       9       6     136    92.50    84.48    89.09    86.73
label.2                        44      13       8     135    89.50    77.19    84.62    80.73
label.3                        13      12       6     169    91.00    52.00    68.42    59.09
label.4                        18      20       8     154    86.00    47.37    69.23    56.25
label.5                         8       4       6     182    95.00    66.67    57.14    61.54
label.6                        12       6       1     181    96.50    66.67    92.31    77.42
label.7                        12       9       5     174    93.00    57.14    70.59    63.16
label.8                         9      10       4     177    93.00    47.37    69.23    56.25
label.9                         3      11       0     186    94.50    21.43   100.00    35.29
#--------------------------------------------------------------------------------------------
micro-avg                     223     110      62    1605    91.40    66.97    78.25    72.17


We can evaluate several output files at the same time. Here, we only
show micro-avg metrics:

$ evaluate.pl OUTPUT.D0 OUTPUT.D1 OUTPUT.D2 OUTPUT.D3 | grep micro
#_                           +as+    +as-    -as+    -as-      acc   recall    prec.       f1
#--------------------------------------------------------------------------------------------
micro-avg                     223     110      62    1605    91.40    66.97    78.25    72.17
micro-avg                     232     101      54    1613    92.25    69.67    81.12    74.96
micro-avg                     225     108      49    1618    92.15    67.57    82.12    74.14
micro-avg                     231     102      44    1623    92.70    69.37    84.00    75.99


The evaluation program can read from the standard input. As an example,
the following commands evaluate the accuracy of AB.D3 in terms of the
number of boosting rounds (we only show micro-avg metrics):

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 50| evaluate.pl -stdin | grep micro
#_                           +as+    +as-    -as+    -as-      acc   recall    prec.       f1
#--------------------------------------------------------------------------------------------
micro-avg                     227     106      58    1609    91.80    68.17    79.65    73.46

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 100| evaluate.pl -stdin | grep micro
micro-avg                     228     105      43    1624    92.60    68.47    84.13    75.50

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 150| evaluate.pl -stdin | grep micro
micro-avg                     230     103      44    1623    92.65    69.07    83.94    75.78

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 200| evaluate.pl -stdin | grep micro
micro-avg                     229     104      44    1623    92.60    68.77    83.88    75.58

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 250| evaluate.pl -stdin | grep micro
micro-avg                     230     103      43    1624    92.70    69.07    84.25    75.91

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 300| evaluate.pl -stdin | grep micro
micro-avg                     233     100      44    1623    92.80    69.97    84.12    76.39

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 350| evaluate.pl -stdin | grep micro
micro-avg                     232     101      45    1622    92.70    69.67    83.75    76.07

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 400| evaluate.pl -stdin | grep micro
micro-avg                     233     100      44    1623    92.80    69.97    84.12    76.39

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 450| evaluate.pl -stdin | grep micro
micro-avg                     232     101      43    1624    92.80    69.67    84.36    76.32

$ boostree --class -l 10 AB.D3 TEST -v 0 -t 500| evaluate.pl -stdin | grep micro
micro-avg                     231     102      44    1623    92.70    69.37    84.00    75.99


======================================================================