====================================================================== ====================================================================== This is BOOSTREE v.0.1, bundled with the OMLET library v.0.96 Contact : Xavier Carreras : carreras --at-- csail.mit.edu Lluis Padro : padro --at-- lsi.upc.edu See also the homepage of OMLET+FRIES: http://www.lsi.upc.edu/~nlp/omlet+fries/ September 2007 ====================================================================== ====================================================================== DESCRIPTION -- This package provides code that implements the AdaBoost.MH learning algorithm, with a decision tree learner as weak learner. This learning algorithm is a boosting algorithm for multilabel classification problems, and was introduced by Schapire and Singer in 1999 [1] (specifically, see section 7 of [1]). -- In multilabel classification, there exists a number of predefined categories or labels. Examples may belong to zero or more categories. -- At training time, the algorithm receives a collection of labeled examples (i.e. the training set) and learns a classifier. -- At testing time, the algorithm receives a classifier and new examples. For every example, it produces a prediction score for each category. These scores are interpreted as follows: if the score associated to some category is positive, the classifier predicts that the example belongs to the category; if the score is negative, the classifier predicts that the example does not belong to the category. In addition, the magnitude of the score (that is, its absolute value) can be interpreted as the confidence of the prediction. Higher magnitudes are interpreted as more confident predictions (it can be shown experimentally that, in fact, the magnitude of a prediction is a good estimate of the confidence of that prediction). -- Note that binary classification can be seen as a particular case of multilabel classification. In the binary case, there exists one single category, and examples can belong or not to that category. Therefore, running the multilabel AdaBoost.MH classifier on binary data is equivalent to running the the binary version of AdaBoost discussed in the first sections of the article by Schapire and Singer. -- This algorithm can be used as well for multiclass classification, where each example belongs to one and exactly one category. In this setting, the predicted category for a test example would be the category that receives higher score. However, it has to be noted that the AdaBoost.MH algorithm is designed for multi-label problems, where labels do not "compete" between each other. In the literature there exist many algorithms designed specifically for multiclass classification. -- The code assumes a representation of examples based on binary features. Moreover, the code assumes sparse representations, in which only a small fraction of features are active in an example. -- This algorithm has given very good results in many NLP problems. In some cases, NLP systems based on this algorithm have obtained the state-of-the-art results. The following papers make use of the code distributed here: [2], [3], [4], [5], [6], [7]. REFERENCES [1] Schapire, Robert and Yoram Singer. "Improved boosting algorithms with confidence-rated predictions". Machine Learning 37:3, pp. 297-336, 1999. [2] X. Carreras and Ll. Màrquez, "Boosting Trees for Clause Splitting", In Proc. of CoNLL-2001 Shared Task. 2001. [3] X. Carreras and Ll. Màrquez, "Boosting Trees for Anti-Spam Email Filtering", In Proc. of the Conference on Recent Advances in NLP (RANLP'01). 2001. [4] X. Carreras, Ll. Màrquez and Ll. Padró, "Named Entity Extraction Using AdaBoost", In Proc. of CoNLL-2002 Shared Task, 2002. [5] X. Carreras, Ll. Màrquez and Ll. Padró, "A Simple Named Entity Extractor Using AdaBoost", In Proc. of CoNLL-2003 Shared Task, 2003. [6] Ll. Màrquez, P. Comas, J. Giménez and N. Català, "Semantic Role Labeling as Sequential Tagging", In Proc. of CoNLL-2005 Shared Task. 2005. [7] M. Surdeanu and J. Turmo, "Semantic Role Labeling Using Complete Syntactic Analysis", In Proc. of CoNLL-2005 Shared Task, 2005. ====================================================================== INSTALLATION AND PLATFORM DETAILS The code has been developed on GNU/Linux-x86 machines, using g++ v3.2. Moreover, it has been succesfully tested on OSX platforms. To compile and run the code, the following packages are required : - gcc 3.x or gcc 4.x - automake - Perl (only to use the evaluation program) To install the code, unpack the distribution package in some directory. In that directory, run the following commands : $ ./configure $ make You will find the programs in the src/bin directory. ====================================================================== TRAINING AND TESTING CLASSIFIERS This package offers a program named "boostree". This program is used for training a classifier from data, and for testing the learned classifer with new examples. After compiling the package, "boostree" is found under directory src/bin. In addition, the package offers an evaluation script named "evaluate.pl", which computes standard evaluation metrics such as classification accuracy, precision, recall or the F-measure. The script is also found under the src/bin directory, requires Perl, and does not require compilation. DATA FORMATS ------------ An examples consists of two lists (both can be null): - The list of categories that the example belongs to. - The list of boolean features that are active in the example In the programs, categories and features are represented by integers. In a multilabel problem with L possible categories, the categories take values from 0 to L-1. On the other hand, in a feature space of D features, feature indices take values from 1 to D. A data file is a collection of examples. Each example is represented with two lines. The first line encodes the positive categories for the example, separated by spaces. The second line encodes the active features in the example, saparated by spaces. Either lines may be empty, if this corresponds. For testing, the format of the data files is the same as described. If the categories of the test examples are not known, the line encoding the categories shall be left empty. TRAINING A CLASSIFIER --------------------- The following is a typical command to train a classifier : $ boostree --train -l -t -d training_file model_file where : training_file : file containing the training examples model_file : file where the learned model will be saved to -l int : number of labels in the data (required parameter) -t int : number of boosting rounds (required parameter) -d int : depth of the weak decision trees; a value of 0 indicates decision stumps (defalult option); greater values indicate deeper decision trees Other available options for training : -e float : smoothing value (see paper of Schapire and Singer '99) -v int : verbosity level, 0 no verbose, 1 verbose (default) The following is a typical command to test a classifier : $ boostree --classify -L -T model_file test_file where : model_file : file containing the model test_file : file containing the test examples if not given, examples are read from the standard input -l int : number of labels in the data (required parameter) -t int : number of weak rules considered for testing by default, all weak rules are considered -v int : verbosity level, 0 no verbose, 1 verbose (default) When testing, the program produces one line per testing example, with two fields per category. The first field is the prediction score produced by the classifier on that category. The second field is a sign indicating the membership of the example to the category (this information is taken from the input test file). ====================================================================== AN EXAMPLE WITH TEXT CATEGORIZATION This package includes toy datasets to illustrate the use of the programs for learning and testing classifiers. The task associated with this data is text categorization. There are 10 defined categories or topics. The goal of the task is to determine the topics addressed in a given text or document. Examples are represented with a bag-of-words, as is standard in the problem. Each binary feature corresponds to the occurence of a word in the document. The data included in the package is part of the "Reuters RCV1" corpus, and has been extracted from the free distribution included in the online appendices of the following article: Lewis, David D., Y. Yang, T. Rose, and F. Li. "RCV1: A New Benchmark Collection for Text Categorization Research". Journal of Machine Learning Research, 5:361-397, 2004. The data can be found under directory "data" of the package. This directory contains the folloging files: - CATS : the list of the 10 categories; each line contains the identifier of a category, the label of the category in the Reuters corpus, and the occurrences of the category in the training file. - FEATS : the list of bag-of-words features considered for the problem; each line contains the identifier of a feature, the associated word, and the occurrences of the feature in training. - TRAIN.original : training set before mapping categories and features to integers. It contains 1000 examples. - TRAIN : training set after mapping categories and features. - TEST.original : test set before mapping categories and features to integers. It contains 200 examples. - TEST : test set after mapping categories and features. - map-data.pl : a Perl script to map data files from string labels to integer identifiers. Is it used as follows : $ map-data.pl CATS FEATS < original_file > mapped_file where CATS and FEATS are dictionaries mapping categorory/feature labels to integer identifiers as those given in this package. Use option "-u" to unmap, i.e. to map from integer identifiers to string labels - AB.D* : various classifiers learned on the training data - OUTPUT.D* : the output of such classifiers on the test data The following commands are used to train four classifiers, each with an increasing depth of weak rules. In all cases we train for up to 500 weak rules. In real-sized datasets, the boosting process is usually run for a few thousands of iterations. Here are the commands (we assume that the comands are in the path, and the files are in the current directory): $ boostree --train -l 10 -t 500 -d 0 TRAIN AB.D0 $ boostree --train -l 10 -t 500 -d 1 TRAIN AB.D1 $ boostree --train -l 10 -t 500 -d 2 TRAIN AB.D0 $ boostree --train -l 10 -t 500 -d 3 TRAIN AB.D3 Hint : use option "-v 0" to disable printing to standard error. We can now use the classifiers to make predictions on new examples. For instance, here is a command to use AB.D0 on test : $ boostree --classify -l 10 AB.D0 TEST > OUTPUT.D0 The script "evaluate.pl" will compute evaluation metrics for this output. By default, it computes metrics for every category, as well as micro-averaged measures across all the categories. Here's the command, together with the output it produces: $ evaluate.pl OUTPUT.D0 #_file OUTPUT.D0 +as+ +as- -as+ -as- acc recall prec. f1 #-------------------------------------------------------------------------------------------- label.0 55 16 18 111 83.00 77.46 75.34 76.39 label.1 49 9 6 136 92.50 84.48 89.09 86.73 label.2 44 13 8 135 89.50 77.19 84.62 80.73 label.3 13 12 6 169 91.00 52.00 68.42 59.09 label.4 18 20 8 154 86.00 47.37 69.23 56.25 label.5 8 4 6 182 95.00 66.67 57.14 61.54 label.6 12 6 1 181 96.50 66.67 92.31 77.42 label.7 12 9 5 174 93.00 57.14 70.59 63.16 label.8 9 10 4 177 93.00 47.37 69.23 56.25 label.9 3 11 0 186 94.50 21.43 100.00 35.29 #-------------------------------------------------------------------------------------------- micro-avg 223 110 62 1605 91.40 66.97 78.25 72.17 We can evaluate several output files at the same time. Here, we only show micro-avg metrics: $ evaluate.pl OUTPUT.D0 OUTPUT.D1 OUTPUT.D2 OUTPUT.D3 | grep micro #_ +as+ +as- -as+ -as- acc recall prec. f1 #-------------------------------------------------------------------------------------------- micro-avg 223 110 62 1605 91.40 66.97 78.25 72.17 micro-avg 232 101 54 1613 92.25 69.67 81.12 74.96 micro-avg 225 108 49 1618 92.15 67.57 82.12 74.14 micro-avg 231 102 44 1623 92.70 69.37 84.00 75.99 The evaluation program can read from the standard input. As an example, the following commands evaluate the accuracy of AB.D3 in terms of the number of boosting rounds (we only show micro-avg metrics): $ boostree --class -l 10 AB.D3 TEST -v 0 -t 50| evaluate.pl -stdin | grep micro #_ +as+ +as- -as+ -as- acc recall prec. f1 #-------------------------------------------------------------------------------------------- micro-avg 227 106 58 1609 91.80 68.17 79.65 73.46 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 100| evaluate.pl -stdin | grep micro micro-avg 228 105 43 1624 92.60 68.47 84.13 75.50 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 150| evaluate.pl -stdin | grep micro micro-avg 230 103 44 1623 92.65 69.07 83.94 75.78 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 200| evaluate.pl -stdin | grep micro micro-avg 229 104 44 1623 92.60 68.77 83.88 75.58 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 250| evaluate.pl -stdin | grep micro micro-avg 230 103 43 1624 92.70 69.07 84.25 75.91 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 300| evaluate.pl -stdin | grep micro micro-avg 233 100 44 1623 92.80 69.97 84.12 76.39 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 350| evaluate.pl -stdin | grep micro micro-avg 232 101 45 1622 92.70 69.67 83.75 76.07 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 400| evaluate.pl -stdin | grep micro micro-avg 233 100 44 1623 92.80 69.97 84.12 76.39 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 450| evaluate.pl -stdin | grep micro micro-avg 232 101 43 1624 92.80 69.67 84.36 76.32 $ boostree --class -l 10 AB.D3 TEST -v 0 -t 500| evaluate.pl -stdin | grep micro micro-avg 231 102 44 1623 92.70 69.37 84.00 75.99 ======================================================================