====================================================================== This is PHRECO, a software for PHrase RECOgnition tasks. Author: Xavier Carreras carreras --at-- lsi.upc.edu Technical University of Catalonia (UPC) Last revision : October 2007 Created : January 2005 Released under GNU General Public License (GPL) http://www.gnu.org/licenses ====================================================================== INTRODUCTION This code implements the Filtering-Ranking algorithms presented in [1] by Carreras, Màrquez and Castro. These algorithms apply to phrase recognition tasks; i.e., tasks in which the goal is to find phrases of some type in a given sentence. In particular, these algorithms have been used to develop systems for "syntactic chunking", "clause identification", "named entity extraction" and "semantic role labeling"; see papers [1--5]. To read more about these tasks, please refer to the descriptions of the Shared Tasks in 2000-2005. The code in this package can be used to reproduce the results presented in [1]. The package also provides the trained models for chunking and clause identification that are evaluated in that article. However, it may not be possible to reproduce the results published in earlier publications, due to changes and corrections that were made to the code. Please cite the article of Carreras, Màrquez and Castro [1] in the publications containing results derived with this software. ====================================================================== DISCLAIMER The main purpose to have this package released is to show how the results can be reproduced, for research purposes. The code itself is a prototype, and it is not suitable for using as a final product. In particular, some data structures of the current implementation are inefficient, and result in fairly expensive calculations and slow running time. While there exist alternative solutions that would make the programs run much faster, there are no plans for coding such solutions nor to improve the status of this project. Documentation of the code is available to test trained models. No information is provided about training new models, or about using certain parts of the code for other projects. Please contact the author for specific questions only. ====================================================================== INSTALLATION Follow these instructions to use the code : 1. Unpack the package A directory named "phreco-0.1" should be created, with the contents of the package in it. 2. Enter the package directory, type "make" The c++ source code will be compliled, and a library of objects will be created. 3. Edit these two files : src/bin/fr-chunker src/bin/fr-clauser In the header, change the value of the variable $ROOT, so that it points to the path of the package. For example, if the package is in directory "/home/x/soft/phreco-0.1", then the line should look like : BEGIN { $VERSION = "0.1"; # REPLACE THE VALUE OF THIS ASSIGNMENT SO THAT # $ROOT POINTS TO THE PATH OF PHRECO $ROOT = "/home/x/soft/phreco-0.1"; } ====================================================================== USING THE "MLJ" MODELS The current package includes two models that are already trained, and whose experimental performance is reported in [1]. The first model is for text chunking, and was trained on the CoNLL-2000 data. The following command runs the chunker on the test set of the CoNLL-2000 task, and saves the predicted chunks in a file : $ zcat data/conll00-test.gz | src/bin/fr-chunker -m models/mlj-chunk | gzip > data/chunker-test.gz The output file can be then evaluated using the CoNLL-2000 evaluation script: $ zcat data/chunker-test.gz | perl data/conlleval-00 This produces the results reported in Table 5 of [1] : processed 47377 tokens with 23852 phrases; found: 23622 phrases; correct: 22245. accuracy: 95.56%; precision: 94.17%; recall: 93.26%; FB1: 93.71 ADJP: precision: 83.24%; recall: 68.04%; FB1: 74.87 358 ADVP: precision: 83.72%; recall: 78.41%; FB1: 80.98 811 CONJP: precision: 50.00%; recall: 44.44%; FB1: 47.06 8 INTJ: precision: 100.00%; recall: 50.00%; FB1: 66.67 1 LST: precision: 0.00%; recall: 0.00%; FB1: 0.00 0 NP: precision: 94.48%; recall: 94.33%; FB1: 94.41 12402 PP: precision: 96.58%; recall: 97.88%; FB1: 97.22 4876 PRT: precision: 78.57%; recall: 62.26%; FB1: 69.47 84 SBAR: precision: 91.76%; recall: 79.07%; FB1: 84.94 461 VP: precision: 94.07%; recall: 93.32%; FB1: 93.70 4621 The second model is for clause identification, and was trained on the CoNLL-2001 data. The following command runs the clause parser on the test the test data of the CoNLL-2001 task, and saves the output to a file : $ zcat data/conll01-testa.gz | src/bin/fr-clauser -m models/mlj-clauses | gzip > data/clauser-testa.gz $ zcat data/conll01-testb.gz | src/bin/fr-clauser -m models/mlj-clauses | gzip > data/clauser-testb.gz The output file can be then evaluated using the CoNLL-2001 evaluation script: $ zcat data/clauser-testa.gz | perl data/conlleval-01 $ zcat data/clauser-testb.gz | perl data/conlleval-01 This produces the results reported in Table 3 of [1]: Test a, development: processed 47377 tokens with 5418 phrases; found: 5129 phrases; correct: 4645. accuracy: 96.36%; precision: 90.56%; recall: 85.73%; FB1: 88.08 Test b, test: processed 40039 tokens with 4856 phrases; found: 4522 phrases; correct: 3987. accuracy: 95.50%; precision: 88.17%; recall: 82.10%; FB1: 85.03 ====================================================================== REFERENCES [1] Xavier Carreras, Lluís Màrquez and Jorge Castro "Filtering-Ranking Perceptron Learning for Partial Parsing" Machine Learning, Volume 60, Number 1-3, pp. 41-71, September 2005. [2] Xavier Carreras and Lluís Màrquez "Phrase Recognition by Filtering and Ranking with Perceptrons", In proceedings of RANLP-2003. [3] Xavier Carreras and Lluís Màrquez "Online Learning via Global Feedback for Phrase Recognition", In proceedings of NIPS-2003. [4] Xavier Carreras, Lluís Màrquez and Grzegorz Chrupa\la "Hierarchical Recognition of Propositional Arguments with Perceptron" In proceedings of the Shared Task of CoNLL-2004. [5] Xavier Carreras, Lluís Màrquez and Lluís Padró "Learning a Perceptron-Based Named Entity Chunker via Online Recognition Feedback" In proceedings of the Shared Task of CoNLL-2003.