An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints

Abstract

Dictionary-based entity extraction has attracted much attention from the database community recently, which locates substrings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between substrings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. Firstly, they need to tune many parameters to achieve a high performance. Secondly, they are inefficient for large editdistance thresholds. We propose a trie-based method to address these problems. We partition each entity into a set of segments. We prove that if a substring of the document is similar to an entity, it must contain a segment of the entity. Based on this observation, we first search segments from the document, and then extend the matching segments in both entities and the document to find similar pairs. To facilitate searching segments, we use a trie structure to index segments and develop an efficient trie-based algorithm. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. The experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

Publication

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints
[Paper] [PPT][Poster]

Codes

Overview:

We are happy to release our binary codes of taste. We have two codes here. The difference between these two binary codes is that 'taste-all' outputs all the result within edit distance threshold while 'taste-best' only outputs the 'best' result for each matching. Please refer our paper for the definition of 'best' matching. The two codes accept the same input arguments as followings.

Input

Run taste-all on linux from command line:

chmod +x taste-all
./taste-all edth dict doc > output

Description:

edth is the edit-distance threshold.

dict is the path to the dictionary, where entities are separated by '\n'.

doc is the path to your document file where documents are seperated by '\n', too

output stores the result pairs.

Output

taste prints three lines for each similar pair (entity, s):

entity_id document_id start_position length distance
##entity$$
##substring$$

%Here is an example
35568 5 126 146 1
## george a. rovithakis$$
##george a. rovithakis$$

Description: The first line consists of the line id of entity in dictionary file, the line id of document in document file(all start from 0), the approximate matching start position, the approximate matching substring length, and the edit distance between the approximate matching substring and entity. The second line is the approximate matching substring content and the third line is the entity which are all append a prefix of '##' and a suffix of '$$'. One example output for the approximate matching substring and entity is shown on the bottom.

Requirement:

Have GCC 4.2.4 or higher

Download:

Contact

If you have any questions about this study, please feel free to contact Dong Deng.

Comment