Abstract
String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamicprogramming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.
Publication
Top-k String Similarity Search with Edit-Distance Constraints.
[Paper] [Slides] [Poster]
Codes

Overview:

We are happy to release our binary code of Topksearch. For further information please send an email to Dong Deng.

Input

Run topkSearch on 64-bit machine from command line:
chmod +x topkSearch
./topkSearch Dataset Queries [output] > output
Description:

Dataset is the path to the dataset file. Records are separated by '\n'.

Queries is the path to query file. Each query consists of a query string and the number of most similar results wanted, which is k.

output contains some statistic information. Including the largest edit distance between the results and corresponding query and the total time elasped.

Output

======== QUERY: some query ======
1 : id1 record1 ed1
2 : id2 record2 ed2
...
k : idk recordk edk
========= END OF QUERY ==========


# Example
======== QUERY: e. becherer 1000 ======
1 : 139765 e. becherer 0
2 : 139874 e. scherer 2
...
...
1000 : 503834 s. schreiner 6
============= END OF QUERY ============

Description: topkSearch prints the query string and followed by k most similar strings in dataset with their edit distance.

Requirement:

Have GCC 4.4.3 or higher, a 64-bit machine

Download:

Contact
If you have any questions about this study, please feel free to contact Dong Deng.