*** STRUCTURE BASED COMPUTATIONAL ASPECTS 
                    OF SIMILARITY AND ANALOGY IN NATURAL LANGUAGE ***

   Doctoral Research Proposal by Zvika Marx

   Supervisors: 
      Prof. Eli Shamir, The Hebrew University of Jerusalem
      Dr. Ido Dagan, Bar-Ilan University, Ramat-Gan


Abstract

Automatic identification of similarity between objects is an important theme 
in several scientific disciplines.  We propose to develop a robust computational 
procedure for the identification of similarities between complex object 
representations that were extracted automatically from texts.  The proposed 
research aims to fill a gap, which exists in current computational models for 
similarity and analogy.  On one hand cognitive and semantic models are typically 
based on complex representations for systems and objects, using logical or 
graph-based methods.  These models are usually demonstrated for manually 
constructed examples or applied to texts in very narrow domains.  On the other 
hand, there are robust statistical and neural models for similarity and clustering, 
which were applied with some success to unrestricted texts.  These models use 
shallow feature-vector representations for words and documents and do not capture 
well similarities that can be expressed only by richer representations.

We recognize that current language analysis methods, employing tools such as robust 
shallow parsing, morphological canonization and term-level similarity, can robustly 
extract structured textual representations that are significantly richer than 
feature-vector representations.  These structured representations resemble symbolic 
models in the cognitive and semantic literature in that they contain information 
about relations among the terms of a text fragment.  Given a large volume of such 
automatically extracted representations, our procedure will collect rich 
quantitative information about frequencies of various components.  Accordingly, 
our algorithms will combine structural and symbolic evidence with statistical 
evidence when judging the degree of similarity between text fragments.

The combination of statistical and structural models is an attractive 
characteristic of the proposed research.  We expect this combination to promote an 
important enhancement to similarity and clustering algorithms in the areas of 
natural language processing in particular and statistical pattern matching in 
general.  The proposed computational procedure, which applies accumulative 
acquisition of relational conceptual knowledge, may also influence models of 
analogy in the cognitive sciences.  At the practical level, the research addresses 
the growing need for better information access tools (for instance in the World 
Wide Web) by a contribution to the development of information retrieval and 
information extraction systems that employ high-level representations.