*** STRUCTURE BASED COMPUTATIONAL ASPECTS OF SIMILARITY AND ANALOGY IN NATURAL LANGUAGE *** Doctoral Research Proposal by Zvika Marx Supervisors: Prof. Eli Shamir, The Hebrew University of Jerusalem Dr. Ido Dagan, Bar-Ilan University, Ramat-Gan Abstract Automatic identification of similarity between objects is an important theme in several scientific disciplines. We propose to develop a robust computational procedure for the identification of similarities between complex object representations that were extracted automatically from texts. The proposed research aims to fill a gap, which exists in current computational models for similarity and analogy. On one hand cognitive and semantic models are typically based on complex representations for systems and objects, using logical or graph-based methods. These models are usually demonstrated for manually constructed examples or applied to texts in very narrow domains. On the other hand, there are robust statistical and neural models for similarity and clustering, which were applied with some success to unrestricted texts. These models use shallow feature-vector representations for words and documents and do not capture well similarities that can be expressed only by richer representations. We recognize that current language analysis methods, employing tools such as robust shallow parsing, morphological canonization and term-level similarity, can robustly extract structured textual representations that are significantly richer than feature-vector representations. These structured representations resemble symbolic models in the cognitive and semantic literature in that they contain information about relations among the terms of a text fragment. Given a large volume of such automatically extracted representations, our procedure will collect rich quantitative information about frequencies of various components. Accordingly, our algorithms will combine structural and symbolic evidence with statistical evidence when judging the degree of similarity between text fragments. The combination of statistical and structural models is an attractive characteristic of the proposed research. We expect this combination to promote an important enhancement to similarity and clustering algorithms in the areas of natural language processing in particular and statistical pattern matching in general. The proposed computational procedure, which applies accumulative acquisition of relational conceptual knowledge, may also influence models of analogy in the cognitive sciences. At the practical level, the research addresses the growing need for better information access tools (for instance in the World Wide Web) by a contribution to the development of information retrieval and information extraction systems that employ high-level representations.