David Alvarez-Melis

Postdoctoral Researcher, Microsoft Research New England

1 Memorial Drive, Cambridge MA

firstsurname.secondsurname@microsoft.com

Towards a Theory of Word Embeddings

A theoretical framework to understand the semantic properties of word embeddings.

TL;DR: Word embeddings have intriguing semantic properties (think King - Man + Woman = Queen). How do we make sense of them? Cognitive Science and Metric Recovery yield a possible answer.

Abstract

Continuous word representations have been remarkably useful across NLP tasks but remain poorly understood. We ground word embeddings in semantic spaces studied in the cognitive-psychometric literature, taking these spaces as the primary objects to recover. To this end, we relate log co-occurrences of words in large corpora to semantic similarity assessments and show that co-occurrences are indeed consistent with an Euclidean semantic space hypothesis. Framing word embedding as metric recovery of a semantic space unifies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. Furthermore, we propose a simple, principled, direct metric recovery algorithm that performs on par with the state-of- the-art word embedding and manifold learning methods. Finally, we complement recent fo- cus on analogies by constructing two new inductive reasoning datasets—series completion and classification—and demonstrate that word embeddings can be used to solve them as well.

The relation between semantics and word co-occurrence representations predates neural approaches by a few decades. This diagram depicts inductive reasoning in semantic spaces as proposed by Sternberg and Gardner (1983). A, B, C are
given, I is the ideal point and D are the choices.
The relation between semantics and word co-occurrence representations predates neural approaches by a few decades. This diagram depicts inductive reasoning in semantic spaces as proposed by Sternberg and Gardner (1983). A, B, C are given, I is the ideal point and D are the choices.
Our theory says that word embedding algorithms can be understood as manifold learning methods. Empirical results confirm this. Here, we show dimensionality reduction using both (word embedding and manifold learning) types of methods. Performance is quantified by percentage of 5-nearest neighbors sharing the same digit label. The resulting embeddings demonstrate that metric regression is highly effective at this task, outperforming metric SNE and beaten only by t-SNE (91% cluster purity), which is a visualization method specifically designed to preserve cluster separation. All word embedding methods including SVD (68%) embed the MNIST digits remarkably well and outperform classic manifold learning baselines of PCA (48%) and Isomap (49%).
Our theory says that word embedding algorithms can be understood as manifold learning methods. Empirical results confirm this. Here, we show dimensionality reduction using both (word embedding and manifold learning) types of methods. Performance is quantified by percentage of 5-nearest neighbors sharing the same digit label. The resulting embeddings demonstrate that metric regression is highly effective at this task, outperforming metric SNE and beaten only by t-SNE (91% cluster purity), which is a visualization method specifically designed to preserve cluster separation. All word embedding methods including SVD (68%) embed the MNIST digits remarkably well and outperform classic manifold learning baselines of PCA (48%) and Isomap (49%).

Relevant Publications:

  1. Hashimoto, Alvarez-Melis and Jaakkola. “Word, graph and manifold embedding from Markov processes”, NIPS 2015 Workshop on Nonparametric Methods for Large Scale Representation Learning.
  2. Hashimoto, Alvarez-Melis and Jaakkola. “Word Embeddings as Metric Recovery in Semantic Spaces”, TACL’16.