Understanding the functionality and dimensionality of vector embeddings : the distributional hypothesis, the pairwise inner product loss and its applications

Placeholder Show Content

Abstract/Contents

Abstract
Vector embedding is a foundational building block of many deep learning models, especially in natural language processing. We present a theoretical framework for understanding the effect of dimensionality on vector embeddings. We observe that the distributional hypothesis, a governing principle of statistical semantics, requires a natural unitary-invariance for vector embeddings. Motivated by the unitary-invariance observation, we propose the Pairwise Inner Product (PIP) loss, a unitary-invariant metric on the similarity between two embeddings. We demonstrate that the PIP loss captures the difference in functionality between embeddings, and that the PIP loss is tightly connected with two basic properties of vector embeddings, namely similarity and compositionality. By formulating vector embedding algorithms as matrix factorizations under noise, we reveal a fundamental bias-variance trade-off in the dimensionality selection process, which, similar to many signal processing problems, is between the signal and noise power. This bias-variance trade-off sheds light on many empirical observations which have not been thoroughly explained, for example the existence of an optimal dimensionality. Moreover, we discover two new results about vector embeddings, namely their robustness against over-parametrization and their forward stability. The bias-variance trade-off of the PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings. The combination of vector embedding and the PIP loss can be applied to quantify diachronic evolution and domain adaptation of language. We develop the Global Anchor Method (GAM), a method for comparing embeddings trained on two different corpora. Corpora dissimilarity determines model transferability, namely, whether a model trained on one corpus will work on another. We used GAM to analyze diachronic and domain effects in English corpora, using the Google Books and arXiv datasets as examples, where we reveal interesting sociological and academic factors that affect the usage of language.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2018; ©2018
Publication date 2018; 2018
Issuance monographic
Language English

Creators/Contributors

Author Yin, Zi
Degree supervisor Prabhakar, Balaji, 1967-
Thesis advisor Prabhakar, Balaji, 1967-
Thesis advisor Rosenblum, Mendel
Thesis advisor Weissman, Tsachy
Degree committee member Rosenblum, Mendel
Degree committee member Weissman, Tsachy
Associated with Stanford University, Department of Electrical Engineering.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Zi Yin.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2018.
Location electronic resource

Access conditions

Copyright
© 2018 by Zi Yin

Also listed in

Loading usage metrics...