Understanding the functionality and dimensionality of vector embeddings : the distributional hypothesis, the pairwise inner product loss and its applications
Abstract/Contents
- Abstract
- Vector embedding is a foundational building block of many deep learning models, especially in natural language processing. We present a theoretical framework for understanding the effect of dimensionality on vector embeddings. We observe that the distributional hypothesis, a governing principle of statistical semantics, requires a natural unitary-invariance for vector embeddings. Motivated by the unitary-invariance observation, we propose the Pairwise Inner Product (PIP) loss, a unitary-invariant metric on the similarity between two embeddings. We demonstrate that the PIP loss captures the difference in functionality between embeddings, and that the PIP loss is tightly connected with two basic properties of vector embeddings, namely similarity and compositionality. By formulating vector embedding algorithms as matrix factorizations under noise, we reveal a fundamental bias-variance trade-off in the dimensionality selection process, which, similar to many signal processing problems, is between the signal and noise power. This bias-variance trade-off sheds light on many empirical observations which have not been thoroughly explained, for example the existence of an optimal dimensionality. Moreover, we discover two new results about vector embeddings, namely their robustness against over-parametrization and their forward stability. The bias-variance trade-off of the PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings. The combination of vector embedding and the PIP loss can be applied to quantify diachronic evolution and domain adaptation of language. We develop the Global Anchor Method (GAM), a method for comparing embeddings trained on two different corpora. Corpora dissimilarity determines model transferability, namely, whether a model trained on one corpus will work on another. We used GAM to analyze diachronic and domain effects in English corpora, using the Google Books and arXiv datasets as examples, where we reveal interesting sociological and academic factors that affect the usage of language.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2018; ©2018 |
Publication date | 2018; 2018 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Yin, Zi |
---|---|
Degree supervisor | Prabhakar, Balaji, 1967- |
Thesis advisor | Prabhakar, Balaji, 1967- |
Thesis advisor | Rosenblum, Mendel |
Thesis advisor | Weissman, Tsachy |
Degree committee member | Rosenblum, Mendel |
Degree committee member | Weissman, Tsachy |
Associated with | Stanford University, Department of Electrical Engineering. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Zi Yin. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2018. |
Location | electronic resource |
Access conditions
- Copyright
- © 2018 by Zi Yin
Also listed in
Loading usage metrics...