Understanding the functionality and dimensionality of vector embeddings : the distributional hypothesis, the pairwise inner product loss and its applications

Yin, Zi

Understanding the functionality and dimensionality of vector embeddings : the distributional hypothesis, the pairwise inner product loss and its applications

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fdd718wc3836" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Vector embedding is a foundational building block of many deep learning models, especially in natural language processing. We present a theoretical framework for understanding the effect of dimensionality on vector embeddings. We observe that the distributional hypothesis, a governing principle of statistical semantics, requires a natural unitary-invariance for vector embeddings. Motivated by the unitary-invariance observation, we propose the Pairwise Inner Product (PIP) loss, a unitary-invariant metric on the similarity between two embeddings. We demonstrate that the PIP loss captures the difference in functionality between embeddings, and that the PIP loss is tightly connected with two basic properties of vector embeddings, namely similarity and compositionality. By formulating vector embedding algorithms as matrix factorizations under noise, we reveal a fundamental bias-variance trade-off in the dimensionality selection process, which, similar to many signal processing problems, is between the signal and noise power. This bias-variance trade-off sheds light on many empirical observations which have not been thoroughly explained, for example the existence of an optimal dimensionality. Moreover, we discover two new results about vector embeddings, namely their robustness against over-parametrization and their forward stability. The bias-variance trade-off of the PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings. The combination of vector embedding and the PIP loss can be applied to quantify diachronic evolution and domain adaptation of language. We develop the Global Anchor Method (GAM), a method for comparing embeddings trained on two different corpora. Corpora dissimilarity determines model transferability, namely, whether a model trained on one corpus will work on another. We used GAM to analyze diachronic and domain effects in English corpora, using the Google Books and arXiv datasets as examples, where we reveal interesting sociological and academic factors that affect the usage of language.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2018; ©2018
Publication date	2018; 2018
Issuance	monographic
Language	English

Creators/Contributors

Author	Yin, Zi
Degree supervisor	Prabhakar, Balaji, 1967-
Thesis advisor	Prabhakar, Balaji, 1967-
Thesis advisor	Rosenblum, Mendel
Thesis advisor	Weissman, Tsachy
Degree committee member	Rosenblum, Mendel
Degree committee member	Weissman, Tsachy
Associated with	Stanford University, Department of Electrical Engineering.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Zi Yin.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2018.
Location	electronic resource

Access conditions

Also listed in

View in SearchWorks

Loading usage metrics...