Learning with N-grams : from massive scales to compressed representations

Placeholder Show Content

Abstract/Contents

Abstract
Machine learning has established itself as an important driver of industrial progress and scientific discovery. The quest to expand its usage to address ever deeper questions and harder problems places particular emphasis on building sophisticated and statistically rigorous models that can handle the deluge of information being generated. The stakes are higher than ever; the success of global, billion dollar initiatives that can fundamentally change the landscape of human health rests on the existence of machine learning tools that can extract intricate relationships at unprecedented scales. In turn, machine learning paradigms are constantly evolving to address these needs, and some of the greatest advances have come from integrating combinatorial ideas with classical statistical ideas, such as the ability to perform principled feature selection using the Lasso. The underlying perspective of this thesis is that machine learning must rely on the algorithms and data structures that classically form the underpinnings of theoretical computer science in order to fully harness the potential of these combinatorial ideas. To this end, we contribute two advances to machine learning based on N-gram features, a feature representation for strings that has stood the test of time and continues to provide state-of-the-art results in natural language processing and genomics. The first addresses the computational and statistical issues of learning with long, and possibly all, N-grams in a document corpus. Our main result leverages suffix trees to provide a quadratic memory and processing time improvement over current machine learning systems by virtue of a fast matrix-vector multiplication routine whose computational requirements are at worst linear in the length of the underlying document corpus. As the majority of machine learning algorithms rely on and are bottlenecked by matrix-vector multiplication to learn, our routine can speed up almost any learning system by simply replacing its multiplication routine with ours. The practical savings are substantial, including an efficiency gain of four orders of magnitude for DNA sequence data, and open a new realm of possibilities for N-gram models. This routine also has large statistical implications; suffix trees perform a quadratic dimensionality reduction that substantially increases the robustness of machine learning systems when the appropriate level of data representation granularity is unknown. Finally, we provide an efficient persistent data storage system based on our algorithms that screens N-gram features according to a multitude of statistical criteria and produces data structures optimized for multiplication. Our second contribution looks to classical ideas from compression to devise a new form of combinatorial Deep Learning for text termed Dracula. Dracula is based on a generalization of the compression criterion underlying dictionary--based compressors like Lempel-Ziv 78. It learns a dictionary of N-grams that efficiently compresses a text corpus, and then recursively compresses its own dictionary for additional space savings. In doing so, it selects N-grams that are useful features for learning and induces a graph--based regularizer that orders the N-grams into low and high frequency components. Importantly, solving Dracula can be expressed as a binary linear program that may be further relaxed to a linear program, allowing a plurality of tools from optimization and computer science to be used to analyze its properties. Computationally, Dracula is NP-Complete, but it exhibits substantial problem structure that allows approximate algorithms to scale to large datasets. Statistically, we show how Dracula can learn a multitude of representations to accommodate an underlying storage cost model and identify parameters that control the behavior of its solutions in meaningful ways. We also demonstrate that Dracula is amenable to fine tuning by proving that its solutions evolve in a predictable way as the storage cost model varies. We demonstrate the utility of Dracula's features using experiments over a variety of problem domains including natural language processing and bioinformatics.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Paskov, Hristo Spassimirov
Associated with Stanford University, Computer Science Department.
Primary advisor Hastie, Trevor
Primary advisor Mitchell, John
Thesis advisor Hastie, Trevor
Thesis advisor Mitchell, John
Thesis advisor Ré, Christopher
Advisor Ré, Christopher

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Hristo Spassimirov Paskov.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Hristo Spassimirov Paskov
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...