On String-Similarity Estimation Problems

Placeholder Show Content

Abstract/Contents

Abstract

In this thesis, we study two widely used measures of string similarity: edit distance and dynamic time warping (DTW). We investigate alignment algorithms, metric embeddings, fast (approximation) algorithms, and communication complexity.

Results on Edit Distance: Edit distance is a fundamental distance metric between strings with applications throughout computer science. While the problem of estimating edit distance has been studied extensively, the equally important question of producing an alignment (i.e., the sequence of edits) has received far less attention. Somewhat surprisingly, we show that any algorithm to estimate edit distance can be used in a black-box fashion to produce an approximate alignment of strings, with modest loss in approximation factor and small loss in run time. Plugging in a previous result, we obtain an alignment that is a $(\log n)^{O(1/\varepsilon^2)}$ approximation in time $\tilde{O}(n^{1 + \varepsilon})$.

Closely related to the study of approximation algorithms is the study of metric embeddings. We present three results which together demonstrate the effectiveness of min-hash techniques in designing edit distance embeddings: (1) An embedding from Ulam distance (edit distance over permutations) to Hamming space that matches the best known distortion of $O(\log n)$ and also implicitly encodes sequences of edits; (2) Given an upper bound $K$ on edit distance, we show that embeddings of edit distance into Hamming space with distortion $f(n)$ can be modified in a black-box fashion to give distortion $O(f(\operatorname{poly}(K)))$ for a class of periodic-free strings; (3) A randomized dimension-reduction map with contraction $c$ and asymptotically optimal expected distortion $O(c)$, improving on a previous result.

Results on Dynamic Time Warping: In the second half of the thesis, we turn our attention to dynamic time warping (DTW), a close cousin of edit distance whose primary applications are as a similarity measure between time series.

We present the first strongly subquadratic algorithm for DTW in the low-distance regime. Our algorithm runs in time $O(n \cdot \dtw(x, y))$ for $x$ and $y$ of length $n$ over an arbitrary metric. Building on this, we present an approximation algorithm for $\dtw$ over an arbitrary $2$-hierarchically well-separated tree metric, providing a $\alpha$-approximation in time $\tilde{O}(n^2/\alpha)$. Additionally, we obtain the first approximation algorithm for edit distance to work over an arbitrary metric space, providing an $\alpha$-approximation in time $\tilde{O}(n^2 / \alpha)$, with high probability.

Finally, we provide near tight bounds on the one-way communication complexity of DTW -- an upper bound of $\tilde{O}(n / \alpha)$ bits for efficiently computing an $\alpha$-approximation of DTW between strings of length $n$, and a lower bound of $\Omega(n / \alpha)$ bits for the same problem.

Description

Type of resource text
Date created [ca. 2018]

Creators/Contributors

Author Kuszmaul, William
Primary advisor Charikar Moses
Degree granting institution Stanford University, Department of Computer Science

Subjects

Subject Stanford School of Engineering
Subject Department of Computer Science
Subject Algorithm
Subject Edit Distance
Subject Approximation Algorithm
Subject Embedding
Genre Thesis

Bibliographic information

Access conditions

Use and reproduction
User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Preferred citation

Preferred Citation
Kuszmaul, William. (2018). On String-Similarity Estimation Problems. Stanford Digital Repository. Available at: https://purl.stanford.edu/yn198kk9738

Collection

Undergraduate Theses, School of Engineering

View other items in this collection in SearchWorks

Contact information

Also listed in

Loading usage metrics...