Hierarchical clustering with global objectives : approximation algorithms and hardness results

Chatziafratis, Evangelos

Hierarchical clustering with global objectives : approximation algorithms and hardness results

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fbb164pj1759" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Hierarchical Clustering is an important tool for data analysis, used in diverse areas ranging from Phylogenetics in Biology to YouTube video recommendations and everything in between. The term ``hierarchical'' is used to contrast with ``flat'' clustering approaches, like k-means or k-center, which only produce one specific partitioning of a data set. Hierarchical Clustering is a way of understanding the overall structure of large data sets by visualizing them as a collection of successively smaller and smaller clusters, capturing different levels of granularity simultaneously. The end result is a tree, whose internal nodes represent nested clusters and with leaves representing the data points. Despite the plethora of applications and heuristics, the theory behind Hierarchical Clustering was underdeveloped, since no ``global'' objective function was associated with the final output. A well-formulated objective allows us to compare different algorithms, measure their quality, explain their success or failure and enables continuous optimization methods with gradient descent. This lack of objectives is in stark contrast with the flat clustering literature, where objectives like k-means, k-median or k-center have been studied intensively starting from the 1950s, leading to a comprehensive theory on clustering. In this thesis, we study approximation algorithms and their limitations, making progress towards building such a theory for objective-based Hierarchical Clustering (HC). For the minimization cost function with pairwise similarities introduced recently by Dasgupta (2016) and its maximization variants, we show how standard tools like Sparsest Cut, Balanced Cut, Max Cut or Max Uncut Bisection can be used in a black box manner, to produce provably good HC and we also complement our algorithms with inapproximability results. Escaping from worst-case analyses, we also study scenarios where extra structure is imposed on the data points (e.g., feature vectors) or qualitative information is provided (e.g., the common triplet or quartet constraints in Phylogenetics). Our results are inspired by geometric ideas (semidefinite programming relaxations, spreading metrics, random projections) and novel connections with \textit{ordering} Constraint Satisfaction Problems. Overall, this thesis provides state-of-the-art approximation and hardness results for optimizing global HC objectives, reveals unexpected connections with common linkage or graph cut heuristics and advances our understanding of old Phylogenetics problems. As such, we hope it serves the community as the foundation for objective-based Hierarchical Clustering and for future improvements

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2020; ©2020
Publication date	2020; 2020
Issuance	monographic
Language	English

Creators/Contributors

Author	Chatziafratis, Evangelos
Degree supervisor	Charikar, Moses
Degree supervisor	Roughgarden, Tim
Thesis advisor	Charikar, Moses
Thesis advisor	Roughgarden, Tim
Thesis advisor	Tan, Li-Yang
Thesis advisor	Valiant, Gregory
Degree committee member	Tan, Li-Yang
Degree committee member	Valiant, Gregory
Associated with	Stanford University, Computer Science Department.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Evangelos Chatziafratis
Note	Submitted to the Computer Science Department
Thesis	Thesis Ph.D. Stanford University 2020
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...