Hierarchical clustering with global objectives : approximation algorithms and hardness results

Placeholder Show Content

Abstract/Contents

Abstract
Hierarchical Clustering is an important tool for data analysis, used in diverse areas ranging from Phylogenetics in Biology to YouTube video recommendations and everything in between. The term ``hierarchical'' is used to contrast with ``flat'' clustering approaches, like k-means or k-center, which only produce one specific partitioning of a data set. Hierarchical Clustering is a way of understanding the overall structure of large data sets by visualizing them as a collection of successively smaller and smaller clusters, capturing different levels of granularity simultaneously. The end result is a tree, whose internal nodes represent nested clusters and with leaves representing the data points. Despite the plethora of applications and heuristics, the theory behind Hierarchical Clustering was underdeveloped, since no ``global'' objective function was associated with the final output. A well-formulated objective allows us to compare different algorithms, measure their quality, explain their success or failure and enables continuous optimization methods with gradient descent. This lack of objectives is in stark contrast with the flat clustering literature, where objectives like k-means, k-median or k-center have been studied intensively starting from the 1950s, leading to a comprehensive theory on clustering. In this thesis, we study approximation algorithms and their limitations, making progress towards building such a theory for objective-based Hierarchical Clustering (HC). For the minimization cost function with pairwise similarities introduced recently by Dasgupta (2016) and its maximization variants, we show how standard tools like Sparsest Cut, Balanced Cut, Max Cut or Max Uncut Bisection can be used in a black box manner, to produce provably good HC and we also complement our algorithms with inapproximability results. Escaping from worst-case analyses, we also study scenarios where extra structure is imposed on the data points (e.g., feature vectors) or qualitative information is provided (e.g., the common triplet or quartet constraints in Phylogenetics). Our results are inspired by geometric ideas (semidefinite programming relaxations, spreading metrics, random projections) and novel connections with \textit{ordering} Constraint Satisfaction Problems. Overall, this thesis provides state-of-the-art approximation and hardness results for optimizing global HC objectives, reveals unexpected connections with common linkage or graph cut heuristics and advances our understanding of old Phylogenetics problems. As such, we hope it serves the community as the foundation for objective-based Hierarchical Clustering and for future improvements

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2020; ©2020
Publication date 2020; 2020
Issuance monographic
Language English

Creators/Contributors

Author Chatziafratis, Evangelos
Degree supervisor Charikar, Moses
Degree supervisor Roughgarden, Tim
Thesis advisor Charikar, Moses
Thesis advisor Roughgarden, Tim
Thesis advisor Tan, Li-Yang
Thesis advisor Valiant, Gregory
Degree committee member Tan, Li-Yang
Degree committee member Valiant, Gregory
Associated with Stanford University, Computer Science Department.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Evangelos Chatziafratis
Note Submitted to the Computer Science Department
Thesis Thesis Ph.D. Stanford University 2020
Location electronic resource

Access conditions

Copyright
© 2020 by Evangelos Chatziafratis
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...