Topics in unsupervised learning : feature selection and multi-modality
- Often in the unsupervised setting one clusters data attempting to learn the unobserved latent class variable. Proper inference requires determining both the correct number of clusters and the subset of features dependent on the class variable. In the supervised setting one has prediction error to guide the decision making process. An analog for unsupervised data is prediction strength, Tibshirani and Walther (2005), whereby one attempts to estimate this error by measuring cluster stability. Originally proposed as a method for determining the number of clusters, we will show that prediction strength can also be used for feature selection. Additionally, one can compute the likelihood a feature depends on the latent variable when feature selection is posed as a model selection problem. As the dimensionality of the problem gets large sampling models must be approached with care, motivating a survey of various sampling methods. The second part of the thesis considers low-dimensional projections of the data via principal curves, Hastie and Stuetzle (1989), as a vehicle for determining the number of clusters. In the low-dimensional setting (often a single dimension) multi-modality investigation is simplified resulting in flexible estimation of the actual number of clusters.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|2010, c2011; 2010
|Ahmed, Murat Omer
|Stanford University, Department of Statistics
|Lai, T. L
|Lai, T. L
|Statement of responsibility
|Murat Ömer Ahmed.
|Submitted to the Department of Statistics.
|Thesis (Ph.D.)--Stanford University, 2011.
- © 2011 by Murat Omer Ahmed
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...