Rethinking single-cell RNA-Seq analysis

Placeholder Show Content

Abstract/Contents

Abstract
Since the Human Genome Project was completed in 2003, scientists have developed technologies for measuring the RNA content of a single cell. In the last decade, the number of individual cells profiled per study has grown exponentially to over 1,000,000 cells. In this thesis, I will discuss some of the computational and statistical challenges associated with the analysis of such large single-cell datasets. After introducing background information, the thesis covers three main works. The first work introduces a novel, interpretable framework with the biologist end user in mind. The framework also addresses the clustering subjectivity issue by justifying its results based on a rigorous definition of cell type. This allows us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. The second work considers a novel approach for representing single-cell RNA-Seq data. We argue that gene or transcript expression vectors, while intuitive, are not the most optimal way for representing single cell genomic profiles. Rather than counting the number of reads that comes from each transcript, which requires resolving the ambiguity associated with read multimapping, we decide to count the number of reads that comes from each transcript set. We show that these new representations are both more computationally efficient to obtain and more information-rich. The third and perhaps most interesting work first observes a post-selection inference problem in standard single-cell computational pipelines. Standard pipelines perform differential analysis after clustering on the same dataset, and this reusing of the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. In summary, we discuss multiple works for drawing key insights from single-cell RNA-Seq data: a clustering method that emphasizes interpretability of results, a representation of single cells that retains more information from read data, and a framework for correcting the selection bias from standard analysis pipelines.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Zhang, Jesse Min
Degree supervisor Tse, David
Thesis advisor Tse, David
Thesis advisor Nishimura, Dwight George
Thesis advisor Zou, James
Degree committee member Nishimura, Dwight George
Degree committee member Zou, James
Associated with Stanford University, Department of Electrical Engineering.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Jesse Min Zhang.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2019.
Location electronic resource

Access conditions

Copyright
© 2019 by Jesse Min Zhang
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...