Rethinking single-cell RNA-Seq analysis
Abstract/Contents
- Abstract
- Since the Human Genome Project was completed in 2003, scientists have developed technologies for measuring the RNA content of a single cell. In the last decade, the number of individual cells profiled per study has grown exponentially to over 1,000,000 cells. In this thesis, I will discuss some of the computational and statistical challenges associated with the analysis of such large single-cell datasets. After introducing background information, the thesis covers three main works. The first work introduces a novel, interpretable framework with the biologist end user in mind. The framework also addresses the clustering subjectivity issue by justifying its results based on a rigorous definition of cell type. This allows us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. The second work considers a novel approach for representing single-cell RNA-Seq data. We argue that gene or transcript expression vectors, while intuitive, are not the most optimal way for representing single cell genomic profiles. Rather than counting the number of reads that comes from each transcript, which requires resolving the ambiguity associated with read multimapping, we decide to count the number of reads that comes from each transcript set. We show that these new representations are both more computationally efficient to obtain and more information-rich. The third and perhaps most interesting work first observes a post-selection inference problem in standard single-cell computational pipelines. Standard pipelines perform differential analysis after clustering on the same dataset, and this reusing of the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. In summary, we discuss multiple works for drawing key insights from single-cell RNA-Seq data: a clustering method that emphasizes interpretability of results, a representation of single cells that retains more information from read data, and a framework for correcting the selection bias from standard analysis pipelines.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2019; ©2019 |
Publication date | 2019; 2019 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Zhang, Jesse Min |
---|---|
Degree supervisor | Tse, David |
Thesis advisor | Tse, David |
Thesis advisor | Nishimura, Dwight George |
Thesis advisor | Zou, James |
Degree committee member | Nishimura, Dwight George |
Degree committee member | Zou, James |
Associated with | Stanford University, Department of Electrical Engineering. |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Jesse Min Zhang. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2019. |
Location | electronic resource |
Access conditions
- Copyright
- © 2019 by Jesse Min Zhang
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...