Inference on the generalization error of machine learning algorithms and the design of hierarchical medical term embeddings
- This dissertation comprises three papers that address important challenges in applying statistical and machine learning techniques in biomedical research, ranging from valid statistical inference on evaluating algorithm performance via general cross-validation, to a new embedding method for biomedical terms based on their hierarchical structure for better downstream applications, to improving model performance by balancing the prediction accuracy and the cost of collecting relevant prediction features. The first paper introduces a novel fast bootstrap method to estimate the standard error of cross-validation estimates. Cross-validation helps avoid the optimism bias in error estimates, which can be significant for models built using complex statistical learning algorithms. However, since the cross-validation estimate is a random value dependent on observed data, it is essential to accurately quantify the uncertainty associated with this estimate. This is especially important when comparing the performance of two models, as one must determine whether differences in error estimates are a result of chance fluctuations. Although various methods have been developed for making inferences on cross-validation estimates, they often have many limitations, such as stringent model assumptions or constraints on the form of the loss function. This paper proposes an accelerated bootstrap method that quickly estimates the standard error of the cross-validation estimate and produces valid confidence intervals for a population parameter measuring average model performance. Our method overcomes the computational challenge inherent in bootstrapping a cross-validation estimate by estimating the variance component via fitting a random effects model. To showcase the effectiveness of our approach, we employ comprehensive simulations and real data analysis across three diverse applications. The second paper presents a novel biomedical term representation model fine-tuned on hierarchical structures. Electronic health records contain narrative notes that provide extensive details on the medical condition and management of patients. Natural language processing of clinical notes can use observed frequencies of clinical terms as predictive features for various downstream applications such as clinical decision making and patient trajectory prediction. However, due to the vast number of highly similar and related clinical concepts, a more effective modeling strategy is to represent clinical terms as semantic embeddings via representation learning and use the low dimensional embeddings as more informative feature vectors for those applications. Fine-tuning pre-trained language models with biomedical knowledge graphs may generate better embeddings for biomedical terms than those from standard language models alone. These embeddings can effectively discriminate synonymous pairs from those that are unrelated. However, they often fail to capture different degrees of similarity or relatedness for concepts that are hierarchical in nature. To overcome this limitation, we propose HiPrBERT, a biomedical term representation model trained on additional data sources containing hierarchical structures for various biomedical terms. We modify existing contrastive loss functions to extract information from these hierarchies. Our numerical experiments demonstrate that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in substantially more informative embeddings for further biomedical applications. The third paper proposes a dynamic prediction rule for clinical decision-making, aiming to optimize the order of acquiring prediction features. Physicians today have access to a wide array of tests for diagnosing and prognosticating medical conditions. Ideally, they would apply a high-quality prediction model, utilizing all relevant features as input, to facilitate appropriate decision-making regarding treatment selection or risk assessment. However, not all features used in these prediction models are readily available without incurring some costs. In practice, predictors are typically gathered as needed in a sequential manner, while the physician dynamically evaluates this information. This process continues until sufficient information is acquired, and the physician gains reasonable confidence in making a decision. Importantly, the prospective information to collect may differ for each patient and depend on the predictor values already known. Our method aims to address these challenges, with the objective of maximizing the prediction accuracy while minimizing the costs associated with measuring prediction features for individual subjects. To achieve this, we employ a reinforcement learning algorithm, where the agent must decide on the best action at each step: either making a clinical decision with available information or continuing to collect new predictors based on the current state of knowledge. To evaluate the efficacy of the proposed dynamic prediction strategy, we've conducted extensive simulation studies. Additionally, we provide two real data examples to illustrate the practical application of our method.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Degree committee member
|Degree committee member
|Stanford University, School of Engineering
|Stanford University, Computer Science Department
|Statement of responsibility
|Submitted to the Computer Science Department.
|Thesis Ph.D. Stanford University 2023.
- © 2023 by Bryan Cai
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...