Pre-finetuning methods for domain and task adaptation, with applications to discourse and translation
Abstract/Contents
- Abstract
- A recently adopted standard technique for training models for NLP applications is to pretrain a model on a large generic dataset and then finetune the model on the target domain training data. However, the pretraining task often differs greatly from the target task. Target tasks can use data from different domains, incorporate custom labels, and require modifications to base models, such as adding parameters. Performance on downstream tasks can be improved by bridging the gap between the pretraining and finetuning phases. An intermediate phase, called pre-finetuning, can be used to adapt to the target domain prior to finetuning. Introducing an intermediate phase incorporates some domain knowledge into the model, resulting in better performance after finetuning. We study applications to discourse and translation, both of which are settings where high quality in-domain labels are expensive to acquire. We present a theoretical framework for language model adaptation and three applications of pre-finetuning, each an exemplar of pre-finetuning focused on the data, model and annotations respectively: (1) data selection for machine translation, (2) sentence-level objectives for discourse performance of language models and (3) weakly-supervised discourse relation recognition. We present empirical results on data selection for neural machine translation and show that pre-finetuning on a subset of pretraining data that is most similar to the target domain improves the performance of the final model. However, we show that while trivially selecting the most similar data improves performance, the optimal setting requires finding data that complements the target domain data rather than mirroring it. Then, we present Conpono, a novel objective introduced during pre-finetuning. The inter-sentence objective models discourse coherence and the distance between sentences. We show that by pre-finetuning a pretrained language model with Conpono, the model improves on the previous state-of-the art on discourse representation evaluation benchmarks. Lastly, we introduce DiscoMtB, a method for discourse representation learning to discover discourse structure that acts as pseudo-labels in a text corpus. The weakly supervised discourse relations are used to both create new pre-finetuning training data for sentence relation classification tasks and to augment text generation models as an interpretable knob for controlling generation to introduce more diversity to the generation space while maintaining discourse coherence in the generated text. This dissertation presents the benefits of a three-phase training process and presents three applications, which bridge the gap between pretraining and finetuning during pre-funetuning by adapting the data, model architecture and labels, respectively.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2022; ©2022 |
Publication date | 2022; 2022 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Iter, Dan |
---|---|
Degree supervisor | Jurafsky, Dan, 1962- |
Thesis advisor | Jurafsky, Dan, 1962- |
Thesis advisor | Hashimoto, Tatsunori |
Thesis advisor | Liang, Percy |
Degree committee member | Hashimoto, Tatsunori |
Degree committee member | Liang, Percy |
Associated with | Stanford University, Computer Science Department |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Dan Iter. |
---|---|
Note | Submitted to the Computer Science Department. |
Thesis | Thesis Ph.D. Stanford University 2022. |
Location | https://purl.stanford.edu/xn665xc5858 |
Access conditions
- Copyright
- © 2022 by Dan Iter
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...