Pre-finetuning methods for domain and task adaptation, with applications to discourse and translation

Iter, Dan

Pre-finetuning methods for domain and task adaptation, with applications to discourse and translation

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fxn665xc5858" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: A recently adopted standard technique for training models for NLP applications is to pretrain a model on a large generic dataset and then finetune the model on the target domain training data. However, the pretraining task often differs greatly from the target task. Target tasks can use data from different domains, incorporate custom labels, and require modifications to base models, such as adding parameters. Performance on downstream tasks can be improved by bridging the gap between the pretraining and finetuning phases. An intermediate phase, called pre-finetuning, can be used to adapt to the target domain prior to finetuning. Introducing an intermediate phase incorporates some domain knowledge into the model, resulting in better performance after finetuning. We study applications to discourse and translation, both of which are settings where high quality in-domain labels are expensive to acquire. We present a theoretical framework for language model adaptation and three applications of pre-finetuning, each an exemplar of pre-finetuning focused on the data, model and annotations respectively: (1) data selection for machine translation, (2) sentence-level objectives for discourse performance of language models and (3) weakly-supervised discourse relation recognition. We present empirical results on data selection for neural machine translation and show that pre-finetuning on a subset of pretraining data that is most similar to the target domain improves the performance of the final model. However, we show that while trivially selecting the most similar data improves performance, the optimal setting requires finding data that complements the target domain data rather than mirroring it. Then, we present Conpono, a novel objective introduced during pre-finetuning. The inter-sentence objective models discourse coherence and the distance between sentences. We show that by pre-finetuning a pretrained language model with Conpono, the model improves on the previous state-of-the art on discourse representation evaluation benchmarks. Lastly, we introduce DiscoMtB, a method for discourse representation learning to discover discourse structure that acts as pseudo-labels in a text corpus. The weakly supervised discourse relations are used to both create new pre-finetuning training data for sentence relation classification tasks and to augment text generation models as an interpretable knob for controlling generation to introduce more diversity to the generation space while maintaining discourse coherence in the generated text. This dissertation presents the benefits of a three-phase training process and presents three applications, which bridge the gap between pretraining and finetuning during pre-funetuning by adapting the data, model architecture and labels, respectively.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2022; ©2022
Publication date	2022; 2022
Issuance	monographic
Language	English

Creators/Contributors

Author	Iter, Dan
Degree supervisor	Jurafsky, Dan, 1962-
Thesis advisor	Jurafsky, Dan, 1962-
Thesis advisor	Hashimoto, Tatsunori
Thesis advisor	Liang, Percy
Degree committee member	Hashimoto, Tatsunori
Degree committee member	Liang, Percy
Associated with	Stanford University, Computer Science Department

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Dan Iter.
Note	Submitted to the Computer Science Department.
Thesis	Thesis Ph.D. Stanford University 2022.
Location	https://purl.stanford.edu/xn665xc5858

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...