Internationalization of task-oriented dialogue systems

Placeholder Show Content

Abstract/Contents

Abstract
Virtual assistants and Task-oriented Dialogue (ToD) agents are increasingly prevalent due to their utility in daily tasks. Despite the linguistic diversity worldwide, only a few dominant languages are supported by these digital assistants. This restriction is due to the high cost and manual effort required to produce large, hand-annotated datasets to train these agents. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent. This thesis introduces a novel solution to automatically create ToD agents in new languages by leveraging dialogue data in the source language and neural machine translation. The approach is based on automatic entity-aware training data translation, a concise dialogue data representation enabling effective zero-shot training, and a scalable and robust approach for creating end-to-end high-quality fewshot, validation, and test data, minimizing the manual effort needed. To address data scarcity, we use neural machine translation to translate the training dataset from the source to the target language. We show that naive application of this approach would not yield good performance as entities in the input can be mistranslated, transliterated, or omitted and no longer match with that in the annotation. We propose a series of techniques to improve the quality of the dataset by (1) leveraging word alignments from the neural translation model's cross-attention weights to preserve entities and (2) applying automatic data filtering based on textual semantic similarity to exclude poor translations. Using this approach, we create multilingual versions of Schema2QA, a single-turn question-answering dataset, in 10 different languages. Agents trained on our automatically translated data improve upon previous state-of-the-art by 30-40% and comes within 5-8% of the original English agent. Translation is inherently noisy and poses a special challenge in the end-to-end dialogue setting where the amount of natural language encoded grows with each turn. The accumulation of errors can prevent a correct parse for the rest of the dialogue. To address this, we introduce a new distilled dialogue data representation which significantly reduces the amount of natural language encoded and decoded by the model. On the BiToD dataset, using our representation, we found a 14% improvement in Dialogue Success Rate (DSR) in the fewshot setting. The lack of a high-quality realistic testbed for multilingual ToD evaluation has impeded accurate measurement of research progress on the topic. Prior work deployed human translators to either translate or post-edit an automatically translated dataset. However, this was done only for one or two subtasks of a dialogue agent, and training an intractable end-to-end agent was not possible. To address this, we initiated a global effort to extend a large-scale multi-domain dataset, RiSAWOZ (initially in Chinese), to several new languages: English, Korean, French, Hindi, and code-mixed English-Hindi. To ensure the best quality and fluency, we used human post-editing only for the fewshot, validation, and test data. The challenges encountered in creating this dataset at scale led us to create a toolset that makes post-editing for a new language much faster and cheaper. Experiments show that few-shot training achieves 63-88% performance of the original full-shot. The remaining gap motivates further research on multilingual ToD. Overall, this thesis provides a new methodology and framework for extending current capabilities of ToD systems to new languages cost-effectively. The vast number of language and the ongoing evolution of machine learning models signify that there is a continuum of exploration ahead. This thesis lays the groundwork for a broader endeavor to democratize virtual assistants for all languages and cultures.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Moradshahi, Mehrad
Degree supervisor Lam, Monica
Thesis advisor Lam, Monica
Thesis advisor Boneh, Dan
Thesis advisor Sadigh, Dorsa
Degree committee member Boneh, Dan
Degree committee member Sadigh, Dorsa
Associated with Stanford University, School of Engineering
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Mehrad Moradshahi.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/kg582jk7231

Access conditions

Copyright
© 2023 by Mehrad Moradshahi
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...