Honors Thesis 2022 - Daniil Huryn

Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums

Daniil Huryn

Highest Honor in Computer Science

Abstract

Unsupervised Machine Learning models have taken the Natural Language Processing world by storm. Transformers, the currently most popular unsupervised models, utilize vast amounts of data and deliver performance far beyond what could have been achieved only a few years ago. As good as these models are, they have one major requirement - a lot of data. One of the first transformers, BERT, was trained on 3.3 Billion words of data, and later models have used even more data (GPT-3). This presents unsupervised dialogue models with a bit of a problem: there's not that much high quality dialogue data out there, certainly not on the scale required. Because Dialogue is far harder to encounter online then posts, articles, etc., High Quality datasets are usually very limited in size (Switchboard, Daily Dialog), while high quantity datasets (Opensubtitles, Reddit Corpus) are either low quality or of a very specific type, for instance movie subtitles. One of the main mitigations of this issue has been to first train models on large amounts of low quality data, and then fine-tune on low amounts of high quality data. In this paper, we propose to create a high quantity, medium quality, multi-turn dataset, that will allow for far better model training. To do this, we intend to utilize a more computational approach to dialogue creation, where we create it from a set of Reddit posts and their respective comments, blending it in a way that creates a new conversation out of a disjointed online forum post. By utilizing the nature of Reddit threads and a variety of Natural Language Processing metrics, we intend to first construct and then thoroughly filter conversations to automatically create a large dataset of high quality dialogues.

Department / School

Computer Science / Emory University

Degree / Year

BS / Spring 2022

Committee

Jinho D. Choi, Computer Science and QTM, Emory University (Chair)
Ting Li, Computer Science, Oxford College of Emory University
Jonathan Hulgan, Mathematics, Oxford College of Emory University

Links

Anthology | Paper | Presentation

Daniil Huryn (top-left), Jinho Choi (top-right), Jonathan Hulgan (bottom-left), Ting Li (bottom-right)