Automatic Generation of Multi-turn Dialogues from Reddit

Mack Hutsell

Highest Honor in Computer Science


Abstract

High-quality multi-turn dialogue datasets are a scarce commodity in the field of Natural Language Processing, and with the recent rise of chat bots powered by seq2seq models that train on these datasets, they have become more important than ever. This thesis describes work done on a model built to deconstruct Reddit posts and sequence the fragments to create high-quality, multi-turn, topic-specific conversations. The model works by using a post's content as a beginning framework for a single speaker's statements in a conversation, filling in the second speaker's utterances with comments left on the same post. A dialogue dataset with 951 dialogues was generated using this method comprising conversations across two topics: movies and books. This dataset, HuHu, was then manually evaluated against DailyDialog, Topical-Chat, and MultiWOZ, three high-quality datasets with ~10,000 dialogues constructed in varying ways. The results showed that our generated dialogues were overall considered more natural in 46% of cases and considered at least as natural in 73% of comparisons. This is an incredible result given that our model can generate millions of dialogues across any number of topics, limited only by the number of related Reddit posts. Future work in the task of dialogue assembly models appears to be very promising and could result in dialogues at a near-human level within the near future.

Department / School

Computer Science / Emory University

Degree / Year

BS / Spring 2022

Committee

Jinho D. Choi, Computer Science and QTM, Emory University (Chair)
Lauren Klein, English and QTM, Emory University
Ting Li, Computer Science, Oxford College of Emory University

Links

Anthology | Paper | Presentation

Mack Hutsell (top-left), Jinho Choi (top-right), Ting Li (bottom-left), Lauren Klein (bottom-right)