Conversation Data Generation from Discussion Forums

Daniil Huryn

Date: 2021-11-05 / 3:00 ~ 4:00 PM


One of the major roadblocks in unsupervised dialogue models being successful is that there is not much high quality multi-turn dialogue data. The highest quality datasets, such as Switchboard Corpus, don’t have many samples, and the highest quantity datasets, such as Opensubtitles or Project Gutenberg dataset, don’t have very high quality samples. To date, most dialogue datasets involve little computational manipulation - data is simply collected and presented.

Recently, advances such as Blender have been powered by a new form of dataset created by intelligent use of computational processes. The data for Blender, for example, took Reddit data and extracted millions of two-turn conversations, as well as “persona profiles” for the associated speakers. Such approaches open the door for more sophisticated computational approaches to dataset creation. We take advantage of the numerous recent advancements in the field of Natural Language Processing, focusing their power on constructing a large, high quality, multi-turn, conversational dialogue dataset. We intend to use reddit data in a more intelligent manner, creating conversations significantly larger than 2 turns, using methods of mapping comments to sections of posts (with QA Model or Dialog Ranking Model), specific reddit features that let users quote others when replying to them, and a variety of other methods we are considering.