Date: 2022-02-25 / 4:00 ~ 5:00 PM
Location: MSC E306 (https://emory.zoom.us/j/99364825782)
One of the major roadblocks in unsupervised dialogue models being successful is that there is not much high quality multi-turn dialogue data. The highest quality datasets, such as Switchboard1 Corpus, don’t have many samples, and the highest quantity datasets, such as Opensubtitles or Project Gutenberg dataset, don’t have very high quality samples. To date, most dialogue datasets involve little computational manipulation - data is simply collected and presented. New approaches open the door for more sophisticated computational approaches to dataset creation. We take advantage of the numerous recent advancements in the field of Natural Language Processing, focusing their power on constructing a large, high quality, multi-turn, conversational dialogue dataset. We take reddit data and use a variety of metrics to construct a conversation turn by turn. Using comment threads, we enhance our conversations by utilizing the conversational nature of threads on Reddit. Finally, we use a variety of metrics to help filter our resulting conversations for better results, especially focusing on conversation coherency.