Date: 2022-11-18 / 4:00 ~ 5:00 PM
Location: MSC E306 (https://emory.zoom.us/j/99364825782)
There is much concern in the field of dialogue systems over how to properly evaluate new dialogue approaches in a reliable and reproducible manner. Human evaluations have been widely accepted as the standard for dialogue evaluation; however, there is still much debate over the manner in which human evaluations should be performed. The supposed subjectivity and subsequent lack of reproducibility of Likert-style evaluations has encouraged alternative methods to be used, although there is a lack of experimentation on the overall magnitude of these limitations. In addition, the choice between interactive and external evaluations of dialogue systems is not presently well understood. This work seeks to explore the division of interactive and external evaluations in an effort to better understand their suitability to the task of dialogue system evaluation. The results of this work present implications for the reliability and comparability of Likert-style dialogue evaluations, and suggest recommendations regarding the employment of interactive and external evaluations relative to specific evaluation goals.