PhD Dissertation 2017 - Tomasz Jurczyk

Improving Question Answering by Bridging Linguistic Structures and Statistical Learning

Tomasz Jurczyk


Abstract

Question answering (QA) has lately gained lots of interest from both academic and industrial research. No matter the question, search engine users expect the machines to provide answers instantaneously, even without searching through relevant websites. While a significant portion of these questions ask for concise and well known facts, more complex questions do exist and they often require dedicated approaches to provide robust and accurate systems.

This thesis explores linguistically-oriented approaches for both factoid and non-factoid question answering and cross-genre text applications. The contributions include new annotation schemes for question answering oriented corpora, extracting linguistic structures and performing matching, and early exploration of conversation dialog text applications.

For sentence-based factoid question answering, a multi-stage crowdsourcing annotation scheme is presented. Next, a subtree matching algorithm for two sentences that aims to extract semantic similarity in open-domain texts is introduced and combined with a neural network architecture. Then, various factoid question answering corpora are thoroughly analyzed and cross-tested to improve the performance of QA systems. This thesis explores two complex scenarios of non-factoid question answering. In the first, a semantics-graph knowledge graph that is build on the top of linguistic structures is presented and applied on arithmetic questions using verb polarity classification. In the second, a system that combines lexical, syntactic and semantic text representations with statistical learning is presented and evaluated on event-based question answering. The last part of this thesis is focused on the cross-genre aspect of text in which the misalignment between the dialog and formal writings is the main challenge. First, an approach that combines semantic structure extraction with statistical learning is presented and used to improve the performance in the document retrieval task. Next, an exploration for the passage completion task is presented. A crowdsourcing annotation scheme is executed and a new corpus is created. A multi-gram convolutional neural network with the attention is compared to several state-of-the-art approaches for reading comprehension applications.

Department / School

Computer Science and Informatics / Emory University

Degree / Year

PhD / Fall 2017

Committee

Jinho D. Choi, Computer Science and QTM, Emory University (Chair)
Eugene Agichtein, Computer Science, Emory University
Li Xiong, Computer Science, Emory University
Marsal Gavalda, Square Inc.

Links

Anthology | Paper | Presentation