Embodied tasks that require active perception are key to improving language grounding models and creating holistic social agents. In this talk we explore two multi-modal embodied perception tasks which require localization or navigation of an agent in an unknown 3D space with limited information about the environment. First we present the Where Are You? (WAY) dataset which contains over 6k dialogs of two humans performing a localization task. On top of this dataset, we propose the task of Localization from Embodied Dialog (LED). The LED task involves taking a natural language dialog of two agents -- an observer and a locator -- and predicting the location of the observer agent. The second task we examine is the Vision Language Navigation (VLN) task, in which an agent navigates via natural language instructions. For both tasks, we address the objective of improving model accuracy and demonstrate that this can be done using passive data, which can introduce more semantically rich and diverse information during training, in comparison to additional interaction data. We additionally introduce a novel analysis pipeline for both tasks to diagnose and reveal limitations and failure modes of these types of common multi-modal models.
Meera Hahn is a Research Scientist at Google Research working on multi-modal modeling of vision and natural language for applications in artificial intelligence. Her long-term research goal is to develop multi-modal systems capable of supporting robotic or AR assistants that can seamlessly interact with humans. She recently completed her PhD in Computer Science at the Georgia Institute of Technology under Dr. James M. Rehg. Her research at Georgia Tech focused on training embodied agents (in simulation) to perform complex semantic grounding tasks.
Date: 2022-12-02 / 4:30 ~ 5:30 PM