Reference-Aligned Retrieval-Augmented Question Answeringover Heterogeneous Proprietary Documents

Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi


Abstract

Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests — each costing hundreds of thousands of dollars — produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answer-ing (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address these, we propose a RAG-QA framework for internal enterprise use, consisting of: (1) a data pipeline that converts raw multi-modal documents into a structured corpus and QA pairs, (2) a fully on-premise, privacy-preserving architecture, and (3) a lightweight reference matcher that links answer segments to supporting content. Applied to the automotive domain, our system improves factual correctness(+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08,+1.67) over a non-RAG baseline, based on 1–5 scale ratings from both human and LLM judge. The system was deployed internallyfor pilot testing and received positive feedback from employees.

Venue / Year

arXiv -> Under review at the ACM International Conference on Information and Knowledge Management (CIKM) / 2025

Links

Anthology | Paper | BibTeX | GitHub