Honors Thesis 2025 - Michelle Kim

Leveraging Large Language Models for Loneliness Detection and Analysis

Michelle Kim

Highest Honor in Computer Science


Abstract

This research investigates the application of Large Language Models (LLMs) in measuring andanalyzing loneliness in the caregiver and non-caregiver populations to enable building diversesocial media datasets to study loneliness across the two populations and better understandtheir experiences of loneliness.

Firstly, this research applies GPT-4o, GPT-5-nano, and GPT-5 to evaluate and detecthigh quality Reddit posts from 15 subreddits. We developed an expert-developed frameworkto measure loneliness and an expert-informed cause of loneliness typology framework toidentify and categorize causes of loneliness across populations. This complete data processingpipeline is validated with human annotation and resulted in a validated data processingpipeline that judges a given post’s relevance, measures the author’s loneliness, extracts andcategorizes the author’s cause of loneliness, and extracts demographic information.

We find that LLMs are able to be successfully applied to measure loneliness via apsychologically grounded framework in the caregiver and non-caregiver populations, achieving76.09% and 79.78% average accuracy respectively. Additionally, we find that LLMs areable to effectively apply the cause of loneliness categorization framework on high-qualityReddit posts, achieving high micro-F1 scores of 0.825 and 0.8 in the caregiver and non-caregiver populations, respectively. We find that the distribution of cause categories stronglydiffers across the two populations, suggesting our dataset and framework captures differencesbetween the two populations. We find that the perceived causes of loneliness between thetwo populations highly differ, with caregiver’s loneliness predominately originating from theirrole as caregivers, demonstrating the loneliness experiences between the two populations aredistinct. Through applying these validated frameworks, we successfully created a dataset ofhigh quality posts for both populations. Through demographic data extraction, we find thatReddit data is viable for building a diverse dataset across 6 demographic categories in thecaregiver population. This work contributes to understanding caregiver and non-caregiverloneliness by establishing a LLM-based data processing pipeline for sourcing high quality anddiverse social media data and demonstrating successfully application of LLMs to analyzedifferences in the loneliness of the two populations.

Department / School

Computer Science / Emory University

Degree / Year

BS / Fall 2025

Committee

Jinho D. Choi, Computer Science, Emory University (Chair)
Joyce C. Ho, Computer Science, Emory University
Jane Chung, School of Nursing, Emory University

Links

Anthology | Paper | Presentation

Jane Chung, Michelle Kim, Joyce Ho, Jinho D. Choi