This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively, while the cause categorization framework achieved micro-F1 scores of 0.825 and 0.80. We observe substantial differences in the distribution of loneliness causes, with caregivers’ loneliness predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver dataset. Overall, this work establishes an LLM-based pipeline for creating high-quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in loneliness.
EACL Workshop on Linguistic Analysis for Health (HeaLing) / 2026
Anthology | Paper | BibTeX | GitHub