Abstract
Objective:
Behavioral and social factors (BSFs) substantially influence the risk, onset, and progression of Alzheimer’s disease and related dementias (ADRD). A systematic representation of their interplay is essential for advancing prevention and targeted interventions. However, BSF-related knowledge is scattered across heterogeneous sources, limiting scalable evidence synthesis and computational analysis. To address this, we created a Behavioral Social Data and Knowledge Ontology for ADRD (BSO-AD) to represent and integrate BSFs with respect to ADRD.
Material and Methods:
BSO-AD was developed following established ontology design principles, prioritizing reuse of existing ontology elements to ensure semantic interoperability. It was built upon the Social Determinants of Health Ontology (SDoHO) and the Drug-Repurposing Oriented Alzheimer’s Disease Ontology (DROADO). BSF-related classes were enriched with ICD-10-CMZ55–Z65 codes and ADRD-related classes with AD-Onto. Relationships between BSFs and ADRD were derived through literature mining. Ontology quality was evaluated through Hootation-based expert review and an LLM-assisted framework assessing structural coverage and semantic coherence.
Results:
BSO-AD contains 2,275 classes, 153 object properties, and 49 data properties. Expert review demonstrated strong rational agreement (0.95), with disagreements resolved through discussion. LLM-based evaluation showed high category coverage rates (≥ 0.97) and robust semantic alignment with the relevant literature (average completeness = 0.79; conciseness = 0.94).
Discussion and Conclusion:
BSO-AD is, to our knowledge, the first ontology to systematically represent BSFs and hierarchically model their interrelationships in ADRD. It establishes a semantic backbone for computational analysis and knowledge integration. The LLM-assisted evaluation framework demonstrates the feasibility of scalable, automated ontology assessment.
Keywords: ontology, behavioral social factors, Alzheimer’s disease and related dementias, large language models
INTRODUCTION
Alzheimer’s disease and related dementias (ADRD) constitute a major and escalating public health crisis [1]. In 2025, an estimated 7.2 million Americans aged 65 and older, about 1 in 9 people in this age group, are living with Alzheimer’s dementia (AD). This number is projected to nearly double to 13.8 million by 2060 [2]. Along with increasing prevalence and mortality, ADRD imposes tremendous caregiving and financial burdens on patients, families, and healthcare systems [3]. Accumulating evidence demonstrates that behavioral and social factors (BSFs) play a pivotal role in shaping the risk, progression, and outcomes of ADRD across populations [4]. BSFs encompass behavioral and social determinants of health (e.g., physical activity, smoking, and social isolation) that influence neurocognitive resilience and aging trajectories [5]. Importantly, BSFs contribute to observed disparities in ADRD incidence, diagnosis, and care by reflecting differences in healthcare access and socioeconomic conditions [6]. Integrating BSF knowledge into ADRD prevention and intervention frameworks is critical for informing the development of targeted health policies and enhancing the quality of life for affected individuals and families [7,8].
However, systematic integration of BSF knowledge into ADRD research remains a major challenge due to the heterogeneity of data sources and variability in their representation. In structured data settings, BSF information, derived from surveys, electronic health records (EHRs), and mobile apps, is often encoded using inconsistent standards and value sets. Unstructured sources, including clinical narratives and scientific literature, contain rich contextual information but pose substantial challenges for extraction, normalization, and computational analysis. Moreover, integrating BSF and ADRD data, both structured and unstructured, requires cross-disciplinary alignment, as differences across domain-specific conceptual systems further impede effective aggregation. These challenges necessitate a robust, interoperable ontological infrastructure to harmonize heterogeneous data sources and enable standardized, AI-driven analysis of BSFs in ADRD.
Several AD-specific ontologies have been developed, each with distinct focuses, structures, and applications. For example, the Alzheimer’s Disease Ontology (ADO) provides a fine-grained hierarchy capturing biological and clinical dimensions [9] while the Drug Repurposing-Oriented Alzheimer’s Disease Ontology (DROADO) [10] integrates AD-related drugs, genes, targets, and pathways to support computational drug repurposing. Parallel efforts to standardize BSFs using ontologies have emerged. The Ontology of Medically Related Social Entities (OMRSE) captures health-related societal entities [11], while the Semantic Mining of Activity, Social, and Health data (SMASH) ontology models biomarkers, social activities, and physical activities associated with sustained weight loss [12]. The Social Determinants of Health Ontology (SDoHO) further defines core SDoH factors and their relationships within a structured, measurable framework [13]. Despite these advances, existing ontologies remain largely siloed, focusing either on the biological-clinical aspect of ADRD or on BSFs in isolation. This fragmentation represents a critical unmet need for an integrated ontology that systematically unifies the representation of BSFs and ADRD within a coherent, forward-looking framework.
To address this need, we developed the Behavioral Social Data and Knowledge Ontology for ADRD (BSO-AD) to harmonize the conceptual systems of these two domains by reusing and extending established ontologies and standards. To capture the multifaceted relationships between BSFs and ADRD, we modeled both direct and mechanistic associations. Direct associations were synthesized through literature review, whereas mechanistic associations were formalized via intermediate biological entities, including genes, pathways, and pathological processes, realizing multi-level, biologically grounded representations. BSO-AD was assessed through expert review of its semantics, using the ontology evaluation tool Hootation [14]. Furthermore, we developed a large language model (LLM)-assisted, domain-informed evaluation pipeline for automated ontology assessment. By bridging traditional siloed behavioral-social, biological, and medical domains within a unified semantic framework, BSO-AD provides a robust knowledge infrastructure for interoperable data integration and scalable AI-driven applications, ultimately advancing evidence-based strategies for ADRD prevention and intervention.
The main contributions of this study are as follows:
To the best of our knowledge, BSO-AD represents the first ontology that systematically formalizes BSFs in the context of ADRD.
We designed a multi-layer relationship framework to capture both direct associations between BSFs and ADRD and indirect relationships mediated by underlying biological mechanisms and interactions.
We developed an LLM-assisted, data-driven framework to enable scalable, automated ontology evaluation.
METHODS
Ontology Development
BSO-AD was represented in the Web Ontology Language (OWL2) [15]. We used ROBOT [16] for modular assembly and template-based ontology format conversion, followed by manual curation in Protégé 5.6.4 [17].
Principles
The development of BSO-AD followed the best practices in biomedical ontology engineering, including the Open Biological and Biomedical Ontologies (OBO) Foundry principles [18], the ACCELERATE ontology development guidelines [19], and the FAIR principles (Findable, Accessible, Interoperable, and Reusable) [20]. Collectively, these frameworks emphasize openness, community collaboration, interoperability, and reusability. A central design principle of BSO-AD is ontology reuse with modular extension, prioritizing reuse and alignment with established ontologies and controlled vocabularies. This strategy enhances semantic interoperability and minimizes redundancy [21,22]. New classes and properties are introduced only when gaps are identified, and no suitable ones are available in established ontologies or standards.
Class Design
The construction of the BSO-AD class hierarchy followed a combination of ontology reuse [21,22], top-down [23], and bottom-up [23] approaches. The top-down approach begins with broad domain concepts that are progressively specialized, while the bottom-up approach identifies specific classes from data or literature and groups them into higher-level categories [23].
The BSO-AD was designed to represent knowledge of BSFs in the context of ADRD. Three existing ontologies were incorporated based on domain relevance and modeling scope: SDoHO [13], DROADO [10], and the Time Event Ontology (TEO) [24]. SDoHO was reused to represent the BSFs, as it provides a standardized and data-driven framework for modeling SDoH factors and their interrelationships [25]. DROADO [10] was integrated to capture ADRD-related biological knowledge, including molecular and pharmacological mechanisms. In addition, TEO [24] was included to formally represent the temporal dimension within the ADRD context.
Although SDoHO provides a comprehensive conceptual framework for modeling BSFs, it does not incorporate standard clinical coding systems such as ICD-10-CM, which are essential for interoperability with EHRs. To address this limitation, we incorporated SDoH-related ICD-10-CM Z codes (Z55–Z65) [26], which document patients’ health hazards related to socioeconomic, occupational, and psychosocial circumstances [27]. Following the bottom-up approach [23], these concepts were either merged with existing classes or created as new subclasses under appropriate branches of the BSF hierarchy. Notably, ICD includes residual categories containing terms such as “other” or “unspecified” (e.g., “Other problems related to education and literacy” and “Unemployment, unspecified”). To align with the principle of clear and unambiguous naming advocated by the OBO Foundry [19] and the ACCELERATE ontology development guidelines [19], we mapped these concepts to their parent classes with well-defined semantic meanings, using annotation property skos:narrower. For example, “Z56.0 Unemployment, unspecified” was mapped to the ontology class “Unemployment”, with annotation property: “skos:narrower: Z56.0”, accompanied by sub-annotations including “dc:source: ICD-10-CM” and “Description: Unemployment, unspecified”. Details of these concepts and their hierarchical placement within the ontology (i.e., parent classes) are provided in Supplementary 1.
While DROADO captures molecular and pharmacological knowledge relevant to ADRD, it provides limited coverage of neuropsychological assessments, which are critical for ADRD diagnosis and monitoring. Therefore, we incorporated branches from AD-Onto [28], a computational ontology focusing on the neuropsychological tests derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data collection. Specifically, we reused the “Examination” and “StandardizedAssessmentItem” branches, which represent various clinical evaluation types (e.g., neuropsychological and vascular risk assessments) and cognitive assessment components (e.g., MMSE subitems such as orientation, recall, and registration). Additionally, following the bottom-up approach [23], ADRD-related concepts from ICD-9-CM and ICD-10-CM (Supplementary 2, Table S1) were integrated to expand coverage of ADRD concepts.
Property Design
Object property
In OWL ontologies, object properties describe binary semantic relationships between two individuals (or instances) [29], thereby encoding knowledge that extends beyond taxonomic hierarchies. We reused the necessary object properties from the source ontologies (SDoHO, DROADO, TEO, and AD-Onto) to preserve the important semantic relationships among the domain concepts.
Systematic identification and representation of the potential semantic relationships between BSFs and ADRD was a high priority in the design of object properties for BSO-AD. We conducted a targeted PubMed literature search (query detailed in Supplementary 2, Table S2) and manually reviewed the retrieved abstracts (n = 216) to summarize potential relationships. These relationships were subsequently aligned with relation types defined in the Unified Medical Language System (UMLS) Semantic Network [30] to ensure semantic standardization and interoperability. The UMLS Semantic Network provides a well-defined multi-level hierarchy of semantic relation types, structured under the top-level category “associatedWith”. Under “associatedWith”, the hierarchy is structured into broad semantic domains such as “functionallyRelatedTo” and “temporallyRelatedTo”, which are further divided into fine-grained subrelations (e.g., “affects”, “causes”, “limits”). This hierarchical organization served as a conceptual backbone for organizing the BSF-ADRD relationship hierarchy.
Furthermore, beyond the direct associations, a substantial body of research has reported biological mechanisms by which BSFs affect ARRD [31–33]. To uncover the mechanistic pathways and support multi-level knowledge representations, we incorporated three key biological entity types, i.e., genes, pathways, and pathologies, as intermediate nodes. Relationships among BSFs, three intermediate nodes, and ADRD were modeled based on evidence derived from the literature and refined through consultation with domain experts.
Annotation and Data Property
Annotation properties provide additional information for describing and labeling concepts in an ontology [34]. Several standardized IDs, such as UMLS Concept Unique Identifiers (CUIs), ICD-10-CM codes, and ICD-9-CM codes, were added as annotation properties of BSO-AD to facilitate semantic interoperability with external terminologies and clinical coding systems. In addition, annotation properties from source ontologies, including comments, definitions, and provenance information, were also reused.
Data properties link individuals to literal values (e.g., strings, numbers) [15]. For BSO-AD, we primarily reused the data properties from source ontologies, including SDoHO, DROADO, AD-Onto, and TEO.
Ontology Evaluation
Hootation-Based Expert Review
First, we evaluated the ontology semantics through expert review. The ontology was transformed into natural language sentences using the ontology evaluation tool Hootation [14]. Two domain experts independently assessed each sentence to determine whether the corresponding ontology assertion (i.e., every child class is a type of its parent class) was semantically valid. Two agreement metrics were calculated: (1) Inter-evaluator agreement, defined as the proportion of statements for which both evaluators assigned the same label; and (2) Rational agreement, defined as the proportion of statements jointly judged as rational (i.e., the number of statements labeled as rational by both evaluators divided by the total number of statements) [13]. Disagreements were resolved through discussion to achieve consensus.
LLM-Assisted, Data-Driven Framework for Automated Evaluation
To enable automated and scalable evaluation, we developed a domain-informed, LLM-assisted framework to assess BSO-AD. By integrating multiple open-source LLMs (Llama4 [35], Qwen3 [36], and MedGemma [37]) within a unified pipeline for systematic literature selection and entity extraction, our framework evaluated representational breadth and conceptual coherence within the BSF-ADRD domain, grounded in evidence derived from the literature (Figure 1).
Figure 1:
LLM-assisted, Data-Driven Framework for Automated Ontology Evaluation (A) Construction of targeted domain corpus, (B) LLM-assisted and ontology-guided concept extraction resulting in BSF category-level coverage metric; (C) LLM-annotated concept aggregation resulting in embedding-based semantic coherence evaluation metrics: Completeness, Conciseness, Parent-Child Similarity Score (PSS), and Child Similarity Score (CSS).
Domain Literature Corpus Construction
To create a highly relevant and representative corpus for evaluation, we retrieved publications from PubMed focused on the intersection of ADRD and behavioral and social science research using a combination of Medical Subject Headings (MeSH) [38] terms and keyword-based queries (Supplementary 2, 2.1.1). Two general-purpose LLMs (Llama4 and Qwen3) were used to independently screen each article’s title and abstract against pre-defined inclusion criteria to maximize coverage of articles relevant to the ontology scope (Supplementary 2, 2.1.2). Only articles with consensus by two LLMs were retained (Figure 1A).
Concept Extraction and Category Coverage Evaluation
To assess how effectively BSO-AD captures concepts from the literature, we designed a two-stage, ontology-guided framework for concept extraction, classification, and coverage evaluation (Figure 1B). In the first stage, three LLMs, including one medical-specific model (MedGemma) and two general-purpose models (Llama4 and Qwen3), were used to identify BSFs within and beyond the medical domain. These models independently reviewed the title and abstract of selected articles (from Step 1 Domain Literature Corpus Construction) to determine relevance across the seven BSF categories and to extract ADRD-related entities. In the second stage, each model was provided with definitions and few-shot examples from the corresponding BSF subcategories to verify relevance and assign a specific subclass to each extracted entity. Entities that could not be classified were labeled as “Undefined” (UNK). This process was applied independently across all BSF categories. The prompts for this process are provided in Supplementary 2, 2.2.
Category coverage rate was defined as the proportion of extracted entities that were successfully assigned or categorized to defined ontology classes (i.e., not labeled as Undefined), providing a measure of how comprehensively BSO-AD represents the conceptual breadth of the domain. This metric was computed based on entities identified and classified by MedGemma, given its domain specialization.
| (1) |
Semantic Cohenence Evaluation
To assess BSO-AD’s semantic adequacy beyond explicit class mappings, we adopted and extended the ontology evaluation framework proposed in [39] (Figure 1C). This framework evaluates two complementary dimensions: (1) Completeness and conciseness, quantifying the conceptual overlap between ontology classes and literature-derived domain concepts; and (2) Correctness and consistency, measuring the contextual similarity among hierarchically or relationally connected ontology classes. In contrast to the original framework that relied on a pre-trained named entity recognition (NER) model for concept extraction, we utilize the LLM-based, ontology-guided extraction and classification framework described above. This modification enables broader and more flexible capture of domain concepts, particularly those spanning both clinical and socio-behavioral dimensions relevant to the ontology. All evaluations were performed independently across the seven BSF sub-categories using a domain-adapted embedding model ClinicalBERT [40].
To ensure comprehensive capture of domain concepts, the entity extraction and classification process was conducted by three LLMs individually. Concepts identified as BSF-relevant by any LLM were aggregated and deduplicated using a FAISS-based [41] nearest-neighbor approach through cosine similarity in the embedding space. Entities paired with similarity ≥ 0.90 were treated as semantically equivalent and merged. The resulting unique set represented the domain concept set. In addition, a shared concept was defined as an ontology concept that has at least one nearest (cosine similarity ≥ 0.75) domain concept in the embedding space, indicating meaningful semantic overlap despite lexical variations.
Completeness and Conciseness:
Completeness is the proportion of domain concepts captured by the ontology, whereas conciseness is the proportion of ontology classes that correspond to domain concepts. These two metrics are defined as follows [39]:
| (2) |
Where O denotes ontology concepts, D denotes domain concepts extracted from literature using LLMs, and S denotes shared concepts between them. Together, these measures assess both the breadth and the specificity of the ontology with respect to the domain.
Correctness and Consistency:
Semantic correctness and consistency were evaluated following the original formulation in [39]. Correctness was assessed through the Child Similarity Score (CSS) and Parent-Child Similarity Score (PSS), while internal hierarchical consistency was evaluated through the Parent-Child Difference Agreement (PDA). Detailed mathematical definitions of these metrics are presented in Supplementary 2, 2.3.
For ease of reference, all abbreviations are listed in Supplementary 2, Table S5.
RESULTS
Ontology
BSO-AD provides a comprehensive framework for representing and standardizing BSF knowledge in the context of ADRD, with a well-organized class hierarchy and clearly defined object, data, and annotation properties. The current version of BSO-AD comprises 2,275 classes, 153 object properties, 49 data properties, and 41 annotation properties with 3,151 logical axioms and 2,588 declaration axioms.
Classes
The current BSO-AD includes 16 root classes that represent BSFs and ADRD-related data and knowledge, such as “Elemenet_Relevant_to_Behavioral_Social_Factor”, “Biological_Process”, and “Disease” (Figure 2).
Figure 2:
The Hierarchy of BSO-AD
Within “Element_Relevant_to_Behavioral_Social_Factor”, eight major branches represent the core BSFs, such as “Element_Relevant_to_Food” and “Element_Relevant_to_Neighborhood”, which were imported from SDoHO. In addition, a total of 142 relevant concepts are imported from ICD-10-CM codes Z55–Z65 (Supplementary 1), and subsequently mapped and merged into the behavioral-social hierarchies.
ADRD-related knowledge is represented by 14 branches, of which 12 were imported from DROADO, and two (“Examination” and “StandardizedAssessmentItem”) from AD-Onto. Furthermore, 21 ADRD-related concepts (Supplementary 2, Table S1) from ICD-9-CM and ICD-10-CM were included. Finally, four branches capture temporal dimensions, all of which were adopted from the TEO ontology.
Properties
The BSO-AD defines 153 object properties, 49 data properties, and 41 annotation properties, which together formalize the semantic relationships and descriptive characteristics within and across BSFs and ADRD domains.
We defined a multi-layer relational framework to capture both the direct and indirect associations of BSFs with ADRD. Specifically, 32 object properties are defined to represent direct associations, which are organized into a multi-level hierarchy grounded in the UMLS Semantic Network (Table 1). They are aligned with the UMLS top-level “associatedWith” category and further structured into two major semantic relationships: “functionallyRelatedTo” and “temporallyRelatedTo”. The “functionallyRelatedTo” captures function-based associations and further expands to “affects“ and “predicts” subrelations. The “affects” is specialized into “affectsRiskOf”, “affectsDiseaseCourse”, and “affectsDiseaseBurden” to model the effects of BSFs on disease risk, progression, and disease-related burden. Notably, “affectsDiseaseBurden” is further differentiated across individual-level effects (e.g., “benefits” and “reducesBurdenFor”, indicating improved outcomes or reduced burden) and population-level effects (e.g., “hasHigherIncidenceOf” and “hasHigherPrevalenceOf”). “temporallyRelatedTo” describes time-dependent associations, such as “associatedWithShorterTime”, supporting representing temporal dynamics (e.g., earlier onset or delayed progression). To facilitate downstream natural language processing (NLP) tasks, property is enriched with a set of lexical variants and synonyms (e.g., “increasesRriskOf”: ”raisesRiskOf”, ”escalatesRiskOf”) by incorporating literature-derived expressions.
Table 1.
Direct relationships between BSFs and ADRD
|
|
|
|
|
In parallel, we built indirect relationships to represent how BSFs may influence ADRD through intermediate biological nodes, including pathways, genes, and pathology (Figure 3, Supplementary 2 Table S3). For example, a BSF may modulate a gene, the gene may regulate a pathway, the pathway may perturb a pathology, and the pathology may be associatedWith ADRD. By modeling these multi-level mechanistic linkages, we intended to uncover the underlying mechanisms through which BSFs contribute to the development and progression of ADRD.
Figure 3.
Indirect relationships connecting BSF and ADRD through intermediate biological nodes
The annotation properties include standard metadata elements such as comments, definitions, and provenance information. In addition, interoperability is strengthened by adding annotation properties such as skos:exact, skos:narrower, dc:source to connect CUIs and ICD-10-CM codes, facilitating integration with external terminologies and standards.
Evaluation Results
Hootation-Based Expert Review
We evaluated the semantic quality of the ontology using the Hootation tool [14], with assessments performed independently by two domain experts. The rational agreement was 0.95, indicating that 95% of statements were judged as semantically valid by both experts. The inter-evaluator agreement was 0.96, reflecting a high level of consensus between two reviewers. Disagreements were resolved through discussion, resulting in consensus across all classes. Identified issues were addressed and incorporated into ontology refinement. For example, the statement “every Non-smoker is a type of Smoker” reflected a semantic conflict. Accordingly, we restructured “Non-smoker” and “Smoker” as sibling classes rather than a parent-child relationship. Similarly, for the problematic statement “every Absence_of_Family_Member is a type of Family_Support”, we introduced a new parent class, “Problem_Related_to_Family_Support”, to accommodate “Absence_of_Family_Member”, and positioned them in an appropriate hierarchy under the high-level class “Problem_Related_to_Social_and_Community_Context”.
LLM-based Automated Evaluation
As shown in Table 2, category coverage rates were uniformly high across all BSF domains (≥ 0.97), indicating that most literature-derived concepts were successfully classified into the ontology’s high-level classes. Detailed category-wise results are provided in Supplementary 2, 2.4. Completeness averaged 0.78, reflecting substantial alignment between ontology concepts and literature-derived domain concepts. Meanwhile, Conciseness averaged 0.81, suggesting that a large proportion of ontology classes correspond to semantically matched concepts from the literature.
Table 2.
LLM-based automated evaluation results
| Primary Category of BSFs | Category Coverage Rate | Completeness | Conciseness | CSS | PSS | PDA |
|---|---|---|---|---|---|---|
| Behavior and Lifestyle | 0.97 | 0.7 | 0.86 | 0.73 | 0.72 | 0.94 |
| Economic Stability | 1.00 | 0.84 | 0.83 | 0.73 | 0.75 | 0.94 |
| Education and Literacy | 1.00 | 0.91 | 0.96 | 0.73 | 0.70 | 0.94 |
| Food | 1.00 | 0.56 | 0.79 | 0.71 | 0.69 | 0.98 |
| Healthcare | 0.99 | 0.93 | 0.77 | 0.71 | 0.72 | 0.96 |
| Neighborhood | 1.00 | 0.72 | 0.65 | 0.80 | 0.78 | 0.96 |
| Social and Community | 1.00 | 0.79 | 0.88 | 0.72 | 0.74 | 0.93 |
| Overall (Average) | 0.99 | 0.78 | 0.81 | 0.73 | 0.73 | 0.95 |
Abbreviations: CSS: Child Similarity Score; PSS: Parent-Child Similarity Score; PDA: Parent-Child Difference Agreement.
Semantic similarity metrics showed stable and coherent hierarchical organization across BSF categories, with CSS ranging from 0.71 to 0.80 and PSS ranging from 0.69 to 0.78. PDA was uniformly high (≥ 0.93), indicating consistent parent-child similarity patterns within concept families. Specifically, the Neighborhood category exhibited the strongest performance (CSS = 0.80, PSS = 0.78), reflecting a well-defined and semantically cohesive sub-ontology. In contrast, the Food category showed the lowest correctness scores (CSS = 0.71, PSS = 0.69), suggesting that its subclasses are semantically heterogeneous but equally distant from the parent concept. This pattern points to potential areas for structural enrichment, such as introducing intermediate classes to better distinguish parent-children classes. To demonstrate how correctness metrics can inform refinement, we examined the Food category using a CSS-PSS quadrant plot (Figure 4). Within the category, the Diet (CSS = 0.81, PSS = 0.63) appears in the second-quadrant region where its children cluster tightly, but the parent is overly broad, suggesting the need for an intermediate category. In contrast, Food Insecurity and Lack of Adequate Food demonstrate moderately high PSS but low CSS, indicating that these underspecified families require either additional subclasses or consolidation.
Figure 4.
Child Similarity Score (CSS) - Parent-Child Similarity Score (PSS) quadrant plot for Food sub-category
Note: The x- and y-axis thresholds are set to 0.7.
DISCUSSION
BSO-AD is the first specialized ontology designed to standardize and harmonize ADRD-related BSFs knowledge from heterogeneous sources. It provides a formally structured and semantic representation of BSFs tailored for ADRD by integrating functionality from existing ontologies, including SDoHO, DROADO, TEO, and AD-Onto. In addition, BSO-AD incorporates standardized SDoH codes (ICD-10-CM Z55-Z65) and ADRD-related concepts from ICD-9-CM and ICD-10-CM. By reusing and extending existing ontologies, BSO-AD promotes interoperability and improves applicability in downstream tasks such as EHR-based knowledge integration.
BSO-AD introduces a multi-level hierarchy of semantic relations (object properties) that represent both functional and temporal associations between BSFs and ADRD (e.g., “causes”, “increasesRiskOf”, and “associatedWithShorterTime”). In addition to these direct relationships, the ontology captures indirect associations through intermediate biological entities, enabling representation of underlying biological mechanisms. These fine-grained relations support modeling how BSFs influence ADRD onset, progression, and outcomes across both behavioral-social and biological dimensions, facilitating the integration of multiscale real-world evidence to inform medical decision-making.
The evaluation results highlight several notable characteristics of BSO-AD. First, at the overall ontology level, the high values observed for both LLM-based coverage and embedding-based completeness indicate that the ontology represents a stable set of concepts that recurs across both literature extractions and semantic similarity modeling. Despite the variability in terminology and granularity across the literature, concepts identified by LLMs consistently map to ontology classes in the embedding space. This alignment suggests that BSO-AD is robust to linguistic variation and captures the key constructs underlying BSF-ADRD research. Second, the correctness metrics (CSS and PSS) fall in a moderate range, reflecting the inherent structural properties of behavioral and social science concepts. Unlike domains with rigid hierarchical distinctions (e.g., Anatomy or Disease), BSF constructs often exhibit overlapping conceptual boundaries. Related classes, such as Social isolation and Lives alone, tend to share contextual usage while representing distinct aspects of a broader construct. The observed similarity values (child-child and parent-child) reflect meaningful differentiation without excessive divergence, indicating that the ontology captures the expected conceptual proximity. Collectively, these findings support the ontology’s ability to represent BSFs in a manner that is semantically coherent, structurally stable, and well-suited for downstream computational applications.
The evaluation results also inform directions for further ontology enhancement and enrichment. For instance, the Healthcare domain shows high completeness but low conciseness, indicating that some classes are underrepresented in the ADRD-focused socio-behavioral literature used for evaluation. This is likely due to limitations in the literature query scope or indicates that these concepts can be better captured in alternative data sources such as EHRs. In addition, the quadrant plot (Figure 4) presented in the Results section highlights areas within the Food hierarchy that exhibit insufficient granularity or have overly heterogeneous groupings. This analytic approach can be applied to systematically identify and prioritize targets for ontology refinement.
Looking ahead, we plan to enrich BSO-AD through large-scale literature- and EHR-based data mining, as well as incorporating elements from additional ontologies and standards, such as the ICD-11 “Factors influencing health status” category. Compared with ICD-10, ICD-11 adopts a more interoperable approach, with enhanced capabilities in integrating with modern digital health systems and supporting global data exchange [42]. Incorporating ICD-11 SDoH codes will enhance the breadth and depth of the ontology, enabling more precise representation and harmonization of BSF data. In parallel, we will refine the LLM-driven evaluation pipeline through integrating human-in-the-loop review, developing human-annotated gold standard corpora for key stages, and experimenting with advanced proprietary LLMs such as OpenAI’s GPT series and Google’s Gemini, to further boost its accuracy, effectiveness, and generalizability.
CONCLUSION
The BSO-AD establishes a unified, semantically rich knowledge representation of BSFs and ADRD concepts and their interrelationships. By integrating standardized vocabularies, curated relationships, and an LLM-assisted evaluation pipeline, BSO-AD lays a robust foundation for scalable harmonization and analysis of BFSs within the context of ADRD, potentially supporting advanced computational modeling and knowledge discovery across interdisciplinary studies.
Supplementary Material
Supplementary 1.xlsx
Supplementary 2.doc
Supplementary 3.xlsx
FUNDING
This study was supported by the National Institute of Aging under U01AG088076, R01AG072799, and U24AG088019, and the National Institute of Mental Health under U24MH136069.
Funding Statement
This study was supported by the National Institute of Aging under U01AG088076, R01AG072799, and U24AG088019, and the National Institute of Mental Health under U24MH136069.
Footnotes
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
The BSO-AD will be publicly available at GitHub: https://github.com/Tao-AI-group/BSO-AD after the paper is published.
CODE AVAILABILITY
The Python codes for ontology development and evaluation pipeline will be publicly available at GitHub: https://github.com/Tao-AI-group/BSO-AD after the paper is published.
REFERENCES
- 1.Rebok GW, Gallo JJ, Thorpe RJ. Advancing Alzheimer’s Disease and Related Dementias Research: The Johns Hopkins Alzheimer’s Disease Resource Center for Minority Aging Research. J Aging Health. 2025;37:3S–8S. doi: 10.1177/08982643241308448 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Rajan KB, Weuve J, Barnes LL, et al. Population estimate of people with clinical Alzheimer’s disease and mild cognitive impairment in the United States (2020–2060). Alzheimers Dement J Alzheimers Assoc. 2021;17:1966–75. doi: 10.1002/alz.12362 [DOI] [Google Scholar]
- 3.2025 Alzheimer’s disease facts and figures. Alzheimers Dement. 2025;21:e70235. doi: 10.1002/alz.70235 [DOI] [Google Scholar]
- 4.Reducing the Impact of Dementia in America: A Decadal Survey of the Behavioral and Social Sciences. Washington, D.C.: National Academies Press; 2021. [Google Scholar]
- 5.Adkins-Jackson PB, George KM, Besser LM, et al. The structural and social determinants of Alzheimer’s disease related dementias. Alzheimers Dement J Alzheimers Assoc. 2023;19:3171–85. doi: 10.1002/alz.13027 [DOI] [Google Scholar]
- 6.Wong W. Economic burden of Alzheimer disease and managed care considerations. Am J Manag Care. 2020;26:S177–83. doi: 10.37765/ajmc.2020.88482 [DOI] [PubMed] [Google Scholar]
- 7.Stites SD, Midgett S, Mechanic-Hamilton D, et al. Establishing a Framework for Gathering Structural and Social Determinants of Health in Alzheimer’s Disease Research Centers. The Gerontologist. 2022;62:694–703. doi: 10.1093/geront/gnab182 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Edwards GA III, Gamez N, Escobedo G Jr., et al. Modifiable Risk Factors for Alzheimer’s Disease. Front Aging Neurosci. 2019;11:146. doi: 10.3389/fnagi.2019.00146 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Malhotra A, Younesi E, Gündel M, et al. ADO: a disease ontology representing the domain knowledge specific to Alzheimer’s disease. Alzheimers Dement J Alzheimers Assoc. 2014;10:238–46. doi: 10.1016/j.jalz.2013.02.009 [DOI] [Google Scholar]
- 10.Li F, Wang M, Pham HA, et al. Systematic Design of Drug Repurposing-Oriented Alzheimer’s Disease Ontology. 2019 IEEE International Conference on Healthcare Informatics (ICHI). 2019:1–5. [Google Scholar]
- 11.Hicks A, Hanna J, Welch D, et al. The ontology of medically related social entities: recent developments. J Biomed Semant. 2016;7:47. doi: 10.1186/s13326-016-0087-8 [DOI] [Google Scholar]
- 12.Phan N, Dou D, Wang H, et al. Ontology-based Deep Learning for Human Behavior Prediction with Explanations in Health Social Networks. Inf Sci. 2017;384:298–313. doi: 10.1016/j.ins.2016.08.038 [DOI] [Google Scholar]
- 13.Dang Y, Li F, Hu X, et al. Systematic design and data-driven evaluation of social determinants of health ontology (SDoHO). J Am Med Inform Assoc JAMIA. 2023;30:1465–73. doi: 10.1093/jamia/ocad096 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Amith M, Manion FJ, Harris MR, et al. Expressing Biomedical Ontologies in Natural Language for Expert Evaluation. Stud Health Technol Inform. 2017;245:838–42. [PMC free article] [PubMed] [Google Scholar]
- 15.OWL 2 Web Ontology Language Primer (Second Edition). https://www.w3.org/TR/owl2-primer/ (accessed 15 July 2025) [Google Scholar]
- 16.Jackson RC, Balhoff JP, Douglass E, et al. ROBOT: A Tool for Automating Ontology Workflows. BMC Bioinformatics. 2019;20. doi: 10.1186/s12859-019-3002-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Musen MA. The protégé project: a look back and a look forward. AI Matters. 2015;1:4–12. doi: 10.1145/2757001.2757003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.OBO Foundry. https://obofoundry.org/principles/fp-019-term-stability.html
- 19.Ontology development guideline_v2.docx. https://mctools-my.sharepoint.com/:w:/g/personal/yu_yue1_mayo_edu/EWyHooWTW4BIkrL_Caadhm4Bg4IFkZ3-nXjOwjFtajpxTQ?wdOrigin=TEAMS-MAGLEV.p2p_ns.rwc&wdExp=TEAMS-TREATMENT&wdhostclicktime=1759244886895&web=1 (accessed 14 October 2025)
- 20.Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Carriero VA, Daquino M, Gangemi A, et al. The Landscape of Ontology Reuse Approaches. Applications and Practices in Ontology Design, Extraction, and Reasoning. IOS Press; 2020:21–38. [Google Scholar]
- 22.Halper Michael, N S, Brochhausen Mathias, et al. Guidelines for the reuse of ontology content. Appl Ontol. Published Online First: 25 April 2023. doi: 10.3233/AO-230275 [DOI] [Google Scholar]
- 23.Noy NF, McGuinness DL. Ontology Development 101: A Guide to Creating Your First Ontology.
- 24.Li F, Du J, He Y, et al. Time event ontology (TEO): to support semantic representation and reasoning of complex temporal relations of clinical events. J Am Med Inform Assoc JAMIA. 2020;27:1046–56. doi: 10.1093/jamia/ocaa058 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dang Y, Li F, Hu X, et al. Systematic design and data-driven evaluation of social determinants of health ontology (SDoHO). J Am Med Inform Assoc JAMIA. 2023;30:1465–73. doi: 10.1093/jamia/ocad096 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.2025 ICD-10-CM Codes Z55-Z65: Persons with potential health hazards related to socioeconomic and psychosocial circumstances. https://www.icd10data.com/ICD10CM/Codes/Z00-Z99/Z55-Z65
- 27.Guo Y, Chen Z, Xu K, et al. International Classification of Diseases, Tenth Revision, Clinical Modification social determinants of health codes are poorly used in electronic health records. Medicine (Baltimore). 2020;99:e23818. doi: 10.1097/MD.0000000000023818 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Taglino F, Cumbo F, Antognoli G, et al. An ontology-based approach for modelling and querying Alzheimer’s disease data. BMC Med Inform Decis Mak. 2023;23:153. doi: 10.1186/s12911-023-02211-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.OWL class restrictions — Ontology101Tutorial 1.0 documentation. https://ontology101tutorial.readthedocs.io/en/latest/OWL_ClassRestrictions.html (accessed 2 October 2025)
- 30.UMLS Semantic Network. https://uts.nlm.nih.gov/uts/umls/semantic-network/root (accessed 2 October 2025)
- 31.Munteanu C, Popescu C, Vlădulescu-Trandafir A-I, et al. Alcohol-Induced Dysregulation of Hydrogen Sulfide Signaling in Alzheimer’s Disease-Narrative Mechanistic Synthesis Review. Int J Mol Sci. 2026;27:1595. doi: 10.3390/ijms27031595 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Xie Y, Wang H, Zhang Y, et al. Cigarette smoke-induced PPAR signaling dysregulation accelerates Alzheimer’s disease pathogenesis and cognitive decline in 5xFAD mice. Food Chem Toxicol Int J Publ Br Ind Biol Res Assoc. 2025;203:115596. doi: 10.1016/j.fct.2025.115596 [DOI] [Google Scholar]
- 33.Xiong C, Lu R, Bui Q, et al. Person-specific digital measurements of air pollutant exposure and biomarkers of Alzheimer’s disease: Findings from a pilot study. J Alzheimers Dis JAD. 2025;107:778–88. doi: 10.1177/13872877251362667 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhan C, Ren S, Zhang Y, et al. MIO: An ontology for annotating and integrating medical knowledge in myocardial infarction to enhance clinical decision making. Comput Biol Med. 2025;190:110107. doi: 10.1016/j.compbiomed.2025.110107 [DOI] [PubMed] [Google Scholar]
- 35.Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and Efficient Foundation Language Models. 2023.
- 36.Yang A, Li A, Yang B, et al. Qwen3 Technical Report. 2025.
- 37.Team G, Mesnard T, Hardin C, et al. Gemma: Open Models Based on Gemini Research and Technology. 2024.
- 38.Medical Subject Headings - Home Page. https://www.nlm.nih.gov/mesh/meshhome.html (accessed 17 March 2026)
- 39.Zaitoun A, Sagi T, Hose K. Automated Ontology Evaluation: Evaluating Coverage and Correctness using a Domain Corpus. Companion Proceedings of the ACM Web Conference 2023. Austin TX USA: ACM; 2023:1127–37. [Google Scholar]
- 40.Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. 2020.
- 41.Douze M, Guzhva A, Deng C, et al. THE FAISS LIBRARY. IEEE Trans Big Data. 2025;1–17. doi: 10.1109/TBDATA.2025.3618474 [DOI] [Google Scholar]
- 42.International Classification of Diseases (ICD). https://www.who.int/standards/classifications/classification-of-diseases (accessed 9 December 2025)
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The BSO-AD will be publicly available at GitHub: https://github.com/Tao-AI-group/BSO-AD after the paper is published.
The Python codes for ontology development and evaluation pipeline will be publicly available at GitHub: https://github.com/Tao-AI-group/BSO-AD after the paper is published.




