Abstract
Background
Digital phenotyping provides passive monitoring of behavioural health but faces implementation challenges in translating complex multimodal data into actionable clinical insights. Digital navigators, healthcare staff who interpret patient data and relay findings to clinicians, provide a solution, but workforce limitations restrict scalability.
Objective
This study provides one of the first systematic evaluation of large language model performance in interpreting simulated psychiatric digital phenotyping data, establishing baseline accuracy metrics for this emerging application.
Methods
We evaluated GPT-4o and GPT-3.5-turbo across over 153 test cases covering various clinical scenarios, timeframes and data quality levels using simulated test datasets currently employed in training human digital navigators. Performance was assessed on the model’s capacity to identify clinical patterns relative to human digital navigation experts.
Findings
GPT-4o demonstrated 52% accuracy (95% CI 46.5% to 57.6%) in identifying clinical patterns based on standard test cases, significantly outperforming GPT-3.5-turbo (12%, 95% CI 8.4% to 15.6%). When analysing GPT-4o’s performance across different scenarios, strongest results were observed for worsening depression (100%) and worsening anxiety (83%) patterns while weakest performance was seen for increased home time with improving symptoms (6%). Accuracy declined with decreasing data quality (69% for high-quality data vs 39% for low-quality data) and shorter timeframes (60% for 3-month data vs 43% for 3-week data).
Conclusions
GPT-4o’s 52% accuracy in zero-shot interpretation of psychiatric digital phenotyping data establishes a meaningful baseline, though performance gaps and occasional hallucinations confirm human oversight in digital navigation tasks remains essential. The significant performance variations across models, data quality levels and clinical scenarios highlight the need for careful implementation.
Clinical implications
Large language models could serve as assistive tools that augment human digital navigators, potentially addressing workforce limitations while maintaining necessary clinical oversight in psychiatric digital phenotyping applications.
Keywords: PSYCHIATRY, Machine Learning, Depression & mood disorders
WHAT IS ALREADY KNOWN ON THIS TOPIC.
WHAT THIS STUDY ADDS
This study establishes the first systematic evaluation of how large language models perform when interpreting psychiatric digital phenotyping data. The findings reveal both promising capabilities and important limitations that must be considered when developing technological supports for digital navigation workflows.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
Healthcare systems could implement hybrid human-AI approaches to digital navigation where algorithms scale the reach of limited human resources. This model would allow digital navigators to focus on complex cases and clinical judgement while automated systems assist with routine data processing, potentially enabling wider adoption of digital phenotyping in clinical practice.
Introduction
Psychiatry faces a measurement challenge. Traditional assessments and episodic clinical encounters cannot capture the dynamic, multifaceted nature of mental health conditions, leaving significant gaps in understanding patients’ daily experiences.1 To address this, the field is increasingly recognising the need for more comprehensive, frequent and objective measurements to supplement clinical judgement and improve diagnosis, treatment and early intervention.
Digital phenotyping offers a promising solution by collecting data from consumer digital devices to quantify behaviour and mental states in real time.2 Smartphones and wearables can collect both active inputs (self-report surveys, ecological momentary assessments (EMSs)) and passive data streams (Global Positioning System (GPS) location, accelerometer readings and device usage patterns).3 These data streams can then be used to generate clinically relevant features. In psychiatry, changes in mobility, phone usage or biobehavioural signals (eg, sleep, physical activity, social connectivity) may help identify early signs of depressive relapse or heightened anxiety.4 Integrating this longitudinal data into care has the potential to advance personalised psychiatry by improving symptom monitoring, enabling earlier interventions and tailoring treatments to individual needs.
Despite the promise of digital phenotyping, its clinical utility is dependent on how effectively raw data are transformed into clinical insights that are interoperable with existing clinical workflows and can be shared with both clinicians and patients. The volume and complexity of multimodal data streams can overwhelm busy clinicians and patients alike, creating a significant barrier to implementation. Digital navigators have emerged as a new member of the healthcare team, serving a crucial intermediary role in improving the clinical relevance of digital phenotyping data by translating digital data streams into actionable insights.5 In mental healthcare, a digital navigator serves as a bridge between technology and patient care, addressing the ‘last mile” implementation challenges of digital health tools in clinical practice.6 By summarising digital data into a concise format, digital navigators facilitate clinicians’ ability to incorporate that information into treatment planning and boost patient engagement and trust in digital health programmes, the latter of which have been historical challenges in this space.7
Implementing digital navigator programmes faces several challenges. One major barrier is the lack of established funding sources or insurance reimbursement mechanisms.8 Workforce shortages also present significant constraints, as individual navigators can support only a limited number of patients and clinicians. This limitation creates a ripple effect: existing navigators become stretched thin, many practices operate without adequate support, clinicians struggle to integrate digital tools effectively and patients ultimately experience delays in receiving timely assistance.
The advent of large language models (LLMs) provides one potential solution to this growing problem. LLMs like GPT,9 Claude,10 LLaMA11 and Gemini,12 with their vast encodings of medical and psychiatric knowledge, have already been explored as tools with the potential to aid diagnosis, suggest treatments and even engage in therapeutic dialogues.13 However, none of the existing works14,17 has investigated the usage of LLMs on psychiatric digital phenotyping data and compared them with human expert assessments. Moreover, in most mental health contexts, there are significant trust and safety concerns that limit LLM use in practice.18 19 These models sometimes produce incorrect or fabricated information with great confidence—a phenomenon known as hallucination. Additionally, clinicians and regulators frequently cite the opaque reasoning of LLMs and the lack of accountability as barriers to regulated status (eg, software as a medical device; SaMD) and wider adoption.20 Existing works have also highlighted the limitations of LLMs on mental health assessment tasks using other forms of data modalities such as online text data.14 We seek to investigate whether such limitations are also present within psychiatric digital phenotyping data and further advance research on aligning LLM-based evaluation with human assessments.19 21 Instead of using LLMs to offer therapy or treatments, we propose the need to investigate how LLMs may augment clinician support and help interpret data like that presented by digital phenotyping which digital navigators currently approach.
LLM-assisted digital navigation offers a compelling, scalable solution to overcome current expansion limitations in digital navigator programmes. Rather than replacing human digital navigators, this approach aims to amplify their impact by automating routine data interpretation tasks. By handling these time-intensive aspects, LLMs would enable existing navigators to support more clinicians and patients, ultimately extending the reach of digital mental health tools while preserving the essential human element in care delivery.
Methods
Testing dataset
For our evaluation, we used a simulated test dataset established at Beth Israel Deaconess Medical Center (Boston, Massachusetts) for training and assessing human psychiatric digital navigators. This dataset consists of test scenarios designed to simulate specific clinical cases that digital navigators are expected to identify from digital phenotyping data. All test cases were entirely simulated and contained no human subject data.
The simulated datasets include two primary categories of digital phenotyping data: (1) active EMAs and (2) passive data collection aggregated into 24 hour buckets. Table 1 presents the complete feature set from these simulated test datasets.
Table 1. Features included in simulated digital phenotyping datasets.
| Feature category | Feature name | Description |
|---|---|---|
| Active assessments | PHQ-9 | Patient Health Questionnaire-9 for depression screening |
| GAD-7 | Generalized Anxiety Disorder-7 for anxiety screening | |
| UCLA-Q | UCLA Loneliness Scale | |
| ISI | Insomnia Severity Index | |
| SDS | Sheehan Disability Scale for functional impairment | |
| ESA | Emotional Self-Awareness Scale | |
| Passive sensing | Home time | Time spent at primary residence |
| Screen time | Duration of smartphone usage | |
| Location entropy | Measure of movement variability | |
| Incoming call number | Number of incoming calls | |
| Outgoing call number | Number of outgoing calls | |
| Call degree | Number of unique call contacts | |
| Incoming text number | Number of incoming text messages | |
| Outgoing text number | Number of outgoing text messages | |
| Text degree | Number of unique text contacts | |
| Steps | Physical activity measurement | |
| Battery level | Device usage proxy | |
| Sleep duration | Hours of sleep per night |
Each test case in this dataset contains digital phenotyping data that follows the LAMP-cortex data standard, a dominant framework in psychiatric digital phenotyping research.22 The cases feature daily active and passive sensing data structured to simulate real-world implementation scenarios. These test cases were specifically designed to assess a digital navigator’s ability to recognise particular clinical patterns, listed in table 2, which they frequently need to identify and communicate to clinicians and patients in their workflows.
Table 2. Clinical scenarios simulated in digital phenotyping datasets.
| Category | Scenario | Description |
|---|---|---|
| General patterns | Worsening symptoms | Progressive increases in clinical scores and correlated passive feature changes over time |
| Improving symptoms | Progressive decreases in clinical scores and correlated passive feature changes over time | |
| Depression patterns | Worsening depression | Progressive increase in PHQ-9 scores over time |
| Improving depression | Progressive decrease in PHQ-9 scores over time | |
| Anxiety patterns | Worsening anxiety | Progressive increase in GAD-7 scores over time |
| Improving anxiety | Progressive decrease in GAD-7 scores over time | |
| Insomnia patterns | Worsening insomnia | Progressive increase in ISI scores over time |
| Improving insomnia | Progressive decrease in ISI scores over time | |
| Screen time correlations | Increased screen time, worse symptoms | Increased screen time with corresponding increases in GAD-7 and PHQ-9 scores |
| Increased screen time, better symptoms | Increased screen time with corresponding decreases in GAD-7 and PHQ-9 scores | |
| Physical activity correlations | Increased daily steps, better symptoms | Increased step count associated with decreased GAD-7 and PHQ-9 scores |
| Increased daily steps, worse symptoms | Increased step count associated with increased GAD-7 and PHQ-9 scores | |
| Sleep correlations | Increased sleep, better symptoms | Increased sleep duration associated with decreased GAD-7, PHQ-9 and ISI scores |
| Home time correlations | Increased home time, worse symptoms | Increased time at home associated with higher GAD-7 and PHQ-9 scores |
| Increased home time, better symptoms | Increased time at home associated with lower GAD-7 and PHQ-9 scores | |
| Location entropy correlations | Increased location entropy, worse symptoms | Increased location entropy associated with higher GAD-7 and PHQ-9 scores |
| Social connectivity | Increased social connectivity, better symptoms | Increased call and text degree associated with reduced UCLAQ |
GAD-7, Generalized Anxiety Disorder-7; ISI, Insomnia Severity Index; PHQ-9, Patient Health Questionnaire-9; UCLA-Q, UCLA Loneliness Scale.
To evaluate the impact of data volume on interpretation capabilities, the test dataset includes cases with varying time periods, mirroring the range that digital navigators typically work with: short-term (3 weeks), medium-term (6 weeks) and long-term (3 months) datasets.
The test dataset also incorporates varying levels of data quality to assess robustness to real-world data collection challenges:
High-quality data: at least 85% of rows contain passive data and survey non-adherence is rare
Medium quality data: between 60% and 70% of rows contain passive data and survey non-adherence is moderate
Low quality data: less than 50% of rows contain passive data and there is frequent random non-adherence to surveys
These quality variations in the test dataset reflect the common challenges of missing data encountered in digital phenotyping in clinical practice, where sensor data collection can fail for a period due to technical issues, battery depletion or device non-use.
Altogether, the testing dataset spans 17 clinical scenarios, 3 time periods and 3 data quality levels, resulting in 153 cases requiring evaluation. By utilising these established test questions, we ensure that our evaluation of GPT’s capabilities directly corresponds to the skills and pattern recognition abilities required of human digital navigators in clinical settings.
Assessment
To evaluate GPT’s capabilities in psychiatric digital phenotyping interpretation, we developed a validation framework targeting three core digital navigator competencies: pattern identification, anomaly detection and clinical summarisation. We tested GPT-4o (release V.2025-03-27) and GPT-V.3.5-turbo-0125 via the OpenAI API against our simulated test dataset.
Our evaluation employed a multioutput prompting strategy with two distinct prompt scaffolds. The detailed prompt template included a specific persona, comprehensive data definitions explaining the digital phenotyping features and structured output requirements. In contrast, the simple prompt contained minimal instructions, requesting only a summary of key points relevant to clinicians with a structured output requirement.
For systematic evaluation, we implemented a dual assessment approach where two independent human experts (both digital navigation specialists with over a year of experience training and evaluating digital navigators at BIDMC) and an automated GPT-4o classifier independently scored model outputs against ground truth scenarios using a binary classification rubric that required identification of the specific correlation and trend from the source dataset. The reviewers were blinded to other reviewer assessments and evaluated cases in randomised order. We calculated Krippendorff’s alpha to measure inter-rater reliability between human and automated reviewers. Building on recent work showing model-based grading can match expert assessment,23 we utilised the automated GPT-4o classifier to evaluate over 600 unique outputs across our multidimensional test matrix, analysing performance across scenario types, timeframes, data quality levels, model types and prompt structures.
Detailed implementation specifics and evaluation methodologies are available in online supplemental material A and prompts in online supplemental material B.
Results
Human and model review alignment
In our validation of the automated evaluation approach, we analysed 306 test cases where both human reviewers and GPT-4o assessed whether digital navigator outputs correctly identified the underlying scenario. The two human reviewers demonstrated 92.5% (283/306) raw agreement with a Krippendorff’s α of 0.85, indicating strong agreement and establishing reliable ground truth. GPT-4o’s automated assessments achieved 86.0% (263/306) agreement with reviewer 1 (Krippendorff’s α=0.72) and 83.7% (256/306) with reviewer 2 (Krippendorff’s α=0.67), with all three raters reaching consensus in 81.1% (248/306) of cases. When both human reviewers agreed on an assessment GPT-4o aligned with this consensus in 87.7% of instances (248 of 283 consensus cases).
The automated evaluator identified correct scenario detection in 51.6% of cases, compared with 58.5% and 55.2% for the human reviewers. This slightly more conservative pattern from the automated system provides reassurance that our use of GPT for both generation and evaluation did not inflate performance estimates. The substantial agreement levels confirmed that the automated classifier provided a reliable method for assessing model performance across our extensive test suite. For detailed analysis of reviewer agreement patterns and scenario-specific performance, see online supplemental material D and online supplemental figures 1–4.
Systematic evaluation results
After establishing strong agreement between human expert scoring and our automated GPT-4o assessment system, we proceeded with a comprehensive analysis across all dimensions of our test matrix. The following figure presents our systematic evaluation of model performance across key factors affecting digital phenotyping interpretation. Figure 1a presents overall scenario detection accuracy comparing GPT-4o and GPT-3.5-turbo across all test cases, where we observed a substantial performance gap between models. Figure 1b illustrates how data quality impacts each model’s performance, demonstrating the expected performance degradation as data becomes less complete. Figure 1c demonstrates the effect of different timeframes on detection accuracy, revealing improved pattern recognition with longer monitoring periods. Figure 1d compares performance between detailed and simple prompts for both models, highlighting the importance of robust prompt engineering in clinical applications. Finally, Figure 1e provides a comprehensive breakdown of scenario detection accuracy across all 17 clinical scenarios, revealing which patterns were most easily identified by each model and where performance challenges exist. These multidimensional analyses allow us to systematically evaluate the strengths and limitations of language models in identifying meaningful patterns in digital phenotyping data.
Figure 1. Scenario detection accuracy of GPT-4o and GPT-3.5-turbo across test cases, stratified by model, data quality, timeframe, prompt type, and clinical scenario. Results highlight GPT-4o’s four-fold improvement over GPT-3.5-turbo, with performance varying by data conditions and scenario type.
Sample case
For our sample case, we selected a commonly seen pattern in digital phenotyping data: worsening anxiety. We selected the 3-week time window and high data quality to facilitate legibility. In this case, the patient’s Generalized Anxiety Disorder 7-item scale (GAD-7) scores progressively increased from the mild range (4–6) to the moderate range (9–11) over 3 weeks, while other measures remained stable (full dataset available in online supplemental material C.1).
When presented with this data and the detailed prompt, GPT-4o identified the anxiety trend in its second observation, stating: ‘The GAD-7 scores started at a mild level (4-6) but increased to moderate anxiety levels (10-11) towards the end of the study. This pattern suggests a gradual increase in anxiety symptoms which might require monitoring. The spike beginning on March 24th could potentially be linked with external stressors or challenges’. In its recommendations, the model appropriately prioritised anxiety regulation as the primary intervention target, saying: ‘Given the increase in GAD-7 scores, consider exploring potential triggers for anxiety, such as recent life events or lifestyle changes, and implement stress-reducing strategies or interventions’ (complete model output is found in online supplemental material C.2).
Notably, we also observed evidence of hallucination in the model’s fourth observation for this case, where it incorrectly claimed ‘there’s a noticeable decrease in daily steps alongside relatively stable yet significant screen time’ despite no such pattern existing in the input data.’
Discussion
Our mixed qualitative and systematic evaluation documents that frontier LLMs like GPT-4o have developed measurable capabilities for interpreting longitudinal psychiatric digital phenotyping data, representing a fourfold improvement over GPT-3.5 and establishing viability for specific pattern detection tasks. Human digital navigators are required to correctly identify clinical patterns in test scenarios with 75% accuracy before being permitted to operate unsupervised. With an overall accuracy of 52% on the same test cases (compared with 12% for GPT-3.5-turbo), GPT-4o is approaching our existing competence threshold, though it has not yet reached human-level performance standards. Importantly, this performance was achieved in a fully zero-shot setting with neither examples provided in the prompts nor domain-specific model fine-tuning, and unlike human test takers, the models were provided only the raw data and no accompanying visuals. However, our findings also reveal important limitations. 52% is still far from the level of accuracy that would likely be required for an SaMD use case and remains below the expected proficiency for human digital navigator certification. We also observed occasional hallucinations, biases toward certain clinical measures and sensitivity to low data quality—details that indicate human oversight is necessary in any pilot clinical implementations. This mirrors findings in other healthcare applications of LLMs, where hallucinations and biases represent a significant challenge to reliable implementation.14 24 Despite the limitations, our results echo previous findings that LLMs’ capacity for mental health assessment has improved across the years.15 25 These results suggest that rather than replacing human digital navigators, LLM-based tools could eventually serve as force multipliers that handle routine data processing when data quality is high, there is a detailed prompt, and data collection is substantially longitudinal, while allowing human navigators to focus on complex cases and final clinical interpretations. This is already a topic of active research by large technology companies including Apple and Google.26 27
Bias in clinical attention and interpretation
With our prompts, GPT-4o appeared to favour EMA and survey data in its summaries over engaging with the passive data, possibly reflecting biases in clinical practice that prioritise direct patient reporting over passive behavioural measures.28 This tendency to emphasise self-report over behavioural data may prevent LLMs from leveraging the full potential of digital phenotyping data, which derives much of its value from passive, continuous monitoring of behaviour.
GPT-4o struggled with interpreting non-zero-based scales, often misinterpreting UCLAQ scores in the 20–30 range as elevated loneliness when this actually indicates low loneliness on the 20–80 bounded scale. Such misinterpretations could lead to incorrect clinical conclusions and underscore the need for domain-specific training, fine-tuning and explicit inclusion of scale information in prompts or few-shot examples.
As shown in figure 1e, GPT-4o prioritised depression and anxiety as the core of what is ‘clinically relevant’, even when other survey-based or passive data-based trends might be more significant. This mirrors clinical practice where these measures often serve as primary mental health indicators.29 This bias is also present in our test cases, with PHQ-9 and GAD-7 being the main diagnostic surveys utilised, and most scenarios involving at least one of these two data streams.
GPT-4o consistently showed bias towards identifying worsening symptoms over improving ones despite symmetrical test dataset design. This vigilance bias is evident in detection accuracy disparities: worsening anxiety (83%) versus improving anxiety (56%), worsening depression (100%) versus improving depression (61%) and increased screen time with worse symptoms (44%) versus better symptoms (11%). The notable exception was increased daily steps scenarios, where the model better identified improvements (61%) than deteriorations (44%). This asymmetry suggests a risk-averse approach prioritising detection of clinical deterioration, with potential implications for real-world monitoring applications. The reversal of this pattern specifically for increased physical activity scenarios suggests the model may have internalised common clinical and cultural assumptions about exercise being beneficial for mental health, demonstrating how broader knowledge representations within these models can influence their pattern recognition in domain-specific tasks.
Impact of data quality and volume
Our systematic evaluation confirmed that data quality significantly impacts LLM digital navigation performance, mirroring what we observe in existing clinical analyses and workflows. Higher data quality and more complete datasets consistently yielded better scenario detection accuracy performance across models and scenarios, reinforcing the importance of robust data collection in digital phenotyping implementations. For GPT-4o, accuracy declined from 69% with high-quality data to 39% with low-quality data, a reduction of 30 percentage points. This finding suggests that investments in improving data collection adherence could yield significant returns in the quality of LLM-generated insights, in much the same way as other traditional analysis methods. Additionally, the higher performance of GPT-4o on low-quality data when compared with GPT-3.5-turbo on high-quality data indicates that model improvements may progressively decrease the impact of low data quality on summarisation results.
The performance improvement observed with longer timeframes demonstrates GPT-4o’s capability to effectively synthesise extended monitoring periods, which conversely tend to be more challenging for human digital navigators to review and consolidate. This suggests that the more advanced models benefit from having more longitudinal data to identify persistent patterns and distinguish them from short-term fluctuations. This finding supports the value of comprehensive longitudinal monitoring in digital phenotyping applications, as more extended data collection appear to enhance rather than hinder LLM-based pattern recognition. In contrast, GPT-3.5-turbo showed consistently low accuracy (12%–13%) across all timeframes, neither benefiting from nor being hindered by additional data. Several non-mutually exclusive factors may underlie this performance gap, including architectural and training-corpus differences between GPT-3.5-turbo and GPT-4o, context-length handling limits and a reduced ability to filter clinically relevant signal from noise in long prompts.
Prompt engineering considerations
Both GPT-4o and GPT-3.5-turbo showed better performance with detailed prompts compared with simple ones, with GPT-3.5-turbo demonstrating a more dramatic improvement (19% vs 5% accuracy) than GPT-4o (57% vs 46%). While the accuracy metrics for GPT-4o between prompt types appear relatively similar, qualitative review revealed substantial differences in output utility—detailed prompts produced more clinically relevant observations with stricter format adherence, explicit trend identification and clearer prioritisation of information. The detailed prompt specifically instructed consideration of temporal patterns, resulting in directional trend analysis rather than the vague “fluctuations” often described with simple prompts. These structured outputs more closely resembled documentation from experienced digital navigators and would better facilitate reliable downstream processing in automated clinical workflows, demonstrating that evaluation metrics focused solely on pattern identification may miss important qualitative differences in clinical utility.
Limitations
Our study faces methodological constraints inherent to modelling clinical implementation through simulated data. While our test scenarios incorporated realistic missingness patterns and variability based on actual digital phenotyping deployments, simulated data inevitably simplifies the complex, idiosyncratic nature of real patient experiences.30 Random variations in our simulated test cases also created occasional unintentional patterns—a challenge that mirrors real-world interpretation but potentially complicates model evaluation.
These findings also reflect the capabilities of specific models (GPT-4o and GPT-3.5-turbo) at a particular developmental stage, evaluated using default API parameters without optimisation. The exclusive use of OpenAI models limits generalisability, as other foundation models might exhibit different behaviour patterns. We also did not compare LLM performance against traditional machine learning approaches or rule-based systems, which could potentially achieve comparable or superior performance for certain pattern recognition tasks while avoiding hallucination risks.
Our evaluation framework, while comprehensive, relied on a limited number of human evaluators for validation and did not include demographic fairness analysis, which is necessary before piloting clinical deployments. It also did not include a systematic quantitative assessment of hallucination rates or types. The use of closed-source commercial APIs also raises significant privacy concerns for real patient data, though our proof-of-concept used only simulated datasets. Additionally, our binary evaluation metric (correct/incorrect scenario identification) may not capture nuanced differences in clinical utility or the potential harm of different types of errors.
The study’s scope was deliberately constrained to establish baseline feasibility, leaving numerous optimisation opportunities unexplored. We did not investigate domain-specific fine-tuning, retrieval-augmented generation to ground outputs in validated examples, systematic hyperparameter optimisation or advanced prompting techniques that could substantially improve performance. These limitations reflect our proof-of-concept approach but highlight the gap between current capabilities and clinical readiness.
Future directions
This preliminary analysis establishes a foundation for multiple research directions that could advance LLM-assisted digital navigation toward clinical viability.
Visualisation generation for clinical decision support
Future research should explore whether LLMs can generate meaningful data visualisations to enhance digital navigator workflows. Automated visualisation generation could provide rapid visual summaries of complex longitudinal data, potentially serving as a safeguard against hallucinations where discrepancies between textual summaries and visual representations could be automatically flagged for human review. Systematic evaluation of LLM-generated visualisations for quality, clinical relevance and usability would determine their value in augmenting human interpretation of digital phenotyping patterns.
Contextual integration and demographic fairness
While our study focused on pattern recognition in isolated datasets, real-world digital phenotyping interpretation requires integration of contextual factors including demographics, life events, medication changes and comorbidities. Future work must evaluate how LLMs handle such contextual information, including their ability to recognise time-dependent causal relationships and lag effects that are crucial in longitudinal psychiatric care. Research must also examine whether model outputs exhibit systematic biases across age, gender, race, socioeconomic status and their intersections. Understanding these contextual dependencies is necessary to develop fair and clinically appropriate systems that can provide personalised insights while avoiding perpetuation of healthcare disparities.
Model architecture and optimisation strategies
Expanding evaluation beyond GPT models is critical for identifying optimal architectures for digital phenotyping tasks. Other proprietary models like Anthropic’s Claude or Google’s Gemini may exhibit significantly different behaviour when presented with digital phenotyping data, reasoning-specialised models (eg, o3, o4) may better handle multistep temporal pattern analysis, and open-source alternatives (LLaMA, Mistral) could perform comparably while enabling privacy-preserving local deployments. Systematic hyperparameter optimisation, particularly temperature settings between 0.0 and 0.3 for clinical applications, could reduce hallucinations while maintaining pattern recognition capabilities. Future work should also develop standardised methods for quantifying hallucination rates, establishing clear criteria for different error types (fabrication vs misinterpretation) and evaluating their relative clinical risks. Advanced prompting strategies including chain-of-thought reasoning, few-shot learning with validated examples and constitutional AI approaches also warrant investigation.
Hybrid system architectures
Future work should explore agentic architectures where LLMs orchestrate traditional analytical tools rather than directly performing pattern recognition on structured data. In such systems, LLMs could leverage their natural language understanding to interpret clinical questions, select appropriate statistical methods and synthesise results into coherent narratives while preserving their key advantage of flexible, open-ended data exploration by delegating numerical analysis to specialised tools. This approach would maintain the flexibility that makes LLMs valuable for reviewing diverse digital phenotyping data without predefined analysis pipelines, while potentially reducing hallucination risks and improving accuracy on quantitative tasks. The LLM’s role would shift from direct data interpretation to high level supervision of incoming data and intelligent orchestration of validated analytical methods, combining the reliability of traditional approaches with the adaptability and natural language capabilities that make LLMs promising for digital navigation workflows.
Clinical validation and integration
Only after addressing the fundamental safety and technical improvements like the demographic bias evaluation, fairness testing and optimisation strategies, should research progress to clinical validation. Real-world evaluation must then occur in two phases. First, validation on real-world digital phenotyping datasets is essential to address complexities absent from simulated data, including irregular sampling intervals, device changes, seasonal effects and the full spectrum of data quality issues encountered in practice. Privacy-preserving evaluation methods, such as federated learning or differential privacy techniques, could enable such validation while protecting patient confidentiality. Second, prospective clinical integration studies should assess how LLM-assisted digital navigation performs in actual workflows. Key questions include: How do clinicians perceive and trust AI-generated summaries? Does LLM assistance improve digital navigator efficiency or patient outcomes? What safeguards and review processes optimise human-AI collaboration? These studies must also address regulatory pathways, liability considerations and integration with existing electronic health records.
Conclusion
GPT-4o achieved 52% accuracy at identifying clinical patterns in simulated psychiatric digital phenotyping cases in the first systematic evaluation of LLM performance on this task aligned with human expert assessments. While this approaches the 75% threshold for human digital navigator certification, it remains below the standard required for unsupervised practice, particularly given observed hallucinations, biases towards certain patterns and struggles with non-zero-based scales. The substantial performance gap between GPT-4o and GPT-3.5-turbo (12% accuracy) suggests that advances in model architectures may yield significant improvements, with next-generation models potentially reaching the performance levels needed to effectively augment digital navigation workflows. Performance variations across data quality levels, prompt types and clinical scenarios indicate that successful implementation would require careful attention to these factors. This benchmark data will enable tracking of LLM progress towards clinical viability in digital navigation workflows, with the fourfold improvement from GPT-3.5 to GPT-4o suggesting that rapid advancement may continue. As digital phenotyping generates increasingly complex datasets, this proof-of-concept study provides an empirical foundation for understanding both the possibilities and current limitations of human-AI collaboration in behavioural health monitoring.
Supplementary material
Footnotes
Funding: This research was supported by a grant from the Shifting Gears Foundation. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Provenance and peer review: Not commissioned; externally peer-reviewed.
Patient consent for publication: Not applicable.
Ethics approval: Not applicable.
Data availability statement
No data are available.
References
- 1.Newson JJ, Hunter D, Thiagarajan TC. The Heterogeneity of Mental Health Assessment. Front Psychiatry. 2020;11:76. doi: 10.3389/fpsyt.2020.00076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Torous J, Kiang MV, Lorme J, et al. New Tools for New Research in Psychiatry: A Scalable and Customizable Platform to Empower Data Driven Smartphone Research. JMIR Ment Health. 2016;3:e16. doi: 10.2196/mental.5165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Torous J, Staples P, Onnela JP. Realizing the potential of mobile mental health: new methods for new data in psychiatry. Curr Psychiatry Rep. 2015;17:602. doi: 10.1007/s11920-015-0602-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chalmers JA, Quintana DS, Abbott MJA, et al. Anxiety Disorders are Associated with Reduced Heart Rate Variability: A Meta-Analysis. Front Psychiatry. 2014;5:80. doi: 10.3389/fpsyt.2014.00080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wisniewski H, Torous J. Digital navigators to implement smartphone and digital tools in care. Acta Psychiatr Scand. 2020;141:350–5. doi: 10.1111/acps.13149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Macrynikola N, Nguyen N, Lane E, et al. The Digital Clinic: An Innovative Mental Health Care Delivery Model Utilizing Hybrid Synchronous and Asynchronous Treatment. NEJM Catalyst . 2023;4 doi: 10.1056/CAT.23.0100. [DOI] [Google Scholar]
- 7.Mair JL, Hashim J, Thai L, et al. Understanding and overcoming barriers to digital health adoption: a patient and public involvement study. Transl Behav Med. 2025;15:ibaf010. doi: 10.1093/tbm/ibaf010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.The California Telehealth Policy Coalition . CCHP; 2023. Digital navigators and telehealth: addressing the digital divide.https://www.cchpca.org/resources/telehealth-value-based-care-models-improving-patient-access-and-outcomes-while-reducing-costs-2/ Available. [Google Scholar]
- 9.Radford A, Narasimhan K, Salimans T, et al. OpenAI; 2018. Improving language understanding by generative pre-training.https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Available. [Google Scholar]
- 10.Anthropic Introducing claude. 2023. https://www.anthropic.com/index/introducing-claude Available.
- 11.Touvron H, Lavril T, Izacard G, et al. arXiv; 2023. LLaMA: open and efficient foundation language models.https://arxiv.org/abs/2302.13971 Available. [Google Scholar]
- 12.Anil R, Borgeaud S, et al. Gemini Team . arXiv; 2023. Gemini: a family of highly capable multimodal models.https://arxiv.org/abs/2312.11805 Available. [Google Scholar]
- 13.Obradovich N, Khalsa SS, Khan W, et al. Opportunities and Risks of Large Language Models in Psychiatry. NPP Digit Psychiatry Neurosci . 2024;2:8. doi: 10.1038/s44277-024-00010-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Xu X, Yao B, Dong Y, et al. Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data. Proc ACM Interact Mob Wearable Ubiquitous Technol . 2024;8:31. doi: 10.1145/3643540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhang T, Teng S, Jia H, et al. Leveraging LLMs to predict affective states via smartphone sensor features. Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing; Melbourne VIC Australia: ACM. 2024. pp. 709–16. Available. [Google Scholar]
- 16.Nissen D, Yu T, Babtista R, et al. Utilizing large language models (llms) in data analysis pipeline for digital phenotyping: description, prediction, and visualization. 2024 IEEE International Conference on Big Data (BigData); Washington, DC, USA: IEEE. 2024. pp. 7513–20. Available. [Google Scholar]
- 17.Wang Y, Zhu T, Zhou T, et al. Hyper-DREAM, a Multimodal Digital Transformation Hypertension Management Platform Integrating Large Language Model and Digital Phenotyping: Multicenter Development and Initial Validation Study. J Med Syst. 2025;49:42. doi: 10.1007/s10916-025-02176-1. [DOI] [PubMed] [Google Scholar]
- 18.Guo Z, Lai A, Thygesen JH, et al. Large Language Models for Mental Health Applications: Systematic Review. JMIR Ment Health. 2024;11:e57400. doi: 10.2196/57400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lopes E, Jain G, Carlbring P, et al. Talking Mental Health: a Battle of Wits Between Humans and AI. J technol behav sci. 2023;9:628–38. doi: 10.1007/s41347-023-00359-6. [DOI] [Google Scholar]
- 20.Gardiner H, Mutebi N. London, UK: Parliamentary Office of Science and Technology (POST), UK Parliament; 2025. AI and mental healthcare: ethical and regulatory considerations. Report no.: post note 738. Available. [DOI] [Google Scholar]
- 21.Shankar S, Zamfirescu-Pereira JD, Hartmann B, et al. Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology; Pittsburgh PA USA: ACM. 2024. pp. 1–14. Available. [Google Scholar]
- 22.Burns J, Chen K, Flathers M, et al. Transforming Digital Phenotyping Raw Data Into Actionable Biomarkers, Quality Metrics, and Data Visualizations Using Cortex Software Package: Tutorial. J Med Internet Res. 2024;26:e58502. doi: 10.2196/58502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Arora RK, Wei J, Hicks RS, et al. arXiv; 2025. HealthBench: evaluating large language models towards improved human health.http://arxiv.org/abs/2505.08775 Available. [Google Scholar]
- 24.Dergaa I, Fekih-Romdhane F, Hallit S, et al. ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front Psychiatry. 2023;14:1277756. doi: 10.3389/fpsyt.2023.1277756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kim Y, Jeong H, Chen S, et al. arXiv; 2025. Medical hallucinations in foundation models and their impact on healthcare.https://arxiv.org/abs/2503.05777 Available. [Google Scholar]
- 26.Erturk E, Kamran F, Abbaspourazad S, et al. Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions. arXiv. 2025 doi: 10.48550/arXiv.2507.00191. [DOI] [Google Scholar]
- 27.Khasentino J, Belyaeva A, Liu X, et al. A personal health large language model for sleep and fitness coaching. Nat Med. 2025 doi: 10.1038/s41591-025-03888-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Nghiem J, Adler DA, Estrin D, et al. Understanding Mental Health Clinicians’ Perceptions and Concerns Regarding Using Passive Patient-Generated Health Data for Clinical Decision-Making: Qualitative Semistructured Interview Study. JMIR Form Res . 2023;7:e47380. doi: 10.2196/47380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ashworth M, Shepherd M, Christey J, et al. A client‐generated psychometric instrument: The development of ‘PSYCHLOPS’. Couns Psychother Res. 2004;4:27–31. doi: 10.1080/14733140412331383913. [DOI] [Google Scholar]
- 30.Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. NPJ Digit Med . 2023;6:186. doi: 10.1038/s41746-023-00927-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No data are available.

