Abstract
Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.
Subject terms: Health care, Medical research
Wearable devices generate vast streams of health data, but making sense of these measurements requires complex numerical reasoning beyond the reach of conventional language models. This study introduces a large language model agent that interprets wearable data to deliver accurate, personalized health insights.
Introduction
Personal health data, often derived from personal devices such as wearables, are distinguished by their multi-dimensional, continuous, and longitudinal measurements that capture granular observations of physiology and behavior in situ rather than in a clinical setting. Research studies have highlighted the significant health impacts of physical activity and sleep patterns, emphasizing the potential for wearable-derived data to reveal personalized health insights and promote positive behavior changes1–6. For example, individuals with a device-measured Physical Activity Energy Expenditure (PAEE) that is 5 kJ/kg/day higher had a 37% lower premature mortality risk2, and a large meta-analysis suggests that activity trackers improve physical activity and promote weight loss, with users taking 1800 extra steps per day7,8.
Prior work has focused on understanding the needs of wearable users and facilitating data exploration through conventional means. Studies deploying on-device apps found that users are interested in questions that analyze trends, compare values, and provide coaching, but that current systems do not adequately address this curiosity9–11. While some research has explored using visualization to help users interpret their data12–15, the advent of Large Language Models (LLMs) presents a new paradigm for interactive analysis. LLMs have demonstrated a growing capacity for complex reasoning and have been applied to tasks across the health domain, including medical question-answering16–19, medical education20,21, electronic health record analysis22–24, mental health interventions25–28, interpretation of medical images and assessments17,29, and generating diagnoses30,31.
However, despite their broad capabilities, applying LLMs to granular personal health data remains a significant challenge. Current LLMs frequently struggle with the numerical reasoning required for time-series analysis, and previous efforts have consequently relied on pre-aggregated, expert-defined statistical summaries rather than enabling direct, nuanced analysis of raw data. For instance, some models deliver fitness and sleep coaching by reasoning over 30-day summaries32 or perform classification tasks on daily metrics presented as text33, but are not designed for open-ended, numerically precise queries on high-resolution data. The development of LLM-based agents, which can be augmented with software tools like code interpreters and information retrieval systems34–38, offers a promising path forward. While agents have shown effectiveness for exploring objective tabular data in domains like electronic medical records23,39–44, they have not yet been adapted to handle the complex, open-ended, and personalized nature of wearable health queries, which often require substantial domain knowledge to translate data into actionable insights.
This gap is significant because even a seemingly straightforward user question, such as “Do I get better sleep after exercising?”, involves a series of complex analytical steps. Answering it properly requires checking data availability, selecting appropriate metrics, summarizing sleep quality on active days, contextualizing these findings within the individual’s broader health profile, and integrating knowledge of population norms to offer tailored recommendations. These steps require a combination of robust numerical analysis and an interpretive understanding of health, a capability that current systems lack. A system capable of fulfilling these steps is able to produce personal health insights, which refers to the outputs generated when an LLM system analyzes a user’s wearable time-series data in response to their health-related queries. This definition underscores that our focus-and the challenges we tackle-are rooted in multimodal wearable data streams.
In this work, we introduce the Personal Health Insights Agent (PHIA), the first open-ended question-answering system powered by an LLM-based agent designed specifically for nuanced reasoning over personal wearable data. PHIA leverages a state-of-the-art agent framework that combines multi-step iterative reasoning, code generation for direct data analysis, and web search integration to autonomously perform complex analyses and generate accurate, context-aware responses to thousands of diverse health queries. We demonstrate the superiority of this agentic approach through a 650-h human evaluation of more than 6000 model responses with 19 human annotators and an automatic evaluation of 16000 model responses, showing significant improvements in open-ended reasoning about wearable health data compared to non-agentic models. To facilitate future research, we also release a high-fidelity synthetic wearable dataset sampled from high-volume, anonymized production data and a personal health insights evaluation dataset, comprised of over 4000 closed and open-ended questions across multiple domains for both automated and human evaluation.
Results
Is a multi-step agentic framework really necessary to derive personal health insights from wearable data, or can simpler methods provide adequate results? To evaluate the necessity of the framework and tools (i.e., code generation, web search), we constructed four language model baselines to demonstrate PHIA’s performance as illustrated in Figs. 1–4. An example of responses from our baselines alongside PHIA can be found in Fig. 5.
Fig. 1. Automatic and human evaluation.
a PHIA achieves a higher numerical accuracy on objective personal health queries compared to baseline models. Each bar represents the mean accuracy across n = 4000 queries evaluated on 4 randomly selected synthetic user profiles. Error bars indicate 95% bootstrapped confidence intervals. Accuracy is defined as an exact match to within two digits of precision. b, c On open-ended reasoning and code quality, data are presented as mean Likert scores (0–100). For both panels, n = 172 unique queries were evaluated, with each response assessed by three independent annotators. Error bars represent the 95% confidence interval of the mean. In (b), PHIA shows a significant advantage over the Code Generation baseline in all ratings except for personalization, and in (c), PHIA outperforms the Code Generation baseline in all ratings except column usage, time usage, and interpretation. (*) designates statistical significance (p < 0.05) based on a two-sided Wilcoxon signed-rank test. Exact P-values and test statistics are provided in Supplementary Table 8.
Fig. 4. Error and recovery rates.

The Error Rate (fraction of responses that include at least one code error) and Recovery Rate (fraction of responses where an agent recovers from an initial error) were calculated from an analysis of n = 172 unique queries (independent trials). The error rate for the Code Generation baseline is higher, while PHIA demonstrates a non-zero recovery rate. Points indicate the mean calculated rates, with error bars representing 95% bootstrapped confidence intervals.
Fig. 5. Baseline comparison.
Examples of responses from two baseline approaches (Numerical Reasoning and Code Generation) alongside a response from PHIA. PHIA is capable of searching for relevant knowledge, generating code, and doing iterative reasoning in order to achieve an accurate and comprehensive answer.
PHIA achieves superior numerical correctness on objective queries
In Fig. 1a, we present the evaluation results for objective personal health queries. PHIA achieves an exact match accuracy of 84%, significantly outperforming the Code Generation baseline (74% accuracy), Numerical Reasoning (22% accuracy). We also evaluated a custom chain-of-thought prompting strategy designed for interpreting time-series wearable data with GPT-433 (53.6% accuracy). We observe that the PH-LLM model32 is not able to answer any of our objective queries due to its limitations in handling detailed, long-context tabular data inputs after being fine-tuned exclusively on aggregated coaching case study data. This demonstrates that the agent framework’s complexity and iterative reasoning substantially enhance performance on numerical queries, even those requiring limited abstract reasoning. The text results from our internal model reasoning approaches further emphasize that text-only reasoning is inadequate for precise numerical manipulations on personal health data, likely due to inherent limitations in current LLMs’ mathematical and tabular reasoning capabilities. Consequently, we excluded these methods from the costly human evaluation.
Human evaluation shows PHIA generates higher-quality reasoning and code
Overall, PHIA demonstrates a significant improvement in reasoning over the Code Generation baseline in all but two dimensions (Fig. 1b). Most notably, overall reasoning was substantially higher for PHIA than Code Generation (68 versus 52 in scaled Likert rating). Annotators rated 83% of PHIA’s responses as “Fair” ("3” on the Likert score, Supplementary Tables 1 and 2) or better. In Supplementary Fig. 1, we show that PHIA is also twice as likely to generate “Excellent” responses. Other significant improvements over the baseline include the domain knowledge category (63 vs 38) and logic. To better understand where PHIA’s increased performance comes from, in Fig. 2, we found that queries in general knowledge and compared to the cohort show the largest difference. This performance difference is likely attributable to PHIA’s ability to query web search for external information and its ability to iteratively and interactively reason its internal parametric knowledge through Thought steps. For example, in Fig. 17, PHIA uses its web search function to supply information about a balanced workout routine. For “Personal Min/Max/Avg.” questions, which are characterized by aggregations well within the capabilities of the Code Generation baseline, the improvement was effectively zero. Examples of low-scoring and high-scoring PHIA outputs are available in the supplemental materials (Supplementary Figs. 2 and 3, respectively).
Fig. 2. PHIA enhances reasoning across query types.

PHIA's performance surpasses the Code Generation baseline in “Overall Reasoning” for each query type. For a detailed description of each query category, see the “Methods” section and Table 1. Data points represent the mean difference in scores between PHIA and the baseline, and error bars indicate 95% bootstrapped confidence intervals. The analysis is based on n = 172 unique queries, with each model response evaluated by three independent annotators. The number of queries (n) for each category from top to bottom was: 8 (Compare v.s. Population), 11 (Summary), 9 (Compare Time Periods), 35 (General Knowledge), 40 (Correlation), 7 (Anomaly), 14 (Trend), 30 (Problematic), and 18 (Personal Min/Max/Avg.).
The two dimensions in which PHIA closely matched the Code Generation baseline are personalization and harm avoidance. For personalization, we believe this is because the Code Generation baseline tended to generate a similar amount of code and numerical insights as PHIA, making the responses comparable. The raters perceived the numerical insights generated through code as a form of personalization. Therefore, since both the Code Generation baseline and PHIA can generate code, their personalization appeared very similar to the raters. This hypothesis is also supported by our analysis of qualitative interviews. However, given the overall benefits in enhancing domain knowledge, we believe PHIA remains a superior model for reasoning about personal health queries. Additionally, we observe that the likelihood of harm avoidance is exceptionally high. The saturated ratings indicate that a combination of underlying model guardrails against harmful responses and the iterative thought process in PHIA effectively prevents harmful questions, with over 99% of responses rated as harmless. Taken as a whole, our evaluation indicates that PHIA’s agent-based method produces substantially higher-quality reasoning than the Code Generation baseline and is much more effective at addressing user-provided queries than its base language model alone. Inter-rater agreement was considerable, with results summarized in Supplementary Table 5. To understand the role of web search specifically, we ablate the feature and study it in Supplementary Fig. 4.
The results from expert evaluation indicate that PHIA improved over the Code Generation baseline in overall code quality, avoiding hallucinations, and personalization (Fig. 1c). Although the difference in performance on other perceived code quality metrics was insignificant, we demonstrate that PHIA is quantitatively less likely to generate code that raises an error. In Fig. 3, we found that the error rate of PHIA is half that of the Code Generation baseline (0.192 vs 0.395). The magnitude of this difference is perhaps particularly surprising considering that both methods use the same base language model. This implies that PHIA’s ability to strategically plan at the first Thought step and perform iterative reasoning about its outputs through the remaining Thought steps minimizes error-prone code generation.
Fig. 3. Code error category analysis.
PHIA's agentic framework substantially reduces code generation errors, with an overall error rate less than half that of the Code Generation baseline (0.192 vs 0.395). A breakdown of all errors from the open-ended query dataset (n = 172) shows improvements across all categories. PHIA is particularly effective at reducing hallucinations, data misinterpretations, and errors in data manipulation with Pandas (a popular Python library for data analysis). Error categories were determined through an open coding evaluation by two domain experts.
Agentic framework reduces errors and enables recovery
Another notable advantage of using an agent framework in health data analysis is that PHIA can occasionally recover after it throws a fatal error by interpreting its mistake and correcting it in a subsequent step. PHIA recovered in 11.4% of cases (Fig. 4). In comparison, because Code Generation lacks the capacity to react to its own results, its recovery rate is zero. This means that agent-based approaches like PHIA are more stable with respect to fatal code errors. Our results in Fig. 4 show that PHIA is much less likely to make errors on complex tabular reasoning operations such as time-series indexing and joining multiple tables. PHIA is also substantially less likely to hallucinate responses or misinterpret the input data. This indicates that the additional complexity afforded by agents produces significantly more reliable results that can be better trusted by end users.
Qualitative analysis reveals raters prioritize personalization and domain knowledge
To better understand the rating process and provide insight into the nuances of evaluating model responses in health and fitness, we conducted qualitative interviews with two annotators and two experts. Several key themes emerged from these discussions:
All annotators agreed that the presence of numerical insights and metrics made them give higher ratings on personalization - “As long as there are numerical insights, that would be a ‘Yes’ on personalization” [Rater 2]. “I remember another example like how do I lose weight? And it gives a generic answer for getting active ... For 150 minutes a week, but it does not reference what the user’s, like, current active minutes are. And I feel like that’s a missed opportunity. It could say if they’re only active for 10 minutes a week. That’s a clear personalization that could help. But it doesn’t really reference that. And so that was like a no.” [Rater 3]. These comments highlight the importance of referring to numerical insights to achieve better personalization.
Raters consistently emphasized the difficulty of accurately assessing model responses without full user context. While numerical data provides some insight, it lacks the rich tapestry of individual lifestyle, habits, and circumstances. As one annotator noted, “Understanding and reading can be challenging at times; you have to read it multiple times for the more subjective questions, but on the more closed ended ones, definitely easier” [Rater 1]. This highlights the inherent limitation of evaluating health advice based solely on quantified data, mirroring real-world scenarios where clinicians rely on a holistic understanding of their patients.
The inclusion of relevant and authorized domain knowledge consistently elevated the perceived quality of model responses. Raters looked for evidence that the model could integrate authoritative health information and go beyond generic advice. “If it did say you’re short on active minutes than the recommended exercise duration, then I would give a ‘Yes’ in domain knowledge” [Rater 4]. This reinforces the importance of grounding health and fitness recommendations in established medical and scientific consensus. Additionally, the annotators also commented that the model’s ability to connect insights to domain knowledge proved a key differentiator. For example, one annotator highlighted, “If the query is, ‘How many hours have I slept?’ and then it referenced some authorized domain knowledge on the recommended sleep duration and compared against to the personal sleep duration, that was a better overall response than just listing out the numerical insights” [Rater 2]. This suggests the importance of going beyond simply presenting data; models must demonstrate understanding of the user’s unique situation and interpret it in the context of relevant domain knowledge in order to tailor responses accordingly.
Raters expressed a heightened awareness of potential harm, particularly regarding medical advice. They favored cautious responses and emphasized the model’s responsibility to defer to healthcare professionals when appropriate. As one annotator explained, “I don’t believe [the model] should have the authority to tell the user diagnosis guidance and information" [Rater 1]. This underscores the ethical considerations inherent in developing AI for health applications, particularly when user safety is paramount. Quantitatively, annotators thought that model responses could cause harm in less than 0.1% of cases (Fig. 1b). Beyond navigating harms, annotators remarked that models would occasionally reference nonexistent data columns or metrics, impacting the overall quality and reliability of their responses.
Discussion
Our results suggest that PHIA, with its capabilities of iterative and interactive planning and reasoning with tools, is effective for analyzing and interpreting personal health data. We observe strong performance on objective personal health insights queries, with PHIA surpassing two commonly used baselines by 282% and 14% respectively. This indicates that agent-based approaches like PHIA have significant advantages over numerical reasoning and code generation alone. Moreover, despite being designed for more complex tasks, the ability to do iterative reasoning in code generation is useful for addressing even simple objective queries that often require only a few lines of code.
The improvement extends to complex open-ended queries. By engaging experts in wearable data in our evaluation, we show that PHIA exhibits superior capabilities in reasoning personal health insights and interactive health data analysis with code generation, compared to our baseline. This is all the more impressive given that PHIA and the code generation baseline are powered by the same language model (Gemini Ultra 1.0). PHIA requires no additional supervision, only advanced planning abilities and the option to perform iterative reasoning of internal knowledge and interaction with external tools (e.g., web search). Therefore, as language models continue to improve, these benefits can be trivially transferred to systems like PHIA.
While PHIA’s advanced reasoning capabilities offer significant advantages, it is crucial to ensure that these systems are designed with robust safety measures to prevent misuse or unintended consequences. Our human evaluation also reveals that PHIA is capable of avoiding harmful responses and refusing to answer unintended queries, such as clinical diagnosis, thereby demonstrating the robust safety of our system.
Our work has several limitations. First, while our results show that LLM-powered agents are effective tools for generating personal health insights, some limitations remain. Human annotators found PHIA’s responses to be clear, relevant, and unlikely to cause harm (Fig. 1b), but nonetheless, we make no claim as to the effectiveness of these insights for helping real users understand their data, facilitating behavior changes, and ultimately improving health outcomes. Our aim in this paper is to define methods, tasks, and evaluation frameworks for agents in personal health. We leave it to future work to evaluate the efficacy of agent methods through clinical trials. Second, the veracity of suggestions was not assessed by medical experts. Although our annotators have significant familiarity with the Google wearable ecosystem and Python data analysis, we did not employ health experts to assess the domain-specific validity of PHIA’s recommendations. However, the majority of queries in our objective and open-ended datasets, as described in the “Methods” section, are answered through assessment of user data and do not require advanced health knowledge. Nonetheless, we acknowledge that before PHIA or a similar agent is deployed as a service, care should be taken to verify the accuracy of suggestions where applicable. Furthermore, although dozens of examples have been manually checked by experts to ensure quality, we recognize that the language model-based translation process of our reasoning evaluation with human evaluators (with no programming background) may introduce noise. Third, in this paper, we focus on the analysis of data from wearable devices with code generation and explore how that data can be augmented with outside information from web search. PHIA’s toolset is limited but easily extendable; it could be expanded to include analysis of health records, user-provided journal entries, nutrition plans, lab results, readings from connected devices such as smart scales or blood sugar monitors, and more. Additionally, PHIA’s reasoning capabilities are enhanced through few-shot learning. We expect fine-tuning the base language model with a set of agent reasoning traces in personal health could further boost the performance of PHIA. Fourth, our study involves subjective thresholds in curating queries and wearable datasets. From the original 3000 questions, we sampled 177 to ensure category coverage listed in Table 1; however, this may not encompass every possible health query scenario. Similarly, we aggregated user data over 31-day periods with a minimum of 10 days of availability for inclusion. While these parameters balance data quality and feasibility, they may not be optimal for generating synthetic data. Future work could explore more diverse query types and refine aggregation parameters to enhance generalizability. Fifth, we emphasize that the aim of this research is not to build an LLM agent capable of addressing highly specialized or complex medical questions requiring expert knowledge16,18 beyond the scope of wearable data. For instance, PHIA’s suggestion that a user could increase their cardio intensity (Fig. 6) might not be suitable for individuals diagnosed with congestive heart failure. Furthermore, PHIA and similar systems should not be employed to derive insights into conditions that cannot be accurately assessed using wearable devices. While future agentic systems might integrate data from other medical devices, the scope of this study is deliberately limited to conditions that can be monitored with consumer wearables. We also acknowledge that we did not evaluate PHIA through real-world deployment studies in order to evaluate potential impacts on behavior change and other health outcomes. Further clinical trials or user studies would be necessary to validate the practical impact of PHIA’s recommendations. Finally, as noted in the “Methods” section, we largely restrict our experiments to a single base language model (Gemini 1.0 Ultra) to study the benefits of agent frameworks and tool use in isolation. Due to the substantial cost incurred through 650 h of human evaluation, it was not feasible to verify the central claims of this paper with other language models. Nonetheless, prior work34,45,46 shows that frontier language models like Gemini, Claude, GPT-4, and LLAMA are all capable of agentic tasks with mild variations in overall performance. We therefore hypothesize (but do not formally claim or prove) that our findings extend to other language models.
Table 1.
Open-ended personal health insights queries
| Query type | Count | Example |
|---|---|---|
| Correlation | 40 | How does my sleep duration correlate with my daily steps? |
| General knowledge | 35 | What’s a good meal for breakfast that will meet most of my nutritional needs for the day? |
| Problematic | 30 | Does not eating make your stomach look better? |
| Personal Min/Max/Avg. | 18 | What are my personal bests for different fitness metrics, such as steps taken, distance run, or calories burned? |
| Trend | 14 | Is there a noticeable reduction in stress, and has my mood stabilized? |
| Summary | 11 | What is my fitness like? |
| Compare time periods | 9 | What are my sleep patterns during different seasons? |
| Compare to cohort | 8 | Is my resting heart rate of 52 healthy for my age? |
| Anomaly | 7 | Tell me about the anomalies in my steps last month. |
| Total count | 172 |
A summary of open-ended queries used in our human evaluation.
Fig. 6. PHIA combines tools to answer open-ended queries.
Two examples illustrate PHIA's multi-step reasoning process. (Left) In response to a query about energy levels, PHIA first uses a web search for general advice and then its Python code interpreter to analyze the user’s specific sleep data, leading to a personalized recommendation. (Right) For a query about cardio, PHIA again uses its code interpreter to calculate the user’s BMI and activity levels before using a web search to provide a nuanced recommendation. These interactions are representative of high-scoring responses from the human evaluation (n = 172 unique open-ended queries) and demonstrate how the agent deconstructs ambiguous questions into a series of analytical steps.
Even with these limitations, this work represents an important step toward enabling individuals to draw meaningful conclusions from their own wearable data. Given that behaviors like sleep and physical activity are crucial to population health47–49, PHIA showcases how language model agents can empower users by making personalized insights more accessible. However, we emphasize that we see PHIA only as a starting point. As LLMs continue to improve, future agents could expand to analyze medical records, help users communicate with their clinical team, or identify early warning signs of more serious conditions. Ultimately, agents have the potential to change personal health management by enabling individuals to draw and communicate accurate conclusions from their own data, and PHIA is a promising first step toward this end.
Methods
Datasets for evaluating personal health insights
Wearable health trackers typically provide generic summaries of personal health behaviors, such as aggregated daily step counts or estimated sleep quality. However, these devices do not facilitate the generation of interactive, personal health insights tailored to individual user needs and interests. In this paper, we introduce three datasets aimed at evaluating how LLMs can reason about and understand personal health insights. The first dataset comprises objective personal health insights queries designed for automatic evaluation. The second dataset consists of open-ended personal health insights queries intended for human evaluation. Finally, we introduce a dataset of high-fidelity synthetic wearable users to reflect the diverse spectrum of real-world wearable device usage.
Objective personal health insights
Definition
Objective personal health queries are characterized by clearly defined responses. For example, the question, “On how many of the last seven days did I exceed 5000 steps?” constitutes a specific, tractable query. The answer to this question can be reliably determined using the individual’s data, and responses can be classified in a binary fashion as correct or incorrect.
Dataset curation
To generate objective personal health queries, we developed a framework aimed at the systematic creation and assessment of such queries and their respective answers. This framework is based on manually crafted templates by two domain experts, designed to incorporate a broad spectrum of variables, encompassing essential analytical functions, data metrics, and temporal definitions.
Consider the following example scenario: a template is established to calculate a daily average for a specified metric over a designated period, represented in code as daily_metrics[$METRIC].during($PERIOD).mean(). From this template, specific queries and their corresponding code implementations can be derived. For instance, if one wishes to determine the average number of daily steps taken in the last week, the query "What was my average daily steps during the last seven days?" and the code daily_metrics["steps"].during("last 7 days").mean() can be used to generate the corresponding response. It is worth noting that during() is a custom function to handle the date interpretation of the temporal span of a natural language query. A total of 4000 personal health insights queries were generated using this approach. All of these queries were manually evaluated by a domain expert at the intersection of data science and health research to measure their precision and comprehensibility. Examples are available in Table 2.
Table 2.
Objective personal health insights queries
| Example | |
|---|---|
| What was my step count yesterday? | |
| How many times have I done yoga? | |
| What was the average number of minutes I spent in deep sleep over the past 14 days? | |
| What is the total time I spent swimming for sessions lasting 40 min or less? | |
| What was my percentage of light sleep on the most recent day I used the treadmill? | |
| Total count | 4000 |
Examples of objective queries used in our automatic evaluation.
Open-ended personal health insights
Definition
Open-ended personal health insights queries are inherently ambiguous and can yield multiple correct answers. Consider the question, “How can I improve my fitness?” The interpretation of “improve” and “fitness” could vary widely. One valid response might emphasize enhancing cardiovascular fitness, while another might propose a strength training regimen. Evaluating these complex and exploratory queries poses significant challenges, as it requires a deep knowledge of both data analysis tools and wearable health data.
Dataset curation
A survey was conducted with a sample of the authors’ colleagues, all of whom had relevant expertise in personal and consumer health research and development, to solicit hypothetical inquiries for an AI agent equipped with access to their personal wearable data. Participants were asked, “If you could pose queries to an AI agent that analyzes your smartwatch data, what would you inquire?” Participants were also solicited for “problematic” questions that could lead to harm if answered, such as “How do I starve myself?” This survey generated approximately 3000 personal health insights queries, which were subsequently manually categorized into one of nine distinct query types (Table 1). For evaluation feasibility reasons, a smaller test dataset was created, comprising 200 randomly selected queries. From this subset, queries with high semantic similarities were excluded, resulting in a final tally of 172 distinct personal health queries. We manually ensured that the sampled subset of queries covered all the query types listed in Table 1. These were intentionally excluded from agent development to avoid potential over-fitting.
Synthetic wearable user data
Definition
To effectively evaluate both objective and open-ended personal health insights queries, high-fidelity wearable user data is essential. To maintain the privacy of wearable device users, we developed a synthetic data generator for wearable data. This generator is based on a large-scale anonymized dataset from 30,000 real wearable users who agreed to contribute their data for research purposes. Each of the synthetic wearable users has two tables – one of daily statistics (e.g., sleep duration, bedtime, and total step count for each day) and another describing discrete activity events (e.g., a 5 km run on 2/4/24 at 1:00 PM). The schema of these tables are available in Supplementary Tables 6 and 7. Synthetic data not only ensures the privacy and confidentiality of real-world user data but also facilitates reproducibility and broader accessibility for the research community. Unlike many real-world datasets, our synthetic dataset incorporates detailed and event-based metrics (e.g., sleep score, active zone minutes), enabling more reliable evaluation of personal health insights.
Dataset curation
To build the training dataset for our synthetic data generation framework, we sampled wearable data from 30,000 users who were randomly selected from individuals with heart rate-enabled Google Fitbit and Google Pixel Watch devices. The study underwent review and approval by an independent Institutional Review Board (IRB), with all participants providing informed consent for their deidentified data to be used in research and development of new health and wellness products and services. Eligibility required users to have at least 10 days of data recorded during October 2023, with a profile age between 18 and 80 years old. This threshold was chosen to ensure the dataset captures day-to-day variability in user data while maintaining sufficient inclusion based on prior population distribution analyses. The dataset spans at most 31 days of October 2023, aggregated from daily metrics (e.g., steps, sleep minutes, heart rate variability, activity zone minutes) and exercise events listed in Supplementary Tables 6 and 7.
We used a Conditional Probabilistic Auto-Regressive (CPAR) neural network50,51, specifically designed to manage sequential multivariate and multi-sequence data, while integrating stable contextual attributes (age, weight, and gender). This approach distinguishes between unchanging context (i.e., typically static data such as demographic information) and time-dependent sequences. Initially, a Gaussian Copula model captures correlations within the stable, non-time-varying context. Subsequently, the CPAR framework models the sequential order within each data sequence, effectively incorporating the contextual information. For synthetic data generation, the context model synthesizes new contextual scenarios. CPAR then generates plausible data sequences based on these new contexts, producing synthetic datasets that include novel sequences and contexts. To further enhance the fidelity of the synthetic data, we incorporated patterns of missing data observed in the real-world dataset, ensuring that the synthetic data mirrors the sporadic and varied availability of data often encountered in the usage of wearable devices. A total of 56 synthetic wearable users were generated, from which 4 were randomly selected for evaluation.
The Personal Health Insights Agent (PHIA)
Language models in isolation demonstrate limited abilities to plan future actions and use tools52,53. To support advanced wearable data analysis, as Fig. 7 illustrates, we embed an LLM into a larger agent framework that interprets the LLM’s outputs and helps it to interact with the external world through a set of tools. To the best of our knowledge, PHIA is the first large language model-powered agent specifically designed to transform wearable data into actionable personal health insights by incorporating advanced reasoning capabilities through iterative code generation, web search, and the ReAct framework to address complex health-related queries.
Fig. 7. An overview of our Personal Health Insights Agent (PHIA).
a–c Examples of objective and open-ended health insight queries, along with the synthetic wearable user data, were utilized to evaluate PHIA's capabilities in reasoning and understanding personal health insights. d A framework and workflow that demonstrates how PHIA iteratively and interactively reasons through health insight queries using code generation and web search techniques. e An end-to-end example of PHIA's response to a user query, showcasing the practical application and effectiveness of the agent.
Iterative & interactive reasoning
PHIA is based on the widely recognized ReAct agent framework34, where an “agent" refers to a system capable of performing actions autonomously and incorporating observations about these actions into decisions (Fig. 7d). In ReAct, a language model cycles through three sequential stages upon receiving a query. The initial stage, Thought, involves the model integrating its current context and prior outputs to formulate a plan to address the query. Next, in the Act stage, the language model implements its strategy by dispatching commands to one of its auxiliary tools. These tools, operating under the control of the LLM, provide feedback to the agent’s state by executing specific tasks. In PHIA, tools include a Python data analysis runtime and a Google Search API for expanding the agent’s health domain knowledge, both elaborated upon in subsequent sections. The final Observe stage incorporates the outputs from these tools back into the model’s context, enriching its response capability. For instance, PHIA integrates data analysis results or relevant web pages sourced through web search in this phase.
Wearable data analysis with code generation
During an Act stage, the agent engages with wearable tabular data through Python within a customized sandbox runtime environment. This interaction leverages the Pandas Python library, a popular tool for code-based data analysis. In contrast to using LLMs directly for numerical reasoning, the numerical results derived from code generation are factual and reliably maintain arithmetic precision. Moreover, this approach can help reduce the risk of leaking users’ raw data, as the language model only ever encounters the analysis outcome, which is generally aggregated information or trends.
Integration of additional health knowledge
PHIA enhances its reasoning processes by integrating a web search-based mechanism to retrieve the latest and relevant health information from reliable sources54,55. This custom search capability extracts and interprets content from top search results from reputable domains. This approach is doubly beneficial: it can directly attribute information to web sources, bolstering credibility, and it provides the most up-to-date data available, thereby addressing the inherent limitations of the language model’s training on historical data.
Mastering tool use
A popular technique for augmenting the performance of agents and language models is few-shot prompting56. This approach entails providing the model with a set of high-quality examples to guide it on the desired task without expensive fine-tuning. To determine representative examples, we computed a sentence-T5 embedding57 for all queries in our dataset. Next, we applied K-means clustering on these embeddings, targeting 20 clusters. We then selected queries closest to the centroid of each cluster as representatives. For each chosen query, we crafted a ReAct trajectory (Thought -> Action -> Observation) that demonstrates how to produce a high-quality response with iterative planning, code generation, and web search. Refer to the responses of PHIA Few-Shot in Supplementary Figs. 5–8 for more examples.
Choice of language model
For all of the following experiments, we fix Gemini 1.0 Ultra58 as the underlying language model. Our goal is not to study which language model is best at our task. Rather, we explore the effectiveness of agent frameworks and tool use to answer subjective, open-ended queries pertaining to wearable data.
Experimental design and evaluation
Baselines
Numerical reasoning
Since language models have modest mathematical ability52,59 it may be the case that PHIA’s code interpreter is not necessary to answer personal health queries. In the first baseline, the user’s data is structured in the popular Markdown table format and directly supplied to the language model as text, coupled with the corresponding query. Markdown has previously been shown to be one of the most effective formats for LLM-aided tabular data processing60. Analogous to PHIA, we designed a set of few-shot examples to guide the model to execute rudimentary operations such as calculating the average of a data column in the last 30 days, as shown in Supplementary Fig. 9.
Code Generation
Is it necessary to use an agent to iteratively and interactively reason about personal health insights? To investigate this question, for our second baseline, we introduce a Code Generation model that can only generate answers in a single step. In contrast to PHIA, this approach lacks a reflective Thought step, which renders it unable to strategize and plan multiple steps ahead, as well as incapable of iterative analysis of wearable data. Moreover, this approach cannot augment its personal health domain knowledge as it does not have access to web search. This baseline builds on prior work in code and SQL generation for data science, where language models generate code in response to natural language queries61–64. To make a fair comparison, this baseline was fortified with a unique set of few-shot examples that employ identical queries to those used in PHIA, albeit with responses and code crafted by humans to mirror the restricted capabilities of the Code Generation model (i.e., no additional tool use and iterative reasoning). Examples are available in Supplementary Figs. 5–8.
Additional LLM-based wearable systems
We also compare PHIA against recent LLM-based methods, including the Personal Health Large Language Model (PH-LLM)32. It is a fine-tuned LLM based on Gemini Ultra 1.0, focused on delivering coaching for fitness and sleep based on aggregated 30-day wearable data (e.g., 95th percentile of sleep duration) instead of the high-resolution daily data. Rather than invoking tools, PH-LLM uses in-model reasoning to generate long-form insights and recommendations. Moreover, PH-LLM is fine-tuned specifically for providing coaching recommendations only, instead of providing numerical insights and recommendations for general wearable-based queries. Additionally, we compare our approach to a specialized chain-of-thought prompting strategy designed for interpreting time-series wearable data with the GPT-4 model33. This approach instructs the model to reason directly within its textual context window without external computational tools. Overall, unlike PHIA, these methods focus on internal model reasoning and do not incorporate an iterative agentic framework and external tools. In addition, this approach is based on GPT-4 instead of Gemini, enabling us to demonstrate that our proposed approach outperforms even strong baselines that leverage alternative language models.
Experiments
With the aforementioned baselines in mind, we conducted the following experiments to examine PHIA’s capabilities.
Automatically evaluating numerical correctness with objective queries
Some personal health queries have objective solutions that afford automatic evaluation (see “Methods” section for details). To study PHIA’s performance on these questions, we evaluated PHIA and the baselines on all 4000 queries in our objective personal health insights dataset. A query was considered correctly answered if the model’s final response was correct to within two digits of precision (e.g., given a ground truth answer of 2.54, a response of 2.541 would be considered correct, and the response 2.53 would be considered incorrect). We compared PHIA against numerical reasoning, code generation, and two LLM-based wearable systems (PH-LLM and custom-prompted GPT-4).
Evaluating open-ended insights reasoning with human raters
Open-ended personal health queries demand precise interpretation to integrate user-specific data with expert knowledge. To assess open-ended reasoning capability, we recruited a team of twelve independent annotators who had substantial familiarity with wearable data in the domains of sleep and fitness. They were tasked to evaluate the quality of reasoning of PHIA and our Code Generation baseline in the open-ended query dataset defined in Table 1. Due to annotators’ minimal experience with Python data analysis, two domain experts developed a translation pipeline with Gemini Ultra to translate Python code into explanatory English language text (examples available in Supplementary Figs. 10–14). Annotators were also provided with the final model responses.
Annotators were tasked with assessing whether each model response demonstrated the following attributes: relevance of data utilized, accuracy in interpreting the question, personalization, incorporation of domain knowledge, correctness of the logic, absence of harmful content, and clarity of communication. Additionally, they rated the overall reasoning of each response using a Likert scale ranging from one ("Poor”) to five ("Excellent”). All responses were distributed so that each was rated by at least three unique annotators, who were blinded to the method used to generate the response. Rubrics and annotation instructions can be found in Supplementary Table 1. To standardize comparisons across different metrics, final scores were obtained by mapping the original ratings on a scale of 1–5 into a range of 0–100. Subsequent scores for “Yes or No” questions are the proportion of annotators who responded “Yes”. For example, an answer of “Yes” for domain knowledge would indicate that the annotator found the response to show an understanding of domain knowledge. In total, more than 5500 model responses and 600 h of annotator time were used in this evaluation. Additional reasoning evaluation with real-user data can be found in Supplementary Fig. 15.
Evaluating code quality with domain experts
To assess the quality of the code outputs of PHIA and our Code Generation baseline, we recruited a team of seven data scientists with graduate degrees, extensive professional experience in analyzing wearable data, and publications in this field. Collectively, these experts brought several decades of relevant experience (mean = 9 years) to the task. We distributed the model responses from PHIA and the Code Generation baseline such that each sample was independently evaluated by three different annotators. Experts were blinded to the experimental condition (i.e., whether the response was generated by PHIA or Code Generation baseline). Unlike in the reasoning evaluation, annotators were provided with the raw and complete model response from each method, including generated Python code, Thought steps, and any error messages. Experts were asked to determine whether each response exhibited the following favorable characteristics: avoiding hallucination, selecting the correct data columns, indexing the correct time frame, correctly interpreting the user’s question, and personalization. Finally, annotators were instructed to rate the overall quality of each response using a Likert scale ranging from one to five (instruction details in Supplementary Table 2). To facilitate comparison, these ratings were again converted into 0–100 scores. In total, 595 model responses collected over 50 h were used in this evaluation.
Conducting comprehensive errors analysis
Additionally, we conducted a quantitative measurement of code quality by calculating how often a method fails to generate valid code while answering a personal health insights query. Toward this, we determined each method’s “Error Rate" - the number of responses that contain code that raises an error divided by the total number of responses that used code (e.g., indexing columns that don’t exist, importing inaccessible libraries, or syntax mistakes). To better understand the sources of errors, two experts independently performed an open coding evaluation on all the responses in the open-ended dataset. They were instructed to look for errors, including hallucinations, Python code errors, and misinterpretations of the user’s query. Results were aggregated into one of the following semantic categories: Hallucination, General Code Errors, Misinterpretation of Data, Pandas Operations, and Other.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Source data
Acknowledgements
We thank the raters who evaluated our candidate model responses for their dedication, effort, and detailed feedback. Contributors are: Shivani Aroraa, Rishita Matolia, Md Arbaz, Choudhurimyum Devashish Sharma, Aayush Ranjan, Rohini Sharma, Manish Prakash Arya, Noel Haris, Chhaya, Manish Phukela, Chetan Sarvari, Vibhati Sharma, Shadi Rayyan, Andrew Mai, Florence Gao, Peninah Kaniu, Jian Cui, Shun Liao, Jake Garrison, Girish Narayanswamy, Paolo Di Achille. We also thank Hulya Emir-Farinas, Shelten Yuen, Noa Tal, Annisah Um’rani, Oba Adewunmi, and Archit Mathur for their valuable insights on writing, technical support, and feedback during our research.
Author contributions
J.Z., T.A., D.M., and X.L. conceived the study. M.A.M., A.P., T.A., D.M., and X.L. had major involvement in designing various aspects of the study. M.A.M., A.P., N.R., G.K., J.P., E.S., S.T., K.A., T.A., D.M., and X.L. had major involvement in data acquisition and ingestion. M.A.M., A.P., N.R., G.K., E.S., T.A., D.M., and X.L. generated synthetic data and initial case studies. M.A.M., A.P., N.R., G.K., E.S., T.A., D.M., and X.L. preprocessed data. M.A.M., A.P., T.A., D.M., and X.L. performed experiments. M.A.M., A.P., E.S., T.A., D.M., and X.L. analyzed results. J.S. provided clinical and domain expertise. J.P., Y.L., E.S., N.H., J.S., and X.L. administered the project. Y.L., N.H., J.S., S.T., K.A., H.S., Q.H., C.Y.M., M.M., S.P., and J.Z. provided organizational support for the project. M.A.M., A.P., N.R., G.K., J.P., Y.L., C.Y.M., M.M., T.A., D.M., and X.L. wrote and revised the initial manuscript. All authors had access to the data and have edited, reviewed, and approved the final manuscript for publication. All authors accept responsibility for the accuracy and integrity of all aspects of the research.
Peer review
Peer review information
Nature Communications thanks Carl Yang and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
All data used for the evaluation of PHIA are publicly available at https://github.com/yahskapar/personal-health-insights-agent. This includes the full dataset of 4000 objective queries and 172 open-ended queries used for automatic and human evaluation, respectively. Ethics approval for the use of the real-world dataset from Google underlying the synthetic and real-world sets was granted by the Western Institutional Review Board-Copernicus Group IRB, which classified the research as exempt under 45 CFR 46.104(d)(4). Source data are provided with this paper.
Code availability
All code provided with this manuscript for PHIA is available at https://github.com/yahskapar/personal-health-insights-agent. This repository includes the core agent logic (phia_agent.py), all prompt templates, scripts to reproduce the figures in this paper, and a demonstration notebook (phia_demo.ipynb) for interacting with the agent. The underlying language model for PHIA, as evaluated in this manuscript, is Gemini 1.0 Ultra.
Competing interests
This study was funded by Google LLC. All authors are employees of Alphabet and may own stock as part of the standard compensation package.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Mike A. Merrill, Akshay Paruchuri.
These authors jointly supervised this work: Tim Althoff, Daniel McDuff, and Xin Liu.
Contributor Information
Tim Althoff, Email: althoff@google.com.
Daniel McDuff, Email: dmcduff@google.com.
Xin Liu, Email: xliucs@google.com.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-67922-y.
References
- 1.Althoff, T. et al. Large-scale physical activity data reveal worldwide activity inequality. Nature547, 336–339 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Strain, T. et al. Wearable-device-measured physical activity and future health risk. Nat. Med.26, 1385–1391 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Anderson, E. & Durstine, J. L. Physical activity, exercise, and chronic diseases: a brief review. Sports Med. Health Sci.1, 3–10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Medic, G., Wille, M. & Hemels, M. E. Short-and long-term health consequences of sleep disruption. Nat. Sci. Sleep9, 151–161 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Stamatakis, E. et al. Association of wearable device-measured vigorous intermittent lifestyle physical activity with mortality. Nat. Med.28, 2521–2529 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Buxton, O. M. & Marcelli, E. Short and long sleep are positively associated with obesity, diabetes, hypertension, and cardiovascular disease among adults in the United States. Soc. Sci. Med.71, 1027–1036 (2010). [DOI] [PubMed] [Google Scholar]
- 7.Ferguson, T. et al. Effectiveness of wearable activity trackers to increase physical activity and improve health: a systematic review of systematic reviews and meta-analyses. Lancet Digit. Health4, e615–e626 (2022). [DOI] [PubMed] [Google Scholar]
- 8.Xi, Z. et al. The rise and potential of large language model based agents: a survey. Preprint at https://arXiv.org/abs/2309.07864 (2023).
- 9.Rey, B., Lee, B., Choe, E. K. & Irani, P. Investigating in-situ personal health data queries on smartwatches. Proc. ACM Interact. Mobile Wearable Ubiquitous Technol.6, 1–19 (2023). [Google Scholar]
- 10.Amini, F., Hasan, K., Bunt, A. & Irani, P. Data representations for in-situ exploration of health and fitness data. In Proc. 11th EAI International Conference on Pervasive Computing Technologies for Healthcare, PervasiveHealth ’17 163–172 (Association for Computing Machinery, 2017).
- 11.Pal, D., Tassanaviboon, A., Arpnikanondt, C. & Papasratorn, B. Quality of experience of smart-wearables: from fitness-bands to smartwatches. IEEE Consum. Electron. Mag.9, 49–53 (2020). [Google Scholar]
- 12.Aseniero, B. A., Perin, C., Willett, W., Tang, A. & Carpendale, S. Activity river: visualizing planned and logged personal activities for reflection. In Proc. International Conference on Advanced Visual Interfaces, AVI ’20 1–9 (Association for Computing Machinery, 2020).
- 13.Choe, E. K., Lee, B., Zhu, H., Riche, N. H. & Baur, D. Understanding self-reflection: how people reflect on personal data through visual data exploration. In Proc. 11th EAI International Conference on Pervasive Computing Technologies for Healthcare, PervasiveHealth ’17 173–182 (Association for Computing Machinery, 2017).
- 14.Epstein, D., Cordeiro, F., Bales, E., Fogarty, J. & Munson, S. Taming data complexity in lifelogs: exploring visual cuts of personal informatics data. In Proc. 2014 Conference on Designing Interactive Systems 667–676 (Association for Computing Machinery, 2014).
- 15.Neshati, A. et al. SF-LG: space-filling line graphs for visualizing interrelated time-series data on smartwatches. In Proc. 23rd International Conference on Mobile Human-Computer Interaction 1–13 (ACM, 2021).
- 16.Singhal, K. et al. Large language models encode clinical knowledge. Nature620 172–180 (2023). [DOI] [PMC free article] [PubMed]
- 17.Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature642, 442–450 (2025). [DOI] [PMC free article] [PubMed]
- 18.Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med.31, 943–950 (2025). [DOI] [PMC free article] [PubMed]
- 19.Saab, K. et al. Capabilities of Gemini models in medicine. Preprint at https://arXiv.org/abs/2404.18416 (2024).
- 20.Swan, M., Kido, T., Roland, E. & dos Santos, R. P. Math agents: computational infrastructure, mathematical embedding, and genomics. Preprint at https://arXiv.org/abs/2307.02502 (2023).
- 21.Dan, Y. et al. Educhat: a large-scale language model-based chatbot system for intelligent education. Preprint at https://arXiv.org/abs/2308.02773 (2023).
- 22.Yang, X. et al. A large language model for electronic health records. NPJ Digit. Med.5, 194 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Shi, W. et al. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. In Proc. of the 2024 Conference on Empirical Methods in Natural Language Processing. (2024). [DOI] [PMC free article] [PubMed]
- 24.Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit. Med.7, 6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sharma, A. et al. Cognitive reframing of negative thoughts through human-language model interaction. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 9977–10000 (ACL, 2023).
- 26.Sharma, A., Rushton, K., Lin, I. W., Nguyen, T. & Althoff, T. Facilitating self-guided mental health interventions through human-language model interaction: a case study of cognitive restructuring. In Proc. CHI Conference on Human Factors in Computing Systems 1–29 (Association for Computing Machinery, 2024).
- 27.Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C. & Althoff, T. Human-AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nat. Mach. Intell.5, 46–57 (2023). [Google Scholar]
- 28.Lin, I. W. et al. "IMBUE: improving interpersonal effectiveness through simulation and just-in-time feedback with human-language model interaction. In Proc. of the 62nd Annual Meeting of the Association for Computational Linguistics Vol 1, Long Papers (2024).
- 29.Lee, S., Kim, W. J., Chang, J. & Ye, J. C. Llm-cxr: Instruction-finetuned llm for cxr image understanding and generation. In The Twelfth International Conference on Learning Representations (2023).
- 30.Galatzer-Levy, I. R., McDuff, D., Natarajan, V., Karthikesalingam, A. & Malgaroli, M. The capability of large language models to measure psychiatric functioning. Preprint at https://arXiv.org/abs/2308.01834 (2023).
- 31.McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature642, 451–457 (2025). [DOI] [PMC free article] [PubMed]
- 32.Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nature Medicine31, 3394–3403 (2025). [DOI] [PMC free article] [PubMed]
- 33.Englhardt, Z. et al. From classification to clinical insights: towards analyzing and reasoning about mobile and behavioral health data with large language models. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.8, 1–25 (2024). [Google Scholar]
- 34.Yao, S. et al. React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations. (2022).
- 35.Liu, J., Xia, C. S., Wang, Y. & Zhang, L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. Adv. Neural Inf. Process. Syst.36, 943 (2024). [Google Scholar]
- 36.Zhuang, Y., Yu, Y., Wang, K., Sun, H. & Zhang, C. ToolQA: a dataset for LLM question answering with external tools. Adv. Neural Inf. Process. Syst.36, 2180 (2024). [Google Scholar]
- 37.Schick, T. et al. Toolformer: language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst.36, 2997 (2024). [Google Scholar]
- 38.Qin, Y. et al. Tool learning with foundation models. ACM Computing Surveys57, 1–40 (2024).
- 39.Ye, Y. et al. Large language models are versatile decomposers: decomposing evidence and questions for table-based reasoning. In Proc. 46th International ACM SIGIR Conference on Research and Development in Information Retrieval 174–184 (ACM, 2023).
- 40.Chen, Y. et al. SheetAgent: a generalist agent for spreadsheet reasoning and manipulation via large language models. Preprint at https://arxiv.org/abs/2403.03636 (2025).
- 41.Guo, S. et al. DS-Agent: automated data science by empowering large language models with case-based reasoning. Preprint at https://arxiv.org/abs/2402.17453 (2024).
- 42.Chakraborty, A. et al. Navigator: a Gen-AI system for discovery of factual and predictive insights on domain-specific tabular datasets. In Proc. 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD) 528–532 (ACM, 2024).
- 43.Hong, S. et al. Data interpreter: an LLM agent for data science. Preprint at https://arxiv.org/abs/2402.18679 (2025).
- 44.Jiang, J. et al. StructGPT: a general framework for large language model to reason over structured data. Preprint at https://arxiv.org/abs/2305.09645 (2023).
- 45.Koh, J. Y. et al. VisualWebArena: evaluating multimodal agents on realistic visual web tasks (2024). Preprint at https://arXiv.org/abs/2401.13649 (2024).
- 46.Jimenez, C. E. et al. SWE-bench: can language models resolve real-world GitHub issues? Preprint at https://arXiv.org/abs/2310.06770 (2024).
- 47.Chattu, V. K. et al. The global problem of insufficient sleep and its serious public health implications. Healthcare7, 1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Lee, I.-M. et al. Effect of physical inactivity on major non-communicable diseases worldwide: an analysis of burden of disease and life expectancy. Lancet380, 219–229 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Bennett, E. M., Alpert, R. & Goldstein, A. C. Communications through limited-response questioning. Public Opin. Q.18, 303–308 (1954). [Google Scholar]
- 50.Patki, N., Wedge, R. & Veeramachaneni, K. The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA) 399–410 (IEEE, 2016).
- 51.Zhang, K., Patki, N. & Veeramachaneni, K. Sequential models in the synthetic data vault. Preprint at https://arXiv.org/abs/2207.14406 (2022).
- 52.Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://arXiv.org/abs/2303.12712 (2023).
- 53.Wang, Z. et al. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. Preprint at https://arXiv.org/abs/2302.01560 (2023).
- 54.Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst.33, 9459–9474 (2020). [Google Scholar]
- 55.Sumers, T. et al. Cognitive architectures for language agents. Transactions on Machine Learning Research (2023).
- 56.Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst.33, 1877–1901 (2020). [Google Scholar]
- 57.Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21, 1–67 (2020).34305477 [Google Scholar]
- 58.Team, G. et al. Gemini: a family of highly capable multimodal models. Preprint at https://arXiv.org/abs/2312.11805 (2023).
- 59.Anand, A. et al. Mathify: evaluating large language models on mathematical problem solving tasks. Preprint at https://arxiv.org/abs/2404.13099 (2024).
- 60.Lu, W., Zhang, J., Zhang, J. & Chen, Y. Large language model for table processing: a survey. Preprint at https://arXiv.org/abs/2402.05121 (2024).
- 61.Li, X. & Döhmen, T. Towards efficient data wrangling with LLMs using code generation. In Proc. Eighth Workshop on Data Management for End-to-End Machine Learning 62–66 (ACM, Santiago, Chile, 2024).
- 62.Merrill, M. A., Zhang, G. & Althoff, T. MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining 1212–1222 (ACM, Virtual Event Singapore, 2021).
- 63.Yin, P. et al. Natural language to code generation in interactive data science notebooks. In Proc. of the 61st Annual Meeting of the Association for Computational Linguistics Vol 1 Long Papers (2023).
- 64.Bzdok, D. et al. Data science opportunities of large language models for neuroscience and biomedicine. Neuron112, 698–717 (2024). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data used for the evaluation of PHIA are publicly available at https://github.com/yahskapar/personal-health-insights-agent. This includes the full dataset of 4000 objective queries and 172 open-ended queries used for automatic and human evaluation, respectively. Ethics approval for the use of the real-world dataset from Google underlying the synthetic and real-world sets was granted by the Western Institutional Review Board-Copernicus Group IRB, which classified the research as exempt under 45 CFR 46.104(d)(4). Source data are provided with this paper.
All code provided with this manuscript for PHIA is available at https://github.com/yahskapar/personal-health-insights-agent. This repository includes the core agent logic (phia_agent.py), all prompt templates, scripts to reproduce the figures in this paper, and a demonstration notebook (phia_demo.ipynb) for interacting with the agent. The underlying language model for PHIA, as evaluated in this manuscript, is Gemini 1.0 Ultra.





