Abstract
Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.
Subject terms: Genetics, Diseases
Introduction
There are currently well over 6000 genetic conditions with identified causes, and extraordinary progress continues in identifying the specific etiologies, biological underpinnings and manifestations of these disorders1. Unlike some common health conditions, which have been comprehensively studied throughout the lifespan, many genetic conditions are typically described in paediatric populations, focusing on early and obvious manifestations like birth defects, severe biochemical or immunological perturbations, or developmental delay and other neurobehavioral findings2. There are multiple reasons for this paediatric focus. One major driver is the fact that most clinical geneticists who diagnose and manage such patients are additionally trained in paediatrics3. Since these clinicians primarily see, diagnose, and manage children given their scope of training, they tend to report findings in the medical literature and other outlets that pertain to the paediatric timeframe4. Another explanation involves the severe nature of many genetic conditions, which frequently affects survival, especially until more recent decades, when improvements in supportive care and the availability of direct therapies for some conditions has helped enable longer lives for affected individuals5. That is, many individuals with severe genetic conditions did not survive into adulthood, and thus the longer-term sequelae of these conditions were less well described. Finally, at least in some geographic areas, inequitable insurance policies mean that genetic testing is unequally covered in paediatric and adult populations, such that achieving precise diagnosis is often much more difficult in the adult population6,7. As a result of these factors, gaps exist in understanding the clinical features, outcomes and optimal management of genetic conditions as patients age.
As generative artificial intelligence (AI) continues to advance, it is rapidly transforming many biomedical disciplines, including through image generators like generative adversarial networks for realistic image construction as well as large language models (LLMs)8–10. For image generation, to produce high quality outputs that are clinically accurate, one would likely need to finetune available pre-trained models. However, many pre-trained LLMs might be used for many medical purposes without finetuning10–12. Due to this versatility, the adoption of LLMs may increase accessibility to medical information across many clinical specialties by responding to queries from both medical and non-medical users9,13. For example, a person with a genetic condition may ask a pre-trained LLM (e.g. GPT-4, Bard/Gemini, etc.) about future expectations or about available treatments or recurrence risks; a physician seeing a patient with a suspected genetic disorder may use these types of LLMs to better understand pertinent pathophysiology or to help generate a differential diagnosis14. Going beyond single-turn conversations (a single query-and-response) with pre-trained LLMs, recent studies show that LLMs can be further developed into multi-turn chatting agents geared specifically towards medical domains and can communicate with appropriate levels of empathy9,13,14. For example, PaLM2 was finetuned on generated patient vignettes to produce a multi-turn AI chatbot with a diagnostic accuracy exceeding that of primary care physicians15.
Numerous studies have examined the response quality of LLMs across a spectrum of medical specialties, such as in ophthalmology, orthopaedics, dermatology, obstetrics, oncology and others12,15,16. However, none of these studies have specifically focused on genetic conditions that have age-related manifestations and management plans, and whether there can be age-related biases in the LLM responses. This is important, as LLMs may perform differently in the field of medical genomics versus other areas of medicine due to the individual rarity of genetic disorders and the relative scarcity of medical literature about many conditions17. Given these considerations, we aimed to explore aspects of LLM performance in the context of medical genomics: our objective was to investigate potential age-related biases related to the identification and management of genetic conditions. To perform this study, we investigated the open-source Llama-2-70b-chat model and proprietary GPT-3.5 model in terms of proficiency of generating accurate medical vignettes encompassing 282 genetic conditions selected based on prevalence (including subgroups such as metabolic disorders). We also generated and evaluated the LLMs’ ability to generate age-appropriate patient-geneticist dialogues and their performance in answering age-specific questions regarding management with respect to the generated vignettes on a subset of these conditions (for dataset selection, condition categorisation and the generation and scoring of medical vignettes, dialogues and management plans, see details in ‘Methods’). Overall, we found that LLMs perform well with respect to age-related differences in our 282 conditions.
Results
Data curation and selection output
We selected conditions from Orphanet’s dataset of rare diseases based on prevalence (November 2023 version 2). As shown in Fig. 1, after removing 121 duplicates, four clinicians assessed the list of remaining conditions according to the exclusion criteria and removed those without known Mendelian/monogenic genetic causes (e.g. congenital syphilis), benign conditions that may involve a measurable phenotype, but where the clinical impact is unclear (e.g. iminoglycinuria), multifactorial conditions without a clear, known monogenic cause or where only susceptibility loci have been identified, low or unclear penetrance conditions or findings that may appear as clinical features in many different genetic conditions as well as in an isolated fashion (e.g. Hirschsprung disease), conditions caused by somatic genetic variants, such as forms of cancer (as many such conditions do not typically fall under the purview of clinical geneticists), conditions involving variable cytogenomic changes, including microdeletions or microduplications (unless clearly associated with a well-known genetic syndrome such as 22q11.2 deletion syndrome). From the initial list of 793 conditions, after applying the criteria described above, we ended with 282 conditions for analysis (See Supplementary File 1). For our analyses, we divided these 282 conditions into five mutually exclusive categories based on the flowchart in Fig. 2, using information in public databases of genetic conditions including Orphanet (https://www.orpha.net/), GeneReviews (https://www.ncbi.nlm.nih.gov/books/NBK1116/) and OMIM (https://www.omim.org/), all reviewed between 02/02/2024 and 04/16/2024. These five categories are: (1) Disorders limited to childhood (age < 18 years), either because most patients do not survive into adulthood (e.g. rhizomelic chondrodysplasia punctata) or the manifestations subside before adulthood (e.g. glycogen storage disease due to liver phosphorylase kinase deficiency) (n = 33); (2) Disorders limited to adulthood (age > 18 years), where manifestations present later in life, although in some rare scenarios findings can appear in adolescence (e.g. GNE myopathy) (n = 13); (3) Disorders that manifest in childhood and/or adulthood with changes in management (with or without presentation changes) across the lifespan. The change in management might be linked to starting management at an older age, as one symptom of the disease only becomes apparent with aging, prompting treatment at that stage (e.g. homocystinuria, where adults typically require additional management considerations related to their increased risk of medical issues like strokes and osteoporosis) or the change in management could occur without a change in how the disease presents itself but because a medication is approved only for adults (e.g. the use of retinoids for adults with syndromic recessive X-linked ichthyosis compared to the use of mild emollients for children; even though some paediatric dermatologist might use retinoids off label) (n = 53); (4) Disorders that manifest in childhood and/or adulthood, but which involve changes in presentation across the lifespan; but with no specific change in management across the lifespan (e.g. Rubinstein-Taybi syndrome, where children have growth failure but adults have obesity)(n = 33); (5) Disorders with no changes in presentation or management across the lifespan (e.g. Treacher-Collins syndrome) (n = 150). For brevity, in the subsequent sections, tables and figures, we will denote these categories as (1) Limited to Childhood; (2) Limited to Adulthood; (3) Management Change; (4) Presentation Change; (5) No Change. Most of the conditions have a neonatal onset, but many conditions have a wider age range of onset (see Fig. 3 for distribution of age of onset based on data from www.orpha.net).
Fig. 1. Diseases selection process, based on the PRISMA schema.
This schema was adapted from the approach used for systematic reviews and is used with appropriate citation as described in the guidelines41. Sources used in our analyses include Orphanet (https://www.orpha.net/), GeneReviews (https://www.ncbi.nlm.nih.gov/books/NBK1116/) and OMIM (https://www.omim.org/).
Fig. 2. Flowchart showing the categorisation process used to classify each of the 282 conditions into one of the five categories.
Categories abbreviated as: (1) Limited to Childhood; (2) Limited to Adulthood; (3) Management Change; (4) Presentation Change; (5) No Change. e.g. Glycogen storage disease due to acid maltase deficiency does not fit into category 1 (Limited to Childhood) or category 2 (Limited to Adulthood). Due to varying management plans across the lifespan (regardless of the presentation), it falls under category 3 (Management Change). Systemic primary carnitine deficiency also presents across the lifespan, so it does not fit into category 1 or 2. Children with this condition are more likely to experience hypoglycemia and lethargy triggered by fasting or stress, while adults are more likely to develop arrhythmias or sudden cardiac death. However, since lifelong carnitine supplementation is the standard treatment for all ages, it does not fall under category 3 but instead fits category 4 (Presentation Change). We also emphasise that there can be exceptions and atypical presentations, though we attempted to adhere to the condition descriptions based on Orphanet and OMIM.
Fig. 3. Distribution of age of onset for the 282 genetic conditions included in this study.
Many conditions have a range of onset; therefore, a single condition may be represented in multiple age periods (e.g. propionic acidemia may have clinical onset during either the neonatal or infantile timeframe). Orphanet version 1.0.9 served as source for age of onset except for two cancer conditions: Gardner and Gorlin syndromes, which are now considered to be part of broader cancer disorder spectrums: familial adenomatous polyposis (FAP) and basal cell naevus syndrome (BCNS), respectively.
Evaluating generated vignettes
In this section and onward, for brevity, we will denote GPT-3.5 as GPT and the experiment settings Llama-2-70b-chat with and without in-context prompts as 70b and 70b Context, respectively (see ‘Methods’ for details).
For the 282 genetic conditions, the Correctness, Completeness, Total score and Accuracy of the vignettes generated by 70b, 70b Context and GPT were tallied using the described clinical grading rubric (Supplementary Table 1). Each vignette was assessed using three metrics: Correctness, Completeness and Conciseness. Correctness evaluates whether the information presented is factual and free of ‘hallucinations’. Completeness measures whether the response includes the essential clinical and laboratory features relevant to the genetic disorder, considering variability in presentation among affected individuals. Since all vignettes scored 1 for Conciseness, this factor was removed from scoring considerations (for examples and details on scoring criteria and vignette evaluation, refer to the ‘Methods’ section).
Correctness and Completeness combine to form the Total score, while Accuracy is a measure of total vignette success—the Accuracy score is only counted as 1 if both the Correctness and Completeness score receive a 1. To compare two outcomes, we conducted t-test (either paired or unpaired t-test depending on the groups being compared) and applied Bonferroni correction on the standard 5% false positive threshold.
We evaluated whether age-bias exists (i.e. comparing child to adult group) for each LLM setting with respect to each of the five disease categories: ‘Limited to Childhood’, ‘Limited to Adulthood’, ‘Management Change’, ‘Presentation Change’ and ‘No Change’ (Figs. 4 and 5).
Fig. 4. Correctness and Completeness score breakdowns for vignettes in ‘Limited to Childhood’ and ‘Limited to Adulthood’ categories.
All vignettes were clinician-graded and assigned a score of either 0 or 1 for Correctness and Completeness, for a total of 2 possible points. Significance (*) is indicated at the α = 0.05 level between child and adult vignettes for the Correctness score only. Two markers (**) indicates significance at the α = 0.05 level between child and adult scores for both the Correctness and Completeness. In this chart, significance is only demonstrated for comparisons between child and adult vignettes for the same model (See Supplementary Tables 2 and 3 for statistical comparisons between 70b, 70b Context and GPT). The Bonferroni threshold for significance is p < 0.005. This chart shows the combined Correctness and Completeness scores; see Supplementary Figs. 1 and 2 for individual Correctness and Completeness graphs with variance.
Fig. 5. Correctness and Completeness score breakdowns for vignettes in ‘Management Change’, ‘Presentation Change’ and ‘No Change’ categories.
All vignettes were clinician-graded and assigned a score of either 0 or 1 for Correctness and Completeness, for a total of 2 possible points. Significance (*) is indicated at the α = 0.05 level between child and adult vignettes for the Correctness score only. Two markers (**) indicates significance at the α = 0.05 level between child and adult scores for both the Correctness and Completeness scores. In this chart, significance is only demonstrated for comparisons between child and adult vignettes for the same model (See Supplementary Tables 2 and 3 for statistical comparisons between 70b, 70b Context and GPT). The Bonferroni threshold for significance is p < 0.005. This chart shows the combined Correctness and Completeness scores; See Supplementary Figs. 1 and 2 for individual Correctness and Completeness graphs with variance.
For Correctness score, when conditioned on the same LLM setting (70b, 70b Context, or GPT), averaging over diseases ‘Limited to Childhood’, the generated child vignettes had statsistically higher Correctness scores than the adult vignettes (p = 0.00123, p = 1.61 × 10−8, p = 8.37 × 10−10 in 70b, 70b Context and GPT, respectively). This is expected because LLMs are unlikely to generate reliable descriptions of adult patients for such diseases. The expected and opposite trend was observed for diseases ‘Limited to Adulthood’, where vignettes for adults on average obtained higher Correctness score than vignettes for the child group. Two of three comparisons between child and adult vignettes in ‘Limited to Adulthood’ are statistically significant (p = 0.00281, p = 3.21 × 10−6, p = 0.0124 in 70b, 70b Context and GPT, respectively) (See Fig. 4. See Supplementary Fig. 1 for individual Correctness graphs).
Specifically, for conditions ‘Limited to Childhood’ and ‘Limited to Adulthood’, the LLMs generated a description of a person’s age that does not correlate properly with the associated condition category and hence was graded 0 for Correctness (e.g. a generated vignette describing a 10-year-old child with amyotrophic lateral sclerosis (ALS), a progressive neurodegenerative disease that typically affects adults, and a vignette describing a 35-year old adult for rhizomelic chondrodysplasia punctata, a condition in which it is rare for an affected individual to live past age 10).
In terms of Correctness score for conditions in the other categories ‘Management Change’, ‘Presentation Change’ and ‘No Change’, there was overall good performance, particularly for 70b Context and GPT generated vignettes (with score ranging between 68% and 85%), with no statistical significant differences between the child and adult vignettes’ scores among all three types of experiments (70b, 70b Context, GPT) (See Fig. 5. See Supplementary Fig. 1 for individual Correctness graphs).
Considering the Completeness score, there were statistically significant differences between some of the child and adult vignettes for the ‘Limited to Childhood’ category only. Two of the three comparisons between child and adult vignettes in ‘Limited to Childhood’ are statistically significant (p = 0.000743 for 70b Context, and p = 0.00915 for GPT). There was no significant difference in Completeness score between the child and adult vignettes for the ‘Limited to Adulthood’, ‘Management Change’, ‘Presentation Change’ and ‘No Change’ disease categories with overall good performance, particularly for 70b Context and GPT generated vignettes (with Completeness score ranging between 73%-94% for those two models) (See Fig. 4. See Supplementary Fig. 2 for individual Completeness graphs).
The Accuracy score was only counted as 1 if both the Correctness and Completeness score received a 1. Statistically significant differences were observed between the Accuracy of child and adult vignettes for two of three comparisons (70b Context, GPT) made in the ‘Limited to Childhood’ category (p = 1.99 × 10−7 for 70b Context, and p = 3.93 × 10−9 for GPT). Statistically significant differences were observed between child and adult vignettes for one of three comparisons (70b context) made in the ‘Limited to Adulthood’ category (p = 3.21E-06 for 70b Context) (Fig. 6). There was no statistically significant difference between the Accuracy of child and adult vignettes for the disease category ‘Management Change’, ‘Presentation Change’, or ‘No Change’ (Fig. 7).
Fig. 6. Accuracy scores for vignettes in ‘Limited to Childhood’ and ‘Limited to Adulthood’ categories.
Statistically significant differences (*) in Accuracy scores are indicated at the α = 0.05 level between child and adult vignettes. In this chart, significant differences are only demonstrated for comparisons between child and adult vignettes for the same model (See Supplementary Tables 2 and 3 for statistical comparisons between 70b, 70b Context and GPT). The Bonferroni threshold for significance is p < 0.005.
Fig. 7. Vignette Accuracy score for vignettes in ‘Management Change’, ‘Presentation Change’, and ‘No Change’ categories.
Statistically significant differences (*) in Accuracy scores are indicated at the α = 0.05 level between child and adult vignettes. In this chart, significant differences are only observed for comparisons between child and adult vignettes for the same model (see Supplementary Tables 2 and 3 for statistical comparisons between 70b, 70b Context and GPT). The Bonferroni threshold for significance is p < 0.005.
In-context prompting has been shown to improve LLM performance18. Due to the cost of GPT, we applied in-context prompting only to Llama-2-70b-chat. The in-context prompts were constructed using two publicly available clinical genetics databases: Orphanet (https://www.orpha.net/) and GeneReviews (https://www.ncbi.nlm.nih.gov/books/NBK1116/) (more detail in Methods). The general trend shows that leveraging in-context prompts improves Llama-2-70b-chat, especially for Completeness score. For child vignettes, 70b Context scored significantly higher than 70b in Completeness in four of five total disease categories (p = 1.58 × 10−5 for the ‘Limited to Childhood’ category, p = 4.28 × 10−10 for the ‘Management Change’ category, p = 1.02 × 10−5 for the ‘Presentation Change’ category, and p = 8.66 × 10−17 for the ‘No Change’ category). Similarly, with the adult vignettes, 70b Context also scores significantly higher than 70b in Completeness in four of five total disease categories (p = 0.000893 for the ‘Limited to Adulthood’ category, p = 3.86 × 10−8 for the ‘Management Change’ category, p = 1.61 × 10−8 for the ‘Presentation Change’ category, and p = 7.76 × 10−13 for the ‘No Change’ category) (See Supplementary Table 2). This implies that Llama-2-70b-chat may not have been fully trained specifically on the data related to these criteria; hence, without in-context prompting, it does not perform consistently well.
GPT-3.5, however, obtains very high Accuracy scores without any prompting. While the Llama-2-70b-chat in-context prompting accuracies have higher average scores than GPT-3.5, these differences are non-significant (Supplementary Table 3). This implies that (1) in-practice, although Llama-2-70b-chat is free and open-source, it may be easier to use GPT-3.5 to avoid finding in-context prompting information and (2) with proper prompting, Llama-2-70b-chat performance can rival and outperform GPT-3.5.
Overall, 70b, 70b Context and GPT comparisons between child and adult vignettes exceeded our initial expectations and did not show significant age-bias in generating descriptions of genetic conditions.
Due to the potential challenges categorising the conditions in categories 3–5, an additional analysis was performed on these three groups, after removing 23 conditions that were considered potential cross-category conditions (PCCs). These conditions, listed in Supplementary Table 4, were selected for removal due to the potential that there are different forms of the disease, and that these different forms might mean that they should be categorised into different groups. For example, conditions like Glycogen storage disease due to acid maltase deficiency, can have different presentations, such as infantile-onset and late-onset forms, though these are lumped together in the Orphanet source we used to identify conditions for the study. However, we recognise that it might be argued that these various forms should be categorised differently even if our categorisation schema resulted in them being treated as a single condition in the ‘Management Change’ category. For this reason, as described below, we performed additional analyses after manually assessing all conditions and removing a subset that could present categorisation challenges, which we term ‘potential cross-category conditions’ (PCCs).
Analyses after removing 23 PCCs, with 259 remaining diseases, showed similar trends to the original analyses. There are no significant differences between child and adult vignettes for Correctness, Completeness, Accuracy and Total score in the categories ‘Management Change’, ‘Presentation Change’ and ‘No Change’ after removing PCCs (Supplementary Figs. 3–6).Leveraging in-context prompting also improves Llama-2-70b-chat for the Completeness and Accuracy metrics (Supplementary Table 5). Like the initial analyses, there were no significant differences between 70b Context and GPT-3.5 for any of the categories after removing PCCs (Supplementary Table 6).
Mean age presented in vignettes for the ‘Limited to Childhood’ and ‘Limited to Adulthood’ categories
We performed a sub-analysis to examine mean age of the hypothetical child and adult patient in the ‘Limited to Childhood’ and ‘Limited to Adulthood’ vignettes generated by the LLMs (See Fig. 8). Comparing 70b to 70b Context for child vignettes in conditions ‘Limited to Childhood’, the mean age decreased from 3.2 years to 1.4 years when in-context prompting was added, which was a statistically significant difference (p = 2.32 × 10−5). Additionally, the mean age decreased from 3.2 years to 2.2 years when comparing child vignettes generated through 70b and GPT, respectively; this was also statistically significant (p = 0.005401).
Fig. 8. Mean age in years presented in vignette output was calculated for ‘Limited to Childhood’ and ‘Limited to Adulthood’ conditions.
‘Limited to Childhood’ condition is defined as either limited survival due to disease severity or as manifestations may resolve prior to the second decade of life. A ‘Limited to Adulthood’ condition is defined as a condition that presents in adolescence or later. Standard deviation of age is presented in the error bar only shown in the positive direction. Significance (*) is indicated at the α = 0.05 level, between 70b, 70b Context and GPT for each age group. The Bonferroni threshold for significance is p < 0.00833.
For adult vignettes in the ‘Limited to Adulthood’ category, significant differences were found between one comparison of mean ages: 70b Context vs GPT. The mean age for adult vignettes is 40.8 for 70b. When context is added, there is a nonsignificant change, as the mean age is decreased to 40.2. Using GPT, the mean age for adult vignettes is 44, which is significantly higher than 70b context (p = 0.00311), but not significantly higher than 70b (p = 0.0863).
For conditions in the ‘Limited to Adulthood’ category, 70b, 70b Context and GPT can still (albeit inaccurately) generated vignettes for young patients (age < 18 years). However, these vignettes contain patients with a higher age range (mean age: 7.7–11.2 years), when compared to child vignettes within the ‘Limited to Childhood’ category (mean age: 1.4–3.2 years). This likely reflects that the LLMs have been trained on data, such as case reports or other descriptions in the medical literature, that describe these conditions in older individuals.
Similarly, in the ‘Limited to Childhood’ category, adult vignettes, with patients aged >18 years old, were inaccurately generated for conditions in this category, where most patients either die in childhood or the manifestations resolve prior to adulthood. These incorrect vignettes were generated across all three experiments (70b, 70b Context, GPT); however, the hypothetical patients in these generated vignettes had a relatively younger adult age (mean age: 32.5–33.2 years), when compared to the adult vignettes for the ‘Limited to Adulthood’ category (mean age: 40.2–44 years). This again reflects that LLMs are at least partially trained on data that indicates that these conditions are most likely to present at younger ages.
In two instances, GPT appropriately refused to provide these contradicting vignettes. It provided the following response when asked to generate a child vignette for a condition from the ‘Limited to Adulthood’ category ‘ALS: I’m sorry, but it’s not possible for a child to have ALS. ALS is a progressive neurodegenerative disease that typically affects adults, and its onset is rare in individuals younger than 20 years of age…’. It also provided the following response when asked to generate an adult vignette for a condition from the ‘Limited to Childhood’ category, ‘Infantile myofibromatosis is a rare condition that typically affects infants and young children. However, in rare cases, it can also affect adults. Here is a brief medical vignette of an adult with infantile myofibromatosis…’.
Mean age presented in child vignettes for metabolic conditions
From the original cohort of 282 conditions (disregarding the 5 disease categories), we evaluated LLM performance on the child vignettes conditioned only on inherited metabolic diseases (sometimes called ‘inborn errors of metabolism’ or ‘biochemical genetic disorders’), which are genetic conditions that result from a missing or defective enzyme in the body, disrupting how the body makes or uses proteins, fats, or carbohydrates. These conditions typically present very early in life and are important because several of them can lead to death at a very young age.
We analysed the mean age presented in the child vignettes for three groups of metabolic diseases (1) ‘All metabolic conditions’ contained within our original cohort, n = 74; (2) a sub-group of ‘metabolic conditions listed within the Recommended Uniform Screening Panel (RUSP) for newborn screening (NBS)’ as of April 22, 2024 (e.g. Phenylketonuria), n = 21 (https://www.hrsa.gov/advisory-committees/heritable-disorders/rusp); (3) a smaller sub-group of ‘metabolic conditions with acute neonatal crisis’ where these acute events are very important medical issue that tend to present soon after birth (e.g. Maple syrup urine disease (MSUD), n = 11 (See Supplementary File 1).
We intentionally focused on analysing child vignettes rather than adult vignettes as most of these metabolic conditions have childhood onset. The results showed a statistically significant decrease in mean age generated by 70b Context from 3.9 years to 0.8 years between ‘all metabolic conditions’ group versus the group of ‘metabolic conditions with acute neonatal crisis’ (p = 0.006481). Additionally, there was a similar statistically significant decrease in mean age generated by GPT from 3.9 years to 1.2 years between the two groups (p = 0.002258). We also observed a statistically significant decrease of mean age for the ‘metabolic conditions with acute neonatal crisis’ (from 2.1 years to 0.8 years) when comparing 70b to 70b context (p = 0.00565) (See Fig. 9).
Fig. 9. Mean child age presented in vignette output for metabolic conditions.
Subgroupings of metabolic conditions were selected from the larger cohort of 282 conditions. The largest subgrouping contains ‘all metabolic conditions’ within this study. The next grouping is only those ‘metabolic conditions listed within the RUSP for NBS’. The smallest grouping is that of ‘metabolic conditions with acute neonatal crisis’. Significance (*) is indicated at the α = 0.05 level, between Acute Neonatal Crisis conditions, Metabolic NBS conditions and Metabolic conditions. The Bonferroni threshold for significance is p < 0.00566. NBS: Newborn Screening.
Sex bias observations in LLM vignette output according to LLM and prompt type
From each vignette, sex signifiers such as ‘her’ or ‘his’ were used to assign sex to the described individual. Of the total 1692 generated vignettes, 27 did not have sex signifiers and instead used the wording, ‘the patient’. Compared to 70b and 70b Context, GPT is less likely to give sex signifiers for the child prompt with 23 of the total 282 (8.1%) vignettes lacking sex descriptions. Overall, sex ratios were heavily skewed toward male, particularly for 70b child prompts (7.3:1 male to female ratio; See Supplementary Table 7).
Adult vignettes yielded a more even distribution of sex ratios (e.g. GPT child 4.4:1 compared to GPT adult 2.7:1). When context was applied in 70b, more female vignettes were generated, either lowering the male to female ratio for the child or skewing the adult male to female ratio more heavily female, 1.5:1–1:1.8. When removing conditions that disproportionately affect one sex (Supplementary Table 8), we did not observe any significant change in the male to female ratios. As expected, vignettes generated for conditions that disproportionately affect males demonstrated a male bias and those that disproportionately affect females overall showed a female bias, though not to the same degree as the male conditions (i.e. male to female ratios ranging from 3.9:1 to 32:1 for male conditions and 1:1 to 1:11 for female conditions (Supplementary Table 9). We emphasise that we extracted sex terms from the LLM output, and used those to assess conditions that may have different effects in individuals of different biological sexes, such as related to X-linked inheritance patterns.
Mode of inheritance of conditions included in the study
The Orphanet database provides the mode of inheritance for each of the 282 conditions in this study. Forty-five (16.0%) of the conditions had multiple modes of inheritance listed, whereas 229 (81.2%) had a single mode of inheritance and 8 (2.8%) are considered to typically occur sporadically. Supplementary Fig. 7 shows the distribution of the modes of inheritance for the 282 conditions, with autosomal recessive being the most common inheritance pattern. To determine the impact of age of onset of a condition on the distribution of modes of inheritance, we compared ‘Limited to Childhood’ conditions to ‘Limited to Adulthood’ conditions (Supplementary Fig. 8). Interestingly, there is a shift of the predominant mode of inheritance from autosomal recessive in ‘Limited to Childhood’ to autosomal dominant in ‘Limited to Adulthood’ conditions.
Evaluating generated dialogues for the ‘Management Change’ and ‘Presentation Change’ categories
Using Llama-2-70b-chat, we designed a self-conversation environment to simulate a medical dialogue between a patient/family and a geneticist with the generated vignettes as input background information (more details in Method). Dialogues were only generated for disorders in two categories: Management Change and Presentation Change. Furthermore, within these categories, dialogues were only generated for conditions whose vignettes scored a 1 for both Correctness and Completeness (i.e. accurate vignettes) when graded by our clinicians. This is because these vignettes are used as the initial prompt to provide background information to the LLM when generating the dialogues. In other words, the patient in each generated vignette becomes the patient visiting a geneticist in our dialogues, and the LLM builds the patient-geneticist conversation off the background information from this vignette. A total of 41 disorders fit these criteria and had dialogues generated for them.
For the generated dialogues, we observed results with similar trends as the vignettes; that is, conditioned on any of the two studied categories ‘Management Change’ or ‘Presentation Change’, the Correctness, Completeness and Compassion scores were not statistically different with respect to age (p = 0.352; p = 0.548; p = 0.903 for each of these performance metrics, respectively). The mean Total score was 13/15 (87%) and 12.9/15 (86%) for child dialogue and adult dialogue, respectively, which reveals a nonsignificant difference with respect to age (p = 0.838) (see Table 1). Additionally, the open-source Llama-2-70b-chat demonstrated high performance in terms of quality of communication, with Compassion scores from 90% to 99% for all generated dialogues.
Table 1.
Mean dialogue scores and statistical significance in Correctness, Completeness, Compassion and Total metrics for dialogues generated in Management Change (n = 26) and Presentation Change (n = 15) with respect to age
| Correctness (0–5) | Completeness (0–5) | Compassion (0–5) | Total (0–15) | |||||
|---|---|---|---|---|---|---|---|---|
| Child | Adult | Child | Adult | Child | Adult | Child | Adult | |
| Management Change (mean) | 4.00 | 4.27 | 4.23 | 4.04 | 4.54 | 4.50 | 12.8 | 12.8 |
| Management Change (p-value) | 0.327 | 0.434 | 0.830 | 1 | ||||
| Presentation Change (mean) | 3.93 | 4.07 | 4.27 | 4.27 | 4.93 | 4.93 | 13.1 | 13.3 |
| Presentation Change (p-value) | 0.743 | 1 | 1 | 0.857 | ||||
| All dialogues (mean) | 4.20 | 3.98 | 4.12 | 4.24 | 4.67 | 4.68 | 13.0 | 12.9 |
| All dialogues (p-value) | 0.352 | 0.548 | 0.903 | 0.838 | ||||
Dialogues were only generated using Llama-2-70b-chat if the vignette for the associated condition and age group was accurate. Each dialogue was graded on a Likert scale, with 5 points available per metric and 15 total points available. Paired t-tests were conducted between the child and adult dialogues, conditioned on Correctness, Completeness, Compassion and Total scores. Significant differences at the α = 0.05 significance level are indicated with a marker (*), and all p-values were corrected using the Bonferroni correction (p < 0.00417).
Evaluating generated management plans for the ‘Management Change’ category
Unlike the observations for vignettes and dialogues, scores for management plans generated through Llama-2-70b-chat (n = 53) showed lower results (See Table 2), ranging between 55-66% in terms of Correctness and Completeness, and ranging between 83 and 89% for Conciseness, with no statistical difference across different ages for Correctness, Completeness and Conciseness (p = 0.642; p = 0.182; p = 0.371 for each of the three metrics respectively). This may be expected because we only provided the patient age and sex as in-context data and did not provide any other details about the diseases.
Table 2.
Summary of management scores and statistical comparison of child and adult management plans generated by Llama-2-70b-chat and GPT-3.5
| Correctness | Completeness | Conciseness | Accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| Child | Adult | Child | Adult | Child | Adult | Child | Adult | |
| 70b Mean score (n = 53) | 0.55 | 0.58 | 0.66 | 0.65 | 0.83 | 0.89 | 0.34 | 0.26 |
| 70b p-value (n = 53) | 0.642 | 0.182 | 0.371 | 0.322 | ||||
| Low-scoring 70b Mean score (n = 10) | 0.20 | 0.20 | 0.10 | 0.10 | 0.10 | 0.20 | 0 | 0 |
| GPT-3.5 Mean score (n = 10) | 0.90 | 0.90 | 0.80 | 0.50 | 1 | 1 | 0.70 | 0.40 |
| GPT-3.5 p-value (n = 10) | 1 | 0.193 | 1 | 0.0811 | ||||
| Low scoring 70b vs GPT-3.5 p-value (n = 10) | 0.000953* | 0.00132* | 0.00132* | 0.0368 | 1.00 × 10–5* | 0.000953* | 0.00132* | 0.0368 |
Paired t-tests were conducted between the child and adult scores for Correctness, Completeness, Conciseness and Accuracy of clinician-graded management plans generated by Llama-2-70b-chat (n = 53). Only management plans that scored a maximum of 1 out of 3 points using Llama-2-70b-chat were repeated using GPT-3.5 (n = 10). Because of very low sample size (n = 10), permutation tests were conducted between the child and adult scores in Correctness, Completeness, Conciseness and Accuracy of clinician-graded management plans generated by GPT-3.5. Permutation tests were also conducted between the low-scoring Llama-2-70b-chat entries and GPT-3.5 for Correctness, Completeness, Conciseness and Accuracy scores in child and adult age groups (n = 10). Significant differences at the α = 0.05 significance level are indicated with a marker (*), and all p values were corrected using the Bonferroni correction (p value threshold = 0.00833).
Since the three individual performance metrics were low, the Accuracy score was also low, ranging from 26% to 34%, with no statistically significant difference between child and adult age groups (p = 0.322). GPT was used to generate child and adult management plans for the conditions that scored low on 70b, which showed statistically significant improvement in Correctness for child plans (p = 0.000953) and adult plans (p = 0.00132). Completeness scores for GPT-generated child and adult plans improved compared to the low 70b scores (child: 10% to 80%; adult: 10% to 50%). The improvement in child plans is significant (p = 0.00132) while the improvement in adult plans is not statistically significant. This non-significant difference for the adult plans could be attributed to significantly higher Conciseness score for GPT that might have resulted in lower Completeness scores as a consequence of presenting the information concisely. As a result of this non-significant improvement for GPT for the adult plans, the adult Accuracy score did not show an overall significant improvement (p = 0.0368). Conversely, for the child group, GPT resulted in significantly improved Accuracy (p = 0.00132).
Discussion
LLMs and related methods are being increasingly used in clinical practice, and it is imperative to understand how well they work in different scenarios. LLMs can produce unintentional age or sex-related biases in their responses simply because of disproportionally distributed training datasets with respect to the source for these biases19,20. This study seeks to examine whether age-biases exist in LLM responses with respect to genetic conditions whose manifestations and management plans may change with respect to age. In our study, the tested LLM models were Llama-2-70b-chat and GPT-3.5, and the kinds of generated outputs were the medical vignettes, patient-doctor dialogues and management plans. The evaluation was done with respect to 282 prevalent genetic conditions (259 when removing the PCCs) and a subgroup of 74 metabolic conditions. Very rare disorders were excluded from this study, and we plan to expand the number of diseases in future work.
Our original assumption was that, for most conditions, due to the focus on paediatric presentations of disease, including related to importance of early diagnosis and intervention, we would observe an effect of the preponderance of written records describing clinical findings and treatment plans for the child group versus the adult group. Hence, we originally expected LLM responses to be more realistically plausible at generating medical vignettes, patient-doctor dialogues, and management plans for the child group than for the adult group. To our surprise, Llama-2-70b-chat (with and without in-context prompting) and GPT-3.5 (without in-context prompting) did not show any obvious age-related biases.
For the medical vignettes, among the diseases in ‘Management Change’, ‘Presentation Change’ and ‘No Change’, Figs. 4–7 and Supplementary Figs. 1 and 2 did not show statistically differences in the Correctness and Completeness scores between the child and adult groups. LLM only fails at generating medical vignettes of an adult patient for diseases in ‘Limited to Childhood’ (and likewise, child vignettes for diseases in ‘Limited to Adulthood’). However, this behaviour is fully expected.
Besides being unaffected with obvious age-biases, 70b (with in-context prompting) and GPT (without in-context prompting) both obtain a surprisingly high performance at generating medical vignettes; both Correctness and Completeness score were at least 0.68 out of 1, with most scores being close to 0.8 out of 1, averaging over 282 diseases (Supplementary Figs. 1 and 2).
These findings suggest that, contrary to our initial suspicions, 70b Context and GPT were well-trained with respect to age distribution and thus produce plausible descriptions of disease manifestations (e.g. vignettes) for both the child and adult group. This is particularly helpful due to the lack of formal genetics training/curriculum in many internal medicine and primary care residency programmes despite the increasing need and interest among those trainees in genetics as they continue to take care of more adult individuals with an underlying genetic condition, and as more people with genetic conditions survive into adulthood or are diagnosed later in life21–24. Nowadays, many of these clinical trainees, as well as others, use those LLMs on a frequent basis whether during their clinical service or while studying for medical exams and board certifications to retrieve information and learn more about the clinical features, mode of inheritance and outcomes of these genetic conditions25–27.
The mean age of a child generated by the LLM in all the child vignette outputs was 5 ± 3.4 years. We also evaluated LLM performance on the child vignettes conditioned only on metabolic diseases (Fig. 9). Certain kinds of metabolic diseases are expected to manifest at different stages of life. We considered three groups of metabolic diseases (1) ‘All metabolic conditions’ contained within our original cohort; (2) Sub-group of ‘metabolic conditions listed within RUSP for NBS’ and (3) A smaller sub-group of ‘metabolic conditions with acute neonatal crisis’ where these acute events are a very important medical issue (e.g. MSUD). Results showed a statistically significant decrease in mean age generated by 70b Context of 3.9 years versus 0.8 years between ‘all metabolic conditions’ group versus the group of ‘metabolic conditions with acute neonatal crisis’ (p = 0.00648). With GPT, more statistically significant results were seen, with mean age being 3.9 years vs. 1.2 years between these two disease sets, respectively (p = 0.00226). This likely reflects differences in training data and may point towards the knowledge of these LLM models about the very early onset of these ‘metabolic conditions with acute neonatal crisis’ which tend to present soon after birth. This significant difference in mean age was not observed when comparing ‘all metabolic conditions’ group to the group of ‘metabolic conditions listed within RUSP for NBS’, possibly because several of these conditions have insidious onset and if treated early in life, can result in an increased frequency of later-onset information related to longer-term sequelae (e.g. for Phenylketonuria and Homocystinuria). Thus, when we examine subsets of disease, we can see that the LLMs assign ages that correlate with clinical expectations based on important medical issues.
We also analysed the mode of inheritance in the generated medical vignettes with respect to the 282 selected conditions. This analysis does not strictly address age-related biases; however, we found it to be an interesting question to help us understand how genetic conditions were described related to another important area in medical genomics. We found that there was a shift in inheritance mode, going from autosomal recessive in ‘Limited to Childhood’ to autosomal dominant in ‘Limited to Adulthood’ conditions (Supplementary Fig. 8). A possible explanation is that autosomal recessive conditions tend to be more severe compared to autosomal dominant conditions. This is likely due to evidence that disease genes that involve recessive inheritance are under different selective pressures than those that involve dominant inheritance and thus may, in general, manifest in diseases at different ages (and become ‘Limited to Childhood’). One example is the Autosomal Recessive Polycystic Kidney Disease (ARPKD), which typically presents much earlier and is more severe than Autosomal dominant polycystic kidney disease (ADPKD)28–30. We note that this is a generalisation and does not take into account scenarios such as related to deleterious de novo variants. This caution about generalisations should be considered for all of our analyses. Our analysis of the PCCs is motivated by a similar line of thought, and underscores that is important to bear in mind that patients may not always follow traditional textbook patterns (Supplementary Table 4).
From the generated vignettes, we further generated dialogues between a hypothetical clinical geneticist and a hypothetical patient using these vignettes as in-context prompts. This simulation aims to reveal some insights about how an actual patient may converse with the LLM. Overall, Llama-2-70b-chat could generate realistically plausible dialogues with a mean Total score of 13/15 (87%) and 12.9/15 (86%) for the child and adult group, respectively (Table 1). In terms of quality of communication, the Compassion scores excel the most, ranging from 4.5–4.93/5 (90–99%) for all generated dialogues. This is especially important given the potential use of these LLMs and chatbots for conversations by patients as well as physicians, including to draft responses to patients or medical colleagues; however further exploration of this technology is warranted given the high-risk nature of clinical communication. In this paper, we only graded each generated dialogue in its entirety; that is, we focused on how plausible the entire dialogue appears. However, some issues emerge upon close observation. For example, there can be a robotic nature to these generated dialogues, which would hopefully not occur in actual patient-doctor conversations. Future study will focus on analysis for these types of nuances.
Unlike the generated vignettes and dialogues, management plans generated by Llama-2-70b-chat obtained low Correctness and Completeness scores (55–66%), but better Conciseness (83–89%) for the child and adult age group (Table 2). One potential explanation might be that the Llama-2-70b-chat was not provided with comprehensive in-context information, but rather only given the patient’s age and sex as in-context prompts. When GPT-3.5 was used to generate management plans, Accuracy score significantly improved for the child group (p = 0.00132) but showed a smaller, non-significant improvement for the adult group (p = 0.0368) with multiple testing correction. These results in general support that these LLM models can encode a wealth of semantic knowledge about genetic conditions and have high conversational abilities with quality communications but are still not ready or safe for critical decisions related to management plans, including as they lack real-world implementation evidence31,32. We emphasise the following main difference between the management plans and the dialogues. Although dialogues were also assessed for the appropriateness of their management recommendations conveyed to the patient during the conversation, the management recommendations were assessed from a more general viewpoint and with different expectations. The dialogues described an initial evaluation during a first encounter between the patient and the geneticist and focused on discussing possible expected diagnoses, possible clinical outcomes and initial recommendations including sending testing to confirm suspected diagnoses, referring to other subspecialists and scheduling follow-up visits to review testing results and provide more specific recommendations in subsequent visits. Conversely, when generating management plans, the LLMs were asked specific management questions for a specific known diagnosis, and therefore a more comprehensive and specific output was expected.
Our future study would focus on more detailed analysis of the dialogues and management plans. For example, suppose certain phenotypes were excluded during the conversation, then we would evaluate how the final disease diagnosis may change. For management plans, we could include or exclude certain test results and then observe how the LLM response may appear. Moreover, we generated vignettes and dialogues with respect to two broad age groups, child and adult. In future work, we would further focus on narrower age categories, where vignettes, dialogues and management plans are generated and graded with respect to many age brackets (e.g. neonatal, infancy, childhood, adolescence, adult, elderly).
Regarding model limitation, we applied Llama-2-70b-chat and GPT-3.5 without editing any default parameters like top_k and top_p, which can affect the generation process. Moreover, one could finetune LLM specifically on medical datasets or on the generated vignettes and dialogues15. This would require a large amount of well curated data (either automatically or manually). For example, to finetune their model, Tu et al. generated and auto-evaluated many hypothetical vignettes and dialogues, though most of which are not geared specifically towards genetic conditions with potential age-related manifestations and management plans. However, as mentioned previously, we encountered both computational cost and annotation problems when following the Tu et al. auto-evaluation protocol (e.g. few-shot in-context learning).
However, in many situations, the average users would not finetune the LLM with respect to their own datasets. Rather, we suspect that the average users would rely on useful prompts to guide LLM output. For these reasons, in this paper, we evaluated the performance of only pre-trained LLMs on diseases whose manifestations and management may differ with respect to age. Our approach is likely to reveal more plausible insights about how LLMs are being used by both clinicians and patients regarding these genetic conditions in practice.
With in-context prompting, this strategy continues to be effective in improving the performance of LLMs in a specific knowledge area33. The primary limitation with in-context prompting is the token limit size of each LLM. Llama-2-70b-chat has a token size limit of 4096 tokens; after incorporating information from both Orphanet and GeneReviews, the token limit is almost reached. Other studies have investigated the use of bulk scientific literature (including online searches outside of medical databases) to guide their own LLMs15. However, due to computational limitations and token limits of open-source Llama-2-70b-chat, we were unable to incorporate much more than Orphanet and GeneReviews descriptions. Despite this, Llama-2-70b-chat still performed impressively compared to GPT-3.5 at generating medical vignettes considering the model size difference (see Supplementary Tables 2 and 3). In future work, we plan to deploy the experiment based on real-life clinical notes and/or conversations, where there can be much more variability among the input context for the same disease. Such settings would enable the LLM generated vignettes and dialogues to be much more diverse.
The LLM token limit can also affect the complexity of the generation process. For example, previous work implemented a ‘critic agent’ that knows the ground-truth and provides feedback to the ‘doctor agent’ after the first chat session15. This feedback and all the first chat session data would then be used as an in-context prompt for the second chat session. Hence, the second chat session would tend to provide higher quality dialogues. With the token limit of Llama-2-70b-chat, we were only able to implement the ‘patient’ and ‘doctor’ agent without the ‘critic’ (i.e. without feedback to improve the ‘doctor agent’ in the second chat session). Moreover, we could only generate the dialogues for one single chat session to avoid using all the tokens.
While LLMs are being used and studied in a variety of ways in medicine, one area in medical genetics of particular interest involves identifying or diagnosing unknown conditions or recommending genetic testing based on a patient’s manifestations, such as through assessing clinical notes. Much of this will likely take place through the type of multi-turn conversations we used in this study. For example, a clinician may have an LLM analyze clinical notes; the clinician would then give further prompts to enable the LLM to help provide recommendations regarding diagnosis, testing, management, or other areas. Similar scenarios will occur where a patient directly accesses an LLM – they may take their medical records and ask similar questions of an LLM. The ability of the LLMs to perform such tasks depends how well the LLMs ‘portray’ genetic conditions in their outputs, including via the types of multi-turn conversations we analysed. Against this background, the motivation of this particular study was to explore how LLMs portrayed genetic conditions in different types of patients (based on age), and across different categories of condition. Again, this type of analysis is important to help understand if there are different populations for which LLMs will work variably. For example, it could be possible that LLMs, when used to identify diagnoses, suggest genetic testing, or provide management plans from clinic notes, work much better for patients of a certain age group or with certain conditions – our paper aimed to explore these types of questions. Thus, one application of our paper relates to methods and approaches to help assess and benchmark LLMs in clinical situations, whether the situation involves an LLM being used by a clinician or a patient. This could lead to efforts to improve the performance of LLMs in areas where it underperforms (or where optimal performance is especially critical). Beyond this, another potential application involves education. We anticipate that LLMs may be useful to help train clinicians (including future clinicians such as medical students) in hypothetical patient scenarios. Through an LLM (e.g. via multi-turn dialogues), clinicians and trainees could practice interacting with realistic patient situations and our framework of assessment could allow them to have ways to consider the output as well as their own performance. Naturally, it would also be necessary to ensure that LLMs work well across different populations and conditions in this regard.
In summary, the tested LLMs exceeded expectations in addressing genetic conditions, even in adult scenarios. However, it is important to highlight that despite these impressive capabilities, at least open-source LLMs (and likely all LLMs) still have significant limitations in providing age-related management decisions, which should be taken with significant caution given the wide use of these LLMs by patients, medical providers and providers in training.
Methods
Generating and scoring vignettes
We generated medical vignettes for each genetic condition using two LLMs: Llama-2-70b-chat and GPT-3.5. With the open-source Llama-2-70b-chat, we used Biowulf, NIH’s high-performance server. Even with this server, due to the Llama-2-70b-chat model’s size, we needed to run its 8-bit C++ version to efficiently optimise memory usage and speed. The closed-source GPT-3.5 was accessed through a private Azure OpenAI instance, provided via the NIH, at a discounted institutional cost.
Two vignettes for each condition were generated by querying the LLMs: one for a patient (we use this term to refer to an individual affected by a genetic condition) during childhood and one for a patient during adulthood. First, these vignettes were generated using a ‘simple prompt’, without any in-context prompting. Our simple prompt was: ‘Generate a brief medical vignette of a [child/adult] with [genetic condition]. Incorporate the most important clinical features’, where we varied the age group of the patient (i.e. child or adult) and the name of the genetic condition.
Second, due to the cost of GPT-3.5, we applied in-context prompting only to the Llama-2-70b-chat model to generate two additional vignettes for each condition. Therefore, for each condition a total of 4 vignettes were generated by Llama-2-70b-chat (child and adult vignettes with and without in-context prompt) and 2 vignettes were generated by GPT-3.5 (child and adult vignettes without in-context prompt).
The in-context prompts for Llama-2-70b-chat were constructed using two publicly available clinical genetics databases: Orphanet (https://www.orpha.net/) and GeneReviews (https://www.ncbi.nlm.nih.gov/books/NBK1116/). We extracted information from these sources using the Orphanet Scientific Knowledge Files (https://www.orphadata.com/orphanet-scientific-knowledge-files/) and the GeneReviews NLM Open Access Subset (https://ftp.ncbi.nlm.nih.gov/pub/litarch/ca/84/), respectively. From these sources, we downloaded the relevant information for genetic conditions in XML format from both Orphanet and GeneReviews. Then, the XML files were parsed to retrieve two key descriptions: the Orphanet ‘Clinical description’ section and the GeneReviews ‘Clinical characteristics’ section. Some conditions in our cohort (n = 282) did not have a corresponding page in GeneReviews. In these cases, only the ‘Clinical description’ section from Orphanet was extracted, and the GeneReviews portion was omitted. These data were then prepended with the ‘simple prompt’ to create the full prompt with in-context data, which we will refer to as ‘in-context prompt’ (example prompts in Supplementary File 1). For brevity, we will refer to the vignettes generated with in-context prompts for the child group as to as ‘in-context child’, and likewise ‘in-context adult’ for the adult group (see Supplementary File 1 for simple prompts, in-context prompts and all generated vignettes).
There can be other resources besides Orphanet and GeneReviews. However, the Llama-2-70b-chat token limit is 4096 tokens, which constrains the total amount of in-context input data. We also did not preprocess the Orphanet ‘Clinical description’ section and the GeneReviews ‘Clinical characteristics’ information, for example, by removing editorial comments about specific clinical nuances, which are not fully required for the vignette generation process. Our rationale is that the average user would likely use the in-context data as-is. Moreover, having more related information in the in-context prompt typically improves the LLM output34,35.
Previous studies have introduced approaches for using an LLM to grade its own output (or output from another LLM)15,36. For example, one can auto-evaluate the generated vignettes by using in-context learning with few-shot examples (e.g. 5 in-context examples)15. In-context learning ideally requires accurate human manual grading for several vignette instances of the same disease; moreover, these instances should contain both good and bad vignettes37. However, for our approaches, as it can require significant amount of time and expertise to grade a single vignette, and as we aimed to evaluate LLMs on many conditions under various settings (e.g. paediatric versus adult, with and without in-context prompting), we decided to focus on grading a single vignette from many different diseases, rather than grading many different vignettes from the same disease.
Second, in-practice, for the same prompt, we did not observe a wide range of variability among the vignettes when using different random seeds with Llama-2-70b-chat (See Supplementary Table 10 for the results of repeated runs). Roberta vector embedding was used to compare the first generated vignette against each subsequent repeated runs with different random seeds38. There can be other kinds of document embedding; however, Roberta was used successfully in similar studies33. Supplementary Table 10 shows high similar scores among the vignettes generated with the same prompt but with different random seeds.
This low variability implies that, for future studies, we would need to carefully design different in-context prompts for the same disease so that the corresponding vignettes would be at varying levels of accuracy. Ideally, in-context prompts should not be fully unrelated to the disease descriptions; thus, we need to carefully determine which plausible phenotypes to include or exclude from the in-context prompt. This is a difficult task because the accuracy can be equally affected not only by omitting rare phenotypes (manifestations) but also by omitting common phenotypes. For example, Wilson disease contains the rare phenotype ‘Kayser-Fleischer ring’ which might be considered a ‘pathognomonic’ feature (i.e. a feature that greatly helps in identifying a specific disease). Conversely, Rett syndrome contains the common phenotypes: developmental delay, mobility issues and repetitive hand movements. Thus, with Rett syndrome, it is not immediately obvious which phenotype or set of phenotypes should be included or excluded. Moreover, the grading criteria is also complicated by the fact that some conditions have many more associated phenotypes than others.
Third, most open-source models do not have a large token limit, and thus cannot handle many vignette examples for in-context learning. Proprietary models with large token limit would require significant additional computation cost. For these reasons, we manually graded each vignette, and reserve the topic of auto-evaluation for future studies.
Manual scoring of the vignettes was performed by 3 clinicians (a paediatric geneticist, a general practitioner and a genetic counsellor; all clinicians were able to use any available resources to help with scoring, and all have been involved in clinical genetics research for at least several years); each graded 6 different vignettes for approximately one-third of the 282 conditions; child (70b), adult (70b), child (70b context), adult (70b context), child (GPT), and adult (GPT) vignette. A scoring rubric was used to grade the vignettes to help ensure consistency. There are multiple ways to score LLM output; we used a version of a method previously described (https://docs.rungalileo.io/galileo/gen-ai-studio-products/galileo-evaluate), as we felt that it adequately provides a way to judge an LLM output. We assessed each vignette for Correctness, Completeness and Conciseness, assigning a score of 0 (not correct, complete, or concise) or 1 (correct, complete or concise), yielding a maximum Total score of 3 for each. Since all results received 1 scores for conciseness, we removed this consideration from scoring the vignettes (See Supplementary Table 2 for vignette scoring rubric). Correctness score measures whether a given response is factual. That is, we want to avoid ‘hallucinations’, which are presentations of information that are logically consistent and could easily be believable to a nonexpert, but which are factually incorrect (e.g. a vignette that describes an individual affected with campomelic dysplasia to have long limbs, whereas campomelic dysplasia is actually associated with short limbs). Completeness score measures how thoroughly the LLM response covered important features relevant to the genetic disorder. Considering that not all individuals affected by a genetic condition display all the textbook characteristics, but instead demonstrate varying degrees of severity and different manifestations, we considered a vignette complete if it exhibited the most essential features (based on the sources we used) along with sufficient additional clinical and laboratory indicators for a given condition. For example, a vignette of a patient with Freidreich ataxia would be considered incomplete if the description does not include impaired muscle coordination (ataxia) and impaired speech (dysarthria). To enable additional scoring, a separate metric ‘Accuracy’ was marked as 1 if a vignette receives 1 for both Correctness and Completeness.
To ensure consistency and agreement between the three clinician evaluators, independent scoring of ten conditions per pairing (30 conditions total) were done by a combination of each two clinicians (Clinician 1&2, Clinician 1&3, Clinician 2&3). The conditions, shown in Supplementary Table 11, were intentionally selected to overrepresent categories 3–5, since these were somewhat more challenging to categorise. Further, we intentionally included different types of genetic conditions, including inborn errors of metabolism, multi-malformation disorders, neurologic and neuromuscular conditions, haematologic disorders, oncologic conditions, dermatologic conditions and others. Though these were not selected to proportionally represent all genetic conditions, we did intentionally attempt to ensure that there would be representation of a variety of conditions affecting different organ systems. Kappa statistic was used to measure inter-rater reliability which showed that all 3 clinicians did not have significant disagreement39 (See Supplementary Table 12).
To assess for other possible biases within the LLM generated vignettes, we extracted identifiers from within the vignettes (namely: age and sex) and compared them to each condition’s characteristic age of onset and sex-specific prevalence using Orphanet database (https://www.orpha.net/en/disease) as the source. Age bias was assessed by analysing the hypothetical patient’s age in the medical vignettes generated by the LLM. For the categories ‘Limited to Childhood’ and ‘Limited to Adulthood’, the mean age was calculated separately for child and adult vignettes by averaging all the ages presented (generated) by the LLM for each respective group.
The presence of sex bias was identified when the isolated vignettes’ variables did not match the conditions’ specific characteristics. While only less than 1% presented biological sex identifiers (e.g. 46, XY), 96.8% of vignettes did provide suggestive sex identifiers (e.g. his and her), therefore we extrapolated female/male based on these identifiers. Given there are roughly 50% male and 50% female in the general population, deviations from this ratio in vignette sex output was considered sex bias. Conditions that preferentially affect one sex were also considered when examining sex bias.
After PCC were removed from the original cohort of conditions, the vignette scores for the remaining 259 conditions were re-analysed using the same scoring criteria. T-tests were conducted to assess the statistical difference between child and adult vignette scores as well as between the various LLMs.
Generating and scoring dialogues
The same Llama-2-70b-chat instance plays both the role of the patient/family (Agent 1) and the geneticist (Agent 2), in a multi-turn conversation with itself. Through the Llama-2-70b-chat infrastructure, the ‘system’ input argument, which is not part of the prompt, provided the LLM instructions for both roles. Figure 10 illustrates how these instructions were given to each chat agent.
Fig. 10. Instructions given to each agent in the self-conversation process to generate a dialogue using Llama-2-70b-chat, based on an accurate medical vignette.
Instructions were presented to each instance of Llama-2-70b-chat, for a maximum of 6 turns to produce each dialogue.
With the generated medical vignettes as prompts, the patient/family agent describes the clinical manifestations and reasons for visiting the geneticist and asks for any relevant medical questions. The geneticist agent would then offer guidance towards potential differential diagnoses, and management options. The geneticist agent was specifically instructed to ask for pertinent questions to the patient/family agent to reveal more relevant medical information.
Figure 11 summarises this multi-turn conversation procedure with an example of a child vignette about the condition Stickler syndrome. The dialogue starts with Llama-2-70b-chat acting as the patient/family agent initiating the conversation, by using the vignette as in-context prompt to summarise their clinical features and ask for guidance. Llama-2-70b-chat then acts as the geneticist agent, and its corresponding reply is appended into the dialogue history. Due to token limit, the conversation continues for a maximum of 6 turns, allowing 3 responses per agent. All previous outputs are appended to the dialogue history and used as in-context information for the next output. However, the patient vignette is only visible to the LLM when the patient/family agent is responding, as referential information about the condition. Throughout the dialogue, the geneticist role does not have access to the patient vignette; it only has access to the information provided by the patient/family agent.
Fig. 11. Llama-2-70b-chat example self-conversation process to generate a dialogue between the patient/family and geneticist, based on an accurate medical vignette.
See Supplementary File 1 for full dialogues and vignettes. The bracket indicates a shortened representation of the input; for example, [medical vignette] represents the entire generated vignette of a child with Sticker syndrome. For each instance of the geneticist agent, the [full conversation up to this point] does not contain the [medical vignette].
Considering the auto-evaluation difficulties confronted when scoring vignettes, grading of the generated dialogues was also done manually and by only one clinician since the inter-rater reliability showed no significant disagreement based on Kappa statistic from the vignettes scoring (see Supplementary Table 12). Because dialogues were longer and more detailed than the vignettes, we applied a scoring rubric to assess three metrics; (1) Correctness, which assessed how realistic and appropriate the clinical description and management recommendations relevant to the genetic diagnosis, patient age and sex; (2) Completeness, which assessed if the dialogue provides a complete list of the important clinical, laboratory features and management options relevant to the genetic disorder; (3) Compassion, which assessed if the conversation shows appropriate empathy and reassurance, provides resources and avoids the use of medical jargons or other terminology that would be very challenging for a layperson to understand. Each of the three metrics was graded on a 5-point Likert scale (ranging from 1 for strongly disagree to 5 for strongly agree) to match and capture the overall length of the dialogue. (See Supplementary Table 13 for dialogue scoring rubric).
Generating and scoring management plans
Considering the easy accessibility of LLMs to the public, including their use by patients as well as the wide reliance of physicians on online resources to aid in medical decision making and guide patient management11,12,16,40, we decided to specifically test LLM ability in answering direct questions about management recommendations for genetic conditions while considering different age groups. For this purpose, we selected conditions in category of ‘Management Change’ (n = 53) and generated two responses for each of these conditions, one for a child and one for an adult. We prompted Llama-2-70b-chat to provide a management plan for a patient with each condition. This hypothetical patient was assigned the same age and sex from the generated vignette. We note that we did not provide the full generated vignettes or any other data as in-context prompt. We simply asked the LLM to provide appropriate recommendations including medications, genetic testing and specialist referrals. This was done to simulate a more realistic usage of an LLM from the average user. It is unlikely that a patient or family member of someone with a genetic condition would source and use highly specialised in-context prompting when using an LLM.
These management plans were manually scored by one clinician using the same rubric used for scoring the vignettes (See Supplementary Table 1), with Correctness, Completeness and Conciseness all considered, for a maximum of 3 total points. For conditions with management plans that received a low score (0 or 1 out of 3 total points) on Llama-2-70b-chat, we used GPT-3.5 as another option to generate alternative answers. The goal is to observe whether GPT-3.5 would provide more accurate results for the cases in which Llama-2-70b-chat performs poorly.
Supplementary information
Acknowledgements
This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. This work utilised the computational resources of the NIH HPC Biowulf cluster.
Author contributions
A.A.O., K.A.F., D.D., R.L.W. and B.D.S. were responsible for conceiving and conducting experiments, analysing the results and writing and reviewing the manuscript. S.L.H. and P.H. were responsible for conducting experiments, analysing the results and reviewing the manuscript.
Data availability
All data are available in the main manuscript, accompanying figures, tables and supplementary materials.
Code availability
Code to download the model, generate Llama-2-70b-chat responses (i.e. vignettes, dialogues, management plans) and run statistical tests can be accessed at: https://github.com/flahartyka/LLM-aging-in-genetic-conditions.
Competing interests
B.D.S. is the editor-in-chief of the American Journal of Medical Genetics, and receives textbook royalties from Wiley, Inc.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Amna A. Othman, Kendall A. Flaharty.
Contributor Information
Amna A. Othman, Email: amna.othman@nih.gov
Benjamin D. Solomon, Email: solomonb@mail.nih.gov
Supplementary information
The online version contains supplementary material available at 10.1038/s41514-025-00226-z.
References
- 1.Solomon, B. D., Nguyen, A. D., Bear, K. A. & Wolfsberg, T. G. Clinical genomic database. Proc. Natl. Acad. Sci. USA110, 9851–9855 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Solomon, A. M. & Solomon, B. D. Age-related survey of clinical genetics literature and related resources. Am. J. Med. Genet. C. Semin. Med. Genet.193, 103–108 (2023). [DOI] [PubMed] [Google Scholar]
- 3.Duong, D. et al. Neural networks for classification and image generation of aging in genetic syndromes. Front. Genet.13, 864092 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Jenkins, B. D. et al. The 2019 US medical genetics workforce: a focus on clinical genetics. Genet. Med.23, 1458–1464 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Singh, H. et al. Cystic fibrosis-related mortality in the United States from 1999 to 2020: an observational analysis of time trends and disparities. Sci. Rep.13, 15030 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sabo, A. et al. Community-based recruitment and exome sequencing indicates high diagnostic yield in adults with intellectual disability. Mol. Genet. Genom. Med.8, e1439 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Drivas, T. & Gold, J. O38: Universal exome sequencing in critically ill adults: A diagnostic yield of 25% and race-based disparities in access to genetic testing. Genet. Med. Open2, 101024 (2024).
- 8.Waikel, R. L. et al. Recognition of genetic conditions after learning with images created using generative artificial intelligence. JAMA Netw. Open7, e242609 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Duong, D. & Solomon, B. D. Analysis of large-language model versus human performance for genetics questions. Eur. J. Hum. Genet.32, 466–468 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med.183, 589–596 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Huang, A. S., Hirabayashi, K., Barna, L., Parikh, D. & Pasquale, L. R. Assessment of a large language model’s responses to questions and cases about glaucoma and retina management. JAMA Ophthalmol.142, 371–375 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Haver, H. L. et al. Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology307, e230424 (2023). [DOI] [PubMed] [Google Scholar]
- 13.Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature616, 259–265 (2023). [DOI] [PubMed] [Google Scholar]
- 14.Zheng, Y. et al. Rare and complex diseases in focus: ChatGPT’s role in improving diagnosis and treatment. Front. Artif. Intell.7, 1338433 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature10, 10.1038/s41586-025-08866-7 (2025). [DOI] [PMC free article] [PubMed]
- 16.Wilhelm, T. I., Roos, J. & Kaczmarczyk, R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J. Med. Internet Res.25, e49324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Alkuraya, I. F. Is artificial intelligence getting too much credit in medical genetics?. Am. J. Med. Genet. C Semin. Med. Genet.193, e32062 (2023). [DOI] [PubMed] [Google Scholar]
- 18.Flaharty, K. A. et al. Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions. Am. J. Hum. Genet.111, 1819–1833 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2979–2989 (Association for Computational Linguistics, 2017).
- 20.Díaz, M., Johnson, I., Lazar, A., Piper, A. M. & Gergle, D. Addressing age-related bias in sentiment analysis. In Proceedings of the 2018 chi conference on human factors in computing systems, paper 412, 1–14 (Association for Computing Machinery, 2018).
- 21.Riegert-Johnson, D. L. et al. Outline of a medical genetics curriculum for internal medicine residency training programs. Genet. Med.6, 543–547 (2004). [DOI] [PubMed] [Google Scholar]
- 22.Falah, N., Umer, A., Warnick, E., Vallejo, M. & Lefeber, T. Genetics education in primary care residency training: satisfaction and current barriers. BMC Prim. Care23, 156 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chenbhanich, J. et al. Increased family history documentation in internal medicine resident continuity clinic at a community hospital through resident-led structured genetic education program. J. Community Genet.13, 347–354 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Karam, P. E. et al. Genetic literacy among primary care physicians in a resource-constrained setting. BMC Med. Educ.24, 140 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Munaf, U., Ul-Haque, I. & Arif, T. B. ChatGPT: a helpful tool for resident physicians?. Acad. Med.98, 868–869 (2023). [DOI] [PubMed] [Google Scholar]
- 26.Wong, R. S., Ming, L. C. & Raja Ali, R. A. The Intersection of ChatGPT, clinical medicine, and medical education. JMIR Med. Educ.9, e47274 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med.29, 1930–1940 (2023). [DOI] [PubMed] [Google Scholar]
- 28.Furney, S. J., Albà, M. M. & López-Bigas, N. Differences in the evolutionary history of disease genes affected by dominant or recessive mutations. BMC Genom.7, 165 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bergmann, C. Early and severe polycystic kidney disease and related ciliopathies: an emerging field of interest. Nephron141, 50–60 (2019). [DOI] [PubMed] [Google Scholar]
- 30.Bergmann, C. Genetics of autosomal recessive polycystic kidney disease and its differential diagnoses. Front. Pediatr.5, 221 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ahn, M. et al. Do as I can, not as I say: Grounding language in robotic affordances. 6th Conference on Robot Learning 287–318 (PMLR, 2022).
- 33.Bernal Jimenez Gutierrez, N. et al. Thinking about GPT-3 in-context learning for biomedical IE? Think again. Findings Assoc. Comput. Linguist.: EMNLP 4497–4512. (Association for Computational Linguistics, 2022).
- 34.Zhou, H. et al. Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 (2022).
- 35.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst.35, 24824–24837 (2022). [Google Scholar]
- 36.Huang, H., Qu, Y., Liu, J., Yang, M. & Zhao, T. An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. ArXivabs/2403.02839 (2024).
- 37.Min, S. et al. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 11048–11064 (Association for Computational Linguistics, 2022).
- 38.Liu, Y. et al. Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- 39.McHugh, M. L. Interrater reliability: the kappa statistic. Biochem. Med.22, 276–282 (2012). [PMC free article] [PubMed] [Google Scholar]
- 40.Fisch, U., Kliem, P., Grzonka, P. & Sutter, R. Performance of large language models on advocating the management of meningitis: a comparative qualitative study. BMJ Health Care Inform.31. 10.1136/bmjhci-2023-100978 (2024) [DOI] [PMC free article] [PubMed]
- 41.Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ372, n71 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data are available in the main manuscript, accompanying figures, tables and supplementary materials.
Code to download the model, generate Llama-2-70b-chat responses (i.e. vignettes, dialogues, management plans) and run statistical tests can be accessed at: https://github.com/flahartyka/LLM-aging-in-genetic-conditions.











