Abstract
Objective:
To compare the performance of three artificial intelligence (AI) classification strategies against manually classified National Institutes of Health (NIH) cardiac arrest (CA) grants, with the goal of developing a publicly available tool to track CA research funding in the United States.
Methods:
Three AI strategies—traditional machine learning (ML), large language model (LLM) zero-shot learning, and LLM few-shot learning—were compared to manually categorized CA grant abstracts from NIH RePORTER (2007–2021). Traditional ML used a regularized logistic regression model trained on embedding vectors generated by OpenAI’s text-embedding-3-small model. Zero-shot learning, using GPT-4o-mini, classified grants based on task descriptions without labeled examples. Few-shot learning included six example grants. Models were evaluated on a balanced 20% holdout test set using accuracy, precision (positive predictive value), recall (sensitivity), and F1 score (harmonic mean of precision and recall).
Results:
Out of 1,505 grants categorized, 378 (25%) were identified as CA research, yielding 302 grants in the holdout test set, 76 of which were CA research. The few-shot approach performed best, achieving the highest accuracy (0.90) and the best balance of precision and recall (F1 score 0.82). In contrast, traditional ML had the lowest accuracy (0.87) and the highest precision (0.89) but suffered from poor recall, with approximately 2.5 times more false negatives than either generative approach. The zero-shot approach outperformed traditional ML in accuracy (0.88) and recall (0.86) but had lower precision (0.72).
Conclusion:
AI can rapidly identify CA grants with excellent accuracy and very good precision and recall, making it a promising tool for tracking research funding.
Keywords: machine learning, cardiac arrest, research funding, National Institutes of Health
Introduction
For more than a decade, federal funding for cardiac arrest (CA) research has been low compared to other leading causes of death and disability in the United States.[1, 2] This disparity is likely multifactorial, stemming from a small number of investigators submitting relatively few applications, a further constrained pool of senior resuscitation scientists to mentor junior colleagues or serve on federal grant review committees, and finite available federal resources.
A challenge to the growth of resuscitation science as a field is the relative lack of transparency regarding the annual NIH investment. Although the NIH provides detailed accounting for more than 350 disease categories in its annual categorical spending report, CA is not listed.[3] As a result, quantifying grants and funding requires manual review and identification of grant abstracts from NIH RePORTER,[1] which is resource intensive and requires field-specific expertise.
In lieu of these methods, artificial intelligence (AI) may offer an efficient, accurate, and transparent solution for identifying CA grants. Therefore, we sought to compare the performance of three AI classification strategies against manually classified grants from 2007 to 2021 with the goal of developing a publicly available tool for researchers, stakeholders, advocacy groups, funding agencies, and members of the public to access CA funding characteristics.
Methods
Utilizing our previously published search strategy of querying the terms: “‘cardiac arrest’ or ‘cardiopulmonary resuscitation’ or ‘heart arrest’ or ‘circulatory arrest’ or ‘pulseless electrical activity’ or ‘ventricular fibrillation’ or ‘resuscitation,’” grant abstracts from NIH RePORTER from 2007 to 2021 were individually reviewed and categorized as CA research (yes/no) using predefined inclusion and exclusion criteria.[1] This manual (entirely human-based) categorization process provided the ground-truth labels for the present study and has a published kappa of 0.86,[1] indicating excellent inter-rater reliability. Three AI classification strategies were then employed to automate the identification of CA grants: traditional machine learning (ML),[4] large language model (LLM) zero-shot learning[5], and LLM few-shot learning[5].
All models were evaluated on an identical 20% (302 abstracts) holdout test set chosen with stratified random sampling to mirror the class balance of the entire dataset (25% CA research, 75% not CA research). We chose this 80/20 train-validate ratio because traditional ML requires training (fitting) on a relatively large sample. By contrast, zero-shot learning requires no task-specific training data and few-shot learning requires only a few (in this study 6) task-specific examples. To compare performance on the same data for each approach, the 20% holdout data was used to judge each method’s performance on accuracy, precision (positive predictive value, PPV), recall (sensitivity), and F1 score (harmonic mean of precision and recall). For accuracy, precision, and recall, exact 95% confidence intervals were calculated using the Clopper-Pearson method, treating each classification as a Bernoulli trial. For the F1 score, the Delta Method was employed to approximate the standard error and compute the 95% confidence interval.. Common probability-based metrics were not utilized as the zero and few-shot instances return only a category label.
For the traditional ML approach, we employed a regularized logistic regression model (elastic-net with L1-ratio of 0.5, balancing informativeness and repetition while combatting overfitting[4]). Abstracts were converted into embedding vectors using OpenAI’s (San Francisco, CA) text-embedding-3-small model. This process converts each abstract into a numeric vector that projects the document into the space of the model’s vocabulary. The values in the primary “embedding” matrix used in this conversion process are “learned” as part of training larger text generation models such as OpenAI’s GPT-4o. A regularized logistic regression model was then fitted using the vectors as explanatory data for the 80% training data set with the target variable being a 1 if the grant was expert-labeled as CA research and a 0 otherwise. This model was then used to predict (with a probability threshold of 0.5) the (20%) holdout set.
Next, with a separate zero-shot learning approach, using the GPT-4o-mini model from OpenAI, we provided a prompt (text input) describing the classification task without any labeled examples. That is, we provided the model – in plain text – the same instructions provided to expert labelers in our original publication[1], using the exact description of CA research copied and pasted from the manuscript [1]. These instructions and each holdout abstract were provided to the model one at a time. Each time, the model generated a label based solely on this plain text task description. That is, the model responded 302 times with either the text “CA research” or “not CA research” in response to our instructions plus the individual abstract to be labeled. We made each request to the model separately, meaning there was no possibility of “memory” or the model considering abstracts in context with one another. Zero-shot learning is so named because it involves no training or learning for the specific task being performed [6], relying instead on the general purpose training of model being used (in this case GPT-4o-mini).
Finally, for the LLM few-shot learning model, we added six randomly selected example grants (i.e., not included in test set, with three examples labeled as CA research and three as not) from the 80% training set to the same prompt used in zero-shot to guide the model’s classification.[6] In the same fashion as our zero-shot experiment, the model then returned – one-at-a-time – the label “CA research” or “not CA research” for each of the holdout abstracts.
Results
The initial strategy resulted in 4,648 grants, many of which were duplicates (e.g., grant continuation or renewal applications). After a removal of duplicates, a total of 1,505 individual grants were characterized, with 378 (25%) of those being CA research, yielding 302 grants in the holdout test set, with 76 being CA research. Results of the three approaches are summarized in Table 1. The few-shot approach performed best with the highest accuracy (0.90) and best balance between precision (PPV) and recall (sensitivity) as indicated by an F1 score of 0.82. In contrast, traditional ML had the worst accuracy, highest precision, and worst recall with ~2.5 times as many false negatives as either generative approach. The zero-shot approach outperformed traditional ML on some metrics but had worse precision (0.72). Figure 1 presents the confusion matrix for each of the three experiments.
Table 1.
Performance characteristics of three classification experiments in the identification of NIH-funded cardiac arrest grants. (95% CI)
| Zero Shot | Few Shot | Traditional Machine Learning | |
|---|---|---|---|
| Accuracy | 0.881 (0.839, 0.915) | 0.904 (0.865,0.935) | 0.868 (0.824, 0.904) |
| Precision (positive predictive value) | 0.722 (0.618, 0.812) | 0.777 (0.673, 0.860) | 0.891 (0.764, 0.964) |
| Recall (sensitivity) | 0.855 (0.756, 0.926) | 0.868 (0.771, 0.935) | 0.539 (0.421, 0.654) |
| F1 Score | 0.783 (0.719, 0.847) | 0.820 (0.760, 0.880) | 0.672 (0.582, 0.763) |
Figure 1.

Confusion matrix for each of the three experiments: zero-shot, few-shot, and traditional machine learning. Clockwise from top left (dark blue) are True Negatives, False Positives, True Positives, False Negatives.
Discussion
These results suggest a promising method for rapidly and accurately identifying CA grants and tracking the annual NIH investment in resuscitation science research. The relative performance of few-shot learning, which requires no task-specific training, suggests little to no machine learning expertise is needed to achieve reasonable automated performance on text-labeling tasks in the era of large language models.[7] Current NIH reporting does not include CA but does feature more than 350 disease categories, ranging from broad scopes such as “cancer” to highly specific ones like “Cooley’s Anemia”.[3] Adding new categories requires requests from Congress, the White House, advocacy groups, or NIH leadership.[8] Given the significant disease burden associated with cardiac arrest,[9–11] it warrants official recognition as a reported disease entity. Until then, the AI models presented in this analysis can help bridge this gap.
Access to these grants provides critical insights, including total funding, common grant mechanisms, funding institutes, potential collaborators and mentors, and the young investigator pipeline, which are essential for advocacy and the advancement of resuscitation science. One of the current challenges facing this field of research is the modest overall funding[1, 2] and the relatively small number of funded young investigators.[12] Shining a light on NIH investments through grant identification and tracking will empower groups like the American Heart Association to further advocate for increased support and growth of resuscitation science.
Although our focus was on CA research, this tool has broad applicability as it could be used to identify grants for any unlisted disease category without requiring intensive training on large datasets or extensive AI expertise. Efforts to optimize the tool, enhance the user interface, and make it publicly available are currently underway.
Key limitations of our approach include a fixed set of six examples for few-shot learning. A more robust approach would be averaging over additional and larger combinations of examples for few-shot learning or possibly fine-tuning a LLM for specific use cases. Additionally, we did not explore the parameter space for traditional machine learning or the space of other potential model types. A natural language processing and machine learning expert may be able to train a special purpose model that would outperform all approaches tried here. However, we have demonstrated that an easy to use approach with little to no task-specific training examples can achieve reasonable performance. We have focused on ease of use, and model training speed, and rapid-cycle innovation to ultimately develop a publicly available and widely adoptable tool in the near future.
Conclusion
In conclusion, AI can rapidly identify CA grants with excellent accuracy and very good precision and recall. While the classifications are not perfect, grants were generally correctly classified using an LLM, which achieved the best results when provided with a relatively few (six) labeled examples. Additional work is needed to further optimize this tool and explore the added value of generative classification for other disease categories.
SOURCES OF FUNDING
Dr. Coute’s effort was funded by NHLBI K23H166692 and American Heart Association https://doi.org/10.58275/AHA.24SCEFIA1248727.pc.gr.193948
Footnotes
Conflict of Interest:
All other authors: none to report.
DISCLOSURES
RAC: NHLBI K23H166692, University of Alabama at Birmingham Comprehensive Cardiovascular Center and Integrative Center for Aging Research Pilot Award, and American Heart Association https://doi.org/10.58275/AHA.24SCEFIA1248727.pc.gr.193948
Declaration of generative AI and AI-assisted technologies
The authors used GPT-4o by OpenAI to proofread the initial draft. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
REFERENCES
- [1].Coute RA, Panchal AR, Mader TJ, Neumar RW. National Institutes of Health-Funded Cardiac Arrest Research: A 10-Year Trend Analysis. J Am Heart Assoc. 2017;6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Coute RA, Kurz MC, Mader TJ. National Institutes of Health investment into cardiac arrest research: A study for the CARES Surveillance Group. Resuscitation. 2021;162:271–3. [DOI] [PubMed] [Google Scholar]
- [3].Estimates of Funding for Various Research, Condition, and Disease Categories (RCDC). National Institutes of Health. Accessed November 4, 2024 https://report.nih.gov/funding/categorical-spending#/
- [4].Hastie T, Tibshirani R, Friedman JH, & Friedman JH (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1–758). New York: springer. [Google Scholar]
- [5].Howell MD, Corrado GS, DeSalvo KB. Three Epochs of Artificial Intelligence in Health Care. JAMA. 2024;331:242–4. [DOI] [PubMed] [Google Scholar]
- [6].Mishra S, Khashabi D, Baral C, Choi Y, Hajishirzi H. Reframing instructional prompts to GPTk’s language. arXiv. 2021. doi: 10.48550/arxiv.2109.07830. [DOI] [Google Scholar]
- [7].Godwin RC, Tung A, Berkowitz DE, Melvin RL. Transforming Physiology and Healthcare through Foundation Models. Physiology (Bethesda). 2025. [DOI] [PubMed] [Google Scholar]
- [8].Frequently Asked Questions. Research, Condition, and Disease Categories (RCDC). National Institutes of Health. Accessed December 14, 2024 https://report.nih.gov/funding/categorical-spending/rcdc-faqs
- [9].Coute RA, Nathanson BH, Panchal AR, Kurz MC, Haas NL, McNally B, et al. Disability-Adjusted Life Years Following Adult Out-of-Hospital Cardiac Arrest in the United States. Circ Cardiovasc Qual Outcomes. 2019;12:e004677. [DOI] [PubMed] [Google Scholar]
- [10].Coute RA, Nathanson BH, DeMasi S, Mader TJ, Kurz MC, Group* CS. Disability-Adjusted Life Years Due to Pediatric Out-of-Hospital Cardiac Arrest in the United States: A CARES Surveillance Group Study. Circ Cardiovasc Qual Outcomes. 2023;16:e009786. [DOI] [PubMed] [Google Scholar]
- [11].Coute RA, Nathanson BH, Kurz MC, Mader TJ, Jackson EA, American Heart Association’s Get With The Guidelines-Resuscitation I. Disability-Adjusted Life-Years After Adult In-Hospital Cardiac Arrest in the United States. Am J Cardiol. 2023;195:3–8. [DOI] [PubMed] [Google Scholar]
- [12].Coute RA, Huebinger R, Perman SM, Del Rios M, Kurz MC. Evaluating the National Institutes of Health Pipeline for Resuscitation Science Investigators. J Am Heart Assoc. 2024;13:e035854. [DOI] [PMC free article] [PubMed] [Google Scholar]
