Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Jan 1.
Published in final edited form as: Headache. 2024 Dec 10;65(1):180–190. doi: 10.1111/head.14880

Artificial Intelligence Terminology, Methodology and Critical Appraisal: A Primer for Headache Clinicians and Researchers

Gina M Dumkrieger 1, Chia-Chun Chiang 2, Pengfei Zhang 3, Mia T Minen 4, Fred Cohen 5, Jennifer A Hranilovich 6
PMCID: PMC11840968  NIHMSID: NIHMS2038524  PMID: 39658951

Abstract

Objective:

The goal is to provide an overview of Artificial Intelligence and Machine Learning methodology and appraisal tailored to clinicians and researchers in the Headache field to facilitate interdisciplinary communications and research.

Background:

The application of Artificial Intelligence to the study of headache and other healthcare challenges is growing rapidly. It is critical that these findings be accurately interpreted by headache specialists, but this can be difficult for non-Artificial Intelligence specialists.

Methods:

This paper is a narrative review of the fundamentals required to understand Machine Learning/Artificial Intelligence headache research. Using guidance from key leaders in the field of headache medicine and Artificial Intelligence, important references were reviewed and cited to provide a comprehensive overview of the terminology, methodology, applications, pitfalls, and bias of Artificial Intelligence.

Results:

We review how Artificial Intelligence models are created, common model types, methods for evaluation, and examples of their application to headache medicine. We also highlight potential pitfalls relevant when consuming Artificial Intelligence research, and discuss ethical issues of bias, privacy and abuse generated by Artificial Intelligence. Additionally, we highlight recent related research from across headache related applications.

Conclusion:

Many promising current and future applications of Machine Learning and Artificial Intelligence exist in the field of headache medicine. Having an understanding of the fundamentals of artificial intelligence will allow the ability to understand and critically appraise Artificial Intelligence-related research findings in their proper context. This paper will increase the reader’s comfort in consuming Artificial Intelligence/machine learning-based research and will prepare them to think critically about related research developments.

Keywords: artificial intelligence, machine learning, headache, metrics

Plain Language Summary

Artificial Intelligence (AI) is increasingly applied in headache research and in healthcare. It is important for people who are involved in headache research to be able to understand the quality and impact of AI research. In this review, we cover core concepts of AI methodology and terminology, such as machine learning, and we provide relevant examples from the headache literature; this should be a useful reference for clinicians and researchers who are interested in learning more about AI.

Introduction.

Application of Artificial Intelligence (AI) and Machine Learning (ML) to healthcare has exploded in recent years, resulting in a rising number of collaborations between headache researchers and AI scientists. Research has explored using AI to facilitate the diagnosis of headache disorders, forecast migraine attacks1, predict response to therapies2, and study the association between conditions3, among other applications.

As AI moves increasingly into patient care, fluency in this emerging technology should not be left to data scientists alone. To have a voice in policy making, it is imperative that clinicians are able to properly evaluate AI models and their potential impacts and utility. Therefore, the readers of headache literature should be comfortable with AI methodology and understand the research being described.

In this article, we explain the basics of AI and ML, discuss in general and non-mathematical terms how the methods work, families of models, and types of problems that can be addressed with each family of models. Additionally, we illustrate each with examples, and cover the approaches to evaluating the performance of common AI/ML models and common pitfalls to watch out for in AI/ML.

Methods

This is a narrative review of common AI and ML approaches used in headache literature. Most common terms and concepts necessary to understand the related literature are reviewed. Examples are from published primary research and indexed in PubMed. Concept and terminology definitions are from textbooks and reports, written in English. Relevant online publications were included when known to the authors and when representing primary source material.

What is Artificial Intelligence?

The definition of AI has evolved over time and still varies slightly, even among its major users and stakeholders. The U.S. legislature defines AI as “a machine-based system that can, for a given set of human defined objectives, make predictions, recommendations, or decision influencing real or virtual environments.”( H.R.62164). On the other hand, MIT Technology Review, written in part for technologically minded business innovators and investors, defines AI as “the quest to build machines that can reason, learn, and act intelligently”5. The difference in emphasis is functional: for regulatory agencies the outcome is more important, as opposed to technological investors who focus on the potential for continual progress. Our definition is more akin to those supplied by academics in biology: We define artificial intelligence (AI) as a broad term that encompasses a variety of techniques to learn from structured and unstructured data. We furthermore define ML as a narrower term, describing the use of computers to understand data6.

Creating an AI model

Although project/investigation dependent, in general, AI modeling involves the following steps. First, collecting and cleaning the data. It may be necessary to do data transformation and/or normalization, and to perform feature engineering and/or selection. Next, choosing a model type. After choosing a model type the data are used to train, tune and test and evaluate the model. The final step is to share the results.

We will discuss each of the above steps in depth. The details of these steps differ based on the need of every project.

A note on nomenclature

Because of simultaneous development or later redevelopment of ideas in different fields there are often multiple terms for the same concept in AI/ML. For example, the field of statistics most commonly uses the word variables (see also independent variable or covariate) to describe the pieces of data that are the inputs of a model. In ML these are often described as features or attributes. Feature is more often used than variable to describe variables that are extracted or engineered, i.e. developed from the original raw data set. For example, if a dataset contains weight and height, the features of Body Mass Index could be “extracted”, i.e. calculated, and included as an input variable. Extracted features can also encompass more complex concepts like the principal component vectors7 of a data set or the volume of a brain region8. Embeddings is a related term that describes how non-numerical concepts are translated into meaningful computer-comprehensible representations. For example, in a Natural Language Processing (NLP) model, words have to be converted into embeddings before the computer can understand how words like “pulsating” and “throbbing” relate to each other.

Data Preparation

Data cleaning is the step where data are prepared for use in training a model. Some experts estimate that 60–80% of time on a AI project is spent on data cleaning. However, data cleaning is vital as low-quality training data can result in low-quality, ungeneralizable models.

Even in a well-run clinical trial, invalid data may find its way into the data set, or there may be duplicate records that need to be reduced. Records should be reviewed to ensure their validity before input. When using data from multiple studies, variables may need to be harmonized across all sources. For example, one study may record the variable sex as male=‘1’ and female=‘0’, while another study might code male=‘0’ and female=‘1’, or ‘M’/’F’ or ‘1’/’2’. This is not only applicable to categorical variables. A “ten point” scale may be 0–10 or 1–10. A temperature may be Fahrenheit or Celsius, measured on the skin or at the core.

Longitudinal data requires standardization of the timing of events. For example, when combining multiple data sources, researchers should define the elapsed time permissible between two measurements to be considered a single time point. These issues are likely to be encountered if working with electronic health record data. Data transformation also happens at this stage. Variables may be transformed by applying a function, such as taking the square root of each value. Normalization, where variables are transformed to have a mean of zero and variance of one9, 10 is a common transformation.

This is also the stage where the approach to handling missing data begins to be considered. Incomplete records may be dropped, or missing data may be imputed by different means. As some modeling approaches readily handle missing data, these decisions should be made in conjunction with model selection. Choices about how missing data are handled can impact the validity or conclusions of the study and how applicable the resulting study may be to other circumstances. As discussed below, it is also important to exclude certain data that could bias the model.

In summary, readers of AI literature should look for information on how study data were collected, what preprocessing or cleaning was performed, and how missing data were handled.

Choosing the model type

The choice of model type will depend on the question being asked and the type and amount of data available. If there are enough data after cleaning, they should be divided into training, validation and test sets or training and test sets. The subset of data to be used for model testing, or “test data,” should be held completely separate from the validation and training data. If there are multiple entries per individual, and the goal is, for example, to predict outcomes for new individuals, all of an individual’s data should be completely within either the training, validation, or test set. To have a single individual in the test and training sets will overestimate the performance of the model on entirely new individuals.

Training

The process of creating an AI model is referred to as training or fitting. In this step the clean training data are used to find a model which optimizes the objective function of that model.

Tuning

Some model types have “hyperparameters,” which are essentially control settings of the model fitting process. For example, the minimum allowable number of samples in a branch after a split in a tree model (see below) is a hyperparameter. Tuning is the iterative process of training a model with a specific value of hyperparameter, evaluating the model’s performance on a validation set, and then repeating with different hyperparameter values to find the optimal hyperparameter value. Test data should not be used in the tuning process.

Once the optimal hyperparameter values are selected the final model is fit and is then ready for test and evaluation.

Evaluation

There are many different metrics for evaluating the performance of AI models in addition to the methods below. Ideally there are enough data that a model can be evaluated on the held-out test set. If that is not possible other methods such as cross-validation may be used.

Metrics

Appropriate metrics for the evaluation of an AI model will correspond with the task being performed. Regression models output a number; classification models report a discrete class such as treatment responders/nonresponders11, clustering models report which subgroup a sample belongs to, as with headache phenotypes7, 8. In the field of Natural Language Processing, models perform field-specific tasks on written words, such as retrieval12 or text generation. Similarly, there are AI models with very specific tasks like brain magnetic resonance imaging (MRI) segmentation13. Some tasks will have their own specific metrics, such as the Jaccard or DICE indices used in image segmentation tasks. Many different metrics are not covered here; state of the art may change quickly14. Here, we review common metrics for regression and binary classification tasks.

Regression Metrics

When used with linear regression, the mean square error (MSE) is the mean of the squares of the distance between each point and the final regression line, e.g., the difference between the actual and the predicted. Lower MSE values indicate a better model15. The square root of the MSE can also be reported as the Root Mean Square Error (RMSE). The coefficient of determination, R2, measures the amount of variance explained by the model in a linear regression. Higher R2 is better. However, R2 can be misleading as it never decreases with the addition of more variables and can be artificially high when unneeded variables are included in a model. Adjusted R2, takes into account the number of variables in a model16. Akaike Information Criterion17 (AIC) and Bayesian Information Criterion (BIC) are additional fit metrics that account for model complexity, with lower AIC and BIC values generally indicating better models18. AIC and BIC are among the metrics that can be used to select between models. Note that R2 cannot be used with logistic regression, however AIC and BIC can.

Classification metrics

For a binary classification task, the performance of the model in classifying the test (or training) set is often reported in a confusion matrix. A confusion matrix9 is a simple grid that tabulates how many of each class were correctly or incorrectly classified. (Table 1). From this, a number of other metrics can be generated (Table 2).

Table 1:

Confusion Matrix

Predicted class label
Positive Negative
Actual class label Positive True Positive False Negative
Negative False Positive True negative

Table 2:

Common Performance Metrics for Binary Classification

Metric Name Description Formula
Accuracy How many cases were correctly identified out of the total number of cases? TruePositive+TrueNegativeTotalCases
Balanced accuracy What is the balance between specificity and recall? Specificity+Sensitivity2
Precision (Positive Predictive Value, PPV) How many of the cases labeled positive were truly positive? TruePositive(TruePositive+FalsePositive)
Sensitivity (true positive rate, recall) How many of the actually positive cases were correctly identified out of the total number of actually positive cases? TruePositive(TruePositive+FalseNegative)
Specificity (true negative rate) How many of the actually negative cases were correctly identified out of the number actually negative cases? TrueNegative(TrueNegative+FalsePositive)
F1 What is the balance between precision and recall? 2*(Precision*Recall)(Precision+Recall)
AUC* (AUROC^) How well does this model balance specificity and sensitivity over all possible classification threshold values? See text
Kappa (κ) How much better than chance is this model? See text
False Positive Rate How many of the actually negative cases were incorrectly labeled as positive? 1-specificity=FalsePositive(TrueNegative+FalsePpositive)
False Negative Rate How many of the actually positive classes were incorrectly labeled as negative? FalseNegativeTruePositive+FalseNegative
*

AUC: Area Under the Curve

^

AUROC: Area Under the Receiver Operating Characteristic curve

Accuracy is the most basic metric for a classification problem but can be misleading when the classes are unbalanced, meaning there are significantly more samples in one class than the other.9 For example, if a data set has 95 positive samples and five negative samples, 95% accuracy could be achieved by classifying all samples as positive. Balanced accuracy, F1 and Cohen’s kappa 19 (κ) are classification metrics useful when the classes are unbalanced. F1 balances performance of the positive and negative classes. κ is a measure of inter-rater reliability often used with classification models. It accounts for how well the samples could have been classified by chance. A higher κ value is better. F1 pays special attention to the positive class, while balanced accuracy and κ do not.

ROC and AUC

For logistic regression and some other binary classifiers, the model estimates the probability that a sample belongs to the positive class. The final classification of a sample depends on how that probability compares to a chosen classification threshold. Samples with a predicted probability greater than the classification threshold are classified as positive and the others as negative. A classification threshold default of 0.5 is typical16. Other classification thresholds can be used, and changing the threshold changes how samples are classified as well as the associated metrics. To create a Receiver Operating Characteristic (ROC) curve, the sensitivity and specificity are calculated across the range of possible threshold values. Plotting the true positive rate (aka sensitivity) vs the false positive rate (1- specificity) values for these threshold values defines the ROC curve (Figure 1). Multiple models can be compared by showing their ROC curves on the same plot. Curves from better models will be closer to the upper left corner. The line representing a model that is no better than chance will fall on the diagonal.

Figure 1:

Figure 1:

Illustration of Receiver Operating Characteristic Curve

The area under the ROC curve (AUC) gives the model’s performance across classification threshold values and the tradeoff between specificity and sensitivity9. It can be used to compare between models. A perfect model has an AUC of 1, while a model with an AUC of 0.5 performs no better than chance.

SHAP Values:

Rather than evaluating the overall performance of a model SHAP values20 explain how much a specific feature appears to contribute to the classification of a sample or group of samples. In other words, they help decipher which features of a model are important and improve interpretability. They are frequently used in conjunction with deep learning models where contributions of individual variables to the model are not immediately apparent.

Model Families

For each objective, there are multiple applicable modeling approaches/families. Across families, models are generally trained by trying to mathematically optimize some objective function, often through iteration. The process through which the optimization is achieved is the learning algorithm16. See Figure 2.

Figure 2:

Figure 2:

Artificial Intelligence Model Families

Regression

Regression is a fundamental statistical technique. Here, it is discussed as a supervised ML tool to make future predictions16, rather than as a statistical tool to learn about a population.

Models in this family take in numeric data or categorical data that has been coded as numeric data and output a numeric value. Linear regression and logistic regression are the two most common variants. Linear regression can be used when trying to predict a numeric value. Logistic regression is used when the response variable is binary, allowing classification of the output into one of two classes, such as classifying a patient as having migraine or non-migraine headache. Generalized linear models (glm) describe a wider family of regression models. Required assumptions and interpretations differ by model type. The learning algorithm and function being optimized can also differ. For instance, when using ordinary least squares (OLS) regression to fit a linear regression model, the least squares approach is used to find coefficient estimates that minimize the sum of squared errors of the model16, e.g., to create a line that best fits the data when graphed.

These models have the benefit of being familiar to many. The format of the regression equations can make it easy to understand the impact of each of the variables on the outcome.

Example:

Forecasting migraine attacks with AI has been of great interest to researchers. A study by Houle et al. developed a migraine forecasting model based on electronic headache diary information and stress levels obtained from the Daily Stress Inventory questionnaire from 95 patients and 4626 days of diary entries, or up to 3 months per individual. Enrollment was over a greater than 4-year period. The generalized linear mixed-effect model achieved an AUC of 0.65 (95% CI 0.6–0.67) in the out of sample test set using leave-one-out cross validation21. The model developed as it was based on an individual’s preceding 24 hours headache and stress data was not sensitive to missing data. Migraine attack forecasting research is in its early stages, and there are many complicating issues to be addressed, such as the appropriate method of dealing with missing data, the correct way to address seasonal or weather conditions, other illnesses, experiences or triggers that may impact the accuracy of the predictions being made.

Trees

Tree models begin with all data in one pool and then use one or more variables to split the training data into branches9. Each branch is then further split until some stopping condition is reached. At each split the model will choose to split on the variable that maximizes the purity (or other measure) of the resulting branches. Multiple measures of purity, such as the Gini index and entropy, exist and can be used to determine the split. Tree models can be used for regression or classification. Small trees have the benefit of being easily interpreted by humans.

Ensemble methods

Ensemble models combine multiple ML models into one model to improve the overall performance. For example, the random forests are ensembles of multiple tree models. Bagging and boosting are two techniques describing how individual models may be combined into these ensemble models9.

Examples:

Stubberud et al. developed a random forest model to forecast migraine attacks based on a mobile phone diary and physiologic data from a wearable sensor from 18 participants followed for four weeks. The AUC on a holdout test set was 0.621. As with the model developed by Houle et al. for prediction, data from the preceding 24 hours were used to predict a headache, but here included self-reported premonitory symptoms and physiologic data in addition to headache diary data and did not directly collect data on stress. Their SHAP summary plot showed, interestingly, that a premonitory feeling of swelling was a particularly high-value feature of the model. Both this and the model developed by Houle et al. for prediction show the strength of using self-reported headache symptoms to develop models with similar AUC for the prediction of headache the following day.

AI has also been used to predict treatment responses to headache therapeutics with the goal of advancing precision treatments. A random forest ensemble model and online calculator were developed to predict treatment response to calcitonin gene-related peptide (CGRP) monoclonal antibodies (mAbs) mainly based on pre-treatment and post-treatment (at 3 months) headache frequency and Headache Impact Test −6 score. It predicted treatment response at time points including 6, 9, and 12 months 11. The AUC of models at different post-treatment timepoints were high and ranged from 0.87–0.98, which is not surprising as the post-treatment response (3 months) headache frequency, migraine frequency and Headache Impact Test (HIT-6) data were used to predict the headache frequency at subsequent timepoints.

Additionally, response to verapamil for patients with cluster headache was predicted using clinical and imaging features in an ensemble tree model, XGBoost (Extreme Gradient Boosting)7. A model using clinical features alone had a training AUC of 0.669. A model including both clinical and extracted imaging features yielded a test AUC of 0.621.

Another ensemble tree model, lightGBM (light Gradient Boosting Machine), was used to distinguish patients with headache into diagnostic categories including migraine, tension-type headache, trigeminal autonomic cephalalgias, other primary headaches and secondary headaches22. The model, developed using detailed headache symptoms descriptions from 4000 patient questionnaires, was shown to be accurate in its diagnostic capacity with an average (among five diagnostic categories) accuracy, recall, specificity, precision, and F values being 76.25%, 56.26%, 92.16%, 61.24%, and 56.88%, respectively, in the holdout test dataset. Additionally, it improved the diagnostic accuracy of five clinicians who were not headache specialists, when tested on a different group of patients, from 46% to 83% demonstrating the potential to use AI to increase access to accurate headache evaluation.

Another recent study utilized pre-treatment clinical characteristics and detailed migraine descriptions to predict treatment response to seven different types of commonly used migraine preventive medications. Multiple model types were tested. Features used to construct the ML models included demographics, headache frequency, pain intensity and disability level, migraine-associated symptoms and migraine attack triggers, gathered from a Headache Intake Form administered through clinical practice at a tertiary headache center. Notably, the model predicts treatment response, defined as having a 30% reduction in headache frequency from baseline, to CGRP mAbs with an AUC of 0.83 and 80% accuracy using a Gradient Boosting Machine (GBM). The AUCs of models predicting treatment response to other types of medications ranged from 0.58 to 0.67. The most important predictors for models to make predictions for different types of medications were also identified.

Support vector machines

Support vector machines (SVM) classify samples into groups by finding the equation of a barrier that best separates the groups23. Imagine trying to separate a scattering of red dots from blue dots in a two-dimensional field. When trying to separate two groups (red and blue) based on two variables, a simple linear SVM machine draws a straight line between the two groups if it can. The line then acts as the classifier. Points on one side of the line belong to one class, and points on the other side belong to the other class. Linear SVMs maximize the distance between the linear boundary and the nearest points in the data set. More advanced SVMs capable of separating groups combined nonlinearly are possible with different kernel functions that map data into higher dimensional space.

Example:

Messina et al. developed an SVM model used to classify cluster headache patients vs migraine patients using functional brain MRI patterns8The accuracy of the imaging-only model was 78%, whereas a model utilizing both imaging and clinical data achieved an accuracy of 99%.

Clustering models

Clustering describes unsupervised ML approaches that attempt to find natural groupings within a data set.24 In k-means clustering k centroids are found and each point is grouped with its nearest centroid. K-means is a centroid based clustering approach but there are also density based and hierarchical approaches.

Examples:

Clustering has been used to derive headache phenotypes in patients with cluster headache, as well as to define sub-phenotypes within chronic migraine7, 25.

Deep learning models

Deep learning (DL) models are what many people think of when they think of AI. Deep learning models are a type of Artificial Neural Network (ANN).

Artificial neural networks are models inspired by biology26. They are made of an input layer, one or more hidden layers, and an output layer. Each layer is made of nodes and the nodes are connected between and within layers. Deep learning describes ANNs with multiple hidden layers27. The “architecture” of a DL model describes the type and arrangement of the nodes, connections, and layers in the model.

Deep learning is itself a broad class of models/algorithms. These models can be used for virtually any task. In addition to standard classification tasks based on tabular data, deep learning models are also commonly used for visual analysis tasks such as classifying the contents of an image or outlining the margins of a tumor. Convolutional Neural Networks28 (CNN) and Recurrent Neural Networks (RNN) are DL models common in medical research and are primarily used for visual analysis 26and NLP29 tasks respectively. Cutting-edge generative models such as ChatGPT and DallE are transformer-based30 DL models.

A note on nomenclature: in deep learning, well-known models are often given names which are not always helpful for understanding what type of model it is. For example, ResNet31 32 is the name of a specific CNN model. ChatGPT33 is a Generative Pretrained Transformer30 model designed for chat where the transformer describes the DL architecture.

Deep learning models can use virtually any form of input and massive amounts of data. However, acquiring sufficient training data can be one of the challenges of developing DL models. These models may also be financially costly to train. Therefore, it is often more efficient to take a pre-trained model (a model that has already been trained on some other data set) and fine-tune it to a specific task. DL models suffer from a lack of interpretability (aka explainability), meaning the internal reasoning of the models is not visible to the creators or users. Hence their frequent description as “black boxes.” 22, 3436.

Examples:

Deep learning has been employed in neuroimaging to identify biomarkers for headache disorders and predict headache improvement. One study differentiated individuals with migraine vs healthy controls with 75% accuracy, 66.7% sensitivity, and 83.3% specificity and differentiated acute post-traumatic headache with 75% accuracy, 66.7% sensitivity and 83.3% specificity and differentiated persistent post-traumatic headache from healthy controls with 91.7% accuracy, 100% sensitivity, and 83.3% specificity32. Deep learning was also used in conjunction with XGBoost by Katsuki et al. to predict the number of hourly headache reports in a population of individuals with migraine and non-migraine headaches based on weather data. R2 was 84.9% and RMSE was 8.3.37

In a further example of the application of a DL model to headache research a team applied a CNN, which had previously been developed to predict the probability of atrial fibrillation to a group of individuals with migraine and found that those with aura had significantly higher probability of atrial fibrillation than those in the migraine without aura group even after adjusting for vascular comorbidities.

State-of-the-art DL NLP large language models (LLM) can be used to retrieve desired information from electronic health records (EHR). For example, an LLM-based model to extract headache frequency, a key parameter to evaluate headache progression and treatment response to preventive medications, from free-text clinical notes12 has been developed. Such information would traditionally require manual chart review to obtain, which is time-consuming and labor-intensive. LLMs and generative AI are evolving quickly with “few shot” or “zero-shot” models now able to accurately perform tasks such as data retrieval with little or no specific training.

Things to watch out for

Having discussed the methods above, we will now turn to important considerations when consuming AI literature.

When reviewing new headache AI literature, it is important to ask whether the metrics provided are appropriate and most relevant to the question being asked38. For example, if for the purposes of the model it is vital that all cases labeled positive are truly positive then PPV may be more important than other metrics. In cases where classes are unbalanced with few positive cases, sensitivity may be a more relevant metric particularly if the consequences of failing to identify a positive patient are comparatively severe.

Overfitting is when ML models are trained too closely on the training data and perform very well on the training data but not on test data. Compared to evaluating model performance on the training set, use of a test set or cross-validation identifies overfitting and gives a better estimate of how the model would perform in the real world.

There should not be leakage or cross contamination between the test and training sets. If any information from the test set is used prior to the test phase this can artificially inflate the performance on the test set. As a reader this can be difficult to determine if the author does not specifically address it.

Generalizability is the ability of a model to perform as well on an external validation condition as it does in its home condition39. The lack of generalizability in AI models is a common problem even when no obvious issue can be identified. In the same way that the results of a clinical trial that enrolled only individuals with high frequency migraine may not be applicable to the general migraine population heavily restricted training sets can limit the generalizability of AI models. If a headache-related model is trained exclusively on notes from headache specialists, it is unlikely to work as well on data from a broader range of physicians.

Participant bias can skew results. For example, in the study investigating the effects of weather and headache occurrence using a smartphone application and AI, the weather sensitive individuals were more likely to record headaches when weather changes occurred37 due to the design of the application.

Another way generalizability can be negatively impacted is by inappropriate data inclusions in the training set. For example, in what’s known as the Clever Hans phenomenon, variables in the training set inappropriately signal the outcome. With tabular data and non-deep learning models, these issues are easier to identify either prior to training or by looking at the results. There may be no sign of these issues when a deep learning model is used unless steps are taken to specifically look for these problems.

An excellent review of a DL imaging project illustrates this issue40. Here the authors created a CNN for identifying pneumonia using x-rays from several hospital departments. The model performed better on the internal data set than on a test set from another hospital, indicating a lack of generalizability. Investigation showed that the model was able to identify which department training images came from. The inpatient unit, which had a much higher rate of pneumonia, used a portable radiograph, while the emergency department, with a much lower rate of pneumonia, did not. Images from the portable machine were color inverted and included unique text markers. Because the unbalanced rate of pneumonia was tied to the department, which was indicated by the equipment type, guessing pneumonia status based on the colors of the x-ray with accuracy is possible - not useful in practice or in outside facilities. The authors knew there was an issue because of the external validation set and were able to identify the problem by visually inspecting the images. However, deep learning models can “see” things that humans cannot, so visual review may not identify these issues. Inputs need to be handled carefully, as do unbalanced data.

Bias

In addition to the statistical bias, applications of AI may mitigate or exacerbate efforts in social equity. On the one hand, ML can reduce the impact of bias on healthcare quality in various fields of medicine, such as ophthalmology, where clinics with fewer resources may be able to screen for common diseases41, in critical care where disparities may be noted in clinical documentation through topic analysis42, and in the field of pain, where racial disparities may be reduced in graded severity of knee osteoarthritis43. However, AI models can also reinforce bias. The contents of training data shapes the model. If that data is not representative, be it in age or race or disease severity distribution, the model may not perform well on the underrepresented group(s). AI models can also identify race, whether or not we wish them to do so, and may take race into account in its generation of diagnoses. This raises the spectre that known current inequities in headache care may be perpetuated or even magnified by the models we develop based on current practice.

Privacy concerns

Privacy/data sharing is another major concern with AI. The culture of Machine Learning promotes the sharing of new advances across institutions44, which can reduce the initial cost of adopting AI. In addition, it can improve the model by exposing it to diverse data. The ability to share data has made many studies feasible and is in line with the NIH’s emphasis on data sharing45, 46. However, this has the potential to expose patients to a loss of confidentiality.

Unfortunately, systems and regulations have not yet been developed to facilitate and regulate anonymized and de-identified data sharing across institutions and countries. Informed consent processes don’t currently allow for the necessary sharing of data in this way. Cybersecurity measures will be increasingly important to prevent risks of the inappropriate use of datasets, inaccurate or inappropriate disclosures, and limitations of de-identification47.

Ethical considerations

Given predictive models’ potential to identify high risk conditions prior to their onset, they could be improperly used to expose patients with underlying conditions and deny them care. For example, prediction of high-risk conditions has been demonstrated in gestational diabetes where models may help predict the risk of complications48.

As such, there is the potential for abuse by AI technology developers who could develop systems designed to increase profits for certain drugs, tests, and/or devices without clinical users being made aware49. With future developments in AI in healthcare, their use must be regularly reevaluated.

Legal responsibility is also an emerging question with AL/ML: Who is ultimately liable if a patient suffers an adverse event due to AI based technology? There may be a shift in the physician’s sense of personal responsibility49, 50,51

Other uses of Generative AI in Headache

The use of AI and LLM to triage and respond to patient portal messages is an emerging field of study. To our knowledge, there is currently no commercially available patient portal software designed specifically for headache practitioners. However, there are commercial software applying LLM to patient messages in general. Examples of published scholarly research on using AI in patient messaging portals broadly consist of two classes: 1. The classification of patient messages and 2. leveraging LLM in drafting EMR messages. In the former case, use cases of AI in EMR messages include examples such as identifying patients with transportation barriers, determining the authorship of the message, discerning topic of concern, and, of course, triaging messages for their medical complexity. For the latter case, it is worth noting that LLM has the potential to not only assist with provider response but also generate patients’ messages to providers. Given the volume of patient messaging in medicine, we expect LLM’s involvement in patient portals to be an evolving field.

AI has also made its way into scientific writing. Various LLMs, such as ChatGPT, have even appeared as authors in publications. Gao et al. conducted a study where 50 Chat GPT-generated medical research abstracts were evaluated using a plagiarism checker, AI output detector, and a group of medical researchers. The plagiarism checker found 100% originality, while the AI detector and human reviewers identified 66% and 68% of the AI-generated abstracts, respectively, with the reviewers misidentifying 14% of genuine abstracts as AI-generated. In response, several scientific journals, including Nature52, JAMA53, and Science54, have established policies stating that AI, language models, and machine learning systems are not eligible for authorship. Academic journals are publishing guidelines regarding the use of AI in manuscript generation and the eligibility of AI for authorship. These guidelines should be reviewed before using AI to write a manuscript. Authors using AI for research should be aware of its tendency to confabulate citations and verify any information provided by AI themselves.

Conclusion

Artificial Intelligence is likely to play an increasingly important role in headache medicine and headache research going forward. In this time of growth it important for those in the headache field to be able to understand AI developments in their proper context. By understanding how AI models are developed and being aware of potential pitfalls in creating and evaluating AI models readers can become more informed AI users and research consumers.

Acknowledgements:

Jennifer Hranilovich received funding from NIH/NINDS 1K23NS130143-01A1 and Children’s Hospital Colorado Child Health Research Enterprise, Child Health Bridge Funding supplement award. Contents are the authors’ sole responsibility and do not necessarily represent official NIH views.

We gratefully acknowledge Dr. Danielle Kellier for her contributions to the original presentation at the 2023 American Headache Society Annual Scientific Meeting presentation which inspired this paper.

There was no financial support for this project.

Conflict of Interest:

Gina M. Dumkrieger has received research funding from the American Braine Foundation/ the American Academy of Neurology and Amgen, with funds paid to her institution.

Chia-Chun Chiang has served on the advisory board for Satsuma and eNeura and receives research support from the American Heart Association with funds paid to her institution.

Pengfei Zhang has received honoraria from Alder Biopharmaceuticals, Board Vitals, and Fieve Clinical Research. He collaborates with Headache Science Incorporated without receiving financial support. He has ownership interest in Cymbeline LLC.

Mia T. Minen contributed to developing intellectual property for the RELAXaHEAD application that is co-owned by NYU and IRODY. If the research is successful, NYU and IRODY may benefit from the outcome.

Fred Cohen has received honoraria from Abbvie (consulting, speaking), Eli Lilly (speaking), and Pfzier (consulting, speaking). He serves as an assistant editor for Headache. He has received honoraria for Medlink Neurology and Springer Nature.

Jennifer A. Hranilovich holds stock in Stryker and Amgen

Abbreviations:

AI

Artificial Intelligence

ML

Machine Learning

NLP

Natural Language Processing

MRI

magnetic resonance imaging

MSE

Mean square error

RMSE

root mean square error

AIC

Akaike Information Criterion

BIC

Bayesian Information Criteria

ROC

Receiver Operating Characteristic

AUC

Area Under the Curve

GLM

generalized linear model

OLS

Ordinary Least Squares

DL

deep learning

ANN

Artificial Neural Network

CNN

Convolutional Neural Network

RNN

Recurrent Neural Network

LLM

Large language model

PPV

Positive Predictive Value

CGRP

calcitonin gen-related peptide

mAbs

monoclonal antibodies

Contributor Information

Gina M. Dumkrieger, Department of Neurology, Mayo Clinic, Phoenix, AZ, USA.

Chia-Chun Chiang, Department of Neurology, Mayo Clinic, Rochester, MN, USA.

Pengfei Zhang, Rutgers Robert Wood Johnson Medical School, New Brunswick, NJ, USA.

Mia T Minen, Department of Neurology, Department of Population Health, NYU Langone Health, NY, NY, USA.

Fred Cohen, Department of Neurology, Department of Medicine, Mount Sinai Hospital, Icahn School of Medicine at Mount Sinai, New York, New York, USA.

Jennifer A Hranilovich, Division of Child Neurology, Department of Pediatrics, University of Colorado School of Medicine, Aurora, Colorado, USA.

References

  • 1.Stubberud A, Ingvaldsen SH, Brenner E, et al. Forecasting migraine with machine learning based on mobile phone diary and wearable data. Cephalalgia 2023;43:1–10. [DOI] [PubMed] [Google Scholar]
  • 2.Stubberud A, Gray R, Tronvik E, Matharu M, Nachev P. Machine prescription for chronic migraine. Brain Commun 2022;4:fcac059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chiang C-C, Chhabra N, Chao C-J, et al. Migraine with aura associates with a higher artificial intelligence: ECG atrial fibrillation prediction model output compared to migraine without aura in both women and men. Headache: The Journal of Head and Face Pain 2022;62:939–951. [DOI] [PubMed] [Google Scholar]
  • 4.(2019–2021) tC. H.R.6216 National Artificial Intelligence Initiative Act of 2020. 2020.
  • 5.Review” MT. Artificial Intelligence [online]. Available at: https://cdn.technologyreview.com/artificial-intelligence/. Accessed Mar 23 2024.
  • 6.Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022;23:40–55. [DOI] [PubMed] [Google Scholar]
  • 7.Tso AR, Brudfors M, Danno D, et al. Machine phenotyping of cluster headache and its response to verapamil. Brain 2021;144:655–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Messina R, Sudre CH, Wei DY, Filippi M, Ourselin S, Goadsby PJ. Biomarkers of Migraine and Cluster Headache: Differences and Similarities. Ann Neurol 2023;93:729–742. [DOI] [PubMed] [Google Scholar]
  • 9.Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining. Boston, MA: Pearson Eduction, Inc., 2006. [Google Scholar]
  • 10.Montgomery DC, Runger GC, Hubele NF. Engineering Statistics, 2nd ed. USA: John Wiley & Sons, Inc, 2001. [Google Scholar]
  • 11.Gonzalez-Martinez A, Pagan J, Sanz-Garcia A, et al. Machine-learning-based approach for predicting response to anti-calcitonin gene-related peptide (CGRP) receptor or ligand antibody treatment in patients with migraine: A multicenter Spanish study. Eur J Neurol 2022;29:3102–3111. [DOI] [PubMed] [Google Scholar]
  • 12.Chiang CC, Luo M, Dumkrieger G, et al. A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records. Headache 2024;64:400–409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Shimron E, Perlman O. AI in MRI: Computational Frameworks for a Faster, Optimized, and Automated Imaging Workflow. Bioengineering (Basel) 2023;10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection and tool. BMC Medical Imaging 2015;15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.AlShafeey M, Csáki C. Evaluating neural network and linear regression photovoltaic power forecasting models based on different input methods. Energy Reports 2021;7:7601–7614. [Google Scholar]
  • 16.Myers RH, Montgomery DC, Vining GG, Robinson TJ. Generalized Linear Models with Applications in Engineering and the Sciences, 2nd ed. Hoboken, NJ, U.S.A.: John Wiley & Sons, Inc, 2010. [Google Scholar]
  • 17.Akaike H A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 1973;AC-19:716–723. [Google Scholar]
  • 18.Dunn PK, Smyth GK. Generalized Linear Model with Examples in R. New York, NY, U.S.A.: Spring Scienct_Business Media, 2018. [Google Scholar]
  • 19.McHugh ML. Interrater reliability: the kappa statistic. Biochmia Medica 2012;22:276–282. [PMC free article] [PubMed] [Google Scholar]
  • 20.Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, et al. , ed. 31st Conference on Neural Information Processing Systems; 2017; Long Beach, CA, USA. [Google Scholar]
  • 21.Houle TT, Turner DP, Golding AN, et al. Forecasting Individual Headache Attacks Using Perceived Stress: Development of a Multivariable Prediction Model for Persons With Episodic Migraine. Headache 2017;57:1041–1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Katsuki M, Shimazu T, Kikui S, et al. Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy. Cephalalgia 2023;43:3331024231156925. [DOI] [PubMed] [Google Scholar]
  • 23.Bishop CM. Sparse Kernel Machines. Pattern Recognition and Machine Learning. New York, New York: Springer, 2006. [Google Scholar]
  • 24.Ghosh J Scalable Clustering. In: Ye N, ed. The Handbook of Data Mining, 1 ed. Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., 2003: 247–277. [Google Scholar]
  • 25.Woldeamanuel YW, Sanjanwala BM, Peretz AM, Cowan RP. Exploring Natural Clusters of Chronic Migraine Phenotypes: A Cross-Sectional Clinical Study. Sci Rep 2020;10:2804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Bishop CM. Pattern Recognition and Machine Learning. New York, New York: Springer, 2006. [Google Scholar]
  • 27.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–444. [DOI] [PubMed] [Google Scholar]
  • 28.Yann LeCun LB, Yoshua Bengio; Patrick Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 1998;86:2278–2324. [Google Scholar]
  • 29.Lauriola I, Lavelli A, Aiolli F. An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools. Neurocomputing 2022;470:443–456. [Google Scholar]
  • 30.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need.pdf. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA 2017. [Google Scholar]
  • 31.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016: pp. 770–778. [Google Scholar]
  • 32.Siddiquee RMM, Shah J, Chong C, et al. Headache classification and automatic biomarker extraction from structural MRIs using deep learning. Brain Commun 2023;5:fcac311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.OpenAI. Introducing ChatGPT [online]. Available at: https://openai.com/index/chatgpt/. Accessed 7/15/2024.
  • 34.Cowan RP, Rapoport AM, Blythe J, et al. Diagnostic accuracy of an artificial intelligence online engine in migraine: A multi-center study. Headache 2022;62:870–882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kwon J, Lee H, Cho S, Chung CS, Lee MJ, Park H. Machine learning-based automated classification of headache disorders using patient-reported questionnaires. Sci Rep 2020;10:14062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Combi C, Amico B, Bellazzi R, et al. A manifesto on explainability for artificial intelligence in medicine. Artificial Intelligence in Medicine 2022;133:102423. [DOI] [PubMed] [Google Scholar]
  • 37.Katsuki M, Tatsumoto M, Kimoto K, et al. Investigating the effects of weather on headache occurrence using a smartphone application and artificial intelligence: A retrospective observational cross-sectional study. Headache 2023;63:585–600. [DOI] [PubMed] [Google Scholar]
  • 38.Leisman DE, Harhay MO, Lederer DJ, et al. Development and Reporting of Prediction Models: Guidance for Authors From Editors of Respiratory, Sleep, and Critical Care Journals. Crit Care Med 2020;48:623–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Goetz L, Seedat N, Vandersluis R, van der Schaar M. Machine learning generalizability across healthcare settings: insights from multi-site COVID-19 screening. npj Digit Med 2022;7:1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect penumonia in chest radiographs: A cross_sectional study. PLoS Med 2018;15:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Campbell JP, Mathenge C, Cherwek H, et al. Artificial Intelligence to Reduce Ocular Health Disparities: Moving From Concept to Implementation. Transl Vis Sci Technol 2021;10:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Chen IY, Szolovits P, Ghassemi M. Can AI Help Reduct Disparities in General Medical and Mental Health Care? AMA J Ethics 2019;21:E167–179. [DOI] [PubMed] [Google Scholar]
  • 43.Pierson E, Cutler DM, Leskovec J, Mullainathan S, Obermeyer Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nature Medicine 2021;27:136–140. [DOI] [PubMed] [Google Scholar]
  • 44.Curth A, Thoral P, van den Wildenberg W, et al. Transferring Clinical Prediction Models Across Hospitals and Electronic Health Record Systems. In: Cellier P, Driessens K, ed. Machine Learning and Knowledge Discovery in Databases; 2020. 2020//; Cham: Springer International Publishing: 605–621. [Google Scholar]
  • 45.Krumholz HM, Ross JS, Otto CM. Will research preprints improve healthcare for patients? BMJ 2018;362:k3628. [DOI] [PubMed] [Google Scholar]
  • 46.Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Scientific Data 2016;3:160035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Development NaITRa. The National Artificial Intelligence Research and Development Strategic Plan. 2016. [Google Scholar]
  • 48.Cooray SD, Boyle JA, Soldatos G, Wijeyaratne LA, Teede HJ. Prognostic prediction models for pregnancy complications in women with gestational diabetes: a protocol for systematic review, critical appraisal and meta-analysis. Systematic Reviews 2019;8:270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The practical implementation of artificial intelligence technologies in medicine. Nature Medicine 2019;25:30–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Tang A, Tam R, Cadrin-Chênevert A, et al. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology. Canadian Association of Radiologists Journal 2018;69:120–135. [DOI] [PubMed] [Google Scholar]
  • 51.Geis JR, Brady A, Wu CC, et al. Ethics of artificial intelligence in radiology: summary of the joint European and North American multisociety statement. Insights into Imaging 2019;10:101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Nature Portfolio Editorial policies: Artificial Intelligence (AI) [online]. Available at: https://www.nature.com/nature-portfolio/editorial-policies/ai. Accessed 2024.10.21.
  • 53.Editorial Policies for Authors: Authorship Criteria and Contributions [online]. Available at: https://jamanetwork.com/journals/jama/pages/instructions-for-authors. Accessed 2024.10.21.
  • 54.Science Journals: Editorial Policies [online]. Available at: https://www.science.org/content/page/science-journals-editorial-policies#authorship. Accessed 2024.10.21.

RESOURCES