Abstract
Natural language processing (NLP) is a key technique for developing medical artificial intelligence (AI) systems that leverage electronic health record data to build diagnostic and prognostic models. NLP enables the conversion of unstructured clinical text into structured data that can be fed into AI algorithms. The emergence of transformer architecture and large language models (LLMs) has led to advances in NLP for various healthcare tasks, such as entity recognition, relation extraction, sentence similarity, text summarization, and question-answering. In this article, we review the major technical innovations that underpin modern NLP models and present state-of-the-art NLP applications that employ LLMs in radiation oncology research. However, it is crucial to recognize that LLMs are prone to hallucinations, biases, and ethical violations, which necessitate rigorous evaluation and validation prior to clinical deployment. As such, we propose a comprehensive framework for assessing the NLP models based on their purpose and clinical fit, technical performance, bias and trust, legal and ethical implications, and quality assurance prior to implementation in clinical radiation oncology. Our article aims to provide guidance and insights for researchers and clinicians who are interested in developing and using NLP models in clinical radiation oncology.
Keywords: artificial intelligence, radiation oncology, natural language processing, large language models, personalized medicine
Introduction
Artificial intelligence (AI) is transforming healthcare by improving patient outcomes, optimizing clinical workflows, and reducing costs.1 A key area where AI is rapidly evolving is precision medicine, which tailors personalized medical care to patients’ genes, environments, and lifestyles.2 Research in machine learning-based algorithms for the detection of disease-causing genetic mutations3 and individualized treatment protocols for patients4 is being developed and tested. Advanced AI models are being constructed within medicine for personalized disease prevention, care, and even prophylaxis. The ongoing trend of increased healthcare data collection will further aid innovative applications of these technologies in the future.
Natural language processing (NLP) is a subclass of AI that enables machines to understand and interpret human language. The goal of NLP is to develop algorithms and models that are capable of processing, analysing, and generating natural language text and speech. The history of natural language processing dates back to the 1950s when early language translation programs were developed.5 However, limited computing power and lack of data slowed progress.6 The introduction of statistical methods in the 1980s led to more sophisticated language models.7 The development of neural networks and deep learning in the 2010s led to significant advancements in NLP, culminating in the development of the famous transformer architecture in 2017.8 With the capabilities of the transformer architecture, a new breed of models emerged, referred to as large language models (LLMs). Notable LLMs include Bidirectional Encoder Representations from Transformers (BERT),9 GPT-4, and ChatGPT. The development of these technologies has accelerated language processing and opened new possibilities for interacting with machines using natural language. Despite these advancements, LLMs are not without limitations, including biases that may lead to inequitable predictions and hallucinations where models produce incorrect or nonsensical outputs. These issues highlight the importance of an effective evaluation framework to ensure the reliability and fairness of these models in clinical applications.
The transformer architecture has paved the way for the development of LLMs, which are now at the forefront of NLP research. Language modelling,10 machine translation,11 and sentiment analysis12 have been shown to perform well with these models. However, training such models is a complex and resource-intensive process requiring large size of data and computing. Researchers have explored the use of transfer learning to address this challenge,13 a technique that allows pre-trained models to be fine-tuned on specific tasks with limited data. Indeed, the real-world applications of NLP often begin with fine-tuning these pre-trained models to suit the specific needs of the task. Furthermore, zero-shot learning has also emerged as a promising method for generalizing new tasks without explicit training.14 This results in easy deployment, albeit evaluating model performance is becoming more challenging as input data has turned out to be more expansive and complex.15 As a result, researchers and practitioners in the field are facing a significant challenge as they must find ways to accurately assess the quality of these models when applied to prospective real-world data.16
Adequate testing of these models will be an issue in radiation oncology, which utilizes cutting-edge technology to guide precise radiation to the target cancer. Language processing and research applications in radiation oncology have been discussed in the literature.17–22 Highlights include enhanced insights into disease staging, treatment options, and patient outcomes,23 entity extraction from unstructured clinical notes24,25 to aid clinical decision support. The lack of clinical evaluation and proper validation for many of these models poses a significant challenge to their widespread adoption. The clinical evaluation of these models involves rigorous testing based on well-defined metrics and benchmark datasets. This includes assessing the potential risks and benefits in the prospective use of these models and a thorough evaluation of the impact on patient outcomes. The application of NLP models to the field of clinical radiation oncology has the potential to enhance its outcomes and efficiency, contingent upon the models’ ability to exhibit their validity, safety, fairness, and reliability as quantified by a rigorous evaluation framework. This requires collaboration between clinicians, data scientists, and regulatory bodies to develop detailed and robust evaluation frameworks that can ensure that these models are integrated into clinical workflows safely and effectively.26
In the sections below, we review the transformer architecture that has been foundational to the development of LLMs. We also provide an overview of training and fine-tuning for LLMs and present recent research applications in radiation oncology. Lastly, we will identify the existing challenges and limitations for the clinical adoption of LLMs and propose a framework that serves as a preliminary guideline to facilitate the safe, fair, and effective application of these algorithms in clinical settings. A glossary of technical terms is provided in Table 1 to assist readers in understanding the computational aspects discussed in the following section.
Table 1.
Glossary of technical terms.
| Term | Explanation |
|---|---|
| Natural Language Processing (NLP) | A field of artificial intelligence focused on enabling machines to understand, interpret, and generate human language. |
| Large Language Models (LLMs) | Advanced neural network models designed to process and generate text by analysing sequences of words, often with billions of parameters. |
| Transformer Architecture | A neural network architecture that uses self-attention mechanisms to process sequential data efficiently, allowing for parallel computation and long-range dependency modelling. |
| Parameters (for LLMs) | Numerical values in a neural network that are learned during training and define the behaviour of the model. LLMs can have billions (eg, 1B-405B) of parameters. |
| Tokenization | The process of breaking down text into smaller units, such as words or subwords, for model input. |
| Embedding | The representation of words or tokens as dense numerical vectors, capturing semantic meaning. |
| Positional Encoding | Adds information about the order of tokens in a sequence, allowing transformers to process sequential data effectively. |
| Transfer Learning | A method where pre-trained models are adapted for specific tasks by fine-tuning on task-specific data. |
| Fine-Tuning | Adjusting the parameters of a pre-trained model on a smaller, domain-specific dataset to improve performance for specific tasks. |
| Zero-Shot Learning | The ability of a model to perform tasks without explicit training on them, relying on pre-trained knowledge. |
| Few-Shot Learning | The ability of a model to learn and generalize from a small amount of labelled data for a new task. |
| Encoder-Only Models | Models like BERT that analyse entire input texts at once and excel in context-based tasks. |
| Decoder-Only Models | Models like GPT that generate text by predicting the next word based on preceding words. |
| Encoder-Decoder Models | Models like T5 and BART that combine encoding and decoding capabilities for both understanding and generating text. |
| Named Entity Recognition (NER) | An NLP task where specific entities (eg, names, dates) are identified and classified in text. |
| Self-Attention Mechanism | A technique used in transformers that calculates the relevance of each word in a sequence to every other word, enabling context understanding. |
| Scaled Dot-Product Attention | A mathematical operation in transformers that calculates the importance of tokens relative to others in a sequence. |
| Residual Connections | Techniques used in neural networks to improve gradient flow and training stability by adding the input of a layer to its output. |
| Layer Normalization | A method to stabilize and accelerate training by normalizing layer inputs. |
| Adversarial Testing | A technique to evaluate model robustness by testing on edge cases or perturbed inputs. |
| BLEU (Bilingual Evaluation Understudy) | A metric used to evaluate the quality of machine-generated text by comparing it to human-generated references. |
| Rouge Metrics | Metrics that evaluate the overlap between machine-generated and human-generated summaries to assess summarization quality. |
| Hugging Face Transformers Library | A popular open-source library for NLP that provides pre-trained models and tools for fine-tuning. |
| Google TensorFlow | An open-source machine learning framework for building and deploying AI models. |
| Meta PyTorch | An open-source machine learning library known for its flexibility and efficiency in building deep learning models. |
Materials and methods
Literature review
We employed a 2-pronged approach to the literature review to ensure a comprehensive review of modern NLP techniques and their applicability in radiation oncology. First, we utilized ArXiv to capture the most recent advancements in the NLP field, recognizing that cutting-edge research is often disseminated through this preprint server before formal peer review. Search terms used on ArXiv included combinations of “Natural Language Processing,” “Transformer Architecture,” “Large Language Models,” and “Clinical Applications.” Subsequently, we explored PubMed to obtain NLP research applications in Radiation Oncology. Our search terms for PubMed encompassed “Natural Language Processing,” “Radiation Oncology,” “Electronic Health Records,” and “Clinical AI Applications.” This literature was critical for summarizing the foundational elements of modern NLP pipelines and their use in radiation oncology research, as well as for considerations for the design of the implementation framework of these models in clinical radiation oncology. Below, we summarize the key foundational elements of the modern NLP models.
Recurrent neural networks
Recurrent neural networks (RNNs) were foundational in early NLP for processing sequential data and capturing temporal dynamics. However, they struggle with long-term dependencies due to vanishing or exploding gradients.27 Long short-term memory (LSTM) networks were introduced to address these limitations with mechanisms like gating and internal feedback loops, enhancing their ability to retain and learn long-term dependencies. LSTMs became the backbone for various NLP tasks such as machine translation and sentiment analysis. Despite these improvements, LSTMs faced challenges with sequential computation, limiting parallelization and efficiency. This led to the development of the transformer architecture, which enables parallel computation and utilizes self-attention mechanisms to capture long-range dependencies more effectively.8
Transformer architecture
The transformer architecture, introduced in the 2017 paper “Attention is All You Need” by Vaswani et al,8 revolutionized NLP by replacing RNNs with self-attention mechanisms, enabling parallel computation and capturing long-range dependencies more effectively. The transformer comprises an encoder and a decoder: the encoder processes the input sequence into a context-rich representation, which the decoder uses to generate the output.28
Transformers surpass previous NLP models like LSTM in 2 major ways: they excel at contextual learning via multiheaded attention,29 and they facilitate parallelized computations, speeding up training.30 The transformer processes input text by breaking it into tokens, which are converted into dense vector representations through input embedding. Positional encoding is added to these embeddings to maintain the order of the tokens.
As shown in Figure 1, the architecture consists of layers with 2 main components: a self-attention mechanism and a feed-forward network. The self-attention mechanism calculates attention scores between tokens, normalizes them with a softmax function, and computes a weighted sum of the input sequence, emphasizing important segments. The feed-forward network refines the output, with each component augmented by residual connections and layer normalization to ensure robust gradient flow during training.
Figure 1.

Vaswani et al8 illustration of the transformer architecture. First, input words are embedded and encoded based on their position in the input sequence (a). In the encoder, the Multi-head Attention enables each word to attend to other words in the sequence, capturing relationships and dependencies (b). Then, the original input embeddings are added to the self-attention outputs and normalized to integrate attended information while preserving the input (c). Next, a neural network captures complex patterns and interactions (d) before representations are added and normalized. On the other hand, the decoder’s masked self-attention allows the decoder to capture dependencies among its own outputs (e). The decoder’s second self-attention (f) attends to the encoded input sequence, allowing it to access the information from the input and align it with the current decoding position. Similar to the encoder, outputs are further processed with another feed-forward neural network (g), added, and normalized. Finally, the decoder applies a linear transformation (h) followed by a softmax activation function to generate the probability distribution over the vocabulary to select the next token.
Using a scaled dot-product attention mechanism, the transformer captures long-range dependencies by calculating similarity scores between tokens, normalizing them, and weighting the input sequence accordingly. This process, repeated across multiple layers, allows the transformer to derive context and has positioned it as a foundational tool in NLP.10
Large language models
The transformer architecture underpins the development of LLMs, which promise substantial performance improvements. LLMs have evolved from simpler statistical models into complex neural networks, analysing word sequences to predict subsequent words within a given context. Early LLMs, such as OpenAI’s GPT series, range from GPT (117 million parameters) to GPT-4 (estimated at hundreds of billions of parameters), illustrating the rapid increase in model scale and complexity. Meta's LLaMA models,31–33 which come in versions 1, 2, and 3, include parameter sizes ranging from 1 billion to 405 billion, with intermediary models at 3, 8, 13, 70, and 90 billion parameters. While 13 billion and 90 billion models are tailored for multimodal applications, the rest focus on textual tasks, highlighting their versatility and scalability. Additionally, other notable models, such as Mistral,34 Qwen,35 and others, further enrich the landscape of LLM development, showcasing diverse architectures and application potentials.
LLMs fall into 3 key architectures36:
Encoder-only models (eg, BERT) process entire texts at once, excelling in context-based tasks.37
Decoder-only models (eg, GPT) predict the next word based on preceding words, ideal for text generation.
Encoder-decoder models (eg, T538 and BART39) combine both approaches, balancing context understanding and sequence generation.
Despite their capabilities, LLMs require task-specific fine-tuning to optimize performance.40 This process involves exposing the model to relevant data and adjusting its parameters. LLMs, especially GPT-3 and beyond, exhibit potential for “zero-shot” or “few-shot” learning, handling tasks without explicit training, thanks to extensive pre-training on vast datasets. This broad applicability allows models like ChatGPT to generate responses across numerous tasks, though specific fine-tuning may still yield superior results.
The rise of LLMs has spurred novel applications, including advanced language generation and chatbots, redefining human-machine interaction. Their accessibility is enhanced by platforms like Hugging Face’s transformers library,41 Google’s TensorFlow,42 and Meta's PyTorch.43
In medicine, LLMs like Google’s Med-PaLM44 have made significant strides, surpassing the pass mark on USMLE-style questions. Med-PaLM 245 further improved accuracy in medical exams and health queries. The MED-PALM M model,22 integrating text and imaging data, advances applications like radiomics for tumour characterization and therapy response prediction.
Deploying LLMs demands high-performance hardware (eg, GPUs, TPUs, or specialized accelerators), extensive memory, storage, and high-bandwidth networking to handle massive datasets. These requirements incur significant operational costs and environmental impacts. Organizations are increasingly exploring smaller model architectures and more efficient hardware to reduce these burdens while ensuring sustainable AI development.
Robust version control is also essential in clinical workflows to ensure transparency, reproducibility, and regulatory compliance. Using model-agnostic approaches allows for the seamless integration of updated models, while rigorous sandbox testing, regular audits, and clear documentation help maintain system stability and clinician trust.
Radiation oncology research applications
Recent LLM applications in radiation oncology
NLP has been explored for cancer applications,19 promising improved care through big data from EHRs and oncology information systems.46 Reviews by Yim et al17 and Bitterman et al18 discuss various NLP applications in radiation oncology, including information extraction from clinical notes,47 standardizing treatment planning structures,48 and identifying treatment locations.49 NLP also aids in toxicity data extraction from clinical notes; however, it suffers in performance with negated symptoms.50 Advanced NLP methods have also demonstrated enhancement of cancer registries by directly extracting relevant clinical information from clinical text.51 Notably, Khanmohammadi et al52,53 introduced a student-teacher LLM architecture for localized toxicity extraction and iterative prompt refinement, significantly improving accuracy in extracting symptoms and treatments from clinical notes. Their approach leverages automatic prompt optimization through iterative refinement, achieving notable gains in precision and recall while maintaining data privacy and enhancing clinical NLP applications. These applications are summarized in Table 2 and sections “Summarization and retrieval of electronic health records,” “De-identification of sensitive medical data,” “Clinical text mining and decision support,” and “Education and knowledge expansion.”
Table 2.
Summary of NLP and LLM research applications in radiation oncology.
| Application | Task | Details | References |
|---|---|---|---|
| Information extraction | Extracting data from clinical notes | Enhances insights into disease staging, treatment options, and patient outcomes | 17 , 18 , 47 , 52 , 53 |
| Standardizing treatment planning | Structuring treatment plans | Focused on improving consistency and automation in planning structures | 48 |
| Identifying treatment locations | Locating treatment areas | NLP identifies anatomical sites for precise treatment targeting | 49 |
| Toxicity data extraction | Analysing clinical notes for toxicities | Challenges include performance issues with negated symptoms | 50 |
| Enhancing cancer registries | Extracting registry data | Directly extracts clinical information from unstructured text | 51 |
| EHR summarization and retrieval | Summarizing and retrieving data | SPeC pipeline improves summary reliability; adapted LLMs outperform experts in reducing documentation burden | 54 , 55 |
| De-identification of sensitive data | Removing private information from text | DeID-GPT masks private information while preserving text meaning and structure | 56 |
| Clinical text mining and decision support | Mining text and supporting decision-making | ChatGPT used for named entity recognition and synthetic data generation; multimodal AI enhances decision-making | 57 , 58 |
| Education and knowledge expansion | Expanding knowledge and training | ChatGPT excels in radiation oncology exams; RadOnc-GPT and OncoGPT enhance diagnostic descriptions and advice | 59–62 |
Summarization and retrieval of electronic health records
LLMs can enhance electronic health record (EHR) summarization in radiation oncology. The Soft Prompt-Based Calibration (SPeC) pipeline by Chuang et al54 addresses output variance, providing reliable summaries. Van Veen et al55 found that adapted LLMs outperform medical experts in clinical text summarization, reducing clinicians’ documentation burden.
De-identification of sensitive medical data
LLMs, like ChatGPT and GPT-4, offer advanced named entity recognition (NER) capabilities for de-identifying sensitive medical data. Liu et al56 proposed DeID-GPT, a framework effectively masking private information while preserving text meaning and structure.
Clinical text mining and decision support
LLMs provide insights through clinical text mining. Tang et al57 used ChatGPT for NER and relation extraction, generating high-quality synthetic data to fine-tune local models. Ferber et al58 demonstrated that multimodal AI systems enhance decision-making by deploying specialized medical AI tools, supporting LLMs as clinical assistants.
Education and knowledge expansion
LLMs, like ChatGPT, show promise in medical education. Gilson et al59 evaluated ChatGPT’s performance on radiation oncology exams, highlighting its potential for expanding knowledge. RadOnc-GPT,60 fine-tuned on Mayo Clinic data, excelled in generating treatment regimens and diagnostic descriptions. Yalamanchili et al61 found LLM responses to care questions were on par or superior to expert answers, though readability needed improvement. Jia et al62 developed OncoGPT, enhancing accuracy in oncology advice through domain-specific fine-tuning.
Current shortcomings of NLP models
The rise of LLMs has enabled the development of conversational AI with widespread applications. However, these models often generate hallucinations—irrelevant or incorrect outputs—which is a significant concern in healthcare, where such errors can jeopardize patient safety. Ensuring the reliability and accuracy of LLMs in medical contexts is crucial.
LLMs also exhibit embedded biases,63 necessitating collaboration between health professionals and data scientists to prevent encoding historical health disparities. Straw and Callison-Burch64 evaluated biases in NLP models used in psychiatry, highlighting significant biases in religion, race, gender, nationality, and sexuality within GloVe65 and Word2Vec66 embeddings. They emphasized cross-disciplinary collaboration to mitigate health inequalities.
Another issue is model toxicity—producing offensive or discriminatory content.67 Evaluations using toxic benchmark datasets are vital to prevent biased outputs and misinformation.68 Legal concerns around AI liability in decision-making also necessitate creating reliable, safe, and legally compliant models.69 These concerns underscore the need for a robust framework of accountability and transparency in deploying AI tools, ensuring proper review by healthcare providers.
While LLMs have become more accessible and easier to deploy, their comprehensive outputs pose challenges in performance evaluation compared to previous NLP algorithms.70 Training and evaluating LLMs require significant computational resources, limiting their use in low-resource environments. Evaluating LLMs’ quality is challenging due to the nuanced nature of natural languages, particularly in clinical settings.
Rigorous evaluation and validation are essential to ensure that NLP models are reliable and safe for clinical use. The rapid development and deployment of these technologies necessitate a comprehensive evaluation before clinical implementation. The next section proposes a checklist for step-wise evaluation of NLP models before their deployment in radiation oncology.
Framework for clinical implementation
This section presents a comprehensive framework for the clinical implementation of NLP systems in radiation oncology. The framework consists of 3 main components: (1) evaluation of the purpose and clinical fit of the NLP system, (2) commissioning of the NLP system, and (3) quality assurance of the NLP system. The commissioning section includes sub-sections on technical performance, bias and trust, and legal and ethical scope. A graphical overview of the framework and a checklist of relevant questions for the clinical commissioning team are provided in Figure 2.
Figure 2.
Framework for assessing natural language processing algorithms before they are deployed in clinical settings. The framework consists of several categories and questions that can help evaluate the objectives, outcomes, limitations, reliability, validity, and quality control measures of the algorithms, as well as their implications for clinical practice and patient safety.
Purpose and clinical integration
Clinical implementation of an NLP model begins with clearly defining its purpose by outlining the clinical problem, context, chosen solution, and anticipated benefits. In radiation oncology, innovations are first evaluated for their potential to improve efficiency and efficacy. For example, an NLP model might extract tumour staging data from EHRs to support treatment planning or generate concise patient record summaries for tumour board discussions. Although the problem and solution can be broadly defined, they must include concrete, testable components to ensure reliability before routine use.
Beyond purpose, it is essential to articulate the expected impact. An NLP tool that automates toxicity data extraction, for instance, should be compared against manual reviews to verify accuracy. Similarly, automating patient record summarization can reduce clinicians’ review time, streamline workflows, and allow more focus on direct patient care. Effective planning should also address how the technology will augment clinical expertise through comprehensive education and training, ensuring that users can validate and fully leverage the system.
Commissioning
Technical performance
NLP algorithm technical performance is evaluated on specific development datasets that are reflective of the expected clinical performance. The development dataset is usually divided into 3 subsets: train, development/validation, and test. The train set is used to train the initial algorithm, the validation set is used to adjust the algorithm parameters for specific tasks, and the test set is used to measure the algorithm performance. An important consideration for the implementation team is to know that the testing subset should be independent of the training and tuning subsets. The level of this independence can be categorized as external validation, internal validation, and cross-validation. External validation, where the testing subset comes from a different source than the training and tuning subsets, is the most rigorous evaluation method. Internal validation, where the testing subset is separated from the training and tuning subsets within a single source dataset, is the next best method. Cross-validation, where the testing subset overlaps with the training and tuning subsets, is a weaker method due to potential bias.71 Nested cross-validation, however, serves as a more conservative alternative to traditional cross-validation, thereby reducing the risk of information leakage among different sample sub-cohorts.72 In this method, cross-validation is performed within the training set to choose the model parameters, and an external cross-validation loop is used to estimate the error of the chosen model.
Once the dataset selection is validated, the algorithm performance evaluation can be appraised. Generally, performance evaluation is closely associated with the NLP task at hand. NLP tasks can be broadly categorized into classification tasks, NER, entity abstraction, summarization tasks, and question-answering tasks.73 For classification tasks, as well as for NER and entity abstraction, metrics such as positive predictive value, negative predictive value, area under the receiver operating characteristic curve, and F-measures (harmonic mean of precision and recall) are used to assess sensitivity and specificity, with likelihood ratios (positive and negative) providing additional clinical context. For summarization tasks, the Rouge74 metrics measure the overlap between machine-generated and human-generated summaries, while question-answering tasks are evaluated using the Bilingual Evaluation Understudy (BLEU)75 metric, assessing the quality of translated text by comparing it to a set of high-quality reference translations. Choosing the appropriate performance metric is essential for benchmarking and ensuring clinical relevance.
Bias and trust
Technically, bias can be minimized by optimizing for the lowest generalization error, which is measured by the out-of-sample error and the gap between ground truth and prediction.76 This gap can be caused by model inaccuracy, sampling error, or noise. The generalization error can be reduced by choosing the right algorithm, tuning it well, and using large cohorts of diverse data. However, tuning of the model should be balanced as overfitting to variance also results in poor generalization error and poor outputs.77 Therefore, the optimal model should capture the data’s meaningful patterns without being overfitted, as demonstrated in Figure 3. Beyond the technical principle of minimizing generalization error, the algorithm should have a low bias in the domains of statistical bias, algorithmic bias, and societal bias.78
Figure 3.

This figure illustrates the bias-variance tradeoff as a function of model capacity. The x-axis represents model capacity, ranging from low to high, while the y-axis represents Prediction Error, which quantifies the expected discrepancy between model predictions and true values (eg mean squared error [MSE]). Three key trends are displayed: bias2 (descending curve), variance (ascending curve), and generalization error (U-shaped curve). In the area of low capacity, the model is prone to underfitting, represented by high bias and low variance. As we move towards higher capacities, the model starts to overfit, demonstrated by low bias but high variance. The optimal model capacity is depicted where the generalization error is minimized, balancing bias and variance. This point represents the most effective model complexity for preventing both underfitting and overfitting, thus achieving optimal performance on unseen data.
Statistical bias arises when training data does not represent the broader population—for example, a cancer outcome model trained mainly on urban hospital data may underperform for rural patients. This can be mitigated by using diverse datasets with sufficient variance and samples to capture rare conditions. Algorithmic bias stems from design flaws—such as word embeddings linking “doctor” mostly with male pronouns and “nurse” with female pronouns—reflecting historical oversights in considering biological, environmental, and social factors.79 Minimizing this bias requires carefully designed training objectives, fairness-aware algorithms, and debiasing techniques.80 Lastly, societal bias reflects deep-rooted biases in the data that can lead to unequal outcomes by associating conditions disproportionately with certain demographic groups.81 Addressing all these domains is critical to ensure that NLP models perform fairly across diverse patient populations.78
To address these biases in practice, several tools and frameworks have been developed for detecting and mitigating bias in NLP models. TCAV (Testing with Concept Activation Vectors)82 is a Google-developed framework that helps to interpret and analyse the behaviour of machine learning models, while audit-AI (https://github.com/pymetrics/audit-ai) and IBM's AI Fairness 360 (AIF360)83 examine machine learning models to detect race, gender, location, and other sources of bias and discrimination. While these tools may be helpful overall, the testability for narrow use cases may be restrictive. If the algorithm under consideration has resulted in manuscripts and clinical trials, the CLAIM (Checklist for Artificial Intelligence in Medical Imaging),84 Consort-AI, and CLAMP (Clinical Language Annotation, Modeling, and Processing)85 publications can be utilized and translated to ensure adequate study organization, scientific reporting, and robust performance testing including sub-group representation.80 Lastly, it is important to remember that the original data used for the development and testing of NLP pipelines may not be representative of an institution’s local data. Therefore, the clinic must ensure that the algorithm ranks high on the fairness scale when applied to diverse patient populations, and as such, the tools and concepts mentioned above can aid in clinical implementation.
Clinician trust in the NLP algorithm should be evaluated next on the path towards clinical implementation. Trust is a psychological mechanism that deals with the uncertainty between known and unknown, where algorithm transparency, predictability, and fairness can play a large role in the trustworthiness of clinicians.86 Algorithm fairness has been discussed in the previous section, where minimization of bias is critical. The transparency of the algorithm is related to the explainability and interpretability of the results, so that the algorithm can map the output to a selection of inputs.87 In general, the more advanced an algorithm is, the lower the explainability.88 However, to promote trust, explainability and interpretability are increasingly being incorporated in the more advanced algorithms.89 Beyond the explainability, the trust in the algorithm is largely based on the predictability of outcomes, especially when faced with conflicting inputs.
Similar to the interpretability approaches for machine learning described, NLP methods also fall into the categories of model-specific and model-agnostic interpretability.90 For instance, the attention mechanism itself provides some level of model-specific interpretability. They offer a glimpse into the workings of the model by illustrating the importance of different words or phrases in the input for the model’s decision-making.8 This way, clinicians can potentially understand which parts of a patient's history or report the algorithm deemed significant. On the model-agnostic side, techniques such as LIME (https://c3.ai/glossary/data-science/lime-local-interpretable-model-agnostic-explanations/) (Locally Interpretable Model-agnostic Explanations) and SHAP (SHapley additive exPlanations)91 have also been applied to NLP. For example, LIME can provide insight into the model's decisions by perturbing the input and observing the model's output, thereby explaining individual predictions.92 SHAP values can provide a global view of feature importance across all predictions, demonstrating how much each feature (or words in a textual modality) contributes to the model's decisions. For example, a study by Khanmohammadi et al93 utilized SHAP for interpretability in clinical NLP tasks, demonstrating the most significant sound features in predicting foetal biological sex using Phonocardiogram signals. These methods enhance the transparency of NLP algorithms and provide clinicians with a better understanding of how the algorithms arrive at their conclusions. Published NLP algorithms are designed to work on specific tasks, with the embedded assumption that the training and test data are generated from a similar statistical distribution. This assumption may easily be violated in clinical scenarios, where input data may not match the statistical assumptions, and the algorithm's stability under these inputs will be directly associated with the clinician's trust in them. To test for the predictability of outcomes, and an algorithm's stability, the concept of adversarial testing can be used to estimate algorithm performance under unstable inputs.94 There are several methods for performing adversarial testing, with the evasion method perhaps being the most suitable for testing the edges of the NLP model.95 This testing would encompass actively modifying input data that represents the most extreme clinical scenario and analysing the algorithm output for (1) transparency—does the model explain or interpret the outputs to specific inputs? (2) predictability—does it show the same result every time? (3) fairness—does the output represent a sensible answer that is rooted in the representation of all sub-groups within the clinical setting?96
Legal and ethical considerations
Legal and regulatory frameworks ensure the safe and ethical use of AI algorithms in healthcare. These address the potential risks and challenges and cover 3 main aspects of AI development and deployment, namely, how medical devices are regulated, how health data privacy is protected, and how liability is assigned for any harm caused by faulty, erroneous, or unsafe algorithm recommendations.97 Algorithm regulation strictly follows the Food and Drug Administration (FDA) guidelines in the United States, and significant progress has been made by the regulatory agency in defining standards and guiding principles for AI algorithms. NLP algorithms, being a subclass of AI, will fall under the category of Software as a Medical Device (SaMD). The FDA has established a regulatory framework IN 2019 for these devices (https://www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf), which proposes a risk-based regulation based on the intended use of the device and the patient’s risk from inaccurate output. In 2021, the FDA proposed an action plan and guiding principles in collaboration with Canada and UK to ensure safe, effective and quality SaMD use (https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device, https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles). Recent publications have discussed the regulation of LLM-based chatbots as medical devices.98 Therefore, it is critical to understand the level of regulation for the device under consideration and the regulatory implications of the FDA to ensure effective clinical use.
Health data privacy is critical as LLMs become more integrated into clinical workflows. While HIPAA provides a foundational framework for protecting patient health information, it has limitations with modern AI technologies. As noted by Rezaeikhonakdar et al,99 many AI tools and chatbots are not classified as HIPAA-covered entities or business associates, meaning patient interactions with systems like ChatGPT or Google Gemini may fall outside HIPAA’s protections. Even when these systems claim HIPAA compliance, risks such as re-identification of de-identified data remain, as highlighted in cases like Dinerstein v. Google,100 and these risks are compounded when large tech companies cross-reference extensive data resources, potentially compromising patient confidentiality.101,102 Moreover, Marks and Haupt103 emphasize that HIPAA compliance alone may not protect patient data from misuse; chatbots can prompt users to disclose sensitive information, which may then be leveraged for unauthorized data sharing, targeted advertising, or discriminatory practices in insurance and employment. To mitigate these risks, developers must adopt rigorous data governance and transparent practices in data usage. Clinicians also play a key role by limiting the input of sensitive data into these tools and ensuring they act as responsible data stewards, understanding how these systems manage and process information. Informed consent is essential, with patients needing to know how their data will be used, stored, and protected, thereby building trust and upholding ethical principles of autonomy and respect for patients.104 Additionally, robust liability systems must be established to address any harm caused by NLP algorithms, ensuring accountability and patient safety.102,105
Quality assurance
The technical and clinical performance of the model is estimated during the commissioning process. A Quality Management Program (QMP) details the performance tests, frequency of testing, expected outputs, and plan of action for inadequate performance. Further, it should include a quality improvement section, where the inadequate performance of the algorithm under routine use can be evaluated and discussed in detail to ensure safe and quality clinical care. As such, QMP can be divided into routine quality assurance and case-specific testing for quality improvement.106 Routine quality assurance must be performed periodically with a stable reference dataset every time to ensure the stability of outputs. Ideally, the reference data is a subset of the data that is utilized during the commissioning process and is representative of routine clinical data and use cases. This step should also be used for the clinical release of a device after downtime or minor changes, such as changes in computation hardware. A comprehensive data logging system is recommended for the structured collection of algorithm input, output, and stability. This should be a pivotal piece of the quality improvement component, whereby the unexpected performance by the algorithm can be traced back to the inputs, and root cause analysis can be performed to uphold safe and quality-driven clinical care. A review of case-specific performance further allows for the identification of model limitations that can facilitate future model revisions.106 We believe that new and emergent use cases for the algorithm should be evaluated fully by testing purpose, clinical fit, commissioning, and quality assurance plan as outlined in section “Framework for clinical implementation” This rigorous step will ensure that the algorithm performance is suitable to ensure safe and high-quality clinical care.
Real-world validation of the framework for clinical implementation
The proposed framework is validated by research studies employing similar methodologies. Walker et al49 developed an NLP tool to standardize free-text treatment site documentation in radiation oncology EMRs, using domain-specific dictionaries and error correction to achieve superior precision and recall, thereby aligning with the framework’s emphasis on defining purpose and clinical fit. Similarly, Hong et al50 demonstrated the commissioning process with an NLP pipeline for automating toxicity data abstraction, achieving high accuracy for toxicities like radiation dermatitis and fatigue while addressing challenges with negated symptoms to guide iterative improvements in performance metrics, transparency, and fairness. Furthermore, Mathew et al107 integrated NLP and machine learning into an incident learning system, streamlining workflows, promoting a safety culture, and supporting scalability through open-source pipelines and standard taxonomies, which reinforces the framework’s focus on routine and case-specific quality assurance. Together, these studies confirm that a structured approach encompassing purpose, clinical integration, commissioning, and quality assurance can successfully deploy NLP systems in radiation oncology, enhancing both operational efficiency and clinical outcomes while addressing key implementation challenges.
Conclusion
In this article, we have discussed the recent advances and applications of NLP in radiation oncology. NLP is a powerful tool that can transform unstructured clinical narratives into structured data for medical AI systems. NLP models utilizing LLMs and based on self-attention transformer architecture can perform multiple domain-specific tasks through transfer learning, which reduces the need for large annotated data sets and training burdens. These models demonstrate good performance. However, before these models can be implemented and used in routine clinical care, they need to be rigorously evaluated for their validity, functionality, viability, safety, and ethical use. We have also proposed a framework that radiation oncology clinicians can use to assess the suitability of NLP models for their needs. The checklist aptly discusses key areas of algorithm training, tuning, transparency and interpretability, bias, and fairness, as well as legal and ethical concerns. Overall, these novel NLP techniques can enable the creation of more advanced AI models, which can improve patient outcomes and expedite the progress of precision medicine in radiation oncology under appropriate ethical and technical constraints.
Contributor Information
Reza Khanmohammadi, Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, United States.
Mohammad M Ghassemi, Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, United States.
Kyle Verdecchia, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States.
Ahmed I Ghanem, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States; Alexandria Department of Clinical Oncology, Alexandria University, Alexandria 21561, Egypt.
Bing Luo, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States.
Indrin J Chetty, Department of Radiation Oncology, Cedars Sinai Medical Center, Los Angeles, CA 90048, United States.
Hassan Bagher-Ebadian, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States; Departments of Radiology and Osteopathic Medicine, Michigan State University, East Lansing, MI 48824, United States; Department of Physics, Oakland University, Rochester, MI 48326, United States.
Farzan Siddiqui, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States.
Mohamed Elshaikh, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States.
Benjamin Movsas, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States.
Kundan Thind, Department of Radiation Oncology, Henry Ford Health, Detroit, MI 48202, United States; Department of Medicine, Michigan State University, East Lansing, MI 48824, United States.
Funding
None declared.
Conflicts of interest
None declared.
References
- 1. Bohr A, Memarzadeh K. The rise of artificial intelligence in healthcare applications. Artif Intell Healthc. 2020;25-60. 10.1016/B978-0-12-818438-7.00002-2 [DOI] [Google Scholar]
- 2. Johnson KB, Wei WQ, Weeraratne D, et al. Precision medicine, AI, and the future of personalized health care. Clin Transl Sci. 2021;14:86-93. 10.1111/cts.12884 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Malebary SJ, Khan YD. Evaluating machine learning methodologies for identification of cancer driver genes. Sci Rep. 2021;11:12281. 10.1038/S41598-021-91656-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Eghbali N, Alhanai T, Ghassemi MM. Patient-specific sedation management via deep reinforcement learning. Front Digit Health. 2021;3:608893. 10.3389/fdgth.2021.608893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Nwagwu W. RETRACTED: the rise and rise of natural language processing research. 2022;1958-2021. 10.21203/rs.3.rs-2265814/v1 [DOI]
- 6. Thompson NC, Greenewald K, Lee K, Manso GF, Lab AI. The computational limits of deep learning. 2020. 10.21428/bf6fb269.1f033948 [DOI]
- 7. Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015;349:261-266. 10.1126/science.aaa8685 [DOI] [PubMed] [Google Scholar]
- 8.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates Inc.; 2017:6000–6010. [Google Scholar]
- 9. Devlin J, Chang MW, Lee K, Google KT, Language AI. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics; 2019:4171-4186. 10.18653/v1/N19-1423 [DOI]
- 10. Zhao WX, Zhou K, Li J, et al. A survey of large language models. 한국컴퓨터종합학술대회 논문집. 2023. Accessed February 18, 2025. https://arxiv.org/abs/2303.18223v15
- 11. Wang L, Lyu C, Ji T, et al. Document-level machine translation with large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2023:16646-16661. 10.18653/v1/2023.emnlp-main.1036 [DOI]
- 12. Susnjak T. Applying BERT and ChatGPT for sentiment analysis of lyme disease in scientific literature. Methods Mol Biol. 2024;2742:173-183. 10.1007/978-1-0716-3561-2_14 [DOI] [PubMed] [Google Scholar]
- 13. Alyafeai Z, AlShaibani MS, Ahmad I. A survey on transfer learning in natural language processing. arXiv, 2020;6523-6541, Accessed February 18, 2025. https://arxiv.org/abs/2007.04239v1, preprint: not peer reviewed.
- 14. Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are Zero-Shot reasoners. Adv Neural Inf Process Syst. 2022;35. Accessed February 18, 2025. https://arxiv.org/abs/2205.11916v4 [Google Scholar]
- 15. Abro AA, Talpur MSH, Jumani AK. Natural language processing challenges and issues: a literature review. Gazi Univ J Sci. 2023;36:1522-1536. 10.35378/gujs.1032517 [DOI] [Google Scholar]
- 16. Bhatt S, Jain R, Dandapat S, Sitaram SA. Case study of efficacy and challenges in practical human-in-loop evaluation of NLP systems using checklist. Online. 2021:120-130. Accessed February 18, 2025. https://aclanthology.org/2021.humeval-1.14/ [Google Scholar]
- 17. Yim WW, Yetisgen M, Harris WP, Sharon WK. Natural language processing in oncology: a review. JAMA Oncol. 2016;2:797-804. 10.1001/jamaoncol.2016.0213 [DOI] [PubMed] [Google Scholar]
- 18. Bitterman DS, Miller TA, Mak RH, Savova GK. Clinical natural language processing for radiation oncology: a review and practical primer. Int J Radiat Oncol Biol Phys. 2021;110:641-655. 10.1016/j.ijrobp.2021.01.044 [DOI] [PubMed] [Google Scholar]
- 19. Kehl KL, Xu W, Lepisto E, et al. Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clin Cancer Inform. 2020;4:680-690. 10.1200/CCI.20.00020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Santoro M, Strolin S, Paolani G, et al. Recent applications of artificial intelligence in radiotherapy: where we are and beyond. Appl Sci (Switzerland). 2022;12:3223. 10.3390/app12073223 [DOI] [Google Scholar]
- 21. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47:33. 10.1007/s10916-023-01925-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Tu T, Azizi S, Driess D, et al. Towards generalist biomedical AI. NEJM AI. 2024;1:AIoa2300138. 10.1056/AIoa2300138 [DOI] [Google Scholar]
- 23. Netherton TJ, Cardenas CE, Rhee DJ, Court LE, Beadle BM. The emergence of artificial intelligence within radiation oncology treatment planning. Oncology. 2021;99:124-134. 10.1159/000512172 [DOI] [PubMed] [Google Scholar]
- 24. Wahid KA, Glerean E, Sahlsten J, et al. Artificial intelligence for radiation oncology applications using public datasets. Semin Radiat Oncol. 2022;32:400-414. 10.1016/j.semradonc.2022.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194-199. 10.1038/s41746-022-00742-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Parkinson C, Matthams C, Foley K, Spezi E. Artificial intelligence in radiation oncology: a review of its current status and potential application for the radiotherapy workforce. Radiography. 2021;27 Suppl 1:S63-S68. 10.1016/j.radi.2021.07.012 [DOI] [PubMed] [Google Scholar]
- 27. Mardikoraem M, Wang Z, Pascual N, Woldring D. Generative models for protein sequence modeling: recent advances and future directions. Brief Bioinform. 2023;24:1-19. 10.1093/bib/bbad358 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Khanmohammadi R, Mirshafiee MS, Jouryabi YR, Mirroshandel SA. Prose2Poem: the blessing of transformers in translating prose to Persian poetry. ACM Trans Asian Low-Resour Lang Inf Process. 2023;22:1. 10.1145/3592791 [DOI] [Google Scholar]
- 29. Hernández A, Amigó JM. Attention mechanisms and their applications to complex systems. Entropy (Basel). 2021;23:283. 10.3390/e23030283 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Zhuang B, Wu Q, Shen C, Reid I, Hengel AVD. Parallel attention: a unified framework for visual object discovery through dialogs and queries. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2018;14:4252-4261. 10.1109/cvpr.2018.00447 [DOI] [Google Scholar]
- 31. Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models. arXiv.org. 10.48550/arxiv.2302.13971, 2023, preprint: not peer reviewed. [DOI]
- 32.Touvron H, Martin L, Stone KR, et al. Llama 2: open foundation and fine-tuned chat models. arXiv.org. 10.48550/arxiv.2307.09288, 2023, preprint: not peer reviewed. [DOI]
- 33. Dubey A, Jauhri A, Pandey A, et al. The Llama 3 Herd of Models. arXiv.org. 10.48550/ARXIV.2407.21783, 2024, preprint: not peer reviewed. [DOI]
- 34.Jiang AQ, Sablayrolles A, Mensch A, et al. Mistral 7B. arXiv.org. 10.48550/arxiv.2310.06825, 2023, preprint: not peer reviewed. [DOI]
- 35.Bai J, Bai S, Chu Y, et al. Qwen Technical Report. arXiv.org. 10.48550/arxiv.2309.16609, 2023, preprint: not peer reviewed. [DOI]
- 36. Fu Z, Lam W, Yu Q, et al. Decoder-only or encoder-decoder? Interpreting language model as a regularized encoder-decoder. April 8, 2023. Accessed February 18, 2025. https://arxiv.org/abs/2304.04052v1
- 37. Khanmohammadi R, Mirshafiee MS, Allahyari M. COPER: a Query-Adaptable semantics-based search engine for persian COVID-19 articles. In: Proceedings of the 2021 7th International Conference on Web Research (ICWR). IEEE; 2021:64-70. 10.1109/ICWR51868.2021.9443151 [DOI]
- 38. Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21:1-67. 10.5555/3455716.345585634305477 [DOI] [Google Scholar]
- 39. Lewis M, Liu Y, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics; 2020:7871-7880. 10.18653/v1/2020.acl-main.703 [DOI]
- 40. Gupta N. A pre-trained vs fine-tuning methodology in transfer learning. J Phys: Conf Ser. 2021;1947:012028. 10.1088/1742-6596/1947/1/012028 [DOI] [Google Scholar]
- 41. Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics; 2020:38-45. 10.18653/v1/2020.emnlp-demos.6 [DOI]
- 42. Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI’16. USENIX Association; 2016:265-283. [Google Scholar]
- 43. Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. 2019. 10.5555/3454287.3455008 [DOI]
- 44. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172-180. 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Singhal K, Tu T, Gottweis J, et al. Towards expert-level medical question answering with large language models. May 16, 2023. Accessed February 18, 2025. https://arxiv.org/abs/2305.09617v1 [DOI] [PMC free article] [PubMed]
- 46. Wu H, Wang M, Wu J, et al. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. Npj Digit Med. 2022;5:1-15. 10.1038/s41746-022-00730-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Wang L, Luo L, Wang Y, Wampfler JA, Yang P, Liu H. Information extraction for populating lung cancer clinical research data. In: 2019 IEEE International Conference on Healthcare Informatics, ICHI 2019. 2019;1-2. 10.1109/ichi.2019.8904601 [DOI] [PMC free article] [PubMed]
- 48. Syed K, Sleeman W, Ivey K, et al. Integrated natural language processing and machine learning models for standardizing radiotherapy structure names. Healthcare (Basel). 2020;8:120. 10.3390/healthcare8020120 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Walker G, Soysal E, Xu H, Walker G, Soysal E, Xu H. Development of a natural language processing tool to extract radiation treatment sites. Cureus. 2019;11:e6010. 10.7759/cureus.6010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Hong JC, Fairchild AT, Tanksley JP, Palta M, Tenenbaum JD. Natural language processing for abstraction of cancer treatment toxicities: accuracy versus human experts. JAMIA Open. 2020;3:513-517. 10.1093/jamiaopen/ooaa064 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today. 2020;25:689-705. 10.1016/j.drudis.2020.01.020 [DOI] [PubMed] [Google Scholar]
- 52. Khanmohammadi R, Ghanem AI, Verdecchia K, et al. A novel localized student-teacher LLM for enhanced toxicity extraction in radiation oncology. Int J Radiat Oncol Biol Phys. 2024;120:e632-e633. 10.1016/j.ijrobp.2024.07.1392 [DOI] [Google Scholar]
- 53. Khanmohammadi R, Ghanem A, Verdecchia K, et al. Iterative prompt refinement for radiation oncology symptom extraction using Teacher-Student large language models. arXiv.org. 10.48550/arxiv.2402.04075, 2024, preprint: not peer reviewed. [DOI]
- 54. Chuang YN, Tang R, Jiang X, Hu X. SPeC: a soft prompt-based calibration on performance variability of large language model in clinical notes summarization. J Biomed Inform. 2024;151:104606. 10.1016/j.jbi.2024.104606 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30:1134-1142. 10.1038/s41591-024-02855-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Liu ZL, Huang Y, Yu XX, et al. DeID-GPT: Zero-shot medical text De-Identification by GPT-4. arXiv.org. 10.48550/arxiv.2303.11032, 2023, preprint: not peer reviewed. [DOI]
- 57. Tang R, Han X, Jiang X, Hu X. Does synthetic data generation of LLMs help clinical text mining? March 8, 2023. Accessed February 18, 2025. https://arxiv.org/abs/2303.04360v2
- 58. Ferber D, Nahhas OE, Wölflein G, et al. Autonomous artificial intelligence agents for clinical decision making in oncology. April 6, 2024. Accessed February 18, 2025. https://arxiv.org/abs/2404.04667v1 [DOI] [PMC free article] [PubMed]
- 59. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. 10.2196/45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Liu Z, Wang P, Li Y, et al. RadOnc-GPT: a large language model for radiation oncology. September 18, 2023. Accessed February 18, 2025. https://arxiv.org/abs/2309.10160v3
- 61. Yalamanchili A, Sengupta B, Song J, et al. Quality of large language model responses to radiation oncology patient care questions. JAMA Netw Open. 2024;7:e244630. 10.1001/jamanetworkopen.2024.4630 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Jia F, Liu X, Deng L, et al. OncoGPT: a medical conversational model tailored with oncology domain expertise on a large language model meta-AI (LLaMA). February 26, 2024. Accessed February 18, 2025. https://arxiv.org/abs/2402.16810v1
- 63.Liang P, Wu C, Morency LP, et al. Towards understanding and mitigating social biases in language models. In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021:6565-6576. https://proceedings.mlr.press/v139/liang21a.html
- 64. Straw I, Callison-Burch C. Artificial intelligence in mental health and the biases of language based models. PLoS One. 2020;15:e0240376-12. 10.1371/journal.pone.0240376 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: EMNLP 2014—2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2014;1532-1543. 10.3115/v1/d14-1162 [DOI]
- 66. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv.org. 10.48550/arxiv.1301.3781, 2013, preprint: not peer reviewed. [DOI]
- 67. Welbl J, Glaese A, Uesato J, et al. Challenges in detoxifying language models. In: Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021. 2021;2447-2469. 10.18653/v1/2021.findings-emnlp.210 [DOI]
- 68.Park YA, Rudzicz F. Detoxifying language models with a toxic corpus. In: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion. Association for Computational Linguistics; 2022:41-46. 10.18653/v1/2022.ltedi-1.6 [DOI]
- 69. Rodrigues R. Legal and human rights issues of AI: Gaps, challenges and vulnerabilities. J Responsib Technol. 2020;4:100005. 10.1016/j.jrt.2020.100005 [DOI] [Google Scholar]
- 70. Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models Be too big? In: Conference on Fairness, Accountability and Transparency. 2021;610-623. 10.1145/3442188.3445922 [DOI]
- 71. Steyerberg EW, Harrell FE. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245-247. 10.1016/j.jclinepi.2015.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Wainer J, Cawley G. Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst Appl. 2021;182:115222. 10.1016/j.eswa.2021.115222 [DOI] [Google Scholar]
- 73. Patwardhan N, Marrone S, Sansone C. Transformers in the real world: a survey on NLP applications. Inf. 2023;14:242. 10.3390/info14040242 [DOI] [Google Scholar]
- 74.Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL Workshop. Association for Computational Linguistics; 2004:74-81. https://aclanthology.org/W04-1013/
- 75. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics. 2001;311. 10.3115/1073083.1073135 [DOI]
- 76. Vokinger KN, Feuerriegel S, Kesselheim AS. Mitigating bias in machine learning for medicine. Commun Med (Lond). 2021;1:25. 10.1038/s43856-021-00028-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Montesinos López OA, Montesinos López A, Crossa J. Overfitting, model tuning, and evaluation of prediction performance. In: Multivariate Statistical Machine Learning Methods for Genomic Prediction. Springer; 2022:109-139. 10.1007/978-3-030-89010-0_4 [DOI] [PubMed]
- 78. Norori N, Hu Q, Aellen FM, Faraci FD, Tzovara A. Addressing bias in big data and AI for health care: a call for open science. Patterns. 2021;2:100347. 10.1016/j.patter.2021.100347 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. McCradden MD, Joshi S, Mazwi M, Anderson JA. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020;2:e221-e223. 10.1016/S2589-7500(20)30065-0 [DOI] [PubMed] [Google Scholar]
- 80. El Naqa IM, Hu Q, Chen W, et al. Lessons learned in transitioning to AI in the medical imaging of COVID-19. J Med Imag. 2021;8. 10.1117/1.JMI.8.S1.010902 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Zack T, Lehman E, Suzgun M, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6:e12-e22. 10.1016/S2589-7500(23)00225-X [DOI] [PubMed] [Google Scholar]
- 82.Kim B, Wattenberg M, Gilmer J, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In: Proceedings of the 35th International Conference on Machine Learning. PMLR; 2018:2673-2682. http://proceedings.mlr.press/v80/kim18d.html
- 83. Bellamy R, Dey K, Hind M, et al. AI Fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv.org. 2018, preprint: not peer reviewed.
- 84. Tejani AS, Klontzas ME, Gatti AA, et al. Checklist for artificial intelligence in medical imaging (CLAIM): 2024 update. Radiol Artif Intell. 2024;6:e240300. 10.1148/ryai.240300 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85. Soysal E, Wang J, Jiang M, et al. CLAMP—a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc. 2018;25:331-336. 10.1093/jamia/ocx132 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Asan O, Bayrak AE, Choudhury A. Artificial intelligence and human trust in healthcare: Focus on clinicians. J Med Internet Res. 2020;22:e15154. 10.2196/15154 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87. Linardatos P, Papastefanopoulos V, Kotsiantis S. Explainable AI: a review of machine learning interpretability methods. Entropy (Basel). 2020;23:1-45. 10.3390/e23010018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88. Herm LV, Heinrich K, Wanner J, Janiesch C. Stop ordering machine learning algorithms by their explainability! a user-centered investigation of performance and explainability. Int J Inf Manage. 2023;69:102538. 10.1016/j.ijinfomgt.2022.102538 [DOI] [Google Scholar]
- 89. Shin D. The effects of explainability and causability on perception, trust, and acceptance: implications for explainable AI. Int J Hum Comput Stud. 2021;146:102551. 10.1016/j.ijhcs.2020.102551 [DOI] [Google Scholar]
- 90.Carrillo A, Cant’u LF, Noriega A. Individual explanations in machine learning models: a survey for practitioners. arXiv.org. 10.48550/arxiv.2104.04144, 2021, preprint: not peer reviewed. [DOI]
- 91.Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates Inc.; 2017:4768–4777.
- 92. Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; 2016:1135-1144. 10.1145/2939672.2939778 [DOI]
- 93.Khanmohammadi R, Mirshafiee MS, Ghassemi M, et al. Fetal gender identification using machine and deep learning algorithms on phonocardiogram signals. arXiv.org. 10.48550/arxiv.2110.06131, 2021, preprint: not peer reviewed. [DOI]
- 94. Goodfellow I, McDaniel P, Papernot N. Making machine learning robust against adversarial inputs. Commun ACM. 2018;61:56-66. 10.1145/3134599 [DOI] [Google Scholar]
- 95. Biggio B, Corona I, Maiorca D, et al. Evasion attacks against machine learning at test time. arXiv. 2013;8190:387-402. 10.1007/978-3-642-40994-3_25 [DOI] [Google Scholar]
- 96. Borkar J, Chen PY. Simple transparent adversarial examples. ICLR; 2021.. 10.48550/arXiv.2105.09685 [DOI]
- 97. Drabiak K, Kyzer S, Nemov V, El Naqa I. AI and ML ethics, law, diversity, and global impact. Br J Radiol. 2023;96. 10.1259/bjr.20220934 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98. Gilbert S, Harvey H, Melvin T, Vollebregt E, Wicks P. Large language model AI chatbots require approval as medical devices. Nat Med. 2023;29:2396-2398. 10.1038/S41591-023-02412-6 [DOI] [PubMed] [Google Scholar]
- 99. Rezaeikhonakdar D. AI chatbots and challenges of HIPAA compliance for AI developers and vendors. J Law Med Ethics. 2023;51:988-995. 10.1017/jme.2024.15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Dinerstein v. Google, LLC, No. 20-3134 (7th Cir. 2023)::Justia. Accessed February 18, 2025., 2025. https://law.justia.com/cases/federal/appellate-courts/ca7/20-3134/20-3134-2023-07-11.html
- 101. Smith H. Clinical AI: opacity, accountability, responsibility and liability. AI & Soc. 2021;36:535-545. 10.1007/s00146-020-01019-6 [DOI] [Google Scholar]
- 102. Smith H, Fotheringham K. Artificial intelligence in clinical decision-making: rethinking liability. Med Law Int. 2020;20:131-154. 10.1177/0968533220945766 [DOI] [Google Scholar]
- 103. Marks M, Haupt CE. AI chatbots, health privacy, and challenges to HIPAA compliance. JAMA. 2023;330:309-310. 10.1001/jama.2023.9458 [DOI] [PubMed] [Google Scholar]
- 104. Naik N, Hameed BMZ, Shetty DK, et al. Legal and ethical consideration in artificial intelligence in healthcare: who takes responsibility? Front Surg. 2022;9:862322. 10.3389/fsurg.2022.862322 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105. Gerke S, Minssen T, Cohen IG. Ethical and legal challenges of artificial intelligence-driven healthcare. SSRN J. 2020. 10.2139/ssrn.3570129 [DOI] [Google Scholar]
- 106. Vandewinckele L, Claessens M, Dinkla A, et al. Overview of artificial intelligence-based applications in radiotherapy: recommendations for implementation and quality assurance. Radiother Oncol. 2020;153:55-66. 10.1016/j.radonc.2020.09.008 [DOI] [PubMed] [Google Scholar]
- 107. Mathew F, Wang H, Montgomery L, Kildea J. Natural language processing and machine learning to assist radiation oncology incident learning. J Appl Clin Med Phys. 2021;22:172-184. 10.1002/acm2.13437 [DOI] [PMC free article] [PubMed] [Google Scholar]

