Abstract
Deep learning (DL), a subfield of machine learning, has made remarkable strides across various aspects of medicine. This review examines DL’s applications in hematology, spanning from molecular insights to patient care. The review begins by providing a straightforward introduction to the basics of DL tailored for those without prior knowledge, touching on essential concepts, principal architectures, and prevalent training methods. It then discusses the applications of DL in hematology, concentrating on elucidating the models’ architecture, their applications, performance metrics, and inherent limitations. For example, at the molecular level, DL has improved the analysis of multi-omics data and protein structure prediction. For cells and tissues, DL enables the automation of cytomorphology analysis, interpretation of flow cytometry data, and diagnosis from whole slide images. At the patient level, DL’s utility extends to analyzing curated clinical data, electronic health records, and clinical notes through large language models. While DL has shown promising results in various hematology applications, challenges remain in model generalizability and explainability. Moreover, the integration of novel DL architectures into hematology has been relatively slow in comparison to that in other medical fields.
Keywords: Deep Learning, Hematology, Artificial Intelligence, Whole Slide Images, Large Language Models
Introduction
The public release of ChatGPT, an artificial intelligence (AI) system based on a deep learning (DL) architecture, has sparked intense discussion on the impacts of AI. This latest sensation highlights the tremendous progress made in DL over the past decade. With roots tracing back to the 1940s aiming to mimic human neuron interactions,1 deep learning, utilizing neural networks, has rapidly risen to prominence since the mid-2000s, due to increase in computing power and improvement in mathematical techniques.2 Today, DL underpins the transformative capabilities in the two major fields of AI – natural language processing (NLP) and computer vision (CV). Moreover, progress in the DL has been increasingly integrated into the biomedical field, enhancing various aspects of research and clinical applications.3
The essence of machine learning (ML) is about learning the underlying distribution of data – uncovering the intricate patterns and complex rules that govern the data.4 While AI and ML encompass a broader concept, DL is a subclass of ML that utilizes multi-layer neural networks to learn such distribution from vast volumes of data. In neural networks, a layer is the computational module that takes in data, performs certain mathematical operations, and then generates the transformed data. When multiple layers stack over each other, these layers work together to recapitulate the underlying distribution of data. The ‘deep’ in Deep Learning refers to having many such layers, enabling the network to learn very complex patterns.
This review is designed to explain DL concepts and common DL models at a high level, aiming to assist hematologists in more critically appraising studies that incorporate DL techniques. It will also provide a comprehensive overview of the recent advancements in applying DL in the field of hematology, spanning from molecular to patient levels. (Figure 1) We hope to provide hematologists with a practical understanding of the field’s current capabilities and limitations.
Deep Learning Models
Basic process of deep learning
In tasks like predicting the next word in a sentence or identifying an image’s content, the first step is to convert the input, be it words or images, into a digital form. This is done through data encoding, where words or sub-words (also called tokens) are represented by unique numbers, and images are broken down into pixels, also represented numerically. (Figure 2A) Each word or pixel becomes a node, also known as a neuron, the fundamental unit of neural networks. Next, the network combines these nodes using linear transformations, where each node is assigned a weight and summed up to create a new node. Multiple different sets of weights can be applied to the initial nodes, thus generating multiple new nodes, mimicking different ways that information can be combined. The resultant and original nodes form a linear layer, the basic computational unit in all DL models. (Figure 2A) By stacking multiple such layers or their variants (discussed below), a deep network is created. The network’s effectiveness is evaluated using a loss function, which measures the discrepancy between the network’s output value and the actual value (ground truth). Therefore, the process of training a DL network is to adjust the weights by slightly modifying them with each iteration to minimize the loss function, thereby refining the network’s predictive accuracy over time.
Key modules in deep learning models
Built on linear layers, the multi-layer perceptron (MLP), convolutional blocks and self-attention blocks are the three most used modules in DL models. (Figure 2B-2D) At a high level, convolution blocks are mostly for image-based inputs, where they excel in extracting localized features like textures and patterns. On the other hand, self-attention blocks are tailored for sequence-based inputs, adept at identifying and emphasizing relationships and dependencies between different parts of the sequence. In contrast, MLPs serve as versatile processors, typically employed to synthesize and interpret these extracted features, effectively integrating and translating them into meaningful outputs.
The MLP, also widely known as the feedforward network (FFN) or fully connected layer, is essentially a series of linear layers interconnected with a nonlinear element known as the activation function. (Figure 2B) It is understood that each layer is capable of learning distinct characteristics of the input data. However, in the absence of activation functions, the network would fundamentally be a linear model, thereby constraining its capacity to handle more intricate data sets. The key purpose of the activation function is to incorporate non-linearity, enabling deep learning models to identify and learn complex patterns within the data. As a cornerstone in almost all deep learning models, the MLP plays a pivotal role in the generalization of data patterns.
Convolutional blocks are engineered for extracting features from images. The vast number of pixels in an image makes it impractical to apply a linear layer to each individual pixel. To address this, convolutional blocks use a kernel, which is a small-scale linear layer applied to small patches. This process, known as convolution (Figure 2C), integrates local information within these patches in a linear fashion. Similar to the way we scan an image to gather the whole information, sliding the kernel across the entire image allows the convolutional layer to extract local features from different regions. Pooling, a variant of the convolution process, outputs either the maximum or average value within a patch, rather than a linear combination. This approach enhances robustness to minor positional variations. By stacking multiple convolution blocks, the visual information of an image can be efficiently condensed into a compact form.
Self-attention blocks are designed to effectively process sequence-type data, like sentences. The key idea of self-attention is to emphasize the intrinsic relationships within a sequence. Take the sentence “She is examining blood smears” as an example. In this context, the word “examining” should have a stronger semantic connection, or more “attention”, to the word “smear” than to “blood”. This is achieved by updating the value of each word as a weighted linear combination of the values of all other words in the sentence, assigning greater weight (“attention”) to word pairs with closer relationships (Figure 2D). As a fundamental component in large language models (LLMs), self-attention is also gaining traction in computer vision tasks.
DL models at a high level
At a high level, all DL models can be simplified to have an encoder and a decoder – the encoder takes the input data and condenses it into a representation which captures the essential features, while the decoder works to translate this representation into the desired output, whether it be a classification label, the next word of a sentence, or any other form of interpretable result. (Figure 3A) The encoder can be likened to the human process of learning, wherein we acquire new knowledge by distilling complex information into fundamental concepts and principles. Conversely, the decoder mirrors our application of this acquired knowledge, utilizing the simplified rules to execute specific tasks. Taking the three aforementioned modules into this context, integration of these modules can form various encoder and decoder structures. A carefully designed encoder can lead to more effective learning of the data, which is often a primary focus in deep learning models. The complexity of the task dictates the structure of the decoder – for simpler tasks such as classification or next-word prediction, an MLP would simply suffice. However, for more complex tasks like image segmentation or language translation, a combination of different modules is typically employed.
In the following sections, we will introduce key DL models in the two major fields of artificial intelligence – natural language processing (NLP) and computer vision (CV). This is particularly relevant since DL applications in hematology predominantly stem from advancements in these two fields.
DL models in natural language processing
At its core, NLP involves processing sequence-type data, as a sequence of words forms a sentence. Traditionally, the Recurrent Neural Network (RNN) was the go-to method for encoding sequences, until the advent of the Transformer model. The fundamental concept of a simple, or “vanilla”, RNN is to devise a method for passing information through a sequence as each component is processed sequentially. This is achieved by employing a set of evolving values, known as the hidden state. The hidden state retains the information from all previously processed components and updates itself with each new component of the sequence. This integration of the previous hidden state and the current sequence component, facilitated through an MLP-like structure, generates the new hidden state. Therefore, as an encoder, RNN effectively encodes the entire sequence into this final hidden state.(Figure 3B) An improved variant of the vanilla RNN, known as Long Short-Term Memory network (LSTM), has gained popularity for its enhanced ability to handle longer sequences.5
Since 2017, the field of NLP has undergone a significant transformation with the introduction of the Transformer model.6 Traditional RNN models encode sequence-type data slowly, as they integrate information one component at a time through updating the hidden state. The Transformer model addresses this limitation by applying the self-attention module to each sequence component simultaneously, allowing for parallel rather than sequential integration of information. Selectively utilizing core elements of the Transformer architecture, which principally consists of stacks of self-attention modules linked to an MLP, the Generative Pre-trained Transformer (GPT) models and Bidirectional Encoder Representations from Transformers (BERT) models stand out as two of the most prominent large language models (LLMs).7,8 (Figure 3C) Another key factor contributing to the success of LLMs has been the advancement of graphic processing units (GPUs), enabling large-scale parallel training.9 Empirical evidence suggests that the effectiveness of LLMs depends not only on the model size (number of trainable weights) but also on the volume of data used for training.10 Modern LLMs typically boast tens to hundreds of billions of parameters and are trained on vast corpora, encompassing hundreds of billions of words.
DL models in computer vision
In CV, the encoder part of DL models is typically referred to as the backbone network, which is dedicated to extracting image features. This backbone is then integrated with MLPs to perform simple downstream tasks, such as classification. More complex tasks, such as objective detection with bounding boxes and image segmentation, require extensive postprocessing steps or a dedicated decoder structure.11,12
Traditionally, the backbone of DL models in CV has been convolution module-based deep neural networks, commonly known as Convolutional Neural Networks (CNNs or ConvNets). Variations in the arrangement and the total number of stacked convolution modules differentiate well-known vanilla CNNs, such as AlexNet and VGG.13,14 An important advancement in CNNs is the development of the Residual Network (ResNet), which employs a unique mechanism called residual connections that allow the output of one layer to skip some layers and be added directly to the output of a later layer.15 (Figure 3D) This approach enables the training of much deeper models by ensuring efficient flow of information through the network.
Since the introduction of the Transformer model in NLP, self-attention modules have attracted significant interest as a potential backbone structure. However, the high pixel count in images poses a computational challenge for calculating self-attention across all pixels. To address this, the Vision Transformer (ViT) model segments an image into hundreds of patches. This approach allows the application of self-attention among individual patches rather than to each pixel, thereby reducing the computational load.16 (Figure 3E) However, ViT requires extensive data for training to outperform CNN-based models. Another attention-based model is the Shifted Window Transformer (Swin Transformer).17 Drawing inspiration from Convolutional Neural Networks, the Swin Transformer initially divides the original image into small patches. It then applies self-attention within each patch, akin to how a kernel operates on patches in CNNs. Subsequently, to amalgamate information from different patches, the Swin Transformer uses shifted windows and progressively combines smaller patches into larger ones. This approach facilitates a multi-scale representation, mirroring the hierarchical structure typical of CNNs.
Image segmentation, delineating object of interest within an image down to the pixel level, is particularly useful in hematology because it isolates blood cells from the noisy background in smear or biopsy samples for the subsequent identification of the cell morphology. This task requires pixel-level prediction, and a decoder structure is typically required. A widely used network for image segmentation is the U-Net, which utilizes a CNN encoder to decrease the spatial size, followed by a decoder that progressively restores the CNN output to the original size by reversing the operation of convolution (known as transposed convolution).12 (Figure 3F) This upscaling process results in the mapping of features learned by the CNN back onto the image’s original pixel grid, producing per-pixel predictions that determine whether each pixel is part of the object of interest. The recently introduced Segment Anything Model (SAM) features the ViT as its encoder and employs a combination of attention modules and transposed convolution in its decoder.18 Trained on an extensive dataset, this model has outperformed networks based on the U-Net architecture.
Training DL models
The objective of training a DL model is to minimize the difference between the network’s output and the actual value (the ground truth) for a given input, or the loss function, by adjusting the model’s weights. This ground truth can be manually annotated, such as determining a classification label for an image or delineating an object’s segmentation border, a process characteristic of supervised learning. However, due to the extensive human effort involved, these datasets tend to be small. Unsupervised learning, also known as self-supervised learning, rather than predicting manually assigned labels, focuses on predicting individual data point itself, which effectively serves as its own “ground truth”; this guides the model to uncover the underlying structure of the dataset. A prominent unsupervised learning architecture is the autoencoder. It typically uses MLPs to compress the original data into a smaller, condensed form, known as the latent space. The autoencoder then uses another set of MLPs to expand the compressed data back into its original form. The compression process encourages the model to learn the most salient features of the data, much like how we learn new information by summarizing key points.
In essence, various training methodologies in unsupervised learning revolve around how to effectively formulate a pretext task. In autoencoders, the pretext task is to minimize the difference between the input and its reconstruction. In NLP models such as BERT, the pretext task involves masking certain words in a sentence and training the model to predict these hidden words based on the surrounding context.8 This method allows language models to grasp the underlying rules of a language, akin to how cloze tests facilitate language learning in humans. Unsupervised learning in CV is predominantly achieved via contrastive learning, which is under the idea that differently augmented views of the same original image should be labeled as the same, or “positive samples” serving as the ground truth, while views of different images are all different and should be labeled as “negative samples”. The two classic training methods in CV contrastive learning are called MoCo and SimCLR, where the pretext task is to distinguish between pairs of similar (positive) and dissimilar (negative) images.19,20 An improvement upon the above method involves only using the positive samples, as the pretext task is re-designed to minimize the difference of two differently augmented views of the same image. Example techniques include BYOL and DINO.21,22 Lastly, inspired by the success of NLP models like BERT, which use masked word prediction as a pretext task, a similar concept has been adapted in CV. For instance, the masked autoencoder (MAE) technique involves randomly masking a significant portion of an input image and training a model to predict and reconstruct the missing patches.23
However, the pretext tasks in unsupervised learning models, like reconstructing missing patches in an image, typically differ from the desired end tasks, such as identifying objects within an image. Nonetheless, the encoder part of these models has inadvertently learned efficient methods for feature extraction. Consequently, we can repurpose these encoders by pairing them with various decoders tailored to specific downstream tasks. In such scenarios, it is often sufficient to train only the decoder. This approach of leveraging a pre-trained model for new applications is commonly referred to as transfer learning. When a pre-trained model is large-scale and demonstrates strong performance across a variety of transfer learning tasks, it is often referred to as a foundation model, such as GPT. A closely related concept of transfer learning is fine tuning, where instead of training just the decoder, the entire model, including the pre-trained encoder, undergoes additional training to better adapt to the specific requirements of the new task.
For CV models, another popular training method is known as weakly supervised learning, which has been adopted mostly in analyzing whole slide images (WSIs) in pathology. Given the large size of a WSI, annotation of every single cell or every patch within a slide is extremely labor intensive, but the label for the whole slide is often known. To learn the label of individual patches, a commonly used method in weakly supervised learning is multiple instance learning (MIL). This approach aggregates the information from each small patches in a WSI to predict the overall label of the slide.24,25 Even with only the WSI label available, this training approach is still effective in learning the labels of individual patches. This is because, for instance, when training a model to distinguish between slides with blasts and those without, the model learns the characteristics of “no blasts” from all the patches in a WSI labeled as “no blasts”, since none of the patches should contain blasts.
Deep Learning at the Molecular Level
DL in genome
The inherent complexity and large volume of genomics data render them particularly suitable to DL models. The primary tasks in genomics include identifying non-coding regulatory elements, such as promoters, enhancers, and transcription factor binding sites, as well as interpreting the effects of non-coding single-nucleotide polymorphisms (SNPs). In 2015, two seminal studies and their proposed models, named DeepBind and DeepSEA, aimed to tackle the above problems, respectively.26,27 These models laid the foundation for many current DL-based approaches by employing a shared methodology. First, the one-dimensional DNA sequence was converted into a two-dimensional representation, akin to a “picture”, where the added dimension comprised four “pixels”; each pixel symbolized one of the four nucleotide bases (ACTG) at a specific position. Subsequently, a convolutional neural network was deployed to extract sequence features, which were then linked to an MLP to formulate predictions (Figure 4A) Both models outperformed non-DL based tools at the time. Recently, the application of Transformer-based methods to DNA sequences, inspired by their success in processing human language data, has been explored. The DNA-BERT model, for instance, interprets groups of 3 to 6 adjacent nucleotides as a single “word”. In this approach, the primary task of unsupervised learning is to predict these “words” when they are masked in a DNA sequence. After undergoing fine-tuning for specific downstream tasks, DNA-BERT demonstrated improved performance over CNN-based models such as DeepBind and DeepSEA, across a range of metrics.28 However, the direct application of DL models to predict variant effects in hematology is limited for several reasons. First, the experimental validation of causal variants continues to be the gold standard and can be readily conducted when the SNP data are not extensive.29 Second, recently developed machine learning methods, such as regBase, which integrates predictive outcomes from a variety of non-DL and DL-based models, have yielded superior results compared to employing a single DL-based model alone.30 Consequently, these integrated approaches are more frequently utilized.31
DL in karyotyping
Karyotyping through chromosome banding analysis remains the definitive method for detecting cytogenetic abnormalities, despite being both time-consuming and labor-intensive. The advent of DL in automating karyotyping reflects its broader progress within the field of CV. Initially, before CNNs were introduced, the process required extensive manual annotations and significant domain knowledge to extract chromosome features manually, which were then classified using an MLP.32 However, the emergence of AlexNet marked a turning point, enabling CNN-based models to achieve over 90% accuracy in classifying normal chromosomes.33 More recent advancements, particularly through the implementation of residual connections and deeper CNN architectures, have further improved accuracy, pushing it beyond 95% in classifying normal chromosomes.34 The workflow typically starts with software-assisted automatic preprocessing of metaphase images, involving segmentation and organization into karyograms. The processed images are subsequently analyzed by CNN for the classification of chromosomes. Despite these advancements, detecting chromosome aberrations is still challenging due to their complexity and the rarity of some aberrations during the training process. Recently, models based on self-attention mechanisms, such as the ViT, have been employed to address this challenge. By initially pre-training on a large dataset focused on classifying normal chromosomes, and subsequently fine-tuning on a smaller dataset containing aberrant chromosomes, ViT-based models have achieved accuracies exceeding 95% in identifying chromosomal aberrations.35
DL in transcriptomics
Gene expression profiling (GEP) data, derived from bulk RNA-sequencing or microarray techniques, are inherently “high dimensional.” This is because each gene’s expression level introduces a unique “dimension” to the analysis, making the dataset well-suited for machine learning techniques. In this scenario where the data structure is relatively straightforward, traditional machine learning methods, such as Lasso regression (a specialized form of linear regression), tree-based algorithms (including random forest and gradient boosting trees), and Support Vector Machines (SVM), often perform comparably or even better than DL models like MLP.36 For instance, a study aimed at distinguishing between acute myeloid leukemia and other forms of leukemia using peripheral blood GEP data demonstrated that both classical machine learning techniques and neural networks could achieve accuracy rates exceeding 95%.37
However, data obtained from single-cell RNA sequencing (scRNA-seq) encompasses RNA expression information from thousands of individual cells, presenting both a massive scale and complexity that make it ideal for DL-based methods. One application of DL in scRNA-seq is in data processing. scRNA-seq data are inherently noisy, not only because current techniques capture less than 30% of all transcripts leading to dropout events for specific genes, but also because the data exhibit variability from batch to batch. This variability introduces the well-known batch effect, further complicating the analysis.38 Various DL models have been developed to tackle these problems.39 Among these, scVI stands out as a widely adopted tool that employs a variational autoencoder (VAE) to learn a low-dimensional latent representation of the data, effectively capturing its key patterns.40 The normalized distribution characteristic of the latent space in a VAE enables it to manage missing values and dropout events, while simultaneously mitigating batch effects, because it smooths out variations that arise from different batches.40
In addition to data processing, DL methods are also highly effective at modeling cell behavior based on gene expression. For example, one study aimed to identify the counterpart of hematopoietic stem cells (HSCs) within induced pluripotent stem cells (iPSCs).41 In that research, an MLP was trained to identify HSCs from human fetal liver cells based on the differential expression of thousands of genes. Once trained, this MLP model could then be applied to pinpoint HSCs population within iPSCs, utilizing the expression of the same gene set.
Building upon the success of foundational models in NLP, recent initiatives have sought to develop large transformer-based models tailored to scRNA-seq data, treating genes and cells in a manner analogous to words and sentences.42–44 Inspired by unsupervised training techniques used in NLP transformers, these “single-cell foundation models” are trained on expression data from billions of individual cells. During the pre-training stage, the models learn to predict masked genes and their relative expression levels. (Figure 4B) Much like how LLMs learn word relationships and grammar, these cell models develop an understanding of gene interactions and biological patterns. For instance, the geneformer model, when fine-tuned with a specialized dataset of diseased cardiomyocytes, successfully identified genes whose alterations could lead to cardiomyopathy.43 While this concept is intriguing and the preliminary results are promising, the efficacy of these models compared to existing scRNA-seq analysis methods warrants further evaluation.45 As of now, their use in hematology has not been documented. However, they hold potential for various applications, such as discovering unique cell groups, identifying gene expression patterns specific to diseases, predicting how cells might respond to treatments, and revealing new cell states associated with disease development.
DL in protein structure predictions
DL has revolutionized the field of protein structure predictions, with the success of AlphaFold2 (AF2)46 and other models inspired by AF2, including RoseTTAFold and ESMFold.47,48 Leveraging the achievements of preceding models, AF2 integrates modules and design tricks proven to enhance protein prediction, resulting in a complex architecture.(Figure 4C) AF2 begins by constructing a multiple sequence alignment (MSA), which is a widely used method in protein prediction tasks. An MSA aligns homologous protein sequences across different species. This is helpful in protein structure prediction because amino acid residues in close spatial proximity tend to co-evolve in different species. Simultaneously, a pairwise input is initiated, which is a 2-dimensional (2D) table representing the spatial distances between each pair of amino acid residues within a protein. Next, the encoder of AF2 employs self-attention modules to process relational information between amino acids. This integration occurs in two domains: the sequence space (1D) derived from the MSA, and the structural space (2D) derived from the pairwise input. These self-attention modules allow AF2 to understand the relationships between amino acids in both their sequence and spatial arrangements. Additionally, the encoder incorporates geometric rules to ensure the encoding of physically plausible protein structures. The AF2 decoder integrates encoded sequence, pairwise data, and initial protein backbone coordinates (3D information) to determine the 3D coordinates of the backbone and side chains. This process involves synthesizing 1D, 2D, and 3D information with geometric rules for precise protein structure modeling.
Although AF2 can achieve sub-atomic resolution accuracy, it faces several challenges. These include its inability to predict structures of multimeric proteins, proteins with post-translational modifications, or those associated with ions or cofactors; additionally, AF2 performs less effectively in predicting proteins that have mutations or disordered regions.49,50 Moreover, it has poor performance in modeling ligand and drug binding sites. This is likely because AF2, being primarily designed for protein structure prediction, may not capture the subtle but critical features of the protein’s active site where ligands bind.51 These limitations have restricted the application of AF2 in fields like drug discovery and studying the impact of protein mutations. Notably, recent studies have proposed improved models based on AF2 to address these problems. These include models that can predict the structure of protein-nucleic acid complexes, identify the effects of pathogenic mutations, or model multimeric protein structures.52–54 However, these models require further validation across diverse datasets. For instance, Chabane et al. evaluated AlphaMissense,53 essentially a version of AF2 fine-tuned to detect pathogenic variants, on sequencing data from 686 samples of patients with hematological malignancies.55 Out of 853 variants known to be pathogenic from the literature, AlphaMissense correctly identified 80% of them.55 Therefore, given their current performance, while these tools are promising for generating hypotheses, experimental verification remains essential for confirmation.
DL-generated protein structure prediction has been utilized in hematology to elucidate biological functions. For instance, Frunt et al. used AF2 to show how Factor XII, which lacks a known crystal structure, binds to anionic surfaces and exposes its activation site.56 In a separate study, Renella et al. discovered a novel germline mutation in SEPT6, associated with severe neutropenia and dysmyelopoiesis in an infant.57 To investigate the mutation’s pathogenic role, they used AF2 to illustrate how this mutation alters the structure and impacts the dimerization of the SEPT6 protein.57
Deep Learning at the Cell Level
DL in cytomorphology
Automated analysis of peripheral blood smear (PBS) or bone marrow smear (BMS) is an early application of DL in hematology. Several deep learning-based digital cell morphology systems, like CellaVision DM96 and Scopio Labs X100, have gained US FDA Class II medical device approval for PBS analysis.58 These systems typically achieve over 90% accuracy in classifying normal white blood cells (WBCs).59,60 In the clinical workflow of identifying WBCs, these systems start by scanning a blood smear, specifically targeting the monolayer region where cells are spaced closely but not overlapping. The systems then segment the WBCs in the region into patches based on manually engineered features. These patches containing individual WBCs are displayed on a screen, and the system “pre-classifies” the cells as normal or abnormal. Finally, trained technicians verify the pre-classifications. Both CellaVision and Scopio utilize color and shape-based segmentation to isolate WBCs. CellaVision then applies image processing techniques to extract key features from each cell, such as size, shape, and color, before employing a MLP for the final step of classification. On the other hand, Scopio opts for CNN-based methods to classify the cells.61 The specific details of their model architectures remain undisclosed due to proprietary considerations. Nonetheless, a study showed that various pre-trained CNN models can all achieve approximately 90% accuracy in WBC identification.62 However, CellaVision struggles with accurately identifying rare abnormal cells in peripheral blood, such as plasma cells and lymphoblasts, due to insufficient training data points.63
Automated BMS analysis is more complex than PBS analysis due to several factors. BMS contains a wider range of cell types, including both normal and abnormal cells, and faces challenges in cell segmentation because of the variable cell sizes, cell adhesion, and artifacts like dye impurities. Consequently, larger annotated datasets are necessary for effective training. Moreover, an additional module is typically needed to segment cells of interest. In one of the most extensively tested systems, Morphogo, nucleated cells are segmented using a traditional machine learning method, a decision tree based on the distribution of color range. These segmented cells are then input into a 27-layer CNN connected to an MLP for label generation.64 (Figure 5A) The system can reach over 95% accuracy in identifying normal mature and immature granulocytes and erythrocytes, as well as blasts.65,66 Other BMS analysis models employ pre-existing DL-based segmentation tools which have already been widely used in computer vision tasks. Tools such as YOLO (You Only Look Once) and Faster R-CNN (Region-based Convolutional Neural Networks) are utilized to precisely detect and segment target cells.67–72 A particularly challenging scenario involves analyzing cell morphology in bone marrow biopsy samples, where cells are densely clustered. In a study by Sirinukunwattana et al., which focused on differentiating various myeloproliferative neoplasms (MPNs) through megakaryocyte morphology in bone marrow trephines, the U-Net was employed for pixel-level segmentation of megakaryocytes from the surrounding tissue.73 This approach yielded an impressive AUC of 0.98, distinguishing between reactive and MPN samples. This method has also been adopted by other studies to classify bone marrow cells based on morphology.74,75 While the models mentioned previously all utilized CNNs as their primary network for feature extraction, the ViT (Vision Transformer) has also been explored. In a study employing a hybrid model that combines CNN and ViT as the backbone, the prediction accuracy for classifying BMS cells surpassed that of other CNN-based models.76
Beyond classifying cell types in BMS, these techniques can also differentiate cells of the same type with varying morphologies. This is particularly relevant for identifying mutations in acute myeloid leukemia (AML) blasts, which can present distinctive morphological features. For instance, blasts with NPM1/FLT3-ITD mutations often exhibit unique cup-shaped nuclei.77 Thus, DL models hold potential for predicting specific mutations by analyzing the morphology of blasts alone. Eckardt et al. implemented Faster R-CNN for the segmentation of nucleated cells, followed by using a ResNet model to predict NPM1-mutated blasts, achieving an accuracy of 0.86.78 Meanwhile, Kockwelp et al. sought to classify five distinct AML types: CBFB::MYH11, NPM1 mutation, FLT3-ITD mutation, AML with myelodysplastic changes, and a fifth category, favorable risk AML. The first four categories are associated with specific morphological features – such as atypical eosinophils, cup-shaped nuclei (with and without NPM1 mutation), and dysplastic changes, respectively –while the fifth lacks uniform morphological characteristics.79 Despite the emphasis on high-quality segmentation of blasts, their classification model was relatively straightforward, employing an 18-layer ResNet. This approach led to high accuracy for CBFB::MYH11 (AUC 0.9) and NPM1 mutations (AUC 0.88), but the performance for the other three categories was lower, with AUCs ranging from 0.6 to 0.7.79 Although these results are promising, their applicability is limited to mutations with distinctive morphologies, and further validation using external datasets is necessary.
A special use case of automated cytomorphology recognition is in imaging flow cytometry (IFC), which enhances traditional flow cytometry by integrating cameras. This allows for the capture of brightfield, darkfield, and fluorescent images of individual cells.80 Since the cell images are already individually segmented as cells move past the camera one by one, DL-based models can be directly applied to these images.81 In a study aiming to identify WBC subtypes using stain-free IFC images, both traditional ML and CNN-based models achieved comparably accurate results.82
DL in cytometry
The analysis of multiparameter flow cytometry (MFC) or mass cytometry (CyTOF) requires substantial expertise and the results are not always reproducible.83 Moreover, MFC raw data, essentially a vast table with rows and columns representing different cells and the fluorescent intensity of various markers respectively, is high-dimensional and well-suited for ML methods. To identify individual cell labels, either a linear layer or an MLP can be utilized to integrate the information from each marker, mirroring the process of determining cell types through the combination of CD markers. Once trained, the neural network can be applied to the whole sample to classify each cell. Therefore, this method can be used to determine minimal residual disease (MRD). In a study aiming to detect chronic lymphocytic leukemia (CLL) MRD, a three-layer MLP was trained, which had over 99% sensitivity and specificity for identifying CLL cells from normal lymphocytes.84 A key limitation of this approach is the need for manual annotation of individual cell labels to train the model, which is extremely labor-intensive.
Another common method for training DL models in cytometry involves weakly supervised learning, where only the sample-wide label is available (e.g., leukemia vs. no leukemia). This sets up a multiple instance learning (MIL) situation, where the information of individual cells needs to be combined to determine the label of the whole sample. In the CellCnn model, each cell’s markers undergo a linear transformation (convolution) with multiple kernels, after which a max pooling layer aggregates cells’ information by selecting the maximum value across all cells to predict the sample’s label.85 (Figure 5B) This method enables the model to differentiate between samples from healthy BM and those from an AML patient with an MRD of 0.01%.85 A related model, DeepCellCNN, employs two convolutional layers instead of one, as in CellCnn, resulting in marginally better outcomes.86 Performance can be further enhanced by adopting a new prediction objective: instead of predicting a binary label (e.g., leukemia vs. no leukemia), the model predicts the percentage of events, such as the proportion of leukemia cells within a sample.87 Another variation of the traditional CellCnn model incorporates an attention module to aggregate cell information instead of max pooling.88 This adaptation has achieved over 90% accuracy in diagnosing acute leukemia and distinguishing between various types of acute leukemia. However, a notable limitation of this approach is that calculating attention scores across hundreds of thousands of cells is computationally demanding and resource-intensive.
Deep Learning at the Tissue Level
Challenges in Whole Slide Image Application
Compared to traditional computer vision tasks, applying DL models to interpret whole slide images (WSIs) presents unique challenges. First, WSIs are exceptionally large, typically measuring around 100,000 x 100,000 pixels,89 in stark contrast to the much smaller input size of 224 x 224 pixels used in CV datasets like ImageNet.90 To achieve computational efficiency, DL models often necessitate dividing WSIs into smaller patches, also known as tiles, containing only hundreds to thousands of pixels in each dimension. This allows for pixel-level calculations to be conducted on each individual patch. Consequently, most studies on WSIs adopt a two-stage methodology: initially, a feature extractor, typically a CNN, analyzes the pixels within individual patches to generate patch embeddings. Subsequently, these embeddings are integrated using aggregation algorithms for WSI-level predictions. (Figure 5C) A CNN model pre-trained on a general image dataset, like a ResNet with ImageNet, can be effectively transferred for feature extraction from patch pixels of pathology images.25 Interestingly, CNNs trained specifically in histopathology datasets show only marginal enhancement in feature extraction compared to those trained on general image datasets like ImageNet.91,92 Nevertheless, these histopathology-specific feature extractors may be either trained in a fully supervised version,92 where the labels for each patch are required, or more commonly through unsupervised training using contrastive learning.91,93 Furthermore, self-attention-based feature extractors, such as the Vision Transformer (ViT) and Swin Transformer, have been recently applied in WSI analysis for patch feature extraction.94,95
The second challenge is in applying DL to WSIs is the scarcity of curated training samples. The expertise required for WSI annotation limits the number of qualified annotators, making the process challenging. Initially, training models necessitated annotations for every single patch, a process that was exceedingly labor-intensive.96,97 This issue has been partially addressed through weakly-supervised training methods, which rely on slide-level rather than patch-level annotations, greatly reducing the annotation burden. In recent years, there have been significant efforts to create publicly accessible histopathology datasets, facilitated by challenges like PANDA98 and CAMELYON,99 or through open datasets such as TCGA.100 Additionally, there have been innovative attempts to curate data on social media platforms, like X, where clinicians have shared over 200,000 de-identified histopathologic images, contributing to the growing availability of data for research and model training.101 In hematology, there is a notable scarcity of large datasets of bone marrow WSIs, possibly because WSIs serve only auxiliary roles in the diagnosis of most hematologic malignancies. In practice, typically only hundreds of bone marrow WSIs are utilized for training DL models, highlighting the challenge of limited data availability in this specific area of medical research.102
Third, WSI is inherently patchy – only certain sections of a slide might show pathological changes, while the rest could appear normal. This scenario fits into multiple instance learning (MIL), where the diagnosis for the whole slide is based on a subset of these patches. Therefore, various techniques have been employed to aggregate features from individual patches. A commonly used method is mean pooling, which involves calculating the average features of all patches to make a single prediction. However, this approach struggles with imbalanced instances, where the majority may be normal and only a few patches show pathological changes, because it dilutes the significance of abnormal patches, overshadowing key pathological information with predominant normal findings. A solution is top-K pooling, selecting the top K patches with the highest feature scores to label the slide.96,103 However, this approach trains the model using only a few patches per slide (K number of patches), necessitating more WSIs to match the performance of fully supervised models.103 A more refined approach involves assigning varying weights, referred to as attention scores, to different patches. These scores are analogous to their diagnostic importance and can be learned through training.24 (Figure 5D) This method, known as attention-based MIL, effectively integrates the features from all patches. Another benefit of this method is interpretability: by indicating the importance of each patch in contributing to the diagnosis through weights, mapping a heatmap of these weights onto the spatial locations of the original patches visually demonstrates the significance of each region to the overall slide-level diagnosis. A model using this attention-based MIL, CLAM, achieved an AUC exceeding 0.95 in classifying subtypes of various solid tumors, even when trained on fewer than a thousand samples.25 However, a limitation of the attention-based MIL approach is its lack of context awareness: each patch processes information independently without access to the contextual data of adjacent patches. This limitation is critical in scenarios like hypoplastic myelodysplastic syndrome (MDS). In such cases, patches containing dysplastic cells may indicate MDS, but accurately diagnosing requires combining this feature with the context of the surrounding cellularity. Information from other patches can be incorporated through RNN103,104 or self-attention-based models95,105,106to address the issue of context awareness in attention-based MIL. Self-attention can be directly applied to all patches like ViT, but this method is highly computationally demanding due to the vast number of patches involved.106 One strategy to mitigate this is by increasing the pixel count per patch, thereby reducing the total number of patches.105 However, this adjustment might compromise the level of detail in feature extraction from the patches. An alternative and more efficient method employs a hierarchical structure that aggregates patches from small regions to medium-sized windows and finally to the entire slide level.95 (Figure 5E) This context-aware model demonstrates enhanced performance compared to traditional MIL models, though it still comes with a markedly increased computational cost.
DL in histopathology
Lymph node (LN) biopsy and bone marrow (BM) biopsy are the two most common histopathological samples in hematology (Table 1). These samples exhibit unique characteristics compared to biopsies from solid tumors. First, the presence of lymphoma in a LN or leukemia in a BM tends to be more homogenous, making a patch-level representation often sufficient for classifying the WSI. Second, cellular morphology in LN and BM samples plays a more significant role in disease diagnosis than it does in solid tumors, requiring models to place greater emphasis on morphological features. Furthermore, the cell distribution in BM biopsies can be particularly indicative of certain diseases, such as aplastic anemia and myeloproliferative diseases, with changes in cellularity and disruption of the normal architecture being key diagnostic criteria. Therefore, DL models in hematology have been tailored to focus on these characteristics.
Table 1. Studies using deep learning in hematology whole slide imaging interpretation.
Biopsy sample | Clinical Task | Training size | DL model: Patch Feature Extractor | DL model: Patch Feature Aggregator | Testing dataset | Testing results | References |
---|---|---|---|---|---|---|---|
LN | Differentiate DLBCL, BL, SLL, and benign | 128 | CNN on manually cropped area | None | Internal | Accuracy 95% | Achi et al., 2019107 |
LN | Differentiate DLBCL from various benign and malignant LN samples | 1,754 | Majority-voting of 17 CNNs on manually cropped area | None | External | Accuracy >99% | Li et al., 2020108 |
LN | Differentiate DLBCL, FL, and benign | 388 | CNN on manually cropped area | None | Internal | Accuracy 90% AUC 0.95 |
Miyoshi et al., 2020109 |
LN | Differentiate DLBCL, SLL, and benign | 629 | CNN on manually cropped area | None | External | Accuracy 96% | Steinbuss et al., 2021110 |
LN and other biopsy sites | Predict MYC rearrangement on H&E stained DLBCL WSIs | 287 | CNN | Not clearly specified | External | Accuracy 74% AUC 0.83 |
Swiderska-Chadaj et al., 2021111 |
LN | Differentiate FL and benign hyperplasia | 378 | CNN | Mean pooling | External | AUC 0.66 | Syrykh et al., 2020112 |
Skin | Annotate CD30+ regions on CD30-stained WSIs to diagnose CD30+ LPD | 28 | CNN | Local self-attention, sum pooling | Internal | Accuracy 96% AUC 0.99 |
Zheng et al., 2023113 |
BM | Predict mutations on H&E stained MDS WSIs | 236 | Pretrained CNN | Mean pooling | Internal | AUC varies on mutations, as high as 0.94 | Bruck et al., 2021114 |
BM | Differentiate AML, CML, ALL, CLL, and MM | 129 | Pretrained CNN | Attention | External | Accuracy 94% AUC 0.97 |
Wang et al., 2022115 |
BM | Differentiate ET and prePMF | 226 | Pretrained CNN | Attention | Internal | Accuracy 92% AUC 0.90 |
Srisuwananukorn et al., 2023116 |
BM | Differentiate AL, MM, LPD, and normal | 556 | Pretrained YOLO for cell detection and feature extraction | Attention | Internal | Average F1 score 0.57 | Mu et al, 2023117 |
DL, deep learning. LN, lymph nodes. DLBCL, diffuse large B-cell lymphoma. BL, Burkitt’s lymphoma. SLL, small lymphocytic lymphoma. CNN, convolutional neural network. FL, follicular lymphoma. AUC, area under curve. WSI, whole slide image. LPD, lymphoproliferative disease. BM, bone marrow. MDS, myelodysplastic syndrome. AML, acute myeloid leukemia. CML, chronic myeloid leukemia. ALL, acute lymphoblastic leukemia. CLL, chronic lymphocytic leukemia. MM, multiple myeloma. ET, essential thrombocythemia. prePMF, prefibrotic primary myelofibrosis. AL, acute leukemia.
In the realm of DL tasks in LN-derived WSIs, the primary focus of most studies is to differentiate among various types of lymphomas and related conditions. This includes distinguishing aggressive lymphomas such as diffuse large B-cell lymphoma (DLBCL) and Burkitt’s lymphoma (BL), from indolent lymphomas like follicular lymphoma (FL) and small lymphocytic lymphoma (SLL), as well as from reactive hyperplasia or normal lymph nodes, using hematoxylin and eosin (H&E) stained slides.107–110 In a departure from this common objective, one study sought to predict MYC rearrangement in DLBCL WSIs using H&E staining but achieved low accuracy.111 The unique cytomorphology of different lymphomas means that features extracted from just a single patch can often accurately diagnose the WSI. Indeed, most studies have applied a CNN to a manually selected patch, achieving diagnosis accuracies over 90%. In one study, 17 CNN models were fined-tuned to differentiate between DLBCL and non-DLBCL samples using cropped images of approximately 1,000x1,000 pixels.108 To improve the results, a “majority voting” trick was used, wherein each model’s individual prediction contributed to a final diagnosis based on the majority consensus among the models. Only one published study employs the conventional feature extractor-aggregator framework for analyzing WSIs. That research aimed to differentiate between FL and benign follicular hyperplasia (FH) using H&E-stained WSIs. The study began by training a CNN to distinguish FL and FH at the patch level, then implemented mean pooling to assign a label to the entire WSI.112 However, the model’s performance on an external testing dataset resulted in an AUC of just 0.66, indicating limited generalization ability.
Another study focused on differentiating lymphomatoid papulosis from primary cutaneous anaplastic large-cell lymphoma using CD30-stained skin WSIs based on the extent of CD30-positive cell involvement.113 To effectively incorporate information from adjacent patches, the authors implemented a local self-attention mechanism. This technique allowed for integrating the feature vector from the central patch with those from surrounding patches. Consequently, the overall percentage of CD30-positive regions within the WSI was determined by aggregating all the positively identified patches.
Several studies have also explored the use of DL in interpreting BM WSIs, covering a variety of tasks from distinguishing between different disease types to predicting mutations through morphological features.114–117 Commonly, these studies employ a CNN as a feature extractor, followed by an aggregator to compile patch-level features into slide labels. In a work focused on predicting mutations associated with MDS, patch features were extracted directly using CNN models pre-trained on the ImageNet dataset without any fine-tuning for histopathological data.114 The feature vectors from each patch were then condensed into a single value using an MLP tailored for various mutations. The overall label for the WSI — indicating the presence or absence of specific mutations — was determined by averaging these values across all patches. Despite the simplicity of this model architecture, it achieved high AUC scores, exceeding 0.90 for certain mutations, such as ASXL1 and TET2. Additionally, attention-based MIL methods have also been applied. In a study to distinguish between hematologic malignancies using bone marrow smear WSIs, patch features were extracted using a CNN model pre-trained on ImageNet. This was followed by the application of the CLAM framework to assign slide-level labels.115 This approach demonstrated a 94% accuracy rate in identifying various hematologic malignancies using an external test dataset. Another study aimed to distinguish essential thrombocythemia from prefibrotic primary myelofibrosis. It first used a CNN pre-trained on histopathological images to extract features, then applied the CLAM framework to integrate the features of individual patches.116 This model achieved 92% accuracy in differentiating the two conditions. One study employed an attention-based aggregator different from CLAM to differentiate acute leukemia, multiple myeloma, and lymphoproliferative disease from bone marrow WSIs.117 Utilizing the YOLO object detection model, individual cells were segmented and their features extracted. Next, an attention-based aggregating algorithm, known as Hopfield pooling,118 was applied to integrate these features by assigning weights to individual cell images. However, the performance was modest: with internal testing datasets, the F1 score, an accuracy indicator, was only 0.57.
Overall, the field of digital pathology has witnessed significant advancements, paving the way for innovative applications in hematology. Despite these achievements, the integration of DL techniques in hematology primarily relies on established methodologies. While studies demonstrate the potential of DL in analyzing various hematological conditions, the adoption of newer, more sophisticated DL models is still in its nascent stages. Moreover, challenges in model generalization and modest performance in external datasets highlight the need for ongoing research and development.
Deep Learning at the Patient Level
DL in curated clinical data
DL models, particularly MLPs, can be utilized to predict clinical outcomes in hematology using curated patient data. For example, one study employed an MLP to combine patient demographics with laboratory test results to predict the likelihood of successful donor hematopoietic stem cell mobilization.119 Another study trained an MLP to predict the survival status at the last follow-up of patients with DLBCL based on 740 gene expression profiles.120 Despite the capabilities of DL models, comparisons with traditional regression and classic ML methods reveal minimal improvements in prediction accuracy, and, in some instances, they perform worse than classic ML methods. This discrepancy arises probably because, although DL models can process a wide range of variables, only a few significantly impact clinical outcomes. Consequently, DL models’ advantage in handling complex data types remains underutilized. For example, in a study predicting 100-day non-relapse mortality for over 25,000 patients undergoing allogeneic hematopoietic stem cell transplantation, logistic regression, tree-based classic ML methods, and an MLP were used with 23 selected variables.121 The study demonstrated that all methods achieved similar AUC scores, highlighting that incorporating just 3 to 5 key variables was sufficient to reach near-maximal AUCs, underscoring the limited benefit of DL models in this context.
DL in electronic health records
An alternative approach to predicting clinical events involves applying DL models to non-curated, patient-level electronic health records (EHR) data. In this method, each clinical encounter is treated as a data point comprising structured medical codes such as the International Classification of Disease (ICD) diagnosis codes, medication codes, procedure codes, and laboratory codes. This collection of clinical encounters forms sequence-type data, encapsulating a patient’s medical history. Unlike the analysis of curated data, this approach additionally leverages temporal and longitudinal information in the EHR, providing a comprehensive view of patient health over time. The analysis process typically involves three steps. Initially, the medical codes associated with each clinical encounter and their timestamps are encoded into numerical representations. Subsequently, DL models process these sequence-type data, mapping them into a latent space. This step is analogous to summarizing a patient’s medical history or clinical trajectory. Finally, a prediction mechanism, usually an MLP, operates on this latent space to produce the prediction outcome. (Figure 6) In the embedding process, although many models adopt the random “one-hot” encoding, the use of learned embeddings — where similar medical concepts have closely related embeddings — can enhance model performance.122 As for transforming embeddings into latent representations, RNNs are commonly used due to their proficiency in processing sequence-type data.123 For instance, the DoctorAI model inputs medical codes from past encounters into an RNN, creating a contextualized representation of the patient’s medical history, which is then utilized to predict medical diagnoses and medication codes for the subsequent visit.124 Another study expanded this approach by including clinical notes, tokenizing each word in the free texts, and combining them with medical codes.125 This enriched input set was used to generate predictions for in-hospital mortality, readmission rates, and the length of hospital stays, demonstrating the potential of integrating diverse data types for more accurate health outcome predictions. The advent of the Transformer architecture has led to a shift towards self-attention-based DL models in predictive modeling for EHR.126 A notable example is the BEHRT model, which treats diagnosis codes from each visit as words in a sentence.127 Its pre-training objective involves predicting masked diagnosis codes, mirroring BERT’s training methodology. An MLP prediction head is trained on the model’s outputs for tasks such as predicting diagnosis codes for future visits.
While the above methods remain in the proof-of-concept phase, their deployment in hematology remains limited. Notably, a study focusing on predicting the two-year survival of patients with AML, based on the first six months of laboratory and bone marrow histological data, employed a heterogeneous graph transformer model.128 This approach achieved an AUC of 0.76 on an external testing dataset, demonstrating performance comparable to the predictions based on the European LeukemiaNet (ELN) 2022 criteria, even without incorporating molecular and cytogenetic information.129
However, these studies have several limitations. First, while these models often show strong performance within their training datasets, achieving AUC values often exceeding 0.90, testing on external independent datasets is seldom conducted. A systematic review found that only 3 out of 81 studies (3.7%) conducted external testing.130 This scarcity of external validation raises questions about the models’ generalizability. Furthermore, models trained on private datasets often do not disclose their parameters for privacy reasons, complicating these methods’ external evaluation. Third, the structured medical codes used in training these models are frequently criticized for inaccuracies and lack of granularity.131 Thus, validation against original medical records is crucial to verify these results. Fourth and most importantly, some prediction tasks are clinically implausible due to the complex and multifaceted nature of clinical events, which are influenced by numerous unmeasurable variables not captured by medical codes. For instance, it is unrealistic to predict the specific reason for a future hospital admission with high confidence based solely on past medical encounters. Additionally, certain clinical events, such as the onset of pancreatic cancer, are sporadic and minimally influenced by a patient’s medical history. Indeed, despite known associations between several non-specific environmental risk factors (e.g., smoking and obesity) and an increased risk of pancreatic cancer,132 no definitive clinical factor has been identified as a direct cause of the cancer. A study attempting to predict pancreatic cancer occurrence up to 36 months in advance using DL models trained on medical codes illustrates this point. The models exhibited very low precision and recall, barely reaching 1%.133 Although the specificity was reported to be near 100%, this likely resulted from imbalanced labels in the training datasets, where the vast majority of patients did not have pancreatic cancer. This could have led to the models to “cheat” by simply predicting that no patients had pancreatic cancer. In such a scenario, the models would correctly identify most patients without pancreatic cancer, resulting in high specificity, but would fail to identify the few patients who actually had the disease, leading to low precision and recall.
DL in clinical notes
Prior to the prevalence of Transformer-based LLMs, RNNs and CNNs were commonly employed for semantic analysis in clinical notes. In a study aimed at identifying bleeding events from EHR clinical notes, both a CNN and an RNN were utilized to assess individual sentences for descriptions of bleeding, achieving an accuracy rate of 90%.134 The introduction of LLMs has enabled the execution of more complex tasks. Notably, state-of-the-art models like GPT-4 and Med-PaLM 2 have demonstrated the capability to accurately answer US Medical Licensing Exam (USMLE) style questions with an accuracy rate exceeding 85%.135,136 This performance underscores their capability to comprehend and analyze complex medical scenarios. Various LLM-assisted clinical tasks have been proposed.137 LLMs are notably effective in generating clinical notes from doctor-patient conversations, with a study highlighting that notes generated by GPT-4 were as preferred as those written by humans.138 Furthermore, LLMs have shown proficiency in identifying patient eligibility for clinical trials based on clinical notes, with a study utilizing GPT-3.5 revealing high accuracy rates of 86% and 84% for matching inclusion and exclusion criteria, respectively.139 Additionally, LLMs have been leveraged to enhance the readability of clinical documentation. A study showed that the systematic implementation of ChatGPT in a hospital significantly improved the readability of informed consent documents, such as those for bone marrow biopsies, making them more accessible to the average American.140 Lastly, LLMs can also help decision-making based on clinical notes. A study compared the responses of hematologists and various LLMs regarding hematopoietic stem cell transplantation eligibility, donor selection, and conditioning regimens across six clinical cases of patients with hematological malignancies.141 The study showed that LLMs exhibited strong performance in determining patients’ eligibility and selecting donors, yet they fell short in recommending appropriate conditioning regimens.
In addition to tasks related to clinical notes, LLMs hold potential in various aspects of patient care, such as creating medical chatbots for triage, question answering, medication management, automating medical history taking, and generating medical reports from scans or histopathological images.142,143 These applications will likely be developed initially in a general medical setting before being adapted and implemented in specialized fields like hematology. While promising, future research should address the challenges of robustness, explainability, and the ethical implications associated with using LLMs in healthcare.144,145
Conclusion
DL has demonstrated diverse applications across various domains of hematology. At the molecular level, DL models have significantly advanced multi-omics data analysis and protein structure predictions. For cells and tissues, DL techniques enable the automation of cytomorphology analysis, interpretation of flow cytometry data, and diagnosis from whole slide images. Additionally, DL shows promise in predicting clinical outcomes using patient data and electronic health records. The advent of LLMs further facilitates complex tasks such as generating clinical notes and supporting decision-making processes.
Despite these advancements, DL faces specific challenges, including the need for larger, curated datasets, enhanced model interpretability, and improved generalizability. These challenges are particularly pronounced in hematology, where the adoption of new DL models is notably slower than in other medical fields. Future endeavors should develop hematology-tailored models, integrate multimodal data, and ensure generalizability. Interdisciplinary collaboration between hematologists, computer scientists, and regulatory bodies is vital to unlocking DL’s full potential in transforming hematological research and clinical care.
We stand on the brink of a transformative period, marking the advent of a more profound integration of DL into the standard practices of hematology. This integration could significantly enhance patient care by providing more accurate diagnoses, personalized treatment plans, and improved patient outcomes. However, to fully realize these benefits, it is imperative that the upcoming generation of hematologists not only become adept at employing these advanced technologies but also gain a comprehensive understanding of the underlying principles of DL.
Conflicts of Interest
None
Author Contributions
Concept and design: All authors
Manuscript writing: All authors
Final approval of manuscript: All authors
Acknowledgments
Acknowledgement
None
Funding Statement
None
References
- A logical calculus of the ideas immanent in nervous activity. McCulloch W. S., Pitts W. 1943The bulletin of mathematical biophysics. 5(4):115–33. doi: 10.1007/BF02478259. [DOI] [PubMed] [Google Scholar]
- A fast learning algorithm for deep belief nets. Hinton G. E., Osindero S., Teh Y. W. 2006Neural Comput. 18(7):1527–54. doi: 10.1162/neco.2006.18.7.1527. [DOI] [PubMed] [Google Scholar]
- Scientific discovery in the age of artificial intelligence. Wang H., Fu T., Du Y., Gao W., Huang K., Liu Z.., et al. 2023Nature. 620(7972):47–60. doi: 10.1038/s41586-023-06221-2. [DOI] [PubMed] [Google Scholar]
- Deep learning. LeCun Y., Bengio Y., Hinton G. 2015Nature. 521(7553):436–44. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- Long Short-Term Memory. Hochreiter S., Schmidhuber J. 1997Neural Computation. 9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- Vaswani A., Shazeer N. M., Parmar N., Uszkoreit J., Jones L., Gomez A. N.., et al., editors. Neural Information Processing Systems; Attention is All you Need. [Google Scholar]
- Language Models are Few-Shot Learners. Brown T. B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P.., et al. 2020ArXiv. abs/2005.14165 [Google Scholar]
- Devlin J., Chang M.-W., Lee K., Toutanova K., editors. North American Chapter of the Association for Computational Linguistics; BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [Google Scholar]
- Dean J., Corrado G. S., Monga R., Chen K., Devin M., Le Q. V.., et al., editors. Large Scale Distributed Deep Networks. Neural Information Processing Systems; [Google Scholar]
- Training Compute-Optimal Large Language Models. Hoffmann J., Borgeaud S., Mensch A., Buchatskaya E., Cai T., Rutherford E.., et al. 2022ArXiv. abs/2203.15556 [Google Scholar]
- End-to-End Object Detection with Transformers. Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S. 2020ArXiv. abs/2005.12872 doi: 10.1007/978-3-030-58452-8_13. [DOI] [Google Scholar]
- U-Net: Convolutional Networks for Biomedical Image Segmentation. Ronneberger O., Fischer P., Brox T. 2015ArXiv. abs/1505.04597 doi: 10.1007/978-3-319-24574-4_28. [DOI] [Google Scholar]
- ImageNet classification with deep convolutional neural networks. Krizhevsky A., Sutskever I., Hinton G. E. 2012Communications of the ACM. 60:84–90. doi: 10.1145/3065386. [DOI] [Google Scholar]
- Very Deep Convolutional Networks for Large-Scale Image Recognition. Simonyan K., Zisserman A. 2014CoRR. abs/1409.1556 [Google Scholar]
- He K., Zhang X., Ren S., Sun J. 2015Deep Residual Learning for Image Recognition. :770–8. doi: 10.1109/CVPR.2016.90. [DOI]
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T.., et al. 2020ArXiv. abs/2010.11929 [Google Scholar]
- Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z.., et al. 2021Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. :9992–10002. doi: 10.1109/ICCV48922.2021.00986. [DOI]
- Kirillov A., Mintun E., Ravi N., Mao H., Rolland C., Gustafson L.., et al. 2023Segment Anything. :3992–4003. doi: 10.1109/ICCV51070.2023.00371. [DOI]
- He K., Fan H., Wu Y., Xie S., Girshick R. B. 2019Momentum Contrast for Unsupervised Visual Representation Learning. :9726–35. doi: 10.1109/CVPR42600.2020.00975. [DOI]
- A Simple Framework for Contrastive Learning of Visual Representations. Chen T., Kornblith S., Norouzi M., Hinton G. E. 2020ArXiv. abs/2002.05709 [Google Scholar]
- Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. Grill J.-B., Strub F., Altch'e F., Tallec C., Richemond P. H., Buchatskaya E.., et al. 2020ArXiv. abs/2006.07733 [Google Scholar]
- Caron M., Touvron H., Misra I., J'egou He. Mairal J., Bojanowski P.., et al. 2021Emerging Properties in Self-Supervised Vision Transformers. :9630–40. doi: 10.1109/ICCV48922.2021.00951. [DOI]
- He K., Chen X., Xie S., Li Y., Doll'ar P., Girshick R. B. 2021Masked Autoencoders Are Scalable Vision Learners. :15979–88. doi: 10.1109/CVPR52688.2022.01553. [DOI]
- Ilse M., Tomczak J. M., Welling M., editors. 2018Attention-based Deep Multiple Instance Learning.
- Data-efficient and weakly supervised computational pathology on whole-slide images. Lu M. Y., Williamson D. F. K., Chen T. Y., Chen R. J., Barbieri M., Mahmood F. 2021Nat Biomed Eng. 5(6):555–70. doi: 10.1038/s41551-020-00682-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Alipanahi B., Delong A., Weirauch M. T., Frey B. J. 2015Nat Biotechnol. 33(8):831–8. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
- Predicting effects of noncoding variants with deep learning-based sequence model. Zhou J., Troyanskaya O.G. 2015Nat Methods. 12(10):931–4. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Ji Y., Zhou Z., Liu H., Davuluri R.V. 2021Bioinformatics. 37(15):2112–20. doi: 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Recurrent noncoding somatic and germline WT1 variants converge to disrupt MYB binding in acute promyelocytic leukemia. Song H., Liu Y., Tan Y., Zhang Y., Jin W., Chen L.., et al. 2022Blood. 140(10):1132–44. doi: 10.1182/blood.2021014945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variants. Zhang S., He Y., Liu H., Zhai H., Huang D., Yi X.., et al. 2019Nucleic Acids Res. 47(21):e134. doi: 10.1093/nar/gkz774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Discovery of novel predisposing coding and noncoding variants in familial Hodgkin lymphoma. Flerlage J. E., Myers J. R., Maciaszek J. L., Oak N., Rashkin S. R., Hui Y.., et al. 2023Blood. 141(11):1293–307. doi: 10.1182/blood.2022016056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- A study for the hierarchical artificial neural network model for Giemsa-stained human chromosome classification. Cho J., Ryu S.Y., Woo S.H. 2004Conf Proc IEEE Eng Med Biol Soc. 2004:4588–91. doi: 10.1109/IEMBS.2004.1404272. [DOI] [PubMed] [Google Scholar]
- Classification of Metaphase Chromosomes Using Deep Convolutional Neural Network. Hu X., Yi W., Jiang L., Wu S., Zhang Y., Du J.., et al. 2019J Comput Biol. 26(5):473–84. doi: 10.1089/cmb.2018.0212. [DOI] [PubMed] [Google Scholar]
- Classification of fluorescent R-Band metaphase chromosomes using a convolutional neural network is precise and fast in generating karyograms of hematologic neoplastic cells. Vajen B., Hänselmann S., Lutterloh F., Käfer S., Espenkötter J., Beening A.., et al. 2022Cancer Genetics. 260-261:23–9. doi: 10.1016/j.cancergen.2021.11.005. [DOI] [PubMed] [Google Scholar]
- Shamsi Z., Bryant D. H., Wilson J. M., Qu X., Dubey K. A., Kothari K.., et al., editors. Karyotype AI for Precision Oncology. [Google Scholar]
- Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. Alharbi F., Vakanski A. 2023Bioengineering (Basel) 10(2) doi: 10.3390/bioengineering10020173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics. Warnat-Herresthal S., Perrakis K., Taschler B., Becker M., Bassler K., Beyer M.., et al. 2020iScience. 23(1):100780. doi: 10.1016/j.isci.2019.100780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massively parallel digital transcriptional profiling of single cells. Zheng G. X., Terry J. M., Belgrader P., Ryvkin P., Bent Z. W., Wilson R.., et al. 2017Nat Commun. 8:14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. Brendel M., Su C., Bai Z., Zhang H., Elemento O., Wang F. 2022Genomics Proteomics Bioinformatics. 20(5):814–35. doi: 10.1016/j.gpb.2022.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep generative modeling for single-cell transcriptomics. Lopez R., Regier J., Cole M. B., Jordan M. I., Yosef N. 2018Nat Methods. 15(12):1053–8. doi: 10.1038/s41592-018-0229-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Single-cell analyses and machine learning define hematopoietic progenitor and HSC-like cells derived from human PSCs. Fidanza A., Stumpf P. S., Ramachandran P., Tamagno S., Babtie A., Lopez-Yrigoyen M.., et al. 2020Blood. 136(25):2893–904. doi: 10.1182/blood.2020006229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Yang F., Wang W., Wang F., Fang Y., Tang D., Huang J.., et al. 2022Nature Machine Intelligence. 4(10):852–66. doi: 10.1038/s42256-022-00534-z. [DOI] [Google Scholar]
- Transfer learning enables predictions in network biology. Theodoris C. V., Xiao L., Chopra A., Chaffin M. D., Al Sayed Z. R., Hill M. C.., et al. 2023Nature. 618(7965):616–24. doi: 10.1038/s41586-023-06139-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Cui H., Wang C., Maan H., Pang K., Luo F., Duan N.., et al. 2024Nat Methods. doi: 10.1101/2023.04.30.538439. [DOI] [PubMed]
- Assessing the limits of zero-shot foundation models in single-cell biology. Kedzierska K. Z., Crawford L., Amini A. P., Lu A. X. 2023bioRxiv. doi: 10.1101/2023.10.16.561085. [DOI]
- Highly accurate protein structure prediction with AlphaFold. Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O.., et al. 2021Nature. 596(7873):583–9. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Accurate prediction of protein structures and interactions using a three-track neural network. Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G.R.., et al. 2021Science. 373(6557):871–6. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W.., et al. 2023Science. 379(6637):1123–30. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
- AlphaFold2 and its applications in the fields of biology and medicine. Yang Z., Zeng X., Zhao Y., Chen R. 2023Signal Transduct Target Ther. 8(1):115. doi: 10.1038/s41392-023-01381-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Can AlphaFold2 predict the impact of missense mutations on structure? Buel G. R., Walters K. J. 2022Nature Structural & Molecular Biology. 29(1):1–2. doi: 10.1038/s41594-021-00714-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karelina M., Noh J. J., Dror R. O. How accurately can one predict drug binding modes using AlphaFold models? Cold Spring Harbor Laboratory; [DOI] [PMC free article] [PubMed] [Google Scholar]
- Accurate prediction of protein-nucleic acid complexes using RoseTTAFoldNA. Baek M., McHugh R., Anishchenko I., Jiang H., Baker D., DiMaio F. 2024Nat Methods. 21(1):117–21. doi: 10.1038/s41592-023-02086-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Accurate proteome-wide missense variant effect prediction with AlphaMissense. Cheng J., Novati G., Pan J., Bycroft C., Zemgulyte A., Applebaum T.., et al. 2023Science. 381(6664):eadg7492. doi: 10.1126/science.adg7492. [DOI] [PubMed] [Google Scholar]
- Protein complex prediction with AlphaFold-Multimer. Evans R., O’Neill M., Pritzel A., Antropova N., Senior A., Green T.., et al. 2021bioRxiv. doi: 10.1101/2021.10.04.463034. [DOI]
- Real life evaluation of AlphaMissense predictions in hematological malignancies. Chabane K., Charlot C., Gugenheim D., Simonet T., Armisen D., Viailly P. J.., et al. 2024Leukemia. 38(2):420–3. doi: 10.1038/s41375-023-02116-3. [DOI] [PubMed] [Google Scholar]
- Factor XII Explored with AlphaFold - Opportunities for Selective Drug Development. Frunt R., El Otmani H., Gibril Kaira B., de Maat S., Maas C. 2023Thromb Haemost. 123(2):177–85. doi: 10.1055/a-1951-1777. [DOI] [PubMed] [Google Scholar]
- Congenital X-linked neutropenia with myelodysplasia and somatic tetraploidy due to a germline mutation in SEPT6. Renella R., Gagne K., Beauchamp E., Fogel J., Perlov A., Sola M.., et al. 2022Am J Hematol. 97(1):18–29. doi: 10.1002/ajh.26382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Digital morphology analyzers in hematology: ICSH review and recommendations. Kratz A., Lee S. H., Zini G., Riedl J. A., Hur M., Machin S.., et al. 2019Int J Lab Hematol. 41(4):437–47. doi: 10.1111/ijlh.13042. [DOI] [PubMed] [Google Scholar]
- Performance evaluation of the CellaVision DM96 system: WBC differentials by automated digital image analysis supported by an artificial neural network. Kratz A., Bengtsson H. I., Casey J. E., Keefe J. M., Beatrice G. H., Grzybek D. Y.., et al. 2005Am J Clin Pathol. 124(5):770–81. doi: 10.1309/XMB9K0J41LHLATAY. [DOI] [PubMed] [Google Scholar]
- Evaluation of Scopio Labs X100 Full Field PBS: The first high-resolution full field viewing of peripheral blood specimens combined with artificial intelligence-based morphological analysis. Katz B. Z., Feldman M. D., Tessema M., Benisty D., Toles G. S., Andre A.., et al. 2021Int J Lab Hematol. 43(6):1408–16. doi: 10.1111/ijlh.13681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Digital pathology and artificial intelligence as the next chapter in diagnostic hematopathology. Lin E., Fuda F., Luu H. S., Cox A. M., Fang F., Feng J.., et al. 2023Semin Diagn Pathol. 40(2):88–94. doi: 10.1053/j.semdp.2023.02.001. [DOI] [PubMed] [Google Scholar]
- Classification of peripheral blood neutrophils using deep learning. Tseng T. R., Huang H. M. 2023Cytometry A. 103(4):295–303. doi: 10.1002/cyto.a.24698. [DOI] [PubMed] [Google Scholar]
- Experience with CellaVision DM96 for peripheral blood differentials in a large multi-center academic hospital system. Rollins-Raval M. A., Raval J. S., Contis L. 2012J Pathol Inform. 3:29. doi: 10.4103/2153-3539.100154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Developing and Preliminary Validating an Automatic Cell Classification System for Bone Marrow Smears: a Pilot Study. Jin H., Fu X., Cao X., Sun M., Wang X., Zhong Y.., et al. 2020J Med Syst. 44(10):184. doi: 10.1007/s10916-020-01654-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morphogo: An Automatic Bone Marrow Cell Classification System on Digital Images Analyzed by Artificial Intelligence. Fu X., Fu M., Li Q., Peng X., Lu J., Fang F.., et al. 2020Acta Cytol. 64(6):588–96. doi: 10.1159/000509524. [DOI] [PubMed] [Google Scholar]
- High-accuracy morphological identification of bone marrow cells using deep learning-based Morphogo system. Lv Z., Cao X., Jin X., Xu S., Deng H. 2023Sci Rep. 13(1):13364. doi: 10.1038/s41598-023-40424-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- White blood cells detection and classification based on regional convolutional neural networks. Kutlu H., Avci E., Ozyurt F. 2020Med Hypotheses. 135:109472. doi: 10.1016/j.mehy.2019.109472. [DOI] [PubMed] [Google Scholar]
- Huang D., Cheng J., Fan R., Su Z.-w., Ma Q., Li J. 2021Bone Marrow Cell Recognition: Training Deep Object Detection with A New Loss Function. :1–6. doi: 10.1109/IST50367.2021.9651340. [DOI]
- An Automated Pipeline for Differential Cell Counts on Whole-Slide Bone Marrow Aspirate Smears. Lewis J. E., Shebelut C. W., Drumheller B. R., Zhang X., Shanmugam N., Attieh M.., et al. 2023Mod Pathol. 36(2):100003. doi: 10.1016/j.modpat.2022.100003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Assessment of dysplasia in bone marrow smear with convolutional neural network. Mori J., Kaji S., Kawai H., Kida S., Tsubokura M., Fukatsu M.., et al. 2020Sci Rep. 10(1):14734. doi: 10.1038/s41598-020-71752-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep learning for bone marrow cell detection and classification on whole-slide images. Wang C. W., Huang S. C., Lee Y. C., Shen Y. J., Meng S. I., Gaol J. L. 2022Med Image Anal. 75:102270. doi: 10.1016/j.media.2021.102270. [DOI] [PubMed] [Google Scholar]
- Automated bone marrow cytology using deep learning to generate a histogram of cell types. Tayebi R. M., Mu Y., Dehkharghanian T., Ross C., Sur M., Foley R.., et al. 2022Commun Med (Lond) 2:45. doi: 10.1038/s43856-022-00107-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Artificial intelligence-based morphological fingerprinting of megakaryocytes: a new tool for assessing disease in MPN patients. Sirinukunwattana K., Aberdeen A., Theissen H., Sousos N., Psaila B., Mead A. J.., et al. 2020Blood Adv. 4(14):3284–94. doi: 10.1182/bloodadvances.2020002230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep Learning Enables Spatial Mapping of the Mosaic Microenvironment of Myeloma Bone Marrow Trephine Biopsies. Hagos Y. B., Lecat C. S. Y., Patel D., Mikolajczak A., Castillo S. P., Lyon E. J.., et al. 2024Cancer Res. 84(3):493–508. doi: 10.1158/0008-5472.CAN-22-2654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep learning application of the discrimination of bone marrow aspiration cells in patients with myelodysplastic syndromes. Lee N., Jeong S., Park M. J., Song W. 2022Sci Rep. 12(1):18677. doi: 10.1038/s41598-022-21887-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- HematoNet: Expert level classification of bone marrow cytology morphology in hematological malignancy with deep learning. Tripathi S., Augustin A.I., Sukumaran R., Dheer S., Kim E. 2022Artificial Intelligence in the Life Sciences. 2:100043. doi: 10.1016/j.ailsci.2022.100043. [DOI] [Google Scholar]
- Cuplike nuclei (prominent nuclear invaginations) in acute myeloid leukemia are highly associated with FLT3 internal tandem duplication and NPM1 mutation. Chen W., Konoplev S., Medeiros L. J., Koeppen H., Leventaki V., Vadhan-Raj S.., et al. 2009Cancer. 115(23):5481–9. doi: 10.1002/cncr.24610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep learning detects acute myeloid leukemia and predicts NPM1 mutation status from bone marrow smears. Eckardt J. N., Middeke J. M., Riechert S., Schmittmann T., Sulaiman A. S., Kramer M.., et al. 2022Leukemia. 36(1):111–8. doi: 10.1038/s41375-021-01408-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep learning predicts therapy-relevant genetics in acute myeloid leukemia from Pappenheim-stained bone marrow smears. Kockwelp J., Thiele S., Bartsch J., Haalck L., Gromoll J., Schlatt S.., et al. 2024Blood Adv. 8(1):70–9. doi: 10.1182/bloodadvances.2023011076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imaging flow cytometry: a primer. Rees P., Summers H. D., Filby A., Carpenter A. E., Doan M. 2022Nat Rev Methods Primers. 2 doi: 10.1038/s43586-022-00167-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deepometry, a framework for applying supervised and weakly supervised deep learning to imaging cytometry. Doan M., Barnes C., McQuin C., Caicedo J.C., Goodman A., Carpenter A.E.., et al. 2021Nat Protoc. 16(7):3572–95. doi: 10.1038/s41596-021-00549-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Classification of Human White Blood Cells Using Machine Learning for Stain-Free Imaging Flow Cytometry. Lippeveld M., Knill C., Ladlow E., Fuller A., Michaelis L. J., Saeys Y.., et al. 2020Cytometry A. 97(3):308–19. doi: 10.1002/cyto.a.23920. [DOI] [PubMed] [Google Scholar]
- Reproducibility of Flow Cytometry Through Standardization: Opportunities and Challenges. Kalina T. 2020Cytometry A. 97(2):137–47. doi: 10.1002/cyto.a.23901. [DOI] [PubMed] [Google Scholar]
- Artificial Intelligence Enhances Diagnostic Flow Cytometry Workflow in the Detection of Minimal Residual Disease of Chronic Lymphocytic Leukemia. Salama M. E., Otteson G. E., Camp J. J., Seheult J. N., Jevremovic D., Holmes D. R., 3rd., et al. 2022Cancers (Basel) 14(10) doi: 10.3390/cancers14102537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sensitive detection of rare disease-associated cell subsets via representation learning. Arvaniti E., Claassen M. 2017Nat Commun. 8:14825. doi: 10.1038/ncomms14825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- A robust and interpretable end-to-end deep learning model for cytometry data. Hu Z., Tang A., Singh J., Bhattacharya S., Butte A.J. 2020Proc Natl Acad Sci U S A. 117(35):21373–80. doi: 10.1073/pnas.2003026117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- A cell-level discriminative neural network model for diagnosis of blood cancers. Robles E. E., Jin Y., Smyth P., Scheuermann R. H., Bui J. D., Wang H. Y.., et al. 2023Bioinformatics. 39(10) doi: 10.1093/bioinformatics/btad585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Automated Deep Learning-Based Diagnosis and Molecular Characterization of Acute Myeloid Leukemia Using Flow Cytometry. Lewis J. E., Cooper L. A. D., Jaye D. L., Pozdnyakova O. 2024Mod Pathol. 37(1):100373. doi: 10.1016/j.modpat.2023.100373. [DOI] [PubMed] [Google Scholar]
- Artificial Intelligence and Digital Pathology: Challenges and Opportunities. Tizhoosh H. R., Pantanowitz L. 2018J Pathol Inform. 9:38. doi: 10.4103/jpi.jpi_53_18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deng J., Dong W., Socher R., Li L. J., Kai L., Li F.-F., editors. Jun 20;2009 Jun 25;2009 ImageNet: A large-scale hierarchical image database. doi: 10.1109/CVPR.2009.5206848. [DOI]
- Self supervised contrastive learning for digital histopathology. Ciga O., Xu T., Martel A.L. 2022Machine Learning with Applications. 7:100198. doi: 10.1016/j.mlwa.2021.100198. [DOI] [Google Scholar]
- Improving feature extraction from histopathological images through a fine-tuning ImageNet model. Li X., Cen M., Xu J., Zhang H., Xu X.S. 2022J Pathol Inform. 13:100115. doi: 10.1016/j.jpi.2022.100115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval. Wang X., Du Y., Yang S., Zhang J., Wang M., Zhang J.., et al. 2023Med Image Anal. 83:102645. doi: 10.1016/j.media.2022.102645. [DOI] [PubMed] [Google Scholar]
- Transformer-based unsupervised contrastive learning for histopathological image classification. Wang X., Yang S., Zhang J., Wang M., Zhang J., Yang W.., et al. 2022Med Image Anal. 81:102559. doi: 10.1016/j.media.2022.102559. [DOI] [PubMed] [Google Scholar]
- Chen R. J., Chen C., Li Y., Chen T. Y., Trister A. D., Krishnan R. G.., et al., editors. Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning; 2022-6-18 to 2022-6-24; [DOI]
- Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Coudray N., Ocampo P. S., Sakellaropoulos T., Narula N., Snuderl M., Fenyo D.., et al. 2018Nat Med. 24(10):1559–67. doi: 10.1038/s41591-018-0177-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep Learning for Identifying Metastatic Breast Cancer. Wang D., Khosla A., Gargeya R., Irshad H., Beck A.H. 2016ArXiv. abs/1606.05718 [Google Scholar]
- Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Bulten W., Kartasalo K., Chen P.C., Strom P., Pinckaers H., Nagpal K.., et al. 2022Nat Med. 28(1):154–63. doi: 10.1038/s41591-021-01620-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge. Bándi P., Geessink O., Manson Q., Dijk M. V., Balkenhol M., Hermsen M.., et al. 2019IEEE Transactions on Medical Imaging. 38(2):550–60. doi: 10.1109/TMI.2018.2867350. [DOI] [PubMed] [Google Scholar]
- The Cancer Genome Atlas Pan-Cancer analysis project. Cancer Genome Atlas Research N., Weinstein J. N., Collisson E. A., Mills G. B., Shaw K. R., Ozenberger B. A.., et al. 2013Nat Genet. 45(10):1113–20. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- A visual-language foundation model for pathology image analysis using medical Twitter. Huang Z., Bianchi F., Yuksekgonul M., Montine T. J., Zou J. 2023Nat Med. 29(9):2307–16. doi: 10.1038/s41591-023-02504-3. [DOI] [PubMed] [Google Scholar]
- Deep learning applications in visual data for benign and malignant hematologic conditions: a systematic review and visual glossary. Srisuwananukorn A., Salama M. E., Pearson A. T. 2023Haematologica. 108(8):1993–2010. doi: 10.3324/haematol.2021.280209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Campanella G., Hanna M. G., Geneslaw L., Miraflor A., Werneck Krauss Silva V., Busam K. J.., et al. 2019Nat Med. 25(8):1301–9. doi: 10.1038/s41591-019-0508-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep Learning Models for Histopathological Classification of Gastric and Colonic Epithelial Tumours. Iizuka O., Kanavati F., Kato K., Rambeau M., Arihiro K., Tsuneki M. 2020Sci Rep. 10(1):1504. doi: 10.1038/s41598-020-58467-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors. Li Z., Cong Y., Chen X., Qi J., Sun J., Yan T.., et al. 2023iScience. 26(1):105872. doi: 10.1016/j.isci.2022.105872. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shao Z., Bian H., Chen Y., Wang Y., Zhang J., Ji X.., et al., editors. Neural Information Processing Systems; TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication. [Google Scholar]
- Automated Diagnosis of Lymphoma with Digital Pathology Images Using Deep Learning. Achi H. E., Belousova T., Chen L., Wahed A., Wang I., Hu Z.., et al. 2019Ann Clin Lab Sci. 49(2):153–60. [PubMed] [Google Scholar]
- A deep learning diagnostic platform for diffuse large B-cell lymphoma with high accuracy across multiple hospitals. Li D., Bledsoe J. R., Zeng Y., Liu W., Hu Y., Bi K.., et al. 2020Nat Commun. 11(1):6004. doi: 10.1038/s41467-020-19817-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep learning shows the capability of high-level computer-aided diagnosis in malignant lymphoma. Miyoshi H., Sato K., Kabeya Y., Yonezawa S., Nakano H., Takeuchi Y.., et al. 2020Lab Invest. 100(10):1300–10. doi: 10.1038/s41374-020-0442-3. [DOI] [PubMed] [Google Scholar]
- Deep Learning for the Classification of Non-Hodgkin Lymphoma on Histopathological Images. Steinbuss G., Kriegsmann M., Zgorzelski C., Brobeil A., Goeppert B., Dietrich S.., et al. 2021Cancers (Basel) 13(10) doi: 10.3390/cancers13102419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Artificial intelligence to detect MYC translocation in slides of diffuse large B-cell lymphoma. Swiderska-Chadaj Z., Hebeda K. M., van den Brand M., Litjens G. 2021Virchows Arch. 479(3):617–21. doi: 10.1007/s00428-020-02931-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Accurate diagnosis of lymphoma on whole-slide histopathology images using deep learning. Syrykh C., Abreu A., Amara N., Siegfried A., Maisongrosse V., Frenois F. X.., et al. 2020NPJ Digit Med. 3:63. doi: 10.1038/s41746-020-0272-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Automatic CD30 scoring method for whole slide images of primary cutaneous CD30(+) lymphoproliferative diseases. Zheng T., Zheng S., Wang K., Quan H., Bai Q., Li S.., et al. 2022J Clin Pathol. doi: 10.2139/ssrn.4029432. [DOI] [PubMed]
- Machine Learning of Bone Marrow Histopathology Identifies Genetic and Clinical Determinants in Patients with MDS. Bruck O. E., Lallukka-Bruck S. E., Hohtari H. R., Ianevski A., Ebeling F. T., Kovanen P. E.., et al. 2021Blood Cancer Discov. 2(3):238–49. doi: 10.1158/2643-3230.BCD-20-0162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efficient and Highly Accurate Diagnosis of Malignant Hematological Diseases Based on Whole-Slide Images Using Deep Learning. Wang C., Wei X. L., Li C. X., Wang Y. Z., Wu Y., Niu Y. X.., et al. 2022Front Oncol. 12:879308. doi: 10.3389/fonc.2022.879308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Interpretable Artificial Intelligence (AI) Differentiates Prefibrotic Primary Myelofibrosis (prePMF) from Essential Thrombocythemia (ET): A Multi-Center Study of a New Clinical Decision Support Tool. Srisuwananukorn A., Loscocco G. G., Kuykendall A. T., Dolezal J. M., Santi R., Zhang L.., et al. 2023Blood. 142:901. doi: 10.1182/blood-2023-173877. [DOI] [Google Scholar]
- Whole slide image representation in bone marrow cytology. Mu Y., Tizhoosh H. R., Dehkharghanian T., Campbell C. J. V. 2023Comput Biol Med. 166:107530. doi: 10.1016/j.compbiomed.2023.107530. [DOI] [PubMed] [Google Scholar]
- Hopfield Networks is All You Need. Ramsauer H., Schafl B., Lehner J., Seidl P., Widrich M., Gruber L.., et al. 2020ArXiv. abs/2008.02217 [Google Scholar]
- Machine learning-based scoring models to predict hematopoietic stem cell mobilization in allogeneic donors. Xiang J., Shi M., Fiala M. A., Gao F., Rettig M. P., Uy G. L.., et al. 2022Blood Adv. 6(7):1991–2000. doi: 10.1182/bloodadvances.2021005149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Artificial Neural Networks Predicted the Overall Survival and Molecular Subtypes of Diffuse Large B-Cell Lymphoma Using a Pancancer Immune-Oncology Panel. Carreras J., Hiraiwa S., Kikuti Y. Y., Miyaoka M., Tomita S., Ikoma H.., et al. 2021Cancers (Basel) 13(24) doi: 10.3390/cancers13246384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prediction of Hematopoietic Stem Cell Transplantation Related Mortality- Lessons Learned from the In-Silico Approach: A European Society for Blood and Marrow Transplantation Acute Leukemia Working Party Data Mining Study. Shouval R., Labopin M., Unger R., Giebel S., Ciceri F., Schmid C.., et al. 2016PLoS One. 11(3):e0150637. doi: 10.1371/journal.pone.0150637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Multi-layer Representation Learning for Medical Concepts. Choi E., Bahadori M. T., Searles E., Coffey C., Thompson M., Bost J. E.., et al. 2016Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. doi: 10.1145/2939672.2939823. [DOI]
- Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Xiao C., Choi E., Sun J. 2018J Am Med Inform Assoc. 25(10):1419–28. doi: 10.1093/jamia/ocy068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. Choi E., Bahadori M. T., Schuetz A., Stewart W. F., Sun J. 2016JMLR Workshop Conf Proc. 56:301–18. [PMC free article] [PubMed] [Google Scholar]
- Scalable and accurate deep learning with electronic health records. Rajkomar A., Oren E., Chen K., Dai A. M., Hajaj N., Hardt M.., et al. 2018NPJ Digit Med. 1:18. doi: 10.1038/s41746-018-0029-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Recent Advances in Predictive Modeling with Electronic Health Records. Wang J., Luo J., Ye M., Wang X., Zhong Y., Chang A.., et al. 2024ArXiv. abs/2402.01077 doi: 10.24963/ijcai.2024/914. [DOI] [Google Scholar]
- BEHRT: Transformer for Electronic Health Records. Li Y., Rao S., Solares J. R. A., Hassaine A., Ramakrishnan R., Canoy D.., et al. 2020Sci Rep. 10(1):7155. doi: 10.1038/s41598-020-62922-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu Z., Dong Y., Wang K., Sun Y. 2020Heterogeneous Graph Transformer.
- Harnessing Artificial Intelligence for Risk Stratification in Acute Myeloid Leukemia (AML): Evaluating the Utility of Longitudinal Electronic Health Record (EHR) Data Via Graph Neural Networks. Sinha R., Schwede M., Viggiano B., Kuo D., Henry S., Wood D.., et al. 2023Blood. 142(Supplement 1):960. doi: 10.1182/blood-2023-190151. [DOI] [Google Scholar]
- Prediction models using artificial intelligence and longitudinal data from electronic health records: a systematic methodological review. Carrasco-Ribelles L. A., Llanes-Jurado J., Gallego-Moll C., Cabrera-Bean M., Monteagudo-Zaragoza M., Violan C.., et al. 2023J Am Med Inform Assoc. 30(12):2072–82. doi: 10.1093/jamia/ocad168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The shaky foundations of large language models and foundation models for electronic health records. Wornow M., Xu Y., Thapa R., Patel B., Steinberg E., Fleming S.., et al. 2023NPJ Digit Med. 6(1):135. doi: 10.1038/s41746-023-00879-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The global, regional, and national burden of pancreatic cancer and its attributable risk factors in 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Collaborators GBDPC 2019Lancet Gastroenterol Hepatol. 4(12):934–47. doi: 10.1016/S2468-1253(19)30347-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Placido D., Yuan B., Hjaltelin J. X., Zheng C., Haue A. D., Chmura P. J.., et al. 2023Nat Med. 29(5):1113–22. doi: 10.1038/s41591-023-02332-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deep learning detects and visualizes bleeding events in electronic health records. Pedersen J. S., Laursen M. S., Rajeeth Savarimuthu T., Hansen R. S., Alnor A. B., Bjerre K. V.., et al. 2021Res Pract Thromb Haemost. 5(4):e12505. doi: 10.1002/rth2.12505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Towards Expert-Level Medical Question Answering with Large Language Models. Singhal K., Tu T., Gottweis J., Sayres R., Wulczyn E., Hou L.., et al. 2023ArXiv. abs/2305.09617 [Google Scholar]
- Capabilities of GPT-4 on Medical Challenge Problems. Nori H., King N., McKinney S. M., Carignan D., Horvitz E. 2023ArXiv. abs/2303.13375 [Google Scholar]
- The future landscape of large language models in medicine. Clusmann J., Kolbinger F. R., Muti H. S., Carrero Z. I., Eckardt J. N., Laleh N. G.., et al. 2023Commun Med (Lond) 3(1):141. doi: 10.1038/s43856-023-00370-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Giorgi J., Toma A., Xie R., Chen S. S., An K. R., Zheng G. X.., et al., editors. WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models; 2023; [DOI]
- Matching Patients to Clinical Trials with Large Language Models. Jin Q., Wang Z., Floudas C. S., Sun J., Lu Z. 2023ArXiv.
- Using ChatGPT to Facilitate Truly Informed Medical Consent. Mirza F. N., Tang O. Y., Connolly I. D., Abdulrazeq H. A., Lim R. K., Roye G. D.., et al. 2024NEJM AI. 1(2):AIcs2300145. doi: 10.1056/AIcs2300145. [DOI] [Google Scholar]
- Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making. Civettini I., Zappaterra A., Ramazzotti D., Granelli B. M., Rindone G., Aroldi A.., et al. 2023Blood. 142(Supplement 1):3726. doi: 10.1182/blood-2023-185854. [DOI] [PubMed] [Google Scholar]
- Creation and Adoption of Large Language Models in Medicine. Shah N. H., Entwistle D., Pfeffer M. A. 2023JAMA. 330(9):866–9. doi: 10.1001/jama.2023.14217. [DOI] [PubMed] [Google Scholar]
- Large language models in medicine. Thirunavukarasu A. J., Ting D. S. J., Elangovan K., Gutierrez L., Tan T. F., Ting D. S. W. 2023Nat Med. 29(8):1930–40. doi: 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- Ethics of large language models in medicine and medical research. Li H., Moon J. T., Purkayastha S., Celi L. A., Trivedi H., Gichoya J. W. 2023Lancet Digit Health. 5(6):e333–e5. doi: 10.1016/S2589-7500(23)00083-3. [DOI] [PubMed] [Google Scholar]
- Large language models propagate race-based medicine. Omiye J. A., Lester J. C., Spichak S., Rotemberg V., Daneshjou R. 2023NPJ Digit Med. 6(1):195. doi: 10.1038/s41746-023-00939-z. [DOI] [PMC free article] [PubMed] [Google Scholar]