Abstract
BACKGROUND:
Clinical babesiosis is diagnosed, and parasite burden is determined, by microscopic inspection of a thick or thin Giemsa-stained peripheral blood smear. However, quantitative analysis by manual microscopy is subject to error. As such, methods for the automated measurement of percent parasitemia in digital microscopic images of peripheral blood smears could improve clinical accuracy, relative to the predicate method.
METHODS:
Individual erythrocyte images were manually labeled as “parasite” or “normal” and were used to train a model for binary image classification. The best model was then used to calculate percent parasitemia from a clinical validation dataset, and values were compared to a clinical reference value. Lastly, model interpretability was examined using an integrated gradient to identify pixels most likely to influence classification decisions.
RESULTS:
The precision and recall of the model during development testing were 0.92 and 1.00, respectively. In clinical validation, the model returned increasing positive signal with increasing mean reference value. However, there were 2 highly erroneous false positive values returned by the model. Further, the model incorrectly assessed 3 cases well above the clinical threshold of 10%. The integrated gradient suggested potential sources of false positives including rouleaux formations, cell boundaries, and precipitate as deterministic factors in negative erythrocyte images.
CONCLUSIONS:
While the model demonstrated highly accurate single cell classification and correctly assessed most slides, several false positives were highly incorrect. This project highlights the need for integrated testing of machine learning-based models, even when models in the development phase perform well.
Clinical babesiosis is a haemoprotozoan disease that is most commonly transmitted from animals to humans by invertebrate vectors (e.g., Ixodes scapularis, the black-legged deer tick) (1). In the United States, 95% of cases occur in the Northeast and upper Midwest states, occurring primarily between May and October. In the state of Connecticut, the seroprevalence has been shown to range between 0.3% to 17.8%, with the number of reported cases being approximately 44 per 100 000 (2). Disease severity can range from asymptomatic to severe, the latter of which may lead to life-threatening scenarios. Severe disease is more common in specific at-risk populations including those who are post-splenectomy, immunocompromised, or older than 50 years of age. The all-cause mortality of babesiosis has been estimated as <1% for clinical cases, and approximately 10% for iatrogenic cases (e.g., transfusion-transmitted) (2).
The diagnostic gold standard for babesiosis is microscopic inspection of thick or thin Giemsa-stained peripheral blood smear (1). If Babesia spp. is identified, the degree of parasitemia is used to guide patient management strategies. For mild disease, or minimal parasitemia, antimicrobials are the preferred therapy. However, the American Society for Apheresis guidelines state that severe babesiosis is a category II indication for red blood cell (RBC) exchange. Severe disease is determined both by clinical and laboratory criteria including significant parasitemia (e.g., >10%), the presence of comorbidities (e.g., asplenia), or severe symptoms such as disseminated intravascular coagulation or multiorgan failure (2). While there is no consensus on when to discontinue RBC exchange, it is recommended that patients with severe babesiosis be monitored closely, with parasitized erythrocytes quantified daily alongside continued RBC exchange until parasite burden decreases below 5% (2, 3).
Percent parasitemia is the quotient of parasite-infected erythrocytes over the number of total erythrocytes counted. To derive this in a clinical laboratory, the process commonly involves a medical laboratory scientist (MLS) counting a large number of erythrocytes (e.g., 1000) using a 100× oil-immersion objective. While this process requires minimal laboratory equipment, it does require an experienced MLS to ensure optimal accuracy and reproducibility for serial measurement purposes (1). In addition, quantitative analysis by manual microscopy is subject to observer bias, slide distribution errors, statistical sampling error, and recording errors, and is inherently burdensome from time management and workflow efficiency standpoints (4, 5). Such limitations can mislead or delay therapeutic decision-making, particularly in the context of therapeutic RBC exchange. Accordingly, there remains a substantial need to develop automated methods to optimize the cost, efficiency, and accuracy of quantitative analysis.
The progress made in computer vision and machine learning (ML) technology over the last decade has encouraged a corresponding increase in their implementation in the clinical laboratory (6). With the decreasing availability of experienced medical laboratory scientists, evaluating ML-based software capabilities without expert operator review remains an important consideration in study design (7, 8). To this end, we sought to develop and evaluate the accuracy of an ML-based method for the automated measurement of percent parasitemia in digital microscopic images of peripheral blood smears. Specifically, we sought to describe the accuracy of parasitemia measurements, as determined by ML-based software, relative to an MLS-derived reference standard (MLS-RS). We hypothesized that results generated by the ML-based software would show superior precision to MLS-RS while achieving clinically comparable numerical results to the mean MLS-RS.
Materials and Methods
DATASET CURATION
This study was reviewed and approved by the Yale University Internal Review Board (IRB# 2000020244). Clinical blood samples were originally collected as part of routine clinical workflow in lavender-top (EDTA) tubes for screen and quantification of Babesia spp. Slides and concomitant digital images of the associated peripheral blood smears, which were found by to be positive for Babesia spp. and negative for Malaria spp. (BinaxNOW Malaria; Abbott), were flagged for inclusion using previously described methods (9–11). Slides and the concomitant digital images of Babesia-negative samples were collected from the routine clinical workflow throughout the study period and reviewed by a clinical pathologist for the absence of Babesia spp. prior to inclusion.
Slides and images were separated into 2 distinct groups, representing separate patient cohorts: (a) the model development dataset and (b) the clinical validation dataset (Table 1). The model development dataset was used for training, validation, and testing of the cell classification model. The clinical validation dataset was used as a second, “external” validation dataset to evaluate how the model would perform in a clinical implementation workflow, as compared to a predicate method-based reference standard.
Table 1.
Definitions of common machine learning vernacular and terms used in this report.
| Term | Definitiona |
|---|---|
| Precision | Mathematical value describing the algorithm’s ability to correctly predict a positive outcome from all true positive outcomes. Synonymous with positive predictive value. Calculated as: TP/TP + FP |
| Recall | Mathematical value describing the algorithm’s ability to correctly predict a positive outcome from all true positive outcomes. Synonymous with sensitivity. Calculated as: TP/TP + FN |
| Accuracy | Mathematical value describing the correctness of an algorithm, irrespective of types of true predictions (true positives or true negatives). Calculated as: TP + TN/positives + negatives (all predictions, n) |
| Epoch | Batched approach to model development. The model trains on training data and checks its accuracy against validation data in each epoch. When performance in the current epoch no longer exceeds performance in the previous epoch, model training may be complete. |
| Training | The act of moving data through a predictive model, allowing the model to recognize patterns in the image, and saving subsequent updates to model parameters. |
| Validation | The act of evaluating prediction accuracy of any given version of a model throughout model training. Model training is complete when model performance on validation data is sufficient. |
| Testing | The act of evaluating prediction accuracy of the final version of the model, after model training is complete and weights are locked. This is done on data that was not presented to the model during training or validation. |
| Training dataset | Data used in the initial training of a model. Training is the first step of model development, when weights are initialized and adjusted, according to prediction accuracy per epoch. |
| Testing dataset | Data used to evaluate the prediction accuracy of the final version of the model, after weights are locked. This data was not presented to the model during training. |
| Model development dataset | Erythrocyte images that were used in the development of the classification model (Fig. 1). Model development involved training, validation, and testing experiments (Fig. 3). |
| Clinical validation dataset | Erythrocyte images that were used in the clinical validation of a model. Clinical validation took place after model weights were locked and development was complete. |
| Image segmentation | A process commonly employed in computer vision, which attempts to partition images into sets of pixels or regions of interest. This is done by evaluating the individual pixels of the image. |
| Model classification | ML models applied to the process of classifying something according to a set of attributes or properties. Model classification may be binary (2 classes predicted) or multi-class (greater than 2 classes predicted) in practice. |
| Integrated gradients | The process of identifying pixels within each image that most heavily influence a model’s prediction. Pixel contribution is derived from the gradient (i.e., slope or derivative) of the prediction function relative to each feature (i.e., pixel). |
Abbreviations: FN, false negative; FP, false positive; TN, true negative; TP, true positive.
All peripheral blood smears were created and imaged on a DI-60 Integrated Slide Processing System (Cellavision AB). The DI-60 uses a 100×-objective and a 0.5× magnifier prior to imaging, rendering an effective magnification of 50×. Images are 3-channel RGB, with a resolution of 5 pixels per micron. In the model development dataset, slide images had a mean height and width of 2884 pixels (95% CI: 2882–2885) and 2867 pixels (95% CI: 2865–2868) (Fig. 1, A). Slides included in the model development dataset were imaged a single time. Slides included in the clinical validation dataset were imaged 3 times on the same scanner to compute intra-precision for quantification of Babesia spp. during subsequent portions of the study.
Fig. 1. Flow diagram of model development process.

(A), Slides included in the model development dataset were imaged a single time by the Cellavision DI-60 and uploaded to a custom-built web application for label annotation; (B), Central (x, y) coordinates of infected (red) and non-infected (blue) erythrocytes were marked on the slide-level images; (C), Central (x, y) coordinates were used to crop individual erythrocytes into 70 × 70 pixel, 3-channel arrays and paired with the corresponding label of either parasite (red) or normal (blue); (D), Labeled erythrocyte images were collectively divided 80:20 into training and testing datasets, respectively. The training dataset was further subdivided 70:30 into training and validation datasets, respectively; (E), The training and validation datasets were used train the image classification model; (F), Following completion of training, the [best model] was used to evaluate model performance using the test dataset. See color figure online at clinchem.org.
CELL LABELING FOR MODEL DEVELOPMENT
Slide-level images from the model development dataset were uploaded to a custom-built web application for labeling of individual erythrocytes using one of two labels: “parasite” or “normal.” Using the web application, annotators marked central (x, y) coordinates of infected and non-infected erythrocytes (Fig. 1, B). The (x, y) coordinates of cell centers were then used to crop individual erythrocytes from the slide-level parent image into 70 × 70 pixel, 3-channel image arrays. These 70 × 70 × 3 images were then paired with their corresponding label of either parasite or normal (Fig. 1, C). The labeling process was performed by a single laboratory medicine attending and author of this manuscript (T.J.S. Durant). As a post-processing step, (x, y) coordinates that were within 140 pixels of another set of (x, y) coordinates were removed from the dataset following completion of the annotation process. This was done to ensure that there was no overlap of images in the final development dataset that, if present, could have resulted in part of an image being represented in both the training and validation and testing datasets, leading to overfitting, or an over-optimistic estimate of model performance.
Ultimately, the final dataset used for model development consisted of non-overlapping, individual erythrocyte images (shape: 70 × 70 × 3) with an associated label of parasite or normal. These data were split and used to train, validate, and test the image classification model. The model development dataset was divided 80:20 into training and testing datasets, respectively (Fig. 1, D). The training dataset was further subdivided 70:30 into training and validation datasets, respectively. The training and validation datasets were used during the training of the image classification model (Fig. 1, E). The testing dataset was used to evaluate model performance following completion of training (Fig. 1, F).
For model development and image classification, we implemented DenseNet121 as the base model, initialized with pretrained weights from ImageNet. More information on hardware specifications for ML training, ML neural network implementation, and model development protocol are provided in the online Supplemental Material.
CLINICAL VALIDATION PROTOCOL
Following model development, a separate set of peripheral blood smear slides were used to assess the accuracy of the model in a simulated clinical workflow. Due to the inherent variability seen with quantitative analysis by microscopy, a clinical reference standard consisting of multiple measurements was compiled for comparisons between the model and the predicate method. Accordingly, each glass slide in the clinical validation dataset was independently evaluated by 3 MLSs with 26, 6, and 4 years of experience for MLS A, B, and C, respectively. The clinical validation slides were shuffled, specimen numbers on the glass slides were covered, and a box containing the clinical validations slides was given to each MLS for independent evaluation. Each MLS evaluated all clinical validation slides 3 separate times (Fig. 2, A). In total, this process generated 9 results of percent parasitemia for each slide in the clinical validation dataset. These data were used to calculate the mean percent parasitemia across all 9 reads, which was used as the MLS-RS for each case/sample (Fig. 2, B). Of note, the lower limit of quantification for percent parasitemia in the clinical laboratory at our institution is 1% and results below this value are reported out as <1% in routine practice. For the purposes of this study, MLSs were asked to record the precise parasitemia value, including those below 1%, to allow for a completely empirical comparison against the model.
Fig. 2. Flow diagram of clinical validation process.

(A), Each peripheral blood smear was evaluated 3 times, in a blinded fashion, by each MLS; (B), This process yielded a total of 9 parasitemia results for each slide in the clinical validation dataset. These data were used to calculate the mean parasitemia across all 9 reads, which was used as the clinical reference standard for each case; (C), Each glass slide in the clinical validation dataset was imaged 3 separate times by the Cellavision DI-60; (D), Contour-based cell segmentation was used to extract individual erythrocytes from the DI-60 slide-level images as 70 × 70 × 3 cropped images; (E), Individually cropped erythrocytes were independently evaluated by the [best model] to yield a predicted class (i.e., parasite or normal); (F), The number of cells with the predicted label of parasite were divided by number of total cells classified to yield the parasitemia result. This process was done once for each DI-60 image. With 3 images per specimen, this yielded a total of 3 parasitemia results per slide, which were used to calculate a mean parasitemia result for each specimen. See color figure online at clinchem.org.
For the model-based method, as mentioned, each slide in the clinical validation dataset was scanned 3 separate times by the DI-60 (Fig. 2, C). A custom cell-segmentation script was then used to crop individual erythrocytes from the peripheral blood smear image (Fig. 2, D). Cell segmentation was implemented using OpenCV (version: 4.2.0.34) using the contour-based [cv.findContours()] function. Cells were then cropped (shape: 70 × 70 × 3) from the peripheral blood smear image based on the center (x, y) coordinates of detected objects. Individual erythrocytes were then provided as input to the best model, as defined in the development protocol, to yield a predicted class (i.e., parasite or normal) for each individually cropped erythrocyte (Fig. 2, E). Following classification of individual erythrocytes, the number of cells with the predicted label of parasite were divided by number of total cells classified to yield the quantification of percent parasitemia. This process was done one time for each image with 3 images per specimen, yielding a total of 3 parasitemia results per slide (Fig. 2, F).
Method-to-method comparisons between the model and MLS-RS percent parasitemia were made using a variety of approaches: (a) bar plot visualization; (b) regression and Bland–Altman plots; (c) quantitative agreement of model percent parasitemia in relation to ±2 SD of the mean MLS-RS percent parasitemia (n = 9) for each case in the clinical validation dataset; (d) categorical agreement of percent parasitemia bins; (e) categorical agreement around the clinical decision threshold of 10%. Precision was assessed using the coefficient of variation, which was calculated on a case-wise basis across the MLS (n = 9) and model results (n = 3); and, (f) unpaired t-tests for pair-wise comparisons between MLS and model-derived percent parasitemia. The Holm–Šídák method was used to account for multiple comparisons.
MODEL INTERPRETABILITY
In an effort to examine the relationship between model predictions and image features, we implemented an explainable artificial intelligence technique based on axiomatic attribution for deep networks and known as integrated gradients (IG) (12). While the methods of IG are outside the scope of this report, the general purpose was to identify pixels within each image that most heavily influenced a model’s prediction, and derived from the gradient (i.e., slope or derivative) of the prediction function relative to each feature (i.e., pixel). For the purposes of this report we attempted to provide representative samples of what we observed when reviewing the images derived from an IG implementation. This was done on the test images in the model development dataset.
Results
DATASET CURATION
A total of 96 unique slides were included in this study. Of these, 71 slides were included in the development dataset, 28 of which were found to be positive for Babesia spp. by routine clinical workflow. A total of 14 633 individual erythrocyte images were initially labeled. Of those, 2019 images that had overlapping cells were removed, yielding a final development dataset of 11 388 erythrocytes labeled as normal and 1226 with a parasite. The mean number of labeled cells per unique slide was 178 (63), range 1 to 286. Of the slide-level images that were Babesia-positive, the mean parasitemia was 6.5 (4.5)%, range 1.0% to 20.0%. The clinical validation dataset consisted of the remaining 25 slides, of which 64% (n = 16) were Babesia-positive. The mean parasitemia among the Babesia-positive slides in the clinical validation dataset was 8.9 (9.4)%, range 1.0% to 29.2%.
MODEL DEVELOPMENT
The cell classification model was trained 3 separate times. Each training replicate consisted of 50 epochs (iterations). Learning rates decayed following validation loss plateau across all training replicates, with the final value ranging from 1 × 10−8 to 1 × 10−9. Minimum validation loss was observed following completion of training epoch 22, 22, and 31 for each of the training replicates, with a mean binary cross-entropy of 0.024 (0.003). Binary cross-entropy loss was plotted and inspected for positive divergence of validation loss, relative to training loss, as an empirical indicator of overfitting. This was observed minimally in the later training epochs (see online Supplemental Fig. 1, A). Precision, recall, and area under the receiver operator characteristic curve asymptotically approached model performance limits, which were concordant with plateaus of validation loss, indicating model improvement to be unlikely to occur with additional training iterations (see online Supplemental Fig. 1, B–D). Training replicate 3 achieved the lowest validation loss during training (0.021) and was subsequently used for evaluation of the test and clinical validation datasets. Model predictions on the test dataset resulted in 20 false positives and zero false negatives. The precision and recall were 0.92 and 1.00, respectively (Fig. 3, A). The binary classification accuracy was 0.99. The distribution of predicted probabilities for erythrocytes in the test dataset was visualized and demonstrated a predominantly bimodal distribution between the predicted classes (Fig. 3, B).
Fig. 3. Model classification results on test dataset.

(A), Confusion matrix of actual vs predicted labels; (B), Per-cell probability distribution of model predicted class with actual labels depicted in color (red = parasite; blue = normal). The probability of the predicted class being parasite is on the x-axis and the y-axis is a random number between 0 and 1 assigned to each cell for better visualizing data points. Green dotted line: decision threshold for prediction label of parasite—i.e., cells with a predicted probability of ≥0.5 are labeled as parasite. See color figure online at clinchem.org.
CLINICAL VALIDATION OF THE MODEL-BASED METHOD
A total of 25 unique slides were identified for evaluation in the clinical validation set, 16 of which were found to be positive for Babesia spp. by routine clinical workflow. Of those 16, one (case 15) was excluded from analysis, as per the consensus recommendation of the participating MLS due to excessive artifact, Howell–Jolly bodies, and only rare, dying parasites. The remaining slides were evaluated in 3 separate instances by each of the MLS with a mean parasitemia ranging from <0.1% to 38.5% (see online Supplemental Table 1 and Supplemental Fig. 2).
The custom cell-segmentation module identified a mean of 2098 (25th–75th percentiles: 25–75: 1924–2357) individual cells in each peripheral blood smear image (i.e., Fig. 2, C and D). Model classification of individual cells demonstrated an increasing positive signal (i.e., higher parasite count) with respect to the MLS-RS; however, the automated model also demonstrated spurious positive signal with the negative cases (cases 16–25). In addition, the model returned highly erroneous false positive signal on cases 11 and 16, relative to the MLS-RS (Fig. 4). A simple linear regression was performed to evaluate the concordance between the MLS-RS and the model predictions. The regression equation was determined as 4.78 + 0.55x with a correlation coefficient (r2) of 0.244 (see online Supplemental Fig. 3, A). With cases 11 and 16 removed, the regression equation was calculated as 1.68 + 0.68x with an r2 of 0.916. Bland–Altman plots were also assessed for bias trends, and similarly demonstrate erroneous positive signal on the low end and erroneously low positive signal on the high end (see online Supplemental Fig. 3, B and C).
Fig. 4.

Bar plot of mean percent parasitemia for the MLS-RS (n = 9) and the model-based method (n = 3). Error bars represent 1 SD. See color figure online at clinchem.org.
Of the 14 positive cases included in the clinical validation dataset, 10 were within 2 SD of the MLS-RS mean. However, only 7 were concordant between the model and MLS-RS with regards to the percent parasitemia bins. In addition, there were 3 major errors by the model-based method, which were defined as discordance around the clinical decision point of 10% parasitemia. Of the 14 positive cases, the MLS-RS CV was <20% in only 3 cases, whereas the model CV was <20% for 10 of the cases (see online Supplemental Table 2).
MODEL INTERPRETABILITY
Cells from the test dataset and the clinical validation dataset were evaluated using the IG approach to visualize feature pixel-level activation patterns. Cells from the test dataset generally demonstrated activation of pixels that were near the intraerythrocytic parasite (Fig. 5). Cells from case 25, a negative case in the clinical validation set, were also examined and demonstrated erroneous activation on non-parasitic features. Some of these features included erythrocyte abnormalities (e.g., target cell contours), precipitate, and overlying platelets. In some cases, the model appeared to be focusing on background pixels, which may be indicative of overfitting in some aspects of the model (Fig. 6).
Fig. 5.

Integrated gradient visualizations including the original image, the pixel-wise IG attribution mask, and the overlay of the two. Images are from the model development test dataset. (A and B), Representative examples from the parasite class; (C and D), Representative examples from the normal class. See color figure online at clinchem.org.
Fig. 6.

Integrated gradient visualizations including the original image and an overlay of the pixel-wise integrated gradient attribution mask and the original image. Images are from case 25 of the clinical validation dataset and are those that were predicted as belonging to the parasite class. See color figure online at clinchem.org.
Discussion
In this study, we found that in isolated testing of the model, performance metrics demonstrated highly accurate results. Training and validation loss curves demonstrated minimally appreciable divergence towards the end of training iterations, which would imply that there is negligible overfitting with the cell classification model (see online Supplemental Fig. 1, A). The sigmoid activation function used for the classification layer of the model demonstrated good separation between the parasite class and the non-parasite class, with only 20 false positive cells in the test dataset (Fig. 3, A). However, when the model was integrated with contour-based cell segmentation and applied to the clinical validation dataset, method comparison studies with the MLS-RS demonstrated suboptimal concordance with the model-based method. Simple linear regression between the two methods had a calculated correlation coefficient (r2) of 0.244 and 0.916 with and without outliers, respectively. In addition, only 7 of the 14 positive cases were concordant between the model-based method and MLS-RS when grouped by percent parasitemia bins. Lastly, there were 3 major errors by the model-based method, which were defined as discordance around the clinical decision point of 10% parasitemia (see online Supplemental Table 2).
The root cause analysis of the discrepant results observed in the clinical validation phase revealed several likely causative factors. The model-based method returned highly erroneous positive signal with cases 11 and 16, relative to the MLS-RS (Fig. 4). These errors were likely driven, in part, by the quality of the blood smears in these cases, which contained a substantial amount of precipitate and rouleaux formations. For blood smear images where there was minimal to no rouleaux formation, a mean of approximately 2000 cells were presented to the model for prediction, and visual inspection of contour-based cell segmentation suggested adequate performance (see online Supplemental Fig. 4). However, in the context of substantial rouleaux formation, cell segmentation resulted in approximately 300 to 800 individual cells identified for evaluation (see online Supplemental Figs. 5 and 6). In combination with overlying precipitate, which can be mistaken for intraerythrocyte parasites, this resulted in a high numerator (i.e., false positives) and a low denominator (i.e., fewer individually segmented cells), which led to erroneous elevations in parasitemia quantification for cases 11 and 16, relative to the MLS-RS.
Cell segmentation did not appear, however, to be a contributing factor to the false-positive and false-negative results observed in other cases where minimal to no erythrocyte clumping was observed. To this end, IG interpretability experiments helped develop an intuitive sense of what was causing the observed model behavior. A limitation of IG is that this method only provides an indication of feature importance on individual images and does not offer insight across the entire dataset. In addition, it also only explains individual feature contributions and does not examine how feature interactions may contribute to predictions (13). Nonetheless, IG experiments in this study revealed that model predictions of the target class, parasite, were generally most impacted by pixels spatially related to intraerythrocytic ring forms (Fig. 5). However, there were instances wherein pixel-wise activation patterns were found to be localized outside of the erythrocyte and corresponding to background noise, in some instances appearing to contribute to false-negatives (see online Supplemental Fig. 7). This would suggest that there is some degree of overfitting that is not obviously appreciable through visual inspection of the train and validation loss curves, or noisy signal outside of the region of interest (i.e., intraerythrocytic region), which may be inadvertently causing network activation.
IG also provided some insight into factors that predispose the model to false-positive predictions in the clinical validation dataset. Cells classified as parasite from case 25 demonstrated pixel-wise activation patterns that suggest that the model prediction of the target class is susceptible to features which share similarities to ring-form parasites. Examples of these microscopic features that were associated with localized pixel activation included variations in erythrocyte morphology (e.g., target cell contours) and overlying precipitate or platelets (Fig. 6). While images from the clinical validation dataset can be inspected for false positives and false negatives without IG, visualizing pixel-wise activation patterns is useful in identifying what image features are the likely source of erroneous signal and from these insights, guide the appropriate correction action in subsequent development efforts.
In general, model misclassification errors may be remedied by increasing the number of class examples during training. In doing so, the model input space is likely to be more representative of the heterogeneity the model is expected to encounter with real-world data. However, in the context of training classification models in healthcare, particularly those that rely on cases of low prevalence diseases, increasing the number of training examples can be prohibitive. There are techniques that can be implemented to expand the size of the training dataset artificially (e.g., label-preserving image transformations) and are meant to improve model performance and generalizability. These label-preserving data transformations were performed when training the algorithm in this study. However, these techniques are limited in terms of their performance benefits and cannot portray inherent intra-class variability that is not already represented in the existing training dataset, as demonstrated by the persistence of false negatives and false positives (e.g., Fig. 6) in the clinical validation phase of this study. Ultimately, post-hoc visual analysis of clinical validation images with and without IG overlays suggests that the number of training images was not sufficient to fully represent the heterogeneity in the input image space. However, in some instances, additional images may not be helpful such as when precipitate and parasites are hard to distinguish without additional magnification or z-plane focus optimization.
There are several lessons learned from this study. Software testing traditionally includes phases in which individual components are tested in isolation (i.e., unit testing) and are subsequently combined and tested as a group (i.e., integrated testing). ML-based software should be no exception in following this standard software engineering approach. As seen in this study, unit testing of the model demonstrated acceptable results; however, when tested in an integrated fashion, additional points of failure were recognized. In this work, integrated testing revealed the need to consider alternative approaches to cases with clumps of erythrocytes. Future research may consider leveraging ML for cell segmentation by using algorithms that combine object detection and classification. However, ML-based object detection would require additional annotation of the training dataset (e.g., bounding box annotation around target objects) that sufficiently captures the heterogeneity found among clumped erythrocyte forms in clinical practice. A reasonable alternative to handling cases with clumped erythrocytes is to automatically detect them as part of a pre-analytic quality assurance step, treat them as an “interferant” and route these cases to a workflow that involves manual review. Quality assurance modules such as these could be implemented during the slide scanning step, but this type of optimization was outside the scope of this study.
In addition, integrated testing also highlighted the potential benefit of including additional “hold out” datasets. Traditionally, unit testing of ML models involves a single hold out dataset that is commonly referred to as the “test dataset.” However, there is likely some benefit in having additional external datasets for clinical validation. Datasets that are collected at a different time relative to the original train–test dataset, or collected at a different study location, may provide a better understating of model robustness and generalizability (14). In addition, by testing the model against more “unseen” data, particularly in the context of relatively small training datasets, under-represented features in the input space of the training dataset may be detected through post-hoc visual inspection of model predictions. In this study, training data and clinical validation data were temporally separated, and the clinical validation dataset was relatively large (approximately150 000 cells) compared to the training dataset. Through visual inspection of model predictions from the clinical validation dataset, with and without IG overlays, potential interferants causing false-positive and false-negative predictions were able to be identified. Further, these cells, which were identified as false negatives or false positive, could be used to retrain the algorithms to iteratively improve performance.
Conclusion
In summary, the discrepancies between the model-based method and the MLS-RS are likely due to (a) under-represented variability of noisy artifacts in the training dataset (e.g., stain precipitate, overlying platelets, variability of image quality, and variations in intraerythrocytic morphology); (b) variable resolution of intracellular parasites; and, (c) inadequate handling of clumped erythrocytes by the cell-segmentation module. Performance issues due to the under-represented variability of noisy artifacts could be remedied by investigating several approaches such as up-sampling known artifact examples during training and modifying the loss function to penalize difficult false positives (15). However, increasing the volume of annotated examples in the training dataset would likely be the most efficient, resolute, and robust approach. Optimization of slide scanning is outside the scope of this report, but improved resolution of intracellular parasites through increased magnification or inclusion of z-plane images would likely assist both annotator and model in distinguishing between parasite and precipitate. Potential optimizations or alternative scanning methods should be considered in the context of how well those solutions integrate with current or future potential workflows in the clinical laboratory. Optimization of cell segmentation, as previously mentioned, could be optimized by empirically testing alternative computer vision methods for object detection, or slides with excess erythrocyte clumping could be excluded from analysis if the frequency of this occurrence is sufficiently rare in clinical practice.
Indeed, with the increasing breadth of ML technologies, there are multiple avenues to pursue for parasite quantification. This work highlights the need to interrogate the performance of ML-based technology beyond the train–test development cycle and pursue integrated system testing, include additional hold out datasets if possible, and to explore explainable artificial intelligence solutions to identify potential sources of erroneous model predictions. Future research is needed to delineate which methods of parasite quantification are most robust, scalable, and most easily implemented into clinical workflows, as well as to address data quality for ML implementation in microscopic image-based computer analysis.
Supplementary Material
Supplemental material is available at Clinical Chemistry online.
Acknowledgments:
We would like to acknowledge Lisa Mehlin, Holly Base, and Laura Pires for volunteering their time to quantify Babesia parasites for the purposes of the clinical validation portion of this study. We would also like to acknowledge John Errico and Cai Mayberry for their administrative support of this work.
Research Funding:
T. Durant received funding from the Academy of Clinical Laboratory Physicians and Scientists: Paul E. Strandjord Young Investigator Research Grant.
Role of Sponsor:
The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, preparation of manuscript, or final approval of manuscript.
Nonstandard Abbreviations
- RBC
red blood cell
- MLS
medical laboratory scientist
- ML
machine learning
- MLS-RS
medical laboratory scientist-reference standard
- IG
integrated gradient
Footnotes
Authors’ Disclosures or Potential Conflicts of Interest: Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:
Author Declaration: A version of this article was previously posted as a preprint on medRxiv as https://www.medrxiv.org/content/10.1101/2021.04.27.21256115v1.
Employment or Leadership: None declared.
Consultant or Advisory Role: T. Durant, Roche and Instrumentation Laboratories; W. Schulz, Hugo Health, Instrumentation Laboratories, and Interpace Diagnostics.
Stock Ownership: W. Schulz, Refactor Health; R. Torres, Applikate Technologies.
Honoraria: None declared.
Expert Testimony: None declared.
Patents: None declared.
References
- 1.Miller JM, Binnicker MJ, Campbell S, Carroll KC, Chapin KC, Gilligan PH, et al. A guide to utilization of the microbiology laboratory for diagnosis of infectious diseases: 2018 update by the infectious diseases Society of America and the American Society for Microbiology. Clin Infect Dis 2018;67:e1–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Padmanabhan A, Connelly-Smith L, Aqui N, Balogun RA, Klingel R, Meyer E, et al. Guidelines on the use of therapeutic apheresis in clinical practice - evidence-based approach from the Writing Committee of the American Society for Apheresis: The Eighth Special Issue. J Clin Apher 2019;34:171–354. [DOI] [PubMed] [Google Scholar]
- 3.Wormser GP, Dattwyler RJ, Shapiro ED, Halperin JJ, Steere AC, Klempner MS, et al. The clinical assessment, treatment, and prevention of lyme disease, human granulocytic anaplasmosis, and babesiosis: clinical practice guidelines by the Infectious Diseases Society of America. Clin Infect Dis 2006;43:1089–134. [DOI] [PubMed] [Google Scholar]
- 4.Pierre RV. Peripheral blood film review. The demise of the eyecount leukocyte differential. Clin Lab Med 2002;22:279–97. [DOI] [PubMed] [Google Scholar]
- 5.Rümke CL. Imprecision of ratio-derived differential leukocyte counts. Blood Cells 1985;11:315. [PubMed] [Google Scholar]
- 6.Florin L, Maelegheer K, Muyldermans A, Van Esbroeck M, Nulens E, Emmerechts J. Evaluation of the CellaVision DM96 advanced RBC application for screening and follow-up of malaria infection. Diagn Microbiol Infect Dis 2018;90:253–6. [DOI] [PubMed] [Google Scholar]
- 7.Garcia E, Kundu I, Kelly M, Soles R. The American Society for Clinical Pathology’s 2018 vacancy survey of medical laboratories in the United States. Am J Clin Pathol 2019;152:155–68. [DOI] [PubMed] [Google Scholar]
- 8.Garcia E, Kundu I, Ali A, Soles R. The American Society for Clinical Pathology’s 2016–2017 vacancy survey of medical laboratories in the United States. Am J Clin Pathol 2018;149:387–400. [DOI] [PubMed] [Google Scholar]
- 9.McPadden J, Warner F, Young HP, Hurley NC, Pulk RA, Singh A, et al. Clinical characteristics and outcomes for 7,995 patients with SARS-CoV-2 infection. PloS one 16.3 (2021): e0243291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Durant TJS, Gong G, Price N, Schulz WL. Bridging the collaboration gap: Real-time identification of clinical specimens for biomedical research. J Pathol Inform 2020;11:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McPadden J, Durant TJ, Bunch DR, Coppi A, Price N, Rodgerson K, et al. Health care and precision medicine research: analysis of a scalable data science platform. J Med Internet Res 2019;21:e13043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. International Conference on Machine Learning. PMLR, 2017. [Google Scholar]
- 13.TensorFlow. Integrated Gradients. 2020. https://www.tensorflow.org/tutorials/interpretability/integrated_gradients (Accessed November 2021). [Google Scholar]
- 14.Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 2018;286:800–9. [DOI] [PubMed] [Google Scholar]
- 15.Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 2020;42:318–27. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
