Skip to main content
Journal of Clinical Microbiology logoLink to Journal of Clinical Microbiology
. 2021 Jan 21;59(2):e02236-20. doi: 10.1128/JCM.02236-20

Deep Neural Networks Offer Morphologic Classification and Diagnosis of Bacterial Vaginosis

Zhongxiao Wang b,#, Lei Zhang a,#, Min Zhao h, Ying Wang a, Huihui Bai g, Yufeng Wang a, Can Rui i, Chong Fan i, Jiao Li j, Na Li j, Xinhuan Liu k, Zitao Wang f, Yanyan Si n, Andrea Feng m, Mingxuan Li c,d, Qiongqiong Zhang a,q, Zhe Yang o, Mengdi Wang l, Wei Wu c,d, Yang Cao c,d, Lin Qi e, Xin Zeng i, Li Geng k, Ruifang An j, Ping Li i, Zhaohui Liu g, Qiao Qiao f, Weipei Zhu e, Weike Mo c,d,p, Qinping Liao a,q,, Wei Xu b,
Editor: Erik Munsonr
PMCID: PMC8111127  PMID: 33148709

Bacterial vaginosis (BV) is caused by the excessive and imbalanced growth of bacteria in vagina, affecting 30 to 50% of women. Gram staining followed by Nugent scoring based on bacterial morphotypes under the microscope is considered the gold standard for BV diagnosis; this method is often labor-intensive and time-consuming, and results vary from person to person. We developed and optimized a convolutional neural network (CNN) model and evaluated its ability to automatically identify and classify three categories of Nugent scores from microscope images.

KEYWORDS: bacterial vaginosis, application of AI to diagnostic microbiology, automation in clinical microbiology

ABSTRACT

Bacterial vaginosis (BV) is caused by the excessive and imbalanced growth of bacteria in vagina, affecting 30 to 50% of women. Gram staining followed by Nugent scoring based on bacterial morphotypes under the microscope is considered the gold standard for BV diagnosis; this method is often labor-intensive and time-consuming, and results vary from person to person. We developed and optimized a convolutional neural network (CNN) model and evaluated its ability to automatically identify and classify three categories of Nugent scores from microscope images. The CNN model was first established with a panel of microscopic images with Nugent scores determined by experts. The model was trained by minimizing the cross-entropy loss function and optimized by using a momentum optimizer. The separate test sets of images collected from three hospitals were evaluated by the CNN model. The CNN model consisted of 25 convolutional layers, 2 pooling layers, and a fully connected layer. The model obtained 82.4% sensitivity and 96.6% specificity with the 5,815 validation images when altered vaginal flora and BV were considered the positive samples, which was better than the rates achieved by top-level technologists and obstetricians in China. The capability of our model for generalization was so strong that it exhibited 75.1% accuracy in three categories of Nugent scores on the independent test set of 1,082 images, which was 6.6% higher than the average of three technologists, who are hold bachelor’s degrees in medicine and are qualified to make diagnostic decisions. When three technologists ran one specimen in triplicate, the precision of three categories of Nugent scores was 54.0%. One hundred three samples diagnosed by two technologists on different days showed a repeatability of 90.3%. The CNN model outperformed human health care practitioners in terms of accuracy and stability for three categories of Nugent score diagnosis. The deep learning model may offer translational applications in automating diagnosis of bacterial vaginosis with proper supporting hardware.

INTRODUCTION

Abnormal vaginal discharge and odor are vaginitis symptoms that affect millions of women globally and represent the most common reasons for women to visit clinics. Bacterial vaginosis (BV) (40 to 50%), vulvovaginal candidiasis (VVC) (20 to 25%), and trichomonas vaginitis (TV) (15 to 20%) are the leading types of vaginitis. BV represents a dysbiosis of the vaginal microbiome that is associated with significant adverse health outcomes, including preterm labor resulting in low birth weight, pelvic inflammatory disease, acquisition of the human immunodeficiency virus, and increased susceptibility to sexually transmitted infections (18). In the United States, women have a high BV incidence rate, 29.2%, with the prevalence varying with race: African-American women, 51%; Hispanic women, 32%; and white women, 23% (9). In China, the prevalence of BV in a few cities with survey data was between 15 and 20%, which represented more than 100 million women (10).

Different methods have been developed for diagnosing BV, but microscopy with Gram staining is considered the gold standard (11). Clinical criteria (Amsel’s criteria) for the diagnosis of BV have a sensitivity of only 60 to 72% (12, 13). Other methods, including enzymatic tests such as the OSOM BV Blue test, and molecular methods, such as the BD MAX vaginal panel, have higher sensitivities, 91.7% and 90.7%, respectively, for the diagnosis of BV compared to microscopy using Nugent scores (14, 15). The WHO Guidelines Group recommended that the current best practice to diagnose BV in women was Gram-staining followed by microscopy using Hay-Ison criteria because it is easier and quicker to use in clinical practice, but the Nugent score has been used as the gold standard for studies. These two criteria are similar: Hay’s grade I, grade II, and grade III are similar to Nugent’s scores of 0 to 3, 4 to 6, and 7 to 10 (11).

In 1991, Nugent et al. (16) reported the use of a numerical score to diagnose BV by semiquantification of Gram-positive rods, Gram-negative coccobacilli, and curved Gram-negative rods after Gram staining. These morphotypes were thought to represent Lactobacillus spp., Gardnerella vaginalis, and Mobiluncus spp., respectively. Nugent scoring has since then become the gold standard for laboratory diagnosis of BV (7, 8, 17). In the Nugent scale, scores of 0 to 3 were considered to indicate normal vaginal flora (Lactobacillus dominant); scores of 4 to 6 were interpreted as altered vaginal flora (mixed morphotypes), and scores of 7 to 10 were indicative of BV (absence of lactobacilli and predominance of the other 2 morphotypes). Alternative diagnostic methods, such as molecular diagnostic assays, enzymatic assays, and chromogenic point-of-care tests, have been compared to the use of Nugent criteria (18). However, the determination of a Nugent score by a microbiologist could be easily influenced by the individual’s skill and is time-consuming (17, 18). In addition, the number of experienced microbiologists or technologists performing the microscopic work is insufficient in some countries and districts (19). Therefore, a more efficient method to classify Nugent scores is needed.

Here, we provide a proof of concept for a deep-learning-based model to quantify Gram stain and, hence, automated classification of Nugent scores. Recently, a traditional image processing method was developed for automatic bacterial vaginosis diagnosis. However, the sensitivity (58.3%) and specificity (79.1%) were poor in comparison to the results obtained by experts due to limitations in the image processing algorithm (17). Deep learning methods, especially convolutional neural network (CNN) models, demonstrated excellent performance on computer vision tasks, including image classification, image semantic segmentation, and image object detection. For image classification, various CNN models were constructed with increasing performance on natural image classification. Numerous CNN models, including LeNet-5 (20), AlexNet (21), VGGNet (22), ResNet (23), GoogLeNet (24), Xception (25), FCN (26), PSPNet (27), and U-net (28), were developed to improve performance in natural image recognition. Many models proved effective for medical image processing, including for identifying diabetic retinopathy in retinal fundus photographs (2931), interpretation of endoscopic images (32), and microbiology recognition (3336). We hypothesized that CNN-based deep learning models can be used to diagnose BV using Nugent score classifications efficiently and accurately. First, we developed several CNN models to learn images that had been previously evaluated and curated by obstetricians and microbiologists. Second, a trained model was tested on a separate image set from the same hospital. Our trained model was subsequently evaluated for accuracy in comparison to expert classification. Finally, three independent test sets collected from three different medical institutions were used to verify the versatility of our CNN model.

MATERIALS AND METHODS

Data collection.

A total of 29,095 microscopic images and the associated medical records from January 2018 to September 2019 at Beijing Tsinghua Changgung Hospital were retrieved. The field under the microscope that the image represents was selected by the technologists, who had more than 10 years of clinical experience; the Nugent score of the selected field represents the whole slide. One-fifth of all samples were randomly selected as the validation set (5,815 samples), and the rest (23,280 samples) were used as the training set. The resolution of the samples was 1,024 by 768 pixels.

For verifying our model's accuracy in an extensive setting and comparing it with determinations made by various experts, three independent testing data sets (A, B, and C) were constructed. Set A comprised 427 images collected from the Beijing Tsinghua Changgung Hospital, set B included 359 images collected from the Second Affiliated Hospital of Soochow University, and set C contained 296 samples from the Affiliated Hospital of Inner Mongolia Medical University. The resolution of the samples in sets B and C was 1,280 by 1,024 pixels. To standardize the images, the center 1,280 by 960 pixels were cropped and resized to 1,024 by 768 pixels.

The diagnoses based on the images were made by experts from the National Committee of Gynecological Infection, the Chinese Medical Association, including two chief obstetricians and three microbiologists, based on microscopic images. Each sample was first evaluated by two microbiologists, a consistent result was considered the ground truth. Discordant samples were reviewed by the chief obstetrician. If the chief obstetrician agreed with one of the microbiologists, the chief obstetrician’s result was the ground truth. If the chief obstetrician agreed with neither of the microbiologists, the sample was removed. The Nugent scores obtained ranged from 0 to 10, which were divided into three groups: normal vaginal flora (score of 0 to 3), altered vaginal flora (score of 4 to 6), and BV (score of 7 to 10).

Three representative images and their distributions of Nugent classifications from the three different medical institutions mentioned above are shown in Fig. 1. The training set of 23,280 images contained 16,490 images with normal vaginal flora, 4,660 images with altered vaginal flora, and 2,130 images of flora indicative of BV. The validation set of 5,815 samples included 4,120 images scored 0 to 3, 1,164 images scored 4 to 6, and 531 images scored 7 to 10. Set A had 160 images indicating normal vaginal flora, 206 indicating altered vaginal flora, and 61 BV images. Set B had 109 images showing normal vaginal flora, 149 showing altered vaginal flora, and 101 BV images. Set C had 158 images of normal vaginal flora, 123 showing altered vaginal flora, and 15 BV images.

FIG 1.

FIG 1

Information on the data set used. (a) Three typical samples for (i) normal vaginal flora, (ii) altered vaginal flora, and (iii) BV collected from Beijing Tsinghua Changgung Hospital. (b) Three typical samples collecting from (i) Beijing Tsinghua Changgung Hospital, (ii) the Second Affiliated Hospital of Soochow University, and (iii) the Affiliated Hospital of Inner Mongolia Medical University. (c) Distribution of the data set. (d) Distribution of the three independent test sets: set A from Beijing Tsinghua Changgung Hospital, set B from the Second Affiliated Hospital of Soochow University, and set C from the Affiliated Hospital of Inner Mongolia Medical University.

In short, the training set (23,280 images) and validation set (5,815 images) were collected from the same hospital with the same hardware, and the test set (1,082 images), consisting of set A (427 images), set B (359 images), and set C (296 images), was collected from three clinical settings with three different types of hardware (see Table S1 in the supplemental material). The study protocol was approved by the ethics committees of participating hospitals.

Development of CNN models for BV analysis.

The neural networks consist of many neurons, which are similar to biological neurons. In the training process, the training data were used to update the connection strength (connection weights) between different neurons. The convolutional neural network is a neural network specially used to process image information. We developed four CNN models with different network widths (different channel numbers in each layer) to predict Nugent scores based on microscope images. The residual module used in ResNet was employed in all models (23). In the training process, color jittering, scale jittering, and horizontal/vertical flip were used as data augmentation methods. By comparing the AUC (area under the receiver operating characteristic [ROC] curve) values of the four models, the best model with the best network width was selected.

In clinical practice, the microbiologist/technologist inspected multiple fields of view for each sample under the microscope. Each field of view was given a Nugent score. A final diagnostic result was based on collective Nugent scores from various fields. A representative image of the diagnostic result was selected and saved. Our model was applied to the representative images for an automated Nugent score to achieve an automatic diagnosis.

The classic classification convolutional neural network was unsuitable to process our microscope images at a resolution of 1,024 by 768, as they are designed for input pictures with a resolution of 224 by 224, such as VGG, GoogLeNet, ResNet, and DenseNet (1619). The Gardnerella vaginalis organism was only 4 to 9 pixels wide in our images. If we directly used the mature classification CNN models, the image resolution would be compressed to 224 by 224, the width of the Gardnerella vaginalis organism would change to 1 to 3 pixels, and much information about the organism would be lost. Therefore, we developed a new CNN model named NugentNet with more convolutional layers, including two downsample convolutional layers, to adapt to the resolution of our images and extract the detailed information about the bacteria. Our model scaled the network depth to adapt to the resolution of our image, which was similar to the EfficientNet (the state-of-the-art model on ImageNet [37]) scaling network depth and resolution on ResNet-18 (Fig. S1).

The model was trained to minimize a cross-entropy loss function given by

J(θ)=1nj=1ni=1myjilabellog(yjiprediction)

where m and n were the numbers of classes and the batch size, and ylabel was the one-hot encode vector of the label. yjiprediction=f(θ; xj) was a vector with the parameters that represent the probabilities for predicting each class, which was obtained by using a softmax function after the last fully connected layer in the CNN model. xj was the input data, and θ was the variable for updating. A momentum optimizer (38) was used to train the basic model on the labeled images.

The training set had only 23,280 samples; the basic model needed more samples to train, and it showed overfitting on the data set. Therefore, we further developed three compression models, which needed fewer training samples than the basic model, to find the most suitable model for our data set.

The compression models (1/2 NugentNet, 1/4 NugentNet, and 1/8 NugentNet) were derived from the basic model. The input and output channels (the value of Cin and Cout in Fig. S1) for every convolutional layer were reduced to C values of 32 for 1/2 NugentNet, 16 for 1/4 NugentNet, and 8 for 1/8 NugentNet. The compression models reduced the number of the model’s parameters by reducing the width of the network, which is a common practice for model compression. Reduced channels would simplify the neural network and hence reduce the use of computing resources. Therefore, the speed of compression models was higher than that of the basic model.

Image preprocessing for CNN models.

The three independent test sets (A, B, and C) were collected from the different hardware used by the above-mentioned three hospitals to generate microscope images (Table S2). Three different typical samples were collected from the three hospitals (Fig. 1b): set A from Beijing Tsinghua Changgung Hospital (Fig. 1b, image i), set B from the Second Affiliated Hospital of Soochow University (Fig. 1b, image ii), and set C from the Affiliated Hospital of Inner Mongolia Medical University (Fig. 1b, image iii). The pixel distributions of these three types of samples were significantly different because they were generated in different settings, by different people, and from different hardware. The main differences among the three test sets were the actual physical area represented by the image and the brightness of the image (Table S2). The actual physical area of images in set B was twice that of the sample in the training set. Images in set B were also brighter than images in the training set. In contrast, the images in set C represented only half the actual physical area and were darker than the training set.

Preprocessing was used to eliminate sample differences among different test sets in the inference process. The preprocessing included two steps: standardizing the actual physical area of images and adjustment of the brightness of images (Fig. 2). The preprocessing for test set B is shown in Fig. 2a. First, the center 656 by 492 pixels were cropped and resized to 1,024 by 768 pixels. Second, all pixel values were increased by 32 for the red channel, decreased by 9 for the green channel, and decreased by 21 for the blue channel. As shown in Fig. 2b, the preprocessing of test set C started by resizing to 798 by 598 pixels, which was followed by edge expansion. Three typical edge expansion methods—replicate, wrap, and reflect—were applied (Fig. 2c). Our results showed that the reflect method was the best. All pixel values of images were increased by 109 for the red channel, by 61 for the green channel, and by 32 for the blue channel in set C (Fig. 2b).

FIG 2.

FIG 2

Preprocessing for test sets B (a) and C (b). (c) Three typical edge expansion methods used for panel b, image ii.

Analysis of diagnostic performance using metric methods.

The performances of diagnostic methods were usually measured by sensitivity and specificity in comparison to the standard method. In this study, for the purpose of screening, we considered normal vaginal flora (scores of 0 to 3) the negative samples and altered vaginal flora (scores of 4 to 6) and BV (scores of 7 to 10) the positive samples.

The performance of our models was illustrated by AUC (area under the receiver operating characteristic [ROC] curve). The ROC curve is a graphic plot that illustrates the diagnostic capability of a binary classifier system as its discrimination threshold is varied (38). The ROC curve was created by plotting the true-positive rate (TPR) against the false-positive rate (FPR) at various threshold settings (39). The true-positive rate was the sensitivity, and the false-positive rate was equal to 1 − specificity.

To show more performance details of our models and human readers, the confusion matrix was employed to illustrate the prediction results of all three Nugent groups. In the confusion matrix (Fig. 3), each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa) (40). The accuracy of the three categories of Nugent scores (three-classification accuracy) is also provided.

FIG 3.

FIG 3

The confusion matrix of the best points of the 1/4 NugentNet on the validation set.

Comparison to human readers.

Three data sets, A, B, and C, were then evaluated by our CNN model and five independent health care providers (HCPs), including three technologists and two obstetricians. All the human readers and our model evaluated the images only, i.e., they did not use slide information. The five HCPs were from four representative teaching hospitals and the leading private hospitals in China: the First Affiliated Hospital of Xi’an Jiaotong University, the Second Affiliated Hospital of Soochow University, Beijing HarMoniCare Women's and Children's Hospital, Binzhou Medical University Hospital, and Nanjing Medical University Affiliated Nanjing Maternity and Child Health Care Hospital. The diagnostic results were compared using sensitivity, specificity, and accuracy.

RESULTS

Development and selection of the best CNN model for Nugent scoring.

Image data from Beijing Tsinghua Changgung Hospital were used to select the best CNN model for BV diagnosis. Four CNN models were generated, trained with the 23,280-training-image set, and then evaluated on the 5,815 test images. For the purpose of screening, samples showing altered vaginal flora and BV were considered positive samples and those showing normal vaginal flora were considered negative samples. All models had AUCs greater than 0.95, with the 1/4 NugentNet model showing the best performance, with an AUC of 0.978 (Fig. S2). NugentNet and 1/2 NugentNet had more training variables than the best model. These two models showed overfitting on our data set and hence gave lower AUCs than 1/4 NugentNet (Fig. S2). The number of training variables of 1/8 NugentNet was only a quarter of that of 1/4 NugentNet. The AUC of 1/8 NugentNet was slightly lower (0.004) than that of 1/4 NugentNet (Fig. S2). Therefore, it was likely underfitted on our training data set. The 1/4 NugentNet was chosen for subsequent study as the best CNN model for the purpose of Nugent scoring and diagnosis of BV.

To better understand the performance of the best model on BV classification, three categories of Nugent score results (confusion matrix) of the best point of the AUC curve for the 1/4 NugentNet were plotted against the labeled results of the test set. The best point obtained 89.3% accuracy on the three-category classification, which was 5.6% higher than the microscopists’ performance and comparable to the top experts' performance in China (17). The results showed that only 3.8% (20/531) of actual BV samples yielded false-negative results (normal vaginal flora) and 0.1% (6/4120) of actual normal vaginal flora samples yielded false-positive results (BV).

Performance comparison between CNN model and human practitioners.

The performance of the 1/4 NugentNet model is comparable to that of Chinese top health care providers in classifying three categories of Nugent scores from microscopic images. To compare our best model and top-level health care providers, five human readers from five representative teaching hospitals and 1/4 NugentNet independently classified test set A (427 images from Beijing Tsinghua Changgung Hospital). Our model achieved an AUC of 0.9746, which outperformed 4 human readers and was comparable to that of the best human practitioner (Fig. 4).

FIG 4.

FIG 4

Comparison of the results of the best model and five human readers on independent test set A.

The 1/4 NugentNet model had better accuracy in three categories of Nugent scores than human practitioners. Because the Nugent scores are categorized into three classifications, the AUC does not reflect the reality of diagnostic accuracy in clinical settings. The accuracy of three categories of Nugent scores was calculated for 1/4 NugentNet and 5 human readers. The accuracy of 1/4 NugentNet was 80.3%, which was 10.3% higher than the average result for the human readers and 0.6% higher than the best human result (Table S3). The average performance of all the human readers was a sensitivity of 94.3% and a specificity of 73.1%. At the best point of the AUC curve, our model had a sensitivity of 91.4% and specificity of 91.3%. If we moved the AUC curve of our model at the sensitivity of 94.3% to match the average of human readers, our specificity was 89.4%, 16.3% higher than the 73.1% average specificity of human readers. Overall, our CNN model achieved high accuracy on three categories of Nugent scores from high-quality images obtained from Tsinghua Changgung Hospital.

Performance of CNN model on images from different clinical settings.

Since images from different clinical settings differed in size, brightness, and color (Table S2), the preprocessing procedure was essential to improve the accuracy of our CNN model. Using various preprocessing steps, we had much better performance on image sets B and C (Table S4). Set B had larger actual physical areas and brighter samples than the training data set we obtained from Tsinghua Changgung Hospitals. Adjusting the physical area improved the AUC from 0.8552 to 0.9375. Adjusting brightness improved the AUC from 0.8552 to 0.8626. Combined, the adjustments of brightness and the physical area resulted in the performance of our CNN model with set B reflecting an overall AUC of 0.9396 (Table S4). The test results for set C showed a similar pattern, in that adjusting both brightness and the physical area greatly improved performance, increasing the AUC from 0.5137 to 0.9450 (Table S4). Since images in set C had a smaller physical area than the training images, we had to expand the edges of images to match the actual physical area for our model. Of the three common methods tried, our results showed that the reflect method was the best edge expansion method, as it increased the AUC from 0.5137 to 0.7136 (Table S5). Both adjusting physical area and adjusting brightness greatly improved model performance with image sets B and C, which were from different clinical settings.

Comparison between CNN model and human on images from three clinical settings.

Our model was comparable to human practitioners for three categories of Nugent score diagnosis using images from various clinical settings. Data sets A, B, and C were combined to get an independent testing data set of 1,082 images. When the 1/4 NugentNet model was applied to the independent testing data set, it achieved AUC of 0.7007 on original images, 0.7884 after adjustment of physical areas, 0.8917 after adjustment of brightness, and 0.9400 after both types of preprocessing (Fig. 5). Comparison of these results with the results from 5 experts who evaluated the Nugent scores on the same image set showed that our model outperformed three of them, as judged by sensitivity and specificity (Fig. 5). The average performance of all the technologists was 96.5% sensitivity and 62.2% specificity. When the sensitivity was set to be equal, our model had a specificity of 71%, 8.8% higher than the averaged human results. When specificity was set to be equal, the sensitivity of our model was 2.1% higher than the human average. When the accuracy of three categories of Nugent scores was used as a better readout of the Nugent scoring performance, our 1/4 NugentNet model achieved a total accuracy of 75.1%, which is again better than that of the three technologists and 2.1% higher than the human average (Table 1). The 1,082-image set was evaluated by three technologists from different hospitals, and the precision for the three categories of Nugent scores was 54.0% (Table 2). One hundred three samples were evaluated by two different technologists on different days; the precision levels of the two individuals were 85.4% and 95.1%, respectively. The average precision for the three categories of Nugent scores was 90.3% (Table S6). The results showed that our model had a high level of performance when preprocessing steps were applied to different images and that it outperformed human health care practitioners with regard to accuracy and stability.

FIG 5.

FIG 5

Comparison of the results of the best model and five human readers on the total independent test sets with 1,082 samples.

TABLE 1.

Results of the best model and five human readers for the total independent test set with 1,082 samplesa

Method or reader No. of samples with Nugent score:
Sensitivity (%) Specificity (%) Three-category accuracy (%)
0–3 (negative) 4–10 (positive) 4–6 7–10
Gold standard 427 655 478 177 NA NA NA
Best point (AI) 363 583 356 94 89.0 85.0 75.1
High sensitivity (AI) 296 637 379 94 97.3 69.3 71.1
Technologist 1 256 635 338 140 97.0 60.0 67.8
Technologist 2 260 624 335 135 95.3 60.9 67.5
Technologist 3 280 637 342 139 97.3 65.6 70.3
Obstetrician 1 401 618 375 99 94.4 93.9 80.9
Obstetrician 2 396 592 359 96 90.4 92.7 78.7
Avg (technologists) 265.3 632 338.3 138 96.5 62.2 68.5
Avg (human readers) 318.6 621.2 349.8 121.8 94.9 74.6 73.0
a

“Best point” refers to maximizing the sum of sensitivity and specificity; “high sensitivity” refers to setting sensitivity equal to the best result of human readers; AI is our best model; NA is 100%.

TABLE 2.

Precision of three technologistsa

Technologist No. of samples with Nugent score:
Total no.
0–3 4–6 7–10
1 276 487 319 1,082
2 291 527 264 1,082
3 298 501 283 1,082
Precision result 159 285 140 584 (54.0%)
a

Each specimen was run in triplicate.

DISCUSSION

In recent years, CNN methods have shown many successful applications in medical image processing. Typical examples include automatic classification of 26 common skin conditions (41); using pixels and disease labels as inputs to classify skin lesions (42); identifying prostate cancer in biopsy specimens and detecting breast cancer metastasis in sentinel lymph nodes (43); automated classification of Gram staining of blood samples (33); automatic diagnosis of Helicobacter pylori infection (32); and predicting cardiovascular risk factors from extracted retinal fundus images (31). Therefore, CNN methods were suitable for medical image processing. As a type of medical image, the microscopic image contains a great deal of local details and global variables. Local details include various types of bacteria. Global information includes the distribution density and ratio of various bacteria, etc. In the diagnostic process, all the information should be considered for an accurate result. The local details could be extracted by the first few layers of the CNN model, while the global information could be extracted by the last few layers. The last fully connected layer could use all the information extracted from the convolution layer in front to obtain the three categories. The CNN model could accurately extract local and global information for diagnosis (27, 28, 44). Therefore, the CNN model was very suitable for automatic diagnosis by the three categories of Nugent scores. A summary of the workflow for training, validation, and testing of our CNN model is provided in Fig. 6.

FIG 6.

FIG 6

Summary of CNN training, validation and testing. For CNN training, 23,280 samples were collected. The 5,815 validation samples were randomly selected and used to validate models. The test results of the best model and human readers on the single test set (427 samples) from the Tsinghua Chunggong hospital are shown in Fig. 4 and Table S3. Finally, a total test set (1,082 samples) collected from three hospitals was used to evaluate the best model’s generalization ability. The test results of the best model and human readers on the total test set are shown in Fig. 5 and Table 1.

The 1/4 NugentNet CNN model showed better performance in processing microscopic images for BV diagnosis using three categories of Nugent scores than the traditional image processing method. Traditional automatic image processing methods required three difficult steps to get the diagnosis results (17). The first step involved the segmentation of the bacterial area, which required a series of artificially designed algorithms to extract the foreground. In the second step, the overlap clumps were split from the bacterial area, from which morphotypes of the individual bacterium were obtained. In the third step, the features of bacterial morphotypes were extracted and classified using traditional machine learning methods. In our CNN model, the features of the microscope images could be automatically extracted, with the diagnosis being made spontaneously. The sensitivity, specificity, and three-category accuracy of the traditional method were 58.3%, 87.1%, and 79.1% (17). Our 1/4 NugentNet model had better performance for all three diagnostic performance indicators: a 24.1% increase in sensitivity, a 9.5% increase in specificity, and a 10.2% increase in three-category accuracy. Furthermore, our model could adjust the sensitivity and specificity by adjusting the predicting probability threshold. The diagnostic performance could not be further improved with the same traditional automatic diagnostic methods, but it could be further improved in our model with more training data.

The diagnosis of three categories of Nugent scores of 1/4 NugentNet CNN is comparable to that of human experts, with greater consistency and flexibility. The test data set of 1,082 images was evaluated by top experts, with each result agreed on by at least two of them. When tested independently by our CNN model and 5 human experts, results of 1/4 NugentNet on the test data set had an average three-category accuracy of 75.1%, which is 2.1% higher than the average of experts. Although the accuracy seems comparable, human readers showed considerable variability in sensitivity, from 90.4% to 97.3%, and a large variation in specificity, from 60.0% to 93.9% (Table 1). When we consider the most accurate point of our CNN model, the specificity and sensitivity were 85.0% and 89.0%, respectively. The application of the CNN model in clinical settings could minimize the potential diagnostic variations seen with different practitioners. Adjusting to the AUC curve, we could also find a diagnostic point with specific sensitivity for the purpose of clinical practice. For screening purposes, we could increase the sensitivity to identify as many positive patients as possible. For confirmative testing, we could increase the specificity to minimize false diagnoses. The CNN model could be implemented with different capacities in clinical settings.

Our model performed extremely well and outperformed technologists when the samples were standardized by preprocessing. Standardizing the actual physical area and the brightness of the samples allowed our model to perform very well on variable samples from different settings. We further studied the impact of the clarity of images and the number of the training set on the accuracy of our models. The results showed that the image-sharpening method did not improve the model’s performance on the independent test sets, and the performance decreased when the samples were blurred. The clarity of the samples was good enough for our models. To investigate the performance of our best model with different training sample sizes, we trained the best model with five different sample sizes, including 5,000 images, 10,000 images, 15,000 images, 20,000 images, and 23,000 images. All the AUCs were greater than 0.95 (Fig. 7). As we expected, a larger training set produced a model with better performance. Once the size of the training set exceeded 15,000 samples, the performance of the model improved only marginally as the number of samples increased.

FIG 7.

FIG 7

The performance of the best model with different training sample sizes: 5,000 images, 10,000 images, 15,000 images, 20,000 images, and 23,000 images.

The CNN model can increase the speed of BV diagnosis in clinical settings. We used an NVIDIA GeForce GTX 1080Ti graphics processing unit (GPU) for training and inference. The best model was faster than NugentNet and 1/2 NugentNet in both the training process and the inference process. In the training process, the 1/4 NugentNet model could be obtained within 10,000 iterations and completed in 2.4 h by one GPU. In contrast, it takes months to years to train a proficient inspector. In the inference process, it took our model 2.4 s to diagnose 100 images, while the traditional automatic diagnostic methods needed 30 s to obtain the diagnosis result for a single microscope image (17). By using the same hardware, our model could diagnose 5 microscopic images per second. The inference speed of our model was more than 150 times greater than that of traditional automatic diagnostic methods. The diagnostic efficiency of our model was much higher than that of the traditional automatic diagnostic methods.

In conclusion, our study provides the first description of a deep learning technique to diagnose bacterial vaginosis based on microscopic images. We constructed convolutional neural network models for automatic BV diagnosis. For image-level BV diagnosis, our models had better performance in terms of accuracy of diagnosing three categories of Nugent scores (89.3%) than experts and traditional automated diagnostic methods. The limitation of this research is that our model was developed only for image-level diagnosis. With the help of an automatic scanning microscope, manual Nugent scoring can be completely replaced by our model and the problem of clinical manual Nugent scoring can be completely solved. In addition, many gynecological lower genital tract infections, including aerobic vaginitis (AV), vulvovaginal candidiasis (VVC), and trichomonas vaginitis (TV), were diagnosed by using the same microscope images. Our model could be further modified to diagnose all three infections. It could be developed into an automatic diagnostic device that is more precise, more efficient, and more stable than manual diagnosis and would standardize the diagnostic process.

Data sharing statement.

The data that support the findings of this study are available from the corresponding author upon request.

Supplementary Material

Supplemental file 1
JCM.02454-20-s0001.xls (206.5KB, xls)
Supplemental file 2
JCM.02454-20-s0002.xls (42.5KB, xls)
Supplemental file 3
JCM.02454-20-s0003.pdf (846.5KB, pdf)

ACKNOWLEDGMENT

Thanks go to Suzhou Turing Microbial Technologies Co., Ltd., and Beijing Turing Microbial Technologies Co., Ltd., for technical support.

This work was supported by the National Natural Science Foundation of China (grant no. 81671409), Beijing Municipal Administration of Hospitals Clinical Medicine Development of Special Funding (grant no. XMLX201605), the National Natural Science foundation of China (NSFC) (grant no. 61532001), Tsinghua Initiative Research Program (grant no. 20151080475), and Ant Financial and Nanjing Turing AI Institute.

Zhongxiao Wang, L. Zhang, Z. Liu, Q. Liao, and W. Xu proposed the research, Z. Liu, R. An, P. Li, L. Geng, Q. Qiao, W. Zhu, X. Zeng, and Q. Liao led the multicenter study, Yufeng Wang, Zitao Wang, and L. Qi collected data, Ying Wang, H. Bai, and M. Zhao performed the data annotation, Zhongxiao Wang, Z. Yang, Y. Cao, M. Li, and W. Wu wrote the deep learning code and performed the experiment, J. Li, N. Li, C. Rui, C. Fan, X. Liu, Y. Si, L. Qi, and A. Feng evaluated the algorithm, Zhongxiao Wang, L. Zhang, and Q. Zhang wrote the manuscript, M. Wang, W. Mo, Q. Liao, and W. Xu reviewed the manuscript.

W. Mo, Y. Cao, W. Wu, and M. Li were employed by Suzhou Turing Microbial Technologies Co., Ltd., and Beijing Turing Microbial Technologies Co., Ltd. They may use the AI model in this paper to develop new products. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Supplemental material is available online only.

REFERENCES

  • 1.Wang J. 2000. Bacterial vaginosis. Prim Care Update Ob Gyns 7:181–185. doi: 10.1016/s1068-607x(00)00043-3. [DOI] [PubMed] [Google Scholar]
  • 2.Leitich H, Bodner-Adler B, Brunbauer M, Kaider A, Egarter C, Husslein P. 2003. Bacterial vaginosis as a risk factor for preterm delivery: a meta-analysis. Am J Obstet Gynecol 189:139–147. doi: 10.1067/mob.2003.339. [DOI] [PubMed] [Google Scholar]
  • 3.Hillier SL, Krohn MA, Cassen E, Easterling TR, Rabe LK, Eschenbach DA. 1995. The role of bacterial vaginosis and vaginal bacteria in amniotic fluid infection in women in preterm labor with intact fetal membranes. Clin Infect Dis 20:S276–S278. doi: 10.1093/clinids/20.Supplement_2.S276. [DOI] [PubMed] [Google Scholar]
  • 4.Peipert JF, Ness RB, Blume J, Soper DE, Holley R, Randall H, Sweet RL, Sondheimer SJ, Hendrix SL, Amortegui A, Trucco G, Bass DC, Pelvic Inflammatory Disease Evaluation and Clinical Health Study Investigators. 2001. Clinical predictors of endometritis in women with symptoms and signs of pelvic inflammatory disease. Am J Obstet Gynecol 184:856–863. doi: 10.1067/mob.2001.113847. [DOI] [PubMed] [Google Scholar]
  • 5.Hillier SL, Kiviat NB, Hawes SE, Hasselquist MB, Hanssen PW, Eschenbach DA, Holmes KK. 1996. Role of bacterial vaginosis-associated microorganisms in endometritis. Am J Obstet Gynecol 175:435–441. doi: 10.1016/s0002-9378(96)70158-8. [DOI] [PubMed] [Google Scholar]
  • 6.Martin HL, Richardson BA, Nyange PM, Lavreys L, Hillier SL, Chohan B, Mandaliya K, Ndinya-Achola JO, Bwayo J, Kreiss J. 1999. Vaginal lactobacilli, microbial flora, and risk of human immunodeficiency virus type 1 and sexually transmitted disease acquisition. J Infect Dis 180:1863–1868. doi: 10.1086/315127. [DOI] [PubMed] [Google Scholar]
  • 7.Moodley P, Connolly C, Sturm AW. 2002. Interrelationships among human immunodeficiency virus type 1 infection, bacterial vaginosis, trichomoniasis, and the presence of yeasts. J Infect Dis 185:69–73. doi: 10.1086/338027. [DOI] [PubMed] [Google Scholar]
  • 8.Klebanoff MA, Hillier SL, Nugent RP, MacPherson CA, Hauth JC, Carey JC, Harper M, Wapner RJ, Trout W, Moawad A, Leveno KJ, Miodovnik M, Sibai BM, Vandorsten JP, Dombrowski MP, O'Sullivan MJ, Varner M, Langer O, National Institute of Child H, Human Development Maternal-Fetal Medicine Units Network. 2005. Is bacterial vaginosis a stronger risk factor for preterm birth when it is diagnosed earlier in gestation? Am J Obstet Gynecol 192:470–477. doi: 10.1016/j.ajog.2004.07.017. [DOI] [PubMed] [Google Scholar]
  • 9.Koumans EH, Sternberg M, Bruce C, McQuillan G, Kendrick J, Sutton M, Markowitz LE. 2007. The prevalence of bacterial vaginosis in the United States, 2001–2004; associations with symptoms, sexual behaviors, and reproductive health. Sex Transm Dis 34:864–869. doi: 10.1097/OLQ.0b013e318074e565. [DOI] [PubMed] [Google Scholar]
  • 10.Liao QP, Zhang D. 2011. Diagnosis, treatment and research status of female reproductive tract infection in China. J Int Obstet Gynecol 38:469–471. http://www.gjfckx.ac.cn/CN/Y2011/V38/I6/469. [Google Scholar]
  • 11.Sherrard J, Wilson J, Donders G, Mendling W, Jensen JS. 2018. 2018 European (IUSTI/WHO) International Union against sexually transmitted infections (IUSTI) World Health Organisation (WHO) guideline on the management of vaginal discharge. Int J Std Aids 29:1258–1272. doi: 10.1177/0956462418785451. [DOI] [PubMed] [Google Scholar]
  • 12.Gallo MF, Jamieson DJ, Cu US, Rompalo A, Klein RS, Sobel JK. 2011. Accuracy of clinical diagnosis of bacterial vaginosis by human immunodeficiency virus infection status. Sex Transm Dis 38:270–274. [DOI] [PubMed] [Google Scholar]
  • 13.Singh RH, Zenilman JM, Brown KM, Madden T, Gaydos C, Ghanem KG. 2013. The role of physical examination in diagnosing common causes of vaginitis: a prospective study. Sex Transm Infect 89:185–190. doi: 10.1136/sextrans-2012-050550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Paavonen J, Brunham RC. 2018. Bacterial vaginosis and desquamative inflammatory vaginitis. N Engl J Med 379:2246–2254. doi: 10.1056/NEJMra1808418. [DOI] [PubMed] [Google Scholar]
  • 15.Gaydos CA, Begaj S, Schwebke JR, Lebed J, Smith B, Davis TE, Fife KH, Nyirjesy P, Spurrell T, Furgerson D, Coleman J, Paradis S, Cooper CK. 2017. Clinical validation of a test for the diagnosis of vaginitis. Obstet Gynecol 130:181–189. doi: 10.1097/AOG.0000000000002090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Nugent RP, Krohn MA, Hillier SL. 1991. Reliability of diagnosing bacterial vaginosis is improved by a standardized method of Gram stain interpretation. J Clin Microbiol 29:297–301. doi: 10.1128/JCM.29.2.297-301.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Song Y, He L, Zhou F, Chen S, Ni D, Lei B, Wang T. 2017. Segmentation, splitting, and classification of overlapping bacteria in microscope images for automatic bacterial vaginosis diagnosis. IEEE J Biomed Health Inform 21:1095–1104. doi: 10.1109/JBHI.2016.2594239. [DOI] [PubMed] [Google Scholar]
  • 18.Coleman JS, Gaydos CA. 2018. Molecular diagnosis of bacterial vaginosis: an update. J Clin Microbiol 56:e00342-18. doi: 10.1128/JCM.00342-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Dimopolous S, Christian EM, Fabian R, Joerg S. 2014. Accurate cell segmentation in microscopy images using membrane patterns. Bioinformatics 30:2644–2651. doi: 10.1093/bioinformatics/btu302. [DOI] [PubMed] [Google Scholar]
  • 20.LeCun L, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324. doi: 10.1109/5.726791. [DOI] [Google Scholar]
  • 21.Krizhevsky A, Sutskever I, Hinton G. 2012. Imagenet classification with deep convolutional neural networks. Adv Neural Information Processing Systems 25:1106–1114. [Google Scholar]
  • 22.Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556. https://arxiv.org/abs/1409.1556.
  • 23.He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition, p 770–778. In IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
  • 24.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. 2015. Going deeper with convolutions, p 1–9. In IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
  • 25.Chollet F. 2017. Xception: deep learning with depthwise separable convolutions, p 1800–1807. In IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
  • 26.Long J, Shelhamer E, Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39:640–651. doi: 10.1109/TPAMI.2016.2572683. [DOI] [PubMed] [Google Scholar]
  • 27.Zhao H, Shi J, Qi X, Wang Q, Jia J. 2017. Pyramid scene parsing network., p 6230–6239. In IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
  • 28.Ronneberger O, Fisher P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation, p 234–241. In Navab N, Hornegger J, Wells W, Frangi A (ed). Medical image computing and computer-assisted intervention—MICCAI 2015. Lecture notes in computer science, vol 9351. Springer, Cham, Switzerland. doi: 10.1007/978-3-319-24574-4_28. [DOI] [Google Scholar]
  • 29.Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, Venugopalan S, Widner K, Madams T, Cuadros J, Kim R, Raman R, Nelson PC, Mega JL, Webster DR. 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316:2402–2410. doi: 10.1001/jama.2016.17216. [DOI] [PubMed] [Google Scholar]
  • 30.Maji D, Santara A, Mitra P, Sheet D. 2016. Ensemble of deep convolutional neural networks for learning to detect retinal vessels in fundus images. arXiv 1603.04833. https://arxiv.org/abs/1603.04833. [DOI] [PubMed]
  • 31.Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell MV, Corrado GS, Peng L, Webster DR. 2018. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng 2:158–164. doi: 10.1038/s41551-018-0195-0. [DOI] [PubMed] [Google Scholar]
  • 32.Shichijo S, Nomura S, Aoyama K, Nishikawa Y, Miura M, Shinagawa T, Takiyama H, Tanimoto T, Ishihara S, Matsuo K, Tada T. 2017. Application of convolutional neural networks in the diagnosis of Helicobacter pylori infection based on endoscopic images. EBioMedicine 25:106–111. doi: 10.1016/j.ebiom.2017.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Smith KP, Kang AD, Kirby JE. 2017. Automated interpretation of blood culture Gram stains by use of a deep convolutional neural network. J Clin Microbiol 56:e01521-17. doi: 10.1128/JCM.01521-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Racsa LD, Gander RM, Southern PM, TeKippe EM, Doern C, Luu HS. 2015. Detection of intracellular parasites by use of the CellaVision DM96 analyzer during routine screening of peripheral blood smears. J Clin Microbiol 53:167–171. doi: 10.1128/JCM.01783-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mathison BA, Kohan JL, Walker JF, Smith RB, Ardon O, Couturier MR. 2020. Detection of intestinal protozoa in trichrome-stained stool specimens by use of a deep convolutional neural network. J Clin Microbiol 58:e02053-19. doi: 10.1128/JCM.02053-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Quiblier C, Jetter M, Rominski M, Mouttet F, Bottger EC, Keller PM, Hombach M. 2016. Performance of Copan WASP for routine urine microbiology. J Clin Microbiol 54:585–592. doi: 10.1128/JCM.02577-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Tan M, Le QV. 2019. EfficientNet: rethinking model scaling for convolutional neural networks. arXiv 1905.11946. https://arxiv.org/abs/1905.11946.
  • 38.Sutskever I, Matrens J, Dahl G, Hinton G. 2013. On the importance of initialization and momentum in deep learning., p 1139–1147. In Proceedings of the 30th International Conference on International Conference on Machine Learning. [Google Scholar]
  • 39.MathWorks. 2020. Detector performance analysis using ROC curves. https://www.mathworks.com/help/phased/ug/detector-performance-analysis-using-roc-curves.html.
  • 40.Powers DMW. 2008. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J Machine Learning Technol 2:37–63. [Google Scholar]
  • 41.Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, Kanada K, Oliveira Marinho G, Gallegos J, Gabriele S, Gupta V, Singh N, Natarajan V, Hofmann-Wellenhof R, Corrado GS, Peng LH, Webster DR, Ai D, Huang SJ, Liu Y, Dunn RC, Coz D. 2020. A deep learning system for differential diagnosis of skin diseases. Nat Med 26:900–908. doi: 10.1038/s41591-020-0842-3. [DOI] [PubMed] [Google Scholar]
  • 42.Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542:115–118. doi: 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Litjens G, Sanchez C, Timofeeva N, Hermsen M, Nagtegaal I, Kovacs I, Kaa CH, Bult P, Ginneken B, Laak J. 2016. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci Rep 6:26286. doi: 10.1038/srep26286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Goodfellow I, Bengio Y, Courville A. 2016. Deep learning, p 316–356. MIT Press, Cambridge, MA. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental file 1
JCM.02454-20-s0001.xls (206.5KB, xls)
Supplemental file 2
JCM.02454-20-s0002.xls (42.5KB, xls)
Supplemental file 3
JCM.02454-20-s0003.pdf (846.5KB, pdf)

Articles from Journal of Clinical Microbiology are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES