Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2021 Dec 6;150(6):4118–4127. doi: 10.1121/10.0007272

Investigating training-test data splitting strategies for automated segmentation and scoring of COVID-19 lung ultrasound imagesa)

Roshan Roshankhah 1,b), Yasamin Karbalaeisadegh 2, Hastings Greer 3, Federico Mento 4,c), Gino Soldati 5, Andrea Smargiassi 6, Riccardo Inchingolo 6, Elena Torri 7, Tiziano Perrone 8, Stephen Aylward 3, Libertario Demi 4,d), Marie Muller 1,e)
PMCID: PMC8684042  PMID: 34972274

Abstract

Ultrasound in point-of-care lung assessment is becoming increasingly relevant. This is further reinforced in the context of the COVID-19 pandemic, where rapid decisions on the lung state must be made for staging and monitoring purposes. The lung structural changes due to severe COVID-19 modify the way ultrasound propagates in the parenchyma. This is reflected by changes in the appearance of the lung ultrasound images. In abnormal lungs, vertical artifacts known as B-lines appear and can evolve into white lung patterns in the more severe cases. Currently, these artifacts are assessed by trained physicians, and the diagnosis is qualitative and operator dependent. In this article, an automatic segmentation method using a convolutional neural network is proposed to automatically stage the progression of the disease. 1863 B-mode images from 203 videos obtained from 14 asymptomatic individual,14 confirmed COVID-19 cases, and 4 suspected COVID-19 cases were used. Signs of lung damage, such as the presence and extent of B-lines and white lung areas, are manually segmented and scored from zero to three (most severe). These manually scored images are considered as ground truth. Different test-training strategies are evaluated in this study. The results shed light on the efficient approaches and common challenges associated with automatic segmentation methods.

I. INTRODUCTION

The outbreak of the coronavirus disease 2019 (COVID-19) has spread over the world at a high rate with over 25 × 106 cases of infection in the U.S. as of January 28, 2021.1 Although the mild cases manifest a variety of symptoms, including fever, dry cough, fatigue, and loss of taste, the disease can rapidly progress to a more severe stage with life-threatening respiratory symptoms and pulmonary abnormalities.2–4 Similar to the Middle East Respiratory Syndrome, COVID-19 can cause acute respiratory distress, resembling acute respiratory distress syndrome (ARDS).5 The more severe cases may require oxygen therapy because of the serious alveolar damage in the lungs and are associated with high mortality rates.3 Furthermore, postinfection or during treatment, it can be necessary to monitor the recovery or worsening of the lung condition. Therefore, lung imaging tools can be of critical importance to detect COVID-related pneumonia and assist in the treatment decision-making. Using imaging to assist in and improve on the existing assessment techniques can be particularly important when there are taxing workloads imposed on health care workers.

X-ray computed tomography (CT) is an imaging modality, which commonly used to evaluate the damage to the respiratory system for COVID-19 patients.6 Based on the initial analyses, the bilateral and peripheral lung ground-glass opacity on the chest CTs and emergence of areas of consolidation in the lungs are typical signs among COVID-19 patients.7 Frequent monitoring of the lungs would assist in detecting imaging patterns, associating these patterns to the pathophysiology of the infection, and predicting the course of the disease and possible future complications. However, there are significant difficulties associated with using CT for assessing the lung in the COVID epidemic. These difficulties include cost, lack of availability, lack of portability, and complex disinfection processes. Additionally, the radiation exposure associated with CT scanning makes it less suitable for frequent monitoring assessments.

Ultrasound is an attractive alternative to CT, especially in low-resource conditions and for more vulnerable populations, such as pregnant women and children. In addition to involving only nonionizing radiation, ultrasound is portable and less costly and easier to disinfect than CT, which makes it an ideal bedside monitoring modality in the critically ill. Ultrasound has been used as a tool to assess the lung since the 1990s.8,9 Several studies in recent years have addressed the characterization of lung tissue using ultrasound imaging as well as quantitative and semi quantitative (i.e., based on numerical scores derived from qualitative artifacts analysis) ultrasound methods.10–18 Conventional ultrasound imaging of the lung is extremely challenging because the healthy, air-filled alveoli act as strong scatterers for ultrasound waves; however, the structure-dependent artifactual images can be exploited to evaluate the lung.14 In a healthy lung, these artifacts appear as the reverberations between the pleural line and ultrasound probe in the form of several horizontal lines known as A-lines.14 Alveolar damage and a subsequent increase in the subpleural tissue density is associated with the disappearance of A-lines and the appearance of vertical artifacts. These vertical artifacts, e.g., also known as B-lines and white lung, are attributed to a variety of respiratory conditions, including pneumonia, edema, lung contusion, and extravascular lung water.10,19,20 Their presence and appearance are a qualitative indication of an abnormal lung condition. As mentioned above, one of the complications induced by the COVID-19 infection is similar to ARDS and it is associated with increased lung density, which is visible on the ultrasound, to the point that the B-mode images can exhibit large bright areas (known as white lung).12 Another phenomenon that is observed for pneumonia, advanced ARDS, and bronchiolitis cases is the consolidation of certain regions in the lung resulting from collapse and severe alveolar damage.12,21,22

In conclusion, there is interest in developing techniques that are capable of automatically detecting and localizing artifactual patterns.12,14,23–28 Although detecting B-lines in lung ultrasound (LUS) images can be used as a method of detecting lung pathological conditions, there are clinical limitations for it, such as being operator dependent, and subjectiveness of the image interpretations. There are some automatic B-line detection methods that were previously proposed. Brattain et al.23 have first introduced a technique for automatic B-line detection using angular features and thresholding. Moshavegh et al.24 also proposed a B-line detection algorithm using alternate sequential filtering procedures. Anantrasirichai et al.25 also proposed an inverse problem-based method for automatic line detection in ultrasound images. In the context of COVID-19, a nonconvex regularization had been implemented on covid patients' LUS data26 and its performance showed a better detection accuracy in comparison with the previous inverse problem-based method.26 Accordingly, Van Sloun and Demi have used a weakly supervised deep learning method to detect the B-lines in lung mimicking phantoms and patients.28 In a work by Roy et al.,29 using a scoring system for lung damage severity proposed by Soldati et al.,14 they developed a deep learning method capable of automatically assessing the severity of the lung condition based on the labels provided for frames and videos of the LUS. This study demonstrated the feasibility of using deep learning models to automatically score lung disease severity.

In the present study, we implement an automatic segmentation approach using a customized convolutional neural network (CNN) to automatically segment the B-lines and differentiate between the artifactual features, including the A-lines, B-lines, white lung pattern, and consolidation areas using the lung B-mode images from 14 healthy and symptomless individuals, 14 COVID-19 infected individuals [confirmed by reverse transcription–polymerase chain reaction (RT-PCT) tests], and 4 symptomatic suspected COVID-19 individuals. Automatic segmentation has the potential to significantly facilitate the diagnosis and monitoring of COVID-19 induced pulmonary abnormalities. Moreover, automated segmentation directly enables semiquantitative assessment and longitudinal comparison while minimizing the inter- and intra-operator variabilities. In this study, we aim at demonstrating the impact of the data split strategies.

II. METHODS

A. Ultrasound data acquisition

The LUS data are acquired in multiple clinical facilities: BresciaMed, Brescia, Italy; Valle del Serchio General Hospital, Lucca, Italy; Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy; Fondazione Policlinico Universitario San Matteo IRCCS, Pavia, Italy; and Tione General Hospital, Tione (TN), Italy. The data are acquired using linear arrays as well as convex transducers and several different scanners: Mindray DC-70 Exp® (Shenzhen, China), Esaote MyLabAlpha® (Genoa, Italy), Toshiba Aplio XV® (Tokyo, Japan), and Wi-Fi Ultrasound Probes, ATL (Advanced Technology Laboratories, Inc., Bothell, WA). Each frame of a LUS video is manually segmented and assigned a numeric score (0–3, as defined next) to categorize the case severity. To minimize the individual rater subjectivity, the labeling process is arranged in four different levels: (i) four master students with an ultrasound background assign scores to each image; (ii) the assigned scores are then validated by a Ph.D. student with expertise in the field of LUS; (iii) revalidation of the assigned scores is performed by a biomedical engineer with 10 yr of experience in LUS; and (iv) the final validation is performed by clinicians with 10 yr of experience in LUS.31 The imaging protocol and scoring method described above have been clinically validated, and the prognostic value has been demonstrated.30,31

The complete dataset consists of 277 LUS videos from 35 patients, corresponding to 58 924 labeled frames. However, a smaller subset of frames that are both labeled and segmented is used in the present study. The obtained dataset is composed of 14 healthy and symptomless individuals, 14 COVID-19 infected cases (confirmed by RT PCR tests), and 4 suspected COVID-19 positive cases, making a total of 32 cases out of 35 patients. The complete dataset contains 203 videos and 1863 frames of images with a fixed gray scale intensity range between 0 and 255.

B. Scoring system

Lung ultrasonography takes advantage of artifacts that are caused by changes in the tissue structures. The present study uses four different scores to categorize lung parenchyma, as proposed by Soldati et al.14

  • Score 0: In a healthy lung, the shape, size, and geometrical disposition of the alveoli are responsible for the bright continuous pleural line and reverberated horizontal A-lines in the B-mode ultrasound images. The A-lines are attributed to multiple reflections between an ultrasound probe and the lung surface.14

  • Score 1: The pleural line is indented, and vertical white artifacts (B-lines) are visible. These B-lines are attributed to local alterations in the lung tissue and gradual replacement of the lung volume previously occupied by air with media acoustically similar to the intercostal tissues.14

  • Score 2: Discontinuity of the pleural line is observed, accompanied with consolidation areas (darker areas) on top of white areas (white lung). The consolidation areas are formed as a result of the alveolar damage and local tissue alterations; however, at this stage, the lung is not fully deaerated and demonstrates some highly scattering zones, encompassing the white vertical lines.14

  • Score 3: These cases depict large and dense areas of white lung, resulting from extensively damaged alveoli. This score illustrates the most severe case for the lung damage.14

Because the increase in score is associated with exacerbation of the lung condition, the overall score for images with two or more different labels is decided to be the maximum of those label values. Figure 1 illustrates the typical score 1, 2, and 3 images, respectively, and their corresponding manually segmented regions. It is worth noting that for score 0, although available in the dataset, no segmentation is used in this study for training and testing the neural network model. Avoiding extra segmentation masks for score 0 will result in faster training.

FIG. 1.

FIG. 1.

(Color online) The manually segmented LUS B-mode images for the (a) score 1, (b) score 2, and (c) score 3 cases.

C. Neural network overview

To automatically score the state of the lung deterioration as detected by B-mode ultrasound imaging, we seek to emulate the existing clinical process. We detect and segment lung features in three classes of severity (scores 1, 2, and 3) in addition to the class score 0. The image is then scored based on the worst feature present. This allows our network to train on a modest quantity of COVID-19 ultrasound images because U-Net architectures, which are used for our neural network, are much less data hungry than other types of networks with fully connected heads such as Resnet or VGG19. By framing our scoring problem as a segmentation problem, we are able to apply an existing state-of-the-art technique to a new domain. Specifically, we produced our segmentations using the U-Net architecture with hyperparameters closely following a previous work32 on segmenting bone in the spinal ultrasound.

Compared to early U-Net papers, such as the work by Ronneberger et al.33 in 2015, our U-Net has a large number of down-sampling and up-sampling stages with only one convolution layer per stage. To enable data flow between the pixels in the up-sampling half of the network with only one convolution per up sampling, we use 4 × 4 convolutions following our up-sampling layers. We first saw this style of U-Net in a previous work concerning the manipulation of conventional photographs.34 We believe it should be more widely used in medical imaging as opposed to the three convolutions per up-sampling style as seen in Ronneberger et al.,33 which is still prevalent in medically oriented papers. In our experience, the style of the network presented here yields more consistent training as a result of better communication across the image enabled by the deep up sampling and the prevention of vanishing gradients by more frequent skip connections. The details on the number and type of layers, features per layer, and positioning of the skip connections can be found in Fig. 2.

FIG. 2.

FIG. 2.

(Color online) The schematic representation of the U-Net neural network (Ref. 34).

D. Training process and data preparation

We obtained 1863 segmented LUS images, which have been previously labeled based on the proposed scoring system14 and described in Sec. II B. The score data from the images are one-hot encoded into four labels. The data are split into ultrasonic images as the neural network input on one hand and labeling masks to indicate regions within the images as the neural network outputs on the other hand. As mentioned above, for all score 0 cases, no segmentation is used, and the corresponding masks are blank. This input data are normalized to the range [0,1] and resized to 160 × 160 pixels. Then, random 128 × 128 crops are generated to increase the population size and improve the performance of the neural network model. We train this network using an Adam optimizer with a learning rate of 0.001 and a categorical cross-entropy loss. Starting from the entire set of frames, we first applied a randomly assigned and simple 90%–10% train-set test-set split. The training and evaluation process ran on a single GeForce GTX 1650 (Nvidia, Santa Clara, CA). The assessment of the test data after training results in 2775 128 × 128 images, which are used for the statistical analysis.

Next, to evaluate the impact of the training-test data splitting, we repeat the process by performing the split between the train and test data at the patient level. This process would ensure a more accurate analysis as we eliminate any possibility of having similar frames from one given patient in both the test and training datasets.

We used the data from three symptomatic COVID-19 patients and three healthy and symptomless cases for testing, and the rest of the images were used for the training. The total number of images used for the training and test were 1565 and 498, respectively. Augmentation techniques, such as randomly cropping and rotating the frames, were used to increase the image population in the test and training datasets and improve the performance of the neural network model. By randomly cropping the image, we will increase the number of the data population and help the CNN to train the model on a smaller portion of the images. By passing randomly selected subregions of the images to the network, we can develop and train the model on as many characteristics of ultrasound images without any bias on the location of these features. For the rotation of the augmented frames, we keep it at a randomized angel with a maximum of 15 deg in both directions. The statistical analysis of the test dataset after the training was performed on 7470 cropped frames after implementing the augmentation techniques. To further improve the methodology, because the artifacts belonging to scores 1, 2, and 3 are vertically shaped, we used the rectangular images (128 × 64 pixels) instead of cropping the data into square frames (128 × 128 pixels). Moreover, we increased the image size (256 × 256 pixels) to improve the resolution of the input for the CNN, and we manually removed the region above the lung pleural line, which was associated with the chest muscles and fat tissue.

E. Test output scoring process

To evaluate the method, the auto-segmented test results need to be scored and compared with the data manually segmented by the experts. In the ground-truth and auto-segmented images, the pixels with a zero value represent class 0 (score 0 pattern and everything that is not score 1, 2, or 3), whereas any abnormality is depicted by pixels with 1, 2, or 3 values. Depending on the extent in pixels of the auto-segmented mask, a different approach is followed. If the mask is larger than ten pixels, the round of the average of all of the nonzero pixels in the auto-segmented output is compared with the ground-truth score. Differently, the round of the average of all of the pixels in the auto-segmented output is compared with the ground truth. This approach is justified by the fact that when the detected mask is relatively small but significant in size, considering the entire image, the round of the average would be zero, resulting in a wrong assessment, whereas the CNN has indeed correctly detected the small mask. On the contrary, just averaging the nonzero pixels without considering the size of the mask may also result in erroneous detections.

F. Evaluating the location of the auto-segmented regions

The Szymkiewicz-Simpson overlap coefficient35 is calculated to compare the manually segmented images and the corresponding segmented image from the test results. This coefficient calculates the similarity between the two sets by comparing the intersection of the sets with the smaller set. It is calculated for two sets of data (A,B) as

S(A,B)=|AB|min(|A|,|B|). (1)

The overlap coefficient is equal to one when either one of the sets is a subset of the other.

III. RESULTS

A total of 1863 LUS images from the scans of 14 healthy and symptomless individuals, 14 confirmed COVID-infected patients (PCR positive), and 4 patients presenting symptoms and suspected to be infected (no PCR) are obtained and manually segmented. After normalizing the images and reducing the size, crops of 128 × 128 images are generated through simple sliding techniques to improve the performance of the neural network model. Of the data, 90% are used to train the U-Net neural network and the rest is used for the validation. Figure 3 depicts four different images with scores 0–3, which are used for testing. Because score 0 is generally associated with healthy lungs and, in this work, is labeled with no segmented area, the corresponding test output for those cases is also not segmented [Fig. 3(a)].

FIG. 3.

FIG. 3.

(Color online) The LUS images, manually segmented (red) area, and automatically segmented (blue) area for (a) score 0, (b) score 1 (S = 0.93), (c) score 2 (S = 0.97), and (d) score 3 (S = 0.95).

The overlap coefficients are calculated through comparison between the test output results and the manually segmented regions for the scores 1, 2, and 3 cases (N = 1754). Figure 4 demonstrates the distribution of S values for the COVID-infected/suspected cases (mean = 0.89, median = 0.94).

FIG. 4.

FIG. 4.

The distribution of the overlap coefficient S for the score 1, 2, and 3 cases (mean = 0.89, median = 0.94).

To evaluate the classification capabilities of the model, a confusion matrix with four different output and target classes is formed (Fig. 5). The confusion matrix is used for a detailed analysis of the accuracy, errors, and misclassifications. The rows are attributed to the predicted class, which is the automatically segmented data. The columns correspond to the ground truth, which is the manually segmented data (target class). The number in each cell indicates the size of the population and its percentage with respect to the total number of frames. The observations that are accurately predicted are located in the diagonal cells. The column on the far right shows the percentages of the cases predicted to belong to each target class, which are correctly (green) and incorrectly (red) classified and known as precision and false discovery, respectively. The bottom row illustrates the percentages of all of the cases belonging to each class that are correctly (green) and incorrectly (red) classified and known as true positive and false negative, respectively. The accuracy of the whole model (95%) is indicated in the cell on the bottom right in Fig. 5.

FIG. 5.

FIG. 5.

(Color online) The confusion matrix for the auto-segmentation data (prediction class) and manually segmented data (target class).

As mentioned in Sec. II, to avoid having similar frames in the training and test datasets, we implemented a splitting strategy based on the patients and not simply on the frames and further improved the model through the image augmentation techniques. The results are summarized in Fig. 6. First, we compared the confusion matrix and overlap coefficient histogram for two trainings in which the image augmentation was performed using square (128 × 128 pixels) and rectangular (128 × 64 pixels) crops of the original images. In Fig. 6, we can observe the training result when we used square frames and compared them with the rectangular frames. It is important to note that the change in the size of cropping used for the augmenting data will change the number of frames that have score 1 or score 0. The reason is in the difference between the width of the cropped frames. Due to the fact that score 1 masks are relatively small, decreasing the frame width from 128 to 64 pixels affects the total number of class 0 and class 1 frames in the confusion matrices. Having larger masks, the score 2 and score 3 frame count changes are relatively smaller than those for score 1 and score zero.

FIG. 6.

FIG. 6.

(Color online) The confusion matrix for the auto-segmentation data (prediction class) and manually segmented data (target class) along with the overlap coefficient distribution for the vertical and square frames.

As indicated in Fig. 6, using vertical rectangular frames (128 × 64 pixels) instead of square frames (128 × 128 pixels) greatly increases the accuracy and precision of the trained model, bringing it from 62.7% to 68.7% for the same dataset training. We also increased the sizes of the images by doubling the resolution of the frames from 128 to 256 pixels and manually removing the area above the pleural line (muscles and fat tissue surrounding the lungs) by cropping the raw frames before using them for training the model. Increasing the size of the image to 256 × 256 pixel frames did not improve the model significantly.

The precision values for score 1 were lower (<6%) for all of the different trainings that we have performed. The smaller size of the labeled areas associated with score 1 could be a possible justification for this observation. Considering the subjective nature of the manual segmentation process used as the ground truth, the population size of the score 1 augmented cropped frames—3% of the total images for the case with rectangular frames in Fig. 6—in addition to the low precision for score 1 in the previous confusion matrices, we chose to remove all of the score 1 images and ran the model training with only three scores: 0, 2, and 3. This led to an increase in the accuracy (Fig. 7). We also attempted fusing scores 0 and 1 into a single category for a training run. To do so, we implement the code to treat score 1 segmentations as score 0 segmentations in addition to healthy lung pictures. Therefore, the score 1 and score 0 will be merged into one class. Even when fusing scores 0 and 1, the overall accuracy of the model stayed at 65%, which is in a range similar to the previous model trainings.

FIG. 7.

FIG. 7.

(Color online) The confusion matrix for the auto-segmentation data (prediction class) and manually segmented data (target class) with the improved blind test. The (A) square 256 × 256-pixel frames and (B) rectangular 128 × 64-pixel frames are shown.

By comparing Figs. 6 and 7, it is observed that the accuracy and precision increases if we only train the model to detect scores 0, 2, and 3 whether we are using square frames (256 × 256 pixels) or rectangular images (128 × 64 pixels). Similar to Fig. 6, we can see that the change in the width size of the augmented images will result in a change in the total count for each target classes for the two following confusion matrices below.

IV. DISCUSSION

Artifactual features of LUS, such as horizontal A-lines, representing healthy tissue, and vertical B-lines, associated with various medical conditions, such as pneumonia and pulmonary edema,10,19 have proven effective for the assessment of the lung. In the context of the outbreak of the 2019 novel coronavirus, the accurate, real-time and automatic diagnosis of the disease on the respiratory system proved to be desirable. The conventional B-mode imaging of the lung is only semiquantitative and highly operator dependent. Machine-learning tools are highly relevant in this context. The detection of B-lines as indicators of the respiratory abnormalities to train and test a weakly supervised deep learning model has been successfully performed.32 In the present study, we have used a two-dimensional (2D) convolutional neural network to automatically segment LUS images and evaluate the results through a comparison with the manually segmented data. Different training-test split strategies are compared, leading to a better understanding of the automated segmentation techniques and realization of the probable drawbacks.

To assess the location of the segmented area quantitatively, the Szymkiewicz-Simpson overlap coefficient (S) is used. Figure 3 demonstrates the typical lung images with scores ranging from 0 to 3. A good agreement between the target and predicted segmented areas is observed and validated quantitatively with high S values. The performance of the CNN is, however, affected by the training-test datasets and the fact that multiple similar frames associated with the same individuals are repeated in both the training and test data. The same overestimated performance is observed for the confusion matrix (Fig. 5), which is obtained to evaluate the classification process). The confusion matrix (Fig. 5) indicates high true positive values for all of the cases and relatively high precision values for cases 0, 2, and 3. The fact that the segmented regions associated with score 1 in our data are mostly observed to be smaller than those with scores 2 and 3 might have led to a number of misclassifications. Score 0 and score 3 cases with distinct A-line features and large white lung artifacts, respectively, show greater precision. This was expected as scores 0 and 3 present very clear, unambiguous features, characterized by the absence of A-lines (score 0) and large consolidation areas (score 3).

Inefficient splitting of the training and test datasets is one of the issues that is often overlooked. Different splitting strategies are addressed in the present study. To explain the discrepancies between these initial results (accuracy = 95% and precision above 85%) and other results from similar studies (Roy et al.36 with precision of 70%), we investigated the train/test process. The training/test datasets were initially image based, meaning that similar frames from one patient could have existed both in the training and test pools of the images. This has resulted in an overestimation of the accuracy of the designed segmentation system.

To avoid using the frames that are associated with one patient in both of the training and test datasets, we separated the two sets based on the patients and not the images. The test data consisted of images from three healthy patients and three symptomatic patients, and the rest of the images were used for the training. Although the results were in accordance with similar studies,36 they showed a reduction in the accuracy (<75%) compared to the case in which there had not been a patient-dependent split between the test and training (accuracy = 95%).

Another important aspect of this methodology is augmenting the frames by cropping the images for the training and testing the neural network. Although the augmentation techniques are proved valuable for the CNNs, the fact that score 1 usually has a small segmentation will affect the class counts for different trainings with different sizes of cropped frames. This is especially true when we use vertical frames for cropping the augmented data. Having the same test sets, the change in the count can affect the precision and accuracy values for scores 1 and 0. The main reason we see a small percentage (3%) of augmented data consist of score 1 frames is the fact that there is a smaller chance that the small score 1 mask will be included in a cropped frame, hence, reducing the score 1 count in the confusion matrices. Because both scores 2 and 3 have larger segments than score 1, it will not result in a decrease in the score 2 and 3 counts in the augmented dataset.

The fact that the scoring system is defined based on the qualitative characteristics makes the segmentation much more difficult and subjective to the people who score different frames. Due to the qualitative nature of the labeling process, the ground truth, even though it is obtained through a structured approach, is not free from potential ambiguities in the assigned score (especially between score 1 and score 2), which may, in the end, result in impairing the learning process. Frames exhibiting similar qualitative characteristics and detecting subjectively into different scores by different people, can negatively affect the training of the neural network. This will, in turn, affect the achievable performance of the method. The qualitative features and subjective ground truth databases tend to be more problematic for the neural network models and, therefore, make the trained model less robust and predictive. Of course, having a higher number of images with various characteristics will also help the model and increase its reliability.

The differentiation between the scores is a challenging task, and several limitations will need to be addressed in the future. In spite of the high overlap coefficient values and high overall accuracy with an initial population size of 1863 lung images, we observed that the auto-segmentation can, in some instances, detect the wrong class. We believe that further improving the neural network performance through optimization of the split strategies and augmentation techniques will correct the segmentation irregularities as shown in Fig. 8. Moreover, it is better to evaluate the consecutive sets of frames, which make short videos, rather than comparing the single frames. These modifications could potentially be addressed in future studies.

FIG. 8.

FIG. 8.

(Color online) (a) The manually segmented region (yellow line) indicates score 2, whereas the auto-segmented region indicates score 3 and has an abnormal shape (green dashed line). (b) The abnormal shape of the detected score 3 (red line) is used to show the variability of the patterns associated with score 3, which can also be categorized as a score 2 image.

V. CONCLUSION

A deep learning approach based on a convolutional neural network model is implemented to automatically segment the lung images of healthy and COVID-19 infected cases. Although previous studies have mostly depended on the characterization of the healthy and diseased lungs through exclusive detection of B-lines,32 we have successfully been able to automatically classify different score ranks based on the detection of multiple features. Machine-learning tools are extremely beneficial in terms of providing a real-time quantitative assessment of the lungs29,36 and allow for regular monitoring of COVID-19 affected patients to track the course of the disease. However, it is worth noting that using artificial intelligence (AI) methods in image auto-segmentation is highly dependent on the test and training populations. Overtraining the neural network can be elusive and result in unreliably high accuracies. It is generally not straightforward to define the size of the dataset required to reach a solid and reproducible conclusion in terms of the performance of a given AI approach. Most likely, this is strongly dependent on the task which is to be performed by the AI, as well as the quality of the data used during the training phase. To this end, future studies will focus on the link between the dataset size and quality and achieved performance.

ACKNOWLEDGMENT

This work was supported by the European Institute of Innovation and Technology, Project UltraOn EIT Digital 2021. Funding for this project was partially provided by the Fondazione Valorizzazione Ricerca, Trentino, Italy (VRT Foundation), COVID-19 call 2020 Grant No. 1. It was also supported by the National Institutes of Health (NIH) with Grant No. R01EB021396. Y.K. and R.R. had equal contributions to this study and are both first authors of this publication.

a)

This paper is part of a special issue on Lung Ultrasound.

References

  • 1.CDC, “ Cases in U.S.|CDC,” available at https://covid.cdc.gov/covid-data-tracker/#datatracker-home (Last viewed 28 January 2021).
  • 2. Shi Y., Wang Y., Shao C., Huang J., Gan J., Huang X., Bucci E., Piacentini M., Ippolito G., and Melino G., “ COVID-19 infection: The perspectives on immune responses,” Cell Death Differ. 27, 1451–1454 (2020). 10.1038/s41418-020-0530-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Wang D., Hu B., Hu C., Zhu F., Liu X., Zhang J., Wang B., Xiang H., Cheng Z., Xiong Y., Zhao Y., Li Y., Wang X., and Peng Z., “ Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China,” JAMA, J. Am. Med. Assoc. 323(11), 1061–1069 (2020). 10.1001/jama.2020.1585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Villalba N. L., Maouche Y., Alonso Ortiz M. B., Cordoba Sosa Z., Chahbazian J. B., Syrovatkova A., Pertoldi P., Andres E., and Zulfiqar A.-A., “ Anosmia and dysgeusia in the absence of other respiratory diseases: Should COVID-19 infection be considered?,” Eur. J. Case Rep. Intern. Med. 7(4), 001641 (2020). 10.12890/2020_001641 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Huang C., Wang Y., Li X., Ren L., Zhao J., Hu Y., Zhang L., Fan G., Xu J., Gu X., Cheng Z., Yu T., Xia J., Wei Y., Wu W., Xie X., Yin W., Li H., Liu M., Xiao Y., Gao H., Guo L., Xie J., Wang G., Jiang R., Gao Z., Jin Q., Wang J., and Cao B., “ Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China,” Lancet 395(10223), 497–506 (2020). 10.1016/S0140-6736(20)30183-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Pan Y., Guan H., Zhou S., Wang Y., Li Q., Zhu T., Hu Q., and Xia L., “ Initial CT findings and temporal changes in patients with the novel coronavirus pneumonia (2019-nCoV): A study of 63 patients in Wuhan, China,” Eur. Radiol. 30(6), 3306–3309 (2020). 10.1007/s00330-020-06731-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Yi K., Yu L., Tong H., Tian Y., Li X., Li R., and Fang J., “ CT features and clinical manifestations of ordinary pneumonia with COVID-19 infection: A multi-center study,” preprint available at 10.21203/rs.3.rs-23394/v1(Last viewed 28 January 2021). [DOI]
  • 8. Lichtenstein D. A. and Menu Y., “ A bedside ultrasound sign ruling out pneumothorax in the critically ill,” Chest 108(5), 1345–1348 (1995). 10.1378/chest.108.5.1345 [DOI] [PubMed] [Google Scholar]
  • 9. Lichtenstein D. and Mezière G., “ A lung ultrasound sign allowing bedside distinction between pulmonary edema and COPD: The comet-tail artifact,” Intensive Care Med. 24(12), 1331–1334 (1998). 10.1007/s001340050771 [DOI] [PubMed] [Google Scholar]
  • 10. Demi L., Egan T., and Muller M., “ Lung ultrasound imaging, a technical review,” Appl. Sci. 10(2), 462 (2020). 10.3390/app10020462 [DOI] [Google Scholar]
  • 11. Mohanty K., Blackwell J., Egan T., and Muller M., “ Characterization of the lung parenchyma using ultrasound multiple scattering,” Ultrasound Med. Biol. 43(5), 993–1003 (2017). 10.1016/j.ultrasmedbio.2017.01.011 [DOI] [PubMed] [Google Scholar]
  • 12. Soldati G., Smargiassi A., Inchingolo R., Buonsenso D., Perrone T., Briganti D. F., Perlini S., Torri E., Mariani A., Mossolani E. E., Tursi F., Mento F., and Demi L., “ Is there a role for lung ultrasound during the COVID‐19 pandemic?,” J. Ultrasound Med. 39(7), 1459–1462 (2020). 10.1002/jum.15284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Soldati G., Smargiassi A., Demi L., and Inchingolo R., “ Artifactual lung ultrasonography: It is a matter of traps, order, and disorder,” Appl. Sci. 10(5), 1570 (2020).. 10.3390/app10051570 [DOI] [Google Scholar]
  • 14. Soldati G., Demi M., Smargiassi A., Inchingolo R., and Demi L., “ The role of ultrasound lung artifacts in the diagnosis of respiratory diseases,” Expert Rev. Respir. Med. 13(2), 163–172 (2019). 10.1080/17476348.2019.1565997 [DOI] [PubMed] [Google Scholar]
  • 15. Mohanty K., Blackwell J., Masuodi S. B., Ali M. H., Egan T., and Muller M., “ 1-Dimensional quantitative micro-architecture mapping of multiple scattering media using backscattering of ultrasound in the near-field: Application to nodule imaging in the lungs,” Appl. Phys. Lett. 113(3), 033704 (2018). 10.1063/1.5038005 [DOI] [Google Scholar]
  • 16. Demi L., Demi M., Prediletto R., and Soldati G., “ Real-time multi-frequency ultrasound imaging for quantitative lung ultrasound—First clinical results,” J. Acoust. Soc. Am. 148(2), 998–1006 (2020). 10.1121/10.0001723 [DOI] [PubMed] [Google Scholar]
  • 17. Mento F., Soldati G., Prediletto R., Demi M., and Demi L., “ Quantitative lung ultrasound spectroscopy applied to the diagnosis of pulmonary fibrosis: First clinical study,” IEEE Trans. Ultrason. Ferroelectr. Freq. Control 67, 2265–2273 (2020). 10.1109/TUFFC.2020.3012289 [DOI] [PubMed] [Google Scholar]
  • 18. Demi L., “ Lung ultrasound: The future ahead and the lessons learned from COVID-19,” J. Acoust. Soc. Am. 148(2), 2146–2150 (2020). 10.1121/10.0002183 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Bouhemad B., Zhang M., Lu Q., and Rouby J.-J., “ Clinical review: Bedside lung ultrasound in critical care practice,” Crit. Care 11(1), 205 (2007). 10.1186/cc5668 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Jambrik Z., Monti S., Coppola V., Agricola E., Mottola G., Miniati M., and Picano E., “ Usefulness of ultrasound lung comets as a nonradiologic sign of extravascular lung water,” Am. J. Cardiol. 93(10), 1265–1270 (2004). 10.1016/j.amjcard.2004.02.012 [DOI] [PubMed] [Google Scholar]
  • 21. Lichtenstein D., Mezière G., and Seitz J., “ The dynamic air bronchogram,” Chest 135(6), 1421–1425 (2009). 10.1378/chest.08-2281 [DOI] [PubMed] [Google Scholar]
  • 22. Moro F., Buonsenso D., Moruzzi M. C., Inchingolo R., Smargiassi A., Demi L., Larici A. R., Scambia G., Lanzone A., and Testa A. C., “ How to perform lung ultrasound in pregnant women with suspected COVID-19,” Ultrasound Obstet. Gynecol. 55(5), 593–598 (2020). 10.1002/uog.22028 [DOI] [PubMed] [Google Scholar]
  • 23. Brattain L. J., Telfer B. A., Liteplo A. S., and Noble V. E., “ Automated B-line scoring on thoracic sonography,” J. Ultrasound Med. 32(12), 2185–2190 (2013). 10.7863/ultra.32.12.2185 [DOI] [PubMed] [Google Scholar]
  • 24. Moshavegh R., Hansen K. L., Sørensen H. M., Hemmsen M. C., Ewertsen C., Nielsen M. B., and Jensen J. A., “ Novel automatic detection of pleura and B-lines (comet-tail artifacts) on in vivo lung ultrasound scans,” Proc. SPIE 9790, 97900K (2016). 10.1117/12.2216499 [DOI] [Google Scholar]
  • 25. Anantrasirichai N., Hayes W., Allinovi M., Bull D., and Achim A., “ Line detection as an inverse problem: Application to lung ultrasound imaging,” IEEE Trans. Med. Imag. 36(10), 2045–2056 (2017). 10.1109/TMI.2017.2715880 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Karakuş O., Anantrasirichai N., Aguersif A., Silva S., Basarab A., and Achim A., “ Detection of line artifacts in lung ultrasound images of COVID-19 patients via nonconvex regularization,” IEEE Trans. Ultrason., Ferroelectr., Freq. Control 67(11), 2218–2229 (2020). 10.1109/TUFFC.2020.3016092 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Soldati G., Smargiassi A., Inchingolo R., Buonsenso D., Perrone T., Briganti D. F., Perlini S., Torri E., Mariani A., Mossolani E. E., Tursi F., Mento F., and Demi L., “ Proposal for international standardization of the use of lung ultrasound for patients with COVID-19: A simple, quantitative, reproducible method,” J. Ultrasound Med. 39(7), 1413–1419 (2020). 10.1002/jum.15285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. van Sloun R. J. G. and Demi L., “ Localizing B-lines in lung ultrasonography by weakly supervised deep learning, in-vivo results,” IEEE J. Biomed. Heal. Inf. 24(4), 957–964 (2020). 10.1109/JBHI.2019.2936151 [DOI] [PubMed] [Google Scholar]
  • 29. Roy S., Menapace W., Oei S., Luijten B., Fini E., Saltori C., Huijben I., Chennakeshava N., Mento F., Sentelli A., Peschiera E., Trevisan R., Maschietto G., Torri E., Inchingolo R., Smargiassi A., Soldati G., Rota P., Passerini A., van Sloun R. J. G., Ricci E., and Demi L., “ Deep learning for classification and localization of COVID-19 markers in point-of-care lung ultrasound,” IEEE Trans. Med. Imag. 39(8), 2676–2687 (2020). 10.1109/TMI.2020.2994459 [DOI] [PubMed] [Google Scholar]
  • 30. Mento F., Perrone T., Macioce V. N., Tursi F., Buonsenso D., Torri E., Smargiassi A., Inchingolo R., Soldati G., and Demi L., “ On the impact of different lung ultrasound imaging protocols in the evaluation of patients affected by coronavirus disease 2019: How many acquisitions are needed?,” J. Ultrasound Med. 40(10), 2235–2238 (2020). 10.1002/jum.15580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Perrone T., Soldati G., Padovini L., Fiengo A., Lettieri G., Sabatini U., Gori G., Lepore F., Garolfi M., Palumbo I., Inchingolo R., Smargiassi A., Demi L., Mossolani E. E., Tursi F., Klersy C., and Di Sabatino A., “ A new lung ultrasound protocol able to predict worsening in patients affected by severe acute respiratory syndrome coronavirus 2 pneumonia,” J. Ultrasound Med. 40(8), 1627–1635 (2021). 10.1002/jum.15548 [DOI] [PubMed] [Google Scholar]
  • 32. Ungi T., Greer H., Sunderland K. R., Wu V., Baum Z. M., Schlenger C., Oetgen M., Cleary K., Aylward S. R., and Fichtinger G., “ Automatic spine ultrasound segmentation for scoliosis visualization and measurement,” IEEE Trans. Biomed. Eng. 67(11), 3234–3241 (2020). 10.1109/TBME.2020.2980540 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ronneberger O., Fischer P., and Brox T., U-Net: Convolutional Networks for Biomedical Image Segmentation ( Springer, Cham, 2015), pp. 234–241. [Google Scholar]
  • 34. Liu G., Reda F. A., Shih K. J., Wang T.-C., Tao A., and Catanzaro B., “ Image inpainting for irregular holes using partial convolutions,” in European Conference on Computer Vision (ECCV 2018), pp. 89–105. [Google Scholar]
  • 35. Vijaymeena M. K. and Kavitha K., “ A survey on similarity measures in text mining,” Mach. Learn. Appl. Int. J. 3(1), 19–28 (2016). 10.5121/mlaij.2016.3103 [DOI] [Google Scholar]
  • 36. Carrer L., Donini E., Marinelli D., Zanetti M., Mento F., Torri E., Smargiassi A., Inchingolo R., Soldati G., Demi L., Bovolo F., and Bruzzone L., “ Automatic pleural line extraction and COVID-19 scoring from lung ultrasound data,” IEEE Trans. Ultrason. Ferroelectr. Freq. Control 67, 2207–2217 (2020). 10.1109/TUFFC.2020.3005512 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES