Skip to main content
PLOS One logoLink to PLOS One
. 2021 Nov 4;16(11):e0258760. doi: 10.1371/journal.pone.0258760

Accuracy of deep learning-based computed tomography diagnostic system for COVID-19: A consecutive sampling external validation cohort study

Tatsuyoshi Ikenoue 1,*,#, Yuki Kataoka 2,3,#, Yoshinori Matsuoka 4, Junichi Matsumoto 5, Junji Kumasawa 6, Kentaro Tochitatni 7, Hiraku Funakoshi 8, Tomohiro Hosoda 9, Aiko Kugimiya 10, Michinori Shirano 11, Fumiko Hamabe 12, Sachiyo Iwata 13, Shingo Fukuma 1; Japan COVID-19 AI team
Editor: Haoran Xie14
PMCID: PMC8568139  PMID: 34735458

Abstract

Ali-M3, an artificial intelligence program, analyzes chest computed tomography (CT) and detects the likelihood of coronavirus disease (COVID-19) based on scores ranging from 0 to 1. However, Ali-M3 has not been externally validated. Our aim was to evaluate the accuracy of Ali-M3 for detecting COVID-19 and discuss its clinical value. We evaluated the external validity of Ali-M3 using sequential Japanese sampling data. In this retrospective cohort study, COVID-19 infection probabilities for 617 symptomatic patients were determined using Ali-M3. In 11 Japanese tertiary care facilities, these patients underwent reverse transcription-polymerase chain reaction (RT-PCR) testing. They also underwent chest CT to confirm a diagnosis of COVID-19. Of the 617 patients, 289 (46.8%) were RT-PCR-positive. The area under the curve (AUC) of Ali-M3 for predicting a COVID-19 diagnosis was 0.797 (95% confidence interval: 0.762‒0.833) and the goodness-of-fit was P = 0.156. With a cut-off probability of a diagnosis of COVID-19 by Ali-M3 set at 0.5, the sensitivity and specificity were 80.6% and 68.3%, respectively. A cut-off of 0.2 yielded a sensitivity and specificity of 89.2% and 43.2%, respectively. Among the 223 patients who required oxygen, the AUC was 0.825. Sensitivity at a cut-off of 0.5% and 0.2% was 88.7% and 97.9%, respectively. Although the sensitivity was lower when the days from symptom onset were fewer, the sensitivity increased for both cut-off values after 5 days. We evaluated Ali-M3 using external validation with symptomatic patient data from Japanese tertiary care facilities. As Ali-M3 showed sufficient sensitivity performance, despite a lower specificity performance, Ali-M3 could be useful in excluding a diagnosis of COVID-19.

Introduction

A proper triage system is critical during the COVID-19 pandemic [1, 2]. An improper triage system may be disadvantageous to patients and lead to a waste of personal protective equipment (PPE). An increase in hospital infections through the admission of infected patients to healthcare facilities could result in the collapse of the medical system. Although reverse transcription-polymerase chain reaction (RT-PCR) tests have been developed, the delay in receiving RT-PCR results could hamper appropriate triage.

Computed tomography (CT) is a fast and useful diagnostic tool. Certain studies have reported characteristic COVID-19 findings on chest CT images [38]. The use of chest CT images by radiologists has shown a high diagnostic performance for COVID-19. However, radiologists’ interpretations vary greatly. This depends on their familiarization with the interpretation of COVID-19 CT images [9]. Therefore, using CT as a diagnostic tool in general clinical practice is challenging in the current pandemic environment.

Diagnostic support systems using artificial intelligence (AI) have the potential to replace many of the routine detection, characterization, and quantification tasks currently performed by radiologists who use their human cognitive abilities [10]. AI can prevent the diagnostic inconsistencies from inter- and intra-reader diagnoses. In China, where the COVID-19 infection originated, many AI systems have been developed to establish a diagnosis of COVID-19 based on chest CT images [1115]. One such system, Ali-M3, can detect the likelihood of COVID-19 in a range of 0 to 1. It has excellent COVID-19 detection accuracy. Ali-M3 has an accuracy, sensitivity, and specificity of 99.0%, 98.5%, and 99.2%, respectively. Although Ali-M3 has excellent accuracy, it was developed with a virtual population. This consisted of 3,067 examinations for COVID-19, 1,996 for community-acquired pneumonia, and 1,975 for non-pneumonia. These virtual examinations differed from a general population, therefore its’ accuracy could be overestimated [16].

To use Ali-M3 to exclude the diagnosis of COVID-19, its’ external validity must be evaluated based on the distribution of disease in a real-world setting. We conducted a retrospective cohort study to evaluate the external validity of Ali-M3. We used the Japanese sequential sampling data of patients who underwent RT-PCR tests as well as chest CT for the diagnosis of COVID-19.

Materials and methods

Study design

This retrospective cohort study consisted of 11 Japanese tertiary care facilities that provided treatment for COVID-19 in each region of the country. The institutions from which the medical data were obtained are listed in S1 Table. We collected data from the medical records of each institution between April 15 and May 31, 2020. We partially followed the guidelines of the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis Statement to plan and report this study (S2 Table) [17]. The Institutional Review Board of each facility approved the study. The requirement to obtain written informed consent was waived as it was decided that this was an emergent study with public health implications. The accuracy and reliability of the data were confirmed by PMDA during the approval process of Ali-M3.

Participants

We included patients who underwent both RT-PCR and chest CT for the diagnosis of COVID-19. The potentially eligible participants were identified on the advice of their physician. The physician confirmed that both an RT-PCR test and a chest CT were obtained when the patient presented with symptoms or was suspected of having COVID-19. Detailed information on the inclusion criteria are shown in S3 Table. We selected patients using consecutive sampling methods between January 1 and April 15, 2020. RT-PCR results were extracted from the medical records of the patients at each facility. The patients were excluded when the time interval between the chest CT and the first RT-PCR assay was greater than 7 days.

All available data in the database was used to maximize the power and generalizability of the results.

Chest CT protocols

All images were obtained using one of the five types of CT systems with the patient in the supine position. The details of the scanning parameters and systems are listed in S4 Table.

Image analysis

We used a three-dimensional deep learning framework to detect the COVID-19 infections [16]. The details of this model are included in the S1 File. The population development characteristics from the datasheet are shown in S5 Table. The learning of Ali-M3 was stopped before the evaluation. We set a cut-off point for the model output at 0.5 as this cut-off point was used during the development stage. The investigators who entered the CT image data into Ali-M3 were blinded to the RT-PCR results.

Reference standard

The diagnosis of COVID-19 was established by an RT-PCR test. This test detects the nucleic acid of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in the sputum, throat swabs, or secretions of lower respiratory tract samples [18]. We established RT-PCR tests as the main reference standard. Although the findings of the chest CT, interpreted by radiologists, were included as the reference standard in the derivation study, we did not include it as the reference standard in the present study.

Statistical analysis

Statistical analysis was performed using R statistical software, version 3.6.3, (R Foundation for Statistical Computing). Data analysis was performed using a complete case dataset. Continuous variables were presented as means (standard deviation) and categorical variables were presented as counts and percentages. Using the RT-PCR results as a reference, the area under the curve (AUC), sensitivity, specificity, positive-predictive value, and negative-predictive value of the likelihood of COVID-19 (as derived from the Ali-M3’s analysis of the chest CT imaging) were calculated. A 95% confidence interval (CI) was determined using the Wilson score method. The goodness-of-fit was calculated using the Le Cessie‒Van Houwelingen normal test statistic for the unweighted sum of squared errors.

Sensitivity analysis

Moving cut-off point

The objective of this study was to determine whether the AI model could be used as a screening tool for COVID-19 in the real world. In a clinical situation, physicians require an accurate diagnosis of COVID-19. Therefore, they insist on more sensitivity than specificity. For the sensitivity analysis, we moved the cut-off point and observed sensitivities and specificities to minimize the possibility of omitting COVID-19 patients.

Simulation of imperfect reference

In the main analysis, we assumed RT-PCR to be the perfect reference (100% sensitivity and 100% specificity). However, in the real world, RT-PCR is not the perfect reference. Its’ sensitivity has been estimated to be 0‒80% [19]. To evaluate the effect of this imperfect reference, we calculated the sensitivity, specificity, and AUC of Ali-M3. We used the methods and R code described in the S2 File while varying the sensitivity. However, we established the specificity of RT-PCR at 100% [20].

Effect of the number of days after symptom onset

The number of days that passed before the onset of symptoms affects the presence of antibodies and the performance of RT-PCR tests in COVID-19 patients [19, 21]. However, it is not clear if this could affect CT images in these patients. Sensitivity and specificity were calculated for a group of patients whose symptom onset dates were known. This was calculated for those patients with the elapse of 14 days or more after symptom onset. This was also calculated for patients every 2 days from 0 to 13 days after symptom onset.

Effect of symptom severity

Imaging is not routinely used as a screening test for COVID-19 in asymptomatic individuals [22]. However, CT images were used to assess disease severity. We established the severity by evaluating whether oxygen therapy was required and if the patient was asymptomatic while undergoing CT.

Effect of reconstruction slice

The thickness of the reconstruction slice can affect diagnostic performance [23]. We separated the dataset for the main analysis with a 3-mm thick reconstruction slice. We did this because of the fissure in our data set between 3 mm and 4 mm. We then calculated the performance of the model for each dataset.

Results

Study population characteristics

Fig 1 shows the patient flow diagram. Data from 749 patients were analyzed. In this validation study, we assessed 617 symptomatic patients. The characteristics of the study population for the main datasets are listed in Table 1. Overall, 289 patients (46.8%) were diagnosed with COVID-19 using RT-PCR. Thirteen patients required more than two RT-PCR tests before being diagnosed with COVID-19. The major symptoms were dry cough (37.6%), fever (33.5%), and sore throat (25.8%).

Fig 1. Patient flow.

Fig 1

Abbreviations: CT, computed tomography; RT-PCR, reverse transcription-polymerase chain reaction; DICOM, digital imaging and communications in medicine.

Table 1. Demographics of patient characteristics.

Variable Symptomatic patients Patients using oxygen Asymptomatic patients
N 617 (223) (86)
Age (years old) + 59.6 (19.2) 68.3 (16.4) 54.5 (22.4)
Sex (Male) 377 (61.2) 158 (70.9) 40 (46.5)
Real-time PCR test (Positive) 289 (46.8) 97 (43.5) 37 (43.0)
Body temperature (≥ 37°) 391 (66.5) 143 (69.8)
Systolic Blood Pressure (≤ 90 mmHg) 18 (3.2) 11 (5.2)
Pulse (≥ 120 bpm) 48 (8.2) 22 (10.2)
Respiratory rate (≥ 25 /minute) 92 (20.5) 64 (38.3)
Saturation of percutaneous oxygen (≤ 92%) 105 (17.7) 62 (28.7)
Oxygen use 223 (36.1) 223 (100.0)
Vasopressor use 14 (2.3) 14 (6.3)
Distribution of symptoms reported
    Dry cough 232 (37.6) 67 (30.0)
    Chills 91 (14.7) 40 (17.9)
    Sore throat 159 (25.8) 38 (17.0)
    Diarrhea 66 (10.7) 17 (7.6)
    Joint or muscle pain 46 (7.5) 12 (5.4)
    Conjunctivitis 30 (4.9) 9 (4.0)
    Loss of smell or taste 55 (8.9) 21 (9.4)
Exposure history
    No 484 (78.4) 191 (85.7) 62 (72.1)
    Within family 39 (6.3) 11 (4.9) 6 (7.0)
    Other persons 94 (15.2) 21 (9.4) 18 (20.9)
Any international travel 44 (7.1) 6 (2.7) 9 (10.5)
Current Smoking 99 (16.0) 41 (18.4) 11 (12.8)
Past medical history
    Cardiac artery disease 46 (7.5) 24 (10.8) 4 (4.7)
    Stroke 60 (9.7) 34 (15.2) 2 (2.3)
    Chronic heart failure 69 (11.2) 43 (19.3) 4 (4.7)
    Chronic kidney disease 58 (9.4) 33 (14.8) 7 (8.1)
    Chronic obstructive pulmonary disease 69 (11.2) 34 (15.2) 7 (8.1)
    Malignancy 105 (17.0) 62 (27.8) 8 (9.3)
    Immune deficiency 32 (5.2) 17 (7.6) 1 (1.2)
    Hypertension 119 (19.3) 71 (31.8) 11 (12.8)
    Diabetes 116 (18.8) 64 (28.7) 13 (15.1)
    Any other disease 188 (30.5) 73 (32.7) 29 (33.7)

PCR, polymerase chain reaction; bpm, beats per minute

*Patients using oxygen were included in the symptomatic patients.

+ is continuous data, and the others are count data. Continuous variables are expressed as mean (SD) and count data as numbers (percentages).

Model performance

The performance of the confidence score after validation among the symptomatic patients is shown in Fig 2. The performance of the confidence score was P = 0.156 for the goodness-of-fit and the AUC was 0.797 (95% CI 0.762‒0.833). The relationship between the score and the predicted probability is shown in Fig 2. The optimal cut-off point with maximal sensitivity and specificity was 0.5. The sensitivity and specificity were 80.6% (233 of 289), [95% CI: 75.6‒85.0%] and 68.3% (224 of 328), [95% CI, 63.3%–93.3%], respectively.

Fig 2. Differential performance of Ali-M3 for coronavirus disease in symptomatic patients.

Fig 2

(A) A plot of test sensitivity (y-coordinate) versus its’ false-positive rate (x-coordinate) obtained at each cutoff level confidence score. The area under the receiver operating characteristic curve is 0.797 and the Youden index is 0.50. (B) A plot of test sensitivity, specificity, positive predictive value (PV+), and negative predictive value (PV-) in y-coordinate versus confidence score obtained from Ali-M3 in x-coordinate. The PV+ is dark gray and the PV- is light gray. The maximum PV+ is 46.8% and the maximum PV- is 53.2%. (C) This graph shows the goodness of fit. The dashed line is an ideal line that predicts the probability obtained from the confidence score of Ali-M3 equal to the actual probability. The pointed line is the fitted line that is estimated with non-linear assumption alone. The dashed line is the fitted line that is estimated with non-linear assumption and considering the bias in nonparametric estimation using the le Cessie-van Houwelingen method.

Sensitivity analysis

Moving cut-off point

Table 2 shows the relationship between the cut-off points for confidence score and performance. When the cut-off point was 0.2, the sensitivity and specificity were 89.2% and 43.3%, respectively.

Table 2. Moving cut-off confidence score and test performance.
Confidence score 0.50 0.40 0.30 0.20 0.10
Sensitivity 0.806 ( 0.755 - 0.850 ) 0.837 ( 0.789 - 0.877 ) 0.854 ( 0.808 - 0.893 ) 0.892 ( 0.851 - 0.925 ) 0.910 ( 0.870 - 0.940 )
Specificity 0.682 ( 0.629 - 0.732 ) 0.612 ( 0.557 - 0.665 ) 0.545 ( 0.490 - 0.600 ) 0.432 ( 0.378 - 0.488 ) 0.375 ( 0.322 - 0.429 )

AUC (95% confidence interval).

Simulation of imperfect reference

Fig 3 shows the sensitivity and specificity with the assumption of imperfect reference for the RT-PCR test. The AUC was 0.865. When the cut-off point was set at 0.5. using the Youden Index, the sensitivity and specificity were 80.6% and 81.3%, respectively. When the cut-off point was set at 0.2, the sensitivity and specificity were 89.2% and 51.9%, respectively.

Fig 3. Relationship between the test performance and the number of days after the onset of symptoms.

Fig 3

(A) The graph shows the relationship between the test performance and the number of days after the onset of symptoms when the confidence score from Ali-M3 is at 0.20. (B) The graph shows the relationship between the test performance and the number of days after onset of symptoms when the confidence score from Ali-M3 is at 0.50. The light gray bar shows the number of patients included in the strata of days after the onset of symptoms, following the right axis. One stratum includes 2 days from day 0 to day 13. The stratum to the extreme right includes 14 days or more. Following the left axis, the solid lines represent the sensitivity in strata, and the dash lines represent specificity in the strata.

Effect of number of days after symptom onset

Of all symptomatic patients, 600 (97.2%) were included in the sensitivity analysis. Of these, 17 patients did not know the number of days after symptom onset. Fig 4 shows the relationship between the test performance and the number of days since the onset of symptoms when the confidence score of Ali-M3 was set at 0.5 0.2. Sensitivity values began at 0.7 and increased up to 1.0, until 10‒11 days in both cases. However, the specificity values remained similar across the strata. The sensitivity increased over 0.9 when the confidence score was set at 0.2. This was greater than when the confidence score was set at 0.5.

Fig 4. Receiver operating characteristic (ROC) curves when ignoring imperfect reference and considering imperfect reference.

Fig 4

(A) A plot of test sensitivity (y-coordinate) versus its false-positive rate (x-coordinate) obtained at each cut-off level of confidence score ignoring imperfect reference. The area under the ROC curve is 0.797. (B) A plot of test sensitivity (y-coordinate) versus its false-positive rate (x-coordinate) was obtained at each cut-off level confidence score considering imperfect reference. The area under the ROC curve is 0.865.

Changing the eligibility criteria

The effects of changing the criteria for patient eligibility are shown in Fig 5.

Fig 5. Differential performance of Ali-M3 for Covid-19 in asymptomatic patients and patients using oxygen.

Fig 5

(A) A plot of test sensitivity (y-coordinate) versus its’ false-positive rate (x-coordinate) obtained at each cut off level confidence score in asymptomatic patients. The area under the receiver operating characteristic (ROC) curve is 0.623 and the Youden index is 0.25. (B) A plot of test sensitivity, specificity, positive predictive value (PV+), and negative predictive value (PV-) in y-coordinate versus the confidence score obtained from Ali-M3 in x coordinate among asymptomatic patients. The PV+ is dark gray and PV- is light gray. The maximum PV+ is 43.0% and maximum PV- is 57.0%. (C) A plot of test sensitivity (y-coordinate) versus its’ false-positive rate (x-coordinate) obtained at each cut off confidence score level in patients using oxygen. The area under the ROC curve is 0.623 and the Youden index is 0.25. (D) A plot of test sensitivity, specificity, PV+, and PV- in y-coordinate versus confidence scores obtained from Ali-M3 in x-coordinate in patients using oxygen. The PV+ is dark gray and the PV- is light gray. The maximum PV+ is 43.5% and the maximum PV- is 56.5%.

Dataset focused on asymptomatic patients

There were 86 asymptomatic patients (RT-PCR positive, n = 37). Using these patients only, the AUC was 0.623. When the cut-off point was 0.5, the sensitivity and specificity were 51.4% and 59.2%, respectively. When the cut-off point was 0.2, the sensitivity and specificity were 44.9% and 73.0%, respectively.

Dataset focused on patients requiring oxygen therapy

A total of 223 patients required oxygen (RT-PCR positive: 97). When using only these patients, the AUC was 0.828. When the cut-off point was set at 0.5, the sensitivity and specificity were 88.7% and 57.9%, respectively. When the cut-off point was set at 0.2, the sensitivity and specificity were 97.9% and 34.9%, respectively.

Effect of the thickness of the CT reconstruction slice of CT

There were 320 patients (RT-PCR positive: 121) with a reconstruction slice thickness of less than 3-mm. When considering these patients only, the AUC was 0.825. When the cut-off point was set at 0.5, the sensitivity and specificity were 82.6% and 69.7%, respectively. When the cut-off point was set at 0.2, the sensitivity and specificity were 94.2% and 51.5%, respectively. In patients with a reconstruction slice thickness > 3 mm, the AUC was 0.789 (S1 Fig).

Discussion

In this external validation study, our results indicated that Ali-M3 could be useful for the immediate triage of suspected COVID-19 patients with symptoms at a lower cut-off value. In particular, greater accuracy was observed in patients with greater severity, a few days after symptom onset, and with images with a thinner reconstructed CT slice.

Currently, all patients with symptoms such as fever are triaged as COVID-19 patients. Therefore, medical practitioners must use PPE for each patient [24]. Additionally, bed zoning is essential to avoid contamination of non-infected patients [25]. On the other hand, under-triaging results in hospital infections through the admission of infected patients to health care facilities. This should be continued until a definitive diagnosis is established. Since Ali-M3 is available on the cloud, the physician can receive results immediately. This is accomplished by sending the digital imaging and communications in the medical images from the ordinal picture archiving and communication system. When applying triage, clinicians require sufficient accuracy in terms of sensitivity. However, the specificity is less important [19]. The high sensitivity obtained at a cut-off of 0.2, with the AI diagnosis, is useful for excluding the diagnosis of COVID-19.

Ali-M3 also has the potential to support the diagnosis of COVID-19. The tools currently used for diagnosing COVID-19 are antibody, antigen, and RT-PCR tests. Both antigen and RT-PCR tests use tracheal secretions or saliva. An antigen test requires an antigen protein above a given detectable level and is currently inferior to the RT-PCR tests. When the same patient sample was used, the antigen test could not support the RT-PCR test. The RT-PCR test is currently used as the gold standard. Although, the sensitivity changes depending on the number of days after the onset of symptoms [19]. Therefore, for an exclusion diagnosis, multiple tests staggered over time are required rather than a single negative RT-PCR test. Even when this test is performed as rapidly as possible, it still requires a few days to obtain multiple test results. On the other hand, Ali-M3 uses the configurational information of the patients’ lungs and can add different information. This is apart from that obtained with RT-PCR, thereby complementing the drawbacks of RT-PCR among symptomatic patients with suspected COVID-19.

In this study, the diagnostic accuracy at the validation stage was lower than that at the development stage. A two-gate (case-control) design was used in the development of the AI system. However, in the present study, to evaluate the ability of Ali-M3 to assess a COVID-19 diagnosis by chest CT imaging, we used a single-gate (cohort) design. Although many studies have used the two-gate design for the evaluation of AI for the diagnosis of COVID-19 [26], the two-gate design is generally prone to overestimation of diagnostic test results [27]. Thus, blindly using the results of a two-gate design in a clinical situation can be inappropriate. Moreover, other factors must be considered. With the use of a two-gate design, the fact that RT-PCR is an imperfect reference standard is typically ignored. Furthermore, performing culture and tests to ascertain the true sensitivity of this test is difficult. In the present study, we simulated the diagnostic ability of Ali-M3 considering that the sensitivity of the reference standard was imperfect. This leads to an underestimation of the specificity and AUC of Ali-M3, without distortion of the sensitivity. Furthermore, the outcomes of developing Ali-M3 and examining its’ adequacy were different. Taking into account the patient flow in China, the outcomes at the development stage were set as positive cases with negative RT-PCR results and positive CT image findings [28]. This had a small effect on the sensitivity, but a large effect on the specificity. For example, in the development stage, 33.9% of the positive patients had negative RT-PCR results and positive CT image findings [28]. The performance showed a sensitivity of 98.5% and a specificity of 99.2% during the development of Ali-M3 [16]. A change from 97.7% to 100% for sensitivity and from 80.8% to 81.6% for specificity takes place when a positive RT-PCR result is the only reference. Upgrading to a diagnostic AI that targets only RT-PCR-positive cases at the developmental stage is desirable.

This study had some limitations. First, the differentiation performance of Ali-M3 was poor in asymptomatic patients and Ali-M3 did not show good specificity even if the cut-off was changed. Thus, Ali-M3 should not be used to screen asymptomatic patients. While an alternative to the RT-PCR test for COVID-19 is expected in terms of screening for nosocomial infections and screening on admission for patients with other diseases, Ali-M3 is not recommended for this purpose. Second, we could not differentiate COVID-19 from other forms of viral pneumonia. Compared to the past five seasons, the number of Japanese people infected with influenza during this season was markedly low [29]. Only a few cases in our cohort were diagnosed with other forms of viral pneumonia. Third, Ali-M3 could not reflect the differences in imaging features caused by different COVID-19 types. In addition to type A COVID-19, which was initially prevalent in Asia, type B and type C were prevalent in Europe and the United States. These different types were not determined in the PCR test. Thus, we could not evaluate these differences. Fourth, the AI system, generally known as the decision process, is a black-box system. Although Ali-M3 also has the aspects of a black-box, it shows imagines that are the cause of the decision. [16].

Conclusions

We conducted a retrospective cohort study for the external validation of Ali-M3 using symptomatic patient data from Japanese tertiary care facilities. Despite limited data analysis, our results indicated that AI-based CT diagnosis could be useful for a diagnosis of the exclusion of COVID-19 in symptomatic patients. This is particularly true in patients requiring oxygen and only a few days after symptom onset. Using Ali-M3 support can reduce PPE consumption and prevent hospital infections through the admission of covertly infected patients. Moreover, Ali-M3 also has the potential to support the diagnosis of RT-PCR in patients with suspected COVID-19. However, as Ali-M3 has some limitations in terms of development, further studies and learning are warranted to update this system.

Supporting information

S1 Fig. Differential performance of Ali-M3 for coronavirus disease in patients divided by the thickness of the reconstructed slice of computed tomography.

(A) A plot of test sensitivity (y coordinate) versus its’ false-positive rate (x coordinate) obtained at each cutoff level confidence score under the 3 mm thickness of the reconstruction slice. The area under the receiver operating characteristic (ROC) curve was 0.825 and the Youden index was 0.50. (B) A plot of test sensitivity, specificity, positive predictive value (PV+), and negative predictive value (PV-) in the y coordinate versus the confidence score obtained from Ali-M3 in the x coordinate under the 3 mm thickness of the reconstruction slice. PV+ is dark gray, and PV- is light gray. The maximum PV+ was 46.5%, and the maximum PV- was 53.5%. (C) A plot of test sensitivity (y coordinate) versus its’ false-positive rate (x coordinate) obtained at each cut-off confidence score level over the 3 mm thickness of the reconstruction slice. The area under the ROC curve was 0.789, and the Youden index was 0.50. (D) A plot of test sensitivity, specificity, PV+, and PV- in the y coordinate versus the confidence score obtained from Ali-M3 in the x coordinate over the 3 mm thickness of the reconstruction slice. PV+ is dark gray, and PV- is light gray. The maximum PV+ was 47.0%, and the maximum PV- was 53.0%.

(PNG)

S1 Table. The list of institutions from which patient medical data was obtained.

(DOCX)

S2 Table. Checklist of the guidelines of the Transparent Reporting of a Multivariable Prediction.

Model for Individual Prognosis or Diagnosis Statement.

(DOCX)

S3 Table. Inclusion criteria.

Patients who met the following criteria even for one item were considered symptomatic and were enrolled in the study.

(DOCX)

S4 Table. Computed tomography system and protocol

(DOCX)

S5 Table. Population characteristics in development of Ali-M3 from the datasheet.

(DOCX)

S1 File. The datasheet of Ali-M3.

(PDF)

S2 File. R code to evaluate the effect of the imperfect reference.

(DOCX)

Acknowledgments

We thank M3 Inc. and Clinical Porter for providing free Ali-M3 and data storage, although they did not participate in the preparation protocol and manuscript. The analysis of the CT by Ali-M3 was carried out by Nobori on behalf of M3. (M3 and Nobori did not know the patients’ data including the result of RT-PCR) Ali-M3 was officially approved by the Japanese PMDA using our data on June 29, 2020. (Approval number form PMDA: 30200BZX00212000, https://www.pmda.go.jp/english/about-pmda/0002.html) To access the Ali-M3 system please contact M3 (m3-ai-lab@m3.com). We also thank Ms. Kyoko Wasai, who assisted in retrieving data and Editage (http://www.editage.com) for editing and reviewing this manuscript for English language. The group author affiliations were as follows: Shingo Hamaguchi, Takafumi Haraguchi (St. Marianna University School of Medicine), Shungo Yamamoto (Kyoto City Hospital), Hiromitsu Sumikawa, Koji Nishida (Sakai City Medical Center), Haruka Nishida, Koichi Ariyoshi (Kobe City Medical Center General Hospital), Hiroshi Shinmoto, Hiroaki Sugiura (National Defense Medical College Hospital), Hidenori Nakagawa, Tomohiro Asaoka (Osaka City General Hospital), Naofumi Yoshida(Kobe University Graduate School of Medicine), Rentaro Oda (Tokyobay Urayasu Ichikawa Medical Center), Takashi Koyama, Yui Iwai (Hyogo Prefectural Amagasaki General Medical Center), and Yoshihiro Miyashita (Yamanashi Prefectural Central Hospital). the lead author for this group was Koichi Ariyoshi (kobe9914@yahoo.co.jp).

Data Availability

Chest CT images and individual clinical information could not be publicized because of restrictions imposed by the IRB and by Japanese domestic law and guidelines, which do not allow us to open our data according to Article 16 in "Act on the Protection of Personal Information". (http://www.japaneselawtranslation.go.jp/law/detail/?id=2781&vm=04&re=01). The Hyogo Prefectural Amagasaki General Medical Center functioned as the central ethical review committee. Data access requests may be directed to Ms. Kyoko Wasai (contact via agmc.irb@gmail.com).

Funding Statement

The authors did not receive financial funding for this study

References

  • 1.Maves RC, Downar J, Dichter JR, Hick JL, Devereaux A, Geiling JA, et al. Triage of Scarce Critical Care Resources in COVID-19 An Implementation Guide for Regional Allocation: An Expert Panel Report of the Task Force for Mass Critical Care and the American College of Chest Physicians. Chest. 2020;158(1):212–25. Epub 2020/04/15. doi: 10.1016/j.chest.2020.03.063 ; PubMed Central PMCID: PMC7151463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Carenzo L, Costantini E, Greco M, Barra FL, Rendiniello V, Mainetti M, et al. Hospital surge capacity in a tertiary emergency referral centre during the COVID-19 outbreak in Italy. Anaesthesia. 2020;75(7):928–34. doi: 10.1111/anae.15072 [DOI] [PubMed] [Google Scholar]
  • 3.Li Y, Xia L. Coronavirus Disease 2019 (COVID-19): Role of Chest CT in Diagnosis and Management. AJR American journal of roentgenology. 2020:1–7. Epub 2020/03/05. doi: 10.2214/AJR.20.22954 . [DOI] [PubMed] [Google Scholar]
  • 4.Salehi S, Abedi A, Balakrishnan S, Gholamrezanezhad A. Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients. AJR American journal of roentgenology. 2020:1–7. Epub 2020/03/17. doi: 10.2214/AJR.20.23034 . [DOI] [PubMed] [Google Scholar]
  • 5.Zhou S, Wang Y, Zhu T, Xia L. CT Features of Coronavirus Disease 2019 (COVID-19) Pneumonia in 62 Patients in Wuhan, China. AJR American journal of roentgenology. 2020:1–8. Epub 2020/03/07. doi: 10.2214/ajr.20.22975 . [DOI] [PubMed] [Google Scholar]
  • 6.Chaganti S, Balachandran A, Chabin G, Cohen S, Flohr T, Georgescu B, et al. Quantification of Tomographic Patterns associated with COVID-19 from Chest CT. ArXiv. 2020. Epub 2020/06/19. ; PubMed Central PMCID: PMC7280906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Liu K-C, Xu P, Lv W-F, Qiu X-H, Yao J-L, Gu J-F, et al. CT manifestations of coronavirus disease-2019: A retrospective analysis of 73 cases by disease severity. European Journal of Radiology. 2020;126:108941. doi: 10.1016/j.ejrad.2020.108941 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Pan F, Ye T, Sun P, Gui S, Liang B, Li L, et al. Time Course of Lung Changes at Chest CT during Recovery from Coronavirus Disease 2019 (COVID-19). Radiology. 2020;295(3):715–21. Epub 2020/02/14. doi: 10.1148/radiol.2020200370 ; PubMed Central PMCID: PMC7233367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bai HX, Hsieh B, Xiong Z, Halsey K, Choi JW, Tran TML, et al. Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT. Radiology. 2020;296(2):E46–E54. Epub 2020/03/11. doi: 10.1148/radiol.2020200823 ; PubMed Central PMCID: PMC7233414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pesapane F, Codari M, Sardanelli F. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp. 2018;2(1):35. Epub 2018/10/26. doi: 10.1186/s41747-018-0061-6 ; PubMed Central PMCID: PMC6199205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Huang L, Han R, Ai T, Yu P, Kang H, Tao Q, et al. Serial Quantitative Chest CT Assessment of COVID-19: Deep-Learning Approach. Radiology: Cardiothoracic Imaging. 2020;2(2):e200075. doi: 10.1148/ryct.2020200075 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Li L, Qin L, Xu Z, Yin Y, Wang X, Kong B, et al. Artificial Intelligence Distinguishes COVID-19 from Community Acquired Pneumonia on Chest CT. Radiology. 2020:200905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Liu W, Liu M, Guo X, Zhang P, Zhang L, Zhang R, et al. Evaluation of acute pulmonary embolism and clot burden on CTPA with deep learning. European radiology. 2020;30(6):3567–75. Epub 2020/02/18. doi: 10.1007/s00330-020-06699-8 . [DOI] [PubMed] [Google Scholar]
  • 14.Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ (Clinical research ed). 2020;368:m689. Epub 2020/03/28. doi: 10.1136/bmj.m689 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wynants L, Van Calster B, Bonten MMJ, Collins GS, Debray TPA, De Vos M, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ (Clinical research ed). 2020;369:m1328. Epub 2020/04/09. doi: 10.1136/bmj.m1328 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Academy TAD. COVID-19 AI Assisted Analysis Based On Chest CT Imaging. The Alibaba DAMO Academy, 2020 7 May, 2020. Report No.
  • 17.Gary S. Collins JBR, Douglas G. Altman, Karel G.M. Moons. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Annals of Internal Medicine. 2015;162(1):55–63. doi: 10.7326/M14-0697 [DOI] [PubMed] [Google Scholar]
  • 18.Lippi G, Simundic AM, Plebani M. Potential preanalytical and analytical vulnerabilities in the laboratory diagnosis of coronavirus disease 2019 (COVID-19). Clin Chem Lab Med. 2020. Epub 2020/03/17. doi: 10.1515/cclm-2020-0285 . [DOI] [PubMed] [Google Scholar]
  • 19.Kucirka LM, Lauer SA, Laeyendecker O, Boon D, Lessler J. Variation in False-Negative Rate of Reverse Transcriptase Polymerase Chain Reaction-Based SARS-CoV-2 Tests by Time Since Exposure. Ann Intern Med. 2020;173(4):262–7. Epub 2020/05/19. doi: 10.7326/M20-1495 ; PubMed Central PMCID: PMC7240870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Limmathurotsakul D, Turner EL, Wuthiekanun V, Thaipadungpanit J, Suputtamongkol Y, Chierakul W, et al. Fool’s Gold: Why Imperfect Reference Tests Are Undermining the Evaluation of Novel Diagnostics: A Reevaluation of 5 Diagnostic Tests for Leptospirosis. Clinical Infectious Diseases. 2012;55(3):322–31. doi: 10.1093/cid/cis403 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Long QX, Liu BZ, Deng HJ, Wu GC, Deng K, Chen YK, et al. Antibody responses to SARS-CoV-2 in patients with COVID-19. Nat Med. 2020;26(6):845–8. Epub 2020/05/01. doi: 10.1038/s41591-020-0897-1 . [DOI] [PubMed] [Google Scholar]
  • 22.Rubin GD, Ryerson CJ, Haramati LB, Sverzellati N, Kanne JP, Raoof S, et al. The Role of Chest Imaging in Patient Management during the COVID-19 Pandemic: A Multinational Consensus Statement from the Fleischner Society. Radiology. 2020;296(1):172–80. Epub 2020/04/08. doi: 10.1148/radiol.2020201365 ; PubMed Central PMCID: PMC7233395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.He L, Huang Y, Ma Z, Liang C, Liang C, Liu Z. Effects of contrast-enhancement, reconstruction slice thickness and convolution kernel on the diagnostic performance of radiomics signature in solitary pulmonary nodule. Sci Rep. 2016;6:34921. Epub 2016/10/11. doi: 10.1038/srep34921 ; PubMed Central PMCID: PMC5056507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Organization WH]. Rational use of personal protective equipment (PPE) for coronavirus disease (COVID-19): interim guidance, 19 March 2020. b020.
  • 25.Liu J, Yang J, Li S, Chen J, Yang L, Zhao Z, et al. Gynecological prevention and control model based on ward rearrangement and zoning management in pandemic period of COVID-19. Panminerva Med. 2020. Epub 2020/05/18. doi: . [DOI] [PubMed] [Google Scholar]
  • 26.Pham TD. A comprehensive study on classification of COVID-19 on computed tomography with pretrained convolutional neural networks. Sci Rep. 2020;10(1):16942. Epub 2020/10/11. doi: 10.1038/s41598-020-74164-z ; PubMed Central PMCID: PMC7547710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rutjes AW, Reitsma JB, Di Nisio M, Smidt N, van Rijn JC, Bossuyt PM. Evidence of bias and variation in diagnostic accuracy studies. CMAJ: Canadian Medical Association journal = journal de l’Association medicale canadienne. 2006;174(4):469–76. Epub 2006/02/16. doi: 10.1503/cmaj.050090 ; PubMed Central PMCID: PMC1373751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, et al. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology. 2020:200642. doi: 10.1148/radiol.2020200642 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sakamoto H, Ishikane M, Ueda P. Seasonal Influenza Activity During the SARS-CoV-2 Outbreak in Japan. JAMA. 2020;323(19):1969–71. Epub 2020/04/11. doi: 10.1001/jama.2020.6173 ; PubMed Central PMCID: PMC7149351. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Haoran Xie

29 Mar 2021

PONE-D-21-03621

Accuracy of deep learning-based computed tomography diagnostic system of COVID-19: A consecutive sampling external validation cohort study

PLOS ONE

Dear Dr. Ikenoue,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by May 11 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Haoran Xie

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please ensure that you include a title page within your main document. We do appreciate that you have a title page document uploaded as a separate file, however, as per our author guidelines (http://journals.plos.org/plosone/s/submission-guidelines#loc-title-page) we do require this to be part of the manuscript file itself and not uploaded separately.

Could you therefore please include the title page into the beginning of your manuscript file itself, listing all authors and affiliations.

3. Thank you for providing the date(s) when patient medical information was initially recorded (between January 1 and April 15, 2020). Please also include the date(s) on which your research team accessed the databases/records to obtain the retrospective data used in your study.

4. In your methods section or in the supplementary material, please provide the names of the 11 institutions where patient medical data was obtained from.

5. In your methods section, please provide the names and catalog numbers of the RT-PCR tests used in this study.

6. Thank you for stating the following financial disclosure:

"NO"

At this time, please address the following queries:

  1. Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution.

  2. State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

  3. If any authors received a salary from any of your funders, please state which authors and which funders.

  4. If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

7. Thank you for stating the following in your Competing Interests section: 

"NO"

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

 This information should be included in your cover letter; we will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

8. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

9. One of the noted authors is a group or consortium [Japan COVID-19 AI team]. In addition to naming the author group, please list the individual authors and affiliations within this group in the acknowledgments section of your manuscript. Please also indicate clearly a lead author for this group along with a contact email address.

10. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This study carried out an external validation of a commercial tool Ali-m3. This is necessary for the area of AI-based medical systems. A number of concerns should be resolved before a further decision could be made.

1. The tool is a commercial tool on the cloud system, which means the commercial provider may change the code and models as they want. And the source code of Ali-m3 is not publicly available. Please clarify how this study ensure the replicability of this tool Ali-m3.

2. The authors mentioned that their data are unavailable to the public, either. The validation data are simply chest CT images, which are very easy to be anonymized. There are many freely available databases of chest CT images. So the chest CT images, the clinical data, and the diagnosis results of the samples need to be released to the public, after being anonymized. The prediction results of the tool Ali-m3 should also be released to the public for the replication purpose.

3. The free access to the commercial tool and online data storage IS a financial support. Please clarify this in the conflict of interest statement.

4. The current cohort consists of 617 patients, with 289 COVID-19 positive patients, and 223 patients with severe symptoms (needing oxygen support). The practical situation has many more COVID-19 negative patients. Considering the specificity is only 43.2% using the Ali-m3 score threshold 0.2, please clarify how to handle the increasing high number of false positives.

5. The results should be strictly discussed. For example, in the Abstract, “sensitivity increased for both cut-off values after 5 days”. But only one threshold 0.2 was mentioned in the Abstract.

6. And for the “223 patients who required oxygen support”, it’s misleading to skip mentioning the specificity. If we set the threshold to the extreme value (like 0), we can get 100% in sensitivity. But that is not an intelligent tool.

7. The commercial provider for Ali-m3 has a website in Japanese only. It’s impossible to review whether this company is a solid AI company or maybe just a contractor of this tool Ali-m3. So the quality and stability of Ali-m3 is unpredictable.

8. Does Ali-m3 have a medical license approved by some governmental agencies?

9. This study cited the commercial tool Ali-m3 by an internal report of a commercial company, which is not the service provider “m3”. Please clarify this.

10. And what is the online like to the validated tool Ali-m3? It’s not acceptable to ask the anonymous reviewer to contact the commercial provider to access the cloud-based tool.

Reviewer #2: The manuscript is about a system for real-time sentiment prediction on Twitter streaming data for coronavirus pandemic. The paper is well-organised, but I still have some concerns:

1) In my idea, the paper contributions are not significant. There is no novelty.

2) There is some repetitive information in different parts of the manuscript about Twitter and sentiment analysis, etc.

3) The result part is the written form of tables.

4) The discussion part didn't discuss anything; it's just repeating the result section in other words.

5) There is some punctuation mistake in the manuscript.

Reviewer #3: The contribution of this research paper isn't clear. Sorry to say that, however, I can't get the point of this paper from the manuscript. Although you state your purpose as "Ali-M3, ... However, Ali-M3 has not been externally validated.", this statement didn't show anything about what you want to do in this research paper.

Based on the conclusion of this paper, "Our results indicated that AI-based CT diagnosis could be useful for ...", it seems that you want to prove that Ali-M3 can be used to diagnose COVID-19, but the data samples used to evaluate Ali-M3 and the results are not good enough to support your conclusion. There are only several hundreds of samples in your evaluation process, even more, you didn't provide background information about those samples, such as how were they collected and which groups of people they covered. So, in my opinion, they can't represent all COVID-19 situation.

Besides the insufficient testing samples, the performance of the model with AUC 0.79, 0.82 isn't very good. How could a model with such performance be used in COVID-19 diagnosis?

Another question, what is your work in this research? From the manuscript, I see that you ran the Ali-M3 model which is already a usable deep learning model, with patients data which I don't know you collected it or not, and take some simple analysis about the results. Are these all you had did in this research? What's the significance of what you did? Maybe you could add more contents in your manuscript about what you did, such as data collection, sample pre-processing, model adjustment, deep analysis, diagnosis direction, practice guideline, or some other things.

A lot of analysis were done focusing on cut-off point adjustment. However what's the meaning of those analysis? Sensitivity and specificity have big changes when you use different cut-off values and they can be affected by the ratio of positive and negative samples of testing dataset. So I think it's not necessary to analysis those values because they can't represent real performance of prediction model.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Nov 4;16(11):e0258760. doi: 10.1371/journal.pone.0258760.r002

Author response to Decision Letter 0


1 Sep 2021

Dear Editor,

We appreciate the opportunity to revise our manuscript. Please find below our responses to the editorial comments and reviewers' comments.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Response: Thank you for your comments. We have modified manuscript format.

2. Please ensure that you include a title page within your main document. We do appreciate that you have a title page document uploaded as a separate file, however, as per our author guidelines (http://journals.plos.org/plosone/s/submission-guidelines#loc-title-page) we do require this to be part of the manuscript file itself and not uploaded separately.

Could you therefore please include the title page into the beginning of your manuscript file itself, listing all authors and affiliations.

Response: Thank you for your comments. We have added the title page to the manuscript.

3. Thank you for providing the date(s) when patient medical information was initially recorded (between January 1 and April 15, 2020). Please also include the date(s) on which your research team accessed the databases/records to obtain the retrospective data used in your study.

Response: Thank you for your advice. We added the following sentence to the Materials and Methods section:

Materials and Methods: Study design

We collected data from medical records between April 15 and May 31, 2020.

4. In your methods section or in the supplementary material, please provide the names of the 11 institutions where patient medical data was obtained from.

Response: Thank you for your advice regarding the information from the institutions. We have added the following sentence to the Materials and Methods section and the supplemental material.

Materials and Methods: Study design

The institutions where patient medical data were obtained are listed in Supplemental Table 1.

5. In your methods section, please provide the names and catalog numbers of the RT-PCR tests used in this study.

Response: Unfortunately, no restrictions were placed on RT-PCR testing in this study. For this reason, we were not able to provide the names and catalog numbers of the RT-PCR tests. However, in legitimate medical institutions in Japan, PCR tests were performed using the following methods.

https://www.niid.go.jp/niid/images/lab-manual/2019-nCoV20200319.pdf

Shirato K et al. (2020). Development of genetic diagnostic methods for novel coronavirus 2019 (nCoV-2019) in Japan. Jpn J Infect Dis. 2020 Volume 73 Issue 4 Pages 304-307. DOI: 10.7883/yoken.JJID.2020.061.

Matsuyama S, et al. (2020). Enhanced isolation of SARS-CoV-2 by TMPRSS2-expressing cells. Proc Natl Acad Sci U S A. 2020 Mar 31;117(13):7001-7003.

Shirato K et al., (2020). Performance evaluation of real-time RT-PCR assays for detection of severe acute respiratory syndrome coronavirus-2 developed by the National  Institute of Infectious Diseases, Japan. Jpn J Infect Dis. 2021.  https://www.niid.go.jp/niid/images/jjid/COVID19/No27_2020-1079R1_20210202.pdf.

6. Thank you for stating the following financial disclosure:

"NO"

At this time, please address the following queries:

a. Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution.

b. State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c. If any authors received a salary from any of your funders, please state which authors and which funders.

d. If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

Response: Thank you for providing detailed information on the financial disclosure. We have answered your questions individually.

a. No financial funding was received for this study. M3 Inc. and Clinical Porter provided Ali-M3 and data storage for free.

b. M3 Inc. and Clinical Porter did not participate in the preparation protocol or the development of the manuscript. At the initiations of the study, Ali-M3 existed as a tool produced by Alibaba Damo (Hangzhou) Technology Co., Ltd for research use that had not received any approval. It was not known whether it would be of any value. Therefore, the Ali-M3 was not allowed by Japanese law to be used as a diagnostic tool in actual practice. M3 was only one of the channels for Japanese researchers to connect Japanese researchers and Alibaba. We approached M3 to validate Ali-M3 because the lack of validity prevented chest CT use in the diagnosis of COVID-19. However, with the results of this study, we confirmed that Ali-M3 has clinical benefits. Therefore, under a special expedited review in Japan, Ali-M3 has been approved by the Japanese Pharmaceuticals and Medical Devices Agency (PMDA) and licensed for use as a diagnostic tool in actual practice. For the license, Ali-M3 should have been a commercial tool. However, commercializing was against our wishes. We wanted Ali-M3 to be used for free, even if it was for a specific period of time.

c. No author received a salary from M3 Inc. and Clinical Porter.

d. We added the sentence below to the COI section.

# Competing Interests

The authors did not receive any financial funding for this work.

7. Thank you for stating the following in your Competing Interests section:

"NO"

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

Response: Thank you for your comment on COI. The authors have no competing interests to declare. We have added the following information to the COI section in the manuscript as follows:

Competing interests (COI).

The authors declare no competing interests. The authors did not receive financial funding for this study. At the initiation of the study, Ali-M3 existed as a tool produced by Alibaba Damo (Hangzhou) Technology Co., Ltd for research use that had not received any approval. The results of this study confirm that Ali-M3 has clinical benefits. Therefore, under a special expedited review in Japan, Ali-M3 has been approved by the Japanese Pharmaceuticals and Medical Devices Agency (PMDA) and licensed for use as a diagnostic tool in actual practice. (Approval number: 30200BZX00212000, Datasheet: https://www.pmda.go.jp/files/000235943.pdf) For the license in Japan, Ali-M3 should have been a commercial tool. Therefore, it was not anticipated that M3 Inc. would benefit from the commercialization of the Ali-M3 because of our research. M3 provided the Ali-M3 and storage free of charge.

8. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

Response: We regret that chest CT images and individual clinical information could not be publicized. This is because of the IRB's decision and the Japanese guidelines for protecting personal information. This study was performed in a worldwide emergency setting. We received immediate approval from IRB. This was not the routine method of approval. The IRB of each facility approved the study and the need to obtain written informed consent was waived. Japanese guidelines concerning personal information does not allow personal data to be used by third parties without patient consent. On the other hand, the accuracy and reliability of the data were confirmed by PMDA. This was accomplished during the approval process by the comparing raw data in each hospital and analysis data. We modified the methods section for IRB and data to describe specialty and data reliability as follows.

Materials and Methods: Study design

The Institutional Review Board of each facility approved the study. The requirement to obtain written informed consent was waived as it was decided that this was an emergent study with public health implications. The accuracy and reliability of the data were confirmed by PMDA during the approval process of Ali-M3.

9. One of the noted authors is a group or consortium [Japan COVID-19 AI team]. In addition to naming the author group, please list the individual authors and affiliations within this group in the acknowledgments section of your manuscript. Please also indicate clearly a lead author for this group along with a contact email address.

Response: Thank you for your advice concerning the group authors. We listed the individual authors and their affiliations within the group in the acknowledgments section. The lead author of this group was Koichi Ariyoshi (kobe9914@yahoo.co.jp). We have added this information to the acknowledgments.

10. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Response: Thank you for your advice regarding Supporting Information files. We have modified the Supporting Information files and citations in the manuscript.

Reviewer #1:

This study carried out an external validation of a commercial tool Ali-m3. This is necessary for the area of AI-based medical systems. A number of concerns should be resolved before a further decision could be made.

Reply:

Thank you for your comments.

Ali-M3 is a diagnostic tool used for COVID-19 identification on chest CT. Although many Japanese hospitals have CT systems in their facilities, Japanese practitioners cannot effectively use chest CT in the diagnosis of COVID-19.

At the initiation of the study, Ali-M3 existed as a tool produced by Alibaba Damo (Hangzhou) Technology Co., Ltd for research use that had not received approval. It is not known whether it would be of any value. Therefore, the Ali-M3 was not allowed by Japanese law to be used as a diagnostic tool in actual practice. M3 was only one of the channels for Japanese researchers to connect Japanese researchers and Alibaba. We approached M3 to validate Ali-M3 because the lack of validity prevented chest CT use for the diagnosis of COVID-19. However, with the results of this study, we confirmed that Ali-M3 has clinical benefits. Therefore, under a special expedited review in Japan, Ali-M3 has been approved by the Japanese Pharmaceuticals and Medical Devices Agency (PMDA) and licensed for use as a diagnostic tool in actual practice. For the license, Ali-M3 should have been a commercial tool. However, commercializing was against our wishes. We wanted Ali-M3 to be used for free, even if it was for a specific period of time.

1. The tool is a commercial tool on the cloud system, which means the commercial provider may change the code and models as they want. And the source code of Ali-m3 is not publicly available. Please clarify how this study ensure the replicability of this tool Ali-m3.

Reply:

Thank you for your comment regarding replicability.

Ali-M3 was a fixed model, and the same model can be obtained on a commercial basis, as we used in this study. To date, no medical device that continues learning after approval has been approved in Japan. For use in academic research, Alibaba Damo (Hangzhou) Technology Co., Ltd provided M3 with a program in which the learning process had already been halted.

The publicity of the source code of Ali-M3 is unavailable because Ali-M3 has already been approved as a diagnostic tool in actual practice and became commercial in this process.

2. The authors mentioned that their data are unavailable to the public, either. The validation data are simply chest CT images, which are very easy to be anonymized. There are many freely available databases of chest CT images. So the chest CT images, the clinical data, and the diagnosis results of the samples need to be released to the public, after being anonymized. The prediction results of the tool Ali-m3 should also be released to the public for the replication purpose.

Reply:

Thank you for your comment regarding external validation using other available data.

We searched for external data that was a sequential sampling dataset as thoroughly as possible. However, we did not discover any such dataset. We discussed external validation using two-gate designs or single-gate designs in the 3rd paragraph of the Discussion section. The purpose of our study was to perform external validation using a single-gate design. Although many studies have used the two-gate design for the evaluation of AI for the diagnosis of COVID-19, the two-gate design is generally prone to an overestimation of diagnostic test results. Thus, blindly using the results based on a two-gate design in a clinical situation can be inappropriate. We believe that external validation using a two-gate design is not meaningful. If a sequential dataset was available, we validated Ali-M3.

3. The free access to the commercial tool and online data storage IS a financial support. Please clarify this in the conflict of interest statement.

Reply:

Thank you for your comment. We apologize for the complex COI situation. As previously described, during the study period, Ali-M3 was not a commercial tool. For use in the actual diagnosis of COVID-19, we needed to receive approval immediately. This is due to the worldwide pandemic emergency. In a routine situation, we believe our paper would have been published before Ali-M3 was approved and commercialized. According to your recommendation, we have added this progress to the COI as follows.

Competing interests (COI).

The authors declare no competing interests. The authors did not receive financial funding for this study. At the initiation of the study, Ali-M3 existed as a tool produced by Alibaba Damo (Hangzhou) Technology Co., Ltd for research use that had not received any approval. The results of this study confirm that Ali-M3 has clinical benefits. Therefore, under a special expedited review in Japan, Ali-M3 has been approved by the Japanese Pharmaceuticals and Medical Devices Agency (PMDA) and licensed for use as a diagnostic tool in actual practice. (Approval number: 30200BZX00212000, Datasheet: https://www.pmda.go.jp/files/000235943.pdf) For the license in Japan, Ali-M3 should have been a commercial tool. Therefore, it was not anticipated that M3 Inc. would benefit from the commercialization of the Ali-M3 because of our research. M3 provided the Ali-M3 and storage free of charge.

4. The current cohort consists of 617 patients, with 289 COVID-19 positive patients, and 223 patients with severe symptoms (needing oxygen support). The practical situation has many more COVID-19 negative patients. Considering the specificity is only 43.2% using the Ali-m3 score threshold 0.2, please clarify how to handle the increasing high number of false positives.

Reply:

Thank you for your important comments.

Your comments depend on the situation using this model. As described in the Discussion, we warned against the use of Ali-M3 as a screening tool. In this case, we set the target population with high prior probability. Moreover, as described in the discussion, we used Ali-M3 for rule-out (i.e., exclusion), not rule-in. In this situation, sensitivity is suitable for an evaluation. Even in serious situations requiring patients to be supplied with oxygen, we must consider rule out as triage. According to your comments, we have added the concern of specificity to limitations.

# Discussion

First, the differentiation performance of Ali-M3 was poor in asymptomatic patients and Ali-M3 did not show good specificity even if the cut-off was changed; thus, Ali-M3 should not be used to screen asymptomatic patients.

5. The results should be strictly discussed. For example, in the Abstract, “sensitivity increased for both cut-off values after 5 days”. But only one threshold 0.2 was mentioned in the Abstract.

Reply:

Thank you for your comment.

We modified the Results section in the abstract as follows.

# Abstract

Results: Of the 617 patients, 289 (46.8%) were RT-PCR-positive. The area under the curve (AUC) of Ali-M3 for predicting a COVID-19 diagnosis was 0.797 (95% confidence interval: 0.762‒0.833) and the goodness-of-fit was P = 0.156. With a cut-off probability of a diagnosis of COVID-19 by Ali-M3 set at 0.5, the sensitivity and specificity were 80.6% and 68.3%, respectively. A cut-off of 0.2 yielded a sensitivity and specificity of 89.2% and 43.2%, respectively. Among the 223 patients who required oxygen, the AUC was 0.825. Sensitivity at a cut-off of 0.5% and 0.2% was 88.7% and 97.9%, respectively. Although the sensitivity was lower when the days from symptom onset were fewer, the sensitivity increased for both cut-off values after 5 days.

6. And for the “223 patients who required oxygen support”, it’s misleading to skip mentioning the specificity. If we set the threshold to the extreme value (like 0), we can get 100% in sensitivity. But that is not an intelligent tool.

Reply:

Thank you for your comments.

Although we discussed the issue by shifting the threshold, which may be misleading, we only discussed the threshold fixed at 0.2 and 0.5. We did not arbitrarily change these values in the discussion. However,, in clinical practice, it is important to shift threshold values according to the purpose of use. Depending on the clinical situation, the importance of clinical differentials in “misclassification cost” differs. Clinicians are uninterested in performance across all thresholds; they focus on clinically relevant thresholds. Because we could not locate a clinically relevant threshold from all thresholds, we focused on the threshold at 0.2 and 0.5.

We mentioned the specificity of 223 patients who required oxygen in the Results section. What we want to discuss is that Ali-M3 is suitable for the rule-out of COVID-19. To accomplish this, we required information on sensitivity and not specificity. Therefore, we believe that specificity in the sensitivity analysis is not required in the Abstract. Your concern regarding misleading is important. We modified the conclusion in the Abstract section as follows:

# Abstract

Conclusion: We evaluated the Ali-M3 using external validation. Because Ali-M3 showed sufficient sensitivity performance although lower specificity performance, it was deemed useful in excluding a diagnosis of COVID-19.

7. The commercial provider for Ali-m3 has a website in Japanese only. It’s impossible to review whether this company is a solid AI company or maybe just a contractor of this tool Ali-m3. So the quality and stability of Ali-m3 is unpredictable.

Reply:

Thank you for your comments.

Unfortunately, there is no specific description for the AI sector on M3's website (https://corporate.m3.com/en/). However, information for Ali-M3 was in the same site (https://corporate.m3.com/en/ir/20200629_2/Microsoft%20Word%20-%20AI_Ali-M3_APPROVAL_E.pdf). The quality and stability of Ali-M3 were officially confirmed by the Japanese PMDA using our data on June 29, 2020. (https://www.pmda.go.jp/english/about-pmda/0002.html).

# Acknowledgement

Ali-M3 was officially approved by the Japanese PMDA using our data on June 29, 2020. (https://www.pmda.go.jp/english/about-pmda/0002.html).

8. Does Ali-m3 have a medical license approved by some governmental agencies?

Reply:

Yes. PMDA approved Ali-M3 (# 30200BZX00212000) on June 29, 2020. We have added this fact to Acknowledgement as follows:

Ali-M3 was officially approved by the Japanese PMDA using our data on June 29, 2020. (https://www.pmda.go.jp/english/about-pmda/0002.html).

9. This study cited the commercial tool Ali-m3 by an internal report of a commercial company, which is not the service provider “m3”. Please clarify this.

Reply:

Thank you for your comments. As previously described, at the initiation of the study, M3 was only one of the channels for Japanese researchers to connect with Alibaba. Ali-M3 was developed by Alibaba Damo Technology Co., Ltd. During this time, Ali-M3 did not have a name. The name Ali-M3 was provided in the process of approval. The Ali-M3 datasheet was provided by Alibaba Damo Technology Co., Ltd.

10. And what is the online like to the validated tool Ali-m3? It’s not acceptable to ask the anonymous reviewer to contact the commercial provider to access the cloud-based tool.

Reply:

Thank you for your comments.

As mentioned earlier, Ali-M3 was approved by PMDA. Approval letters are available on the web, although they are in Japanese. (https://www.pmda.go.jp/files/000235943.pdf) We translated the instructions as follows.

1. Preparation for use:

(1) Turn on the general-purpose IT equipment to be installed or access the product on the cloud server.

(2) Start the product.

2. Operation:

(1) Input the X-ray CT image from the X-ray CT diagnostic equipment or the server that stores these images.

(2) The confidence level of the CT image findings in COVID-19 pneumonia is presented, and the area of interest is marked on the image.

(3) Save the results.

3. Exit.

(1) Click on the exit icon on the screen or select the menu items' exit function to exit the product.

(2) If necessary, turn off the general IT equipment.

The screen image of CT is like the one below.

Reviewer #2: The manuscript is about a system for real-time sentiment prediction on Twitter streaming data for coronavirus pandemic. The paper is well-organised, but I still have some concerns:

Reply:

Thank you for your constructive comments.

However, we did not discuss “a system for real-time sentiment prediction” on Twitter streaming data for the coronavirus pandemic.

Reviewer #3: The contribution of this research paper isn't clear. Sorry to say that, however, I can't get the point of this paper from the manuscript. Although you state your purpose as "Ali-M3, ... However, Ali-M3 has not been externally validated.", this statement didn't show anything about what you want to do in this research paper.

Reply:

Thank you for your clear comments.

Our aim was to perform external validation of the Ali-M3. Similar to other AI systems for diagnosing COVID-19, Ali-M3 has high accuracy in the process of internal validation. In usual prediction models, after developing a prediction model, it is strongly recommended to evaluate the model's performance with other participant data than was used for model’s development. Moreover, AI makes the model’s data fit better than existing statistical methods. Therefore, AI can easily cause overfitting. Thus, external validation is more important than the usual model development in an AI model’s development. Although limited, we evaluated the actual specifications of Ali-M3 using external validation with a later period, different countries, and different setting datasets. Our dataset is one of the ideal datasets from the perspective of the TRIPOD statement. According to your comments, we have modified the abstract as follows:

# Abstract

Background: Ali-M3, an artificial intelligence program, analyzes chest computed tomography (CT) and detects the likelihood of coronavirus disease (COVID-19) in a range of 0 to 1. However, Ali-M3 has not been externally validated. Our aim was to evaluate the accuracy of Ali-M3 in identifying COVID-19 and to discuss its’ clinical value.

Purpose: To evaluate the external validity of the Ali-M3 using sequential Japanese sampling data.

Ref: Moons KG, Altman DG, Reitsma JB, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015; 162:W1-73. [PMID: 25560730] doi:10.7326/M14-0698

Based on the conclusion of this paper, "Our results indicated that AI-based CT diagnosis could be useful for ...", it seems that you want to prove that Ali-M3 can be used to diagnose COVID-19, but the data samples used to evaluate Ali-M3 and the results are not good enough to support your conclusion. There are only several hundreds of samples in your evaluation process, even more, you didn't provide background information about those samples, such as how were they collected and which groups of people they covered. So, in my opinion, they can't represent all COVID-19 situation.

Reply:

Thank you for your comments regarding the sample size and generalizability.

* Sample Size

In external validation of the prediction model using dichotomous outcomes, the number of both events and non-events in the external validation cohorts is recommended to be over 200, at least in each cohort. (Vergouwe Y, 2005 Journal of Clinical Epidemiology) In this cohort, 289 patients were RT-PCR (+), and 460 patients were RT-PCR (-). Thus, because our cohort sample was sufficient for the external validation study, your suggestion, which is short of sample size, was not suitable for our cohort.

* Generalizability

We also believe that this external validation cannot be adapted for all situations in which physicians suspect COVID-19. The limitations of the adaptation of this external validation study are necessary. In this external validation, the facility was limited to 11 Japanese tertiary care facilities that provided treatment for COVID-19 in each region of the country. These are areas where facilities are COVID-19 hot spots in Japan. During the study period, patients suspected of COVID-19 were integrated into these facilities in each region, regardless of their severity. Whether or not and when patients should be seen for suspected COVID-19 was controlled by the health authorities. Patients were instructed to go to the healthcare facility that they were integrated into if they had symptoms of suspected infection. These included persistent fever or cough, or if they had been in close contact with someone already known to be infected. Physicians decided whether patients required COVID-19 testing. Although the threshold that physicians used for suspected COVID-19 must have been different for each physician (because there was a shortage of information at the initiation of this study), almost all patients suspected of COVID-19 received RT-PCR and chest CT in each facility. This was because physicians, who diagnosed patients with suspected COVID-19, were aware of the information that COVID-19 patients might have some features on chest CT. The records of RT-PCR were not allowed to have any missing data due to Japanese domestic laws regarding COVID-19. Sampling was performed sequentially. Therefore, potentially eligible participants were identified on physicians' advice when patients from Japanese COVID-19 hot spots presented with symptoms and were suspected of having COVID-19. Moreover, patients who were included required both RT-PCR tests and chest CT.

Because we agreed with your opinion that our cohort could not represent all COVID-19 situations, we modified the Conclusion section in the Abstract and the Body of the manuscript.

# Abstract

Conclusion: We evaluated Ali-M3 by external validation using symptomatic patient data from Japanese tertiary care facilities. Because Ali-M3 showed sufficient sensitivity performance although lower specificity performance, Ali-M3 was shown to be useful in excluding a diagnosis of COVID-19.

# Body

We conducted a retrospective cohort study for external validation of Ali-M3 using symptomatic patient data from Japanese tertiary care facilities. Despite limited data analysis, our results indicated that AI-based CT diagnosis could be useful for a diagnosis of the exclusion of COVID-19 in symptomatic patients. This is particularly true in patients requiring oxygen and with only a few days after symptom onset.

Reference:

Vergouwe Y, Steyerberg EW, Eijkemans MJC, Habbema JDF (2005) Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 58:475–483

Besides the insufficient testing samples, the performance of the model with AUC 0.79, 0.82 isn't very good. How could a model with such performance be used in COVID-19 diagnosis?

Reply:

Thank you for your comments regarding model performance and use.

As you mentioned, we used AUC, sensitivity, and specificity. It is well established that diagnostic tests are best understood when presented in terms of gains and losses to individual patients [11]. The AUC lacks clinical interpretability because it does not reflect this fact. Clinicians are not interested in performance across all thresholds that AUC provides. They focus on clinically relevant thresholds. Thus, evaluation using AUC only will not produce relevant information for clinical practice. To our knowledge, the validation of the CT diagnosis system for COVID-19 using a sequential dataset (consecutive sampling) did not show an excellent AUC in the validation set. (Ref)

We discuss the usefulness of Ali-M3 in the diagnosis of exclusion of COVID-19. As noted in the discussion, RT-PCR is not a perfect test. Even now rapid tests are becoming available. Testing accuracy depends on the viral load and the instability of the collection method. Furthermore, there is a concern regarding the emergence of variants of COVID-19 that bypass the currently used RT-PCR, as reported in Brittany. To cope with this situation, it is important to use multiple diagnostic modalities. We strongly believe that the diagnosis of exclusion by Ali-M3 has clinical implications.

Another question, what is your work in this research? From the manuscript, I see that you ran the Ali-M3 model which is already a usable deep learning model, with patients data which I don't know you collected it or not, and take some simple analysis about the results. Are these all you had did in this research? What's the significance of what you did? Maybe you could add more contents in your manuscript about what you did, such as data collection, sample pre-processing, model adjustment, deep analysis, diagnosis direction, practice guideline, or some other things.

Reply:

Thank you for your comment regarding the author contributions. We understand the reviewer's concerns. We have added an explanation of the author contributions as follows:

# Author Contribution

YK, YM, JM, JK, KT, HF, TH, AK, MS, FH, and SI compiled medical and imaging information at each institution. Group author members collected substantive data at each site and added items to the survey that were required for clinical information. TI, YK, and SF were involved in the study design and data interpretation. TI was involved in the data analysis and sample pre-processing. The analysis of the CT by Ali-M3 was carried out by Nobori on behalf of M3. (M3 and Nobori were blinded to the patients' data including the result of RT-PCR) All authors critically revised the report, commented on drafts of the manuscript, and approved the final report.

A lot of analysis were done focusing on cut-off point adjustment. However what's the meaning of those analysis? Sensitivity and specificity have big changes when you use different cut-off values and they can be affected by the ratio of positive and negative samples of testing dataset. So I think it's not necessary to analysis those values because they can't represent real performance of prediction model.

Reply:

Thank you for your comments.

Again, it is the sensitivity and specificity, not the AUC, which are relevant in actual clinical practice. It is the sensitivity and specificity of the test as a function of a certain threshold value that is useful for an actual diagnosis. For this reason, we developed a number of threshold values and discussed the sensitivity and specificity of these values. We believe that it is the diagnosis of exclusion of COVID-19 that is more important when considering the use of the test. We believe that it is important for clinicians to have both a threshold value of 0.2, and a threshold value of 0.5, which is close to the Youden-index threshold.

Attachment

Submitted filename: Revise_to_Editor.docx

Decision Letter 1

Haoran Xie

6 Oct 2021

Accuracy of deep learning-based computed tomography diagnostic system of COVID-19: A consecutive sampling external validation cohort study

PONE-D-21-03621R1

Dear Dr. Ikenoue,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Haoran Xie

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: No

Reviewer #3: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: In my idea, although this study has a lot of limitations, it can be a good start to use AI in clinics.

Reviewer #3: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Acceptance letter

Haoran Xie

19 Oct 2021

PONE-D-21-03621R1

Accuracy of deep learning-based computed tomography diagnostic system for COVID-19: A consecutive sampling external validation cohort study

Dear Dr. Ikenoue:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Haoran Xie

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Differential performance of Ali-M3 for coronavirus disease in patients divided by the thickness of the reconstructed slice of computed tomography.

    (A) A plot of test sensitivity (y coordinate) versus its’ false-positive rate (x coordinate) obtained at each cutoff level confidence score under the 3 mm thickness of the reconstruction slice. The area under the receiver operating characteristic (ROC) curve was 0.825 and the Youden index was 0.50. (B) A plot of test sensitivity, specificity, positive predictive value (PV+), and negative predictive value (PV-) in the y coordinate versus the confidence score obtained from Ali-M3 in the x coordinate under the 3 mm thickness of the reconstruction slice. PV+ is dark gray, and PV- is light gray. The maximum PV+ was 46.5%, and the maximum PV- was 53.5%. (C) A plot of test sensitivity (y coordinate) versus its’ false-positive rate (x coordinate) obtained at each cut-off confidence score level over the 3 mm thickness of the reconstruction slice. The area under the ROC curve was 0.789, and the Youden index was 0.50. (D) A plot of test sensitivity, specificity, PV+, and PV- in the y coordinate versus the confidence score obtained from Ali-M3 in the x coordinate over the 3 mm thickness of the reconstruction slice. PV+ is dark gray, and PV- is light gray. The maximum PV+ was 47.0%, and the maximum PV- was 53.0%.

    (PNG)

    S1 Table. The list of institutions from which patient medical data was obtained.

    (DOCX)

    S2 Table. Checklist of the guidelines of the Transparent Reporting of a Multivariable Prediction.

    Model for Individual Prognosis or Diagnosis Statement.

    (DOCX)

    S3 Table. Inclusion criteria.

    Patients who met the following criteria even for one item were considered symptomatic and were enrolled in the study.

    (DOCX)

    S4 Table. Computed tomography system and protocol

    (DOCX)

    S5 Table. Population characteristics in development of Ali-M3 from the datasheet.

    (DOCX)

    S1 File. The datasheet of Ali-M3.

    (PDF)

    S2 File. R code to evaluate the effect of the imperfect reference.

    (DOCX)

    Attachment

    Submitted filename: Revise_to_Editor.docx

    Data Availability Statement

    Chest CT images and individual clinical information could not be publicized because of restrictions imposed by the IRB and by Japanese domestic law and guidelines, which do not allow us to open our data according to Article 16 in "Act on the Protection of Personal Information". (http://www.japaneselawtranslation.go.jp/law/detail/?id=2781&vm=04&re=01). The Hyogo Prefectural Amagasaki General Medical Center functioned as the central ethical review committee. Data access requests may be directed to Ms. Kyoko Wasai (contact via agmc.irb@gmail.com).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES