Artificial intelligence for the classification of fractures around the knee in adults according to the 2018 AO/OTA classification system

Anna Lind; Ehsan Akbarian; Simon Olsson; Hans Nåsell; Olof Sköldenberg; Ali Sharif Razavian; Max Gordon

doi:10.1371/journal.pone.0248809

. 2021 Apr 1;16(4):e0248809. doi: 10.1371/journal.pone.0248809

Artificial intelligence for the classification of fractures around the knee in adults according to the 2018 AO/OTA classification system

Anna Lind ¹, Ehsan Akbarian ¹, Simon Olsson ¹, Hans Nåsell ¹, Olof Sköldenberg ¹, Ali Sharif Razavian ¹, Max Gordon ^1,^*

Editor: Ivana Isgum²

PMCID: PMC8016258 PMID: 33793601

Abstract

Background

Fractures around the knee joint are inherently complex in terms of treatment; complication rates are high, and they are difficult to diagnose on a plain radiograph. An automated way of classifying radiographic images could improve diagnostic accuracy and would enable production of uniformly classified records of fractures to be used in researching treatment strategies for different fracture types. Recently deep learning, a form of artificial intelligence (AI), has shown promising results for interpreting radiographs. In this study, we aim to evaluate how well an AI can classify knee fractures according to the detailed 2018 AO-OTA fracture classification system.

Methods

We selected 6003 radiograph exams taken at Danderyd University Hospital between the years 2002–2016, and manually categorized them according to the AO/OTA classification system and by custom classifiers. We then trained a ResNet-based neural network on this data. We evaluated the performance against a test set of 600 exams. Two senior orthopedic surgeons had reviewed these exams independently where we settled exams with disagreement through a consensus session.

Results

We captured a total of 49 nested fracture classes. Weighted mean AUC was 0.87 for proximal tibia fractures, 0.89 for patella fractures and 0.89 for distal femur fractures. Almost ¾ of AUC estimates were above 0.8, out of which more than half reached an AUC of 0.9 or above indicating excellent performance.

Conclusion

Our study shows that neural networks can be used not only for fracture identification but also for more detailed classification of fractures around the knee joint.

Introduction

Fractures around the knee joint are inherently complex with high risk of complications. For instance, during the first decade after a tibial plateau facture 7% receive a total knee replacement, five times more than the control population [1]. Bicondylar tibia fractures have a hazard ratio of 1.5 for total knee replacement, while high age has hazard ratio of 1.03 [1]. While regular primary osteoarthritis replacements have a survival rate of at least 95% in a decade, post-traumatic knee replacements have both higher complication rates and survival rates as low as 80% for the same time period [2]. There is a need to lessen complications from these fractures, and a reliable diagnosis and description of the fracture is crucial for providing correct treatment from the onset.

Experienced radiologists with extended orthopedic training constitute a scarce resource in many hospitals, especially in the middle of the night. Fatigue, inexperience and lack of time when interpreting diagnostic images increases the risk of human error as a cause for misdiagnosis [3–6]. Use of computed tomography (CT) might improve accuracy, but this is not universally true [7] and CT is not as readily available as plain radiographs. We believe that computer aided interpretation of radiographs could be of use both in helping clinicians properly assess the initial fracture as well as in retrospectively reviewing a large amount of fractures to better understand the optimal treatment regime.

Recent studies have shown promising results in applying deep learning, also known as deep neural networks, a form of artificial intelligence [8], for image interpretation. In medicine, deep learning has notably been explored in specialties such as endocrinology for retinal photography [9], dermatology for recognizing cancerous lesions [10] and oncology for recognizing pulmonary nodules [11], as well as mammographic tumors [12]. In trauma orthopedics, the last four years have yielded several studies on deep learning for fracture recognition with very promising results [4, 13–15], yet its applications and limitations are still largely unexplored [16].

There are to our knowledge no studies applying deep learning for knee fractures and there are very few published studies on fracture classification [14, 17, 18]. The primary aim of this study was therefore to evaluate how well a neural network can classify knee fractures according to the detailed 2018 AO-OTA fracture classification [19].

Patients and methods

The research was approved by ethical review committee (dnr: 2014/453-31) (The Swedish Ethical Review Authority).

Study design and setting

The study is a validation study of a diagnostic method based on retrospectively collected radiographic examinations. These examinations were analyzed by a neural network for both presence and type of knee fracture. Knee fracture is defined in this study as any fracture to the proximal tibia, patella or distal femur.

Data selection

We extracted radiograph series around the knee taken between the years 2002 and 2016 from Danderyd University Hospital’s Picture Archiving and Communication System (PACS). Images along with corresponding radiologist reports were anonymized. Using the reports, we identified phrases that suggested fractures or certain fracture subtypes. We then selected random subsets of image series from both the images with phrases suggesting that there may be a fracture and those without. This selection generated a bias towards fractures and certain fracture subtypes to reduce the risk of non-fracture cases dominating the training data and rarer fractures being missed.

Radiograph projections included were not standardized. Trauma protocols as well as non-trauma protocols were included. Diaphyseal femur and tibia/fibula protocols were included as these display the knee joint although not in the center of the image. For each patient we only included the initial knee exam within a 90-day period to avoid overestimating the network by including duplicate cases of the same fracture at different stages. Images of knee fractures on children were tagged for exclusion by the reviewer upon seeing open physes as these are classified differently and Danderyd University Hospital only admits patients that are 15 years or older. Image series where the quality was deemed too poor to discern fracture lines were also tagged for exclusion by the reviewer. All tagged exclusions were then validated by MG before removal from the dataset.

Method of classification

In this method of machine learning the neural network identifies patterns in images. The network is fed both the input (the radiographic images) and the information of expected output label (classifications of the fractures) in order to establish a connection between the features of a fracture and corresponding category [8].

Prior to being fed to the network the exams along with radiologist’s reports were labelled using a custom-built platform according to AO/OTA-class (v. 2018) by AL, SO, MG & EA. The AO/OTA classification system was chosen as it can be applied to all three segments of the knee joint [19] and because of its level of detail. The classification system has more than 60 classes of knee fractures, many of which are nested and interdependent, e.g. the A1.1 is a subset of both A and A1 [19]. Fractures were classified down to lowest discernable subgroup or qualifier. (See S1 File for details). We also created custom output categories such as displacement/no displacement and lateral/medial fracture as it is interesting to see how well the network can discern these qualities regardless of AO/OTA class.

Data sets

The data was randomly split into three sets: test, training and validation. The split into sets was constructed so that the same patient seeking and receiving an x-ray of the knee joint on multiple occasions with a > 90-day separation could be included multiple times in the same set, but there was no patient overlap between the training, validation and test sets.

The test set consisted of 600 cases, which were classified by two senior orthopedic surgeons, MG, OS and EA, working independently. Any disagreement was dealt with by a joint reevaluation session until a consensus between the two surgeons was reached. Out of the 600 cases, 71 cases had disagreement regarding type of fracture (see S1 File for details). The test set then served as the ground truth that the final network was tested against. A minimum of 2 captured cases per class was required for that class to be included in the test set. All images contained at least an AP and a lateral view and had to have the knee joint represented.

During training two sets of images were used, the training set which the network learned from and a validation set for evaluating performance and tweaking network parameters. The validation set was prepared in the same way as the test set but by AL and SO, two 4th year medical students. The training set was labeled only once by either AL or SO. MG validated all images with fractures or by the students marked for revisit. Initially, images were randomly selected for classification and fed to the network i.e. passive learning. As the learning progressed cases were selected based on the networks output: 1) initially cases with high probability of a class were selected to populate each category, and then 2) cases where the network is the most uncertain to define the border was used i.e. active learning [20]. Due to the number of classes available the category used for selection changed depending on which categories were poorly performing at that stage. During this process the predictions from the network were fed back into the labeling interface as an additional feedback loop to the reviewers so that the error modes became clearer and could be addressed. The reviewers were presented with probabilities in the form of continuous color scale and categories with probability over 60% were preselected by the interface.

Neural network setup

We used a convolutional neural network that was a modification of a ResNet type. The network consisted of a 26-layer architecture with batch normalization for each convolutional layer and adaptive max pool (See Table 1 for structure). Each class had a single endpoint that was converted into a probability using a sigmoid function. We randomly initialized the network and trained using stochastic gradient descent.

Table 1. General network architecture.

Type	Blocks	Kernel size	Filters	Section
Convolutional	1	5x5	32	Core
Convolutional	1	3x3	64	Core
ResNet block	4x2	3x3	64	Core
ResNet block	2x2	3x3	128	Core
ResNet block	2x2	3x3	256	Core
ResNet block	2x2	3x3	512	Core
Image max	1	-	-	Pool
Convolutional	1	1x1	72	Classification
Fully connected	1	-	4	Classification
Fully connected	1	-	4	Classification

Open in a new tab

All images were individually processed in the core section of the network and then merged at the pool stage using the adaptive max function. The final classification section was then used for generating the AO/OTA classes.

The training was split into several sessions with different regularizes for controlling overfitting. Between each session we re-set the learning rate and trained according to Table 2. We trained the network initially with dropout without any noise. In subsequent sessions we applied regularizers such as white noise, auto-encoders [21], semi-supervised learning with teacher-student networks [22] and stochastic weighted averaging [23]. During training we alternated between similar task for other anatomical sites, e.g. our ankle fracture dataset [17], using additional 16 172 exams. During the teacher student session, we augmented the dataset with unlabeled exams using a ratio of 1:2 where the teacher network had access to the radiologist report in addition to the images. The learning rate was adjusted at each epoch and followed the cosine function.

Table 2. The training setup of the network.

Session	Epochs	Initial learning rate	Noise	Teacher-student pseudo labels	Autoencoder	SWA
Initialization	70	0.025	none	no	no	no
Noise	80	0.025	5%	no	no	no
Teacher-student	40	0.010	5%	yes	no	no
Autoencoder	20	0.025	10%	no	yes	no
SWA	20 x 5	0.010	5%	no	no	yes

Open in a new tab

All sessions used standard drop-out in addition to the above.

Input images

The network was presented with all available radiographs in each series. Each radiograph was automatically cropped to the active image area, i.e. any black border was removed, and the image was reduced to a maximum of 256 pixels. We then added padding to the rectangular image so that the network received a square format of 256 x 256 pixels.

Outcome measures & statistical analysis

Network performance was measured using area under curve (AUC) as primary outcome measure and sensitivity, specificity and Youden J as secondary outcome measures. Proportion of correctly detected fractures was estimated using AUC—the area under a receiver operating curve (ROC)—which is a plot of true positive rate against the false positive rate and suggests the networks ability to sort the class from low to high likelihood. An AUC value of 1.0 signifies prediction that is always correct and a value of 0.5 is no better than random chance. There is no exact guide for how to interpret AUC values, but in general an AUC of <0.7 is considered poor, 0.7–0.8 is considered acceptable, 0.8–0.9 is considered good to excellent and ≥ 0.9 is considered excellent or outstanding [24–26]. Youden Index (J) is a value also used in conjunction with the ROC curve, it is a summary of sensitivity and specificity. It has a range of 0 to 1 and is defined as [26]:

J = s e n s i t i v i t y + s p e c i f i c i t y - 1

As there are many categories, we also presented a weighted mean of each measure that included all the subclasses, e.g. A-types will not only include the A-type but also all available groups and the subgroups into one measure. The weighting was according to the number of positive cases as we wanted small categories that may perform well by chance to have less influence on the weighted mean, for AUC the calculation was:

{A U C}_{w e i g h t e d} = \frac{\sum_{i = 1}^{c a t e g o r i e s} {A U C}_{i} * n_{i}}{\sum_{i = 1}^{c a t e g o r i e s} n_{i}}

Cohens kappa, a measure of interrater reliability [27], was used to measure the level of agreement between the two human reviewers assessing the test set, as differences in interpretation between human reviewers could be a confounder in fairly assessing the network.

We implemented integrated gradients [28] as a method to access which image features the network analyzed to arrive at its output, as this is not otherwise immediately accessible. Integrated gradients displays this information as a heatmap where the color red illustrates image features that contribute positively to a certain output i.e. fracture category and blue illustrate features that contribute against that output [28].

The network was implemented and trained using PyTorch (v. 1.4). Statistical analysis was performed using R (4.0.0). The research was approved by ethical review committee (dnr: 2014/453-31).

Results

From 42 163 available knee examinations 6188 exams were classified for the training set and 605 for the test set. A total of 70 images were excluded during classification, a majority as they contained open physes, leaving the training set with 6003 exams from 5657 separate patients and the test set with 600 from 526 patients (see Fig 1). Out of these 6003 exams, 5700 were used for training with an average 4.5 radiographs per exam (ranging 2 to 9 radiographs) while the remaining 303 were used for evaluating network performance and tweaking network parameters (the validation set). The test set had slightly fewer radiographs per exam, on average 4.1 (ranging from 2 to 7 radiographs). There was no patient overlap between the test and training datasets. We evaluated the network performance for a total of 49 fracture categories, 40 of which were AO/OTA classes and 9 custom classes.

Proximal tibia (AO/OTA 41)—621 training cases and 68 evaluation cases

The weighted mean AUC for all tibial plateau fractures was 0.87 (95% CI, 0.82–0.92), sensitivity, specificity and Youden J were 0.83 (95% CI, 0.80–0.92), 0.91 (95% CI, 0.85–0.93) and 0.74 (95% CI, 0.69–0.83) respectively. As shown in Table 3, the A-types, which consisted mostly of tiny avulsions, performed the worst, around 0.7. B-types was closer to 0.9 and the C-types with subclasses just above 0.8. For the split-depression fractures (B3-group) performed excellently with all estimates above 0.9 Among the custom descriptors, medial and lateral performed with AUC scores of 0.89 and 0.81 respectively. The custom displacement class performed well with an AUC of 0.91.

Table 3. Network performance for proximal tibia.

Proximal tibia
	Observed cases (n = 600)	Sensitivity (%)	Specificity (%)	Youden’s J	AUC (95% CI)
A
A	10	50	94	0.44	0.72 (0.52 to 0.91)
1	8	60	82	0.42	0.73 (0.52 to 0.94)
…3	5	80	79	0.59	0.78 (0.52 to 0.95)
…→a	3	100	76	0.76	0.86 (0.76 to 0.95)
A displaced	3	67	93	0.60	0.87 (0.68 to 1.00)
B
B	47	83	88	0.71	0.89 (0.83 to 0.95)
1	11	73	85	0.58	0.78 (0.60 to 0.91)
…1	6	67	81	0.48	0.76 (0.59 to 0.91)
…2	2	100	94	0.94	0.97 (0.93 to 1.00)
…3	3	67	92	0.59	0.72 (0.24 to 1.00)
2	10	67	90	0.57	0.74 (0.49 to 0.94)
…1	6	83	93	0.76	0.89 (0.73 to 0.98)
…2	4	100	81	0.81	0.88 (0.81 to 0.97)
3	26	92	92	0.84	0.97 (0.95 to 0.99)
…1	12	100	94	0.94	0.99 (0.97 to 0.99)
…3	14	93	88	0.81	0.93 (0.85 to 0.98)
B → x	5	100	93	0.93	0.97 (0.94 to 0.99)
B → t	7	100	97	0.97	0.99 (0.97 to 0.99)
B → u	6	83	93	0.76	0.88 (0.68 to 0.98)
C
C	11	82	95	0.77	0.83 (0.60 to 0.99)
1	2	50	100	0.50	0.53 (0.06 to 1.00)
2	4	100	98	0.98	0.99 (0.98 to 1.00)
3	5	80	95	0.75	0.79 (0.43 to 0.98)
…1	4	75	95	0.70	0.74 (0.30 to 0.98)
Custom classes
Displaced	29	83	97	0.80	0.91 (0.82 to 0.98)
Lateral	14	75	90	0.65	0.81 (0.62 to 0.97)
Medial	10	78	89	0.67	0.89 (0.74 to 0.98)
C2 or C3	5	80	94	0.74	0.80 (0.43 to 0.99)
Lateral B2 or B3	18	94	95	0.90	0.96 (0.88 to 0.99)
Medial B2 or B3	3	100	75	0.75	0.86 (0.76 to 0.96)

Open in a new tab

Table showing network performance for the different AO-OTA classes as well as other fracture descriptors, first letter corresponds to fracture type, first number to group, second number to subgroup and last letter to qualifiers. The observed cases column correspond to the number of observed fractures by the reviewers. Note that an exam can appear several times as the category A1.3 will belong to both the overall A-type, A1 group and A1.3 subgroup at the same time.

Patella (AO/OTA 34)—525 training cases and 40 evaluation cases

The weighted mean AUC for patella was 0.89 (95% CI, 0.83–0.94), sensitivity, specificity and Youden J were 0.89 (95% CI, 0.81–0.96), 0.88 (95% CI, 0.85–0.93) and 0.77 (95% CI, 0.70–0.87) respectively. Similar to proximal tibia fractures, the A-types (extraarticular fracture) had the lowest performance with AUC just under 0.8. The B-types, partial articular sagittal fractures, had the highest AUC-scores with around AUC 0.9 for the main group and all subgroups. The C-types, complete articular fractures, also performed well, only C1.3 (fractures in the distal third of the patella) performed below 0.8 (Table 4).

Table 4. Network performance for patella.

Patella
	Observed cases (n = 600)	Sensitivity (%)	Specificity (%)	Youden’s J	AUC (95% CI)
A
A	5	80	83	0.63	0.79 (0.67 to 0.86)
1	5	80	85	0.65	0.81 (0.68 to 0.89)
1a	2	100	94	0.94	0.97 (0.93 to 0.99)
B
B	6	100	90	0.90	0.94 (0.91 to 0.97)
1	6	100	90	0.90	0.93 (0.90 to 0.97)
…1	3	100	78	0.78	0.86 (0.76 to 0.95)
…2	3	100	89	0.89	0.95 (0.89 to 0.99)
C
C	29	90	86	0.75	0.90 (0.79 to 0.97)
1	11	91	86	0.76	0.89 (0.74 to 0.97)
…1	6	100	85	0.85	0.94 (0.89 to 0.99)
…3	5	60	89	0.49	0.75 (0.44 to 0.96)
2	8	100	88	0.88	0.97 (0.93 to 0.99)
3	10	80	97	0.77	0.88 (0.70 to 0.98)
Custom classes
Displaced	21	81	91	0.72	0.88 (0.76 to 0.97

Open in a new tab

Table showing network performance for the different AO-OTA classes as well as other fracture descriptors, letter corresponds to fracture type, first number to group, second number to subgroup and last letter to qualifiers. The observed cases column correspond to the number of observed fractures by the reviewers. Note that an exam can appear several times as the category B1.1 will belong to both the overall B-type, B1 group and B1.1 subgroup at the same time.

Distal femur (AO/OTA 33)—147 training cases and 12 evaluation cases

Distal femur fractures were rare both in the training and the test data. Despite this, the weighted mean AUC was 0.89 (95% CI, 0.78–0.96), sensitivity, specificity and Youden J were 0.90 (95% CI, 0.82–1.00), 0.92 (95% CI, 0.79–0.97), and 0.81 (95% CI, 0.71–0.96) respectively. Only the B-type (partial articular fractures) performed lower at AUC 0.72. However, the number of cases were few and many of the confidence intervals were wide (Table 5).

Table 5. Network performance for distal femur.

Distal femur
	Observed cases (n = 600)	Sensitivity (%)	Specificity (%)	Youden’s J	AUC (95% CI)
A
A	5	100	83	0.83	0.94 (0.88 to 0.99)
2	4	100	97	0.97	0.99 (0.97 to 1.00)
B
B	4	75	83	0.58	0.72 (0.31 to 0.96)
C
C	3	100	97	0.97	0.98 (0.97 to 1.00)
Custom classes
B11 or C11	4	75	94	0.69	0.81 (0.49 to 0.98)

Open in a new tab

Table showing network performance for the different AO-OTA classes as well as other fracture descriptors, letter corresponds to fracture type, first number to group and second number to subgroup. The observed cases column correspond to the number of observed fractures by the reviewers. Note that an exam can appear several times as the category A2 will belong to both the overall A-type, and A2 group at the same time.

Inter-rater results

The Cohen’s kappa between MG and EA ranged between 0 and 1 with a large variety between categories (see S2 Table in S1 File). High Cohen’s kappa appeared to correspond weakly to classes where the network also performed well and there were indications that the number of training cases facilitated this effect (Fig 2). The correlation was however not strong enough to provide significant results using a linear regression.

Network insight and example images

We sampled cases where the network was most certain of a prediction, whether correct or incorrect, for analysis. Case images for the most common fracture type in the data, proximal tibia B-type, and the adjacent C-type are shown below (Fig 3A to 3C). Also shown are heatmaps visualizing which areas in the images the network focuses on as colored dots. There were no clear discernable trends among these cases as to what made the network fail or succeed. Colored dots were concentrated to the joint segment of the bone and often seemed to cluster close to fracture lines, suggesting that the network appropriately finds these areas to contain relevant information.

Discussion

This is, to our knowledge, the first study to evaluate a deep neural network for detailed knee fracture diagnostics. We evaluated a total of 49 fracture categories. In general, the network performed well with almost ¾ AUC estimates above 0.8. Out of these, a little more than half reached an AUC of 0.9 or above indicating excellent performance.

We conducted no direct comparison between network performance and performance of clinicians. Chung et al [14] in a similar study on deep learning for fracture classification found that orthopedic surgeons specialized in shoulders performed with a Youden J, a summary of sensitivity and specificity, of 0.43–0.86 at classifying shoulder fractures. By that standard, our network performed with Youden J ranging from 0.42–0.98 and mean weighted Youden J 0.74–0.81 which would likely indicate similar results compared to orthopedic surgeons, with the caveat that fractures to the shoulder and knee might differ in diagnostic difficulty.

Some fractures were classified with significantly better prediction than others, though in many cases differences in performance between categories were not significant. During training, we could see a trend where categories with few training cases performed worse, however this correlation diminished later on due to the active learning approach. There were also initially indications that fractures with low Cohen’s kappa values were more challenging, but after re-visiting all fractures in the training set this effect was no longer detectable. The importance of reducing label noise i.e. disruptions which obscure the relationship between fracture characteristics and correct category [29],—sometimes stemming from incorrect or inconsistent labelling by the image reviewers—is well-established [30] and our experience aligns with prior findings.

Our diagnostic accuracy is somewhat lower than that reported in previous studies on deep learning for fracture diagnostics. Langerhuizen et al found in their 2019 systematic review [16] that six studies using a convolutional neural network to identify fractures on a plain radiograph [3, 4, 13, 14, 31, 32] reported AUC ranging from 0.95–1.0 and/or accuracy ranging from 83–97%. One of the studies in the review, Chung et al [14], also investigated fracture classification using a convoluted neural network, with an AUC of 0.90–0.98 depending on category. The difference in performance could partly be due to the complexity of the task at hand; our study had 49 nested fracture categories whereas Chung et al. [14] had 4. Another likely cause is that this study made use of a less strictly controlled environment in which to train and test the network. Four of the six mentioned studies in the systematic review only used one radiographic projection, [4, 14, 31, 32] a fifth study used two projections [3]. This study made use of several projections not all centered on the knee joint. Furthermore, our images where not centered around the fracture to the extent that images from the previous studies were and we did not remove images containing distracting elements such as implants, as Urakawa et al. did [31].

Strengths and limitations

This study aimed to retain the full complexity a random influx of patients brings. We did not introduce selection bias by automatically excluding knees with contractures, implants, thick casts, and other visual challenges. Our study should thus be less likely to overestimate the AI by simplifying the diagnostic scenario and closer to achieving a clinically relevant setting as requested by Langerhuizen et al in their systematic review [16]. However, we did not avoid selection bias completely as we removed images where the image quality was to poor for the human reviewers to establish a correct fracture label. In the test set 5 cases where excluded, four due to open physes and one because it did not include the knee joint, see Fig 1. We actively selected rare fracture patterns, both to be able to capture all AO/OTA classes but also because we believe that, in the long run, the potential clinical value of a computer assisted diagnosis will not only be in everyday fractures but in rare cases where even the clinician is uncertain. This could however also be considered a limitation as we did introduce a bias towards having rare fractures overrepresented in our data compared to how often they appear in clinic. Fractures overall were also overrepresented as otherwise the data would be dominated by healthy images. This would present less challenge for the network and would likely yield the appearance of a better performing network but would hinder the goal of the study to evaluate network performance for classification of different fracture types. We believe that the mixed inter-rater agreement between the orthopedic surgeons reviewing the test sets also reflects that the network was evaluated on cases that would be of varying difficulty for clinicians instead of more trivial cases only.

A central limitation is that we did not have a sophisticated method of establishing ground truth labels such as utilizing CT/MRI scans or operative findings or other clinical data to aid the research team in interpreting the images. Including CT/MRI:s for 6000 exams was deemed unfeasible as this would have vastly increased the time to review each exam and something more suited for follow-up studies. Image annotation was instead aided by the radiologist report, written with access to patient history and other exams. Unfortunately, this report was often too simplistic to help in subgrouping AO/OTA-classes. Double audits were used for fracture images but there is still a risk of misclassification. This misclassification bias could have resulted in an underestimate in the number of complex fractures. However, we believe that fractures that may require surgery will be subjected to CT/MRI exams, even with the aid of computer-assisted diagnosis, as these are incredibly useful before entering the operating theatre.

The AO/OTA classification system leaves room for differences in interpretation between image reviewers—as demonstrated by Cohen kappa values between MG and EA—which likely impaired a completely fair judgement of network performance. The AO/OTA fracture classification system is also perhaps not the most commonly applied knee fracture classifier, as it is impractically extensive for many clinical settings. However, its level of detail can be useful for research purposes, and while some fractures where difficult to categorize, once we super-grouped many of the estimates we saw a significant boost, suggesting that this detailed classification can easily be simplified into one with fewer categories if need be.

While the fractures were collected from over a decade long period with a large sample of patients, our data selection was limited in that the data source is a single hospital in Stockholm. A fracture recognition tool developed from this network might not perform as well on the fracture panoramas of other cities or countries. Furthermore, findings are only applicable to an adult population.

Clinical applications

The study evaluates a potential diagnostic tool with the ability to generate classifications or information which otherwise might fall into the area of knowledge for an orthopedic specialist rather than a radiologist. The AO-OTA classification carries relatively detailed information on properties usually not mentioned in the radiologist report, addition of a network report would provide extra information of value for the clinician treating the patient. This tool could also aid in alerting clinicians of otherwise potentially missed fissures and could serve as a built-in fail-safe or second opinion for clinicians.

Future studies

Future studies could likely benefit from bringing in further information from medical records and x-ray referral and using more detailed imaging methods such as CT or MRI or operative findings as possible ways to refine the answer key the network is evaluated against. By using pre-training network as presented, it should be feasible to fine-tune the network using a more detailed but smaller subset of the cases used here.

In this study we relied on anonymized cases without patient data, adding patient outcomes can be of great interest as we usually want to connect the fracture pattern to the risk of complications. Having a computer aided diagnostic tools allows us to do this on an unprecedented scale.

Conclusion

In conclusion, we found that a neural network can be taught to apply the 2018 AO/OTA fracture classification system to diagnose knee fractures with an accuracy ranging from acceptable to excellent for most fracture classes. It can also be taught to differ between medial and lateral fractures as well as non-displaced and displaced fractures. Our study shows that neural networks have potential not only for the task of fracture identification but for more detailed description and classification.

Supporting information

S1 File

(DOCX)

Click here for additional data file.^{(338KB, docx)}

Data Availability

There are legal and ethical restrictions on sharing the full data set. After discussions with the legal department at the Karolinska Institute we have decided that the double-reviewed test-set can be shared without violating EU-regulations. The deidentified share the test dataset with 600 images are available through the data-sharing platform provided by AIDA (https://datasets.aida.medtech4health.se/10.23698/aida/kf2020). The version used for training the network has been uploaded to GitHub, see https://github.com/AliRazavian/TU.

Funding Statement

This project was supported by grants provided by Region Stockholm (ALF project), and by the Karolinska Institute. Salaries and computational resources were funded through these grants. DeepMed AB provided no financial or intellectual property for this study.

References

1.Wasserstein D, Henry P, Paterson JM, Kreder HJ, Jenkinson R. Risk of total knee arthroplasty after operatively treated tibial plateau fracture: a matched-population-based cohort study. J Bone Joint Surg Am. 2014;96(2):144–50. 10.2106/JBJS.L.01691 [DOI] [PubMed] [Google Scholar]
2.Saleh H, Yu S, Vigdorchik J, Schwarzkopf R. Total knee arthroplasty for treatment of post-traumatic arthritis: Systematic review. World J Orthop. 2016;7(9):584–91. 10.5312/wjo.v7.i9.584 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lindsey R, Daluiski A, Chopra S, Lachapelle A, Mozer M, Sicular S, et al. Deep neural network improves fracture detection by clinicians. Proc Natl Acad Sci U S A. 2018;115(45):11591–6. 10.1073/pnas.1806905115 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin Radiol. 2018;73(5):439–45. 10.1016/j.crad.2017.11.015 [DOI] [PubMed] [Google Scholar]
5.Hallas P, Ellingsen T. Errors in fracture diagnoses in the emergency department—characteristics of patients and diurnal variation. BMC emergency medicine. 2006;6:4-. 10.1186/1471-227X-6-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Waite S, Scott J, Gale B, Fuchs T, Kolla S, Reede D. Interpretive Error in Radiology. American Journal of Roentgenology. 2016;208(4):739–49. 10.2214/AJR.16.16963 [DOI] [PubMed] [Google Scholar]
7.te Stroet MA, Holla M, Biert J, van Kampen A. The value of a CT scan compared to plain radiographs for the classification and treatment plan in tibial plateau fractures. Emerg Radiol. 2011;18(4):279–83. 10.1007/s10140-010-0932-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Machine Learning for Medical Imaging. RadioGraphics. 2017;37(2):505–15. 10.1148/rg.2017160130 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ting DSW, Cheung CY-L, Lim G, Tan GSW, Quang ND, Gan A, et al. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA. 2017;318(22):2211–23. 10.1001/jama.2017.18152 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115. 10.1038/nature21056 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hua K-L, Hsu C-H, Hidayati SC, Cheng W-H, Chen Y-J. Computer-aided classification of lung nodules on computed tomography images via deep learning technique. OncoTargets and therapy. 2015;8:2015–22. 10.2147/OTT.S80733 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kooi T, Litjens G, van Ginneken B, Gubern-Merida A, Sanchez CI, Mann R, et al. Large scale deep learning for computer aided detection of mammographic lesions. Med Image Anal. 2017;35:303–12. 10.1016/j.media.2016.07.007 [DOI] [PubMed] [Google Scholar]
13.Olczak J, Fahlberg N, Maki A, Razavian AS, Jilert A, Stark A, et al. Artificial intelligence for analyzing orthopedic trauma radiographs Deep learning algorithms-are they on par with humans for diagnosing fractures? Acta Orthopaedica. 2017;88(6):581–6. 10.1080/17453674.2017.1344459 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chung SW, Han SS, Lee JW, Oh KS, Kim NR, Yoon JP, et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 2018;89(4):468–73. 10.1080/17453674.2018.1453714 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gan KF, Xu DL, Lin YM, Shen YD, Zhang T, Hu KQ, et al. Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthopaedica. 2019;90(4):394–400. 10.1080/17453674.2019.1600125 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Langerhuizen DWG, Janssen SJ, Mallee WH, van den Bekerom MPJ, Ring D, Kerkhoffs G, et al. What Are the Applications and Limitations of Artificial Intelligence for Fracture Detection and Classification in Orthopaedic Trauma Imaging? A Systematic Review. Clin Orthop Relat Res. 2019. 10.1097/CORR.0000000000000848 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Olczak J, Emilson F, Razavian A, Antonsson T, Stark A, Gordon M. Ankle fracture classification using deep learning: automating detailed AO Foundation/Orthopedic Trauma Association (AO/OTA) 2018 malleolar fracture identification reaches a high degree of correct classification. Acta Orthopaedica. 2020:1–7. 10.1080/17453674.2020.1837420 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Pranata YD, Wang KC, Wang JC, Idram I, Lai JY, Liu JW, et al. Deep learning and SURF for automated classification and detection of calcaneus fractures in CT images. Comput Methods Programs Biomed. 2019;171:27–37. 10.1016/j.cmpb.2019.02.006 [DOI] [PubMed] [Google Scholar]
19.Meinberg E, Agel J, Roberts C, et al. Fracture and Dislocation Classification Compendium—2018. Journal of Orthopaedic Trauma. 2018;32(January):170. 10.1097/BOT.0000000000001063 [DOI] [PubMed] [Google Scholar]
20.Smailagic A, Costa P, Noh HY, Walawalkar D, Khandelwal K, Galdran A, et al., editors. MedAL: Accurate and Robust Deep Active Learning for Medical Image Analysis. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); 2018 17–20 Dec. 2018.
21.Hinton GE, Salakhutdinov RR. Reducing the Dimensionality of Data with Neural Networks. Science. 2006;313(5786):504. 10.1126/science.1127647 [DOI] [PubMed] [Google Scholar]
22.Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. FitNets: Hints for Thin Deep Nets. arXiv. 2014. [Google Scholar]
23.Izmailov P, Podoprikhin D, Garipov T, Vetrov D, Gordon Wilson A, editors. Averaging weights leads to wider optima and better generalization. 34th Conference on Uncertainty in Artificial Intelligence 2018; 2018; Monterey,United States: Association For Uncertainty in Artificial Intelligence (AUAI).
24.Mandrekar JN. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. Journal of Thoracic Oncology. 2010;5(9):1315–6. 10.1097/JTO.0b013e3181ec173d [DOI] [PubMed] [Google Scholar]
25.Li F, He H. Assessing the Accuracy of Diagnostic Tests. Shanghai archives of psychiatry. 2018;30(3):207–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med. 2013;4(2):627–35. [PMC free article] [PubMed] [Google Scholar]
27.McHugh ML. Interrater reliability: the kappa statistic. Biochemia medica. 2012;22(3):276–82. [PMC free article] [PubMed] [Google Scholar]
28.Sundararajan Mukund TA, Yan Qiqi. Axiomatic Attribution for Deep Networks. International Conference on Machine Learning (ICML); Sydney2017. p. 3319–28.
29.Frenay B, Verleysen M. Classification in the Presence of Label Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems. 2014;25(5):845–69. 10.1109/TNNLS.2013.2292894 [DOI] [PubMed] [Google Scholar]
30.Zhu X, Wu X. Class Noise vs. Attribute Noise: A Quantitative Study. Artificial Intelligence Review. 2004;22(3):177–210. [Google Scholar]
31.Urakawa T, Tanaka Y, Goto S, Matsuzawa H, Watanabe K, Endo N. Detecting intertrochanteric hip fractures with orthopedist-level accuracy using a deep convolutional neural network. Skeletal Radiol. 2019;48(2):239–44. 10.1007/s00256-018-3016-3 [DOI] [PubMed] [Google Scholar]
32.Gale W, Oakden-Rayner L, Carnerio G, Bradley AP, Palmer LJ. Detecting hip fractures with radiologist-level performance using deep neural networksPublished 2017 Accesed Nov 2019. https://arxiv.org/abs/1711.06504v1.

PLoS One. doi: 10.1371/journal.pone.0248809.r001

Decision Letter 0

Ivana Isgum

22 Oct 2020

PONE-D-20-26924

Artificial intelligence for the classification of knee fractures in adults according to the 2018 AO/OTA classification system

PLOS ONE

Dear Dr. Gordon,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers are overall positive but they have also identified multiple major and minor issues that need to be addressed before the manuscript can be considered for publication. Especially, carefully update literature following comments of R1, provide detailed responses regarding the data description and experimental setup (R1, R2, R3), and provide details of the method following the comments of R3.

Please submit your revised manuscript by Dec 05 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Ivana Isgum

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2.We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

3.Thank you for stating the following in the Competing Interests section:

[MG, OS and AS are co-founders and shareholders in DeepMed AB.].

Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

4.Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: N/A

Reviewer #4: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is an interesting article but requires some significant clarifications.

The knee is a joint, thus it dislocates but does not fracture. Please use a different word or indicate the bones involved.

Please clarify your sample sizes; I had great trouble sorting this out.

Please provide the number of independent patients -with the number of radiographs per patient.

Did each patient have all the views -did every plateau fracture have an evaluation with a paired AP and Lateral

i am not fully clear on machine learning technology - were normal x-rays included if the patient had a distal femur fracture?

I find the fact that you quote that knee replacements only last a decade in 80% of the population fairly unrealistic - please review the literature to be sure your quoted reference reflects the global feeling.

In the United States very few fractures of the tibial plateau or distal femur would be treated operatively without a CT scan - and in Europe many patella cases appear to be treated without plain films so please review the 2nd pargaraph of your introduction. I don't think that there is a lot of misdiagnosis in interpreting radiographs of fractures - there might be a misjudgement of the severity of the fracture pattern

Please explain on line 63 how the random images were selected.

Line 66 - When you say projections do you means the actual images?

Line 70 - Are you stating that patients with the same fracture were seen at different time points - were all your images not initial injury films - there should have been no repeat patients for the same fracture; please clarify this.

Line 72 - who made this decision

Line 85 - what is ESM

Line 89 - same comment as line 70

Line 93 -I just want to confirm that two surgeons manually classified 600 fractures - this case number continues to confuse me based on the samples in your tables

Line 10 1- when you say label d do you mean graded - I am not objecting to the word I just want to be certain I understand?

Line 152 etc - please put the sample size in the text for each overall group of fractures

My confusion over your actual samples continues onto your tables -- please be clear as to the actual number of fractures seen - what the surgeons who manually coded cases found and what the machine identified

Did you use the modifiers as well?

I think the concept of machine learning is reasonable - I think your justification for it needs to modified. Perhaps - fracture identification in clinics without orthopedic trained personnel available - replacement of the need for virtual reading of films in the middle of the night by on-call personnel etc

Reviewer #2: Dear author

This is an interesting article in an area that will only become more relevant. It is well structures and written

I do think that some of the paper needs some simplification to make it more readable to understand the true use of this technology.

I have a couple of other comments

1. How were the 600 tests radiographs actually chosen. I am concerned there may have been some bias in the choice. Why was 600 chosen

2. Were the surgeons involved in reviewing the radiographs part of the design team.

3. I would like to see clearer documentation of how bias was addressed

4. I would like more information on how many fractures out of this selection were not identified compared to the radiologists report. This has a large bearing on the use of this.

Reviewer #3: This paper describes and evaluates a method for knee fracture classification from plain radiographs using a deep neural network. The authors collected around 6000 radiographs from a single hospital and manually classified knee fractures according to the 2018 AO/OTA classification system, including a few custom categories. The authors then trained a simple neural network classifier using a majority of the data and tested its perform on 600 images that were annotated by two experienced observers.

This is overall an interesting study, and the authors especially put a lot of effort into data collection and annotation. The study is focused on the validation of an automatic classification method. A lot of details of this method are missing, which would be logical for a validation study of a method described in detail in another (already published) paper, but this seems not to be the case. For this reason, details of the neural network classifier and its training process should in my opinion be included in the manuscript, preferably in the main text rather than a supplement.

For example, the output of the network is not clear. Does the network generate softmax predictions for X classes? In the supplementary material, the network is described with four filters in the last layer, but here it is not clear how these values are translated into AO/OTA categories. It is also unclear what it means that all images were processed individually by the core section of the network – does this refer to different images from the same patient? The supplement also mentions another dataset used for training, which needs to be mentioned in the main text. The training process should also be described in more detail, at least summarizing how radiological reports were fed to the network in teacher student sessions, and how the other more complex regularization techniques were used such as the autoencoder. The active learning part is also described only very briefly.

For data selection, a random subset of the available images was selected based on the likelihood that the image contained a fracture. How was this likelihood determined and how exactly was it used to select images?

How was the data split into test, training and validation sets – randomly?

The test data was annotated by two orthopedic surgeons. Please give their initials if they are co-authors. I assume that Cohen’s kappa was computed with the readings before the consensus session, it might be worth stating this explicitly.

The values in the result section would be more informative with corresponding confidence intervals.

Minor comments:

- Page 3, Line 41: This sentence is not well written and hard to read: “deep learning; a branch of machine learning, utilizing neural networks; a form of artificial intelligence”

- Page 7, Line 123: “>0.7” should be “<0.7”

- Page 9, Line 175: “seemed to correspond somewhat” is not very precise language

- Page 9, Line 184: “falter” should probably be “failed”

- The resolution of Figure 3 is very low.

Reviewer #4: This is an interesting paper looking at machine learning and its ability to recognize knee fractures. This study is important as it is the start of what will probably become the accepted way of assessing and classifying fractures. It will also allow for the appropriate classification of fractures in an unbiased format allowing classifications to be correlated with results and ultimately to treatment decisions and outcomes. The methodology and statistical evaluation are acceptable. The results are definitely encouraging showing reasonable;le correlation with the AO/OTA classification based only on plain radiographs. The discussion was honest and dealt with the shortcoming and strengths of the research.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: Yes: James F Kellam

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Apr 1;16(4):e0248809. doi: 10.1371/journal.pone.0248809.r002

Author response to Decision Letter 0

14 Jan 2021

__See the below word-file for reviewers - it should be more readable than this section (the content is identical)__

Reviewer #1:

“This is an interesting article but requires some significant clarifications.”

Answer: Thank you and we understand that it is perhaps somewhat unclear; blending orthopedic and machine learning research into a readable format is challenging. We have tried to clarify according to your suggestion.

“The knee is a joint, thus it dislocates but does not fracture. Please use a different word or indicate the bones involved.”

Answer: This is correct and we have adjusted to “fractures around the knee joint”. We believe that enumerating “distal femur, patella and proximal tibia” is overly complex. Some readers will also most likely be from the machine learning community and may find it difficult to understand as they will not be familiar with medical terminology.

“Please clarify your sample sizes; I had great trouble sorting this out.”

Answer: We apologize for the confusion. We have tried to clarify in the beginning of the result section. The sample size is actually rather simple – we have a training set of 6000 images that we use for training and tweaking the neural network. Once the training seems optimal, we have tested the model on 600 images that the network has never encountered. This is very different from traditional orthopedic research, what matters is to have a sufficiently large test-set, basically if the model performs well the training set is adequate in size. If we choose a small test-set, e.g. 100 exams there is a risk that results are due to chance and we will most likely not be able to evaluate many of the rare categories.

Action: We have changed the text and included the requested information on the number of radiographs per exam; see the first paragraph in the results.

“Please provide the number of independent patients -with the number of radiographs per patient.”

Answer: We allowed patients to appear more than once if they appeared with at least a span of 90 days. In the test set about 12 % occurred more than once. We believe that reporting radiographs per patient is less important than the number of radiographs per exam as each evaluation of the network is performed per exam.

Action: We have added this information to the first paragraph in the results section, together with the information from the previous question.

“Did each patient have all the views -did every plateau fracture have an evaluation with a paired AP and Lateral”

Answer: Yes, the majority had also oblique images that allowed us to evaluate whether depressions were located in the posterior, central or anterior portion of the knee.

Action: Added to the second paragraph under “Data sets”: All images contained at least an AP and a lateral view and had to have the knee joint represented.

“I am not fully clear on machine learning technology - were normal x-rays included if the patient had a distal femur fracture?”

Answer: We filtered as few images as possible as we wanted the radiographs to represent a true clinical setting, i.e. we included even poorly taken images as long as the knee joint was included in the image.

Action: See previous action that includes this question.

“I find the fact that you quote that knee replacements only last a decade in 80% of the population fairly unrealistic - please review the literature to be sure your quoted reference reflects the global feeling.”

Answer: The number 80% is for “post-traumatic” arthritis and not regular primary osteoarthritis where we would expect 95% survival rate (according to the Swedish Knee Registry). The number stems from a perhaps somewhat small study (Lunebourg et al) but a much larger study with shorter follow-up (Bala et al) also find that the complication rate is much higher in this group.

Action: We have tried to clarify this in the introduction: “While regular primary osteoarthritis have a survival rate of at least 95% in a decade, post-traumatic knee replacements have both higher complication rates and survival rates as low as 80% for the same time period”

1. Lunebourg A, Parratte S, Gay A, Ollivier M, Garcia-Parra K, Argenson J-N. Lower function, quality of life, and survival rate after total knee arthroplasty for posttraumatic arthritis than for primary arthritis. Acta Orthop. 2015 Apr;86(2):189–94.

2. Bala A, Penrose CT, Seyler TM, Mather RC, Wellman SS, Bolognesi MP. Outcomes after Total Knee Arthroplasty for post-traumatic arthritis. The Knee. 2015 Dec 1;22(6):630–9.

3. http://www.myknee.se/pdf/SVK_2019_1.0_Eng.pdf

“In the United States very few fractures of the tibial plateau or distal femur would be treated operatively without a CT scan - and in Europe many patella cases appear to be treated without plain films so please review the 2nd paragraph of your introduction. I don't think that there is a lot of misdiagnosis in interpreting radiographs of fractures - there might be a misjudgement of the severity of the fracture pattern”

Answer: We agree that few (read none) of these would be treated operatively without a CT. The paragraph does though state that the “especially during on call hours” as we usually do CT exams the next day and furthermore only if we are considering surgery. Providing a detailed description in the ER could therefore be of use, especially for selecting which fractures require additional CT-scans. As the other reviewers have not objected to this paragraph we will leave it unchanged.

Action: None.

“Please explain on line 63 how the random images were selected.”

Answer: We agree that “likelihood” was perhaps not the best wording. In effect the procedure was rather simple, we looked for text strings such as “there is a fracture” or “there is a depression” and then made sure that a large part of the images were sampled from those with strings that could have been used with fractures. The word likelihood suggests that we used a statistical method and we agree that this is misleading. As the dataset is dominated by regular knee images it would have been impossible otherwise to retrieve enough fractures to identify all the categories.

Action: We have changed the wording and made it clearer that the report text is guiding the selection.

“Line 66 - When you say projections do you means the actual images?”

Answer: Thank you for this note. We have clarified it to “Radiograph projections”. Many are familiar to this as the “view” but we chose to use projection as this generally the more technical term.

Action: Added “Radiograph” before projection.

“Line 70 - Are you stating that patients with the same fracture were seen at different time points - were all your images not initial injury films - there should have been no repeat patients for the same fracture; please clarify this.”

Answer: Thank you for this remark. The assumption was that a patient would appear with a radiograph of a fracture on day 1, perhaps have a follow-up on day 10 to look for displacement, and then do a final follow-up radiograph on day 40. The most important thought was to avoid having the follow-up image and recruiting the same fracture twice. Similarly, we were less interested looking at healed fractures and thus we extended the period to 90 days.

Action: We have clarified this by changing “first” to “initial” in the description.

“Line 72 - who made this decision”

Answer: The reviewer of the image made the decision if they saw open physes. Although we at the orthopedic department don’t see children there are some neonatal images and occasionally some other clinic that has referred a younger patient to Danderyd University Hospital. While it would be interesting to include these we wouldn’t have enough pathology to find anything useful and immature bone has a different classification.

Action: Added “by the reviewer”.

“Line 85 - what is ESM”

Answer: The extended supplement material

Action: Changed to “supplement”

“Line 89 - same comment as line 70”

Answer: The key factor is that a patient should never appear in different data sets. The network will always overfit the training set and thus it is crucial that as little information as possible leak over into the test-set and we wanted to be very explicit about it here.

“Line 93 -I just want to confirm that two surgeons manually classified 600 fractures - this case number continues to confuse me based on the samples in your tables”

Answer: Yes, this is correct. We hope that our previous answers have made it clearer.

Action: None

“Line 10 1- when you say label d do you mean graded - I am not objecting to the word I just want to be certain I understand?”

Answer: Yes, graded is in our mind a slightly narrower term. The number of classes that the reviewer could choose from was fairly large and not all classes were concerned with the AO classification, e.g. “previous fracture”.

Action: None

“Line 152 etc - please put the sample size in the text for each overall group of fractures”

Answer: Excellent suggestion.

Action: Added the training & evaluation (test) cases to the headers for each fracture group.

“My confusion over your actual samples continues onto your tables -- please be clear as to the actual number of fractures seen - what the surgeons who manually coded cases found and what the machine identified”

Answer: We apologize for this confusion. The machine provides us with a likelihood for a fracture where the simplest measure for how many the machine identified would be to use below or above 50% probability. As we have trained the network with weights proportional to each category’s’ prevalence it will though have a bias towards estimating a fracture. The only number that we look at in practice is the AUC as it is probably the most informative of performance in a clinical setting. When we have tried our network in a clinical setting it is also beneficial that the network is more likely guessing that there is a fracture than not as we are primarily concerned with missing fractures.

Action: We have added a clarification in the table description and renamed the cases column to “Observed cases”.

“Did you use the modifiers as well?”

Answer: We have only used “Displaced” although we named it “Dislocation” which was a mistake on our part (the word “dislocerad” in Swedish means displaced and hence we forgot to properly translate it). Some of the modifiers are not that obvious, e.g. poor bone quality, and combined with the fact that the number of choices in our already rather rich set of classes would have been overwhelming so we decided to focus more on the classes and qualifiers. It is something we will most likely add in some form as we move forward with the project.

Action: We have changed “Dislocated” to “Displaced”

“I think the concept of machine learning is reasonable - I think your justification for it needs to modified. Perhaps - fracture identification in clinics without orthopedic trained personnel available - replacement of the need for virtual reading of films in the middle of the night by on-call personnel etc”

Answer: Yes, we agree that it is suboptimal. The paradox is though in our hospital that while we have no real shortage of radiologist, we have a hard time finding radiologists that are truly interested in orthopedics. This translates into that we usually receive full descriptions of fractures but it unfortunately rarely contains the level of detail that we need to guide our treatments. This is in the end up to the operating orthopedic surgeon and highly influenced by the CT, but we think that there is a missed opportunity of improving the initial report as virtually all fractures have an initial radiograph taken prior to their CT-scan.

Action: We changed the “during on call hours” to “in the middle of the night” and we also added orthopedic expertise to the radiologist.

Reviewer #2:

“Dear author, this is an interesting article in an area that will only become more relevant. It is well structures and written. I do think that some of the paper needs some simplification to make it more readable to understand the true use of this technology.”

Answer: Thank you. We appreciate that you have taken the time to help us improve the paper.

“I have a couple of other comments

1. How were the 600 tests radiographs actually chosen? I am concerned there may have been some bias in the choice. Why was 600 chosen”

Answer: When we had the 600 tests we felt that we had a decent amount of fractures represented in each category. The difficulty is that in a regular patient flow, most patients will not have a fracture and differentiating between fracture and no fracture was not our objective. As always, more could be better but we believed that 600 would be enough to provide a decent overview of the performance while also putting a reasonable work load as the reviewing process for the test set is much more expensive than the one for the training set. We introduced a bias by design as we have actively looked for reports suggesting fractures as described in our reply to reviewer #1. If we would have filled our dataset with more healthy individuals the results could possibly have been better but we strongly believe that that task would have been too trivial and not of clinical interest. Our goal was to be able to distinguish between severe osteoarthritis and depression fracture, just as we would have to do in a regular clinical setting.

Action: None

“2. Were the surgeons involved in reviewing the radiographs part of the design team.”

Answer: If you refer to the design team then MG all the steps in the study, as well as the reviewing of the radiographs.

Action: None

“3. I would like to see clearer documentation of how bias was addressed”

Answer: In these experimental machine learning studies the bias differs from traditional epidemiological bias. We have tried to not to cherry pick images and excluded only a minimal amount of exams, mostly due to immature bone. The test set with all the 600 images will be made available upon publication so that anyone can evaluate it for sources of bias. The training set is of interest more as an indicator of the amount of data required than an actual source of bias, the bias in the results is all about how obvious the examples are. The more cases in the region of uncertainty, the more difficult it will be for the network to decide on the actual class. This bias is described in the inter-rater table – where a high Kappa-value suggests that we are evaluating obvious cases.

Action: We have added a section in the discussion that discusses these issues in more detail.

4. I would like more information on how many fractures out of this selection were not identified compared to the radiologists report. This has a large bearing on the use of this.

Answer: The fractures reached and AUC of 0.90, we don’t have the true number of reports suggesting fracture but the number should be similar. Our paper from 2017 looked at the detection of fractures and it is certainly an interesting topic but after testing this in a clinical setting, we have found that the true utility comes from the detailed description that we can get from helping clinicians to use complex classifying systems such as the AO/OTA. The difference is equivalent to seeing that there is a car and knowing what car model it is, the former is rather obvious even without an AI while the latter is less obvious but absolutely crucial if you want to order spare parts or do something more interesting. We have skipped this information we wanted to have the paper focused on the AO/OTA classification with minimal distraction.

Action: We have added the numbers for the fracture class to the supplement together with example images of detection failures.

Reviewer #3:

“This paper describes and evaluates a method for knee fracture classification from plain radiographs using a deep neural network. The authors collected around 6000 radiographs from a single hospital and manually classified knee fractures according to the 2018 AO/OTA classification system, including a few custom categories. The authors then trained a simple neural network classifier using a majority of the data and tested its perform on 600 images that were annotated by two experienced observers.

This is overall an interesting study, and the authors especially put a lot of effort into data collection and annotation.”

Answer: Thank you for taking your time to read and grasp the essence of the paper.

Action: None

“The study is focused on the validation of an automatic classification method. A lot of details of this method are missing, which would be logical for a validation study of a method described in detail in another (already published) paper, but this seems not to be the case. For this reason, details of the neural network classifier and its training process should in my opinion be included in the manuscript, preferably in the main text rather than a supplement.

For example, the output of the network is not clear. Does the network generate softmax predictions for X classes?”

Answer: The classes are binary, we predict probability through a sigmoid function of the neuron’s activities. Which is common for these tasks.

Action: None

“In the supplementary material, the network is described with four filters in the last layer, but here it is not clear how these values are translated into AO/OTA categories.“

Answer: The layer’s dimensions are intentionally kept low to make sure that the label information is represented in the shared model’s representation space and it is not overfitting in the fully connected layer’s parameters.

We did not perform an extra experiment on the size of the last layers as it is

1. expensive and time-consuming for us,

2. the purpose of the publication is not to find the best architecture for our dataset, but rather, showing the power of deep learning in orthopedical tasks.

3. Even in highly cited publications where the point of the publication is to report the best network architecture and have access to infinite processing power, the choice of minor hyperparameters are generally left to the hunch of the authors. [1,2,3,4]

[1] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90.

[2] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).

[3] Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).

[4] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Action: None

“It is also unclear what it means that all images were processed individually by the core section of the network – does this refer to different images from the same patient?”

Answer: Yes, that is correct.

Action: None

“The supplement also mentions another dataset used for training, which needs to be mentioned in the main text.”

Answer: We recently published AO-classification on ankle fractures and this dataset and other manually-labeled data was used to augment the training set. Our goal is to one day to be able to classify all major bones using AO/OTA. The additional data has some regularizing effects but interestingly we haven’t seen a noticeable improvement for ankles as the dataset for knees grew in size.

Action: We have added more information on the additional datasets that we used.

“The training process should also be described in more detail, at least summarizing how radiological reports were fed to the network in teacher student sessions, and how the other more complex regularization techniques were used such as the autoencoder.”

Answer: Autoencoders are commonly used for regularization and semi-supervised learning [5]. Our approach is not in any way novel [5,6,7] and therefore we wanted to focus on the orthopedic perspective less than the technical. Those interested in the technical details will have the code available (we will open source the version used in this paper) and the supplement should have the most important details summarized for an overview.

[5] Myronenko, Andriy. "3D MRI brain tumor segmentation using autoencoder regularization." International MICCAI Brainlesion Workshop. Springer, Cham, 2018.

[6] Kunin, Daniel, et al. "Loss landscapes of regularized linear autoencoders." arXiv preprint arXiv:1901.08168 (2019).

[7] Hinton, Geoffrey E., and Richard S. Zemel. "Autoencoders, minimum description length and Helmholtz free energy." Advances in neural information processing systems. 1994.

Action: We have added more details regarding the network to the paper.

“The active learning part is also described only very briefly.”

Answer: Early on we focused on attaining enough representative exams for each category. Once that was done we used the entropy measurement to rank samples for annotation. The latter is what is commonly viewed as active learning although both were part of our targeted sampling procedure. During the process we selected the classes which were at the time most poorly performing for annotation. The selection of cases was made in batches depending on how long it took to annotate the images.

Action: We added the information about targeting the categories to the section for clarity.

“For data selection, a random subset of the available images was selected based on the likelihood that the image contained a fracture. How was this likelihood determined and how exactly was it used to select images?”

Answer: This was a bad use of wording regarding the reports. As we progressed we let the network pick evaluate a new set of 1000-3000 radiographs from which we selected a subset of exams. Early on we wanted to populate all the categories and we selected exams with high probability of belonging to the categories. Later we switched to more traditional active learning where we chose examples in the boundary between the classes of interest where the class was selected based on the performance on the validation set. We have tried to explain it in the methods section but it is difficult to balance the level of detail with readability.

Action: See above change regarding the report and also the added details on the active learning.

“How was the data split into test, training and validation sets – randomly?”

Answer: The entire database including the augmentation data was split into the three sets at the beginning based on patients. This was due to concern of patients appearing in multiple sets and thus cause the expected overfitting on the training set to leak over into the test set.

Action: We have tried to clarify in the first paragraph under “Patients & methods > Data sets”

“The test data was annotated by two orthopedic surgeons. Please give their initials if they are co-authors. I assume that Cohen’s kappa was computed with the readings before the consensus session, it might be worth stating this explicitly.”

Answer: Intitials of the two orthopedic surgeons should be given, this was an oversight.

Action: Have added the initials of the orthopaedic surgeons MG, OS and EA

“The values in the result section would be more informative with corresponding confidence intervals.”

Answer: We agree, we have added bootstrapped confidence intervals. The idea with the weighting is to be able to provide some sort of summary value that is not too dependent on individual categories, especially the small categories that may just be happenstance.

Action: The confidence intervals

“Minor comments:

- Page 3, Line 41: This sentence is not well written and hard to read: “deep learning; a branch of machine learning, utilizing neural networks; a form of artificial intelligence””

Answer: We agree, we have tried to modify the text for clarity.

Action: Manuscript has been modified.

“- Page 7, Line 123: “>0.7” should be “<0.7””

Answer: Thanks.

Action: Fixed.

“- Page 9, Line 175: “seemed to correspond somewhat” is not very precise language”

Answer: We agree that this is not the most price language and we have tried to modify the text for clarity. The problem is that if we write “no clear correlation” readers will lose important information that the noise in the classes also make them difficult to learn for the network. As the other reviewers haven’t objected to this vague wording we hope that you will find the change sufficient.

Action: Changed to ”appeared to correspond weakly”

“- Page 9, Line 184: “falter” should probably be “failed””

Answer: Yes, fail is a better choice of wording.

Action: Changed

- The resolution of Figure 3 is very low.

Answer: Strange, our resolution is 9730 x 3272 for all figures. The gradients is though low-res as it corresponds to downscaled input image.

Action: We will look into it at resubmission.

Reviewer #4:

“This is an interesting paper looking at machine learning and its ability to recognize knee fractures. This study is important as it is the start of what will probably become the accepted way of assessing and classifying fractures. It will also allow for the appropriate classification of fractures in an unbiased format allowing classifications to be correlated with results and ultimately to treatment decisions and outcomes. The methodology and statistical evaluation are acceptable. The results are definitely encouraging showing reasonable correlation with the AO/OTA classification based only on plain radiographs. The discussion was honest and dealt with the shortcoming and strengths of the research.”

Answer: Thank you very much for taking the time to review our paper and your kind words. Classification is certainly a subjective task and it is our firm belief that the orthopedic community would have much to gain from an unbiased classification. Too many of our discussions are concerned if a certain study actually included or not fractures that correspond to the discussed fracture.

Attachment

Submitted filename: Rebuttal_v2.1.docx

Click here for additional data file.^{(28.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0248809.r003

Decision Letter 1

Ivana Isgum

12 Feb 2021

PONE-D-20-26924R1

Artificial intelligence for the classification of fractures around the knee in adults according to the 2018 AO/OTA classification

PLOS ONE

Dear Dr. Gordon,

Both reviewers appreciate the improvements made during the revision. Nevertheless, Reviewer 3 identified several major concerns that I agree with. The manuscript is therefore, not yet ready for publication. Please carefully look at all comments, especially those provided by Reviewer 3 and make sure that the description of the method allows its reimplementation.

Please submit your revised manuscript by Mar 29 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Ivana Isgum

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #1: Thank you for your thorough responses. I have a few follow-up queries

On line 65 perhaps clarify that you did not look at all plain radiographs but rather those around the knee joint.

Consider re-writing the next paragraph so it is clear who excluded images for quality or open physes etc.

Line 102 - there are three sets of initials but only 2 orthopedic surgeons in your text

I am sorry if I missed this but did you report the agreement between your human raters -- how far down into the codes did they go - how many cases needed reconciliation

You have very few distal femur training cases - is this number adequate?

Were all the A cases evaluated for being A so in the patella you actually have 12 A's , 10 A1 etc

Thank you for these clarifications on your interesting work

Reviewer #3: The revision has improved this paper in several aspects. However, the method (neural network) is still not sufficiently described in my opinion. The argument that the code will be published on acceptance does not appeal to me - the reader should not need to dig through your code to understand your method.

I would strongly recommend moving most of the neural network description from the supplement into the main body of the manuscript, and extending this description. The reader should be able to reimplement the method based on the description in the paper, but there are currently too many details missing.

The “Neural network setup” section needs to contain more details about the whole setup, most importantly how the individual regularization techniques were used together (the reference for using auto-encoders for regularization [21] also refers to a paper about Stochastic Weight Averaging, this seems to be a mistake).

The outputs of the network appear to be sigmoid units, but the table in the supplement lists only 4 output units - it is still not clear to me how these are translated into the different categories. Or is there another layer with the final output units? With sigmoid outputs, two classifications that exclude each other could both have the same probability (e.g. 1.0) - what would be the final classification in such a case?

“The test set consisted of 600 cases, which were classified by two senior orthopedic surgeons, MG, OS and EA, working independently.” - this seems to list the initials of three observers?

The data availability statement has been considerably improved, now the test set will be released. The authors could consider also uploading their evaluation pipeline and their results to a platform like grand-challenge.org where other researchers can then later upload their own predictions for comparison.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

PLoS One. 2021 Apr 1;16(4):e0248809. doi: 10.1371/journal.pone.0248809.r004

Author response to Decision Letter 1

17 Feb 2021

See attached word file for response

Attachment

Submitted filename: 2nd rebuttal.docx

Click here for additional data file.^{(18.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0248809.r005

Decision Letter 2

Ivana Isgum

8 Mar 2021

Artificial intelligence for the classification of fractures around the knee in adults according to the 2018 AO/OTA classification system

PONE-D-20-26924R2

Dear Dr. Gordon,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ivana Isgum

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

The authors have addressed all issues raised by the reviewers and therefore, the manuscript can be accepted for publication. The authors will share the data as soon as PLOS One provides them with a DOI.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #3: All comments have been addressed and the manuscript has been considerably improved. No further concerns.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

PLoS One. doi: 10.1371/journal.pone.0248809.r006

Acceptance letter

Ivana Isgum

22 Mar 2021

PONE-D-20-26924R2

Artificial intelligence for the classification of fractures around the knee in adults according to the 2018 AO/OTA classification system

Dear Dr. Gordon:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Ivana Isgum

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(DOCX)

Click here for additional data file.^{(338KB, docx)}

Attachment

Submitted filename: Rebuttal_v2.1.docx

Click here for additional data file.^{(28.2KB, docx)}

Attachment

Submitted filename: 2nd rebuttal.docx

Click here for additional data file.^{(18.2KB, docx)}

Data Availability Statement

[pone.0248809.ref001] 1.Wasserstein D, Henry P, Paterson JM, Kreder HJ, Jenkinson R. Risk of total knee arthroplasty after operatively treated tibial plateau fracture: a matched-population-based cohort study. J Bone Joint Surg Am. 2014;96(2):144–50. 10.2106/JBJS.L.01691 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref002] 2.Saleh H, Yu S, Vigdorchik J, Schwarzkopf R. Total knee arthroplasty for treatment of post-traumatic arthritis: Systematic review. World J Orthop. 2016;7(9):584–91. 10.5312/wjo.v7.i9.584 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref003] 3.Lindsey R, Daluiski A, Chopra S, Lachapelle A, Mozer M, Sicular S, et al. Deep neural network improves fracture detection by clinicians. Proc Natl Acad Sci U S A. 2018;115(45):11591–6. 10.1073/pnas.1806905115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref004] 4.Kim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin Radiol. 2018;73(5):439–45. 10.1016/j.crad.2017.11.015 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref005] 5.Hallas P, Ellingsen T. Errors in fracture diagnoses in the emergency department—characteristics of patients and diurnal variation. BMC emergency medicine. 2006;6:4-. 10.1186/1471-227X-6-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref006] 6.Waite S, Scott J, Gale B, Fuchs T, Kolla S, Reede D. Interpretive Error in Radiology. American Journal of Roentgenology. 2016;208(4):739–49. 10.2214/AJR.16.16963 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref007] 7.te Stroet MA, Holla M, Biert J, van Kampen A. The value of a CT scan compared to plain radiographs for the classification and treatment plan in tibial plateau fractures. Emerg Radiol. 2011;18(4):279–83. 10.1007/s10140-010-0932-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref008] 8.Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Machine Learning for Medical Imaging. RadioGraphics. 2017;37(2):505–15. 10.1148/rg.2017160130 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref009] 9.Ting DSW, Cheung CY-L, Lim G, Tan GSW, Quang ND, Gan A, et al. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA. 2017;318(22):2211–23. 10.1001/jama.2017.18152 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref010] 10.Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115. 10.1038/nature21056 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref011] 11.Hua K-L, Hsu C-H, Hidayati SC, Cheng W-H, Chen Y-J. Computer-aided classification of lung nodules on computed tomography images via deep learning technique. OncoTargets and therapy. 2015;8:2015–22. 10.2147/OTT.S80733 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref012] 12.Kooi T, Litjens G, van Ginneken B, Gubern-Merida A, Sanchez CI, Mann R, et al. Large scale deep learning for computer aided detection of mammographic lesions. Med Image Anal. 2017;35:303–12. 10.1016/j.media.2016.07.007 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref013] 13.Olczak J, Fahlberg N, Maki A, Razavian AS, Jilert A, Stark A, et al. Artificial intelligence for analyzing orthopedic trauma radiographs Deep learning algorithms-are they on par with humans for diagnosing fractures? Acta Orthopaedica. 2017;88(6):581–6. 10.1080/17453674.2017.1344459 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref014] 14.Chung SW, Han SS, Lee JW, Oh KS, Kim NR, Yoon JP, et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 2018;89(4):468–73. 10.1080/17453674.2018.1453714 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref015] 15.Gan KF, Xu DL, Lin YM, Shen YD, Zhang T, Hu KQ, et al. Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthopaedica. 2019;90(4):394–400. 10.1080/17453674.2019.1600125 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref016] 16.Langerhuizen DWG, Janssen SJ, Mallee WH, van den Bekerom MPJ, Ring D, Kerkhoffs G, et al. What Are the Applications and Limitations of Artificial Intelligence for Fracture Detection and Classification in Orthopaedic Trauma Imaging? A Systematic Review. Clin Orthop Relat Res. 2019. 10.1097/CORR.0000000000000848 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref017] 17.Olczak J, Emilson F, Razavian A, Antonsson T, Stark A, Gordon M. Ankle fracture classification using deep learning: automating detailed AO Foundation/Orthopedic Trauma Association (AO/OTA) 2018 malleolar fracture identification reaches a high degree of correct classification. Acta Orthopaedica. 2020:1–7. 10.1080/17453674.2020.1837420 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref018] 18.Pranata YD, Wang KC, Wang JC, Idram I, Lai JY, Liu JW, et al. Deep learning and SURF for automated classification and detection of calcaneus fractures in CT images. Comput Methods Programs Biomed. 2019;171:27–37. 10.1016/j.cmpb.2019.02.006 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref019] 19.Meinberg E, Agel J, Roberts C, et al. Fracture and Dislocation Classification Compendium—2018. Journal of Orthopaedic Trauma. 2018;32(January):170. 10.1097/BOT.0000000000001063 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref020] 20.Smailagic A, Costa P, Noh HY, Walawalkar D, Khandelwal K, Galdran A, et al., editors. MedAL: Accurate and Robust Deep Active Learning for Medical Image Analysis. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); 2018 17–20 Dec. 2018.

[pone.0248809.ref021] 21.Hinton GE, Salakhutdinov RR. Reducing the Dimensionality of Data with Neural Networks. Science. 2006;313(5786):504. 10.1126/science.1127647 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref022] 22.Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. FitNets: Hints for Thin Deep Nets. arXiv. 2014. [Google Scholar]

[pone.0248809.ref023] 23.Izmailov P, Podoprikhin D, Garipov T, Vetrov D, Gordon Wilson A, editors. Averaging weights leads to wider optima and better generalization. 34th Conference on Uncertainty in Artificial Intelligence 2018; 2018; Monterey,United States: Association For Uncertainty in Artificial Intelligence (AUAI).

[pone.0248809.ref024] 24.Mandrekar JN. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. Journal of Thoracic Oncology. 2010;5(9):1315–6. 10.1097/JTO.0b013e3181ec173d [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref025] 25.Li F, He H. Assessing the Accuracy of Diagnostic Tests. Shanghai archives of psychiatry. 2018;30(3):207–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref026] 26.Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med. 2013;4(2):627–35. [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref027] 27.McHugh ML. Interrater reliability: the kappa statistic. Biochemia medica. 2012;22(3):276–82. [PMC free article] [PubMed] [Google Scholar]

[pone.0248809.ref028] 28.Sundararajan Mukund TA, Yan Qiqi. Axiomatic Attribution for Deep Networks. International Conference on Machine Learning (ICML); Sydney2017. p. 3319–28.

[pone.0248809.ref029] 29.Frenay B, Verleysen M. Classification in the Presence of Label Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems. 2014;25(5):845–69. 10.1109/TNNLS.2013.2292894 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref030] 30.Zhu X, Wu X. Class Noise vs. Attribute Noise: A Quantitative Study. Artificial Intelligence Review. 2004;22(3):177–210. [Google Scholar]

[pone.0248809.ref031] 31.Urakawa T, Tanaka Y, Goto S, Matsuzawa H, Watanabe K, Endo N. Detecting intertrochanteric hip fractures with orthopedist-level accuracy using a deep convolutional neural network. Skeletal Radiol. 2019;48(2):239–44. 10.1007/s00256-018-3016-3 [DOI] [PubMed] [Google Scholar]

[pone.0248809.ref032] 32.Gale W, Oakden-Rayner L, Carnerio G, Bradley AP, Palmer LJ. Detecting hip fractures with radiologist-level performance using deep neural networksPublished 2017 Accesed Nov 2019. https://arxiv.org/abs/1711.06504v1.

PERMALINK

Artificial intelligence for the classification of fractures around the knee in adults according to the 2018 AO/OTA classification system

Anna Lind

Ehsan Akbarian

Simon Olsson

Hans Nåsell

Olof Sköldenberg

Ali Sharif Razavian

Max Gordon

Roles

Abstract

Background

Methods

Results

Conclusion

Introduction

Patients and methods

Study design and setting

Data selection

Method of classification

Data sets

Neural network setup

Table 1. General network architecture.

Table 2. The training setup of the network.

Input images

Outcome measures & statistical analysis

Results

Fig 1.

Proximal tibia (AO/OTA 41)—621 training cases and 68 evaluation cases

Table 3. Network performance for proximal tibia.

Patella (AO/OTA 34)—525 training cases and 40 evaluation cases

Table 4. Network performance for patella.

Distal femur (AO/OTA 33)—147 training cases and 12 evaluation cases

Table 5. Network performance for distal femur.

Inter-rater results

Fig 2.

Network insight and example images

Fig 3.

Discussion

Strengths and limitations

Clinical applications

Future studies

Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Ivana Isgum

Roles

Author response to Decision Letter 0

Decision Letter 1

Ivana Isgum

Roles

Author response to Decision Letter 1

Decision Letter 2

Ivana Isgum

Roles

Acceptance letter

Ivana Isgum

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases