Reproducibility of radiomics quality score: an intra- and inter-rater reliability study

Tugba Akinci D’Antonoli; Armando Ugo Cavallo; Federica Vernuccio; Arnaldo Stanzione; Michail E Klontzas; Roberto Cannella; Lorenzo Ugga; Agah Baran; Salvatore Claudio Fanni; Ekaterina Petrash; Ilaria Ambrosini; Luca Alessandro Cappellini; Peter van Ooijen; Elmar Kotter; Daniel Pinto dos Santos; Renato Cuocolo; for the EuSoMII Radiomics Auditing Group

doi:10.1007/s00330-023-10217-x

. 2023 Sep 21;34(4):2791–2804. doi: 10.1007/s00330-023-10217-x

Reproducibility of radiomics quality score: an intra- and inter-rater reliability study

Tugba Akinci D’Antonoli ^1,^✉, Armando Ugo Cavallo ², Federica Vernuccio ³, Arnaldo Stanzione ⁴, Michail E Klontzas ^5,⁶, Roberto Cannella ⁷, Lorenzo Ugga ⁴, Agah Baran ⁸, Salvatore Claudio Fanni ⁹, Ekaterina Petrash ¹⁰, Ilaria Ambrosini ⁹, Luca Alessandro Cappellini ¹¹, Peter van Ooijen ¹², Elmar Kotter ¹³, Daniel Pinto dos Santos ^14,¹⁵, Renato Cuocolo ¹⁶; for the EuSoMII Radiomics Auditing Group

PMCID: PMC10957586 PMID: 37733025

Abstract

Objectives

To investigate the intra- and inter-rater reliability of the total radiomics quality score (RQS) and the reproducibility of individual RQS items’ score in a large multireader study.

Methods

Nine raters with different backgrounds were randomly assigned to three groups based on their proficiency with RQS utilization: Groups 1 and 2 represented the inter-rater reliability groups with or without prior training in RQS, respectively; group 3 represented the intra-rater reliability group. Thirty-three original research papers on radiomics were evaluated by raters of groups 1 and 2. Of the 33 papers, 17 were evaluated twice with an interval of 1 month by raters of group 3. Intraclass coefficient (ICC) for continuous variables, and Fleiss’ and Cohen’s kappa (k) statistics for categorical variables were used.

Results

The inter-rater reliability was poor to moderate for total RQS (ICC 0.30–055, p < 0.001) and very low to good for item’s reproducibility (k − 0.12 to 0.75) within groups 1 and 2 for both inexperienced and experienced raters. The intra-rater reliability for total RQS was moderate for the less experienced rater (ICC 0.522, p = 0.009), whereas experienced raters showed excellent intra-rater reliability (ICC 0.91–0.99, p < 0.001) between the first and second read. Intra-rater reliability on RQS items’ score reproducibility was higher and most of the items had moderate to good intra-rater reliability (k − 0.40 to 1).

Conclusions

Reproducibility of the total RQS and the score of individual RQS items is low. There is a need for a robust and reproducible assessment method to assess the quality of radiomics research.

Clinical relevance statement

There is a need for reproducible scoring systems to improve quality of radiomics research and consecutively close the translational gap between research and clinical implementation.

Key Points

• Radiomics quality score has been widely used for the evaluation of radiomics studies.

• Although the intra-rater reliability was moderate to excellent, intra- and inter-rater reliability of total score and point-by-point scores were low with radiomics quality score.

• A robust, easy-to-use scoring system is needed for the evaluation of radiomics research.

Supplementary Information

The online version contains supplementary material available at 10.1007/s00330-023-10217-x.

Keywords: Reproducibility of results, Artificial intelligence, Radiomics, Inter-observer variability, Intra-observer variability

Introduction

Radiomics is an analysis tool to extract information from medical images that might not be perceived by the naked eye [1]. Over the course of a decade, several thousand studies have been published spanning diverse imaging disciplines in the field of radiomics research [2]. Nevertheless, the inherent complexity of these advanced methods that are employed to extract quantitative radiomics features may make it difficult to understand all facets of the analysis and evaluate the research quality, let alone to implement these published techniques in the clinical setting [3]. It is evident that easily applicable and robust tools for assessing the quality of radiomics research are needed to move the field forward.

With the aim of improving the quality of radiomics research methods, Lambin et al [4] proposed in 2017 an assessment tool, the radiomics quality score (RQS). Following the ideal workflow of conducting radiomics research, the RQS breaks it down into several steps and aims to standardize them. As a result, the RQS includes 16 items covering the entire lifecycle of radiomics research. Since its introduction in 2017, it has been widely adopted by the radiomics research community, and numerous systematic reviews using this assessment tool have been published [5–9]. However, it can still be inherently challenging for researchers or reviewers to correctly interpret and implement RQS and, therefore, assign scores, which are reproducible; as a result, most of the time the RQS scores are defined with a consensus decision and without a reproducibility analysis in these systematic reviews [5–7, 10–13]. Importantly, no intra- or inter-rater reproducibility analysis was presented in the original RQS publication [4].

According to a recent review article on systematic reviews using the RQS, in most cases the RQS is being used in a consensus approach: 27 out of 44 review articles chose to use consensus scoring, 10 did not even specify how the final scores were obtained, and only 7 of them used intraclass correlation coefficients (ICC) or kappa (k) statistics to assess inter-rater reliability [5]. Despite the positive connotation of a consensus decision, this does not necessarily mean that a score reached by consensus is reproducible. A consensus decision might solely reflect the most experienced rater, as novice voices could be suppressed, resulting in an underestimation of disagreement [14]. The decision to use consensus rather than inter-rater reliability could also presumably be due to challenges in applying the RQS and because ratings cannot be reliably reproduced across raters. Evidently, there is room for improvement in establishing an easily usable and reproducible tool for all researchers.

In this study, we aim to perform a large multireader study to investigate the intra- and inter-rater reliability of the total RQS score and individual RQS items. We believe that a robust method for assessing the quality of radiomics research is essential to carry the field into the future of radiology, rather than ushering in a reproducibility crisis.

Material and methods

The study was conducted in adherence to the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) reporting guidelines [15].

Paper selection

We included studies published recently in European Radiology, within an arbitrarily chosen period of 1 month until the start of our study. The following search query is used: (“European Radiology”[Journal]) AND (“radiomics”[Title/Abstract] OR “radiomic”[Title/Abstract]) AND (2022/09/01:2022/10/20[Date—Publication]). European Radiology was selected because it is a first-quartile (Q1—Scimago Journal Ranks) journal with the highest number of radiomics publications among all radiology journals; e.g., a PubMed search with keyword “radiomics” or “radiomic” in article title/abstract returns 249 original radiomics articles between January 1, 2021, and December 31, 2022 (Fig. 1).

Fig. 1 — Bar graphs show the number of original radiomics articles published in first-quartal general radiology journals between 2021 and 2022

We only included original research articles and excluded systematic reviews, literature reviews, editorials, letters, and corrections. After applying the inclusion and exclusion criteria, a total of 33 articles were selected for the study, which was above the minimum required sample size, i.e., 30, for the inter-rater reliability studies based on Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research (Fig. 2) [16].

Rater selection and raters’ survey

A total of 9 raters with different backgrounds and experience levels were recruited for the study with an open call within the European Society of Medical Imaging Informatics (EuSoMII) Radiomics Auditing Group. They all completed a survey initially, which was sent to all raters by email to determine their level of expertise in the RQS application as well as the level of expertise in their occupation. Then, they were randomly assigned to the following groups according to their level of expertise: two inter-rater reliability groups, including one with and one without a training session on the use of RQS, and one intra-rater reliability group (Table 1).

Table 1.

Rater characteristics according to the level of RQS rating experience

Rater	RQS rating experience¹	Group²	Occupation	Years of experience³
1 (F.V.)	Novice	2	Radiologist	4
2 (I.A.)	Novice	1	Radiology resident	4
3 (E.A.P)	Intermediate	3	Radiologist	9
4 (S.C.F.)	Intermediate	2	Radiology resident	4
5 (A.B.)	Intermediate	1	Radiologist	8
6 (R.Ca.)	Intermediate	3	Radiologist	3
7 (L.U.)	Advanced	2	Radiologist	5
8 (M.K.)	Advanced	1	Radiology resident	3
9 (A.S.)	Advanced	3	Radiologist	4

Open in a new tab

¹Novice: I have no previous experience, intermediate: I have some experience with RQS (e.g., 1–2 RQS evaluation), advanced: I have extensive experience with RQS (e.g., 3 or more RQS evaluation)

²Group 1: inter-rater reliability w/ training, group 2: inter-rater reliability w/o training, group 3: intra-rater reliability

³In occupation

The inter-rater reliability group with training (group 1) received a brief training session for the RQS assessment, during which they were instructed by an experienced rater (T.A.D.) about how to rate all items on a random article [17], and then, they separately completed the assessment of all 33 papers. The inter-rater reliability group without training (group 2) received no training at all on RQS and completed the ratings of all 33 papers. The intra-rater reliability group (group 3) received no training and was asked to score 17 out of 33 selected papers twice 1 month apart to minimize recall (Fig. 3). All raters provided their ratings as they read the article and their available supplementary material. A keyword search was also allowed if needed.

Fig. 3 — Study pipeline showing the different groups and their pathways in the study

At the end of the study, raters received another survey to investigate the challenges they faced during the RQS assessment and their possible solutions.

Statistical analysis

We used ICC (two-way, single rater, agreement, random effects model) for continuous variables, i.e., total RQS, and Fleiss’ and Cohen’s k statistics for categorical variables, i.e., item scores, as recommended [15, 16, 18]. Cohen’s k does not support to do comparisons of more than two raters/ratings, and Fleiss’ k should be used if there are more than two raters/ratings [19]. Therefore, Cohen’s kappa is used when there are two ratings/raters, i.e., group 3, and Fleiss’ kappa is used when there are more than two ratings/raters, i.e., groups 1 and 2, to compare [19]. We used two one-sided t-tests (TOST), a test of equivalence based on the classical t-test, to investigate group differences between mean RQS scores [20]. All statistical analysis was carried out with R software (version 4.1.1) and the “irr” and “TOSTER” packages were used [21].