Skip to main content
Ultrasound: Journal of the British Medical Ultrasound Society logoLink to Ultrasound: Journal of the British Medical Ultrasound Society
. 2020 Nov 16;29(2):100–105. doi: 10.1177/1742271X20971323

Ultrasound grading of thyroid nodules using the BTA U-scoring guidelines – Is there evidence of intra-and interobserver variability?

Michael Couzins 1,, Stuart Forbes 1, Ganesh Vigneswaran 1, Indu Mitra 2, Elizabeth E Rutherford 1
PMCID: PMC8083133  PMID: 33995556

Abstract

Introduction

U-score ultrasound classification (graded U1-U5) is widely used to grade thyroid nodules based on benign and malignant sonographic features. It is well established that ultrasound is an operator-dependent imaging modality and thus more susceptible to subjective variances between operators when using imaging-based scoring systems. We aimed to assess whether there is any intra- or interobserver variability when U-scoring thyroid nodules and whether previous thyroid ultrasound experience has an effect on this variability.

Methods

A total of 14 ultrasound operators were identified (five experienced thyroid operators, five with intermediate experience and four with no experience) and were asked to U-score images from 20 thyroid cases shown as a single projection, with and without Doppler flow. The cases were subsequently rescored by the 14 operators after six weeks. The first and second round U-scores for the three operator groups were then analysed using Fleiss’ kappa to assess interobserver variability and Cochran’s Q test to determine any intraobserver variability.

Results

We found no significant interobserver variability on combined assessment of all operators with fair agreement in round 1 (Fleiss’ kappa = 0.30, p <0.0001) and slight agreement in round 2 (Fleiss’ kappa = 0.19, p < 0.0001). Cochran’s Q test revealed no significant intraobserver variability in all 14 operators between round 1 and round 2 (all p>0.05).

Conclusions

We found no statistically significant inter- or intraobserver variability in the U-scoring of thyroid nodules between all participants reinforcing the validity of this scoring method in clinical practice, allaying concerns regarding potential subjective biases in reporting.

Keywords: Thyroid ultrasound, thyroid nodules, U-score, grading, observer variability

Introduction

Ultrasound is the most appropriate imaging modality for the assessment of thyroid nodules. The modality is quick, acceptable to the patient and allows both the diagnosis of thyroid carcinoma and the risk stratification of nodules. It is estimated that 30–70% of all patients have thyroid nodules visible on ultrasound1,2 and thus they will be a common finding in those undergoing neck/thyroid ultrasound.

Moon et al. demonstrated in a large retrospective study that shape, margin, echogenicity and calcification are useful sonographic features that can help discriminate benign and malignant features of thyroid nodules.3 The study did, however, acknowledge that in lesions of less than 1 cm, the sensitivity of microcalcifications was lower. Other thyroid nodule classification systems previously described include the Kim criteria,4 the American Association of Clinical Endocrinologists criteria5 and the Society of Radiologists in Ultrasound.1 These systems focus on varying nodule features to aid decision making as to which nodules require further assessment.

To try and reduce variation of interobserver interpretation, Tappouni et al. not only suggested using standard nodule characteristics to decide on appropriate management but also utilise structured reports, repeatable image acquisition techniques and regular operator education.6

Grani et al. reassuringly showed that when utilising different risk classification systems, once benign features are identified the nodules remain fairly stable over a five-year interval with changes warranting sampling a very rare event.7 Ha et al. also reinforced that risk stratification models using sonographic appearances of thyroid nodules show acceptable predictive accuracy.8

In order to allow consistent, reproducible assessment of nodules, the British Thyroid Association (BTA) included a thyroid nodule ultrasound classification (U-scoring) table in their Guidelines for the Management of Thyroid Cancer document.1 The BTA U-classification acknowledges other classification systems and their varying specificity and sensitivity before providing a simple, succinct table with visual aids (Figure 1). The classification details differentiating benign and malignant features of a thyroid nodule to deduce a U-score (U1-5). Once scored, the ultrasound operator is aided by the remainder of the guidance which states which nodules can be reassuringly left alone, those that are indeterminate requiring fine needle aspiration cytology (FNAC) and those that are overtly malignant.

Figure 1.

Figure 1.

Graphic compilation of signs used in thyroid nodule U-classification.1 Reproduced through kind permission from John Wiley and Sons.

As with any grading system, the use of thyroid nodule U-scoring relies on consistency between operators. Ultrasound is operator dependent and as such there will always be variation between images acquired by different operators. Application of a thyroid nodule classification will also demonstrate a degree of subjective variation between clinicians scoring a nodule. Hambly et al. suggested that a five-point scale assessing which nodules require biopsy showed good diagnostic accuracy between seven radiologists educated on evidence-based imaging criteria.9

Lam et al.10 assessed interobserver agreement between two radiologists and one sonographer when assessing imaging characteristics of thyroid nodules that subsequently demonstrated indeterminate cytology. The study identified fair agreement with echogenicity, moderate agreement with margin assessment, good agreement with composition and echogenic foci and very good agreement for extrathyroid extension and lymph node metastasis. However, there was complete disagreement in at least one feature in 22% of the nodules examined. Although this study demonstrated agreement in many aspects of nodule assessment, the varying opinions would thus allow the possibility of differing nodule grading when applied to a scoring classification.

Given the subjective interpretation of ultrasound images, we had anecdotal concerns regarding inter- and intraobserver variability of U-scoring. Therefore, our aim was to assess whether there was any significant intraobserver or interobserver variability between operators utilising the BTA thyroid nodule U-classification.

Methods

We recruited 14 ultrasound operators from our institution. Operators included a mix of radiologists, radiology registrars and sonographers. The operators were divided in three groups based on their thyroid ultrasound experience: five experienced operators performing thyroid ultrasound weekly and having at least three years of experience, five operators with intermediate experience and four operators with no thyroid ultrasound experience.

In order to prevent the variation between participant image acquisition and purely assess the application of the U-classification, ultrasound images of thyroid nodules were provided to the operators. Twenty thyroid nodules were shown as a single ultrasound projection, with and without Doppler flow. Each participant was provided with a copy of the formal U-scoring criteria (Figure 1) for reference and scored the 20 cases. The same 20 cases were subsequently rescored by the operators after an interval of at least six weeks.

The first and second round U-scores for the three groups were then analysed to assess for any significant variability. Application of Fleiss’ kappa analysis assessed for interobserver variability between the operators and Cochran’s Q test was used to determine any intraobserver variability.

Results

The 14 operators completed both rounds of thyroid nodule scoring. Some participants were undecided on cases and when, for example, a nodule was scored as U2/U3 this was recorded as U3, given that in real practice the higher figure would determine the need for subsequent management and follow-up. Combining U-scores from both rounds, 16 of the 20 cases had U-scores ranging from U2 to U5. Three cases had scores ranging from U2 to U4 or U3 to U5. One case had scoring ranging from U2 to U3.

Participants in the ‘experienced’ operators group appeared to group their scores closer together (Table 1). Only one case had U-scores ranging from U2 to U5. Seven cases had U-scores ranging from U2 to U4 or U3 to U5. Ten cases had U-scores varying by one point, i.e. U2-3, U3-4 or U4-5. Two cases were scored the same by all of the ‘experienced’ group in both rounds. As demonstrated in Table 1, there is a greater range of U-scores provided by the group with no previous experience than those with previous thyroid ultrasound experience.

Table 1.

U-score range per case over both rounds for each group.

U score range per case over both rounds All participants Experienced group Intermediate experience group No thyroid US experience
Same U-score 0 2 2 1
U2-3, U3-4, U4-5 1 10 6 5
U2-4, U3-5 3 7 10 2
U2-5 16 1 2 12

No statistically significant difference was identified between all operators to suggest any interobserver variability in round 1 or round 2 scoring. Fair agreement was demonstrated in round 1 scores (Fleiss’ kappa = 0.30, p < 0.0001) and slight agreement in round 2 (Fleiss’ kappa = 0.19, p < 0.0001) (Table 2).

Table 2.

Fleiss’ kappa analysis of interobserver variability between all participants and the subgroups.


Round 1

Round 2
Fleiss’ kappa p-Value Outcome Fleiss’ kappa p-Value Outcome
All participants 0.30 <0.0001 Fair agreement 0.19 <0.0001 Slight agreement
Experienced 0.34 <0.0001 Fair agreement 0.24 <0.0001 Fair agreement
Intermediate 0.30 <0.0001 Fair agreement 0.04 0.49 Agreement accidental
No experience 0.33 <0.0001 Fair agreement 0.28 <0.0001 Fair agreement

When analysing the practitioner subgroups independently, the experienced group did not demonstrate any significant interobserver variability in either round. The groups with no previous ultrasound experience also did not demonstrate any significant interobserver variability. Within the group of intermediate experience, in round 1 there was ‘fair’ agreement and thus no significant interobserver variability. However, in round 2, analysis determined agreement between scoring was accidental (Fleiss’ kappa = 0.04, p = 0.49) and therefore there was significant interobserver variability for this subgroup’s round.

Cochran’s Q test revealed no significant intraobserver variability in all operators between round 1 and round 2 (Table 3). Within each group, the number of cases scored in one round as U2 and in the other round as U3 or above by an individual participant was as follows:

Table 3.

Cochran’s Q test for intraobserver variability.

Q score p value Outcome
All participants 21.7 0.30 No intraobserver variability
Experienced 25.7 0.14 No intraobserver variability
Intermediate 22.0 0.28 No intraobserver variability
No experience 20.3 0.38 No intraobserver variability

Experienced group: 10 cases (10%) – 9 (U2 vs. U3) and 1 (U2 vs. U4)

Intermediate group: 15 cases (18.7%) – 12 (U2 vs. U3), 2 (U2 vs. U4) and 1 (U2 vs. U5)

No experience group: 15 cases (15%) – 10 (U2 vs. U3), 2 (U2 vs. U4) and 3 (U2 vs. U5)

Discussion

In clinical practice there will always be variation between the assessment, interpretation and conclusion made by medical practitioners. Ultrasound is no different. In addition to the variation seen in the acquired operator dependent images, the interpretation of the images will also differ between operators. When assessing thyroid nodules, the BTA U-classification guideline aids risk assessment of the nodule based on data from previous studies. By standardising nodule scoring it allows fellow clinicians to practice with a degree of consistency and confidence. The U-score classification carries a great deal of importance in directing patient management. For surgeons it can decide surgery versus no surgery, by no means a minor difference for the patient. It seems as though a lot of weight is given to a scoring system at risk of subjective variation in its interpretation and application.

Although not assessed in this study, a degree of unconscious bias may occur during a thyroid ultrasound list. Patient load and time pressures may result in underscoring, thus avoiding FNA of a lesion that other operators may well have sampled. Equally, an operator may subconsciously underscore a lesion in an elderly patient with multiple comorbidities, thus steering management towards a more conservative approach.

Our results have shown that in the combined group of ultrasound operators with a range of previous thyroid ultrasound experience, when applying BTA U-classification of a nodule, no significant interobserver difference is demonstrated. Aside from one single round of the four intermediate ultrasound operators, no significant interobserver variability is seen in each subgroup. The interobserver variability seen within the second round of assessment of the intermediate experience group could perhaps be as a result of data analysis with the smallest subgroup (four participants). Reassuringly, no significant variability is seen between an individual operator when rescoring the same nodules at a later date, indicating the scores are reproducible. It is also reassuring that the experienced operators, who will undertake the largest proportion of the thyroid ultrasound, group their scores closest together as seen in Table 1.

Although not statistically significant, our anecdotal impression is that the less experienced operators tend to over score – a U3 score gets further review and possible FNA, acting as a reassuring back-up, compared to U2 that may result in discharge and no further clinical review. It was pleasing to see the experienced operators have a lower percentage of scans graded as benign U2 on one assessment and U3 or above on subsequent analysis (10% vs. 18.7% and 15%), suggesting patients get more consistent U-score directed discharge or follow-up advice from this group.

The Royal College of Radiologists recommends that 100% of thyroid nodules are U-scored. There is an argument to change this ‘gold standard’ to reduce pressures on ultrasound operators providing a U-score in difficult cases or small lesions that are difficult to characterise. It could perhaps be made more acceptable to grade as an ‘uncertain U-score’ allowing interval review rather than immediate grading, possibly leading to an inappropriate management pathway.

The number of participants in the study is relatively low, but despite this being an opportunity for false positives to be drawn, no significant inter- or intraobserver variability is demonstrated across the combined 14 operators. Some would argue that providing the operators with only a single view of a thyroid nodule, with and without Doppler flow, removes the dynamic three-dimensional ‘feel' ultrasound can provide in clinical practice. This was done intentionally so that operators of all clinical experience could use the BTA U-classification tool to make a more comparable assessment within our study.

The main aim of the study was to analyse intra- and interobserver agreement and therefore U-score concordance with the cytological and histological outcome was not assessed. Previous studies have already demonstrated that the ultrasound assessment of benign and malignant nodule features is reflective of the histology.2,9

Conclusion

No statistically significant intra- or interobserver variability was demonstrated in the U-scoring of thyroid nodules across combined assessment of ultrasound operators of varying experience. Reassuringly, this was also the case in subgroup analysis of the experienced thyroid ultrasound operators who complete the bulk of thyroid sonography at our centre. This reinforces the validity of this scoring system in clinical practice, allaying anecdotal concerns that there is significant subjective variation in application of the BTA U-classification.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethics Approval: On review of the Health Research Authority decision tool our work is classified as a Service Evaluation and therefore does not require Research Ethics Committee review.

Guarantor: MC.

Contributors: MC: substantial contribution to data acquisition/analysis, drafting of article, revision of article and approval for publication; SF: substantial contribution to design of work, data acquisition/analysis, revision of article and approval for publication; GV: substantial contribution to data acquisition/analysis and interpretation of data, revision of article and approval for publication; IM: substantial contribution to concept/design of work, data acquisition, revision of article and approval for publication; EER: substantial contribution to concept/design of work, data acquisition, revision of article and approval for publication.

ORCID iD: Michael Couzins https://orcid.org/0000-0002-5686-6407

References

  • 1.British Thyroid Association. Guidelines for the management of thyroid cancer. Clin Endocrinol 2014; 81: 1–122. [DOI] [PubMed] [Google Scholar]
  • 2.Frates MC, Benson CB, Charbonneau JW. Management of thyroid nodules detected at US: society of radiologists in ultrasound consensus statement. Radiology 2005; 237: 794–800. [DOI] [PubMed] [Google Scholar]
  • 3.Moon WJ, Jung SL, Lee JH, et al. Benign and malignant thyroid nodules: US differentiation–multicenter retrospective study. Radiology 2008; 247: 762–770. [DOI] [PubMed] [Google Scholar]
  • 4.Kim EK, Park CS, Chung WY, et al. New sonographic criteria for recommending fine-needle aspiration biopsy of nonpalpable solid nodules of the thyroid. AJR Am J Roentgenol 2002; 178: 687–691. [DOI] [PubMed] [Google Scholar]
  • 5.Gharib H, Papini E, Garber JR, et al. American Association of Clinical Endocrinologists and Associazone Medici Endocrinologi medical guidelines for the diagnosis and management of thyroid nodules. Endocr Pract 2006; 12: 63–102. [DOI] [PubMed] [Google Scholar]
  • 6.Tappouni RR, Itri JN, McQueen TS, et al. ACR TI-RADS: pitfalls, solutions, and future directions. Radiographics 2019; 39: 2040–2052. [DOI] [PubMed] [Google Scholar]
  • 7.Grani G, Lamartina L, Biffoni M, et al. Sonographically estimated risks of malignancy for thyroid nodules computed with five standard classification systems: changes over time and their relation to malignancy. Thyroid 2018; 28: 1190–1197. [DOI] [PubMed] [Google Scholar]
  • 8.Ha SM, Ahn HS, Baek JH, et al. Validation of three scoring risk-stratification models for thyroid nodules. Thyroid 2017; 27: 1550–1557. [DOI] [PubMed] [Google Scholar]
  • 9.Hambly NM, Gonen M, Gerst SR. Implementation of evidence-based guidelines for thyroid nodule biopsy: a model for establishment of practice standards. Am J Roentgenol 2011; 196: 655–660. [DOI] [PubMed] [Google Scholar]
  • 10.Lam CA, McGettigan MJ, Thompson ZJ, et al. Ultrasound characterization for thyroid nodules with indeterminate cytology: inter-observer agreement and impact of combining pattern-based and scoring-based classifications in risk stratification. Endocrine 2019; 66: 278–287. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Ultrasound: Journal of the British Medical Ultrasound Society are provided here courtesy of SAGE Publications

RESOURCES