Skip to main content
Dentomaxillofacial Radiology logoLink to Dentomaxillofacial Radiology
. 2020 Feb 11;49(6):20190484. doi: 10.1259/dmfr.20190484

Through the quality kaleidoscope: reflections on research in dentomaxillofacial imaging

Madeleine Rohlin 1, Keith Horner 2, Christina Lindh 1,, Ann Wenzel 3
PMCID: PMC7461737  PMID: 31971827

Abstract

The REduce research Waste And Reward Diligence statement has highlighted how weaknesses in health research can produce misleading results and waste valuable resources. Research on diagnostic efficacy in the field of dentomaxillofacial radiology (DMFR) is no exception to these criticisms and could be strengthened by more robust study designs, consistent use of a core set of outcome measures and completeness in reporting. Furthermore, we advocate that everyone participating in collaborative research on clinical interventions subscribes to the importance of methodological quality in how imaging methods are used. The aim of this paper, therefore, is to present a guide to conducting high-quality research on diagnostic efficacy in DMFR.

We initially propose a framework inspired by the hierarchical model of efficacy of Fryback and Thornbury, highlighting study designs, measures of analysis, completeness of reporting and established guidelines to assist in these aspects of research. Bias in research, and measures to prevent or limit it, are then described.

It is desirable to climb the Fryback and Thornbury “ladder” from technical efficacy, via accuracy and clinical efficacy, to societal efficacy of imaging methods. Efficacy studies on the higher steps of the ladder may be difficult to perform, but we must strive to answer questions of how useful our methods are in patient management and assess benefits, risks, costs, ethical and social issues. With the framework of six efficacy levels as the structure and based on our experience, we present information that may facilitate quality enhancement of diagnostic efficacy research in DMFR.

Keywords: imaging efficacy, reporting, guidelines, bias, research design

Background

Quality is a complex concept because it encompasses a multitude of perspectives and can be viewed from so many different angles; as such, quality can be examined and reflected as in a kaleidoscope. Ensuring high quality in research is also complex, so it is not surprising that frequently it is not achieved. It has been estimated that 85% of research is wasted, usually because it asks the wrong questions, is badly designed, not published or poorly reported.1 In 2014, the Lancet published a series of reviews “Increasing value: reducing waste in biomedical research”2–6 to encourage the research community to improve quality and monitor progress. In the Lancet REduce research Waste And Reward Diligence (REWARD) Statement,7 methods were formulated to promote and maintain the drive to improve quality (Table 1). Additional references concerning the REWARD campaign can be found in, for example, a follow-up review of the stimulus of the series.8

Table 1.

Methods to maximize the value of research according to REduce research Waste And Reward Diligence Statement7

Set the right research priorities
Use robust research design, conduct and analysis
Ensure that research regulations and management are proportionate to risks
Ensure that complete information on research methods and findings are accessible
Make reports of research complete and usable

In the field of dentomaxillofacial imaging, no comparable overall reviews have been published recently. We agree with conclusions made by Leeflang and coworkers,9 based on overwhelming evidence from reviews of diagnostic tests, that “Challenges that remain are the poor reporting of original diagnostic test accuracy studies and difficulties with the interpretation of the results of diagnostic test accuracy research.” The new focus on overall methodological quality and the awareness of the limitations of biomedical research have the potential to steer the research community towards improvement. However in Leeflang and coworkers’ review,9 the focus was merely on the accuracy part of the study designs. The aim of this paper, therefore, is to present a comprehensive guide to conducting high-quality research on diagnostic efficacy in dentomaxillofacial radiology (DMFR). As well as being addressed at improving quality within our own research community; we also advocate that everyone participating in collaborative research on clinical interventions promotes the importance of methodological quality in how imaging methods are used.

First, we present a framework for research using different efficacy levels inspired by Fryback and Thornbury10 ; second, examples of terminology and guidelines are presented; third, some general issues to consider when planning, conducting and reporting research are discussed. Based on our experience in the knowledge field, we elaborate on specific considerations to take into account before initiating studies at different efficacy levels.

A framework for efficacy levels, measures for analysis and study design

Our framework (Table 2) illustrates six hierarchical levels for evaluation of imaging methods, typical measures for analysis and study design. The imaging process (Levels 1 and 2) is embedded in the clinical process (Levels 3 and 4), whereby the clinician interprets the results as information for clinical decision-making. Further, these processes are embedded within a larger healthcare system, in which benefits to the patients and the society are involved (Levels 5 and 6).

Table 2.

Framework presenting efficacy levels inspired by Fryback and Thornbury10 with examples of typical measures of analysis and study design for assessment of imaging methods

Efficacy level Examples of measures of analysis Study design
1.Technical efficacy Image resolution, sharpness, grayscale range Objective or subjective image quality studies
2. Diagnostic accuracy efficacy Sensitivity, specificity, ROC, predictive values, likelihood ratios In vitro, ex vivo, in vivo studies on validity of an imaging method
3. Diagnostic thinking efficacy The change in diagnosis with the index test Questionnaire, paper-clinics, clinical studies, where diagnosis is performed with and without access to images (on the basis of the index test)
No treatment is performed
4.Therapeutic efficacy (choice of treatment strategy) The change in treatment strategy with and without access to images, changes in treatment Clinical studies in which treatment planning is first based on an already recognized method and thereafter on a new method
Actual treatment could be performed to determine which treatment plan was implemented
5. Patient outcome efficacy The change in patient outcome, treatment quality or prognosis comparing a new and an already recognized method Randomized Clinical Trials (RCTs)
Treatment is performed based on either the new or the recognized method decided by lot
6. Societal efficacy Benefits or costs from societal viewpoints Analyses of cost-effectiveness or cost-benefit Prospective clinical studies or RCTs
Model studies
Treatment effects evaluated from a health-economical point of view

Terminology and guidelines

To assist the transparency of reporting diagnostic studies, it may be useful to comply with a common terminology. The terms proposed in the STARD 15 statement11 (Table 3) are helpful when defined in studies of imaging methods. In the following text, we will adhere to this terminology when appropriate.

Table 3.

Terminology for reporting diagnostic accuracy studies according to STARD 1511

STARD 15 term Explanation
Medical test Any method for collecting additional information about the current health status of a patient
Index test The test under evaluation (e.g., an imaging method)
Target condition The disease or condition that the index test is expected to detect
Reference standard The best available method/condition for establishing the presence or absence of the target condition; a “gold standard” would be an error-free reference standard
Intended use of the test Whether the index test is used for diagnosis, screening, staging, monitoring, surveillance, prediction, prognosis or other reasons
Role of the test The position of the index test relative to other tests for the same condition (e.g., triage, replacement, add-on, new test)

Several websites have emerged, such as the Equator Network (http://www.equator-network.org)12 containing over 400 guidelines. There are also other online sources, such as Penelope research (http://www.peneloperesearch.com),13 which allows for manuscripts to be uploaded and, among other things, instantly auto-checked for the use of appropriate guidelines. In Tables 4 and 5, examples of guidelines are listed. Some are published in several journals, but only one reference/website is listed in the tables. Table 4 presents examples of guidelines primarily targeting reporting of studies; these are, however, also helpful at the study design stage.

Table 4.

Examples of guidelines that will facilitate reporting of research studies related to main study type

Acronym
Main study type
Complete title and reference
CONSORT
2010
Randomized trial (RCT)
Schultz KF et al 2010 CONsolidated Standards Of Reporting Trials.14
Full-text PDF documents of the CONSORT 2010 Statement, CONSORT 2010 checklist, CONSORT 2010 flow diagram and the CONSORT 2010 Explanation and Elaboration document are available from: http://www.consort-statement.org/downloads/consort-statement
GRRAS
Reliability and agreement study
Kottner J et al Guidelines for reporting reliability and agreement studies (GRRAS).15
http://www.equator-network.org/wp-content/uploads/2012/12/GRRAS-checklist-for-reporting-of-studies-of-reliability-and-agreement.pdf
PRISMA
Systematic review
PRISMA –DTA
Systematic review
Moher D et al 2009 Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement.16
Full-text PDF documents of the PRISMA Statement, checklist, flow diagram and the PRISMA Explanation and Elaboration document. http://www.prisma-statement.org/PRISMAStatement/Default.aspx
McInnes MDF et al 2018 Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies: The PRISMA-DTA Statement.17
https://jamanetwork.com/journals/jama/fullarticle/2670259
STARD 2015
Diagnostic accuracy study
Bossuyt PM et al 2015 STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies.11
http://www.equator-network.org/wp-content/uploads/2015/03/STARD-2015-paper.pdf
STROBE
Observational study
von Elm E et al 2007 The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: guidelines for reporting observational studies.18
http://journals.plos.org/plosmedicine/article?id = 10.1371/journal.pmed.0040296
TRIPOD
Prediction model study
Collins GS et al 2015 Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement.19 https://www.bmj.com/content/350/bmj.g7594.long

Table 5.

Examples of guidelines that will facilitate quality appraisal of research studies related to main review type

Acronym
Main review type
Complete title and reference
AGREE II Practice guidelines study Brouwers MC et al 2016 Appraisal of guidelines for research & evaluation II.20
https://www.agreetrust.org/wp-content/uploads/2017/12/AGREE-II-Users-Manual-and-23-item-Instrument-2009-Update-2017.pdf
AMSTAR 2
Systematic review in general
Shea BJ et al 2017 AMSTAR 2: a critical appraisal tool for systematic reviews that include randomized or non-randomized studies of healthcare interventions, or both.21
https://www.bmj.com/content/bmj/358/bmj.j4008.full.pdf
CHARMS
Prediction model study
Moons KG et al 2014 Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS Checklist.22
https://methods.cochrane.org/sites/methods.cochrane.org.prognosis/files/public/uploads/2014%20Moons%20The%20CHARMS% 20checklist%20PlosMed.pdf
QAREL
Diagnostic reliability study
Lucas N et al 2013 The reliability of a quality appraisal tool for studies of diagnostic reliability.23
https://bmcmedresmethodol.biomedcentral.com/artcles/10.1186/1471-2288-13-111
QUADAS-2
Diagnostic accuracy study
Whiting PF et al 2011 A revised tool for the quality assessment of diagnostic accuracy studies.24
http://annals.org/aim/fullarticle/474994/quadas-2-revised-tool-quality-assessment-diagnostic-accuracy-studies
ROBIS
Reviews of systematic review
Whiting PF et al 2016 A new tool to assess risk of bias in systematic reviews.25
https://www.jclinepi.com/article/S0895-4356(15)00308-X/pdf

Guidelines on quality appraisal of research, designed for use in systematic reviews, are also a guide to what and how to report studies effectively (Table 5). Again, being aware of these guidelines can help a lot in designing studies; it is frustrating to perform a research study and only later discover that it would be scored poorly in a systematic review because some aspect of design or reporting was inappropriate.

PICO26 and GRADE27 are other acronyms often mentioned. PICO is not a guideline per se but designed for clinical questions, particularly for RCTs (P = Patient, Population or Problem, I = Intervention, C = Control, O = Outcome). In diagnostic studies, PICO can be used to assess the applicability of a study P = Population, I = Index test, C = Comparator (reference standard) and O = Outcome. GRADE27 is an approach to grade quality (or certainty) of evidence and strength of recommendations obtained from results of systematic reviews.

General recommendation for research in dentomaxillofacial imaging

Build upon the results and conclusions of prior knowledge after careful and critical analyses of previous literature

The first step when designing a study and formulating the research questions is to pursue, find and analyse what has already been published. Locating and retrieving relevant literature is challenging, yet crucial to the success of scientific improvement. Today this is facilitated by search of publications in databases. Thus, there is no excuse for omitting results, conclusions and references of prior research of similar ideas or questions or for those sake methods. To develop and refine the research question and methodology, information of prior publications about the positive and negative outcomes will be helpful. Do read the discussion section of publications, where it may be speculated on what needs to be studied to advance the field. The synthesis and conclusions of systematic reviews are valuable as knowledge gaps on important aspects of our knowledge field should be identified in such reviews. Reports of new research should set the new findings in the context of the body of other relevant research. The replication of an important study may be worthwhile according to the Lancet review by Moher et al8 who recommend “Reward reproducibility practices and reproducible research and enable an efficient culture for replication of research”.

Identify sources of bias in planning, implementation, analysis and publication and take preventive measures

A bias is a systematic error, or deviation from the truth, in results or inferences. Different biases can lead to underestimation or overestimation of the true effects. Bias should not be confused with imprecision. Bias may result in an outcome that multiple replications of the same study could reach the wrong answer on average. Imprecision refers to random error, meaning that multiple replications of the same study will produce different estimates because of sampling variation even if they could give the right answer on average.

Different types of bias can occur at any phase of research, including planning, implementation as well as at the process of analysis and publication (Figure 1). Examples and description of types of bias and how study outcomes and conclusions may be affected are presented in Table 6. Preventive measures with proper study design, implementation and analysis can be taken to limit the degree of bias, although there is nearly always some degree of bias present in published studies.

Figure 1.

Figure 1.

Major types of bias during different phases of progression from planning up to publication of a study of imaging research.

Table 6.

Examples of types of bias during different phases of study progression, descriptions of bias, examples of their effect on outcomes and preventive measures to be taken to minimize bias

Phase Type of bias Description Effect on outcome Preventive measures
Planning Selection bias Errors during identification of study population or in any process of gathering sample May undermine external validity and can lead to overestimation of e.g. accuracy Choose prospective study design and enrol consecutively or randomly Perform a sample size calculation aligned with intended use of imaging method
Spectrum bias Included sample does not represent intended spectrum of disease severity Subgroup differences e.g. screening population vs specialist population May influence disease prevalence and thereby study outcomes e.g. accuracy may be overestimated Ensure that included participants are representative of those that imaging method is intended for
Referral bias A type of spectrum bias Referral pattern distorts sample distribution
Implementation Information bias Selective revealing or suppression of information
  • when results are interpreted with knowledge of reference standard results

  • with less or more information than in practice

  • when imaging methods or treatment outcomes are compared and images are interpreted with knowledge of the intervention

May lead to status quo where multiple researchers discover and discard same outcomes May lead to overestimation of test performance Provide detailed descriptions of information given prior to interpretation of index test, if possible ensure that intended use of test and/or normal clinical practices are followedStrive for blinding or double-blinding of subjects, researchers, technicians, data collector/analysts and evaluators or any persons involved, who may influence outcomes subjectively
Performance bias Conduct and/or interpretation of index test is inadequately performed and/or insufficiently reported May affect study outcomes and will not enable comparison or meta-analyses of studies of same imaging method Provide detailed descriptions of methods and consider appropriate number of raters and think carefully about raters’ expertise
Classification bias Reference standard does not correctly classify target condition May affect disease prevalence and thereby study outcomes Ensure reference standard correctly reflects patients within target condition
Verification bias A set of participants does not undergo reference standard Usually leads to overestimation of sensitivity Ensure all participants undergo both index test and reference standard
Incorporation bias Reference standard is not independent of index test May lead to overestimation of accuracy Ensure reference standard and index test are independent
Analysis Writing up Verification bias Attrition rate or reasons for withdrawals are not documented Estimates may be inconclusive Account for all participants who entered study Explain uninterruptable or intermediate test results
Citation bias Citation or non-citation of research findings, depending on nature and direction May threaten validity of future research as some results may incorrectly receive more and more emphasis Cite and discuss supportive as well as unsupportive high-quality studies Analyse original data
Publication bias Multiple (duplicate) publication bias Publication or non-publication of research findings, depending on nature and direction of results Multiple publications overlap each other substantially or same publication is published several times May lead to incorrect estimation of intervention effects May distort results of systematic reviews and meta-analyses Strategy depends on whether aim is to tackle sets of missing studies, or whether selective/incomplete reporting of data is a primary problem In systematic reviews construct a funnel plot where estimates of efficacy are plotted against e.g. sample size Update systematic reviews regularly

Additional literature on identifying and minimizing bias in research is reported by Pannucci and Wilkins,28 Schmidt and Factor29 and Whiting et al.30 Without emphasizing any type of bias and the effects, we elaborate more in detail on a few ones below.

Spectrum bias and referral bias: Patient cases with the target condition should be consecutively or randomly recruited. They should not be cases for whom some non-random pre-selection process has been used unless the study explicitly aims at a particular subgroup. If a study includes referred cases for which the imaging test was prescribed, it excludes patients for whom the test was not prescribed resulting in referral bias. The former population may differ from the latter, perhaps being more severe or complicated cases and the study would not provide results that are generally applicable. A prospective collection of material, with defined inclusion and exclusion criteria and clear guidance on the detail required for the study, will help to ensure validity of data. The factors used in sample size determination depend on the focus of the study. The number of particular interest may be the patients, if the impact on them is the main study outcome or the raters (observers) if the focus is on the assessment of the clinicians.31 The results of smaller studies are subject to greater sampling variation and hence less precise.32

Information bias: If the aim is purely to measure the impact of the imaging information, then raters should be blind to all other information. In ex vivo studies, when e.g. several index imaging methods are compared, but the origin of the images not blinded, raters could have a preference for a specific method, which may be given benefit. In clinical studies, such as RCT studies, if the rater knows which patients received which imaging method, he/she may also be biased in the assessment of effects. The same goes for the patient in patient-related outcomes. Therefore, both the rater and the patient should be blinded to which index method was used, so-called double-blinding. If, however, the index test is an add-on test and normally interpreted together with other information, then blinding to other information is inappropriate.31 In clinical practice, there is normally access to patients’ history and test results. These will influence diagnostic thinking and treatment planning. A study which only provides imaging to raters may be removed from the context of the clinical situation. Some studies, however, may model the situation of a radiologist reporting an X-ray examination; in such cases, the clinical information provided to the radiologist—the rater—should be consistent with what would normally be provided by a referrer.

Performance bias: As differences in test conditions such as the image acquisition, sample processing and interpretation have the potential to affect results, it is important to fully describe the index test. When assessing study quality according to QUADAS-2,24 one of the foremost important questions is “Was the index test adequately described to permit replication”? For example, image quality is dictated by exposure factors, field variables, image receptor, and monitor and viewing conditions. Therefore, authors should describe all variables involved in the execution of the index test in sufficient detail to allow other researchers to replicate the study or to allow readers to judge the feasibility of the index test in their own settings. The question about the replication is valid for the reference standard as well. Unfortunately, index tests and reference standard often are insufficiently reported which result in that the effects of differences in index test and reference standard methodology among studies cannot be explored and whether these are the potential source of variation in diagnostic accuracy.33

Citation and publication bias: These types of bias are sometimes named reporting biases as both influence the dissemination of research findings. There are many reasons for citing an article; the one which has emerged as the most important reason is to justify results of own study. Additionally, studies reporting higher accuracy estimates are cited more frequently than those with lower accuracy estimates.34 This may especially be the case for studies of new interventions, like a novel imaging method. Over time, studies with positive results may come to be regarded as truisms, even when other studies suggest contradictory results. Another phenomenon in our field is “overcitation” or “cascade citation,” i.e. citation of a paper that cited another article that was frequently mentioned/cited and published in a preferred journal without any analysis of the original sources, or “shell citation” i.e. when the cited paper does not contain at all the stated message.

Publication bias or dissemination bias occurs for publications as well as for non-publications. For instance, once the benefits of a scientific finding are well established, it may become challenging either to write or to publish papers that fail to affirm the benefits. It has been found that the most common reason for non-publication is simply that investigators decline to submit such results, as they assume that they must have made mistake. On the other hand, significant results are 1.6 times more often cited than studies with non-significant results and articles in which the authors explicitly conclude to have found support for their hypothesis were cited 2.7 times as often.35 Publication bias is an important topic in systematic reviews and meta-analyses. Within reviews, funnel plots and related statistical methods can be used to indicate presence or absence of publication bias, although these can be unreliable in some circumstances.

Plan and perform appropriate statistical analyses

Most researchers will have acquired some statistical experience, but few of us are masters of the subject. There is no substitute for including a research statistician or methodologist within a research team, ideally someone with specific experience in diagnostic methods. The worst approach is to plan and perform a study, but only seek statistical advice once the results are in your hands. The statistical collaboration should be there from the earliest planning stages, to provide input on sample size calculations, methodology and how results will be interpreted. It is beyond the remit of this review to provide statistical guidance. Sources of guidance are available from textbooks36,37 and journal publications.38–40

Nonetheless, some basic understanding of statistical methods is needed even when a statistical expert forms part of the research team. The understanding and handling of Reliability and Agreement, which are important concepts in the conduct of studies of all efficacy levels (Level 1 to Level 6), is essential. Estimates of reliability and agreement in clinical studies including an imaging method to analyse the outcomes of interventions is important as well, but is often not reported or largely unknown. Results of reliability and agreement provide information about the amount of error inherent in any diagnosis, score or measurement, which determines the validity of the study results. The terms reliability and agreement are often used interchangeably, but here we keep to the definitions presented by Kottner et al15 that “Reliability may be defined as the ratio of variability between subjects (e.g., patients) or objects (e.g., computed tomography scans) to the total variability of all measurements in the sample.” Thus, reliability can be seen as the ability of a measurement to differentiate among subjects or objects. Agreement “is the degree to which scores or ratings are identical.”15 Two aspects of reliability/agreement are commonly addressed: interrater and intrarater reliability/agreement.

Reliability and agreement are not fixed properties but rather the product of interactions between the tools, the subjects or objects and the context of the assessments. Therefore, estimates of reliability/agreement are only interpretable when the measurement setting is sufficiently described and the methods for calculation fully explained. The guidelines for reporting reliability and agreement studies15 present important issues to consider when designing, conducting and reporting studies. Besides the population of interest, sample size, the number of raters (observers) and replicate observations are important issues. For studies on rater reliability/agreement, several raters should be included to provide some information about the generalizability of results. According to Swets and Picket,41 there is little consequence of including more than six or seven raters on the results when a reasonably large sample is examined. Obuchowski42 presented a graded approach to radiological studies at the Diagnostic Accuracy Efficacy level, according to the phase of study. Thus, initial, exploratory studies of a new diagnostic test (Phase I) might need few raters, while challenging studies with more complex patients (Phase II) might require 5 to 10 raters. Diagnostic accuracy studies at what she refers to as “Phase III”, or advanced, level studies should have more than 10 raters, from more than one institution, if the results from the sample of raters can be generalized to a population of raters. Although Swets and Picket41 and Obuchowski42 may offer different views on number of raters, what is absolutely clear from both is that including only one or two raters in anything other than the most basic exploratory research study is unsatisfactory.

The study design for an imaging method and clinical problem will be influenced, not only by the number objects and raters but also by the rater selection—e.g. their expertise. General dentists and specialists will not necessarily obtain the same results. More experienced clinicians may similarly have different diagnostic thinking patterns. The use of inexperienced students as raters is probably not the best idea, unless the study is on student performance. The right approach is to match the profile of the raters to the intended use of the test under evaluation in the real clinical situation. For example, if a caries diagnostic tool is to be tested, general dentists would probably be the right group from which to recruit rather than radiologists, as the former normally undertake this diagnostic task. In contrast, radiologists would be appropriate as raters for a study on the impact of using a new MRI sequence for salivary gland disease.

Depending on the sampling, type of data (categorical or continuous) and on the treatment of errors, different statistical methods for analyses can be used described in detail by Kottner et al.15 For categorical data, κ statistics, including Cohen’s κ, Cohen’s weighted κ and the intraclass κ statistics, can be applied. When the categories are ordered, it is preferable to use weighted κ, which takes into account the degree of difference between repeated assessments. In our literature, when interpreting the degree of variability or agreement using κ coefficients, the categories (poor, slight, fair, moderate, substantial or almost perfect agreement) proposed in the guidelines by Landis and Koch43 are widely adopted. However, interpretation against arbitrary standards does not distinguish between different applications of the guidelines, for example the reliability may be acceptable for users in a research setting but inadequate for clinical decision-making on individual patients.15 For measuring the reliability of continuous scales, Pearson correlation coefficient should not be used; a strong relationship can exist between repeated measurements despite there being a substantial difference between them. The intraclass correlation coefficient may be adopted with agreement measures such as proportion of agreement or specific agreement and standard errors of measurements.15 Bland and Altman plots can be used to calculate limits of agreement between two sets of measurements, show systematic bias between them and whether the level of agreement changes according to the size of the measurement e.g., if agreement falls with higher values of the measured item.44 Whatever statistical approach is used, CIs as measures of statistical uncertainty should be reported to allow the readers to get an impression of, in particular, the lower level of reliability/agreement.

We want to stress that it is important in the conduct of any clinical trial or epidemiological survey that reliability or agreement of those responsible for the data collection and interpretation of data is presented. Rater variation may be higher than the variation between methods.

Use guidelines for high-quality and complete reporting

Incomplete reporting has been identified as a major source of avoidable waste in biomedical research.45 These authors recommended researchers to use reporting guidelines, which are exemplified in the chapter “Terminology and Guidelines” above. Reports on research should answer four questions (i) what questions were addressed and why, (ii) what was done (sample and methods), (iii) what was found (report of results fully and clearly) and (iv) what did the findings mean (in context with other research).45 It is claimed that new studies in biomedicine rarely set their findings in the context of the body of other relevant research and cite a biased selection of previous publications.8,45 One way to avoid a skewed citation process is to interpret new primary research in the context of an existing systematic review, which NIHR (National Institute for Health Research) in England has required applicants to do.8 The inclusion of a text of each submitted study about “Strengths and weaknesses in relation to other studies, particularly any difference in results” may also increase the awareness of these issues.

Specific considerations of research at the different efficacy levels

Level 1 Technical efficacy

This is the foundation level of diagnostic efficacy and is concerned with parameters that can define the performance of the imaging system; thus, it is about the technology rather than health and disease. Measurements are usually objective in nature, with only limited contribution of subjective assessment by raters (observers). Traditionally, research at this level is led by Medical Physicists rather than clinical researchers because of their training and skills. Because the volume of publications in this area is large in our knowledge field, this efficacy level will not be elaborated on comprehensively.

(1) Measures to assess “Technical efficacy”

There is no general agreement as to what should be included in image quality, or what measures to choose for its assessment. For example, in some diagnostic uses, spatial resolution might be the most important factor, and in some imaging systems artefacts from e.g. metal objects and patient motion may be important.46,47 The technical efficacy of an imaging system might be described as a measure of the diagnostically significant attributes of the images it can produce. Unfortunately, no single or widely applicable definition identifies an ideal or perfect image, from either an objective or subjective perspective. Compounding this is that images are obtained for various clinical indications and that an image that is ideal or acceptable for one purpose might not be so for another. This has led to a wide variability in methods to assess acceptable image quality and an accompanying difficulty in determining optimal radiation exposure. Tests of image quality are normally required in the context of equipment commissioning and regular quality assurance.48 If performing a study of technical efficacy, therefore, a pragmatic approach is to use those guidelines on appropriate tests for that type of imaging.

Three approaches are presented in Table 7: objective image quality based on physical measurements, semi-subjective and subjective image quality related to the whole imaging chain, including raters.

Table 7.

Examples of measures to assess image quality/technical efficacy of imaging methods and description of the measures

Measure Description
Objective image quality
Contrast resolution Degree of density difference between two areas of an image, especially between image of an object and background
Line Spread Function (LSF) andPoint Spread Function (PSF). Spatial density distribution on an X-ray image of a narrow slit or a small point
Modulation Transfer Function (MTF) A description of unsharpness over a range of spatial frequencies
Noise Fluctuations of signal over an image, as a result of a uniform exposure. For a digital detector pixels may on average receive a specific number of X-ray photons, but in practise some will have fewer and some will have more, resulting in a grainy pattern or mottle. As number of x-rays is increased, by increasing exposure factors, amount of noise increases, but proportion of signal that is noise decreases
Noise Power Spectrum (NPS) Noise variance analysed in terms of its spatial frequency content
Signal-to-Noise Ratio (SNR) Ratio of strength of a signal to uncertainty with which it is measured as function of spatial frequencySignal-to-noise ratio (SNR) is a generic term which, in radiology, is a measure of true signal (i.e., reflecting actual anatomy) to noise (e.g., random quantum mottle)A lower SNR generally results in a grainy appearance to images
Contrast- to-Noise Ratio (CNR) Amount of lesion contrast relative to amount of noise (mottle) is key determinant of visibility of a given lesion. Ratio of lesion contrast to image mottle is known as contrast-to-noise ratio (CNR). CNR is mainly used for optimisation purposes in combination with radiation doses and can be defined as ratio between lesion or structure contrast and image noise
Detective Quantum Efficiency (DQE) A measure, which combines effects on modulation, spatial frequency and noise of an image receptor
Semi-subjective image quality
Spatial resolution Ability to reproduce small objects or to separate images of two objects close to each otherA single useful measurement is limiting resolution, being the highest resolvable bar in a resolution (line-pair) test grating. Involves raters making assessment of images of a line-pair test grating
Subjective image quality
Contrast-detail resolution Evaluation of image quality by grading clarity of reproduction of anatomical or pathological structures
Visibility of defined structures
(Visual grading analyses)
Artefacts Evaluation of image artefacts from various sources, e.g. metal and motion

(2) Study designs

Objective image quality mostly involves purely physical characteristics of the imaging system. The use of specifically designed phantoms, with size and densities resembling those of dental interest, is necessary for the testing of the imaging performance characteristics, using special software tools for the interpretation of the results and the evaluation of image quality. Software can be used to determine e.g. voxel (pixel) value variation determined by voxel values in e.g. Cone Beam CT (CBCT) sections and standard deviation of the voxel value distribution in a region of interest, or other measures of shades of grey.

Subjective image quality includes raters grading images of phantoms (contrast-detail phantom or anthropomorphic phantoms) or images of patients. The clarity of reproduction of important anatomical or pathological structures is graded, visual grading. The methods include “Visual Grading Analysis (VGA)” and the “Image Criteria (IC)”. VGA can be performed in two ways: with relative grading using one or several images as references or with absolute grading using no reference.49

(3) Some specific considerations before initiating a study of “Technical efficacy”

▪ Collaborate with a medical physicist wherever possible. They have the relevant knowledge but also have access to the necessary specialist equipment and test tools. If this is not available, carefully consider the level of knowledge in your research team and the feasibility of the proposed research.

▪ Try to include a comparator form of imaging. Measurement of objective image quality of a specific imaging system used in a specific way is unlikely to be meaningful without a comparator imaging method. This is likely to be either an alternative method, which is normally used for the diagnostic purpose (e.g., CBCT vs intraoral radiography), or the same method but with a change in some parameter (e.g., a “low dose” vs a “normal dose” protocol).

▪ Think about the clinical context in which the imaging technique might be used and perform your tests using the relevant parameters. For example, if investigating how image quality is affected by changing manufacturer’s resolution setting in the context of CBCT for endodontics, it would make no sense to use a large field of setting.

▪ Remember that the technical performance of an imaging modality is influenced by changes at any point along the imaging chain. This is obviously true for the exposure settings selected, but also by reconstruction algorithms and monitor specifications. Ensure, therefore, that the experimental set-up reflects clinical practice. For example, does using a high specification radiology monitor give different results to the type of monitors commonly used in primary healthcare clinics?

▪ Use the right test tools for objective measurements. The physical factors of spatial resolution, contrast resolution and image noise, etc. are related to image quality and are therefore used for quality control programs of imaging modality. For analysis of image quality, test tools and phantoms that conform to international standards are required to measure physical factors and should be used rather than specifically manufacturing something for you to use. As mentioned above, perform the tests using widely recognized protocols used in commissioning tests and quality assurance.48

Finally, it is important to emphasize that objective image quality measurements on their own provide no information about adequacy of image quality for diagnostic purposes. Studies at this level alone can be helpful when comparing two or more systems used for similar clinical purposes, but to assess the clinical impact and identify adequate image quality for a specific diagnostic task requires research at the higher levels of the hierarchy. Forty years ago, Metz50 stated “At present one cannot confidently predict the diagnostic performance of a medical imaging procedure from knowledge of its physical characteristics”. Therefore, it is important to investigate the relationship between the objective and subjective image quality. Moreover, the essential issue is whether a given modality increases the accuracy, not only for the obvious ones but for the whole spectrum of the target condition in included patients.

Level 2 Diagnostic accuracy efficacy

(1) Measures to assess “Diagnostic accuracy efficacy”

The term accuracy means how well the index test actually measures what it is supposed to measure, i.e. the true state of the disease or condition. Traditional measures for accuracy and their descriptions are presented in Table 8.

Table 8.

Measures to assess diagnostic accuracy of imaging methods and description of measures

Measure Description
Sensitivity: true positive rate The probability that the index test correctly identifies those with the target condition (compared to the reference standard)
The percentage of true positive test results found in the group that actually has the target condition
Specificity: true negative rate The probability that the index test correctly rules out those without the target condition (compared to the reference standard)
The percentage of true negative test results found in the group that is actually healthy
Positive predictive value (PV+) True positive test results as a percentage of all positive test results (true positives/true positives + false positives); the probability that a positive test result is correct
False positive cases may be termed “false alarm cases”
Negative predictive value (PV-) True negative test results as a percentage of all negative test results (true negatives/true negatives + false negatives); the probability that a negative test result is correct
False negative cases may be termed “missed cases”
Likelihood ratios (LRs) Indicates the clinical value of the index test
If the LRs are close to 1.0 the index test is of little value
Positive likelihood ratio (LR+) The odds of correctly identifying those with the target condition versus false alarms, i.e.. sensitivity/(1- specificity)
Negative likelihood ratio (LR-) The odds of missing cases versus correctly ruling out those with the target condition, i.e. (1- sensitivity)/specificity
Receiver Operating Characteristic (ROC) The relation between the true-positive rate and the false-positive rate at usually five threshold steps

Every diagnostic method—index test—has its profile. For example, the sensitivity of a method may be so high that it successfully captures all cases with the target condition. Yet, it may be so imprecise that it also shows positive results for several cases without the target condition. Few methods are highly sensitive and specific at the same time. The predictive values of an index test are extremely dependent on the prevalence of disease in the study cases, while sensitivity and specificity are less dependent, although Bossuyt et al11 in the STARD 2015 document stated that “It is now well established that sensitivity and specificity are not fixed properties. The relative number of false-positive and false-negative test results varies across settings, depending on how patients present and which test they have already undergone”. In addition, sensitivity and specificity estimates can differ owing to variable definitions of the reference standard against which the index test is being compared.33,51 Likelihood ratios (LRs) have the advantage in incorporating four cells, i.e. the results for true positive, false positive, true negative and false negative test results in contrast to sensitivity, specificity and predictive values, which make use of only two test results.52 LR+ equal to or higher than five is deemed to provide a moderate increase in the post-test probability of the target condition.53

The above measures for accuracy of an index test requires that the outcome of the index test has been obtained (or post-treated) on a qualitative, dichotomous scale; i.e. the index test results fall in the categories: disease/non-disease. Such categorical data are tested by non-parametric statistical analyses. If disease is measured on a so-called confidence scale (e.g., five steps on the scale), measures for accuracy may be obtained via receiver operating characteristic (ROC) analysis. The area under the ROC curve is a quantitative measure for accuracy and should be tested by parametric statistical analyses.

(2) Study designs

To assess diagnostic accuracy, the outcomes of one or more index tests must be held against a reference standard, which reveals the true state of the disease/condition. A robust reference standard has previously often been termed a “gold standard”.

The reference standard: What is to be expected of a reference standard that is meant to provide the true state of disease/target condition?

  1. should be established by a method that is in itself extremely reproducible, i.e. no (almost no) rater variation

  2. should reflect the ground truth, i.e. the patho-anatomical appearance that defines the target condition

  3. should be fully independent of the index test under evaluation.

First, a hypothetical ideal “reference standard” is 100% correct in identifying the presence and the absence of the target condition, i.e. it defines the target condition. If the reference standard is not extremely reproducible, but rater dependent, the true state has not been established, and very different results may appear when used as the reference standard for the index test under evaluation. It will often be as good/bad as not having any reference standard.

Second, the term ground truth refers to the underlying absolute state of the target condition, and the reference standard strives to represent the ground truth as closely as possible. The ground truth is often defined by autopsy/biopsy. In patients, a biopsy, which would provide the best ground truth, may not be applicable. For some target conditions, “the best diagnostic procedure known to date” may therefore serve as the ground truth. As new diagnostic methods become available, the reference standard may change over time. It must be born in mind that logically a new diagnostic method under evaluation can never be more accurate than the reference standard, it is held again.

FInally, it is not meaningful, when pertaining to establish accuracy, to compare the outcome of an index test with the outcome of the same or a very similar test, just performed by other raters, e.g. so-called experts. Such studies merely provide information on reproducibility, not accuracy. The reference standard must distinct itself from the index test under evaluation in all basic parameters.

Establishing a reference standard in vitro/ex vivo/in vivo:In vitro (“in the glass”) studies are performed with microorganisms, cells or biological molecules outside their normal context, often called “test-tube experiments”. Such study designs have little use in evaluating the accuracy of imaging methods.

Ex vivo (“out of the living”) studies are performed outside an organism. It refers to measurements done in tissues from an organism in an external environment with little alteration of the natural conditions. Ex vivo studies allow experiments on an organism’s cells or tissues under controlled conditions. A primary advantage of using ex vivo tissues is the ability to perform tests or measurements that would otherwise not be possible or ethical in living subjects. Tissues may be removed in parts, as whole organs, or as larger organ systems. This study design may be well suited for evaluation of accuracy of diagnostic tests. Ex vivo studies in diagnostic radiology should include naturally occurring lesions/disease in e.g. human extracted teeth or human bones or specimens. Studies that include “lesions” made artificially, for example simulated caries or periapical inflammatory pathosis manufactured using a bur or chemicals may not represent the clinical situation, since artificial lesions do not resemble the biological process of natural disease. Even although such designs provide a 100% valid reference standard, i.e. it is known where the lesion was drilled, and are easy to perform, they may add little to the understanding of an index test used for clinical diagnosis. In other diagnostic tasks, e.g. root fracture, the ex vivo artificially produced fracture may look the same as in the acute state before any resorption has occurred after trauma to a patient.

There seems to be no evidence, that the outcome from studies using the original 5-step confidence scale (e.g., “sure of no caries/almost sure of no caries/uncertain/almost sure of caries/sure of caries”) for data sampling, equals the outcome from studies using a 5-step disease severity scale (e.g., “no caries/caries in enamel/caries < 1/3 into dentin/ caries > 1/3 into dentin/caries into pulp”). Care must be taken therefore, when results from disease severity scale data are treated by ROC analysis.

In vivo (“within the living”) studies are performed on whole, living organisms, including plants, animals and humans. A robust reference standard can be difficult to obtain in studies in living organisms, however if it can be established, such studies have a higher impact than ex vivo studies. In diagnostic radiology, it is usually unethical to expose a patient several times with the same methods in order to test, for example, various forms of imaging equipment, units, resolutions or other technical parameters. Rarely, an optimal reference standard can be established in patient studies to provide accuracy measures for diagnostic radiographic tests.

(3) Some specific considerations before initiating a study of “Diagnostic accuracy efficacy”

Is the research question in the ex vivo study of clinical interest, and can the outcome be transferred to the clinical situation? A sample of extracted teeth, for example, may not be representative of teeth in a patient group or population, and the disease prevalence and severity may differ from patient populations. The same surely goes for human specimens obtained from donations of deceased people, where the age of the dead may have an impact. Studies should also consider the “soft tissue” issue and positioning of the teeth, to come as close as possible to the clinical situation. Below are some examples of dental conditions, where imaging methods have been tested against a reference standard to obtain the diagnostic accuracy of the method.

Numerous ex vivo studies have been performed on depth of caries lesions, where the extracted teeth used are those “at hand” (extracted premolars or third molars), which is a so-called “mixed” tooth sample, not necessarily representing teeth in patient populations. If the ex vivo set-up is optimized, comparable outcomes from ex vivo and in vivo studies of the same teeth have justified this design to be used as an accuracy test of caries lesion depth with radiographic methods as index tests. The reference standard has most often been stereo-microscopy of thin sections of the teeth after imaging, which is a method that is fully independent of the index test and with a good reproducibility among raters. Recently, micro-computed tomography has been mentioned to serve as reference standard, however different results may be obtained when comparing stereo-microscopy and micro-CT as reference standard and it may not be obvious which is the more correct “true” reference standard. It should be considered although, that micro-CT is closer related to the index test under evaluation. A different research question from lesion depth assessment is, whether or not the tooth surface is cavitated or non-cavitated. In this situation, a clinical inspection of the tooth surface, with or without other diagnostic aids, is the reference standard, and an accuracy study can be performed also in vivo in patients.

Studies on periapical lesions can be performed in human bone specimens that include teeth. After examination with the index tests, the specimens may be sectioned through the periapical bone area and the tissue in the lesions harvested for histological examination. The reference standard is thus fully independent on the index test, but care must be taken that all tissue may not have been harvested, and even a finding of no inflammatory cells may not totally exclude inflammation. The creation of a periapical “lesion” with an instrument, to secure a 100% solid truth, should be avoided since the appearance of such lesions does not represent true periapical disease.

The same considerations go for the evaluation of marginal bone loss/inter-radicular bone loss, where human specimens may seem the most valid option for an accuracy study. In patients, clinically determined furcation involvement may not be an appropriate reference standard for a radiographic index test since the two methods identify different conditions—and also, diagnosing the degree of clinical furcation involvement may not obtain high rater agreement. For assessment of peri-implant bone level/loss, the same considerations count.

In studies of the mandibular third molar and its relation to the alveolar inferior nerve, it is hard to imagine a proper ex vivo set-up. Instead, in vivo studies on accuracy have been performed, in which a reference standard has been either the visibility of the nerve during removal of the molar or a neuro-sensibility disturbance postoperatively in the region of the IAN innervation. Both references may be quite doubtful since a clinical observation of the exposed nerve can be unethical to pursue, and sensibility disturbances may be caused by other factors than a close relation to the nerve.

For studies of the temporomandibular joint (TMJ), it should be thoroughly considered if the research question is of clinical interest. Conventional radiographic methods may display merely the morphology of the bony components of the TMJ, and for that human skulls can be used with clinical inspection of the condyle, fossa and tuberculum serving as reference standard, although this reference may be quite rater-dependent. Since the bony appearance of the condyle has little relation to patients’ TMJ symptoms, it may however have little meaning to test it.

Level 3 Diagnostic thinking efficacy and Level 4 Therapeutic efficacy

(1) Measures to assess “Diagnostic thinking efficacy” and “Therapeutic efficacy”

“Diagnostic thinking efficacy” and “Therapeutic efficacy” consider the impact of the imaging method on clinicians’ diagnostic thinking and management decisions, respectively. Research at these levels is frequently carried out in combination, and the research design is essentially the same. Therefore, these levels will be considered together here although the measures to assess these levels differ somewhat (Table 9). If there is no measurable diagnostic impact of an imaging procedure, then no therapeutic impact is likely; if no therapeutic impact is observed, it is hard to see how the imaging procedure could benefit the patient. They can, therefore, be seen as useful preliminary research before planning studies at higher levels using RCTs.54

Table 9.

Examples of measures to assess “Diagnostic thinking efficacy” and “Therapeutic efficacy

Level 3 Diagnostic thinking efficacy Level 4 Therapeutic efficacy
The change between the raters’ pre-test diagnosis (or differential diagnosis) and post-test diagnosis (or differential diagnosis) The change between the raters’ pre-test management strategy and post-test management strategy
The change in the raters’ estimate of the probability of their diagnosis being true, pre-test and post-test The change in the raters’ judgement of the helpfulness of the index test in planning management
The change in the proportion of cases for which the index test was perceived as “helpful” to diagnosis The change in the raters’ confidence that their management strategy is appropriate, pre-test and post-test
The change in the raters’ confidence that their diagnoses are true, pre-test and post-test The number and proportion of times a procedure was changed because of information from the index test

(2) Study designs

To identify the impact of an imaging procedure on patients, the strongest research design would be a randomized controlled trial (RCT), but carrying out RCTs can be difficult. This is not only because of cost and complexity, but also because the imaging method may already be a part of clinical practice and withholding it from the control group in a RCT might be judged as unethical. Indeed, RCTs have been described as “too cumbersome or impractical for regular evaluation of diagnostic technologies”.54 Fryback and Thornbury10 described studies at the diagnostic thinking level as an “empirically feasible proxy for measuring ultimate impact on the patient” and the same can be said for research at the therapeutic efficacy level. Important aspects to optimize the quality of studies at these levels are listed below, adapted from those described by Guyatt et al55 with some additional points.

Before-after study design: The commonest way of conducting studies at both levels is to observe and record diagnostic thinking and/or treatment planning of clinicians. This is done prior to the introduction of the imaging method being investigated and after, allowing the identification of any changes resulting from the index test. This design is referred to as a “before-after study”, or “pre-post study”. They are best suited for evaluating the clinical impact of a single test or one which is an “add-on” to existing tests rather than for comparing different diagnostic options.31,54 Some authors31,55 recommend evaluation of the true disease status in before-after studies, by means of pathological evidence or long-term follow-up of patients to provide some concurrent indication of diagnostic accuracy efficacy and take account of false-positive and false-negative results. The convincing argument for this is that a change in diagnosis and/or therapeutic decisions resulting from using a diagnostic test is not in itself evidence of improvement; a change in diagnostic thinking or patient management may lead to harm rather than good. It is hard, however, to see how their recommendation can be put into practice for some clinical situations. For example, a new imaging method for detection of periapical lesions can have no reference standard unless the teeth of the patients in a study are to be extracted and the periapical area resected, or the target group is teeth in need of surgical endodontic treatment in which periapical tissue can be harvested.56 Furthermore, as far as outcomes are concerned, all patients undergo the same imaging so there is no comparator group.

The before-after design was originally applied to studies looking into the impact of a new technology. An example might be to measure clinicians’ decisions for patients with a particular condition, in a particular healthcare setting, before and after the introduction of CBCT. It is obvious that this is a potentially weak design. It is uncontrolled and there may be temporal changes in patient management unrelated to the availability of CBCT, such as a change in accepted management strategies, changes in clinical staff, greater familiarity with CBCT over time and other management aspects. Any of these could lead to a change in practice not attributable to the CBCT. Some improvement in design is possible by comparing with a control population over the same time period, but identifying a good control group is not straightforward.

The weaknesses of before-after studies have been well described.55,57

  • What participating clinicians/raters say they would do may not be what they would do in real life, simply because they perceive that they are being tested.

  • Participating raters may have an unconscious bias in favour of the index test and may lower their scoring of confidence or probability of a diagnosis in the “before” part of the study.

  • Participating raters may have an unconscious bias against the index test and may lower their scoring of confidence or probability of a diagnosis in the “after” part of the study.

  • There can be no comparison of patient outcome because all patients receive the imaging, unlike an RCT.

  • Change in diagnostic thinking or therapeutic decisions may not benefit the patient.

“With/without” study design: An alternative study design to a genuine “chairside” clinical study, and one which has often been used, is to collect clinical data and radiological information for a cohort of patients with a particular condition and to ask questions of a panel of independent clinicians, first without the information from the imaging method and second to repeat the questions after supplying the images and/or radiological report. The material may be purely radiological but can be in the form of “paper cases” or “clinical vignettes” containing a comprehensive set of clinical information and results of other tests. This type of study has the advantage that many raters can be involved, something that is usually impractical or even unethical to inflict on patients. It is important to recognize that this type of study presents different challenges to the clinical before-after study. For studies using clinical vignettes, consider using clinical photographs, a detailed patient history and a full record of intraoral and extraoral examinations.

(3) Some specific considerations before initiating a study of “Diagnostic thinking efficacy” or “Therapeutic efficacy”

It is important not to over interpret results that show changes in diagnostic thinking or therapeutic decisions. In the absence of data on patient outcomes compared with a control group, changes at these levels may not translate into patient benefit. Only a well-conducted RCT can provide that evidence.

Carefully consider the time interval between “before” and “after” parts. For studies involving patients and clinicians, the interassessment period should be as short as possible, ideally providing the imaging information as soon as it is available.31 This is consistent with everyday clinical practice. However, with “paper cases”/vignettes, some researchers choose to ask raters to undertake two separate and complete case assessments, using a (variable) time interval between these with a view to minimizing recall. Nonetheless, the risk of raters retaining some memory of cases runs a risk of bias in the results. One way of overcoming the potential bias can be to split the cases so that half are assessed using the normal before-after sequence for the index test, and half using an after-before sequence for availability of the index test.

For reporting Levels 3 and 4 studies, there is no specific standard, unlike Levels 2 and 5 studies for which STARD 1511 and CONSORT14 exist, respectively. However, it is possible to use parts of STARD and/or CONSORT that are relevant. When planning and writing up the results of any study, it is worthwhile remembering that one day it may be reviewed by others carrying out a systematic review or meta-analysis. A shrewd researcher would benefit from assessing the manuscript of their study using a systematic review tool prior to submission. Meads and Davenport57 have developed such a tool by modifying QUADAS.

Level 5 Patient outcome efficacy

(1) Measures to assess “Patient outcome efficacy“

Patient outcomes pertain to the effect of imaging methods on individual patients, the ultimate goal of health care being to improve patient outcomes. It is at this level that the expected benefit, such as quality-adjusted life-years (QUALYs), oral health-related quality of life (OHRQoL), quality-adjusted tooth years (QATYs) and number of teeth saved can be weighed against adverse effects, such as burden, pain and risk. Patient satisfaction may be another outcome to consider. For example, if a new imaging method is as effective, and costs the same as the conventional one, but hurts the patient or is unacceptable for other reasons; patient satisfaction might outweigh the new method. With adverse effects becoming established, radiation may be an important consideration when choosing an imaging method when multiple options are available. With information from Level 5 studies, a clinician may be able to make more informed decisions about whether or not to request or perform the imaging examination based on patient outcome efficacy.

(2) Study designs

The study design least prone to bias is the RCT and is therefore considered to be one of the highest quality study designs to provide information on the benefits to the patient. An RCT is accomplished by randomly allocating subjects to two or more groups (experimental and control groups), imaging them using different methods, and then comparing them with respect to measured patient-related outcomes. Patient outcomes are compared whenever possible in a blind manner. Blinding can be imposed on any participant of an experiment, including subjects, researchers, technicians, data analysts and evaluators.

Decision analytic modelling is another study design, which is now recognized as a practical alternative. For the decision analytic modelling, probabilities of events for different pathways may be known and obtained from the literature or assumed. According to Buxton et al,58 modelling overcomes a number of additional methodological hurdles common to diagnostic test evaluation: the need to link evidence from a number of different sources, the lack of long-term outcome data in scenarios where only intermediate endpoints (e.g., test accuracy) have been measured or where only short-term follow-up is possible and the need to compare many interventions (e.g., testing strategies/pathways), which may not be feasible within a single RCT. A specific form of a decision-analytic modelling is the Markov model, which can account for that treatment decisions involve sequential, stochastic decisions over time.

(3) Some specific considerations before initiating a study of “Patient outcome efficacy”

It is not always feasible to perform an RCT for ethical, financial or other reasons. In those cases, case-series collected before and after the introduction of a new test technology (i.e., a before-after study design) or case–control studies may provide some of the answers. Although open to bias, they may better reflect day-to-day practice compared with RCTs and could in this way provide information important to designing effective RCTs.

Whatever study design, the impact of spectrum bias or referral bias (Table 6 for definition), should be considered. To offset referral bias requires mostly inclusion of patients from differing practice environments. This increases the likelihood that patients presenting with low, intermediate and high probability of disease and subsequent therapy will be included.

Level 6 Societal efficacy

(1) Measures to assess “Societal efficacy”

While research at level 5 “Patient outcome efficacy” dealt with the individual patients, level 6 “Societal efficacy” relates to whether an imaging method is efficacious to the extent that it is an efficient use of societal resources. It is in the society’s interest to achieve the greatest possible benefit from the resources allocated to health care. Once resources are used in a particular way, they cannot be used in an alternative manner. As resources are limited, decision makers need to consider not only whether an intervention is effective, but whether it is cost-effective and what resources are available. Measures of outcomes at levels 5 and 6 are often combined to demonstrate whether methods do not only improve the patient’s outcome but also have an acceptable cost. Any of the measures of the previous levels can be used as input, for example cost per surgery avoided, cost per appropriately treated patient, cost per life year gained or cost per quality adjusted life year gained. Final outcomes, such as life years gained or QALYs gained, are preferred over intermediate outcomes in economic evaluations, as they allow comparisons across a broader range of health interventions.

(2) Study designs

A study design at this level always involves a comparative analysis of alternative courses of action.59 Figure 2 presents a simplified model for decision-making when selecting a new method such as CBCT compared with a conventional method, e.g. periapical radiography, based on cost-effectiveness. Four main types of health economic analyses are currently in focus: cost analysis, cost-effectiveness, cost-utility and cost-benefit analysis (Table 10). The difference between the analyses is mainly the way in which the consequences are measured and expressed. In cost analysis, the costs of different interventions are compared without any analysis of the consequences. In the other three types of analyses, both costs and consequences are measured. The expected costs may be in the form of radiation dosage, monetary units, time and discomfort/pain of an examination. These costs are weighed against the expected consequences that may be in the form of improving quality of life/oral health, sense of well-being, avoiding other tests or procedures, as a rational guide for the clinician´s decision about whether or not to perform an examination (Table 10).

Figure 2.

Figure 2.

A simplified model of the decision process resulting from a health economic evaluation of a new imaging method compared with a conventional imaging method. The boxes shaded in a dark grey indicate a straightforward decision in selection of an imaging method, because one method is superior in both cost and effectiveness. The boxes shaded in lighter grey indicate that for one measure (cost or effectiveness), the two methods are the same, but for the other measure there is a difference. In such cases, one method might be superior based on higher effectiveness while costing less or the same. Alternatively, the same option may be appropriate when the effectiveness is the same but the costs are less. The boxes containing question marks indicate situations in which decisions become more complex from the health economic perspective and the final choice of method is dependent on factors other than effectiveness and costs (e.g., patient preferences or equipment reliability).

Table 10.

Health economic evaluation – types of analysis and consequences (modified after Drummond et al 2015).59 All types of analyses involve valuation of resources (monetary or other types of costs e.g. time)

Type of analysis Identification of consequences Measurement/valuation of consequences
Cost analysis None None
Cost-effectiveness Single effects of interest, common to both alternatives, but achieved to different degrees episodes Natural units (e.g., life-years gained, disability days saved), number of avoided episodes of pain, number of teeth saved, number of avoided decayed tooth surfaces
Cost-utility Single or multiple effects, not necessarily common to both alternatives Healthy years (typical measured as Quality-Adjusted Life-Years QALYs)
Oral Health-Related Quality of Life (OHRQoL)
Quality-Adjusted Tooth Years (QATYs)
Cost-benefit Single or multiple effects, not necessarily common to both alternatives Monetary units i.e. describe both costs and benefits of interventions in monetary terms

Because data on these outcomes and costs of the diagnostic and subsequent therapeutic paths are not routinely available, modelling becomes inevitable and the validity of the model input parameters is crucial for the credibility of the model. Cost-effectiveness models can only upgrade the level of evidence if level 5 was available on the outcomes used in the model. A framework, based on the model presented by Drummond et al59 for the systematic identification, measurement and valuation of incremental costs of diagnostic methods in oral health care was presented by Christell et al.60 The framework can be used alongside measures of incremental effects measured in accordance with the stated purpose, goals or objectives of an imaging method. Patient utility assessment content measures include for example (a) patient’s willingness to pay, (b) standard gamble methods, (c) ad hoc survey and (d) questionnaires with utility instruments and psychological instruments to assess factors such as anxiety, discomfort and inconvenience.61

In studies of societal efficacy, the question is “How sensitive are the results to changes in the underlying assumptions characterizing the context?” Sensitivity analysis can be used to illustrate and assess the level of confidence that may be associated with the conclusion of an economic evaluation. One aim of sensitivity analysis is to find out which variables that “drive” the results. Sensitivity analysis is performed by varying key assumptions made in the evaluation (individually or severally) and recording the impact on the result (output) of the evaluation. Some variables carry greater weight than others, a sensitivity analysis might show that for example the size and type of sample is an important variable as well as the time horizons and criteria for including or excluding particular cost components.62

(3) Some specific considerations before initiating a study of “Societal efficacy”

The “Societal efficacy” level has not been reached in most studies of dentomaxillofacial imaging. This has become particularly salient with the introduction of CBCT. An extensive amount of publications deals with the “Technical efficacy“ and, to some extent, “Diagnostic accuracy efficacy” for CBCT, but few deal with the actual benefits for the patients and the society. There are however RCTs on the comparison between CBCT and conventional imaging methods combined with calculated costs, e.g. Petersen et al63 showed that CBCT examination did not reduce post-surgical neurosensoric disturbances after removal of a mandibular third molar but was four times the costs for panoramic imaging.64,65 A position paper from the European Academy of CBCT imaging of the mandibular third molar was recently published.66 Systematic reviews on the efficacy of CBCT concluded that there was only one study on the level of “societal efficacy” for detection of intrabony and furcation defects67 and no study on detection of periapical lesion.68 This is probably due to that there are many challenges in conducting such studies as the parameter selection should reflect not only the best evidence available for the outcomes of the index test(s) but also concerning the economic evaluation.

It is important not to generalize the results of studies at these levels as it may be difficult to compare results from different studies even when sensitivity analyses are performed. For cost-effectiveness analyses, estimates for both costs and effectiveness must be situated firmly within the relevant context, which includes the disease burden and budget of the setting in question. Cost analysis of CBCT examinations revealed that costs varied among different healthcare systems in Europe.69

Certainly, the choice is easy when a new method is shown to be both less expensive and more effective than the current conventional method (Figure 2). But perhaps the more common situation is that a new method might be less cost-effective than current options, but nevertheless contributes something valuable; or it is cost-effective in some patients but not others. Hence, the method can defend its position in the imaging arsenal—but it should be restricted to evidence-based indications. What need to be elucidated are inappropriate indications rather than inappropriate methods. The knowledge gaps on cost-effectiveness studies in dentomaxillofacial imaging call for high-quality studies. The complexity and constraints of performing analyses of diagnostic tests on costs and health outcomes makes it particularly challenging. A recent review70 provides a comprehensive assessment of decision modelling, highlights problems and gives recommendations that can be applied for research of “Patient outcome efficacy” and “Societal efficacy” applicable for our knowledge field.

Concluding comments

An understanding of the efficacy of dentomaxillofacial imaging based on high-quality research evidence is essential for clinical dental practice. Furthermore, radiological images are often used as a tool in clinical research by clinical scientists in fields other than radiology. The importance of high-quality research practice is therefore self-evident. In this paper, we have, as with a kaleidoscope, tried to elucidate the quality in this research from different angles. Based on the framework of six efficacy levels as an organizing structure, we propose a set of aspects broadly useful and applicable to the vast majority of imaging research, underpinned by recognized guidelines on research quality and reporting. Some important aspects of research, such as statistical techniques, are not covered. Our paper should not be seen as a chapter of a text book but rather as a signposting tool. We hope that our paper will encourage researchers, reviewers and readers to collect, appraise and apply those scientific data that will enhance future research in dentomaxillofacial imaging.

Contributor Information

Madeleine Rohlin, Email: madeleine.rohlin@mau.se.

Keith Horner, Email: keith.horner@manchester.ac.uk.

Christina Lindh, Email: christina.lindh@mau.se.

Ann Wenzel, Email: AWENZEL@dent.au.dk.

REFERENCES

  • 1.Chalmers I, Glasziou P. Avoidable waste in the production and reporting of research evidence. The Lancet 2009; 374: 86–9. doi: 10.1016/S0140-6736(09)60329-9 [DOI] [PubMed] [Google Scholar]
  • 2.Salman RA-S, Beller E, Kagan J, Hemminki E, Phillips RS, Savulescu J, et al. Increasing value and reducing waste in biomedical research regulation and management. The Lancet 2014; 383: 176–85. doi: 10.1016/S0140-6736(13)62297-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chalmers I, Bracken MB, Djulbegovic B, Garattini S, Grant J, Gülmezoglu AM, et al. How to increase value and reduce waste when research priorities are set. The Lancet 2014; 383: 156–65. doi: 10.1016/S0140-6736(13)62229-1 [DOI] [PubMed] [Google Scholar]
  • 4.Glasziou P, Altman DG, Bossuyt P, Boutron I, Clarke M, Julious S, et al. Reducing waste from incomplete or unusable reports of biomedical research. The Lancet 2014; 383: 267–76. doi: 10.1016/S0140-6736(13)62228-X [DOI] [PubMed] [Google Scholar]
  • 5.Ioannidis JPA, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, et al. Increasing value and reducing waste in research design, conduct, and analysis. The Lancet 2014; 383: 166–75. doi: 10.1016/S0140-6736(13)62227-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kleinert S, Horton R. How should medical science change? The Lancet 2014; 383: 197–8. doi: 10.1016/S0140-6736(13)62678-1 [DOI] [PubMed] [Google Scholar]
  • 7.The Lancet REWARD Campaign. 2015. Available from: https://www.thelancet.com/campaigns/efficiency/statement [ accessed 2019-11-10].
  • 8.Moher D, Glasziou P, Chalmers I, Nasser M, Bossuyt PM, Korevaar DA, et al. Increasing value and reducing waste in biomedical research: who's listening? The Lancet 2016; 387: 1573–86. doi: 10.1016/S0140-6736(15)00307-4 [DOI] [PubMed] [Google Scholar]
  • 9.Leeflang MMG, Deeks JJ, Gatsonis C, Bossuyt PMM, on behalf of the Cochrane Diagnostic Test Accuracy Working Group . Systematic reviews of diagnostic test accuracy. Ann Intern Med 2008; 149: 889–97. doi: 10.7326/0003-4819-149-12-200812160-00008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11: 88–94. doi: 10.1177/0272989X9101100203 [DOI] [PubMed] [Google Scholar]
  • 11.Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015; 351: h5527. doi: 10.1136/bmj.h5527 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Equator network enhancing the quality and transparency of health research. Available from: http://www.equator-network [accessed 2019-09-25].
  • 13.Penelope research. Available from: https://www.penelope.ai/equatorwizard [accessed 2019-09-25].
  • 14.Schulz KF, Altman DG, Moher D, CONSORT Group . CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340: c332. doi: 10.1136/bmj.c332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kottner J, Audige L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud 2011; 48: 661–71. doi: 10.1016/j.ijnurstu.2011.01.016 [DOI] [PubMed] [Google Scholar]
  • 16.Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group . Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 2009; 6: e1000097. doi: 10.1371/journal.pmed.1000097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McInnes MDF, Moher D, Thombs BD, McGrath TA, Bossuyt PM, Clifford T, et al. Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement. JAMA 2018; 319: 388–96. doi: 10.1001/jama.2017.19163 [DOI] [PubMed] [Google Scholar]
  • 18.von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP, et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med 2007; 4: e296. doi: 10.1371/journal.pmed.0040296 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015; 350: g7594. doi: 10.1136/bmj.g7594 [DOI] [PubMed] [Google Scholar]
  • 20.Brouwers MC, Kerkvliet K, Spithoff K, AGREE Next Steps Consortium . The agree reporting checklist: a tool to improve reporting of clinical practice guidelines. BMJ 2016; 352: i1152. doi: 10.1136/bmj.i1152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ 2017; 358: j4008. doi: 10.1136/bmj.j4008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the charms checklist. PLoS Med 2014; 11: e1001744. doi: 10.1371/journal.pmed.1001744 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lucas N, Macaskill P, Irwig L, Moran R, Rickards L, Turner R, et al. The reliability of a quality appraisal tool for studies of diagnostic reliability (QAREL). BMC Med Res Methodol 2013; 13: 111. doi: 10.1186/1471-2288-13-111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011; 155: 529–36. doi: 10.7326/0003-4819-155-8-201110180-00009 [DOI] [PubMed] [Google Scholar]
  • 25.Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, et al. ROBIS: a new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol 2016; 69: 225–34. doi: 10.1016/j.jclinepi.2015.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Huang X, Lin J, Demner-Fushman D. Evaluation of PICO as a knowledge representation for clinical questions. AMIA Annu Symp Proc 2006;: 359–63. [PMC free article] [PubMed] [Google Scholar]
  • 27.Atkins D, Best D, Briss PA, Eccles M, Falck-Ytter Y, Flottorp S, et al. Grading quality of evidence and strength of recommendations. BMJ 2004; 19: 1490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pannucci CJ, Wilkins EG. Identifying and avoiding bias in research. Plast Reconstr Surg 2010; 126: 619–25. doi: 10.1097/PRS.0b013e3181de24bc [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Schmidt RL, Factor RE. Understanding sources of bias in diagnostic accuracy studies. Arch Pathol Lab Med 2013; 137: 558–65. doi: 10.5858/arpa.2012-0198-RA [DOI] [PubMed] [Google Scholar]
  • 30.Whiting P, Rutjes AWS, Reitsma JB, Glas AS, Bossuyt PMM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004; 140: 189–202. doi: 10.7326/0003-4819-140-3-200402030-00010 [DOI] [PubMed] [Google Scholar]
  • 31.Knotterus JA, Dinant G-J, van Schayk OP. The diagnostic before-after study to assess clinical impact : Knotterus J. A, Buntinx F, The evidence base of clinical diagnosis. Theory and methods of diagnostic research. 2nd ed Chichester: Wiley-Blackwell; 2009. 83–95. [Google Scholar]
  • 32.Kaneman KD. Thinking fast and slow. Farrar, Straus and Giroux. New York; 2011. [Google Scholar]
  • 33.Whiting PF, Rutjes AWS, Westwood ME, Mallett S, QUADAS-2 Steering Group . A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol 2013; 66: 1093–104. doi: 10.1016/j.jclinepi.2013.05.014 [DOI] [PubMed] [Google Scholar]
  • 34.Frank RA, Sharifabadi AD, Salameh J-P, McGrath TA, Kraaijpoel N, Dang W, et al. Citation bias in imaging research: are studies with higher diagnostic accuracy estimates cited more often? Eur Radiol 2019; 29: 1657–64. doi: 10.1007/s00330-018-5801-8 [DOI] [PubMed] [Google Scholar]
  • 35.Duyx B, Urlings MJE, Swaen GMH, Bouter LM, Zeegers MP. Scientific citations favor positive results: a systematic review and meta-analysis. J Clin Epidemiol 2017; 88: 92–101. doi: 10.1016/j.jclinepi.2017.06.002 [DOI] [PubMed] [Google Scholar]
  • 36.Altman DG. Diagnostic tests : Altman D. G, Machin D, Bryant T. N, Gardner M. J, Statistics with confidence. 2nd ed London: BMJ Books; 2000. 105–19. [Google Scholar]
  • 37.Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine In: Wiley-Blackwell. 2nd ed; 2011. [Google Scholar]
  • 38.Altman DG, Bland JM. Diagnostic tests. 1: sensitivity and specificity. BMJ 1994a; 308: 1552. doi: 10.1136/bmj.308.6943.1552 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ 1994b; 309: 102. doi: 10.1136/bmj.309.6947.102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ 2004; 329: 168–9. doi: 10.1136/bmj.329.7458.168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Swets JA, Pickett RM. Evaluation of diagnostic systems: methods from signal detection theory. New York, USA: Academic Press; 1982. [Google Scholar]
  • 42.Obuchowski NA. How many observers are needed in clinical studies of medical imaging? AJR Am J Roentgenol 2004; 4: 867–9. [DOI] [PubMed] [Google Scholar]
  • 43.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–74. doi: 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
  • 44.Bland JM, Altman D. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 1986; 327: 307–10. doi: 10.1016/S0140-6736(86)90837-8 [DOI] [PubMed] [Google Scholar]
  • 45.Glasziou P. The role of open access in reducing waste in medical research. PLoS Med 2014; 11: e1001651. doi: 10.1371/journal.pmed.1001651 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Schulze R, Heil U, Gross D, Bruellmann DD, Dranischnikow E, Schwanecke U, et al. Artefacts in CBCT: a review. Dentomaxillofac Radiol 2011; 40: 265–73. doi: 10.1259/dmfr/30642039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Spin-Neto R, Wenzel A. Patient movement and motion artefacts in cone beam computed tomography of the dentomaxillofacial region: a systematic literature review. Oral Surg Oral Med Oral Pathol Oral Radiol 2016; 121: 425–33. doi: 10.1016/j.oooo.2015.11.019 [DOI] [PubMed] [Google Scholar]
  • 48.IPEM (Institute of Physics and Engineering and Medicine) Recommended standards for the routine performance testing of diagnostic X-ray equipment, IPEM report 91. IPEM: York;. 2005. [Google Scholar]
  • 49.Båth M, Månsson LG. Visual grading characteristics (VGC) analysis: a non-parametric rank-invariant statistical method for image quality evaluation. Br J Radiol 2007; 80: 169–76. doi: 10.1259/bjr/35012658 [DOI] [PubMed] [Google Scholar]
  • 50.Metz CE. Application of ROC analysis in diagnostic image evaluation : Haus A. G, The physics of medical imaging: recording system measurements and techniques. New York: American Association of Physicists in Medicine; 1994. 546–72. [Google Scholar]
  • 51.Biesheuvel C, Irwig L, Bossuyt P. Observed differences in diagnostic test accuracy between patient subgroups: is it real or due to reference standard misclassification? Clin Chem 2007; 53: 1725–9. doi: 10.1373/clinchem.2007.087403 [DOI] [PubMed] [Google Scholar]
  • 52.Grimes DA, Schulz KF. Refining clinical diagnosis with likelihood ratios. The Lancet 2005; 365: 1500–5. doi: 10.1016/S0140-6736(05)66422-7 [DOI] [PubMed] [Google Scholar]
  • 53.Deeks JJ. Systematic reviews of evaluations of diagnostic and screening tests. BMJ 2001; 323: 157–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Guyatt GH, Tugwell PX, Feeny DH, Haynes RB, Drummond M. A framework for clinical evaluation of diagnostic technologies. CMAJ 1986a; 134: 587–94. [PMC free article] [PubMed] [Google Scholar]
  • 55.Guyatt GH, Tugwell PX, Feeny DH, Drummond MF, Haynes RB. The role of before-after studies of therapeutic impact in the evaluation of diagnostic technologies. J Chronic Dis 1986b; 39: 295–304. doi: 10.1016/0021-9681(86)90051-2 [DOI] [PubMed] [Google Scholar]
  • 56.Kruse C, Spin-Neto R, Wenzel A, Vaeth M, Kirkevang L-L. Impact of cone beam computed tomography on periapical assessment and treatment planning five to eleven years after surgical endodontic retreatment. Int Endod J 2018; 51: 729–37. doi: 10.1111/iej.12888 [DOI] [PubMed] [Google Scholar]
  • 57.Meads CA, Davenport CF. Quality assessment of diagnostic before-after studies: development of methodology in the context of a systematic review. BMC Med Res Methodol 2009; 19: 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Buxton MJ, Drummond MF, Van Hout BA, Prince RL, Sheldon TA, Szucs T, et al. Modelling in economic evaluation: an unavoidable fact of life. Health Econ 1997; 6: 217–27. doi: [DOI] [PubMed] [Google Scholar]
  • 59.Drummond MF, Sculper MJ, Claxton K, Stoddart GL, Torrance GW. Methods for economic evaluation of health care programs. 4th ed Oxford: Oxford University Press; 2015. [Google Scholar]
  • 60.Christell H, Birch S, Horner K, Rohlin M, Lindh C, SEDENTEXCT consortium . A framework for costing diagnostic methods in oral health care: an application comparing a new imaging technology with the conventional approach for maxillary canines with eruption disturbances. Community Dent Oral Epidemiol 2012a; 40: 351–61. doi: 10.1111/j.1600-0528.2012.00674.x [DOI] [PubMed] [Google Scholar]
  • 61.Thornbury JR. Intermediate outcomes: diagnostic and therapeutic impact. Acad Radiol 1999; 6(suppl 1): S58–65. doi: 10.1016/S1076-6332(99)80088-9 [DOI] [PubMed] [Google Scholar]
  • 62.Meltzer MI. Introduction to health economics for physicians. The Lancet 2001; 358: 993–8. doi: 10.1016/S0140-6736(01)06107-4 [DOI] [PubMed] [Google Scholar]
  • 63.Petersen LB, Vaeth M, Wenzel A. Neurosensoric disturbances after surgical removal of the mandibular third molar based on either panoramic imaging or cone beam CT scanning: a randomized controlled trial (RCT). Dentomaxillofac Radiol 2016; 45: 20150224. doi: 10.1259/dmfr.20150224 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Petersen LB, Olsen KR, Christensen J, Wenzel A. Image and surgery-related costs comparing cone beam CT and panoramic imaging before removal of impacted mandibular third molars. Dentomaxillofac Radiol 2014; 43: 20140001. doi: 10.1259/dmfr.20140001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Petersen LB, Olsen KR, Matzen LH, Vaeth M, Wenzel A. Economic and health implications of routine CBCT examination before surgical removal of the mandibular third molar in the Danish population. Dentomaxillofac Radiol 2015; 44: 20140406. doi: 10.1259/dmfr.20140406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Matzen LH, Berkhout E. Cone beam CT imaging of the mandibular third molar: a position paper prepared by the European Academy of DentoMaxilloFacial radiology (EADMFR). Dentomaxillofac Radiol 2019; 48: 20190039. doi: 10.1259/dmfr.20190039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Nikolic-Jakoba N, Spin-Neto R, Wenzel A. Cone-beam computed tomography for detection of intrabony and furcation defects: a systematic review based on a hierarchical model for diagnostic efficacy. J Periodontol 2016; 87: 630–44. doi: 10.1902/jop.2016.150636 [DOI] [PubMed] [Google Scholar]
  • 68.Kruse C, Spin-Neto R, Wenzel A, Kirkevang L-L. Cone beam computed tomography and periapical lesions: a systematic review analysing studies on diagnostic efficacy by a hierarchical model. Int Endod J 2015; 48: 815–28. doi: 10.1111/iej.12388 [DOI] [PubMed] [Google Scholar]
  • 69.Christell H, Birch S, Hedesiu M, Horner K, Ivanauskaité D, Nackaerts O, et al. Variation in costs of cone beam CT examinations among healthcare systems. Dentomaxillofac Radiol 2012b; 41: 571–7. doi: 10.1259/dmfr/22131776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Yang Y, Abel L, Buchanan J, Fanshawe T, Shinkins B. Use of decision modelling in economic evaluations of diagnostic tests: an appraisal and review of health technology assessments in the UK. PharmacoEconomics - Open 2019; 3: 281–91. doi: 10.1007/s41669-018-0109-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Dentomaxillofacial Radiology are provided here courtesy of Oxford University Press

RESOURCES