Skip to main content
. 2022 Aug 22;38(1):78–96. doi: 10.1177/08258597221120707

Appendix 2.

NICE guidelines for critical appraisal of quantitative studies.29

1.1 Is the source population or source area well described?
Was the country (eg developed or non-developed, type of healthcare system), setting (primary schools, community centres etc), location (urban, rural), population demographics etc adequately described?
1.2 Is the eligible population or area representative of the source population or area?
Was the recruitment of individuals, clusters or areas well defined (eg advertisement, birth register)?
Was the eligible population representative of the source? Were important groups under-represented?
1.3 Do the selected participants or areas represent the eligible population or area?
Was the method of selection of participants from the eligible population well described?
What % of selected individuals or clusters agreed to participate? Were there any sources of bias?
Were the inclusion or exclusion criteria explicit and appropriate?
2.1 Allocation to intervention (or comparison). How was selection bias minimised?
Was allocation to exposure and comparison randomised? Was it truly random + + or pseudo-randomised + (eg consecutive admissions)?
If not randomised, was significant confounding likely ( − ) or not ( + )?
If a cross-over, was order of intervention randomised?
2.2 Were interventions (and comparisons) well described and appropriate?
Were interventions and comparisons described in sufficient detail (ie enough for study to be replicated)?
Was comparisons appropriate (eg usual practice rather than no intervention)?
2.3 Was the allocation concealed?
Could the person(s) determining allocation of participants or clusters to intervention or comparison groups have influenced the allocation?
Adequate allocation concealment ( + +) would include centralised allocation or computerised allocation systems.
2.4 Were participants or investigators blind to exposure and comparison?
Were participants and investigators – those delivering or assessing the intervention kept blind to intervention allocation? (Triple or double blinding score + +)
If lack of blinding is likely to cause important bias, score − .
2.5 Was the exposure to the intervention and comparison adequate?
Is reduced exposure to intervention or control related to the intervention (eg adverse effects leading to reduced compliance) or fidelity of implementation (eg reduced adherence to protocol)?
Was lack of exposure sufficient to cause important bias?
2.6 Was contamination acceptably low?
Did any in the comparison group receive the intervention or vice versa?
If so, was it sufficient to cause important bias?
If a cross-over trial, was there a sufficient wash-out period between interventions?
2.7 Were other interventions similar in both groups?
Did either group receive additional interventions or have services provided in a different manner?
Were the groups treated equally by researchers or other professionals?
Was this sufficient to cause important bias?
2.8 Were all participants accounted for at study conclusion?
Were those lost-to-follow-up (ie dropped or lost pre-,during or post-intervention) acceptably low (ie typically <20%)?
Did the proportion dropped differ by group? For example, were drop-outs related to the adverse effects of the intervention?
2.9 Did the setting reflect usual UK practice?
Did the setting in which the intervention or comparison was delivered differ significantly from usual practice in the UK? For example, did participants receive intervention (or comparison) condition in a hospital rather than a community-based setting?
2.10 Did the intervention or control comparison reflect usual UK practice?
Did the intervention or comparison differ significantly from usual practice in the UK? For example, did participants receive intervention (or comparison) delivered by specialists rather than GPs? Were participants monitored more closely?
3.1 Were outcome measures reliable?
Were outcome measures subjective or objective (eg biochemically validated nicotine levels + + vs self-reported smoking − )?
How reliable were outcome measures (eg inter- or intra-rater reliability scores)?
Was there any indication that measures had been validated (eg validated against a gold standard measure or assessed for content validity)?
3.2 Were all outcome measurements complete?
Were all or most study participants who met the defined study outcome definitions likely to have been identified?
3.3 Were all important outcomes assessed?
Were all important benefits and harms assessed?
Was it possible to determine the overall balance of benefits and harms of the intervention versus comparison?
3.4 Were outcomes relevant?
Where surrogate outcome measures were used, did they measure what they set out to measure? (eg a study to assess impact on physical activity assesses gym membership – a potentially objective outcome measure – but is it a reliable predictor of physical activity?)
3.5 Were there similar follow-up times in exposure and comparison groups?
If groups are followed for different lengths of time, then more events are likely to occur in the group followed-up for longer distorting the comparison.
Analyses can be adjusted to allow for differences in length of follow-up (eg using person-years).
3.6 Was follow-up time meaningful?
Was follow-up long enough to assess long-term benefits or harms?
Was it too long, eg participants lost to follow-up?
4.1 Were exposure and comparison groups similar at baseline? If not, were these adjusted?
Were there any differences between groups in important confounders at baseline?
If so, were these adjusted for in the analyses (eg multivariate analyses or stratification).
Were there likely to be any residual differences of relevance?
4.2 Was intention to treat (ITT) analysis conducted?
Were all participants (including those that dropped out or did not fully complete the intervention course) analysed in the groups (ie intervention or comparison) to which they were originally allocated?
4.3 Was the study sufficiently powered to detect an intervention effect (if one exists)?
A power of 0.8 (that is, it is likely to see an effect of a given size if one exists, 80% of the time) is the conventionally accepted standard.
Is a power calculation presented? If not, what is the expected effect size? Is the sample size adequate?
4.4 Were the estimates of effect size given or calculable?
Were effect estimates (eg relative risks, absolute risks) given or possible to calculate?
4.5 Were the analytical methods appropriate?
Were important differences in follow-up time and likely confounders adjusted for?
If a cluster design, were analyses of sample size (and power), and effect size performed on clusters (and not individuals)?
Were subgroup analyses pre-specified?
4.6 Was the precision of intervention effects given or calculable? Were they meaningful?
Were confidence intervals or p values for effect estimates given or possible to calculate?
Were CI's wide or were they sufficiently precise to aid decision-making? If precision is lacking, is this because the study is under-powered?
5.1 Are the study results internally valid (ie unbiased)?
How well did the study minimise sources of bias (ie adjusting for potential confounders)?
Were there significant flaws in the study design?
5.2 Are the findings generalizable to the source population (ie externally valid)?
Are there sufficient details given about the study to determine if the findings are generalizable to the source population? Consider: participants, interventions and comparisons, outcomes, resource, and policy implications.