Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2022 Oct 5;89(2):832–842. doi: 10.1111/bcp.15530

Revisiting the inter‐rater reliability of drug treatment assessments according to the STOPP/START criteria

Naldy Parodi López 1,2, Björn Belfrage 3, Anders Koldestam 4, Johan Lönnbro 1,5, Staffan A Svensson 1,6, Susanna M Wallerstedt 1,7,
PMCID: PMC10092534  PMID: 36098258

Abstract

Aims

The aim of this study is to revisit the inter‐rater reliability of drug treatment assessments according to the Screening Tool of Older Persons' Prescriptions (STOPP)/Screening Tool to Alert to Right Treatment (START) criteria.

Methods

Potentially inappropriate medications (PIMs) and potential prescribing omissions (PPOs) were independently identified by two physicians in two cohorts of older people (I: 200 hip fracture patients, median age 85 years, STOPP/START version 1; II: 302 primary care patients, median age 74 years, STOPP/START version 2). Kappa statistics were used to evaluate inter‐rater agreement.

Results

In cohort I, a total of 782 PIMs/PPOs, related to 68 (78%) out of 87 criteria, were identified by at least one assessor, 500 (64%) of which were discordantly identified by the assessors, that is, by one assessor but not the other. For four STOPP criteria, all PIMs (n = 9) were concordantly identified. In cohort II, 955 PIMs/PPOs, related to 80 (70%) out of 114 criteria, were identified, 614 (64%) of which were discordantly identified. For three STOPP criteria, all PIMs (n = 3) were concordantly identified. For no START criterion, with ≥1 PPO identified, were all assessments concordant. The kappa value for PIM/PPO identification was 0.52 in both cohorts. In cohort II, the kappa was 0.37 when criteria regarding influenza and pneumococcal vaccines were excluded. Further analysis of discordantly identified PIMs/PPOs revealed methodological aspects of importance, including the data source used and criteria wording.

Conclusions

When the STOPP/START criteria are applied in PIM/PPO research, reliability seems to be an issue not encountered in previous reliability studies.

Keywords: assessment, methodology, pharmacoepidemiology, pharmacotherapy, quality


When the STOPP/START criteria were applied by two physicians in two separate cohorts, the inter‐rater agreement was weak, in contrast to strong to almost perfect in prior reliability studies. Methodological aspects may explain the differences.

graphic file with name BCP-89-832-g001.jpg


What is already known about this subject

  • The STOPP/START criteria were developed to assist physicians in day‐to‐day practice to improve drug treatment in older people.

  • The inter‐rater agreement for these criteria is reportedly strong to almost perfect, and they have gained extensive use in research to examine the medication profile of older populations.

What this study adds

  • When physicians identified PIMs/PPOs in two separate cohorts, two‐thirds were identified by one assessor but not the other.

  • Analysing discordantly identified PIMs/PPOs and design of prior reliability studies, methodological aspects important for inter‐rater agreement emerged.

  • For enhanced scientific rigour and to facilitate interpretation of PIM/PPO research, suggestions are provided.

1. INTRODUCTION

The prescribing of medication to older people may be a challenge as these patients often have several morbidities concomitantly and, consequently, are being treated with multiple drugs. In addition, they are more sensitive to drug effects due to physiological changes of ageing, such as reduced renal function. To improve prescribing performance in this age group, the Screening Tool of Older Persons' Prescriptions (STOPP)/Screening Tool to Alert to Right Treatment (START) criteria were designed. The first version was intended to be a convenient and time‐efficient tool for physicians to use in day‐to‐day practice to assess an older patient's prescription drugs in the context of his/her concurrent diagnoses. 1 In 2015, an updated version 2 was published. 2 Over the years, the STOPP/START tool has come to be extensively used in research to examine the medication profile of older populations, 3 , 4 also illustrated by the fact that the initial validation publications 1 , 2 had been cited more than 1000 times each at the date of our Scopus search on 5 August 2022. Furthermore, potentially inappropriate medications (PIMs) and potential prescribing omissions (PPOs), which, for instance, can be detected by STOPP and START, have been suggested to be included in core outcome sets to evaluate the effects of intervention studies for improved prescribing in older people. 5 , 6 , 7

The initial STOPP/START publication assessed the inter‐rater reliability using 100 cases, with two researchers achieving a kappa coefficient of 0.75 for STOPP and 0.68 for START. 1 Two additional reliability studies, from the same research group, evaluated the inter‐rater agreement of the STOPP and START criteria using 20 cases each. The first one showed excellent median kappa values: 0.93 and 0.85, respectively, when eight physicians without previous experience of using the criteria were compared with the consensus decision of the two physicians who designed STOPP/START. 8 The second one showed similar results: median kappa values ≥0.88 and ≥0.90, respectively, when 10 pharmacists were compared with the consensus decision of two extensively trained academic pharmacists. 9 As far as we are aware, there are no other studies specifically focusing on the inter‐rater reliability of these tools.

When our research group first applied the STOPP/START criteria (first version), we were surprised by the low inter‐rater agreement. Although the tool was applied by two specialist physicians well experienced in treating older patients with drugs in their everyday practice, the kappa value achieved was considerably lower than those previously reported. 10 Subsequently, when we analysed data in another study using the STOPP/START criteria version 2, 11 discordant identification of PIMs/PPOs again appeared problematic. To understand the divergence between the reliability found in our studies and the reliability previously reported, and to provide insights for future research, we performed this study with the aim of describing the identification of PIMs/PPOs according to STOPP/START in two cohorts of older patients from hospital and primary care.

2. METHODS

In this study, we used data from two patient cohorts, collected for the purpose of four prior studies. Cohort I was based on two studies investigating (i) the effects of physician‐led medication reviews on the prescribing of fracture‐preventing and fall‐risk‐increasing drugs 12 and (ii) the association between quality of drug treatment and multi‐dose drug dispensing. 10 Cohort II was based on two studies investigating (i) the association between medication reviews and adequate drug treatment management 11 and (ii) the clinical relevance of indicators of prescribing quality. 13 The cohorts are described in Figure S1 and Table 1. Cohort I included 200 older hip fracture patients, recruited to a randomised controlled trial (RCT) in 2009. 12 Cohort II included 302 consecutive older patients requiring a planned physician consultation in either of two primary health centres over a three‐week period in 2017. 11

TABLE 1.

Characteristics of studied cohorts

Cohort I Cohort II
Inclusion criteria
Age ≥65 years ≥65 years
Recruitment Had hip fracture surgery Planned physician consultation in primary care
Informed consent Required Waived
Recruitment year 2009 2017
Patients, n 200 302
Age, median years (range) 85 (65–98) 74 (65–99)
Female sex, n (%) 133 (67) 178 (59)
Multi‐dose drug dispensing, n (%) 100 (50) 33 (11)
Nursing home resident, n (%) 60 (30) 31 (10)
Regular drugs, median number (range) 7 (0–21) 5 (0–17)
Criteria applied STOPP/START v1 STOPP/START v2

Abbreviations: START, Screening Tool to Alert to Right Treatment; STOPP, Screening Tool of Older Persons' Prescriptions; v, version.

2.1. PIM/PPO identification

In both cohorts, the identification of PIMs/PPOs was performed by physicians who had access to clinical data from the individual patient's medical records, including primarily unstructured information in free text, organised by date. The volume of the medical records varied greatly, the printouts used in cohort II ranging, for example, from a few pages to well over 100 pages for some patients. At the time of the assessments, drugs within each STOPP/START criterion versions 1 and 2 were available in Swedish health care.

In cohort I, a final‐year resident in geriatrics (A.K.), independently and for every patient, identified PIMs/PPOs according to the STOPP/START criteria, as part of a mandatory project before specialist certification. Thereafter, independent assessments were also performed by a specialist in family medicine (B.B.). Eventually, the assessors had a consensus discussion for a joint decision on PIMs/PPOs. The assessments were performed in 2012 and based on the first version of STOPP/START, including 65 STOPP and 22 START criteria. The assessors had no consensus discussion regarding criteria definitions before or during the application. The identification of PIMs/PPOs was based on information in the electronic medical records of the hospital, where the patients had had their hip fracture surgery, as well as data collected in the original RCT. The information available included, for instance, sociodemographic data, diagnoses, drug prescriptions, laboratory tests, as well as the presence of fall risk factors. 14

In cohort II, one specialist in family medicine (N.P.L.) and one specialist in family medicine and clinical pharmacology (S.A.S.) independently identified PIMs/PPOs according to the STOPP/START criteria, to ensure a systematic screening for potentially inappropriate prescribing before arriving at an overall assessment of the adequacy of each patient's drug treatment. 11 In the process, the physicians first assessed eight cases and discussed challenges regarding the overall medical assessment that was in focus. Thereafter, they assessed the remaining cases without further discussions. Some criteria definitions emerged during the initial discussion, for instance regarding vaccinations, but there was no systematic discussion regarding the criteria before or during the application. No consensus discussion was performed regarding the identified PIMs/PPOs. The assessments were performed in 2018–2019 and based on the second version of STOPP/START, including 80 STOPP and 34 START criteria. 2 The identification of PIMs/PPOs was based on printouts from the medical records in primary care, covering the 2.5 years preceding the consultation. This information included, for instance, sociodemographic data, diagnoses, drug prescriptions, laboratory tests, hospital discharge records, vaccinations and interaction alerts integrated in the medical records system.

2.2. Analyses

Descriptive statistics were calculated using SPSS Statistics for Windows, version 27.0 (IBM Corp., Armonk, NY, USA). The total number of PIMs and/or PPOs identified by at least one assessor was summed, as well as the number of PIMs and/or PPOs discordantly identified, that is, by one assessor but not the other. Inter‐rater agreement was evaluated using kappa statistics, and interpreted as none (kappa ≤ 0.20), minimal (0.21–0.39), weak (0.40–0.59), moderate (0.60–0.79), strong (0.80–0.89) and almost perfect (≥0.90). 15 In a sensitivity analysis, as drug treatment develops over time and criteria may become outdated, we calculated kappa values after excluding criteria completely absent from both cohorts. In cohort II, we also calculated the kappa value without the START criteria concerning influenza and pneumococcal vaccinations, which are not present in the first version. In addition, we excluded the first two STOPP version 2 criteria because they are implicit rather than explicit, that is, (i) any drug prescribed without an evidence‐based clinical indication, and (ii) any drug prescribed beyond the recommended duration, where treatment duration is well defined. We also calculated a kappa value for these criteria alone.

Finally, we analysed the context contributing to discordant identification of PIMs/PPOs in cohort II, for the STOPP/START criteria that were consistently identified by one assessor but not the other in at least 10 patients. Based on this analysis, as well as a scrutiny of the design of prior reliability studies, we summarised suggestions for future PIM/PPO research.

3. RESULTS

Using the first version of STOPP/START in cohort I, ≥1 PIMs/PPOs according to 68 (78%) of 87 criteria were identified by at least one assessor. A total of 782 PIMs/PPOs were identified in 190 (95%) patients, 282 (36%) of which were concordantly identified by both assessors (Table 2). Conversely, both assessors classified 16 618 PIMs/PPOs as absent. The kappa value between the two assessors was 0.52. In all, 555 PIMs/PPOs were agreed on in the consensus process.

TABLE 2.

Number of PIMs/PPOs, according to the STOPP/START criteria versions 1 and 2, identified by at least one assessor, and number of PIMs/PPOs discordantly identified, that is, by only one of the assessors

Cohort I a : STOPP/START v1 Cohort II b : STOPP/START v2
PIMs/PPOs … Kappa PIMs/PPOs … Kappa
… identified by ≥1 assessor … discordantly identified … identified by ≥1 assessor … discordantly identified
STOPP/START 782 500 (64) 0.52 955 614 (64) 0.52
STOPP 391 248 (63) 0.53 460 324 (70) 0.45
START 391 252 (64) 0.49 495 290 (59) 0.57

Note: Data are presented as numbers (percentages).

Abbreviations: PIM, potentially inappropriate medication; PPO, potential prescribing omission; START, Screening Tool to Alert to Right Treatment; STOPP, Screening Tool of Older Persons' Prescriptions; v, version.

a

Assessed by one specialist in family medicine and one geriatrician (n = 200).

b

Assessed by one specialist in family medicine and one specialist in family medicine/clinical pharmacology (n = 302).

For one STOPP criterion and one START criterion in cohort I, ≥10 PIMs/PPOs were found that were discordantly identified in all cases (Table 3). These criteria concerned long‐term opiates in those with dementia without an appropriate underlying reason (n = 10) and absence of an angiotensin converting enzyme (ACE) inhibitor after myocardial infarction (n = 15). An additional 25 STOPP/START criteria resulted in 75 PIMs/PPOs that were all discordantly identified, and for 37 criteria, the proportion of discordant assessments ranged between 6% and 94%. For four STOPP criteria, all PIMs were concordantly identified by both assessors: glibenclamide in Type 2 diabetes mellitus (n = 5), a tricyclic antidepressant (TCA) with an opiate or calcium channel blocker (n = 2), a beta‐blocker in combination with verapamil (n = 1), and alfuzosin in a man with frequent urinary incontinence (n = 1). For no START criterion with ≥1 PPO identified in cohort I were all assessments concordant.

TABLE 3.

PIMs/PPOs, according to STOPP/START, identified by at least one assessor in ≥10 individuals and discordantly identified in ≥50% of the cases, that is, by only one of the assessors, in either of the cohorts

Criterion PIMs/PPOs identified by ≥1 assessor n (% of all patients) PIMs/PPOs discordantly identified n (% of all identified)
Type Cohort I a Cohort II b Cohort I a Cohort II b

Calcium and vitamin D supplement in patients with known osteoporosis (previous fragility fracture, acquired dorsal kyphosis) (v1)

Vitamin D and calcium supplement in patients with known osteoporosis and/or previous fragility fracture(s) and/or bone mineral density T‐scores more than −2.5 in multiple sites (v2)

START 133 (67) 9 (3) 86 (65) 9 (100)

Benzodiazepines (sedative, may cause reduced sensorium, impair balance) in those prone to falls (v1)

Benzodiazepines (sedative, may cause reduced sensorium, impair balance) (v2)

STOPP 76 (38) 24 (8) c 47 (62) 4 (17)
Aspirin with no history of coronary, cerebral or peripheral vascular symptoms or occlusive event (v1) STOPP 35 (18) N/A 25 (71) N/A

Aspirin or clopidogrel with a documented history of atherosclerotic coronary, cerebral or peripheral vascular disease in patients with sinus rhythm (v1)

Antiplatelet therapy (aspirin or clopidogrel or prasugrel or ticagrelor) with a documented history of coronary, cerebral or peripheral vascular disease (v2)

START 32 (16) 13 (4) 24 (75) 10 (77)
Statin therapy in diabetes mellitus if coexisting major cardiovascular risk factors present (v1) START 22 (11) N/A 14 (64) N/A

Beta‐blocker with chronic stable angina (v1)

Beta‐blocker with ischaemic heart disease (v2)

START 18 (9) 20 (7) 13 (72) 16 (80)
Vasodilator drugs with persistent postural hypotension, that is, recurrent >20 mmHg drop in systolic blood pressure (risk of syncope, falls) (v1) STOPP 18 (9) N/A 15 (83) N/A

Statin therapy with a documented history of coronary, cerebral or peripheral vascular disease, where the patient's functional status remains independent for activities of daily living and life expectancy is greater than 5 years (v1)

Statin therapy with a documented history of coronary, cerebral or peripheral vascular disease, unless the patient's status is end‐of‐life or age is > 85 years (v2)

START 17 (9) 23 (8) 15 (88) 19 (83)

Neuroleptic drugs (may cause gait dyspraxia, parkinsonism) in those prone to falls (v1)

Neuroleptics as hypnotics, unless sleep disorder is due to psychosis or dementia (risk of confusion, hypotension, extra‐pyramidal side effects, falls) (v2)

STOPP 17 (9) 1 (0.3) 15 (88) 1 (100)

Aspirin with a past history of peptic ulcer disease without histamine H2‐receptor antagonist or proton pump inhibitor (risk of bleeding) (v1)

Aspirin with a past history of peptic ulcer disease without concomitant PPI (risk of recurrent peptic ulcer) (v2)

STOPP 16 (8) 2 (0.7) 15 (94) 2 (100)

ACE inhibitor following acute myocardial infarction (v1)

ACE inhibitor with systolic heart failure and/or documented coronary artery disease (v2)

START 15 (8) 29 (10) 15 (100) 26 (90)
Long‐term (i.e., >1 month), long‐acting benzodiazepines, for example, chlordiazepoxide, fluazepam, nitrazepam, chlorazepate, and benzodiazepines with long‐acting metabolites, for example, diazepam (risk of prolonged sedation, confusion, impaired balance, falls) (v1) STOPP 15 (8) N/A 13 (87) N/A
Antiplatelet therapy in diabetes mellitus with coexisting major cardiovascular risk factor present (hypertension, hypercholesterolemia, smoking history) (v1) START 15 (8) N/A 9 (60) N/A
Metformin with type 2 diabetes ± metabolic syndrome, in the absence of renal impairment (renal impairment if serum creatinine >150 μmol/L, or estimated GFR <50 mL/min) (v1) START 12 (6) N/A 5 (50) N/A

Regular inhaled β2 agonist or anticholinergic agent for mild to moderate asthma or COPD (v1)

Regular inhaled β2 agonist or antimuscarinic bronchodilator (e.g., ipratropium, tiotropium) for mild to moderate asthma or COPD (v2)

START 10 (5) 13 (4) 8 (80) 11 (85)
Long‐term opiates in those with dementia unless indicated for palliative care or management of moderate/severe chronic pain syndrome (risk of exacerbation of cognitive impairment) (v1) STOPP 10 (5) N/A 10 (100) N/A
Any drug prescribed without an evidence‐based clinical indication (v2) STOPP N/A 93 (24) d N/A 83 (89) e
Pneumococcal vaccine at least once after age 65 according to national guidelines (v2) START N/A 63 (21) N/A 54 (86)
Any drug prescribed beyond the recommended duration, where treatment duration is well defined (v2) STOPP N/A 45 (14) f N/A 44 (98) g

Loop diuretic for dependent ankle oedema only, that is, no clinical signs of heart failure (no evidence of efficacy, compression hosiery usually more appropriate) (v1)

Loop diuretic for dependent ankle oedema without clinical, biochemical evidence or radiological evidence of heart failure, liver failure, nephrotic syndrome or renal failure (leg elevation and/or compression hosiery usually more appropriate) (v2)

STOPP 31 (16) 25 (8) 10 (32) 18 (72)

PPI for peptic ulcer disease at full therapeutic dosage for >8 weeks (dose reduction or earlier discontinuation indicated) (v1)

PPI for uncomplicated peptic ulcer disease or erosive peptic oesophagitis at full therapeutic dosage for >8 weeks (dose reduction or earlier discontinuation indicated) (v2)

STOPP 9 (5) 19 (6) 8 (89) 15 (79)
Acetylcholinesterase inhibitor (e.g., donepezil, rivastigmine, galantamine) for mild–moderate Alzheimer's dementia or Lewy body dementia (rivastigmine) (v2) START N/A 18 (6) N/A 17 (94)
Vitamin D supplement in older people who are housebound or experiencing falls or with osteopenia (bone mineral density T‐score is >−1.0 but <−2.5 in multiple sites) (v2) START N/A 16 (5) N/A 16 (100)

Anticholinergic antispasmodic drugs with chronic constipation (risk of exacerbation of constipation) (v1)

Drugs likely to cause constipation (e.g., antimuscarinic/anticholinergic drugs, oral iron, opioids, verapamil, aluminium antacids) in patients with chronic constipation where non‐constipating alternatives are available (risk of exacerbation of constipation) (v2)

STOPP 0 15 (4) h 0 13 (87) i

NSAID with moderate–severe hypertension (risk of exacerbation of hypertension) and NSAID with heart failure (risk of exacerbation of heart failure) (v1)

NSAID with severe hypertension (risk of exacerbation of hypertension) or severe heart failure (risk of exacerbation of heart failure) (v2)

STOPP 6 (3) j 14 (4) k 5 (83) 14 (100) k
Bone anti‐resorptive or anabolic therapy (e.g., bisphosphonate, strontium ranelate, teriparatide, denosumab) in patients with documented osteoporosis, where no pharmacological or clinical status contraindication exists (bone mineral density T‐scores ≥ 2.5 in multiple sites) and/or previous history of fragility fracture(s) (v2) START N/A 13 (4) N/A 13 (100)
Long‐acting opioids without short‐acting opioids for breakthrough pain (risk of persistence of severe pain) (v2) STOPP N/A 13 (7) N/A 9 (69)

Loop diuretic as first‐line monotherapy for hypertension (safer, more effective alternatives available) (v1)

Loop diuretic as first‐line treatment for hypertension (safer, more effective alternatives available) (v2)

STOPP 6 (3) 12 (4) 6 (100) 11 (92)
ACE inhibitors or angiotensin receptor blockers in patients with hyperkalaemia (v2) STOPP N/A 10 (3) N/A 8 (80)
Appropriate beta‐blocker (bisoprolol, nebivolol, metoprolol or carvedilol) with stable systolic heart failure (v2) START N/A 10 (3) N/A 10 (100)

Abbreviations: ACE, angiotensin converting enzyme; COPD, chronic obstructive pulmonary disease; GFR, glomerular filtration rate; N/A, not applicable; NSAID, non‐steroidal anti‐inflammatory drugs; PIM, potentially inappropriate medication; PPI, proton pump inhibitor; PPO, potential prescribing omission; START, Screening Tool to Alert to Right Treatment; STOPP, Screening Tool of Older Persons' Prescriptions; v, version.

a

Assessed by one specialist in family medicine and one geriatrician (n = 200).

b

Assessed by one specialist in family medicine and one specialist in family medicine/clinical pharmacology (n = 302).

c

23 patients had 24 PIMs.

d

71 patients had 93 PIMs.

e

66 patients had 83 PIMs.

f

41 patients had 45 PIMs.

g

40 patients had 44 PIMs.

h

11 patients had 15 PIMs.

i

9 patients had 13 PIMs.

j

5 patients had 6 PIMs.

k

13 patients had 14 PIMs.

Using the second version of STOPP/START in cohort II, ≥1 PIMs/PPOs according to 80 (70%) of 114 criteria were identified by at least one assessor. A total of 955 PIMs/PPOs were identified (35 of which were duplicates within specific criteria) in 284 (94%) patients, 341 (36%) of which were concordantly identified by both assessors (Table 2). Conversely, both assessors classified 33 508 PIMs/PPOs as absent. The kappa value between the two assessors was 0.52. In all, 172 (50%) of the concordantly identified PIMs/PPOs stemmed from the START criterion regarding absence of annual influenza vaccination. When the START criteria regarding influenza and pneumococcal vaccines were excluded, the kappa value was 0.37. When the two implicit criteria in STOPP version 2, that is, drug without an evidence‐based indication and drug prescribed beyond the recommended duration, were evaluated alone, the kappa was 0.05, with 127 out of 138 identified PIMs, in 93 patients, discordantly identified.

For one STOPP and three START criteria in cohort II, ≥10 PIMs/PPOs were found that were discordantly identified in all cases (Table 3). These criteria concerned non‐steroidal anti‐inflammatory drugs (NSAIDs) in severe hypertension or severe heart failure (n = 14), absence of vitamin D supplement if housebound or experiencing falls or with osteopenia (n = 16), absence of bone‐active treatment in osteoporosis (n = 13), and absence of beta‐blocker in heart failure (n = 10). An additional 42 STOPP/START criteria resulted in 124 PIMs/PPOs that were all discordantly identified, and for 31 criteria, the proportion of discordant assessments ranged between 8% and 98%. For three STOPP criteria, all PIMs were concordantly identified by both assessors: a TCA as first‐line antidepressant treatment (n = 1), a beta‐blocker in combination with verapamil (n = 1), and alfuzosin in orthostatic hypotension (n = 1). For no START criterion with ≥1 PPO identified in cohort II were all assessments concordant.

A total of 18 STOPP and one START criteria (22% of all), version 1, were not identified in any patient. In the second version of the criteria, 30 STOPP and four START criteria (30% of all) were not identified in any patient (Table S1). When these criteria were excluded in a sensitivity analysis, that is, those that did not identify any PIM or PPO in the cohorts, the kappa value between the assessors was 0.51 in both cohorts I and II.

For four STOPP/START version 2 criteria identified in at least 10 patients, all PIMs/PPOs were identified by one assessor only (S.A.S.). Contextual factors that contributed to the complete discordance between the assessors in these cases are detailed in Table S2. For all criteria combined, one assessor identified 294 (A.K.), 206 (B.B.), 178 (N.P.L.) and 436 (S.A.S.) PIMs/PPOs not identified by the other. Summarising reasons underlying discordantly identified PIMs/PPOs, and our results in relation to prior reliability studies, several methodological aspects emerged, including, for instance, the importance of data source used (Table 4).

TABLE 4.

Issues related to reliability of PIM/PPO research, and pertinent suggestions for identification of PIMs/PPOs and criteria construction

Issue Suggestions
Identification Overall weak inter‐rater agreement in two separate cohorts

When PIMs/PPOs are used as outcomes in a study, involve at least two assessors for independent identification, followed by a consensus discussion.

Report kappa values between the assessors

Specifically prepared datasets, including diagnoses, compiled from medical records, used as data source for assessments of reliability Use patient data not specifically prepared for the purpose to identify PIMs/PPOs.
Patient cases specifically selected for PIM/PPO identification, used for assessments of reliability Avoid selecting patients
The START criterion regarding annual influenza vaccination constituted half of all concordantly identified PIMs/PPOs Perform overall analyses with and without vaccination criteria
Differing understanding of criteria Scrutinise the wording of used criteria, decide on a joint interpretation, and describe the approach decided upon explicitly
Criteria construction Uncertainties regarding the indication Be explicit about criteria definitions and avoid complex criteria:
  • Restrict criteria to one indication at a time

  • Be specific and avoid vague wordings like ‘severe’, for example, regarding blood pressure level or UCG findings required

  • Define required event(s) to fulfil criteria regarding fall risk, for example, one or more falls over the last year

  • Define relevant time period for diagnostic examinations, for example, DXA no more than two years ago

No inter‐rater agreement in identification of the two implicit STOPP criteria Avoid implicit criteria in sets of explicit criteria
Methodology for evaluation of reliability

Calculate kappa between experimental assessors, not between experimental assessors and an expert consensus decision

Encourage other research groups to evaluate reliability

Abbreviations: DXA, dual‐energy X‐ray absorptiometry; PIM, potentially inappropriate medication; PPO, potential prescribing omission; UCG, ultrasound cardiography; START, screening tool to alert to right treatment; STOPP, screening tool of older persons' prescriptions.

4. DISCUSSION

Revisiting the inter‐rater agreement regarding the identification of PIMs/PPOs according to the STOPP/START criteria, it appears that reliability may be an issue that, to the best of our knowledge, has not previously been noted. Indeed, only about one in three PIMs/PPOs were concordantly identified by two specialist physicians, and two‐thirds were identified by one assessor but not the other. Furthermore, the kappa values, of equal magnitude in two separate cohorts assessed by two separate pairs of physicians, suggest weak inter‐rater agreement. When the START criteria regarding influenza and pneumococcal vaccines were excluded, appearing in only the second version of the criteria, the kappa value was even lower, suggesting minimal agreement.

Our results diverge from those of previous reliability studies of these criteria, where the inter‐rater agreement has reportedly been strong to almost perfect among physicians 8 as well as pharmacists. 11 Below, we outline methodological differences between our study and prior ones that may explain the different estimates of reliability.

First, our assessors applied the criteria to patient data not specifically prepared for the purpose of examining reliability – our main source was unstructured information available in patients' medical records. In contrast, the initial STOPP/START publication used prepared datasets where the criteria were applied to clinical information abstracted from case notes. 1 Similarly, in the two subsequent reliability studies, the criteria were either applied to datasets compiled from chart review and patient and/or carer interview, 8 or to datasets compiled from medical records. 9 Such preparation of datasets before they reach the assessors may conceal, for instance, uncertainties regarding diagnoses and the health condition of a patient, as well as reasonable doubts regarding the drug treatment per se. In addition, when information is filtered for a specific purpose, medical conditions outside the scope of this purpose may not be captured. In fact, part of the decision making takes place in the very process of categorising patients as having a diagnosis or not, and data extraction will thus be a matter of concern. The significance of the information source, as a factor contributing to the divergent reliability results, is supported by our analysis of the context of the four most frequent, completely discordant assessments in cohort II. Here, we found that uncertainty regarding the drug indication was the most common reason underlying identification of PPOs by one assessor only. Furthermore, as we have previous reported, 11 the most common action suggested, when drug treatment in an older patient in primary care is considered inadequately managed, is to search for additional information in the medical records to be able to make a decision regarding the initiation or withdrawal of a specific drug. 11

Second, selecting cases specifically for the purpose of examining reliability may have implications for the assessments. In our study, no selection based on PIM/PPO incidence was performed. In contrast, both prior reliability studies used specifically selected cases. In one, the cases were selected from an RCT to represent patients with complex comorbidities and an appreciable incidence of PIMs/PPOs. 8 In the other, the cases were randomly selected from those in a study cohort that were identified to have at least one PIM or at least one PPO. 11 Therefore, the assessors could expect to find PIMs/PPOs in every patient. Indeed, a median of two PIMs and one PPO per patient were identified in the prepared cases, with up to five PIMs and four PPOs in a single patient. In our cohort II, on the other hand, a median of one PIM and one PPO was found (including those discordantly identified), with up to 9 PIMs and 11 PPOs identified in a single patient. Furthermore, for 18 patients, no PIMs/PPOs at all were identified (data not shown). As most criteria were absent in most patients, one could also speculate that the reliability may be overestimated when only 20 cases are assessed, as was the case in the two prior reliability studies. 8 , 9 Nevertheless, our sensitivity analysis revealed that the kappa value was robust even when these undetected criteria were excluded.

Third, vague wording of criteria may contribute to discordant identification of PIMs/PPOs. This explanation is supported by our analysis of the STOPP criterion regarding NSAIDs in severe hypertension or severe heart failure. Indeed, the interpretation of the severity of these conditions, which is not further specified in the criterion, differed between the assessors. Regarding wording, it may also be confusing that the STOPP criterion regarding acetylsalicylic acid (ASA) in patients with a past history of peptic ulcer without a proton pump inhibitor (PPI) could just as well be seen as a START criterion; the benefits of ASA, like those of any other treatment, must be weighed against the risks, and extensive use of PPIs may also be problematic. The problem of criteria wording could perhaps have been resolved by consensus discussions early in our assessment process. Nevertheless, such discussions would primarily contribute to increased inter‐rater agreement locally, not necessarily making the assessments more comparable between research groups. To facilitate interpretation of PIM/PPO results, the outcome of initial consensus discussions could therefore preferably be described in detail. One may also speculate that uncertainties related to wording of criteria could be more easily resolved when originators of the tools are involved. Indeed, although few details of the assessment process were provided in the first reliability study, 1 the two others described that the assessments were preceded either by a teleconference to resolve any difficulties with interpretation before the application 8 or by individual telephone calls and the provision of detailed instruction on how to apply the criteria as well as two example cases. 9 For scientific rigour and to allow comparisons between studies, however, unequivocal criteria definitions are preferable. In fact, the usefulness, comprehensiveness and relevance of the STOPP/START tool have been questioned in a qualitative study. 16

Fourth, the inclusion of complex explicit criteria that require deep scrutiny of the medical records may be particularly problematic in patients with multiple chronic conditions and large amounts of information recorded in the medical records. For instance, the START criterion regarding vitamin D supplement in older people who are housebound or experiencing falls or with osteopenia required exhaustive searches within the free text of the medical records, looking for any mention of a previous dual‐energy X‐ray absorptiometry (DXA) or fall episodes. Such evaluations are time‐consuming and might lead to differences in the recognition or interpretation of the criteria between the assessors. Indeed, problematic issues related to complexity have been outlined in a previous study focusing on the development of software applications based on the STOPP/START tools. 17

Fifth, our findings indicate that implicit criteria are even more problematic in terms of reliability. When the inter‐rater agreement of the two implicit STOPP criteria (version 2) was evaluated, a very low kappa value was encountered, representing virtually no agreement. Furthermore, vaccine criteria may introduce a bias as they may contribute many PPOs. Our results suggest that this may be true in particular for influenza vaccination, and that inclusion of this criterion may mask potential reliability issues. Our results consequently support the exclusion of STOPP version 2 implicit criteria and START version 2 vaccine criteria, a procedure that has been applied in previous research. 18

Sixth, it cannot be excluded that professional attitudes may affect the identification of PIMs/PPOs. In our study, most unique PIMs identified by one assessor only originated from a physician also specialised in clinical pharmacology. In this specialty, pharmacovigilance is a prominent part of daily work, 19 and the common approach, for instance in assessing individual case safety reports, is to be better safe than sorry, that is, to collect all potential safety issues and then make an assessment of the medical significance.

Finally, comparing the identification of PIMs/PPOs by experimental assessors only against expert consensus decisions, as performed in both prior studies focusing on reliabililty, 8 , 9 may involve a risk of overestimating reliability of the tools in research. As several assessors were involved, calculating kappa between all assessors would have been preferable. Furthermore, the exclusion of cases where the experts in consensus did not identify any PIM or PPO, as performed in one of the reliability studies, 9 may disregard discordant assessments where experimental assessors had identified PIMs/PPOs not identified by the experts. Indeed, all assessors, not only experts, may contribute insights, illustrated in our cohorts by the fact that all physicians identified PIMs/PPOs not identified by the other. Furthermore, subsequent scientific use, following initial reliability studies, can be expected to include independent researchers and not an expert consensus decision for comparison.

As reliability seems to be an issue when the STOPP/START criteria are applied, our suggestions for identification of PIMs/PPOs could be useful. When used as outcome measures, several assessors could agree on criteria definitions, independently identify PIMs/PPOs, and then jointly decide in consensus which ones fulfil each criterion. An alternative strategy could perhaps be to involve two independent assessors, and to focus on PIMs/PPOs that are concordantly identified. However, our results in cohort I suggest that a substantial number of PIMs/PPOs would probably be missed with such a procedure. On the other hand, the concordantly identified PIMs/PPOs may represent the most obvious ones, likely to be identified by other research groups as well. Whichever alternative is chosen, the reporting of kappa statistics will be crucial as this information illustrates the extent of subjectivity of the results. Nonetheless, considering that the clinical relevance of STOPP/START and other well‐known criteria has been shown to be limited, 13 , 20 the value of extensive PIM/PPO identification procedures could be questioned; caution in interpretation would still be warranted. Furthermore, in a large study involving more than 2000 older patients in four European countries, using these tools by a third party to improve prescribing performance has been shown not to have patient‐relevant effects. 21 Regardless, it need not be forgotten that the STOPP/START criteria were originally designed to assist physicians in daily work. The value of the tool from an educational point of view, in medical school and for resident physicians, could therefore merit further attention. Limitations regarding inter‐rater agreement may be less problematic in such research, as well as the use of prepared datasets.

The strengths of this study include that two patient cohorts were analysed, representing both hospital and primary care. As 87% of eligible patients were included in the first cohort which originated from an RCT, and patients in the second cohort were consecutively included, the generalizability of the results should be acceptable. Another strength is that the assessments were performed by experienced physicians well accustomed to extracting information from medical records for decision making in daily patient work. Indeed, pharmacotherapeutic assessments may vary between physicians and, for instance, pharmacists. 22 , 23 An important limitation is that the identification of PIMs/PPOs was based on information available in the medical records only. Indeed, it cannot be excluded that medical history aspects and other observations could exist that would facilitate the identification of PIMs/PPOs during a consultation. However, as one of the initial reliability studies was also performed without face‐to‐face interaction with the patients, 9 this aspect is unlikely to explain the divergent results regarding inter‐rater agreement.

5. CONCLUSIONS AND IMPLICATIONS

Revisiting the reliability of the STOPP/START criteria, it appears that inter‐rater agreement is an important and previously overlooked methodological issue. There may be several reasons for the high inter‐rater agreement shown in previous studies. For scientific rigour and to facilitate interpretation of PIM/PPO research, our summarised suggestions may be useful, both in the identification procedure and in the construction of criteria.

COMPETING INTERESTS

There are no conflicts of interest to declare.

ETHICS APPROVAL

Ethics approval was obtained from the Regional Ethical Review Board in Gothenburg, Sweden (DRN: 095‐09, T497‐12, 1046‐15).

CONTRIBUTORS

S.M.W. conceived the study, and N.P.L., J.L. and S.M.W. designed it. N.P.L., S.A.S., B.B. and A.K. performed the assessments. N.P.L., B.B. and A.K. entered the data in a database. N.P.L and S.M.W. performed the analyses and S.M.W. drafted the manuscript. All authors contributed to the interpretation of the results and revised the manuscript for intellectual content.

Supporting information

Figure S1 Flowchart of studied cohorts

Table S1 STOPP/START criteria for which no PIM/PPO was identified by any of the assessors

Table S2 Reasons underlying discordant identification of PIMs/PPOs in cohort II

ACKNOWLEDGEMENTS

We would like to thank Christina Sjöberg and Carina Tukukino who were involved in the collection and recording of patient data. The study was funded by the Swedish Research Council (521‐2013‐2639 and 2021‐01308) and the Swedish state under the agreement between the Swedish government and the county councils (the ALF Agreement: ALFGBG‐716941 and ALFGBG‐965025). The funding sources did not influence the design, methods, data collection, analysis, preparation of the paper, or decision to submit it for publication.

Parodi López N, Belfrage B, Koldestam A, Lönnbro J, Svensson SA, Wallerstedt SM. Revisiting the inter‐rater reliability of drug treatment assessments according to the STOPP/START criteria. Br J Clin Pharmacol. 2023;89(2):832‐842. doi: 10.1111/bcp.15530

As no intervention was performed in this study, no principal investigator was assigned.

Funding information Swedish Research Council, Grant/Award Numbers: 2021‐01308, 521‐2013‐2639; Swedish state under the agreement between the Swedish government and the county councils (the ALF agreement), Grant/Award Numbers: ALFGBG‐716941, ALFGBG‐965025

DATA AVAILABILITY STATEMENT

The datasets generated and analysed during the current study are not publicly available owing to Swedish data protection laws. The data can be shared with authorised persons after approval of an application submitted to the Swedish Ethical Review Authority (https://etikprovningsmyndigheten.se).

REFERENCES

  • 1. Gallagher P, Ryan C, Byrne S, Kennedy J, O'Mahony D. STOPP (Screening Tool of Older Person's Prescriptions) and START (Screening Tool to Alert doctors to Right Treatment). Consensus validation. Int J Clin Pharmacol Ther. 2008;46(2):72‐83. doi: 10.5414/CPP46072 [DOI] [PubMed] [Google Scholar]
  • 2. O'Mahony D, O'Sullivan D, Byrne S, O'Connor MN, Ryan C, Gallagher P. STOPP/START criteria for potentially inappropriate prescribing in older people: version 2. Age Ageing. 2015;44(2):213‐218. doi: 10.1093/ageing/afu145 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hill‐Taylor B, Sketris I, Hayden J, Byrne S, O'Sullivan D, Christie R. Application of the STOPP/START criteria: a systematic review of the prevalence of potentially inappropriate prescribing in older adults, and evidence of clinical, humanistic and economic impact. J Clin Pharm Ther. 2013;38(5):360‐372. doi: 10.1111/jcpt.12059 [DOI] [PubMed] [Google Scholar]
  • 4. Tommelein E, Mehuys E, Petrovic M, Somers A, Colin P, Boussery K. Potentially inappropriate prescribing in community‐dwelling older people across Europe: a systematic literature review. Eur J Clin Pharmacol. 2015;71(12):1415‐1427. doi: 10.1007/s00228-015-1954-4 [DOI] [PubMed] [Google Scholar]
  • 5. Millar AN, Daffu‐O'Reilly A, Hughes CM, et al. Development of a core outcome set for effectiveness trials aimed at optimising prescribing in older adults in care homes. Trials. 2017;18(1):175. doi: 10.1186/s13063-017-1915-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Beuscart JB, Knol W, Cullinan S, et al. International core outcome set for clinical trials of medication review in multi‐morbid older patients with polypharmacy. BMC Med. 2018;16(1):21. doi: 10.1186/s12916-018-1007-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Rankin A, Cadogan CA, Ryan C, Clyne B, Smith SM, Hughes CM. Core outcome set for trials aimed at improving the appropriateness of polypharmacy in older people in primary care. J Am Geriatr Soc. 2018;66(6):1206‐1212. doi: 10.1111/jgs.15245 [DOI] [PubMed] [Google Scholar]
  • 8. Gallagher P, Baeyens J‐P, Topinkova E, et al. Inter‐rater reliability of STOPP (Screening Tool of Older Persons' Prescriptions) and START (Screening Tool to Alert doctors to Right Treatment) criteria amongst physicians in six European countries. Age Ageing. 2009;38(5):603‐606. doi: 10.1093/ageing/afp058 [DOI] [PubMed] [Google Scholar]
  • 9. Ryan C, O'Mahony D, Byrne S. Application of STOPP and START criteria: interrater reliability among pharmacists. Ann Pharmacother. 2009;43(7):1239‐1244. doi: 10.1345/aph.1M157 [DOI] [PubMed] [Google Scholar]
  • 10. Belfrage B, Koldestam A, Sjöberg C, Wallerstedt SM. Prevalence of suboptimal drug treatment in patients with and without multidose drug dispensing – a cross‐sectional study. Eur J Clin Pharmacol. 2014;70(7):867‐872. doi: 10.1007/s00228-014-1683-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Parodi López N, Svensson SA, Wallerstedt SM. Association between recorded medication reviews in primary care and adequate drug treatment management – a cross‐sectional study. Scand J Prim Health Care. 2021;39(4):419‐428. doi: 10.1080/02813432.2021.1973239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Sjöberg C, Wallerstedt SM. Effects of medication reviews performed by a physician on treatment with fracture‐preventing and fall‐risk‐increasing drugs in older adults with hip fracture – a randomized controlled study. J Am Geriatr Soc. 2013;61(9):1464‐1472. doi: 10.1111/jgs.12412 [DOI] [PubMed] [Google Scholar]
  • 13. Parodi López N, Svensson SA, Wallerstedt SM. Clinical relevance of potentially inappropriate medications and potential prescribing omissions according to explicit criteria – a validation study. Eur J Clin Pharmacol. 2022;78(8):1331‐1339. doi: 10.1007/s00228-022-03337-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ganz DA, Bao Y, Shekelle PG, Rubenstein LZ. Will my patient fall? JAMA. 2007;297(1):77‐86. doi: 10.1001/jama.297.1.77 [DOI] [PubMed] [Google Scholar]
  • 15. McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22(3):276‐282. [PMC free article] [PubMed] [Google Scholar]
  • 16. Dalleur O, Feron JM, Spinewine A. Views of general practitioners on the use of STOPP&START in primary care: a qualitative study. Acta Clin Belg. 2014;69(4):251‐261. doi: 10.1179/2295333714Y.0000000036 [DOI] [PubMed] [Google Scholar]
  • 17. Anrys P, Boland B, Degryse J‐M, et al. STOPP/START version 2 – development of software applications: easier said than done? Age Ageing. 2016;45(5):589‐592. doi: 10.1093/ageing/afw114 [DOI] [PubMed] [Google Scholar]
  • 18. Counter D, Millar JWT, McLay JS. Hospital readmissions, mortality and potentially inappropriate prescribing: a retrospective study of older adults discharged from hospital. Br J Clin Pharmacol. 2018;84(8):1757‐1763. doi: 10.1111/bcp.13607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Aronson JK. A manifesto for clinical pharmacology from principles to practice. Br J Clin Pharmacol. 2010;70(1):3‐13. doi: 10.1111/j.1365-2125.2010.03699.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Lönnbro J, Wallerstedt SM. Clinical relevance of the STOPP/START criteria in hip fracture patients. Eur J Clin Pharmacol. 2017;73(4):499‐505. doi: 10.1007/s00228-016-2188-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Blum MR, Sallevelt B, Spinewine A, et al. Optimizing Therapy to Prevent Avoidable Hospital Admissions in Multimorbid Older Adults (OPERAM): cluster randomised controlled trial. BMJ. 2021;374:n1585. doi: 10.1136/bmj.n1585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Page AT, Etherton‐Beer CD, Clifford RM, Burrows S, Eames M, Potter K. Deprescribing in frail older people – Do doctors and pharmacists agree? Res Social Adm Pharm. 2016;12(3):438‐449. doi: 10.1016/j.sapharm.2015.08.011 [DOI] [PubMed] [Google Scholar]
  • 23. Wallerstedt SM, Hoffmann M, Lönnbro J. Methodological issues in research on drug‐related admissions: a meta‐epidemiological review with focus on causality assessments. Br J Clin Pharmacol. 2022;88(2):541‐550. doi: 10.1111/bcp.15012 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1 Flowchart of studied cohorts

Table S1 STOPP/START criteria for which no PIM/PPO was identified by any of the assessors

Table S2 Reasons underlying discordant identification of PIMs/PPOs in cohort II

Data Availability Statement

The datasets generated and analysed during the current study are not publicly available owing to Swedish data protection laws. The data can be shared with authorised persons after approval of an application submitted to the Swedish Ethical Review Authority (https://etikprovningsmyndigheten.se).


Articles from British Journal of Clinical Pharmacology are provided here courtesy of Wiley

RESOURCES