Skip to main content
. 2023 Feb 7;380:e071058. doi: 10.1136/bmj-2022-071058

Table 5.

Example of electronic health records data comparing distribution of important variables against development data, in studies developing or validating a prediction model in clustered data. Adapted from: Collins GS, Altman DG. Identifying patients with undetected renal tract cancer in primary care: an independent and external validation of QCancer® (Renal) prediction model. Cancer Epidemiol 2013;37:115-20, with permission from Elsevier186

Risk predictor QRESEARCH THIN (external validation; n=2 145 133)
Development (n=2 359 168) Internal validation (n=1 240 722)
Median age (years) 50.1 (SD 15.0) 50.1 (SD 14.9) 48 (IQR 38-61)
Smoking status (No (%))
 Non-smoker 1 197 521 (50.8) 626 066 (50.5) 860 217 (40.1)
 Ex-smoker 425 611 (18.0) 228 649 (18.4) 311 924 (14.5)
 Current smoker amount not recorded 71 603 (3.0) 39 396 (3.2) 282 534 13.2)
 Light smoker (<10 cigarettes/day) 148 703 (6.3) 80 103 (6.5) 133 657 (6.2)
 Moderate smoker (10-19 cigarettes/day) 180 509 (7.7) 96 175 (7.8) 203 954 (9.5)
 Heavy smoker (≥20 cigarettes/day) 134 688 (5.7) 73 981 (6.0) 183 590 (8.6)
 Smoking status not recorded 200 533 (8.5) 96 352 (7.8) 169 257 (7.9)
Current symptoms and symptoms in the preceding year (No (%))
 Haemoglobin <11 g/dL recorded in past year 29 720 (1.3) 16 169 (1.3) 16 961 (0.8)
 Abdominal pain 230 584 (9.8) 128 721 (10.4) 253 344 (11.8)
 Appetite loss 10 287 (0.4) 5531 (0.4) 6097 (0.30)
 Weight loss 25 897 (1.1) 14 464 (1.2) 29 369 (1.4)
 Haematuria 43 850 (1.9) 25 553 (2.1) 37 810 (1.8)
 Previous diagnosis of cancer apart from renal tract cancer at study entry 51 119 (2.2) 27 163 (2.2) 49 303 (2.3)

Baseline characteristics of the development and an external validation cohort of QCancer, which predicts the risk of having undiagnosed lung, ovarian, colorectal, gastro-oesophageal, renal, or pancreatic cancer. SD=standard deviation; IQR=interquartile range; THIN=Health Improvement Network.