A Validated Logistic Regression Model to Identify Coronary Heart Disease patients (CHD) within Primary Care Databases in the United Kingdom

K Thiru; P Donnan; F Sullivan

. 2003;2003:1030.

A Validated Logistic Regression Model to Identify Coronary Heart Disease patients (CHD) within Primary Care Databases in the United Kingdom

K Thiru ¹, P Donnan ², F Sullivan ³

PMCID: PMC1480149 PMID: 14728533

Abstract

We established the optimal search strategy for identifying coronary heart disease (CHD) patients within the Electronic Patient Record (EPR) of ‘paperless’ family practices in the UK. Multiple logistic regression modelling (MLRM) and Receiver Operating Characteristic (ROC) curves were used to develop the query. The selected search strategy was validated at 2 additional paperless family practices.

Background

Under NHS directives in the UK, clinicians are increasingly required to practice in a ‘paperless’ environment and to collect clinical data electronically. These data are required to support budgetary requirements, implementation of Evidence Based Medicine (EBM) and the ‘quality cycle’. A primary step to meet these needs is the identification of a target population. Due to the granularity of coding strategies, the structure of the coding frame and the prioritisation and use of READ codes by clinicians, establishing a sensitive (complete) and positive predictive (correct) search strategy is complex. In this context, no technique to establish the optimal search strategy with which to identify a patient population, has been described in the literature. Query strategies have tended to be generic and search for expected diagnostic and prescribing codes. This study attempted to identify and validate the most effective and efficient search strategy for CHD patients in UK family medicine practice.

Method

The model was derived from data from a paperless practice with 13 500 patients. A total of 55 independent variables were identified by a multi-professional primary care team, as desirable for the implementation and management of CHD patients. Of these, 14 redundant or highly correlated codes were removed. Three different ontological representations of CHD were used as the dependent variables (reference standards) for MLRM: 1. a World Heath Organisation (WHO) based definition 2. a national data quality group’s definition (PDQ), 3. a model determined by a group responsible for clinical quality of care at the local level (CG). Forward stepwise MLRM using 0.05 and 0.1 entry and exit p-values were used to identify significant codes. The area under the ROC curve (AUC) was used to assess the performance of the models. Age and gender specific models were explored. The final model was for all patients over the age of 35. The number of significant independent variables identified was reduced further using the Receiver Operating Characteristic (ROC) curve. The selected model was validated using data from two independent paperless practices with population s, 11 762 (4-byte) & 7545 (5 byte).

Results

Selection of significant codes was highly influenced by the age and sex of the population considered. Codes selected as optimal for identifying >35 year old patients were dependent on the ontological perspective of the disease used (table 1). PDQ and WHO defined patients were most effectively identified through the select search. The search strategy derived from the PDQ population preformed optimally in relation to sensitivity and yield (1/positive predictive value) statistics on the remaining dependent variables.

Table 1.

Summary table of reference standard specific optimal search strategies and summary statistics of effectiveness at FMC (G4 = diagnostic codes for CHD).

Definition	Codes selected	Sensitivity	Specificity	PPV	Yield
WHO	G4, anti arrhythmics and anti-lipids	98.0%	96.4%	71.0%	1.4
PDQ	G4, nitrates and digoxins	99.0%	92.1%	54.1%	1.8
CG	G4, anti hypertensive drugs and hypertension	95.1%	81.0%	35%	2.8

Open in a new tab

Validation site results show the PDQ strategy to perform best when tested on data from a practice using the same Read version. The search was less effective on the newer 5 byte Read version (table 2).

Table 2.

Performance of diagnostic code, nitrates and digoxin based queries at training and testing sites

G4 + nitrate + digoxin queries	WHO		PDQ		CG2
G4 + nitrate + digoxin queries	Sensitivity	PPV	Sensitivity	PPV	Sensitivity	PPV
Training site (4 byte)	99%	51%	99%	54%	86%	52%
Testing 4 byte site	98%	66%	97%	74%	91%	69%
Testing 5 byte site	91%	39%	74%	48%	65%	61%

Open in a new tab

Conclusion

Generic searches may perpetuate inequalities in care i.e. older and female patients may not be identified by non-specific search strategies. MLRM and ROC techniques can be used to identify the optimal search strategies for specific groups of CHD patients. Queries should be bespoke and mindful of the user’s conceptualisation of disease. Such techniques can improve practices ability to establish sensitive and high yield positive predictive chronic disease registers. The study explored the wider implications of the results on patient identification and the impact of more granular coding strategies on data collection and queries. When READ codes become incorporated within the SNOMED CT classification system these findings will have international relevance.

PERMALINK

A Validated Logistic Regression Model to Identify Coronary Heart Disease patients (CHD) within Primary Care Databases in the United Kingdom

K Thiru, MPH

P Donnan, PHD

F Sullivan, MBChB (Hons) PhD

Abstract

Background

Method

Results

Table 1.

Table 2.

Conclusion

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Validated Logistic Regression Model to Identify Coronary Heart Disease patients (CHD) within Primary Care Databases in the United Kingdom

K Thiru, MPH

P Donnan, PHD

F Sullivan, MBChB (Hons) PhD

Abstract

Background

Method

Results

Table 1.

Table 2.

Conclusion

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases