Abstract
Since its inception in the mid-1980s, the General Practice Research Database (GPRD) has undergone many changes but remains the largest validated and most utilised primary care database in the UK. Its use in pharmacoepidemiology stretches back many years with now over 800 original research papers. Administered by the Medicines and Healthcare products Regulatory Agency since 2001, the last 5 years have seen a rebuild of the database processing system enhancing access to the data, and a concomitant push towards broadening the applications of the database. New methodologies including real-world harm–benefit assessment, pharmacogenetic studies and pragmatic randomised controlled trials within the database are being implemented. A substantive and unique linkage program (using a trusted third party) has enabled access to secondary care data and disease-specific registry data as well as socio-economic data and death registration data. The utility of anonymised free text accessed in a safe and appropriate manner is being explored using simple and more complex techniques such as natural language processing.
Keywords: database, pharmacoepidemiology, primary care
Introduction
The history of the General Practice Research Database (GPRD) has been described elsewhere [Parkinson et al. 2007; Wood and Coulson, 2001], and the various versions of the database can be tracked through this literature. A number of reviews throughout its history have described GPRD research in the area of pharmacoepidemiology, epidemiology, public health and clinical outcomes research [Lawson et al. 1998; Wood and Martinez, 2004]. Few of these have addressed any aspects of organising and maintaining such a data set or the role of an organisation with whom the responsibility of managing such a resource lays. These issues are however critical in terms of providing a vehicle for undertaking clinical research, and are applicable to all primary care data sets available in the UK. This review looks at some issues related to these questions as well as considering current research using GPRD. It also describes some of the areas of novel research being undertaken within or in collaboration with the GPRD organisation.
The GPRD evolved from an early general practice information system. Free computing equipment was supplied in return for standardised data recording and provision of patient-level data into a central database to be used for health research [Whalley and Mantgani, 1997]. Over 30 years later this concept has developed into the largest and most utilised verified primary care database in the UK and arguably the world. As of March 2011, it contains records from over 12 million patients contributing 64 million person years of prospectively recorded high-quality primary healthcare data. During its lifetime, it has evolved both in terms of data, technology and its interface with researchers.
The UK healthcare system involves a two-level service with a gatekeeper. Primary care is the frontline for healthcare. Secondary care (predominantly hospital-based care) provides healthcare within the context of this, either at the request of primary care, or directly but with full disclosure to the primary carer. Nearly all individuals in the UK are registered with a primary care physician (general practitioner [GP]) who oversees that patient’s healthcare and acts as the gatekeeper to the National Health Service (NHS). Thus prospective follow up of individuals is possible via the healthcare records of the GP, and it is for this reason that primary care databases offer such opportunities for research [Lawrenson et al. 1999; de Lusignan and van Weel, 2006].
Primary care data
Source data for primary care research databases is generated via GP systems themselves, and download software that collects data from practice servers. All GP systems use a combination of coded and free text data (including letters and emails) to record healthcare data although the balance between coded and free text data can differ between systems. Systems are flexible in terms of the ways in which data can be recorded and, in the UK, use Read terms to code healthcare information. A migration to SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) has been planned, but has not yet occurred. Data recorded is both clinical and administrative, with entire Read term chapters for recording administrative and procedural details of patient care.
The GPRD data is collected from the Vision GP system, supplied to GPs by In Practice Systems Ltd (INPS). Vision is a relatively highly coded system based around clinical ‘entities’. An entity is a specific piece of information ranging from a medical history note or a diagnosis to a mental state questionnaire score. Coded data is available on clinical history, diagnoses, signs and symptoms, prescribing of drugs and devices, test results, referrals to secondary care and non-practice-based primary care services, immunisations and lifestyle factors. There is also a range of additional data where more detailed information can be entered for a particular predefined event type via a specifically designed structured data area. For example, a smoking habit of 10–19 cigarettes per day may be recorded using a Read code (‘1374.’) or using a specific structured data area for smoking status where individual data items can be recorded for smoking status, quantity smoked and dates of starting and stopping smoking.
Free text information is a common feature of all GP systems and is utilised to varying degrees. Within Vision, it is always linked to specific coded data as part of the record. The balance between coded data and free text is largely dependent upon the coding style of the user but system design has an effect on this balance. Free text can be classified into two broad types: short notes and annotations, and healthcare provider communications. The former are often annotations to the Read term. These types of annotations may only have specific meaning in the context of the coded term. They frequently contain standard and nonstandard abbreviations, are entered manually and can contain misspellings and typographical errors. The healthcare provider communications relate to letters and emails sent to and from the GP. These include referral letters, hospital discharge letters and other communications. Where these communications are electronic, the text of the letter is retained in full in the free text data; however, where they are on paper they are either scanned in using text recognition scanning software, or scanned in as images linked to consultations. Key information from communications such as diagnoses is often entered in as coded data abstracted from the document. These longer forms of free text are generally more grammatically constructed and can be understood in isolation from the coded data to which they are linked.
The major GP systems in the UK include EMIS (Egton Medical Information System) System-One and Vision. All of these systems use essentially similar data models for primary care data with individual healthcare ‘events’ being stored as date-stamped coded records with a potential free text component. Some differences in organising this information exist and system design encourages slightly different recording styles, however the baseline data is similar. The implications for research associated with these differences are likely to be minimal as well-designed methodologies avoid making assumptions about recording strategies and focuses on the raw data whilst attempting to mitigate against the effects of potential recording anomalies.
Research using primary care databases
Primary care databases have been used extensively for research, and remain a major focus of observational data research where prospective healthcare records are required. The range of data sources is very wide, from groups of co-operating practices at the smaller geographical level to larger networks of practices participating in project based data extractions on a broader geographical (but not national) scale [de Lusignan et al. 2006]. These groups may often be focused on a particular clinical area, and to a certain extent can modify recording practice in these areas. Aside from GPRD, in terms of national or UK wide databases there are a handful of resources providing data for research purposes and in some cases market research purposes. These include non-commercial databases such as Q-Research [Hippisley-Cox et al. 2004] and DIN-Link [Carey et al. 2004]; and commercial databases such as UK IMS Disease Analyzer and The Health Improvement Network (THIN) [Bourke et al. 2004]. GPRD is the most established, and has the greatest track record in research and validation. The GPRD bibliography [GPRD, 2011] references over 850 peer-reviewed publications relating to GPRD, with over 800 of these original research papers. Such a track record has positive implications in terms of data quality, experience of use in observational research and appropriate management of the resource. GPRD data is available to both the commercial and academic sector but it is not available for market research.
GPRD as an organisation has an operational division, which creates and manages access to the data resource, and a research arm. The research arm of GPRD represents an expert unit in terms of the data resource as it relates to research. This remit encompasses an understanding of the data in terms of the information it holds, its provenance and how it relates to recording in general practice, and an appreciation of its temporal complexity and how that relates to proposed research. The GPRD research group provides it a high level of expertise in observational data analysis and epidemiology in general, especially relating to the strengths, limitations and general idiosyncrasies of the data. This expertise is made available to clients through workshops and more individual contact relating to study advice, as well as through contracted services.
The GPRD operates on a not-for-profit basis within the terms of the MHRA’s trading fund status [Wood and Coulson, 2001]. Previously, the cost of access to the data has proved a barrier to the extensive use of the data within academia. To address this and to broaden the use of the data a joint venture with the UK Medical Research Council (MRC) was launched in November 2005. This enabled access to the data for UK academic groups working on noncommercial projects, at no cost. During the term of this MRC scheme the number of academic protocols increased by 103.5%. In the same years, total numbers of submitted protocols increased by 99% in overall terms. In the years 2002 to 2004 the number of submitted protocols was split with 54.2% and 35.9% from academic- and industry-based organisations, respectively. From 2005 to 2010 the figure had changed to 60.3% and 23.5%. After its termination, the accessibility to the data for academic organisations was continued through the development of a risk sharing license giving access to the entire data set at a reduced cost under certain conditions.
GPRD aims to broker and undertake high-quality research, the control of which is mediated is via an Independent Scientific Advisory Committee (ISAC) that assesses protocols submitted in terms of their ethical and scientific merit and also their feasibility. This is a committee set up under NHS appointment commission terms which features scientists with statistical, epidemiological and specific GPRD-related expertise and also lay members. The GPRD Group has obtained ethical approval from a Research Ethics Committee (REC) for all purely observational research using GPRD data; namely, studies which do not include patient involvement, and as part of their assessment ISAC may recommend that study-specific REC approval is sought if ethical issues arise in relation to an individual study. Separate REC approval is required for any study which includes any form of direct patient involvement.
Data quality
GPRD has historically undertaken a set of internal data quality measurements in an effort to ensure high-quality data within its subset of UK practices. This data quality assessment is undertaken at the patient level and at the practice level. The practice-level quality assessment is manifested by the practice ‘up-to-standard’ (UTS) date and the patient quality level by a patient acceptability flag.
Patients are labelled as ‘acceptable’ for use in research by a process that identifies and excludes patients with noncontiguous follow up or patients with poor data recording that raises suspicion as to the validity of that patient’s record. Patient data is checked for the issues listed in Table 1. If any of these data values are found then the patient is labelled unacceptable, and is not recommended for use in research. The data however remains in the database. The process is broadly unchanged since the advent of the GPRD Gold system (see the next section). No specific validation work has been conducted on this method as much of it is based on logical inconsistency of the registration data. The breakdown of the acceptability status shows us that of the 11.89% of unacceptable patients, 10.44% are temporary patients and only 1.45% are unacceptable due to ‘inconsistent’ registration data.
Table 1.
Data item | Unacceptable value |
---|---|
First registration date | Empty; invalid date; prior to year of birth |
Year of birth | Missing |
Transferred out date | Present with no reason; prior to first registration date; prior to current registration date |
A transferred out reason | Present with no date |
Current registration date | Prior to first registration date; prior to year of birth |
Gender | Other than male, female or indeterminate |
Age | Over 115 years at end of follow up |
Registration status | Other than applied or permanent |
Historically the UTS date was developed to measure adherence to recording guidelines provided with the VM system and subsequently the Vision system when it was introduced after 1995. The UTS process was developed in the data set derived from the older VAMP system involving ten data quality parameters. Each parameter generated an earliest date at which it is acceptable, and the UTS date was set to the earliest date at which nine of these were acceptable, with the exception of the mortality parameter which was mandatory. In the new Vision-based data set, these processes were ported to the new data as clearly as possible given a new more complex data structure. Having applied the UTS parameters to Vision data for a number of years, we were able to identify the key parameters which were instrumental in determining the actual UTS date. In its current form the UTS date is based on two central concepts: assurance of continuity in data recording, and mortality rate compared with an expected range. Monitoring of mortality rate allows us to identify the point at which previously registered patients have been deleted from the system.
Recording practices and GP systems in the post-2000 era represent a vastly different situation to the one which preceded it involving early GP systems such as VM. Today’s accredited systems are more complex but simpler to use and enable capture of a much richer data set than previous systems. Recording practices have changed enormously with nearly all healthcare episodes being recorded electronically and often according to incentivised regulations in certain areas with Quality Outcome Framework (QOF) rules [Vamos et al. 2011]. This increased complexity raises issues in terms of using the data appropriately for research. For example, such changes render the original UTS parameters redundant to a degree. There is a need to conduct a scientific research-based assessment of primary care data sources in general, specifically for the purpose of characterising more accurately the strengths and weaknesses of the available data itself. A collaboration involving GPRD has recently been undertaken to develop such data quality parameters, initial pilot phase results exploring baseline parameters across the data [Tate et al. 2011].
GPRD has a long history of validation studies which has been recently reviewed [Herret et al. 2010]. This study reviewed 212 publications involving 357 validations classifying them as either internal (manual/algorithmic review of database records or sensitivity analysis) or external (questionnaires or patient record requests to GPs and comparison with external data sources). Generally a high proportion of cases were confirmed but for the majority of studies only positive predictive values (PPVs) are obtainable. The authors also note that detailed case definitions and code lists used were seldom provided, and it is thus hard to assess whether any poor performance in terms of case identification is due to poor code selection. The importance of code selection is an area explored by a GPRD stroke study which stresses the need for transparency and an understanding of the context of code selection [Gulliford et al. 2009]. A smaller review of GPRD validation studies [Khan et al. 2010] came to similar conclusions reporting high PPVs, citing the Morbidity Statistics from General Practice 1991–1992 (MSGP4) as a frequent external comparator, and also noting the use of both Read codes and OXMIS codes (Oxford Medical Information System codes: an early clinical dictionary used by VM system) as a complication of coding. Current GPRD data contains only Read coding. Validation studies in other primary care data sources are limited and in the case of THIN mostly reassess validity in the same clinical areas already assessed using GPRD.
Recent innovations in terms of linking data sets within GPRD has provided further opportunities to compare data across data sets, not only in terms of overall prevalence rates, but also in terms of individual concordance at the patient level. Often, neither linked data sets can be considered GOLD standard and there will be genuine healthcare reasons for disconcordance between the two. Their role in data quality is thus complex, however, the benefit of linkage is already being seen in research [Boggon et al. 2011].
Database redevelopment
In its initial form, the GPRD was derived from a system called VM, a DOS-based system which produced downloads involving five files of a simple structure. This version of the database was used until early 2000 but ceased to be updated from 2001. The Vision system superseded VM in 1995 and from then until 2000 the vast majority of practices upgraded. The data collected from the Vision system was richer and more complex and could not be housed in existing VM data repositories. In 1999 the Medicines and Healthcare products Regulatory Agency (MHRA: an executive agency of the Department of Health, then named the Medicines Control Agency) invested in the development of the Vision-based data that had been accumulating since 1995 in some practices [Wood and Coulson, 2001]. Built by external contractors to a specification supplied by GPRD staff, the new system built was known as Full Feature GPRD (FF-GPRD) and its main features were a centralised online access database interfaced by a suite of data cutting and data reporting tools. Underlying these tools was a data warehouse relational database, with a proto-database in the form of an operational data store. By 2007, the FF-GPRD system was approaching the end of its natural life time and a replacement system was developed and implemented within the GPRD group. This system was designed to concentrate on the basic requirements of an online system. Such requirements were identified by a number of means:
An audit of study population requirements from a sample of ISAC (formerly SEAG) protocols revealed that the vast majority of study populations were identified by the presence or absence of either specific Read terms, or drugs, and a simple set of age, gender, study period and temporal ordering constraints.
An analysis of the relative strengths and weaknesses of the FF-GPRD system enabled us to identify candidate functionality for retention as well as those functions that were not of value and could be discarded.
The experience of GPRD researchers working with data to execute service provision (including data extraction to the undertaking of full scientific research projects) enabled us to easily identify the inefficiencies, generate benchmarks and detail the specific weaknesses of the existing system, as candidates for improvement.
Client feedback was essential due to the funding model of GPRD and it was essential that any new system addressed any areas of dissatisfaction with FF-GPRD.
Over a period of 2 years a replacement system was developed. The new system known as GPRD Gold was launched in March 2009 and is the current version of the database. It processes data practice by practice a collection at a time. Gains in performance have been significant and the process is completely scalable. These improvements, in conjunction with implementation of electronic delivery of incremental data collections, have led to a decrease in the data lag period (the difference between the date of the data in the database and the date of access). Thus, currently nearly all contributing practices have a lag period of less than 6 weeks in a monthly static database. This is limited only by the frequency of collections and not processing limits. It is now possible to rebuild the database from scratch in a 2–3 week window, which was not possible previously. This enables substantive changes to processing to be implemented in a realistic time frame making the database itself potentially reactive to changes in baseline systems or recording practices.
In GPRD Gold a static version of the database is produced every month and it is this which is accessed by researchers. All previous monthly databases are retained and at any one time six previous monthly versions are available for access on the online system. This ensures access to the most up to date data and also the ability to requery data on which a previous analysis is based if required: as the monthly versions are static, the same data will be available. The principle of using a central database via an online system is sound as it negates the need for extensive IT expertise at the client site, and it helps standardise the ‘forms’ of GPRD available for use. The term ‘GPRD’ is frequently used to describe a data source in publications, however it is rarely specified in any more detailed terms. The nature of the resource is such that in order to exactly replicate analysis one would have to analyse the same version of the database. A centralised database with versioned static releases goes some way towards introducing clarity. However, for an online system to be a viable option users need to be able to extract the data they require in a realistic time frame. A set of simple, fit for purpose and efficient data identification and extraction tools were developed, that provided for the requirements of in excess of 90% of all study populations. These tools also enable users to run simple feasibility counts, often the first step in a potential project. In benchmarking comparisons to the prior system, the simplified approach means even queries involving large numbers of patients run hundreds of times quicker and whilst not extracting very large data sets in real time, patient cohorts involving several 100,000 patients can realistically be built within a 24-hour period, and can be run asynchronously without user input.
GPRD is the only online accessible database. Access to data via other providers is either via a method of data extraction at the supplier organisation, and subsequent supply, or in the case of Q research, data is not supplied directly to clients but studies are run within the supplier organisation and reported on to the client. These models are also run by GPRD, and can be an efficient methodology. However with the prevalence of skilled primary care database researchers growing in the UK and elsewhere it is likely that the demand for raw data will grow. By allowing individuals to access the data directly a method is provided that gives researchers more confidence into the provenance of the data on which they are working.
GPRD linkage programme
For a number of years now, the idea of a joined-up health data network for research has been a lofty goal within the UK. Part of the NHS National Programme for Information Technology (NPfIT) program was the creation of a Secondary User System: a system conceived to provide pseudonymised patient-based data for purposes including healthcare planning, clinical audit, performance monitoring and research [NHS Connecting for Health Implementation Guidance team, 2007]. More recently the UK Research Capability Programme has looked at the concept of federating data sources in the UK.
Within the field of primary care databases, linkage to non-primary-care data has long been seen as desirable. Thus far, GPRD remains the only national database to have linked on a permanent and ongoing basis to external data sets. The linkage process itself uses a common methodology which enables linkage to several disparate data sets. Practices are required to consent to having their data linked, and if so they download identifiers, including patient NHS number, post code, date of birth and gender to a trusted third party (TTP). The methodology involves a first-pass match which identifies patients on the basis of NHS number, where this fails a second pass is undertaken matching probabilistically on gender, date of birth and postcode. The overall proportion of patients identified on NHS number is 91.73%. The TTP undertakes the linkage process and makes available a GPRD identifier–external linked data set identifier pair that can be used to integrate linked data with the primary care record. The data governance of this process is based around a separation of function and clear unambiguous rules that prevent any one party being in possession of any data sets that have potentially recognisable data signatures that could be used to identify patients. Linkages are subject to approval and regulation by the National Information Governance Board. Currently, the linkage is restricted to English practices as all approvals are required at the national level. Future plans include the consideration of extending the linkage plan to Scotland, Wales and possibly Northern Ireland. The numbers of practices consenting to linkage are 302 from a total of 465 English practices which represents approximately 65% of the contributing practice in England, and roughly 5% of the population of England.
Data sets that have been integrated include secondary care hospital episode statistics, ONS death certification data, socioeconomic classification data (aggregated at the lower super output area level) and disease registry data including the National Cancer Intelligence Network (NCIN) and the Myocardial Infarction National Audit Program (MINAP) register of myocardial infarctions. These linkages are ongoing and are updated on a quarterly basis. Additional linkages have been implemented for particular studies, including pollution-level data and linkages to other cohorts such as the Avon Longitudinal Study of Parents and Children (ALSPAC). Linkage to other data sets is pending.
Methodological advances
The last few years have seen an increase in the volume of research undertaken by the GPRD research department. A consequence of this has been to begin to explore new areas of application and methodology aided by the access to linked data sets. There has been much work required simply to understand the parameters of a new data set comprising data streams from multiple data sources. For example, considering the composite data comprising primary care data, hospital episode data and death certification data, we have three data streams with separate left and right censoring points all of which have their own characteristics in terms of purpose (clinical care or administrative), function and with differing recording characteristics. Under this situation, the simple task of attempting to match a diagnosis or a fatal event in two or all three of these data streams becomes a complex one. There is a need to investigate the best methods in terms of utilising these data, and a number of analyses [Eaton et al. 2010] have been initiated in GPRD to begin to develop these parameters.
The development of novel methodologies is occurring in a number of areas and a brief description of some of these programmes follows.
An area of research seen as central to GPRD is that of pharmacovigilance. Many studies have used GPRD as a data set in which to test hypotheses in response to safety signals generated by other data sources such as adverse event reports. Recent examples of these include studies by Douglas and colleagues and van Staa and colleagues [Douglas et al. 2009; van Staa et al. 2008a]. The potential to generate drug event pairs by data mining within GPRD has been resisted due to the fact that once utilised for this purpose this same data cannot be used to test this signal. Despite the large size of GPRD, hypothesis testing in newly marketed drugs often lacks sufficient power due to small numbers of users and ring fencing a proportion of the database for data mining would only exacerbate this problem. One study [van Staa et al. 2008b] presents an alternative innovative methodology utilising primary care data from GPRD to provide real-world estimates of the harm–benefit balance of cyclo-oxygenase-2 (Cox-2) inhibitors. The method incorporates the relative rates (for example, from RCTs) of various outcomes for Cox-2 and sets them within the observational real-world data of GPRD. This generates a harm–benefit ratio profile of Cox-2 inhibitors in a real-life study, across specific age, gender and risk factor combinations. This work represents a departure from the standard relative rate estimation of potential adverse events within databases such as GPRD, and is potentially much more useful as it enables individualised risk–benefit decisions to be considered when deciding whether or not to prescribe to patients. It is anticipated that this model could be used as a pharmacovigilance safety monitoring tool given appropriate levels of data for newly launched medicines that will be available upon the development of systems for extracting targeted data from large collections of GP systems on a cross-platform basis.
Research based upon observational data is susceptible to bias and confounding and it is an understanding of this which is critical to the interpretation of study findings. In the field of clinical trials, the concept of the double blind randomised controlled trial (RCT) is seen as the gold standard and often lauded as the only way to get true measurement of exposure effects. However, such studies operate within an artificial environment that does not reflect real-life clinical practice [van Staa et al. 2009]. GPRD is currently embarked on a project to enable randomisation of patients into a pragmatic RCT study direct from the primary care clinical practice, with standard routine collection of their electronic record continuing in the usual way. The focus of such studies will be low-risk interventions, such as randomisation to one or another statin. Such studies can have very long-term follow up from their standard primary care health record within GPRD, and with cause of death being available for nearly all patients. The model has been designed to minimise the impact on the GP in terms of recruiting and administering interventions in order to reduce obstacles to recruitment. Potential study patients can be identified in a number of ways either on the basis of recruitment at next visit, via an invitation to attend a clinic, or even in real time based on a diagnosis being recorded. Mediation of the notification to the GP and also of the recruitment process is via an automated system integrated to the GP patient system alerting the GP and directing them to a recruitment web site for consent and randomisation [Tyson et al. 2011]. As an interventional study the project is subject to full ethical approval and will be required to comply with Good Clinical Practice regulations and systems are being developed to ensure that all aspects of these regulations are conformed with. This new methodology provides a vehicle for the running of pragmatic real-world randomised trials at a fraction of the cost of full RCTs: the randomisation will go a considerable way to eliminating bias and confounding, and the setting within a real-world context will enable the results of such studies to relate more directly to actual clinical practice. As a tool in the epidemiological toolkit this will provide a useful adjunct to current methodologies and the results and conclusions they produce.
Similarly, a current project with GPRD collaboration involves interventions randomised at the practice level. Initiated by findings of prior work [Dregan et al. 2011] it involves cluster randomisation and patient recruitment and uses the same central methodology [Gulliford et al. 2011]. This model would also be useful to facilitate recruitment for an existing pharmacogenetic study: ‘Statin-Induced Muscle Toxicity: Exploration Using the UK General Practice Research Database (GPRD)’ [STAGE, 2011] being run collaboratively with the University of Liverpool and funded by the MRC and the Wellcome trust. Consenting practices have been recruiting statin users for the purpose of genetic sampling by providing either a saliva or blood sample. Once more, the intervention is minimal and the follow up is taken care of by standard GPRD data collections and the linked data sets. Such studies will enable us to look at the relationship between recognised rare adverse events and genetic profiles or particular genetic markers. Already well into the recruiting phase, this study could be set up more efficiently in the future by integrating it with the recruiting mechanism of the pragmatic RCTs.
Free text was described above as a component of primary care data. Its prevalence varies, partly in relation to the underlying system being used. The ability to utilise free text is potentially very useful, and has been demonstrated in the past relating to cause of death identification from free text [Shah and Martinez, 2004] and more recently in a project exploring the role of free text in identification of diagnosis, signs and symptoms in patients with ovarian cancer [Koeling et al. 2011]. Consequently, a further area of new research in GPRD is related to the utilisation of free text collected from the practices. As a critical component of any GP system, free text is used to annotate coded data and also in the form of letters and emails to communicate between disparate sets of healthcare providers. Free text, by its nature, has no prescribed structure and may well contain information that would be inappropriate for researchers to view directly, such as patient, doctor and hospital names and other identifiers. It is possible for individual free text records to be coded in such a way as to prevent it being downloaded, and there is significant variation between the volume and characteristics of the free text received as part of practice data downloads. As yet no specific studies testing the validity of free text have been conducted but considering the number of studies where free text and original patient paper records (provided by the GP and appropriately anonymised) have been utilised, there has been no reference to major inconsistencies with the original source data. Cross validation of free text with linked data sets will be possible in the future.
In the GPRD Gold system, free text is processed sight unseen and indexed such that it can be retrieved for the purposes of anonymisation and provision to researchers if required. At the time of the development of GPRD Gold, free text tools were enhanced. Currently it is possible to extract and anonymise selected text for use in studies, and it is also possible to undertake keyword searches within selected blocks of text or on the entire free text repository. For example, it has been demonstrated that using keyword searches for biological anticancer treatments (not prescribed at all in primary care) up to 30,000 patients can be identified: a potential if imperfect method for identifying patients using non- primary-care treatments.
Free text is associated with all clinical event types including diagnosis and history events, referral events, test result events, prescribing and immunisation. Different event types have characteristically different types of free text, with diagnosis and history events often containing annotation, and referral events being associated with a higher proportion of communications such as referral, outpatient or discharge letters. Test events can often have machine generated results in text format which will have regular structure and form. Prescribing free text is dosage instructions, which is largely formalised and currently handled within existing systems. GPRD are involved in a funded research program part of which is exploring free text using simple strategies as well as more complex natural language processing (NLP) approaches. Whilst currently in its infancy with primary care free text analysis, NLP techniques have shown that shorter unstructured annotations conform to different language models than the longer more grammatically structured letters, when used as a basis for defining synonyms for terms. This is mainly due to the fact that grammatically sound text can be analysed by parsing algorithms, providing us with highly informative grammatical relations between words, while the poor success rate of standard parsing techniques for the noisy unstructured annotations forces us to use much less informative proximity relations. However, there is evidence [McCarthy et al. 2007] that thesauruses built using proximity relations can be of similar quality as those that were built using grammatical relations when there is sufficient training material available.
The building of a type of free-text thesaurus has potential utility in terms of attempting to codify free text automatically. Research is planned on forms of free text to develop a suite of free-text tools to maximise the utility of this data whilst conforming to the strict data governance rules required with this form of data.
Future direction
Over the last 5 years since the latest reviews [Wood and Martinez, 2004; Parkinson et al. 2007] there have been many changes in the arena of primary care databases in the UK. Two new large-scale data sources have come into being, and access to the primary care data has been broadened by schemes such as the GPRD–MRC academic funding agreement. The pool of expertise in terms of use of primary care data has developed correspondingly and progress has been made in terms of the development of new methodologies. Pragmatic randomised trials within the database, cluster RCTs and pharmacogenetic studies are underway. In terms of supplementing primary care data, we are now in a situation where data from disparate NHS data sets have been linked at the patient level including hospital episode statistics, death certification and disease registries. In parallel the Research Capability Programme, part of the NHS National Institute for Health Research has been piloting new linkages and new honest broker methodologies of undertaking such linkages. 2012 will see a new service launched as a partnership between the MHRA, the GPRD host organisation and the NIHR. This new research service, called the Clinical Practice Research Datalink (CPRD), will be a national linkage service across the English 52 million population, have access to many NHS datasets, develop and enable an embedded electronic case report form for primary care clinical trials within the Primary Care EHR IT systems working in conjunction with the Primary Care Research Network (PCRN). The CPRD is a federated approach to delivery of research data and other services and will work collaboratively with all existing databases and services with the sole objective of increasing the volume of research undertaken in the UK; research that will improve both the health and the wealth of the nation.
With such a broad range of activity and progress, it is clear there will be further work required to develop the systems needed to administer and run such research on a large scale. Data quality assessment such as that being undertaken within GPRD will need to be developed to explore how various data sources interface with one another, and methodologies to utilise disparate data sets from different healthcare contexts will be required.
Acknowledgments
The views expressed in this paper are those of the authors and do not reflect the official policy or position of the Medicines and Healthcare products Regulatory Agency (MHRA). GPRD is owned by the UK Department of Health and operates within the MHRA. GPRD has received funding from the MHRA, Wellcome Trust, Medical Research Council, NIHR Health Technology Assessment programme, Innovative Medicine Initiative, UK Department of Health, Technology Strategy Board, Seventh Framework Programme EU, various universities, contract research organisations and pharmaceutical companies.
Footnotes
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
The authors are employees within the GPRD group of MHRA.
References
- Boggon R., van Staa T.P., Timmis A., Hemingway H., Ray K.K., Begg A., et al. (2011) Clopidogrel discontinuation after acute coronary syndromes: frequency, predictors and associations with death and myocardial infarction—a hospital registry-primary care linked cohort (MINAP-GPRD). Eur Heart J 32: 2376–2386 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bourke A., Dattani H., Robinson M. (2004) Feasibility study and methodology to create a quality-evaluated database of primary care data. Inform Prim Care 12: 171–177 [DOI] [PubMed] [Google Scholar]
- Carey I.M., Cook D.G., De Wilde S., Bremner S.A., Richards N., Caine S., et al. (2004) Developing a large electronic primary care database (Doctors’ Independent Network) for research. Int J Med Inform 73: 443–453 [DOI] [PubMed] [Google Scholar]
- de Lusignan S., van Vlymen J., Hague N., Dhoul N. (2006) Using computers to identify non-compliant people at increased risk of osteoporotic fractures in general practice: a cross-sectional study. Osteoporosis Int 17: 1808–1814 [DOI] [PubMed] [Google Scholar]
- de Lusignan S., van Weel C. (2006) The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract 23: 253–263 [DOI] [PubMed] [Google Scholar]
- Dregan A., Toschke M.A., Wolfe C.D., Rudd A., Ashworth M., Gulliford M.C. (2011) Utility of electronic patient records in primary care for stroke secondary prevention trials. BMC Public Health 11 (1): 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Douglas I.J., Evans S.J., Pocock S., Smeeth L. (2009) The risk of fractures associated with thiazolidinediones: a self-controlled case-series study. PLoS Med 6 (9): e1000154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eaton S., Setakis E., Williams T., van Staa T.P. (2010) Linking Primary Care Data (UK GPRD) to Hospital Records (HES). Pharmacoepidemiol Drug Saf 19: S195 [Google Scholar]
- GPRD (2011) GPRD Bibliography. http://www.gprd.co.uk/_docs/Bibliography_Jan%202011.pdf
- Gulliford M.C., Charlton J., Ashworth M., Rudd A.G., Toschke M.A. (2009) Selection of medical diagnostic codes for analysis of electronic patient records. Application to stroke in a primary care database. PLoS ONE 4 (9): e7168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gulliford M.C., van Staa T., McDermott L., Dregan A., McCann G., Ashworth M., et al. for the electronic Cluster Randomised Trial Research Team eCRT Research Team (2011) Cluster randomised trial in the General Practice Research Database: 1. Electronic decision support to reduce antibiotic prescribing in primary care (eCRT study). Trials 12: 115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Herrett E., Thomas S.L., Schoonen W.M., Smeeth L., Hall A.J. (2010) Validation and validity of diagnoses in the general practice research database: a systematic review. Br J Clin Pharmacol 69: 4–14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hippisley-Cox J., Stables D., Pringle M. (2004) QRESEARCH: a new general practice database for research. Inform Prim Care 12: 49–50 [DOI] [PubMed] [Google Scholar]
- Khan N.F., Harrison S.E., Rose P.W. (2010) Validity of diagnostic coding within the general practice research database: a systematic review. Br J Gen Pract 60 (572): e128–e136 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koeling R., Tate A.R., Carroll J. (2011) Automatically estimating the incidence of symptoms recorded in GP free text notes. In Proceedings of the First International Workshop on Managing Interoperability and Complexity in Health Systems (MIXHS'11), Glasgow, UK [Google Scholar]
- Lawrenson R., Williams T., Farmer R. (1999) Clinical information for research; the use of general practice databases. J Public Health Med 21: 299–304 [DOI] [PubMed] [Google Scholar]
- Lawson D.H., Sherman V., Hollowell J. (1998) The general practice research database. Q J Med 91: 445–452 [DOI] [PubMed] [Google Scholar]
- McCarthy D., Koeling R., Weeds J., Carroll J. (2007) Unsupervised Acquisition of Predominant Word Senses. Computat Linguist 33: 553–590 [Google Scholar]
- NHS Connecting for Health Implementation Guidance team (2007) The National Programme for IT Implementation Guide – version 5. Available at: http://www.connectingforhealth.nhs.uk/systemsandservices/implementation/docs/national_programme_implementation_guide.pdf
- Parkinson J., Davis S., van Staa T. (2007) The general practice research database: now and the future. In Mann R.D., Andrews E.B. (eds), Pharmacovigilance, 2nd Ed. Chichester: John Wiley & Sons, Ltd [Google Scholar]
- Shah A.D., Martinez C. (2004) An algorithm to extract medical codes for diagnoses from unstructured free text recorded by general practitioners in the UK. Pharmacoepidemiol Drug Saf 13: S41–S42 [Google Scholar]
- STAGE (2011) STAtin-induced muscle toxicity: exploration using the UK GEneral Practice Research Database (GPRD). http://www.gprd.co.uk/stage/home/
- Tate A.R., Beloff N., Puri S., Williams T., Van Staa T.P. (2011) Developing quality scores for electronic health records for clinical research: a study using the General Practice Research Database. In ACM Proceedings of MIXHS11, 28 October 2011, Glasgow, Scotland, UK. Glasgow: Sheridan [Google Scholar]
- Tyson G., Taweel A., Zschaler S., van Staa T., Delaney B. (2011) A model-driven approach to interoperability and integration in systems of systems. In Proceedings of the 7th European Conference on Modelling Foundations and Applications: Workshop on Model-Based Software and Data Integration (MBSDI), Birmingham, England [Google Scholar]
- Vamos E.P., Pape U.J., Bottle A., Hamilton F.L., Curcin V., Ng A., et al. (2011) Association of practice size and pay-for-performance incentives with the quality of diabetes management in primary care. CMAJ 183 (12): E809–E816 [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Staa T.P., Leufkens H.G., Zhang B., Smeeth L. (2009) A comparison of cost effectiveness using data from randomized trials or actual clinical practice: selective Cox-2 inhibitors as an example. PLoS Med 6 (12): e1000194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Staa T.P., Rietbrock S., Setakis E., Leufkens H.G. (2008a) Does the varied use of NSAIDs explain the differences in the risk of myocardial infarction? J Intern Med 264: 481–492 [DOI] [PubMed] [Google Scholar]
- van Staa T.P., Smeeth L., Persson I., Parkinson J., Leufkens H.G. (2008b) What is the harm-benefit ratio of Cox-2 inhibitors? Int J Epidemiol 7: 405–413 [DOI] [PubMed] [Google Scholar]
- Whalley T., Mantgani A. (1997) The UK General Practice Research Database. Lancet 350: 1097–1099 [DOI] [PubMed] [Google Scholar]
- Wood L., Coulson R. (2001) Revitalizing the general practice research database: plans, challenges, and opportunities. Pharmacoepidemiol Drug Saf 10: 379–383 [DOI] [PubMed] [Google Scholar]
- Wood L., Martinez C. (2004) The general practice research database: role in pharmacovigilance. Drug Saf 27: 871–881 [DOI] [PubMed] [Google Scholar]