Summary
This article looks at the use of large health records datasets, typically linked with other data sources, and their use in mental health research. The most comprehensive examples of this kind of big data are typically found in Scandinavian countries however there are also many useful sources in the UK. There are a number of promising methodological innovations from studies using big data in UK mental health research, including: hybrid study designs, examples of data linkage and enhanced study recruitment. It is, though, important to be aware of the limitations of research using big data, particularly the various analysis pitfalls. We therefore caution against throwing out the methodological baby with the bathwater and argue that other data sources are equally valuable and ideally research should incorporate a range of data.
Introduction
In recent years much has been written about ‘big data’, to such an extent that the literature on this topic is now almost as dizzying in magnitude as the data that is written about. What we aim to do in this short article is to highlight in a non-technical way some of the advantages and disadvantages of these resources, both for those who are actively involved in research and clinicians who need to assess the value and clinical relevance of research evidence. This article concentrates on one kind of big data: large datasets of health records; typically linked to other large datasets, including administrative and census data. This kind of data already plays an important part in psychiatric research and its role is expanding rapidly.
What do we mean by big data?
Arguably the most successful examples in psychiatric research have come from Scandinavian population registers. These comprise health records data collected for the entire population over many years linked to a range of administrative data. They have a particular value for psychiatric research for a number of reasons: they provide information about those who would otherwise be hard to reach using conventional survey approaches, their scale makes it possible to answer questions about disorders that are relatively rare and, with data often collected over a long time period, we can look at mental health outcomes independently of exposures. The latter is particularly important when studying risk factors for severe mental illness. For example, for some time studies have shown elevated rates of psychosis in urban areas although this could be simply the effect of ‘social drift’, where those who are ill or in the prodromal phase ‘drift’ into urban areas due to illness. Using Danish whole population data, it could be shown that urban upbringing itself was associated with greatly increased rates of psychosis in later life (Pedersen and Mortensen, 2001). By measuring the exposure during childhood, rather than adulthood as previous studies had done, a causal path could be more clearly established.
A key component of this kind of population registry data is that every citizen has a unique personal identification number included in all their official records. This makes it possible to easily link individual health records over time and to link data across a wide range of different domains. For example, records for psychiatric in-patient stays can be linked to outpatient appointments, medication use, tax and employment records, migration and educational data (Norredam et al., 2011; Pedersen and Mortensen, 2001; Schofield et al., 2017). This can also be linked to blood samples from which it is possible to extract DNA for genetic research (Agerbo et al., 2015).
Scandinavian countries are not alone in making population health records available for research. A recent (2015) OECD report also highlighted Korea, Singapore, Israel, New Zealand and the United Kingdom as scoring highly on the availability of population health data for research (OECD, 2015). However, Scandinavian countries have the advantage that, because this data has been collected in electronic form since the 1960s, it is now possible to access population cohort data over much of the life course (Rosen, 2002).
Increasingly we are moving towards developing similar resources that are applicable to mental health research in the UK. These include the Scottish SHELs study (Bhopal et al., 2011) (see below), primary care data such as the Clinical Practice Research Datalink (CPRD), linked psychiatric case register databases such as the Clinical Records Interactive Search (CRIS) (also described below). Attempting linkages across administrative and health records without an equivalent universal personal identification number can however be methodologically challenging.
What can ‘big data’ in mental health really achieve?
Increasingly new resources are being created with exciting possibilities in terms of the potential application to mental health research in the UK (McIntosh et al., 2016). However, as we have highlighted, large scale electronic data resources have existed for many decades in other settings. Some of the methodological advances made using these resources can inform how we progress with the resources which have increasingly become available in the UK.
Big data can facilitate novel experimental designs
Some would argue that the ‘gold-standard’ for evidence is the well-conducted randomised controlled trial. Yet, in many situations it may be challenging or even impossible to conduct these types of studies. For example, for rare outcomes such as suicide following a self-harm episode, it can be difficult to ensure that sample sizes are adequate or that the follow-up period is long enough to detect the possible beneficial effects of an intervention.
Within this context large electronic routine datasets may help to assess which types of interventions are beneficial, even if a randomised controlled trial is not possible. For example, Erlangsen and colleagues used whole-population data from Denmark to assess the role of a psychosocial intervention in reducing subsequent completed suicide risk in a national sample of people who had self-harmed (Erlangsen et al., 2015). The authors used propensity scores - an approach which can be used in observational data to match on variables which predict outcomes, which leads to a replication of the balance which is normally achieved in well-conducted randomised controlled trials. By achieving this balance in a cohort of individuals who had self-harmed (with data on 5678 individuals who had received the psychosocial therapy and 17034 individuals who had not received psychosocial therapy following a self-harm episode) the investigators were able to establish that receiving a psychosocial intervention that focused on suicide prevention after an initial episode of self-harm, reduced the risk of repeated self-harm episodes and death by any cause at one year after the index event and was also associated with a reduced risk of repeated self-harm, death by suicide and deaths by any cause, 5-20 years after the intervention (Erlangsen et al., 2015). Studies such as these are clearly a powerful example of how routine electronic data on large samples might be applied to contexts where standard randomised controlled trials may not be feasible. Despite the sophisticated methodologies employed by the investigators in this study, the authors still caution that selection bias as well as a lack of more detailed information on what the ‘psychosocial therapy’ entailed were potential limitations for their study (Erlangsen et al., 2015), yet work like this highlights the possibility of ‘big’ data in making important contributions to the mental health evidence base.
Big data can enhance recruitment
Recruitment to clinical trials can be difficult, with specific challenges relating to the recruitment of people with mental disorders (Howard et al., 2009). This may partly be a function of clinicians acting as gatekeepers, to the extent that patients may not be offered the opportunity to participate in studies, even if they wish to do so (Callard et al., 2014; Patel et al., 2017) Details of one innovative system to enhance the recruitment of service users into research studies, which was developed in partnership with service users and based on an anonymised electronic health record system, is highlighted in the text box (Box 1). However, despite the challenges highlighted in the text box, C4C is an extremely innovative example of what may be possible using large electronic health record resources, with models developed in partnership with service users.
Box 1. Consent for Contact (C4C).
The ‘consent for contact’ (C4C) system in South London and Maudsley (SLaM) Trust, is an innovative example of a system whereby the autonomy of patients wishing to take part in mental health research is enhanced through a robustly anonymised electronic health record system (Callard et al., 2014). This was developed with considerable patient and service user involvement and is based around the SLaM BRC Case Register, comprising the fully de-identified health records system for a large mental health Trust containing over 250,000 patient records over a catchment area of 1.2 million people (Perera et al., 2016). In the C4C system, care coordinators or others in the patient’s team are able to consent patients to join the C4C register, through a consenting procedure which clarifies to the patient that they are joining a register where they may be contacted in future to take part in research (rather than being consented for a specific research project) in a range of areas, with the patient able to refuse at any point (Callard et al., 2014). Once a patient consents to join the C4C register, this is flagged on their electronic health record. With the numbers consented now in the thousands (Oduola et al., 2017) this is an invaluable resource as investigators may otherwise struggle to recruit hard-to-reach or underserved populations.
Big data can enhance randomised controlled trials
The possibility of embedding well-designed randomised controlled trials within everyday clinical practice, where electronic health records have fully replaced paper-based systems for medical note-keeping (Gulliford et al., 2014; McIntosh et al., 2016) so that there may be “randomisation at point of routine care” (van Staa et al., 2012), may be another type of innovative study design which is yet to be fully realised within psychiatry. As mental health Trusts increasingly move towards fully electronic medical records, such a ‘mixed-design’ that intermeshes the clinical trial with the scale, potential for automated data collection (including data on outcomes such as adverse events (van Staa et al., 2012), has obvious logistic and cost advantages. For example, potential participants for research trials may be followed up through the data which is automatically collected through the electronic health record, which may be of particular value for adverse events (van Staa et al., 2012). This type of study design has already been employed in trials of antibiotic prescriptions and stroke prevention (Gulliford et al., 2014), but could feasibly also be able applied to studies of mental health interventions.
Big data can enhance health records through data linkage
Unlinked data from health records may be missing important information which could hamper analyses. Frequently, important indicators for health outcomes or important sociodemographic variables such as ethnicity, may be poorly recorded or be of a variable quality (Bhopal et al., 2011). Linked datasets (Box 2) allow the possibility of bringing in information from different sources to potentially create large cohorts or datasets of individuals with less prevalent conditions, at a scale which is difficult to otherwise achieve in traditional epidemiological studies. This is partly because traditional epidemiological studies may be hampered by challenges of recruitment, loss to follow-up/attrition and falling participation rates (Knudsen et al., 2010). The linkage of data to routine sources additionally helps to ‘plug the gap’ if important indicators such as self-ascribed ethnicity can be brought in via the linkage (Bhopal et al., 2011). For example a linkage of health data to census records in Scotland highlighted ethnic minority mental health inequalities, specific to the devolved Scottish context (Bansal et al., 2014; Bhopal et al., 2011). Traditional studies using unlinked routine data would not have been able to achieve this, as ethnicity was not routinely recorded in Scottish health records at the time of the study. In England, the linkage of death certificate information to records from mental health Trusts have informed our understanding of premature mortality in severe mental illness (Chang et al., 2011; Das-Munshi et al., 2017) as well as conditions such as Chronic Fatigue Syndrome (Roberts et al., 2016). The linkage in these examples enabled a sample size which allowed sufficiently powered analyses.
Box 2. Examples of linkages and clinical applications.
1. Death certificate information linked to electronic health records
In a study of severe mental illness, the investigators linked electronic health records from a large case registry from a mental health Trust in London to death certificate information. This study highlighted a substantially lower life expectancy in people with severe mental disorders, with the greatest reduction in men with schizophrenia (14.6 years lost) and women with schizoaffective disorders (17.5 years lost) (Chang et al., 2011).
2. National pupil database linkages to mental health records
A recent linkage of mental health data with data from the national pupil database, has allowed the possibility of bringing together teacher-assessed measurements of developmental and special educational needs from the schools’ database with clinical mental health data (Downs et al., 2017).The basis of this linkage in real-time electronic health records has the potential to inform service development and be used as a tool to monitor and evaluate service improvements.
3. Primary care linkages to mental health records
It’s a concern that people with severe mental illnesses (SMI) experience premature mortality, with most causes of death from preventable physical causes such as cardiovascular disease. The National Audit of Schizophrenia revealed low levels of recording of physical health indicators, such as Body Mass Index (BMI), in people with SMI (Crawford et al., 2014). In the UK, most people are registered to a general practitioner/family doctor in primary care, which is where most physical health care is monitored and recorded. Therefore, in the UK, linkage of primary care records to secondary mental healthcare records can shed light on the quality of physical healthcare received by people with SMI. For example, in a study which utilised such a linkage, the investigators found that patients with SMI who also had comorbid coronary heart disease and heart failure, were more likely to receive sub-optimal treatments for these conditions, relative to people without SMI (Woodhead et al., 2016). This was especially the case in people prescribed depot antipsychotic medications, or in individuals identified as having SMI of greater severity, or with one or more recorded risk events.
What big data cannot do
While this kind of big data clearly has enormous advantages for the kind of research that we do and that will be possible in the future it is very easy to lose sight of some of the inherent limitations that come with these resources.
Big data cannot replace statistical analysis
A common misconception when presented with data collected for the entire population is that we no longer need to be concerned about the statistical significance of our findings. Statistical theory pre-supposes that the data we are analysing can be treated as a random sample of the overall population of interest. Therefore, it is often assumed, that if we know the health outcomes for the entire population then we no longer have to worry about burdensome statistical calculations. We could instead simply give the percentage of people with, say a diagnosis of schizophrenia, exposed to some risk factor and the percentage not exposed and assume that this covers everything. However, research is rarely about what has already occurred. Instead, we intend that research findings are relevant for future cases and allow us to develop some over-arching theory. Even the most complete population data will only ever comprise a sub-set of all possible instances of the phenomena of interest and therefore statistical methods are needed to account for this.
Big data cannot predict the future
This brings us to arguably the most common example of ‘big data hubris’: that the more data we collect the more likely we will be able to accurately predict future events. A good example of this, that is often cited, is the case of ‘google flu trends" (Box 3). While it appears that google has abandoned this project others argue that this is far from the end of the story. One wide ranging review argues that similar algorithms could be successful although they will require constant updating and improvement and should ideally be used alongside other epidemiological tools (Lazer et al., 2014).
Box 3. Google flu trends.
Originally heralded as an exemplar of the use of big data, google flu trends used patterns in large numbers of google searches to predict localised flu outbreaks (Mayer-Schönberger and Cukier, 2013). By mining how combinations of search terms were related to subsequent outbreaks, the resulting algorithms were used to predict future epidemics. Initial success lead to claims that google trends would ultimately replace costly epidemiological surveys. But this was short lived, as changes in the way that google searches were conducted and processed, and some of the underlying assumptions lead to overestimates of disease incidence that were no better than those based on historical data alone (Lazer et al., 2014). A telling legacy of this is the official website to which searches for “google flu trends” (Google, 2017) are directed. Adopting a cheery tone, “thank you for stopping by”, this documents how models were first developed in 2008 only to be discontinued in 2014, concluding that it is “still early days for ‘nowcasting’”.
Big data cannot make up for the absence of theory
Along with the initial wave of enthusiasm about big data came the idea that text mining large datasets to find relevant patterns was methodologically valid in itself. Concerns about causality and the reasoning behind these algorithms were seen as irrelevant as long as the algorithms worked, as was the case with Google flu trends (Box 3) (Mayer-Schönberger and Cukier, 2013). This is the logic behind ‘machine learning’ approaches. In machine learning a ‘training’ dataset is used to develop an algorithm from a large set of, often arbitrary, variables. This is then applied to another set of ‘test’ data until an optimum predictive tool is arrived at. This kind of ‘black box’ approach is therefore essentially atheoretical – the authors do not need to know why it works, simply that the algorithm works when applied to the test data. It is not hard to see how this could be attractive to mental health research where many fundamental questions about aetiology remain unanswered. Instead of trying to determine the mechanism that might lead to, say increased rates of schizophrenia among migrants, a simpler approach might be to simply arrive at a predictive tool by determining patterns in available data. In fact, in the case of google flu trends (Box 3) the algorithms themselves were never made public, therefore it was impossible to determine why they were ever successful in the first place, or why they subsequently under-performed. While this approach has enormous advantages in many fields, for example in machine translation and free text mining, it is highly problematic in epidemiological research, as google flu trends demonstrated.
Big data cannot always be taken at face value
Unlike research data, administrative data seldom comes with documentation explaining how the data was collected or how categories used in the coding were arrived at. For example, if we are interested in rates of mental disorder for different ethnic groups, with survey data we can determine whether ethnicity is self-reported or not as well as the categories used in the original questionnaire. However, if we look at health records, in the UK, it is often impossible to say who provided the ethnic classification or how this was originally coded. This could be a problem if we tried to determine ethnic health differences between different areas but were unable to distinguish between differences in coding methods and underlying health differences.
Often the way that administrative data is presented suggests a completeness and objectivity that can be misleading if taken at face value. For example, just because a field is presented in GP data for a diagnosis of depression does not mean that this can be useful if we wish to determine prevalence (see Box 4). It is possible, however, in some instances to use a hybrid screening approach to make up for this. For example, for rare disorders such as psychosis more accurate diagnostic coding has been achieved using a combination of clinical expertise and machine learning techniques to process detailed data from health records documenting symptoms (Gorrell et al., 2016; Patel et al., 2015). Often, however, we do not have detailed symptom data. It is therefore important to be aware that all data is created in a context, whether social, administrative, technical or clinical. If we fail to take this into account, we risk misinterpreting the data we have collected.
Box 4. Depression coding in GP records.
The way that data is coded can reflect administrative priorities that are at odds with research. For example, for some time the way that depression diagnosis has been coded in primary care data has meant this is under-recorded compared to what we know from national surveys (Kendrick et al., 2015; Rait et al., 2009). Much of this has been a result of changes to the way that GPs are incentivised. This has meant that recording a diagnosis of depression might lead to triggers in the clinical record for further action to be taken which many GPs saw as unnecessarily burdensome and not directly relevant to clinical care. Therefore, many GPs would instead simply enter a different term in the record, such as “low mood”. Although this did not impact on clinical care this led to an underestimate in the prevalence of depression in primary care as the “low mood” term was not captured within diagnostic systems. Therefore, without understanding, what statisticians term, the “data generating mechanism” behind this kind of health records data it would be easy to misinterpret what appears to be very low prevalence over this period.
Big data alone cannot solve complex analysis problems
Large datasets of population health records can help solve one of the major challenges of psychiatric research, by providing adequately powered samples of the population of interest. However, the challenges of data analysis do not become easier simply because more data is collected. In fact, the larger the dataset the greater the potential complexity to be accounted for in the analysis. With a small well-designed trial or survey potential confounding variables, i.e. patterns in the data that could obscure our findings, can often be easily accounted for. However, population health records do not come with any such safeguards and are easily open to misinterpretation due to our failure to account for these patterns. For example, we could misinterpret spatial patterns by failing to account for differences in contextual risk factors, such as urbanicity (see above), as well as differences in the reporting practices of mental health trusts in different parts of the country. Similarly, temporal patterns could also confound our results, such as changing ICD diagnostic categories. For example, with the change from ICD-9 to ICD-10 the latter showed a much higher sensitivity for dementia which could easily be misinterpreted as an increase in prevalence if we were examining trends using health records data alone (Quan et al., 2008).
To account for this often requires a quite different analysis approach to the statistical methods used with smaller, more theoretically determined, samples. Where a well-designed randomised control trial could potentially be analysed using routine techniques, such as a t-test, for whole population data more complex multilevel modelling and Bayesian analysis approaches are often necessary. While this is becoming easier with the widespread adoption of more advanced statistical methods, these still remain beyond the expertise of many researchers.
Big data cannot make research more replicable
In recent years, increasing concern has been raised about the “replicability crisis” in scientific research, and particularly psychological research. For example, in one recent poll of 1,500 scientists 70% failed to reproduce another scientist’s experiment and around 50% failed to reproduce one of their own experiments (Baker, 2016). Often examples are given of small scale psychology experiments yielding interesting findings that consistently fail to be replicated. It could be argued that big data is one solution to this problem as more data means results are more generalisable and therefore replicable. However, this is to misunderstand the nature of the problem. It is typically not the size of the data that is at issue but the potential for spurious results in those situations where the researcher is faced with a multitude of different possible interpretations of the data. As datasets become large and more complex; the number of potential sub-groups to be analysed, analysis methods used, and alternative categorisations to be adopted increases exponentially. For the unscrupulous researcher this could mean simply re-running the analysis by trying every possible combination of the above until the results fit the required statistical significance (or “p-value”) – a practice known as “p-hacking” (Gelman and Loken, 2014). This has reached the point where the American Statistical Association (ASA) recently felt compelled to issue a formal statement about the correct use of p-values (Wasserstein and Lazar, 2016). For the ASA the recent expansion in the use of large complex datasets for research, while expanding the possibilities for novel research, increases the risk of erroneous conclusions being made from the data. This may not even be deliberate; it is possible that faced with many different analysis possibilities the researcher may, whether consciously or not, be inclined towards the one more likely to give the desired result given the data they are presented with. It is very difficult to rule this out, although one solution is to make the analysis process more transparent (Box 5). For some types of research this can be achieved by reporting in advance the protocol for future studies along with details of the analysis method ensures. However, this is not necessarily applicable or helpful for many descriptive studies where the ultimate focus may not be pre-determined.
Box 5. Ensuring transparency.
Research using information from large-scale electronic health records has an important role to play in mental health research. However, as we have highlighted, there are major caveats to how this data is applied and utilised when attempting to answer challenging questions related to mental health research. All research, including the best-designed studies will have limitations. The reporting of research methods can be strengthened and made more transparent by adhering to principles such as those advocated in guidelines such as STROBE (Strengthening the reporting of observation studies in epidemiology)(von Elm et al., 2008) and CONSORT (Consolidated Standards for Reporting Trials)(Schulz et al., 2010). Guidelines for the reporting of observational studies using routinely collected data have also been developed (Benchimol et al., 2015). Adhering to guidelines such as these will ensure that research studies conducted on electronic health records and other administrative data resources for mental health research are transparent and more likely to be replicable.
Big data cannot answer questions for which data has not already been collected
Big data is, by definition, data that is collected for purposes other than research and therefore does not always fit the research questions we wish to ask. For example, with diagnoses recorded over time for the purposes of clinical care, and not aetiological research, it is often very difficult to determine exactly how date of diagnosis relates to onset. Similarly, if we rely on big data alone it then becomes very difficult to do research on disorders that have not already come to the attention of mental health services. In such situations, a reliance on routine health-systems data may under or over-estimate the actual prevalence of mental disorders. For such situations, cross-sectional surveys based in the community still have a major role to play. So, while traditional methods, such as: surveys, RCTs and qualitative studies, allow us to determine what data is collected, with big data there is a danger that we neglect those research topics for which we do not already have available data (Schofield, 2017). This has particular relevance to mental health research where social factors are often inextricably linked to the aetiology and progress of mental disorders (Reininghaus and Morgan, 2014; van Os et al., 2010). A reliance on big data alone risks a circularity in the way research is conducted; as studies framed within a bio-medical model, using data collected from medical records alone, remove the possibility that social factors might be included in the aetiology of mental disorder.
Conclusion
There are clearly considerable advantages to the use of this kind of big data for psychiatric research. However, these data sources come with inherent limitations, as we have outlined, and therefore should not replace those methods for which there is already proven utility. Instead, we argue, they should play a complementary role alongside randomised control trials, representative surveys, cohort studies and qualitative studies; capitalising on the methodological advantages of each while offsetting their respective limitations. We have also outlined ways in which novel methodologies, such as quasi experimental designs and embedded RCTs, as well as novel recruitment possibilities, may be intermeshed with big data to enhance traditional research methods. We are also confident that many more such novel applications and methods will become apparent with time as this field is rapidly changing.
Multiple choice questions
Select the single best option for each question
- A major advantage of big data in mental health research is:
- we no longer need to use statistics
- analysis is much simpler
- it is easy to get the results we want
- we no longer need other more expensive forms of research data
- we can answer many research questions previously beyond the scope of research
- When analysing big data, research questions are:
- no longer important
- easily matched with available data
- often outside the scope of available data
- the last thing we need to think about
- decided by the computer algorithm
- Big data cannot:
- be combined with other kinds of research data
- be used in experimental studies
- be used in study recruitment
- be interpreted without understanding the data generating mechanism
- be used for purposes other than that for which it was collected
- Big data can:
- allow us to predict what happens in the future
- improve our ability to make causal inferences
- replace the need to make causal inferences
- replace most other research resources
- do away with the need for epidemiologists and medical statisticians
- Big data is:
- something that has only existed in the past couple of decades
- not found in the UK
- confined to social media e.g. analysing Facebook ‘likes’ and twitter feeds
- a passing fad that serious researchers should ignore
- a major opportunity for enhancing mental health research
Answers: 1. e, 2. c, 3. d, 4. b, 5. e
Learning Objectives.
Be aware of major big data resources relevant to mental health research
Be aware of key advantages and innovative study designs using these data sources
Understand the inherent limitations to studies reliant on big data alone
Biographies
Dr Peter Schofield is a Medical Research Council research fellow in the School of Population Health & Environmental Sciences, King’s College London. His research uses a mixed methods approach, including analysis of whole population data, to investigate the role of social factors in the aetiology and management of mental disorders.
Dr Jayati Das-Munshi is an Honorary Consultant Psychiatrist with South London & Maudsley Trust. She also holds a Clinician Scientist Fellowship from the Academy of Medical Sciences/Health Foundation and is based at the Institute of Psychiatry, Psychology & Neuroscience at (King’s College London). Her areas of interest include the social determinants of mental disorders, including ethnic minority/migrant health inequalities, the interplay of physical and mental health and novel methodologies to address research questions.
References
- Agerbo E, Sullivan PF, Vilhjálmsson BJ, et al. Polygenic Risk Score, Parental Socioeconomic Status, Family History of Psychiatric Disorders, and the Risk for Schizophrenia. JAMA Psychiatry. 2015;72:635. doi: 10.1001/jamapsychiatry.2015.0346. [DOI] [PubMed] [Google Scholar]
- Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533:452–454. doi: 10.1038/533452a. [DOI] [PubMed] [Google Scholar]
- Bansal N, Bhopal R, Netto G, et al. Disparate patterns of hospitalisation reflect unmet needs and persistent ethnic inequalities in mental health care: the Scottish health and ethnicity linkage study. Ethnicity & Health. 2014;19:217–239. doi: 10.1080/13557858.2013.814764. [DOI] [PubMed] [Google Scholar]
- Benchimol EI, Smeeth L, Guttmann A, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Medicine. 2015;12 doi: 10.1371/journal.pmed.1001885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhopal R, Fischbacher C, Povey C, et al. Cohort Profile: Scottish Health and Ethnicity Linkage Study of 4.65 million people exploring ethnic variations in disease in Scotland. International Journal of Epidemiology. 2011;40:1168–1175. doi: 10.1093/ije/dyq118. [DOI] [PubMed] [Google Scholar]
- Callard F, Broadbent M, Denis M, et al. Developing a new model for patient recruitment in mental health services: a cohort study using Electronic Health Records. BMJ Open. 2014;4 doi: 10.1136/bmjopen-2014-005654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang C-K, Hayes RD, Perera G, et al. Life Expectancy at Birth for People with Serious Mental Illness and Other Major Disorders from a Secondary Mental Health Care Case Register in London. PLOS ONE. 2011;6:e19590. doi: 10.1371/journal.pone.0019590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crawford MJ, Jayakumar S, Lemmey SJ, et al. Assessment and treatment of physical health problems among people with schizophrenia: National cross-sectional study. British Journal of Psychiatry. 2014;205:473–477. doi: 10.1192/bjp.bp.113.142521. [DOI] [PubMed] [Google Scholar]
- Das-Munshi J, Chang C-K, Dutta R, et al. Ethnicity and excess mortality in severe mental illness: a cohort study. The Lancet Psychiatry. 2017;4:389–399. doi: 10.1016/S2215-0366(17)30097-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Downs J, Gilbert R, Hayes RD, et al. Linking health and education data to plan and evaluate services for children. Archives of Disease in Childhood. 2017;102:599–602. doi: 10.1136/archdischild-2016-311656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erlangsen A, Lind BD, Stuart EA, et al. Short-term and long-term effects of psychosocial therapy for people after deliberate self-harm: a register-based, nationwide multicentre study using propensity score matching. The Lancet Psychiatry. 2015;2:49–58. doi: 10.1016/S2215-0366(14)00083-2. [DOI] [PubMed] [Google Scholar]
- Gelman A, Loken E. The Statistical Crisis in Science. American Scientist. 2014;102:460. [Google Scholar]
- Google. Flutrends. 2017 [WWW Document]
- Gorrell G, Oduola S, Roberts A, et al. Identifying First Episodes of Psychosis in Psychiatric Patient Records using Machine Learning BT - Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics; 2016. pp. 196–205. [Google Scholar]
- Gulliford MC, van Staa TP, McDermott L, et al. Cluster randomized trials utilizing primary care electronic health records: methodological issues in design, conduct, and analysis (eCRT Study) Trials. 2014;15:220. doi: 10.1186/1745-6215-15-220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hennekens C, Buring J, Mayrent S, editors. Epidemiology in medicine. Lippincott Williams and Wilkins; Boston, MA: 1987. [Google Scholar]
- Howard L, de Salis I, Tomlin Z, et al. Why is recruitment to trials difficult? An investigation into recruitment difficulties in an RCT of supported employment in patients with severe mental illness. Contemporary Clinical Trials. 2009;30:40–46. doi: 10.1016/j.cct.2008.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kendrick T, Stuart B, Newell C, et al. Changes in rates of recorded depression in English primary care 2003–2013: Time trend analyses of effects of the economic recession, and the GP contract quality outcomes framework (QOF) Journal of affective disorders. 2015;180:68–78. doi: 10.1016/j.jad.2015.03.040. [DOI] [PubMed] [Google Scholar]
- Knudsen AK, Hotopf M, Skogen JC, et al. The health status of nonparticipants in a population-based health study. American Journal of Epidemiology. 2010;172:1306–1314. doi: 10.1093/aje/kwq257. [DOI] [PubMed] [Google Scholar]
- Lazer D, Kennedy R, King G, et al. Big data. The parable of Google Flu: traps in big data analysis. Science (New York, N.Y.) 2014;343:1203–5. doi: 10.1126/science.1248506. [DOI] [PubMed] [Google Scholar]
- Mayer-Schönberger V, Cukier K. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt; 2013. [Google Scholar]
- McIntosh AM, Stewart R, John A, et al. Data science for mental health: a UK perspective on a global challenge. The Lancet Psychiatry. 2016;3:993–998. doi: 10.1016/S2215-0366(16)30089-X. [DOI] [PubMed] [Google Scholar]
- Norredam M, Kastrup M, Helweg-Larsen K. Register-based studies on migration, ethnicity, and health. Scandinavian journal of public health. 2011;39:201–5. doi: 10.1177/1403494810396561. [DOI] [PubMed] [Google Scholar]
- Oduola S, Wykes T, Robotham D, et al. What is the impact of research champions on integrating research in mental health clinical practice? A quasiexperimental study in South London, UK. BMJ Open. 2017;7 doi: 10.1136/bmjopen-2017-016107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- OECD. Health Data Governance: Privacy, Monitoring and Research. OECD Publishing; Paris: 2015. [Google Scholar]
- Patel R, Jayatilleke N, Broadbent M, et al. Negative symptoms in schizophrenia: a study in a large clinical sample of patients using a novel automated method. BMJ Open. 2015;5:e007619. doi: 10.1136/bmjopen-2015-007619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patel R, Oduola S, Callard F, et al. What proportion of patients with psychosis is willing to take part in research? A mental health electronic case register analysis. BMJ Open. 2017;7 doi: 10.1136/bmjopen-2016-013113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedersen CB, Mortensen PB. Evidence of a dose-response relationship between urbanicity during upbringing and schizophrenia risk. Archives of general psychiatry. 2001;58:1039–46. doi: 10.1001/archpsyc.58.11.1039. [DOI] [PubMed] [Google Scholar]
- Perera G, Broadbent M, Callard F, et al. Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource. BMJ open. 2016;6:e008721. doi: 10.1136/bmjopen-2015-008721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prince M, Stewart R, Ford T, et al., editors. Practical Psychiatric Epidemiology. OUP; Oxford: 2003. [Google Scholar]
- Quan H, Li B, Duncan Saunders L, et al. Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Services Research. 2008;43:1424–1441. doi: 10.1111/j.1475-6773.2007.00822.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rait G, Walters K, Griffin M, et al. Recent trends in the incidence of recorded depression in primary care. The British journal of psychiatry : the journal of mental science. 2009;195:520–4. doi: 10.1192/bjp.bp.108.058636. [DOI] [PubMed] [Google Scholar]
- Reininghaus U, Morgan C. Integrated models in psychiatry: The state of the art. Social Psychiatry and Psychiatric Epidemiology. 2014;49:1–2. doi: 10.1007/s00127-013-0807-7. [DOI] [PubMed] [Google Scholar]
- Roberts E, Wessely S, Chalder T, et al. Mortality of people with chronic fatigue syndrome: a retrospective cohort study in England and Wales from the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Clinical Record Interactive Search (CRIS) Register. The Lancet. 2016;387:1638–1643. doi: 10.1016/S0140-6736(15)01223-4. [DOI] [PubMed] [Google Scholar]
- Rosen M. National Health Data Registers: A Nordic heritage to public health. Scandinavian Journal of Public Health. 2002;30:81–85. doi: 10.1080/140349401753683444. [DOI] [PubMed] [Google Scholar]
- Schofield P. Big data in mental health research – do the n s justify the means? Using large data-sets of electronic health records for mental health research. BJPsych Bulletin. 2017;41:129–132. doi: 10.1192/pb.bp.116.055053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schofield P, Das-Munshi J, Becares L, et al. Neighbourhood ethnic density and incidence of psychosis - First and second generation migrants compared. European Psychiatry. 2017;41:S249. [Google Scholar]
- Schulz KF, Altman DG, Moher D, et al. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ (Clinical research ed.) 2010;340:c332. doi: 10.1136/bmj.c332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Os J, Kenis G, Rutten BP. The environment and schizophrenia. Nature. 2010;468:203–212. doi: 10.1038/nature09563. [DOI] [PubMed] [Google Scholar]
- van Staa T-P, Goldacre B, Gulliford M, et al. Pragmatic randomised trials using routine electronic health records: putting them to the test. The BMJ. 2012;344:e55. doi: 10.1136/bmj.e55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Journal of Clinical Epidemiology. 2008;61:344–349. doi: 10.1016/j.jclinepi.2007.11.008. [DOI] [PubMed] [Google Scholar]
- Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. The American Statistician. 2016;1305:0–17. [Google Scholar]
- Woodhead C, Ashworth M, Broadbent M, et al. Cardiovascular disease treatment among patients with severe mental illness: a data linkage study between primary and secondary care. British Journal of General Practice. 2016;66:e374–e381. doi: 10.3399/bjgp16X685189. [DOI] [PMC free article] [PubMed] [Google Scholar]