Abstract
Background: Deep Phenotyping is the precise and comprehensive analysis of phenotypic features, where the individual components of the phenotype are observed and described. In UK mental health clinical practice, most clinically relevant information is recorded as free text in the Electronic Health Record, and offers a granularity of information beyond that expressed in most medical knowledge bases. The SNOMED CT nomenclature potentially offers the means to model such information at scale, yet given a sufficiently large body of clinical text collected over many years, it’s difficult to identify the language that clinicians favour to express concepts.
Methods: Vector space models of language seek to represent the relationship between words in a corpus in terms of cosine distance between a series of vectors. When utilising a large corpus of healthcare data and combined with appropriate clustering techniques and manual curation, we explore how such models can be used for discovering vocabulary relevant to the task of phenotyping Serious Mental Illness (SMI) with only a small amount of prior knowledge.
Results: 20 403 n-grams were derived and curated via a two stage methodology. The list was reduced to 557 putative concepts based on eliminating redundant information content. These were then organised into 9 distinct categories pertaining to different aspects of psychiatric assessment. 235 (42%) concepts were found to be depictions of putative clinical significance. Of these, 53 (10%) were identified having novel synonymy with existing SNOMED CT concepts. 106 (19%) had no mapping to SNOMED CT.
Conclusions: We demonstrate a scalable approach to discovering new depictions of SMI symptomatology based on real world clinical observation. Such approaches may offer the opportunity to consider broader manifestations of SMI symptomatology than is typically assessed via current diagnostic frameworks, and create the potential for enhancing nomenclatures such as SNOMED CT based on real world depictions.
Keywords: word2vec, natural language processing, serious mental illness, electronic health records, schizophrenia
Introduction
The dramatic decrease in genetic sequencing costs, coupled with the growth of our understanding of the molecular basis of diseases, has led to the identification of increasingly granular subsets of disease populations that were once thought of as homogenous groups. As of 2010, the molecular basis for nearly 4 000 Mendelian disorders has been discovered 1, subsequently leading to the development of around 2 000 clinical genetic tests 2. The resulting ‘precision medicine’ paradigm has been touted as the logical evolution of evidence-based medicine.
Precision medicine has arisen in response to the fact that the ‘real world’ application of many treatments have a lower efficacy and a differential safety profile compared to clinical trials, most likely due to genetic and environmental differences in the disease population. Precision medicine seeks to obtain deeper genotypic and phenotypic knowledge of the disease population, in order to offer tailored care plans with evidence-based outcomes. Amongst the challenges presented by precision medicine is the requirement to obtain highly granular phenotypic knowledge that can adequately explain the variable manifestation of disease.
To realise the ambitions of precision medicine, large amounts of phenotypic data are required to provide sufficient statistical power in tightly defined patient cohorts (so called ‘Deep Phenotyping’ 3). Historical clinical data mined from Electronic Health Record (EHR) systems are frequently employed to meet the related use case of observational epidemiology. As such, EHRs are often posited as the means to provide extensive phenotypic information with a relatively low cost of collection 4, 5.
In order to standardise knowledge representation of clinically relevant entities and the relationships between them, phenotyping from EHRs often employ curated terminology systems, most commonly SNOMED CT. The use of such resources creates a common domain language in the clinical setting, theoretically allowing an unambiguous interpretation of events to be shared within and between healthcare organisations. The anticipated value of such a capability has prompted the UK National Information Board to recommend the adoption of SNOMED CT across all care settings by 2020 6. However, the task of representing the sprawling and ever-changing landscape of healthcare in such a fashion has proven complex 7– 10. Although a complete description of the structure and challenges of SNOMED CT are beyond the scope of this paper, we describe how aspects of these problems manifest themselves in accordance with the task of phenotyping serious mental illness (SMI) from a real world EHR system.
Phenotyping SMI
The quest for empirically validated criteria for assessing the symptomatology of mental illness has been a long term goal of evidence-based psychiatry. SMI is a commonly used umbrella term to denote the controversial diagnoses of schizophrenia (encoded in SNOMED as SCTID: 58214004), bipolar disorder (SCTID: 13746004), and schizoaffective disorder (SCTID: 68890003). While field trials of DSM-5 have revealed promising progress in reliably delineating these three conditions in clinical assessment 11, such diagnostic entities continue to have low clinical utility 12– 14. Recent evidence from genome-wide association studies appears to suggest that such disorders share common genetic loci, further countering the argument that SMI can be classified into discrete, high level diagnostic units 15. In terms of clinical practice, the presenting phenotype of SMI is usually the basis for treatment. This is often characterised by abnormalities in various mental processes, which are in turn categorised according to broad groupings of clinically observable behaviours. For instance, ‘positive symptoms’ refer to the presence of behaviours not seen in unaffected individuals, such as hallucinations, delusional thinking and disorganised speech. Conversely, ‘negative symptoms’, such as poverty of speech and social withdrawal refer to the absence of normal behaviours. Such symptomatology assessments are organised via an appropriate framework such as Postive and Negative Symptom Scale 16 or Brief Negative Symptom Scale 17. Accordingly, SNOMED CT includes coverage for many of these symptoms, generally within the ‘Behaviour finding’ branch (SCTID: 844005).
A qualifying factor regarding the adoption of SNOMED amongst SMI specialists might therefore require that the list of clinical ‘finding’ entities in SNOMED are sufficiently expansive and diverse to represent their own experiences during patient interactions. Specifically, this may manifest as two key challenges for terminology developers:
First, insight must be obtained regarding real-world language usage such that universally understood medical entities, encompassing hypernomy, synonymy and hyponomy adequately represent models of concepts. Similarly, the abundant use of acronyms in the medical domain has caused a large percentage of acronyms to have two or more meanings 18, creating word sense disambiguation problems. As such, significant efforts have arisen to supplement these types of knowledge bases with appropriate real world synonym usage extracted from EHR datasets 19.
Second, if there is controversy over international consensus in a particular area of medicine, the use of ‘global’ perspectives may not be sufficient to meet local reporting/investigatory requirements. Such issues are particularly pertinent in mental health where many diseases defy precise definition and biomarker development has yielded few successes 20. More generally, all medical knowledge bases are incomplete to one degree or another. The opportunity to utilise large amounts of EHR data to discover novel observations and relationships arising from real world clinical practise must not be overlooked.
Given a sufficiently large corpus of documents, typically authored by hundreds of clinical staff over several years, it is often difficult to predict the diversity of vocabulary used within the local EHR setting to describe potentially important clinical constructs. In previous work, we describe our attempts to extract fifty well known SMI symptomatology concepts from a large electronic mental health database resource 21, based upon the contents of such frameworks. During the course of manually reviewing clinical text, we made two subjective observations regarding the authorship of clinician/patient interactions:
The tendency of clinicians to use non-technical vocabulary in describing their observations
The occasional appearance of highly detailed, novel observations that do not readily fit into known symptomatology frameworks
Such observations may feasibly have clinical relevance, for example, as non-specific symptomatology prodromes 22. On the basis that the modelling of SMI for precision medicine approaches require the full dimensionality of the disease to be considered, we sought to explore these observations further.
In this study, we present our efforts to utilise a priori knowledge discovery methods to identify patterns of real world language usage that reflect clinically relevant SMI symptomatology within the context of a large mental healthcare provider. We contrast and compare these patterns with a modern version of the UK SNOMED CT (v1.33.2), and suggest how such approaches may offer novel and/or more granular symptom depictions from patient/clinician interactions when used to supplement resources such as SNOMED CT, potentially offering alternatives to classify psychiatric disorders with finer resolution and greater real-world validity.
Methods
Our general approach for SMI knowledge discovery is composed of several discrete steps. An overview of the workflow is given in Figure 1.
Corpus creation from the Clinical Record Interactive Search
The South London and Maudsley NHS Foundation Trust (SLaM) provides mental health services to 1.2 million residents over four south London boroughs (Lambeth, Southwark, Lewisham and Croydon). Since 2007, the Clinical Record Interactive Search (CRIS) 23 infrastructure programme has been operating to offer a pseudonymised and de-identified research database of SLaM’s EHR system. As the CRIS resource received ethical approval as a pseudonymised and de-identified data source by Oxford Research Ethics Committee (reference 08/H0606/71+5), patient consent was not required for this study.
11 745 094 clinical documents were collected from the CRIS database from the period 01/01/2007 - 27/10/2016 on the basis that the 20 472 associated patients were assigned an SMI ICD10 code of F20, F25, F30 or F31 at some point during their care, in accordance with current clinical practice.
Pre-processing and vocabulary creation
Sentences and tokens were extracted from each document using the English Punkt tokeniser from the NLTK 3.0 suite 24. Each token was converted to lower case. A vocabulary was then constructed of all 1-gram types in the corpus, supplemented with frequently occuring bi-grams and tri-grams using the Gensim 25 suite and the sampling method proposed by Mikolov et al. 26. Bi-grams and tri-grams with a minimum frequency of 10 occurrences in the entire corpus were retained, to give a total vocabulary size of 896 195 n-grams. No further assumptions about the structure of the data, such as the need for stemming/lemmatisation, were made.
Building a word embedding model
The distributional hypothesis was first explored by Harris 27, which proposed that, given a sufficiently large body of text, linguistic units that co-occur in the same context are likely to have a semantically related meaning. Modelling the distribution of such units may therefore have value for a wide range of natural language processing applications. Models of distributional semantics, including word embeddings, are techniques that aim to derive models of semantically similar units in a corpus of text by co-locating them in vector space. In recent years, the use of the Continuous Bag-of-Words (CBOW) model proposed by Mikolov et al. 28 has risen to prominence, owing to its ability to accurately capture semantic relationships whilst scaling to large corpora of text 26. Recently, the CBOW model has been used to identify the semantic similarities between single word entities in biomedical literature and clinical text 29, suggesting that biomedical literature may serve as a useful proxy for clinical text, for tasks such as synonym identification and word sense disambiguation tasks under limited conditions 29.
A full description of the CBOW architecture is discussed in 30. For brevity, we describe only the key features used in our work here. The purpose of the architecture is to ’learn’ in an unsupervised manner, a representation of the semantics of different n-grams, given an input set of documents. CBOW might be described as a simple feed forward neural network consisting of three layers. An input layer X composed of o nodes (where o is the number of unique n-grams in a corpus produced from our above described pre-processing), a hidden layer H of a user defined size n (usually between 100 and 300), and an output layer Y that is also composed of o nodes. Every node in X is connected to every node in H, and every node in H is connected to every node in Y. Between each of the layers is a matrix of weight values; for the X and H layer, an ‘input’ matrix of dimensions o × n (hereafter denoted W); and between the H and the Y layer, an ‘output’ matrix of dimensions n × o (denoted W′). The output of training the neural network is to produce weights in each of these matrices. The weights learnt in the W matrix might be intuitively described as the semantic relationships between each n-gram in the vocabulary as represented in vector space, with semantically similar words located in closer proximity to each other. Weights in the W′ matrix represent the predictive model from the H to the Y layer. A training instance is composed of a group of n-grams, known as a context. A context can be composed of natural language structures, such as sentences in a document, or more complex arrangements, such as a sliding window of n-grams (usually between 5 and 10) that move over each token in a document (potentially ignoring natural grammatical structures). For a given input n-gram, the input into the nodes on the hidden layer is the product of each vector index in matrix W corresponding to each context word and the average vector. From the H to the Y layer, it is then possible to score each n-gram using the W′ matrix, from which a posterior probability is obtained for each word in the vocabulary using the softmax function. The weights in each matrix are then updated using computationally efficient hierarchical softmax or negative sampling approaches. Once training is complete, the semantic similarity of n-grams is often measured via their cosine distance between vectors in the W matrix.
Using the Gensim implementation of CBOW and our previously constructed vocabulary, we trained a word embedding model of n = 100 over our SMI corpus to produce a vector space representation of our clinical vocabulary. Due to patient confidentiality, offline access to records was not feasible and so only a limited number of epochs of training could be performed. However, due to the relatively narrow/controlled vocabulary employed in clinical records (compared to normal speech/text) the range of possible input vectors was narrower than might otherwise be expected, and even a single epoch of training appeared to yield meaningful clusters that could be identified with SMI. As we were primarily intending to identify initial clusters for validation by clinical experts it was felt that single epoch of training, over the 20M clinical records available, was sufficient.
Vocabulary clustering and cluster scoring
The task of clustering seeks to group similar dataset objects together in meaningful ways. In unsupervised clustering, the definition of ‘meaningfulness’ is often subjectively defined by the human observers. In our task, we sought to identify clusters of n-grams derived from our word embedding model that represent semantically linked components of our clinical vocabulary, based on the theory that our word embedding model would cause related symptom concepts to appear close to each other within the vector space.
A particular challenge in the development of clustering algorithms is achieving scalability to large datasets. Since many clustering algorithms make use of the pairwise distance between n samples (or n-grams, in our case), the memory requirements of such algorithms tend to run in the order of n 2. One such algorithm that does not suffer from this limitation is k-means clustering. k-means clustering is a partitional clustering algorithm that seeks to assign n samples into a user defined k clusters by minimising the squared error between each centroid of a cluster and its surrounding points. A global (although not necessarily optimal) solution is derived when the algorithm has minimised the sum of squared errors across all k clusters, subject to some improvement threshold or other stopping criteria. For all experiments, we used the k-means++ implementation from the Scikit-Learn framework 31 with 8 runs each time, to control against centroids emerging in local minima.
The key parameter for k-means clustering is the selection of k. While techniques exist for estimating an appropriate value, such as silhouette analysis and the ‘elbow method’ 32, these utilise pairwise distances between samples, creating substantial technical limitations for large matrices in terms of memory usage. To overcome this, we opted for a memory efficient version of the elbow method, involving plotting the minimum centroid distance for different values of k. The intuition behind this approach is that every increase in k is likely to result in a smaller minimum centroid distance in vector space (subject to a random seed for the algorithm). As k increases, genuine clusters should be separated by a steady decline in minimum centroid distance. However, when the slope of the decline flattens out (i.e. the ‘elbow’ of the curve), assignment of samples to new clusters is likely to be random).
To identify a putative cluster of SMI symptomatology, we devised a simple ‘relevance’ cluster scoring approach based upon prior knowledge of common SMI symptom concepts. We selected 38 symptoms of SMI based upon their depictions in SMI frameworks and on their specificity in clinical use ( Table 1. For instance, we did not select ‘loosening of associations’, due to the different word sense that the word ‘associations’ appears in, such as ‘housing associations’, and organisational references such as ‘Stroke Association’. Rather, we chose symptoms such as ‘aggression’, ‘apathy’ and ‘agitation’, which are less likely to have different word sense interpretations in the context of SMI clinical documents. We produced stems, appropriate synonyms/acronyms and/or n-grams of each concept as described in Table 1. With this matching criterion, we scored each cluster based on the number of per concept hits to derive a cluster/concept count matrix x where x i,j represents the count of the ith concept in the jth cluster. For example, a cluster containing the terms ‘insomnia’ and ‘insomniac’ would receive a count of two for the ‘insomni’ concept. For each concept, we then calculated a vector of the minimum count per concept across all clusters:
where m is 38 (denoting the number of symptom concepts we describe in Table 1). Similarly, we generated a vector of maximum count per concept across all clusters:
Table 1. Known symptomatology concepts and vocabulary matching criteria.
core concept | concept match |
---|---|
aggression | aggress |
agitation | agitat |
anhedonia | anhedon |
apathy | apath |
affect | affect |
catalepsy | catalep |
catatonic | cataton |
circumstantial | circumstant |
concrete | concrete |
delusional | delusion |
derailment | derail |
eye contact | eye_contact |
echolalia | echola |
echopraxia | echopra |
elation | elat |
euphoria | euphor |
flight of ideas | foi |
thought disorder | thought_disorder |
grandiosity | grandios |
hallucinations | hallucinat |
hostility | hostil |
immobility | immobil |
insomnia | insomn |
irritability | irritab |
coherence | coheren |
mannerisms | mannerism |
mutism | mute |
paranoia | paranoi |
persecution | persecut |
motivation | motivat |
rapport | rapport |
posturing | postur |
rigidity | rigid |
stereotypy | stereotyp |
stupor | stupor |
tangential | tangenti |
thought block | thought_block |
waxy | waxy |
to enable us to rescale the value of each concept/cluster count to between 0 and 1 into a matrix x′:
The purpose of rescaling in such a way was to prevent overrepresented concepts unduly influencing the overall result (for instance, a concept with many hits in a cluster would unduly bias the score towards that concept, whereas we sought a scoring mechanism that would weigh all input concepts equally, regardless of their frequency).
Finally, we summed all rescaled concept counts per cluster, and divided by the total cluster size to provide a score per cluster z representing known symptomatology contents:
where s is a vector of the total count of n-grams in each cluster. The purpose of dividing by cluster size was to prevent the tendency of larger clusters to score higher on account of their size.
To select clusters for further investigation, the robust median absolute deviation (MAD) statistic was chosen to identify outliers (the distribution of our cluster scores was non-normal). We adopted a conservative approach to cluster selection by choosing clusters that scored at least six MAD above the median score for further processing, which is approximately equivalent to four standard deviations for a normally distributed dataset.
Expert curation of symptom concepts, frequency analysis and SNOMED CT mapping
The contents of the top scoring clusters underwent a two stage curation process. The first stage was performed by an informatician, and involved a simple manual data cleaning process to reduce the total number of concepts via grouping misspellings, tokenisation failures and other constructs that did not yield additional information content.
The second, more important stage was composed of independent annotation of the curated concept list by two psychiatrists, to identify likely synonyms and new symptomatology based on their clinical experience. Each concept was assigned to one of the below 8 ‘substantive’ categories, or a 9th ‘other’ category.
Appearance/Behaviour Implying a real-time description of the way a patient appears or behaves (including their interaction with the recording clinician)
Speech Anything implying a description of any vocalisation (i.e. theoretically a subset of behaviour but restricted to vocalisations)
Affect/Mood Implying clinician-observed mood/emotional state (i.e. theoretically a subset of appearance but restricted to observed emotion), or implying self-reported mood/emotional state (i.e. has to imply a description that a patient would make of their own mood; theoretically a subset of thought)
Thought Implying any other thought content
Perception Implying any described perception
Cognition Implying anything relating to the patient’s cognitive function
Insight Implying anything relating to insight (awareness of health state)
Personality Anything implying a personality trait or attitude (i.e. something more long-standing than an observed behaviour at interview)
Other A mixed bag of definable terms that do not fit into the above. Common examples included anything implying information that will have been collected as part of a patient’s history, often of behaviours that would have to have been reported as occurring in the past and cannot have been observed at interview, but also which cannot be termed a personality trait. Alternatively, anything where insufficient context was available to make a decision
Inter annotator agreement (IAA) was measured with the Cohen’s Kappa agreement statistic 33.
To explore the frequency of both our prior symptomatology concepts and the newly curated ones in our symptom clusters, we counted the number of unique patient records and the number of unique documents that the stems of each n-gram appeared in. To protect patient anonymity, we discarded any concept that appeared in ten or less unique patient records. Finally, we mapped the remaining concepts to SNOMED CT, UK version v1.33.2, using the following method. First, the root mopheme of each concept was matched to a relevant finding, observable entity or disorder type in SNOMED CT. If a match could not be found, SNOMED CT was explored for potential synonymy, or other partial match. If a clear synonym could not be found, we classified the concept as novel.
Results
Word embedding model training
Processing the corpus of SMI clinical documents took approximately 100 hours on an 8-core commodity hardware server. Documents were fed sequentially from an SQL Server 2008 database operating as a shared resource, with additional overhead likely resulting from network latency.
Parameter selection for k-means clustering
Figure 2 shows a scatterplot of variable values of k and the resulting minimum centroid distance. This suggests a k value of around 50-75 may be optimal for our data. On this basis, we chose a k value of 75.
Cluster scoring
The application of our relevancy scoring algorithm to the 75 derived clusters resulted in a median score was 0.000229 and a MAD of 0.000277, and is visualised in Figure 3.
Three clusters emerged with a score at least six MADs outside of the median cluster score: No. 52 (score: 0.002883), containing 6 665 terms, No. 69, containing 9 314 (score: 0.002282) terms and No. 49 (score: 0.001940), containing 4 424 terms. Taken together, these three clusters contained a total of 20 403 n-grams.
Term curation, frequency analysis and SNOMED CT mapping
The combined 20 403 n-grams were taken forward for curation as described above. The first phase of curation reduced the list to 519 putative concepts. The majority of eliminated concepts were morphological variations, misspellings and tokenisation anomalies of singular concepts. For instance, 84 variations were detected for the stem ‘irrit*’ (as in ‘irritable’). Other n-grams were removed because insufficient context was available for a reasonable clinical interpretation, such as ‘fundamentally unchanged’, ’amusing’ and ‘formally tested’. Finally, n-grams that appeared to have no relevance to symptomatology at all were removed, such as dates and clinician names.
Expert curation by two psychiatrists of the 557 terms (519 discovered concepts and 38 prior concepts) produced a Cohen’s Kappa agreement score of 0.45, where 337 concepts were assigned to one of our 9 categories independently by expert psychiatric curation. Of the 337 concepts, 235 were assigned to a substantive category (i.e. not the indeterminate ‘other’ group). Table 2 shows the number of n-grams per category where agreement was reached.
Table 2. Counts of n-grams where annotators independently agreed by category.
Category | Count |
---|---|
Affect/Mood | 6 |
Appearance/Behaviour | 78 |
Cognition | 6 |
Insight | 2 |
Mood/Anxiety/Affect | 26 |
Other | 102 |
Perception | 9 |
Personality | 23 |
Speech | 63 |
Thought | 22 |
Supplementary File 1 is a CSV table of all 557 n-grams. In addition to the n-gram itself, the table contains the following information; the counts of the unique patient records of our 20 472 patient SMI cohort in which the n-gram was detected; the counts of the unique documents of the 11 745 094 clinical document corpus wherein the n-gram was detected; the category assigned to the n-gram by each of our clinical annotators, and the SNOMED CT ID code for each n-gram, where mapping was possible.
The most frequently detected concept mentions include ‘affect’ (detected in 91% of patients), ‘eye contact’ (85%), ‘hallucinations’ (85%), ’delusions’ (83%) and ‘rapport’ (81%). Other concepts follow a long tailed distribution, with mentions of the top 407 concepts found in at least 100 unique patient records.
Regarding SNOMED CT mapping, it was possible to suggest direct mappings for 177 concepts and to suggest synonymy or partial mapping for another 53 concepts. This left a remaining 327 concepts that did not appear to be referenced in SNOMED CT, of which 106 were classified as belonging to a substantive symptom category by independent curation.
Figure 4 visualises the top 20% most frequent n-grams by appearance in unique patient records, where annotators agreed and were not classified as our ‘other’ grouping. Owing to the difficulty of the IAA and categorisation task, an extended analysis of the top 40% most frequent n-grams by appearance in unique patient records, irrespective of IAA and categorisation is provided in Supplementary Figure 1.
In this project, we sought to explore SMI symptomatology and other language constructs as expressed by clinicians in their own words, using more than ten years of observations made during real world clinician/patient interactions from more than 20 000 unique SMI cases. Within the context of a large mental healthcare provider, the results of our vocabulary curation efforts suggest that psychiatrists make use of a wide range of vocabulary to describe detailed symptomatic observations.
Many of the curated entities where both annotators agreed upon a substantive category map directly to preferred terms or synonyms of well known symptomatology constructs as described in SNOMED CT. Reassuringly, many of most frequently encountered entities as represented by unique patient count are represented in SNOMED CT, suggesting that SNOMED CT offers a reasonable coverage of what clinicians deem to be the most salient features of a psychiatric examination.
Nevertheless, our work produces evidence to suggest that many suitable synonyms are currently missing from SNOMED CT symptom entities. For instance, ‘aggression’ is commonly observed in SMI patients. Our results indicate that this construct might also be referred to by adjectives and phrases such as ‘combatative’ [ sic], ‘assaultative’ [ sic], ‘truculent’, ‘stared intimidatingly’ and ‘stared menacingly’, amongst others. Similarly, direct synonyms of ‘paranoia’ might include ‘suspiciousness’, ‘mistrustful’ and ‘conspirational’[ sic].
In addition, many of the curated constructs appear to reflect more granular observations of known symptomatology. For example, the Positive and Negative Symptom Scale (PANSS) utilises a 30 point scale of different symptomatology constructs. Specifically regarding abnormal speech, the PANSS provide guidance amounting to the high level clinical scrutiny of ‘lack of spontaneity & flow of conversation’. However, clinical depictions of speech within our dataset suggest around 68 distinct states, including ‘making animal noises’, ‘staccato quality’, ‘easily interruptible’, ‘prosody’ and ‘silently mouthing’.
We note the occurrence of several constructs that defy classification under existing schemas of SMI symptomatology, such as behaviours of ‘over politeness’, ‘over complimentary’, ‘spending recklessly’ and ‘shadow boxing’. The clinical interpretation of such entities is a non-trivial exercise, and is out of scope for this piece. Nevertheless, word embedding models may offer the potential to gain insight into potentially novel symptomatology constructs observed from real world clinician/patient interactions. Future work might explore the context for such constructs in more detail.
The emergence of such diverse language in turn has implications for how SNOMED CT might be implemented within an SMI context, raising the question of whether such gaps represent significant barriers to the use of SNOMED CT as a phenotyping resource. The issue of SNOMED CT’s sufficiency in this context has previously been raised for other areas, such as rare disease 34, psychological assessment instruments 35 and histopathology findings 36. However, in fairness, SNOMED CT is not a static resource, but an international effort dependent on the contributions of researchers. Perhaps a more pertinent question for the future development of SNOMED CT concerns balancing its objective to be a comprehensive terminology of clinical language (capable of facilitating interoperability and modelling deep phenotypes within disparate healthcare organisations across the globe) and the overwhelming complexity it would need to encompass in order to not constain its users. Certainly, at over 300 000 entities in its current incarnation, its size already presents problems in biomedical applications 37.
Limitations and future work
On the basis that manifestations of symptoms are the result of abnormal mental processes, novel symptom entities possibly represent observations of clinical significance. However, one particular complication in validating the clinical utility of novel symptomatology constructs with historic routinely recorded notes arises from systemic biases in EHR data. Specifically, the breadth and depth of symptomatic reporting is likely to be highly variable for a number of reasons. For instance, established symptoms as defined by current diagnostic frameworks are likely to be preferentially recorded, as clinicians are mandated to capture such entities in their assessments. On the other hand, constructs that fall outside of such frameworks may only be recorded as tangential observations made during patient/clinician interactions. Regardless of whether they are observed or not, without an established precedent of their clinical utility, they may be subject to random variation as to whether they are documented in a patient’s notes. This is borne out by the tendency of SNOMED CT-ratified concepts to appear more frequently in unique documents compared to our derived expressions. The validation of new symptoms from historic data is therefore something of a ‘chicken and the egg’ situation, a widely discussed limitation of the reuse of EHR data 38, 39. Nevertheless, our frequency analysis of our discovered constructs suggests that there is evidence that many are observed often enough to warrant their consideration within an expanded framework. Similarly, older frameworks with a limited scope of symptomatic expression were likely designed with pragmatic constraints around speed and reproducibility of assessment in mind. However, modern technology allows for a far greater scope of data capture and validation going forward, creating opportunities to develop new frameworks that maximise the value of psychiatric assessment. Future work in this domain might seek statistical validation via randomised experimental design, as opposed to observational study.
Our work suggests an approximate correlation between patient and document count, such that intra and inter patient symptomatological clinical language usage varies relatively consistently. However, some notable exceptions to this correlation (i.e. with a higher document level frequency to patient record level frequency) include ‘aggression’, ‘pacing’, ‘sexual inappropriateness’, ‘sexual disinhibition’ and ‘mutism’. Further work might seek to study these effects in greater detail, to uncover whether they represent a systemic bias in how such concepts are represented in the EHR.
The results of our IAA exercise between two experienced psychiatrists suggested a moderate level of agreement in categorising the newly identified constructs. Given that this annotation exercise did not provide any context beyond the n-gram, and that the nature of SMI symptom observation is somewhat subjective, perhaps it is to be expected that agreement was not higher. Nevertheless, future work might seek more formal definitions of such constructs via expert panel discussion and engagement with international collaborative efforts in SMI research.
Our method for vocabulary building produced nearly 1 million n-grams. A manual annotation of this list may have resulted in further discoveries, although would have been intractable in practical terms. To reduce the volume of n-grams taken forward for curation, we employed a word embedding model with a clustering algorithm. With our cluster scoring methodology, we were able to successfully produce meaningful clusters of n-grams reflecting the semantics of SMI symptomatology. However, as with many unsupervised tasks, it is difficult to determine whether an optimal solution has been achieved. In particular, the emergence of three ‘symptom’ clusters instead of one indicates sub-optimal localisation of symptom constructs in vector space. Addressing such a problem is multifaceted. For technical reasons, only a single epoch of training was possible in this exercise. Additional epochs would likely contribute to better cluster definition, in turn allowing us to reduce the value of our k parameter. In addition, spell checking and collapsing n-grams into their root forms may also have assisted. However, the latter may have also created new word sense disambiguation problems if common, symptom-like morphemes also appear in non-symptomatological assessment contexts.
After clustering, a two stage manual curation of more than 20 000 n-grams was necessary. Methods that produce a smaller vocabulary might conceivably reduce annotator burden. This might include the use of spell checkers and stemming/lemmatisation to correct and normalise tokens, at the risk of introducing new issues associated with morphological forms in word embedding model building. For this attempt, we took the conscious decision to make as few assumptions about the underlying structure of the data as possible.
Conclusions
Evidence-based mental health has long sought to produce disease model definitions that are both valid, in the sense they represent useful clinical representations that can inform treatment, and reliable, in that they can be consistently applied by different clinicians to achieve the same outcomes. In practice this has proven difficult, due to the often subjective nature of psychiatric examination/phenotyping and insufficient knowledge about the underlying mechanisms of disorders such as SMI. Here, we demonstrate that clinical staff make use of a diverse vocabulary in the course of their interactions with patients. This vocabulary often references findings that are not represented in SNOMED CT, raising questions about whether clinicians should observe the constraints of SNOMED CT or whether SNOMED CT should incorporate greater flexibility to reflect the nature of mental health. It is outside the scope of this work to explore how the granularity of symptom-based phenotyping affects patient outcomes, although the possibility of offering a fully realised picture of symptom manifestation may prove valuable in future endeavours of precision medicine.
Data availability
The CRIS dataset is a pseydonymised and de-identified case registrar of electronic health records of the South London and Maudsley NHS Trust. It operates under a security model that does not allow for open publication of raw data. However, access can be granted for research use cases under a patient led security model. For further information and details on the application process, please contact cris.administrator@kcl.ac.uk or visit the website. Alternatively, you may write to the CRIS team at: PO Box 92 Institute of Psychiatry, Psychology & Neuroscience at King’s College London 16 De Crespigny Park London SE5 8AF
Funding Statement
This paper represents independent research funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. RP has received support from a Medical Research Council (MRC) Health Data Research UK Fellowship and a Starter Grant for Clinical Lecturers (SGL015/1020) supported by the Academy of Medical Sciences, The Wellcome Trust, MRC, British Heart Foundation, Arthritis Research UK, the Royal College of Physicians and Diabetes UK. SV is supported by the Swedish Research Council (2015-00359), Marie Sklodowska Curie Actions, Cofund, Project INCA 600398
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; referees: 2 approved with reservations]
Supplementary material
Supplementary File 1: This file contains all of the 557 terms taken forward for expert annotation. It includes SNOMED mappings where possible, unique document and patient counts within the corpus, and the annotations provided by RP and RS.
Supplementary Figure 1: This file is an expanded visualisation of the frequency analysis figure contained in the main manuscript, with the agreement and non-substantive ‘other’ classification restrictions lifted.
References
- 1. Amberger J, Bocchini C, Hamosh A: A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). 2011;32(5):564–567. ISSN 10597794. 10.1002/humu.21466 [DOI] [PubMed] [Google Scholar]
- 2. Mirnezami R, Nicholson J, Darzi A: Preparing for precision medicine. 2012;366(6):489–491. ISSN 0028-4793, 1533-4406. 10.1056/NEJMp1114866 [DOI] [PubMed] [Google Scholar]
- 3. Robinson PN: Deep phenotyping for precision medicine. 2012;33(5):777–780. ISSN 10597794. 10.1002/humu.22080 [DOI] [PubMed] [Google Scholar]
- 4. Pathak J, Kho AN, Denny JC: Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. 2013;20(e2):e206–11. 10.1136/amiajnl-2013-002428 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Castro VM, Minnier J, Murphy SN, et al. : Validation of electronic health record phenotyping of bipolar disorder cases and controls. 2015;172(4):363–372. ISSN 0002-953X, 1535-7228. 10.1176/appi.ajp.2014.14030423 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. NATIONAL INFORMATION BOARD: Personalised Health and Care 2020.2014. Reference Source [Google Scholar]
- 7. Lee D, Cornet R, Lau F, et al. : A survey of SNOMED CT implementations. 2013;46(1):87–96. ISSN 15320464. 10.1016/j.jbi.2012.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Barnes M: Lessons learned from the implementation of clinical messaging systems. 2007;36–40. ISSN 1942-597X. [PMC free article] [PubMed] [Google Scholar]
- 9. The future of healthcare informatics: it is not what you think. 2012;1(4):5–6. ISSN 2164-957X, 2164-9561. 10.7453/gahmj.2012.1.4.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Gordon D: Merging multiple institutions: Information architecture problems and solutions. 1999;785–789. ISSN 1531-605X. [PMC free article] [PubMed] [Google Scholar]
- 11. Freedman R, Lewis DA, Michels R, et al. : The initial field trials of DSM-5: new blooms and old thorns. 2013;170(1):1–5. ISSN 0002-953X, 1535-7228. 10.1176/appi.ajp.2012.12091189 [DOI] [PubMed] [Google Scholar]
- 12. Kendell R, Jablensky A: Distinguishing between the validity and utility of psychiatric diagnoses. 2003;160(1):4–12. ISSN 0002-953X, 1535-7228. 10.1176/appi.ajp.160.1.4 [DOI] [PubMed] [Google Scholar]
- 13. Chmielewski M, Bagby RM, Markon K, et al. : Openness to experience, intellect, schizotypal personality disorder, and psychoticism: resolving the controversy. 2014;28(4):483–99. ISSN 1943-2763. 10.1521/pedi_2014_28_128 [DOI] [PubMed] [Google Scholar]
- 14. Adam D: Mental health: On the spectrum. 2013;496(7446):416–418. ISSN 0028-0836, 1476-4687. 10.1038/496416a [DOI] [PubMed] [Google Scholar]
- 15. Cross-Disorder Group of the Psychiatric Genomics Consortium: Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. 2013;381(9875):1371–1379. ISSN 01406736. 10.1016/S0140-6736(12)62129-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kay SR, Fiszbein A, Opler LA: The positive and negative syndrome scale (PANSS) for schizophrenia. 1987;13(2):261–76. ISSN 0586-7614. Cited by 8221. 10.1093/schbul/13.2.261 [DOI] [PubMed] [Google Scholar]
- 17. Kirkpatrick B, Strauss GP, Nguyen L, et al. : The brief negative symptom scale: psychometric properties. 2010;37(2):300–305. IISSN 0586-7614, 1745-1701. Cited by 0000. 10.1093/schbul/sbq059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Liu H, Aronson AR, Friedman C: A study of abbreviations in MEDLINE abstracts. 2002;464–468. ISSN 1531-605X. [PMC free article] [PubMed] [Google Scholar]
- 19. Henriksson A, Conway M, Duneld M, et al. : Identifying synonymy between SNOMED clinical terms of varying length using distributional analysis of electronic health records. 2013;2013:600–609. ISSN 1942-597X. [PMC free article] [PubMed] [Google Scholar]
- 20. Boksa P: A way forward for research on biomarkers for psychiatric disorders. 2013;38(2):75–55. ISSN 11804882. 10.1503/jpn.130018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Jackson RG, Patel R, Jayatilleke N, et al. : Natural language processing to extract symptoms of severe mental illness from clinical text: The Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. 2017;7(1):e012012. ISSN 2044-6055, 2044-6055. 10.1136/bmjopen-2016-012012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. McGorry PD: The next stage for diagnosis: Validity through utility. 2013;12(3):213–215. ISSN 17238617. 10.1002/wps.20080 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Perera G, Broadbent M, Callard F, et al. : Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: Current status and recent enhancement of an Electronic Mental Health Record-derived data resource. 2016;6(3):e008721. ISSN 2044-6055, 2044-6055. 10.1136/bmjopen-2015-008721 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Bird S, Klein E, Loper E: Natural Language Processing with Python. O’Reilly, Beijing; Cambridge [Mass.], 1st ed edition,2009. ISBN 978-0-596-51649-9. OCLC: ocn301885973. Reference Source [Google Scholar]
- 25. Řehůřek R, Sojka P: Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks2010;45–50. Valletta, Malta, May. ELRA. 10.13140/2.1.2393.1847 [DOI] [Google Scholar]
- 26. Mikolov T, Sutskever I, Chen K, et al. : Distributed representations of words and phrases and their compositionality. 2013;3111–3119. Reference Source [Google Scholar]
- 27. Harris ZS: Distributional Structure. 1954;10(2–3):146–162. ISSN 0043-7956, 2373-5112. 10.1080/00437956.1954.11659520 [DOI] [Google Scholar]
- 28. Mikolov T, Chen K, Corrado G, et al. : Efficient estimation of word representations in vector space. 2013. Reference Source [Google Scholar]
- 29. Pakhomov SV, Finley G, McEwan R, et al. : Corpus domain effects on distributional semantic modeling of medical terms. 2016;32(23):3635–3644. ISSN 1367-4803, 1460-2059. 10.1093/bioinformatics/btw529 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Rong X: Word2vec parameter learning explained. 2014. Reference Source [Google Scholar]
- 31. Pedregosa F, Varoquaux G, Gramfort A, et al. : Scikit-learn: Machine Learning in Python. 2011;12:2825–2830. Reference Source [Google Scholar]
- 32. Kodinariya TM, Makwana PR: Review on determining number of Cluster in K-Means Clustering. 2013;1(6):90–95. Reference Source [Google Scholar]
- 33. Cohen J: A Coefficient of Agreement for Nominal Scales. 1960;20(1):37–46. ISSN 0013-1644, 1552-3888. 10.1177/001316446002000104 [DOI] [Google Scholar]
- 34. Sollie A, Sijmons RH, Lindhout D, et al. : A new coding system for metabolic disorders demonstrates gaps in the international disease classifications ICD-10 and SNOMED-CT, which can be barriers to genotype-phenotype data sharing. 2013;34(7):967–973. ISSN 10597794. 10.1002/humu.22316 [DOI] [PubMed] [Google Scholar]
- 35. Ranallo PA, Adam TJ, Nelson KJ, et al. : Psychological assessment instruments: a coverage analysis using SNOMED CT, LOINC and QS terminology. 2013;2013:1333–1340. ISSN 1942-597X. [PMC free article] [PubMed] [Google Scholar]
- 36. Campbell WS, Campbell JR, West WW, et al. : Semantic analysis of SNOMED CT for a post-coordinated database of histopathology findings. 2014;21(5):885–892. ISSN 1067-5027, 1527-974X. 10.1136/amiajnl-2013-002456 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. López-García P, Schulz S: Can SNOMED CT be squeezed without losing its shape? 2016;7(1):56. ISSN 2041-1480. 10.1186/s13326-016-0101-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Weiskopf NG, Weng C: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. 2013;20(1):144–151. ISSN 1067-5027, 1527-974X. 10.1136/amiajnl-2011-000681 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Chan KS, Fowles JB, Weiner JP: Review: electronic health records and the reliability and validity of quality measures: a review of the literature. 2010;67(5):503–527. ISSN 1077-5587, 1552-6801. 10.1177/1077558709359007 [DOI] [PubMed] [Google Scholar]