Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Dec 11.
Published in final edited form as: Am J Med Genet B Neuropsychiatr Genet. 2017 Sep 1;177(7):613–624. doi: 10.1002/ajmg.b.32579

Phelan-McDermid Syndrome Data Network: Integrating Patient Reported Outcomes with Clinical Notes and Curated Genetic Reports

Cartik Kothari 1, Maxime Wack 2, Claire Hassen-Khodja 3, Sean Finan 4, Guergana Savova 5, Megan O’Boyle 6, Geraldine Bliss 7, Andria Cornell 8, Elizabeth J Horn 9, Rebecca Davis 10, Jacquelyn Jacobs 11, Isaac Kohane 12, Paul Avillach 13
PMCID: PMC5832521  NIHMSID: NIHMS903521  PMID: 28862395

Abstract

The heterogeneity of patient phenotype data is an impediment to the research into the origins and progression of neuropsychiatric disorders. This difficulty is compounded in the case of rare disorders such as Phelan-McDermid Syndrome (PMS) by the paucity of patient clinical data. PMS is a rare syndromic genetic cause of autism and intellectual deficiency. In this paper, we describe the Phelan-McDermid Syndrome Data Network (PMS_DN), a platform that facilitates research into phenotype-genotype correlation and progression of PMS by: a) integrating knowledge of patient phenotypes extracted from Patient Reported Outcomes (PRO) data and clinical notes - two heterogeneous, underutilized sources of knowledge about patient phenotypes - with curated genetic information from the same patient cohort and b) making this integrated knowledge, along with a suite of statistical tools, available free of charge to authorized investigators on a Web portal https://pmsdn.hms.harvard.edu. PMS_DN is a Patient Centric Outcomes Research Initiative (PCORI) where patients and their families are involved in all aspects of the management of patient data in driving research into PMS. To foster collaborative research, PMS_DN also makes patient aggregates from this knowledge available to authorized investigators using distributed research networks such as the PCORnet PopMedNet. PMS_DN is hosted on a scalable cloud based environment and complies with all patient data privacy regulations. As of October 31, 2016, PMS_DN integrates high-quality knowledge extracted from the clinical notes of 112 patients and curated genetic reports of 214 patients with preprocessed PRO data from 415 patients.

Keywords: Knowledge extraction, clinical notes, patient reported outcomes, knowledge integration, rare neuropsychiatric disorders

INTRODUCTION

Genetic causes of neuropsychiatric disorders are not well understood in general [Kerner 2015]. Research investigations using Genome Wide Association Study (GWAS) [Wood 2013; Network 2015], exome-based sequencing [Iossifov et al 2012; O’Roak et al 2011; Vissers et al 2010; Girard et al 2011; Xu et al 2012], and whole genome sequencing [Kong et al 2012] techniques have revealed several candidate genes that are associated with common neuropsychiatric disorders such as Autism Spectrum Disorder (ASD), intellectual disability, and schizophrenia. However, in the case of rare disorders, understanding the genetic origins and progressions of disorders - one of the key objectives of Precision Medicine research [Kohane 2015; Kohane et al 2012a; Collins and Varmus 2015] - is hindered by small patient population size, the consequent paucity of patient data, and the lack of robust phenotyping protocols [Baynam et al 2015; Robinson et al 2015; Delude 2015].

One such rare neuropsychiatric disorder is Phelan-McDermid Syndrome (PMS) or 22q13 deletion syndrome (OMIM 606232) [Phelan and McDermid 2012; Phelan 2008; Cusmano-Ozog K et al 2007], with approximately 1400 cases diagnosed worldwide, mostly in children. PMS is caused by deletion of the terminal end of the long arm of chromosome 22 or by mutation and loss of function of the SHANK3 gene [Macedoni-Lukšič et al 2013], which is also implicated in ASD [Uchino and Waga 2013; Gauthier et al 2008]. Diagnosis is only possible with genetic testing and is often delayed. Early studies have looked at the effect of intranasal insulin therapy [Maxonus et al 2012; Zwanenberg et al 2016] and the role of Insulin-like Growth Factor-1 (IGF-1) [Kolevzon et al 2014] in reversing some of the symptoms of PMS, but there is currently no known treatment for the disorder. A wide variety of symptoms have been observed in individuals with PMS, including poor muscle tone, intellectual disability, developmental delays, dysmorphic facial features, vesicoureteral reflux, gastroesophageal reflux, congenital cardiac diseases, and behavioral disorders. Given the scarcity of patient data resulting from the small patient population size, clinical notes and Patient Reported Outcomes (PRO) data - previously underutilized sources of detailed information about patient conditions - assume significant importance in Precision Medicine research into PMS. Comparative analysis of the genetic profiles of the cohort of PMS patients with patient phenotypes reported in the clinical notes and PRO data has the potential to identify correlations between polymorphisms and deletions of specific genes and patient phenotypes, as well as to identify patient subtypes based upon genotypic and phenotypic profiles.

The Phelan-McDermid Syndrome Data Network (PMS_DN), a Patient Powered Research Network [Fleurence et al 2014; Daugherty et al 2014; Frank et al 2015] funded by the Patient Centered Outcomes Research Institute (PCORI, www.pcori.org), leverages patient clinical notes and PRO data to achieve its objective of furthering Precision Medicine research into PMS. The PMS_DN project is an example of a patient driven clinical research initiative where patients and their families are the primary stakeholders, managing all aspects of data governance and directing a patient-centered research agenda in collaboration with academic research institutions. The objective of this paper is to demonstrate how PMS_DN facilitates research into PMS by:

  1. Extracting knowledge from clinical notes by using a combination of Optical Character Recognition (OCR) and Natural Language Processing (NLP) methods

  2. Ensuring the high-quality and trustworthiness of the knowledge extracted from clinical notes by allowing experts to crosscheck the knowledge against the de-identified source raw text

  3. Integrating the knowledge extracted from clinical notes with PRO data and curated genetic reports from the same cohort of PMS patients, facilitating comparative analyses

  4. Provisioning free multi-level access privileges to the integrated knowledge to clinical practitioners and investigators researching into neuropsychiatric disorders over a Web portal and over distributed research networks, while complying with all the stipulations of patient privacy regulations, including the Health Insurance Portability and Accountability Act (HIPAA)

  5. Allaying concerns about long term scalability and viability of the project by adopting a cloud based computation environment

MATERIALS AND METHODS

Data Acquisition

The PMS Foundation (PMSF, 22q13.org) is a nonprofit foundation founded and run by PMS families that promotes awareness and research into PMS. Through patient outreach activities, PMSF collected patient data from hundreds of families of PMS patients. The collected data included:

  • Patient Reported Outcomes: Patient Reported Outcomes (PRO) data comprises responses by parents and caregivers of PMS patients to detailed questionnaires about diagnoses, procedures, lab tests, medications, patient behavior, and patient conditions, which were collected and stored in the PMS Information Registry (PMSIR, pmsiregistry.patientcrossroads.org).

  • Clinical notes: The families of PMS patients provided consent to CareSync (caresync.com), a third-party vendor, to request and obtain their health records, including clinical notes, from various healthcare providers on their behalf. CareSync collected the clinical notes and shared the PDF scans with the patients’ families and with PMSF. This process greatly simplified the cumbersome and time-consuming process of patients obtaining access to their health records [Lester et al 2016].

  • Curated Genetic Reports: Reports of PMS patients from genetic tests including Comparative Genome Hybridization (CGH) arrays, Single Nucleotide Polymorphism (SNP) arrays, and microarrays were collected, curated by trained genetic counselors, and stored in the PMSIR.

With periodic patient outreach activities, PMSF has progressively improved patient participation in terms of the number of families consenting to share their data with PMSIR and with PMS_DN.

Data Processing

Clinical Notes

We used the open source Tesseract OCR tool [Smith 2007] to extract raw text content from the curated clinical notes. Then, the MITRE MIST tool [Aberdeen et al 2010] and the Scrubber toolkit [McMurry et al 2013] in the Apache cTAKES NLP engine were used to erase Protected Health Information (PHI) elements from the text. Following de-identification, the Apache cTAKES NLP engine [Savova et al 2010] was deployed to extract knowledge by identifying occurrences of concepts defined in the Unified Medical Language System (UMLS) [Bodenreider 2004] in the text. Apache cTAKES also identifies the context in which the concepts are mentioned in the sentence including negation, patient history, family history, and uncertainty. The identified UMLS concepts were mapped to concept definitions in 20 clinical terminologies (Figure 1) including ICD-9/10 (www.icd9data.com, www.icd10data.com), MeSH [Rogers 1963], SNOMED CT [Schulz and Klein 2008], and the Human Phenotype Ontology [Robinson et al 2008].

Figure 1.

Figure 1

PMS_DN uses the Apache cTAKES NLP engine to extract occurrences of UMLS concepts in the clinical notes of PMS patients. The UMLS concepts are mapped to 20 different terminologies including ICD-9, ICD-10, SNOMED, MeSH, and NDFRT. The i2b2/tranSMART user interface allows for easy browsing - starting with broad biomedical concepts and drilling down to find specific patients and data of interest. The i2b2/tranSMART user interface also displays the counts of patients (PC) and distinct terms (DTC) associated with each concept at all levels of the hierarchy.

Genetic Reports

The genetic reports include results from sequencing, CGH arrays, and Fluorescent In-Situ Hybridization (FISH) probes. Genetic reports are first curated by trained genetic counselors who fill 57 structured fields to represent the genetic abnormalities. Because of the disparity in techniques from which genetic data is obtained, all the curated genetic test result information was manually reviewed to extract the coordinates and genome assembly of the chromosomal abnormalities. Chromosomal coordinates for CGH were extracted from the relevant structured fields (chromosome, gain/loss, start, end), and from the International Society of Cytogenetics Nomenclature (ISCN 2013) standard [Simons et al 2013] and comments where necessary. Chromosomal coordinates for FISH results were directly obtained in the GRCh38/hg38 genome assembly [Miga et al 2014] from the National Center for Biotechnology Information (NCBI) Clone database [Schneider et al 2013]. When multiple assays were available for the same region, the most recent or the most precise - in terms of resolution - assay was used. In order of decreasing resolution of the sequence data, sequencing output was preferred over array CGH, and array CGH was preferred over FISH. Chromosomal coordinates were transformed from each original human genome assembly to the latest one available at the time of this study, GRCh38/hg38, using the University of California - Santa Cruz (UCSC) liftOver tool (genome.ucsc.edu/cgi-bin/hgLiftOver). All duplications, deletions, and mutations were retained along with the original fields for the standard nomenclature, karyotype, and parental results; the only exceptions being chromosome alterations with coordinates that did not map to GRCh38/hg38.

Patient Reported Outcomes (PRO)

The Patient Reported Outcomes (PRO) data stored in the PMSIR comprises 1,300 questions over three distinct questionnaires:

  1. A “clinical” questionnaire with questions regarding diagnosed comorbidities, symptoms, tests, and treatments for the whole range of known pathologies and features associated with PMS,

  2. A “developmental” questionnaire, focusing on physical, motor, behavioral, cognitive, and social development, and

  3. An “adult” questionnaire with specific questions aimed at patients aged 12 or more, regarding the evolution of symptoms after puberty. All the questions from the PRO dataset were manually mapped to UMLS Concept Unique Identifiers (CUIs) by a clinical expert before being preprocessed for statistical analysis.

The knowledge extracted from clinical notes was loaded by dedicated Extract Transform Load (ETL) pipelines into the PMS_DN data repository along with the PRO data and the processed curated genetic reports of the PMS patients.

Data Integration on PMS_DN: Leveraging the i2b2/tranSMART platform

PMS_DN leverages the capabilities of the i2b2/tranSMART knowledge management platform [Scheufele et al 2014; Szalma et al 2010; Perakslis et al 2010; Patel et al 2016] to integrate heterogeneous datasets - including phenome, exposome, and genome data - and to facilitate browsing and comparative analysis of these datasets. The i2b2/tranSMART platform is layered upon the Informatics for Integrating Biology with Bedside (i2b2) clinical and biomedical data integration platform [Kohane et al 2012b; Murphy et al 2010]. The i2b2 platform uses a simple and intuitive “observation centric” star schema data model that accommodates a variety of longitudinal patient level datasets including clinical data, prescriptions, and laboratory values. Multiple hierarchical ontologies describe the types of data contained within i2b2, allowing users to start with broad biomedical concepts and drill down to find specific patients and data of interest (Figure 1). New data types can be added to i2b2 by modifying the ontology but without changing the underlying database schema or the software. The ease of use of i2b2 has led to its adoption by over 150 University Hospital research centers worldwide.

Authorized user access to PMS_DN

The primary target audience for PMS_DN are clinical practitioners and researchers working in the areas of autism and other neuropsychiatric disorders. Qualified applicants affiliated with research institutions with an active interest in the research into neuropsychiatric disorders can request access to PMS_DN by filling out a registration form and agreeing to the terms of use. The registration request is reviewed by a Data Network Specialist at PMSF before approval.

Access to PMS_DN is granted at one of two levels: a basic level (Level 1) or an advanced level (Level 2). Level 1 access allows users to browse through and interrogate the patient aggregates of the integrated datasets on PMS_DN’s Web portal. Figure 2 demonstrates the use of the i2b2/tranSMART interface to test a hypothesis about the relationship between patient age and hypotonia, a commonly reported symptom in PMS patients. Users with Level 2 access privileges, obtained from PMSF after mandatory Institutional Research Board (IRB) clearances from their institutions of affiliation, can see the raw, de-identified patient level data (Figure 3) and download it as well. In addition, investigators with Level 2 access privileges can access a novel validation tool, which allows them to verify the accuracy of the knowledge extracted from clinical notes by cross-checking the identified concepts against the anonymized sentences from which they were extracted (Figure 4). While eliminating residual errors of cTAKES caused by ambiguous context of the raw text, the validation tool improves the trustworthiness of the knowledge by allowing authorized investigators to see the raw text source of the knowledge. The input of the investigators is used to immediately update the knowledge in the PMS_DN repository (Figure 5). In a future release of PMS_DN, we will display the credentials of the experts performing the validation to other users, so the credibility of the validation input can be independently assessed. It must be noted that the knowledge validation step is not an exhaustive review of the entirety of the NLP engine’s output. Instead, it is an open-ended process where experts choose to crosscheck specific concepts of interest identified by the NLP pipeline against the raw anonymized sentences in which they occur.

Figure 2.

Figure 2

Hypothesis testing on the i2b2/tranSMART interface of PMS_DN. In STEP 1, the user drags and drops the “hypotonia” concept and the “Yes” and “No” values for this concept into the two different subset boxes. Then the user clicks the “Generate Summary Statistics” button. In STEP 2, the user drags and drops the “AGE IN YEARS” concept into the Summary section to test the hypothesis that Hypotonia is correlated with age of the patient. The RESULT shows that no significant correlation can be found.

Figure 3.

Figure 3

PMS_DN users with advanced access privileges obtained from the PMS Foundation (following appropriate IRB clearances) can view the raw data and perform basic sorting operations on the raw PMS patient data on PMS_DN (A) and also export it (B)

Figure 4.

Figure 4

The pop-up validation window allows clinical experts to cross check the extracted instance of the “Pes Cavus” concept from the Human Phenotype Ontology (“BEFORE Validation” screenshot) against the raw text from which it was extracted (“Pop-up Validation Window” screenshots). Clicking on the grey icon next to the “Pes Cavus” concept brings up the Pop-up Validation window where the user can see the raw sentences from which the concept was extracted. Verification by the expert (by deselecting the checkbox against the raw sentence for patient 2) results in the “Pes cavus”concept being displayed in a green colored font (“AFTER Validation” screenshot) indicating to future users that it has been verified by clinical experts. Note the change in the Patient Count value: from 2 in the “BEFORE Validation” screenshot to 1 in the “AFTER Validation” screenshot. This indicates the immediate update of the knowledge base with the expert’s input on the validation window.

Fig. 5.

Fig. 5

The novel validation tool that can be used by clinical experts to crosscheck the identified concepts against the sentences from which they were extracted. a) Apache cTAKES extracts instances of UMLS concepts from the raw text of clinical notes. b) The output of cTAKES is loaded into the PMS_DN database. c) An expert uses the validation tool to verify the extracted UMLS concept against the raw text source. d) The expert verifies that the extraction of the UMLS concept is valid (or otherwise). e) The input of the experts is used to update the knowledge in the PMS_DN database immediately.

PMS_DN uses the single sign on feature of the OAuth2 authorization protocol (oauth.net/2/) to leverage the login credentials from: a) Harvard Medical School, Boston Children’s Hospital, or the University of Pittsburgh, or b) NIH eRA Commons or c) Google Mail or d) GitHub (github.com) to login to the i2b2/tranSMART Web portal. The OAuth2 based single sign on feature obviates a potential security loophole associated with the storage of user login credentials on PMS_DN.

To foster collaborative research with similar Patient Powered Research Networks, snapshots of PMS_DN data in the form of patient counts for queried parameters are available to authorized investigators using distributed research networks such as SHRINE [Weber et al 2009] and the PCORnet PopMedNet (www.popmednet.org).

Cloud Hosting

To ensure long-term scalability and to eliminate concerns about data archival and hardware maintenance and procurement, we have ported the PMS_DN application to a HIPAA compliant cloud based environment hosted by Amazon Web Services (AWS, aws.amazon.com). The PMS_DN data repository is hosted on a Relational Data Service (RDS) instance of AWS. The ETL pipelines are hosted on dedicated Elastic Compute Cloud (EC2) instances of AWS. The raw clinical notes are stored in a secure Simple Storage Service (S3) instance of AWS prior to processing. Figure 6 displays the entire cloud-based architecture and data flows of PMS_DN.

Fig. 6.

Fig. 6

Cloud-hosted architecture and data flow of PMS_DN

RESULTS

As of October 31, 2016, 623 families (334 in the USA) provided consent to PMSF to share their data with PMS_DN. PMS_DN integrates: a) the knowledge extracted by Apache cTAKES from the clinical notes of 112 patients comprising 40,320 pages in 2202 files, b) preprocessed PRO data from 415 patients, and c) curated genetic information from 214 patients. Following integration, 70 patients were linked across the three datasets i.e., PMS_DN has the full complement of clinical notes, genetic reports, and PRO data for 70 patients, enabling comparative analyses across the datasets (Figure 7). This number is expected to increase as more patient data becomes available. Authorized users can access and interrogate the integrated PMS patient data on the i2b2/tranSMART Web user interface of PMS_DN at https://pmsdn.hms.harvard.edu. Level 2 users with advanced access privileges and the appropriate IRB clearances can: i) obtain advanced, raw data download privileges on PMS_DN from PMSF and also ii) verify the accuracy of the knowledge extracted from clinical notes by cross-checking the identified concepts with anonymized sentences from which they were extracted using the validation tool.

Figure 7.

Figure 7

As of Oct 31, 2016, PMS_DN integrates the knowledge extracted from the clinical notes of 112 patients with the curated genetic reports of 214 patients and the Patient Reported Outcomes data obtained from 415 patients. PMS_DN contains all three datasets - PRO, clinical notes, and curated genetic reports - of 70 patients.

DISCUSSION

PMS_DN facilitates research into the origins and treatment of PMS by making high quality, trustworthy knowledge available to clinical practitioners and investigators researching neuropsychiatric disorders, while safeguarding patient privacy through rigorous patient de-identification methods.

Patient De-Identification

PMS_DN uses a combination of two independent anonymizers - the MITRE MIST anonymizer [Aberdeen et al 2010] and the Scrubber toolkit [McMurry et al 2013] in the Apache cTAKES NLP engine - to remove PHI elements from the clinical notes. In a study [McMurry et al, 2013], the Scrubber toolkit in Apache cTAKES identified and removed approximately 98% of the PHI elements (Recall = 98%) from a test corpus of clinical notes selected from the i2b2 De-Identification Challenge dataset [Uzuner et al, 2007]. However, the same study reported a very low precision score, i.e. a number of useful non-PHI elements were removed from the clinical notes by the Apache cTAKES Scrubber in addition to the PHI elements. Another investigation studied the effectiveness of the MITRE MIST tool in removing PHI elements from clinical notes [Deleger et al 2013] and reported F-Scores (the harmonic mean of precision and recall metrics) [Hripcsak and Rothschild, 2005] of 93.48% and 95.2% at sentence-level and word-level de-identification. These performance metrics were comparable with the performance of human experts in identifying PHI elements in the same corpus of clinical notes.

Because a maximally effective de-identifier with maximal precision and recall performance metrics has yet to be developed, the PMS_DN combines the two independent anonymizers to try and remove PHI elements from the PMS patients’ clinical notes to the maximum extent possible. Despite these efforts, the likelihood of the appearance of PHI elements in the clinical notes cannot be ruled out. We have attempted to mitigate this limitation by restricting the visibility of the anonymized raw text of the clinical notes (on the validation window) to only those users with Level 2 advanced access privileges.

Given the early stage of deployment of PMS_DN, the patient de-identification pipeline has not been observed to adversely impact the comprehensibility of the content of the clinical notes so far. A typical example of an anonymized sentence from the clinical notes can be seen in Figure 4 as displayed in the validation window to a Level 2 user for verification. At present, only the exact sentence from which the concept was extracted is displayed in the validation window. In a future version of PMS_DN, we plan to display, in addition to the source sentence for the concept, the sentences immediately preceding and following this source sentence to try and make the context clearer to the user accessing the validation window.

Knowledge Extraction from Clinical Notes and Expert Validation

From the anonymized text in clinical notes, the Apache cTAKES NLP engine identified the mentions of concepts defined in the UMLS in addition to the appropriate context - including negation, uncertainty, patient history, and family history - in which the extracted concept is mentioned. The validation tool allows experts to cross-check whether the context has been correctly identified by the NLP engine and make corrections where necessary. This is intended to be an open-ended process and not an exhaustive review of the functional efficiency of the NLP engine. The credentials of the experts - including research background and interests, institutions of affiliation, and the relevance of their proposed research work with PMS patient data to the objective of PMS Foundation - who perform these validations are carefully reviewed by the steering committee at the PMS Foundation before access is granted. This ensures a certain level of credibility to the validation input from the experts. At present, the identity of the users who perform these validations are logged by PMS_DN but n0t displayed to the end users. In future, the credentials of the users will also be displayed to authorized users with Level 2 access privileges so the quality of the input can be independently assessed by users.

PRO Questionnaire

The PRO dataset comprises answers to approximately 1,300 questions that were sourced by the PMSF from existing surveys and databases, including the Autism Genetic Resource Exchange (AGRE) [Lajonchere, 2010], the PMS survey by Dr. Katy Phelan, and Common Data Elements and questions in other specific condition surveys about phenotypes reported by PMS patients including seizures, lymphedema, sleep disorders, behavioral disorders, and developmental delays as well as cardiac and renal abnormalities. Expert researchers reviewed and edited the initial draft of questions and delivered two sets of questions: A Clinical Survey of 100 questions split into 23 topics such as Cardiovascular, Seizures, and Sleep and a Developmental Survey split into 11 topics such as Fine and Gross Motor Skills, Puberty Status, and Communication Development. Some of these questions are specific to the symptoms exhibited by PMS patients such as dysplastic toenails. There are also a number of questions that ask about more common conditions such as seizures, reflux, and behavioral patterns associated with ASD. The ASD related questions are relevant given that studies have reported the prevalence of symptoms of ASD in PMS patients [Oberman et al, 2015] and gene-linkage studies have associated SHANK3 mutations with ASD [Leblond et al, 2014; Uchino and Waga, 2013].

Data Sharing

It would be desirable to promote data sharing between PMS_DN and the other PPRNs to foster collaborative research between these projects. However, the stipulations of patient privacy regulations preclude easy data sharing. Therefore, at present, only patient counts can be shared across these projects over distributed research networks such as SCILHS SHRINE and PCORnet PopMedNet. A unified questionnaire pertinent to PMS patients as well as patients diagnosed with disorders related to other PPRNs would be highly desirable. This would spare the families of patients with rare disorders from the hassle of having to repeatedly provide the same information across different questionnaires. The Research Domain Criteria (RDoC) framework from the National Institute of Mental Health (NIMH) [Insel et al, 2010] could be useful in addressing this concern. The objective of the RDoC framework is to bring about synergy between the diverse research projects into mental and behavioral disorders and by extension, between the various surveys that are used in the research into these disorders. RDoC provides a rigid framework comprising units of analysis (from molecules to self-report) for behavioral and developmental domains including cognitive, positive valence, negative valence, social processes, arousal and regulatory systems. We are in the process of mapping the questions from the PRO dataset of PMS_DN to the domains of the RDoC framework as lead off work in this direction. With similar mapping initiatives from other PPRNs over time, the vision of a unified questionnaire for all rare disorders can be achieved.

CONCLUSION

In this paper, we have described PMS_DN, a Patient Powered Research Network that exemplifies the potential of collaborations between academic researchers and family organizations such as PMSF to drive research into a rare genetic disorder: PMS. PMS_DN addresses the paucity of patient data in rare disease research by exploiting the rich yet underutilized sources of knowledge about patient conditions: clinical notes and self-reported outcomes. PMS_DN uses a state-of-the-art NLP engine, Apache cTAKES, to extract context sensitive knowledge from rich text descriptions in patient clinical notes before making this knowledge, along with self-reported outcomes and genetic reports of the same patient cohort, available to authorized investigators. Further, to minimize inaccuracies in the extracted knowledge, PMS_DN implements a novel knowledge validation tool that utilizes clinical expert input to eliminate residual ambiguities. PMS_DN is hosted in a cloud computing environment guaranteeing scalability while mitigating concerns regarding long term viability of the project. By integrating diverse and heterogeneous data about patient phenotypes and genotypes, PMS_DN facilitates research that can identify patient subgroups for targeted therapies based upon genomic and phenotypic profiles. The comparative analyses of integrated datasets, made possible by PMS_DN, has the potential to yield an improved understanding of the associations between genotypic profiles and patient phenotypes.

Acknowledgments

The Phelan-McDermid Syndrome Foundation, the Phelan-McDermid Syndrome International Registry, the patients and their families, Chris Botka, and the Harvard Medical School Research Computing center.

FINANCIAL SUPPORT

This work was partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (PPRN-1306-04814) phase I and II for development of the National Patient-Centered Clinical Research Network, known as PCORnet; by Research Grant EDU_R_FY2015_Q2_HarvardMedicalSchool_Avillach-NEW from Amazon Inc.; and National Institutes of Health — RFA-HG-13-009 — Centers of Excellence for Big Data Computing in the Biomedical Sciences (U54) — Grant Number 1U54HG007963-01

Footnotes

CONFLICT OF INTEREST

None

DISCLAIMER

The statements presented in this article are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee or other participants in PCORnet.

Contributor Information

Cartik Kothari, Department of Biomedical Informatics, Harvard Medical School, Boston MA, USA.

Maxime Wack, Department of Biomedical Informatics, Harvard Medical School, Boston MA, USA.

Claire Hassen-Khodja, Department of Biomedical Informatics, Harvard Medical School, Boston MA, USA.

Sean Finan, Boston Children’s Hospital, Boston MA, USA.

Guergana Savova, Boston Children’s Hospital, Boston MA, USA.

Megan O’Boyle, Phelan-McDermid Syndrome Foundation, Venice, FL, USA.

Geraldine Bliss, Phelan-McDermid Syndrome Foundation, Venice, FL, USA.

Andria Cornell, Phelan-McDermid Syndrome Foundation, Venice, FL, USA.

Elizabeth J Horn, Phelan-McDermid Syndrome Foundation, Venice, FL, USA.

Rebecca Davis, Phelan-McDermid Syndrome Foundation, Venice, FL, USA.

Jacquelyn Jacobs, Phelan-McDermid Syndrome Foundation, Venice, FL, USA.

Isaac Kohane, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Paul Avillach, Department of Biomedical Informatics, Harvard Medical School, Boston MA, USA.

References

  1. Aberdeen J, et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment. International Journal of Medical Informatics. 2010;79(12):849–59. doi: 10.1016/j.ijmedinf.2010.09.007. 2010 Dec. [DOI] [PubMed] [Google Scholar]
  2. Baynam G, et al. Phenotyping: Targeting Genotype’s Rich Cousin for Diagnosis. Journal of Paediatrics and Child Health. 2015;51(4):381–6. doi: 10.1111/jpc.12705. 2015 Apr. [DOI] [PubMed] [Google Scholar]
  3. Bodenreider O. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research. 2004;32:D267–70. doi: 10.1093/nar/gkh061. 2004 January. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Collins FS, Varmus HA. New Initiative on Precision Medicine. New England Journal of Medicine. 2015;372:9. 793–95. doi: 10.1056/NEJMp1500523. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cusmano-Ozog K, et al. 22q13 Deletion Syndrome: A recognizable malformation syndrome associated with marked speech and language delay. American Journal of Medical Genetics. 2007;145C(4):393–398. doi: 10.1002/ajmg.c.30155. 2007. [DOI] [PubMed] [Google Scholar]
  6. Daugherty SE, et al. Patient-Powered Research Networks: Building Capacity for Conducting Patient-Centered Clinical Outcomes Research. Journal of American Medical Informatics Association. 2014 Jul;21(4):583–586. doi: 10.1136/amiajnl-2014-002758. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Deleger L, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. Journal of American Medical Informatics Association. 2013 Jan;20(1):84–94. doi: 10.1136/amiajnl-2012-001012. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Delude C. Deep phenotyping: The details of disease. Nature. 2015;527:S14–S15. doi: 10.1038/527S14a. 5 November 2015. [DOI] [PubMed] [Google Scholar]
  9. Fleurence RL, et al. Patient-Powered Research Networks Aim to Improve Patient Care and Health Research. Health Affairs. 2014;33(7):1212–1219. doi: 10.1377/hlthaff.2014.0113. July 2014. [DOI] [PubMed] [Google Scholar]
  10. Frank L, et al. Conceptual and practical foundations of patient engagement in research at the patient-centered outcomes research institute. Quality of Life Research. 2015;24(5):1033–1041. doi: 10.1007/s11136-014-0893-3. May 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gauthier J, et al. Novel de novo SHANK3 mutation in autistic patients. American Journal of Medical Genetics Part B: Neuropsychiatric disorders. 2008;150B(3):421–424. doi: 10.1002/ajmg.b.30822. 2008. [DOI] [PubMed] [Google Scholar]
  12. Girard SL, et al. Increased exonic de novo mutation rate in individuals with schizophrenia. Nature Genetics. 2011;43(9):860–863. doi: 10.1038/ng.886. 2011 Jul. [DOI] [PubMed] [Google Scholar]
  13. Hripcsak G, Rothschild A. Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association. 2005 May;12(3):296–298. doi: 10.1197/jamia.M1173. 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Insel T, et al. Research domain criteria (RDoC): Toward a new classification framework for research on mental disorders. American Journal of Psychiatry. 2010 Jul;167(7):748–751. doi: 10.1176/appi.ajp.2010.09091379. 2010. [DOI] [PubMed] [Google Scholar]
  15. Iossifov I, et al. De novo gene disruptions in children on the autistic spectrum. Neuron. 2012;74(2):285–299. doi: 10.1016/j.neuron.2012.04.009. 2012 April. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kerner B. Psychiatric genetics, neurogenetics, and neurodegeneration. Frontiers in Genetics. 2015;5:467. doi: 10.3389/fgene.2014.00467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kohane IS. Ten Things We Have to Do to Achieve Precision Medicine. Science. 2015;349(6243):37–38. doi: 10.1126/science.aab1328. 2015. [DOI] [PubMed] [Google Scholar]
  18. Kohane IS, et al. A Glimpse of the Next 100 Years in Medicine. New England Journal of Medicine. 2012a;367:2538–2539. doi: 10.1056/NEJMe1213371. 2012. [DOI] [PubMed] [Google Scholar]
  19. Kohane IS, et al. A translational engine at the national scale: informatics for integrating biology and the bedside. Journal of the American Medical Informatics Association. 2012b;19(2):181–185. doi: 10.1136/amiajnl-2011-000492. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kolevzon A, et al. A pilot controlled trial of insulin-like growth factor-1 in children with Phelan-McDermid Syndrome. Molecular Autism. 2014;5:54. doi: 10.1186/2040-2392-5-54. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kong A, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471–475. doi: 10.1038/nature11396. 2012 August 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lajonchere CM, AGRE Consortium Changing the landscape of autism research: The Autism Genetic Resource Exchange. Neuron. 2010;68(2):187–191. doi: 10.1016/j.neuron.2010.10.009. October 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Leblond CS, et al. Meta-analysis of SHANK mutations in autism spectrum disorders: a gradient of severity in cognitive impairments. PLoS Genetics. 2014;10(9):e1004580. doi: 10.1371/journal.pgen.1004580. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lester M, et al. Personal health records: Beneficial or burdensome for patients and healthcare providers? Perspectives in Health Information Management. 2016 Spring;13 2016 Spring. 1h. [PMC free article] [PubMed] [Google Scholar]
  25. Macedoni-Lukšič M, et al. Deletion of the last exon of SHANK3 gene produces the full Phelan–McDermid phenotype: A case report. Gene. 2013;524(2):386–389. doi: 10.1016/j.gene.2013.03.141. 2013 July. [DOI] [PubMed] [Google Scholar]
  26. Maxonus I, et al. Intranasal insulin may influence motor activities and behaviour in Phelan McDermid Syndrome. Neuropediatrics. 2012;43:PS15_02. doi: 10.1055/s-0032-1307111. 2012. [DOI] [Google Scholar]
  27. McMurry AJ, et al. Improved de-identification of physician notes through integrative modeling of both public and private medical text. BMC Medical Informatics and Decision Making. 2013;13:112. doi: 10.1186/1472-6947-13-112. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Miga KH, et al. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Research. 2014;24(4):697–707. doi: 10.1101/gr.159624.113. 2014 April. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Murphy SN, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) Journal of the American Medical Informatics Association. 2012;17(2):124–130. doi: 10.1136/jamia.2009.000893. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Network and Pathway Analysis Subgroup of the Psychiatric Genomics Consortium. Psychiatric genome-wide association study analyses implicate neuronal, immune and histone pathways. Nature Neurosciences. 2015;18(6):926. doi: 10.1038/nn.3922. 2015 June. [DOI] [PubMed] [Google Scholar]
  31. Oberman LM, et al. Autism spectrum disorder in Phelan McDermid Syndrome: Initial characterization and genotype-phenotype correlations. Orphanet Journal of Rare Diseases. 2015 doi: 10.1186/s13023-015-0323-9. August 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. O’Roak BJ, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nature Genetics. 2011;43(6):585–589. doi: 10.1038/ng.835. 2011 Jun. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Patel CJ, et al. A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey. Nature Scientific Data. 2016;3:25. 160096. doi: 10.1038/sdata.2016.96. 2016 Oct. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Perakslis ED, et al. How Informatics Can Potentiate Precompetitive Open-Source Collaboration to Jump-Start Drug Discovery and Development. Clinical Pharmacology and Therapeutics. 2010;87(5):614–616. doi: 10.1038/clpt.2010.21. 2010. [DOI] [PubMed] [Google Scholar]
  35. Phelan K, McDermid H. The 22q13.3 Deletion Syndrome (Phelan-McDermid Syndrome) Molecular Syndromology. 2012;2(3–5):186–201. doi: 10.1159/000334260. 2012 Apr. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Phelan MC. Deletion 22q13.3 Syndrome. Orphanet Journal of Rare Diseases. 2008;3:14. doi: 10.1186/1750-1172-3-14. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Robinson PN, et al. Capturing phenotypes for precision medicine. Cold Spring Harbor Molecular Case Studies. 2015;1:a000372. doi: 10.1101/mcs.a000372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Robinson PN, et al. The Human Phenotype Ontology: A Tool for Annotating and Analyzing Human Hereditary Disease. The American Journal of Human Genetics. 2008;83(5):610–615. doi: 10.1016/j.ajhg.2008.09.017. 2008 Nov. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Rogers FB. Communications to the Editor. Bulletin of the Medical Library Association. 1963;51(1):114–116. 1963 Jan. [PMC free article] [PubMed] [Google Scholar]
  40. Savova GK, et al. Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, Component Evaluation and Applications. Journal of the American Medical Informatics Association. 2010;17:507–513. doi: 10.1136/jamia.2009.001560. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Scheufele E, et al. tranSMART: An open source knowledge management and high content data analytics platform. AMIA Joint Summits Translational Science Proceedings. 2014;2014:96–101. 2014. [PMC free article] [PubMed] [Google Scholar]
  42. Schneider VA, et al. Clone DB: an integrated NCBI resource for clone-associated data. Nucleic Acids Research. 2013;41:D1070–D1078. doi: 10.1093/nar/gks1164. 2013 January. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Schulz S, Klein G. SNOMED CT - Advances in concept mapping, retrieval, and ontological foundations. Selected contributions to the Semantic Mining Conference on SNOMED CT (SMCS 2006) BMC Medical Informatics Decision Making. 2008;8(Suppl 1):S1. doi: 10.1186/1472-6947-8-S1-S1. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Simons A, et al. Cytogenetic nomenclature: Changes in the ISCN 2013 compared to the 2009 edition. Cytogenetic and Genome Research. 2013;141:1–6. doi: 10.1159/000353118. 2013. [DOI] [PubMed] [Google Scholar]
  45. Smith R. An overview of the Tesseract OCR Engine. Proceedings of the 9th IEEE International Conference on Document Analysis and Recognition (ICDAR) 2007;2:629–633. September 2007. [Google Scholar]
  46. Szalma S, et al. Effective knowledge management in translational medicine. Journal of Translational Medicine. 2010;8:68. doi: 10.1186/1479-5876-8-68. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Uchino S, Waga C. SHANK3 as an autism spectrum disorder-associated gene. Brain and Development. 2013;35(2):106–10. doi: 10.1016/j.braindev.2012.05.013. 2013 Feb. [DOI] [PubMed] [Google Scholar]
  48. Uzuner O, et al. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association. 2007;14(5):550–563. doi: 10.1197/jamia.M2444. October 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Vissers LE, et al. A de novo paradigm for mental retardation. Nature Genetics. 2010;42(12):1109–1112. doi: 10.1038/ng.712. 2010 Dec. [DOI] [PubMed] [Google Scholar]
  50. Weber GM, et al. The Shared Health Research Information Network (SHRINE): A Prototype Federated Query Tool for Clinical Data Repositories. Journal of the American Medical Informatics Association. 2009;16(5):624–630. doi: 10.1197/jamia.M3191. 2009 Sep–Oct. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wood H. Neuropsychiatric disorders: Blurring diagnostic boundaries: Common genetic risk variants in major psychiatric disorders. Nature Reviews Neurology. 2013;9:181. doi: 10.1038/nrneurol.2013.54. April 2013. [DOI] [PubMed] [Google Scholar]
  52. Xu B, et al. De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia. Nature Genetics. 2012;44(12):1365–1369. doi: 10.1038/ng.2446. 2012 Dec. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Zwanenberg RJ, et al. Is there an effect of intranasal insulin on development and behaviour in Phelan-McDermid syndrome? A randomized, double-blind, placebo-controlled trial. European Journal of Human Genetics. 2016;24(12):1696–1701. doi: 10.1038/ejhg.2016.109. 2016 December. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES