Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 1.
Published in final edited form as: Int J Med Inform. 2020 Mar 19;138:104121. doi: 10.1016/j.ijmedinf.2020.104121

Pilot Evaluation of Sensitive Data Segmentation Technology for Privacy

Adela Grando a, Davide Sottara b, Ripudaman Singh c, Anita Murcko a, Hiral Soni a, Tianyu Tang d, Nassim Idouraine a, Michael Todd e, Mike Mote f, Darwyn Chern g, Christy Dye g, Mary Jo Whitfield h
PMCID: PMC7229704  NIHMSID: NIHMS1585630  PMID: 32278288

Abstract

Background

Consent2Share (C2S) is an open source software created by the Office of the National Coordinator Data Segmentation for Privacy initiative to support electronic health record (EHR) granular segmentation. To date, there are no published formal evaluations of Consent2Share.

Method

Structured data (e.g. medications) codified using standard clinical terminologies (e.g. RxNorm) was extracted from the EHR of 36 patients with behavioral health conditions from study sites. EHRs were available through a health information exchange and two sites. The EHR data was already classified into data types (e.g. procedures and services). Both Consent2Share and health providers classified EHR data based on value sets (e.g. mental health) and sensitivity (e.g. not sensitive. Descriptive statistics and Chi-square analysis were used to compare differences between data categorizations.

Results

From the resulting 1,080 medical records items, 584 were distinct. Significant differences were found between sensitivity classifications by Consent2Share and providers (χ2 (2, N = 584) = 114.74, p = <0.0001). Sensitivity comparisons led to 56.0% of agreements, 31.2% disagreements, and 12.8% partial agreements. Most (97.8%) disagreements resulted from information classified as not sensitive by Consent2Share, but sensitive by provider (e.g. behavioral health prevention education service). In terms of data types, most disagreements (57.1%) focused on procedures and services information (e.g. ligation of fallopian tube). When considering value sets, most disagreements focused on genetic data (100.0%), followed by sexual and reproductive health (88.9%).

Conclusions

There is a need to further validate Consent2Share before broad use in health care settings. The outcomes from this pilot study will help guide improvements in segmentation logic of tools like Consent2Share and may set the stage for a new generation of personalized consent engines.

Keywords: data segmentation, data privacy, electronic medical records

Graphical Abstract

graphic file with name nihms-1585630-f0001.jpg

1. Introduction

Health information exchanges (HIEs) pose opportunities for sharing comprehensive patient information from electronic health records (EHRs) between a larger pool of providers. Benefits of data sharing between organizations include fewer duplicated procedures, reduced imaging, lower costs, improved accessibility, efficiency, quality and safety, and better patient experience.13 The shift to HIEs and EHRs, however, may conflict with patients’ privacy and confidentiality needs.410 For that reason, in the United States the National Committee on Vital and Health Statistics (NCVHS) recommended to allow individuals to have limited control, in a uniform manner, over the disclosure of certain sensitive health information for purposes of treatment.4 The proposed approach is consistent with the Title 42 of the Code of Federal Regulations Part 2 (42 CFR Part 2) passed by the United States Congress in 1972.11 This law requires that a program receiving federal funding not use or disclose information about an individual who has applied for or been given diagnosis or treatment for alcohol or drug abuse without the individual’s express consent, with limited exceptions. Other federal and state laws and regulations restrict disclosure of HIV test results, genetic test results, and other sensitive health information.12

The NCVHS acknowledges that individuals differ in their opinions about which categories of health information should be considered sensitive. The NCVHS recognizes that designating particular categories of sensitive information and defining what information is included in each category “will be a complex and difficult undertaking.” Nevertheless, they believe that “it is important to designate categories of sensitive health information with precise definitions.” 4 Data segmentation refers to the process of sequestering from capture, access, or view certain data elements that are perceived by a legal entity, institution, organization, or individual as being undesirable to share.13 The 2014 Substance Abuse and Mental Health Services Administration (SAMHSA)-Health Resources and Services Administration Center for Integrated Health Solutions report stated that a key technical barrier preventing behavioral health and primary care data integration is the “all or nothing” consent format used by most systems because they lack capabilities to automatically segment sensitive medical records.5 This constraint negates a 42 CFR Part 2 facility’s ability to fully participate in data exchange, leading to a negative incentive for providers/facilities that operate with sensitive data to participate in an HIE. Choosing to err on the side of caution, many HIEs are restricting information sharing to physical health data.

The technical challenge is to come to a consensus on a standard method that will support the automatic management and implementation of data segmentation policies driven by patients or required by law. This must be done within HIE and EHR environments so that individually identifiable medical information may be appropriately shared. As part of the Office of the National Coordinator (ONC) Data Segmentation for Privacy (DS4P) initiative,14 SAMHSA and the Veteran Administration (VA) created the Consent2Share software to support automatic granular data segmentation.15 The Consent2Share platform relies on terminology-based data sharing models to specify privacy-meaningful categories of health information (e.g., ‘mental health’ information) and classify data from the patient’s EHR into those categories (e.g. the ICD10CM concept ‘depression’ is tied to a type of ‘mental health’ information). Consent2Share was created to support granular consent options aligning with various federal and state data sharing requirements. To the best of our knowledge, Consent2Share is the only software available to provide automatic segmentation of sensitive medical records. While the deployment of Consent2Share in a production environment has been piloted tested, there has been no formal evaluation on how Consent2Share identifies and segments sensitive medical records.16 There is a need to formally assess Consent2Share’s accuracy to segment sensitive medical records, before its broad use in health care settings. Lessons learned from this exercise may also guide a new generation of personalized granular data segmentation consent engines.

Individuals diagnosed with behavioral health conditions (BHCs) may have privacy concerns that are different from those of individuals with primarily physical conditions. The significance of these concerns is reflected by the prevalence of BHCs. BHCs include substance use disorders, serious psychological distress, suicide, and mental health disorders.17 In the US, BHCs affect over 44 million (18.3%) adults, including 10.4 million (4.2%) adults who have a serious mental illness (SMI), such as major depressive disorder or schizophrenia.18 There is limited research on the data privacy concerns of individuals19, and in particular those with BHCs.10

As part of a 5-year grant funded by the National Institute of Mental Health, we aim to understand data sharing preferences of individuals with BHCs, including patients with SMI. The goal is to guide the development and evaluation of a granular electronic informed consent tool, MyChoice, developed as an extension of Consent2Share. The outcomes of this research will help define sensitive data categories supporting patient preferences and develop educational material to guide informed choices.

2. Objectives

Manual segmentation of sensitive information accessible through HIEs and EHRs may be onerous and impractical to implement, but automated methods may help; however, little is known about automatic means to segment sensitive information. The main objective of this study was to assess Consent2Share’s accuracy in segmenting sensitive medical records and to inform the development of a new generation of personalized consent engines. Specific aims of this study are to: 1) quantify differences in sensitive data categorizations between Consent2Share and a gold standard (health care providers), 2) identify and explain main areas of disagreement, and 3) suggest ways to improve Consent2Share’s segmentation logic and inform the future development of similar technologies.

3. Methods

3.1. Study Sites

Our study sites include two integrated (physical and behavioral health care) clinics in Arizona. Site 1 offers general behavioral health and social services to children, families, and adults of all ages. Site 2 offers a range of recovery-focused services to adult patients with SMI. Both sites receive federal assistance for the diagnosis, treatment, and referral of patients for alcohol or drug abuse conditions, being subject of 42 CFR Part 2. Both sites have been participating agencies in Arizona’s HIE since 2015.

HealthCurrent, Arizona’s state-wide physical and behavioral health HIE supports both directed and query-based exchange.1,20 The HIE follows an opt-out consent model where patient data are automatically shared unless the individual explicitly declines to share. 42 CFR Part 2 gives special confidentiality protection of substance abuse treatment records from federally-assisted substance abuse treatment programs. Because a patient’s substance abuse treatment information may be comingled with the patient’s other health information from these programs, HealthCurrent keeps all health information (including physical care records) it receives from these integrated clinic providers separate from the rest of the patient’s health information. A patient’s physical and behavioral health information from participating health care providers (including Sites 1 and 2) is available through the HIE following an “all or nothing” model, only if a patient gives written consent or in a medical emergency.21

From 2015 until January 2020, the number of patients sharing data through the HIE has increased from 6.3 to 9.6 million. During this same period, organizations participating in the HIE grew from 73 to 608.

3.2. Study Participants

In 2017, participants of a data privacy survey were asked permission to be re-contacted for a follow-up study.9 In 2018, survey participants who provided contact information were invited to participate in this follow-up study. In 2019, we obtained access to the EHRs of those who consented to participate. The inclusion criteria for this study was: patients from Sites 1 and 2, diagnosed with general mental health illnesses or SMI and 21 years old or older. Participants were compensated for their time.

3.3. Medical Record Access

Following Institutional Review Board approval, patients were asked to consent for access to the last five years (2015 to 2019) of their structured, codified medical records (e.g. depression diagnosis, codified with ICD9CM) available through Site 1 or Site 2 and any other HIE participating health care facility where the patient has received care. As part of the consent, participants signed a HIPAA authorization to release access to their EHRs. This was the first time that the state’s HIE medical records were accessed for research.

3.4. Value Set Access

The Consent2Share tool adopts seven extensional value sets (defined by Health Level 7 (HL7) as explicitly enumerated set of codes 22) to list and group concepts associated with:

  1. drug abuse,

  2. alcohol abuse and alcoholism,

  3. mental health,

  4. HIV/AIDS and other communicable diseases,

  5. genetic information,

  6. sexual and reproductive health, and

  7. other addictions.

SAMHSA value sets are composed of triplets ‘term, code, code system’, where code system can take the value of ICD10CM, ICD9CM, RXNORM, LOINC, CPT, HCPCS or SNOMED CT. For instance, ‘Paranoid schizophrenia, F220.0, ICD10CM’ belongs to the ‘mental health’ value set.* The extension value sets were created by SAMHSA in April 2016 and were made available through the Value Set Authority Center (VSAC) before being reviewed by subject matter experts.23 Table 1 provides details on the content of the value sets created by SAMHSA and adopted by Consent2Share. Take for instance the ‘drug abuse’ value set, which has been subdivided in five subsets: (1) amphetamine use disorders, (2) inhalants, (3) other psychoactive substance use disorders, (4) sedative hypnotic or anxiolytic related disorders and (5) substance use information. In the case of the ‘substance use information’ subset, it contains nine LOINC codes, one RXNORM code, and HCPCS code, but no ICD9CM, ICD10CM, SNOMED CT or CPT codes. Value sets are disjoint. For more details on rationale used by SAMHA to choose and define the value sets, we refer the reader to 24.

Table 1:

Details on SAMHSA value sets content

Value Set Terminology and Freauency

ICD9 ICD10 LOINC SNOMED RXNORM CPT HCPCS

Drug Abuse 88 199 383 693 2 0 8
 • Amphetamine Use Disorders 17 45 374 44 1 0 7
 • Inhalants 25 40 0 21 0 0 0
 • Other Psychoactive Substance Use Disorders 23 57 0 123 0 0 0
 • Sedative Hypnotic or Anxiolytic related Disorders 23 57 0 122 0 0 0
 • Substance Use Information 0 0 9 0 1 0 1

Alcohol Abuse and Alcoholism 42 124 58 179 1 0 0

Mental Health 270 308 19 1,589 131 0 0

HIV/AIDs and Communicable Diseases 0 0 89 2 2 46 0

Sexual and Reproductive Health Information 26 0 0 0 4 0 0

Other Addictions 75 270 1,246 160 9 1 0
 • Cannabis Use Disorders 11 48 285 24 0 0 0
 • Cocaine Use Disorders 37 92 0 68 0 0 0
 • Hallucinogens 23 67 70 59 0 0 0
 • Opioids 0 43 885 0 6 1 0
 • Screening for Tobacco Use in Prior 30 days 0 0 1 0 0 0 0
 • Tobacco Use Assessment 0 0 5 0 0 0 0
 • Tobacco Use Disorders 4 20 0 9 3 0 0

Genetic Information 0 0 0 0 0 0 0

Total 664 1,370 3,424 3,093 160 48 16

3.5. Consent2Share Deployment

Consent2Share is an open source software that supports the segmentation of structured medical record information codified in standard terminologies (e.g. ICD10CM).25 Figure 1 describes Consent2Share process of how patients upload medical records, select what sensitive medical records to share with whom and for which purposes, and filter out sensitive EHR data based on granular data sharing choices and value sets:

  1. Upload EHR data through the Upload Document functionality. The data must be structured,codified and standardized using HL7 interchange formats such as Consolidated Clinical Document Architecture (C-CDA) or, more recently, Fast Healthcare Interoperability Resources (FHIR).

  2. Provide sharing preferences through the Consent Management functionality. Select whatsensitive health data to share (drug abuse, alcohol abuse and alcoholism, mental health, HIV/AIDS and other communicable diseases, genetic information, sexual and reproductive health and other addictions), with whom (individuals or health care organizations) and for which purposes (treatment, payment or research).

  3. Redact sensitive information based on sharing preferences and return the filtering results to thepatient.

    1. Trigger the Document Segmentation Service (DSS) engine to analyse and segment the EHR data uploaded through the Upload Document Module in compliance with data sharing results (filter results) uploaded through the Consent Management Module. Data gets segmented based on eXtensible Access Control Markup Language (XACML) rules. The process identifies the clinical statements and returns the coded concepts to the Filtering Engine. A coded concept is identified by a code/code system pair and usually has a human readable label that is used for presentation but not for computation.

    2. Trigger the Filtering Engine to check whether the concepts (codes) from Step 3.1 are members of any Consent2Share value sets. If a lookup is successful, any EHR statement codified with that concept is considered sensitive, and it is classified under the category associated to the corresponding value set; otherwise it is considered ‘other information’ and therefore ‘not sensitive’. For instance, the codified concept ‘Paranoid schizophrenia, F220.0, ICD10CM’ belongs to the ‘mental health’ value set, and therefore it is considered sensitive and categorized as mental health information.

Figure 1:

Figure 1:

Filtering process of Consent2Share

3.6. Categorization of Medical Records into Data Types, Value Sets and Sensitive Categories

First, two biomedical informatics graduate students selected 30 different structured, codified medical record items (e.g. depression) from each participant’s EHR. Unstructured medical records (e.g. clinical notes) were excluded from the analysis. The structured medical record items that were accessed were already categorized by the sites and the HIE into the following data types: allergies, diagnoses, laboratory results, medications, and procedures and services. When possible, it was also desirable to select 30 medical record items that represent different data types.

Second, each medical record item was independently categorized by one internist and one psychiatrist (from now on we refer to them as “providers”) based on the seven value sets supported by Consent2Share (see section 3.4) and sensitivity categories (Table 2). Value set definitions were provided in the Consent2Share tool.25 In the selected 30 medical record items, we tried to include 20 items that, based on providers’ categorizations, corresponded to different Consent2Share’s value sets. During the selection process, patients may or may not have information belonging to each of the seven value sets. Disagreements between providers were resolved by consensus. More details on the categorization process are provided in Soni et al.26

Table 2:

Sensitive categories’ definitions and examples.

Sensitive Classification Description
Sensitive The information is classified using one or more Consent2Share value sets
Examples:
▪ ‘Depression’ can be classified as ‘mental health’ information and therefore ‘sensitive’
▪ ‘Human papilloma virus 16 and 18+45 E6+E7 mRNA in Cervix by NAA with probe detection’ is considered ‘sensitive’. It can be classified as ‘sexual and reproductive health’ and ‘HIV/AIDs and other communicable disease’ information.
Not sensitive The information is classified as ‘other information’
Example: ‘Hypertension’ can be classified as ‘other information’ and therefore ‘not sensitive’
Possibly Sensitive The information is classified into more than one Consent2Share value sets, depending on contextual information that is not available at the time of classification, and is also considered as ‘other information’.
Example: The medication ‘Vicodin’ (ingredient: hydrocodone/paracetamol) can be classified as both ‘drug abuse’ and ‘other information’. Abuse of Vicodin medication can be considered ‘sensitive’ information, while the use of Vicodin to manage pain could be categorized as ‘not sensitive’.27

3.7. Consent2Share Data Segmentation

For each participant, the 30 medical record items selected were inputted into the Consent2Share Filtering Engine. Consent2Share determined for each record item if it corresponded to one of the seven value sets and therefore was ‘sensitive’. Otherwise, Consent2Share considered the information ‘other information’ and therefore ‘not sensitive’.

3.8. Comparison of Provider and Consent2Share Categorizations

In accordance with previous studies on assessing accuracy of clinical decision support systems, we evaluated how Consent2Share segments sensitive data by comparing its outcomes with providers decisions and created a list of common disagreements.2831

Categorization of sensitive types were automatically compared between Consent2Share and providers to identify agreements, partial agreements, and disagreements (Table 3). Descriptive statistics were used to quantify frequency and percentages. Chi Square test was used to test the differences in the sensitivity perceptions between Consent2Share and providers. Providers were consulted on rationale for discovered disagreements with Consent2Share. Discussions with providers were summarized as a list of common disagreements.

Table 3:

Definition of agreement, partial agreement and disagreement based on value sets.

Agreement Type Description
Agreement Provider and Consent2Share both assign the same value set or ‘other information’ to a medical record item.
Example: Both assign ‘depression’ to the ‘mental health’ category; both assign ‘hypertension’ to ‘other informaion’.
Partial Agreement Provider and Consent2Share assign at least one value set in common.
Example: Provider assigns the medication ‘Vicodin’ (ingredient: hydrocodone/paracetamol) to ‘drug abuse’ and ‘other information’. Whereas, Consent2Share categorizes it as ‘drug abuse’.
Disagreement Provider and Consent2Share assign different value set to a medical record item.
Example: Consent2Share assigns ‘amnesia’ ’other information’ and the provider assigns it ‘mental health’.

Because to the best of our knowledge there are no other available tools to automatically segment sensitive data, we did not compare Consent2Share data segmentations against outcomes from other sensitive data segmentation technologies.

3.9. Categorization Disagreement Analysis

Areas of agreements, partial agreements and disagreements (Table 3) between Consent2Share and providers were automatically examined using frequency and percentages. Outcomes were manually analyzed by two biomedical informatics students and verified by providers to identify main areas of improvement and refinement of value sets and the Consent2Share reasoning engine.

4. Results

4.1. Medical Record Access

Seventy-six out of 89 patients who participated in a data privacy survey agreed to be re-contacted for this study.9 Thirty-six subjects consented to provide access to their de-identified EHRs. All of the participants had EHRs available through the integrated clinics and the HIE. Out of the selected 1,080 medical record items, 496 were aggregated as duplicates. For instance, diagnosis of type ’schizophrenia (disorder)’ appeared three times in the selected medical record items. On average, participants had 11 (Minimum = 7, Maximum=17) duplicated medical record items. In the rest of our analysis we will focus on the 584 distinct medical record items.

4.2. Comparison of Provider and Consent2Share Categorizations

When only comparing provider and Consent2Share sensitive categorizations (Table 4), we found that 56.0% were agreements, 31.2% were disagreements, and 12.8% were partial agreements. Significant differences were found between sensitivity classifications by Consent2Share and providers (χ2 (2, N = 584) = 114.74, p = <0.0001).

Table 4:

Comparison of provider and Consent2Share classification of non-duplicated terms, differentiating between agreements (A), partial agreements (PA), and disagreements (D).

Providers
Sensitive Not sensitive Possibly sensitive
A PA D A PA D A PA D Total
Consent2Share Sensitive 88 3 4 0 0 0 0 5 0 100
Not sensitive 0 0 178 239 0 0 0 67 0 484
Total 88 3 182 239 0 0 0 72 0 584

When considering sensitive categories and data types (e.g. allergy), it was found that diagnoses had the highest agreements (76.9%), followed by lab test results (62.6%), and allergies (52.9%) (Table 5). Partial agreements were the highest for allergies (38.2%), followed by medications (22.5%) and procedures/services (10.7%). For example, allergy to ‘Morphine’ was considered by providers as ‘possibly sensitive’ and categorized as ‘not sensitive’ or ‘drug abuse’ information which is ‘sensitive’. Consent2Share, instead, categorized it as ‘not sensitive’. Disagreements were the highest for procedures/services (57.1%), followed by medications (38.2%), and lab test results (32.2%). For example, ‘behavioral health prevention education service’ was considered by providers as ‘mental health’ information and by Consent2Share as ‘not sensitive’ information.

Table 5:

Comparison of provider and Consent2Share agreements, partial agreements, and disagreements based on data type available through the EHR.

Agreements Partial Agreements Disagreements
Allergy 18/34 (52.9%) 13/34 (38.2%) 3/34 (8.8%)
Diagnosis 113/147 (76.9%) 4/147 (2.7%) 30/147 (20.4%)
Lab 107/171 (62.6%) 9/171(5.3%) 55/171(32.2%)
Med 80/204 (39.2%) 46/204 (22.5%) 78/204 (38.2%)
Proc/Serv 9/28 (32.1%) 3/28 (10.7%) 16/28 (57.1%)
Total 327 75 182

When considering sensitive categories and Consent2Share value sets assigned by providers, it was found that ‘other information’ had the highest agreements (100.0%), followed by ‘alcohol abuse’ (50.0%) and ‘drug abuse’ (37.1%) (Table 6). Partial agreements were the highest for ‘alcohol abuse’ (41.7%), ‘drug abuse’ (33.9%) and ‘mental health’ (21.6%). For example, ‘Neurontin 600 mg tablet’ was categorized by providers as both ‘drug abuse’ and ‘other information’ because it can generate addiction. This medication was categorized by Consent2Share as ‘not sensitive’. Disagreements were the highest for ‘genetic data’ (100.0%), ‘sexual and reproductive health’ (88.9%) and ‘HIV/AIDs and communicable diseases’ (77.1%). For example, ‘vaginal pain’ was categorized by providers as ‘sexual and reproductive health’ and by Consent2Share as ‘mental health’.

Table 6:

Comparison of provider and Consent2Share agreements, partial agreements, and disagreements based on value sets.

Agreements Partial Agreements Disagreements
Drug Abuse 23/62 (37.1%) 21/62 (33.9%) 18/62 (29.0%)
Alcohol Abuse 6/12 (50.0%) 5/12 (41.7%) 1/12 (8.3%)
Mental Health 54/185 (29.2%) 40/185 (21.6%) 91/185 (49.2%)
Comm diseases 3/35 (8.6%) 5/35 (14.3%) 27/35 (77.1%)
Genetic Data 0/2 (0.0%) 0/2 (0.0%) 2/2 (100.0%)
Sex & Rep Health 2/45 (4.4%) 3/45 (6.7%) 40/45 (88.9%)
Other Addictions 1/5 (20%) 1/5 (20%) 3/5(60%)
Other Information 238/238 (100.0%) 0/238 (0.0%) 0/238 0.0%)
Total 327 75 182

When the resulting 327 agreements were analyzed (Table 7), Consent2Share and providers agreed most frequently (72.8%) when classifying medical record items as ‘other information’, followed by agreements in the classification of ‘mental health’ (16.5%) and ‘drug abuse’ (7.0%) record items.

Table 7:

Provider and Consent2Share agreements, based on value sets and data type.

Drug Abuse Alcohol Abuse Mental Health Comm Disease Genetic Data Sex & Rep Health Other Addictions Other Info Total
Allergy 0 0 2 0 0 0 0 17 19
Diagnosis 4 2 43 0 0 1 1 62 113
Lab 18 4 0 3 0 0 0 82 107
Med 0 0 9 0 0 0 0 70 79
Proc/Serv 1 0 0 0 0 1 0 7 9
Total 23 6 54 3 0 2 1 238 327

When the 75 partial agreements were analyzed, it was found that 89.3% of the medical record items were considered by providers as ‘possibly sensitive’ and as ‘not sensitive’ by Consent2Share (Table 8). For example: ‘Morphine 2 mg 0.5 mL’ was classified as both ‘other information’ and ‘drug abuse’ information by providers and as ‘not sensitive’ by Consent2Share. It was also found that eight record items (10.7%) were classified by Consent2Share as ‘sensitive’ and by providers as ‘possibly sensitive’. Five of those eight record items were classified by Consent2Share as ‘mental health’ information while providers classified them as both ‘mental health’ and ‘other information’. For example, ‘Amitriptyline’ can be used for neuropathic (nerve) pain, insomnia, and/or depression.32 Three out of those eight medical record items, such as ‘HIV antibody/antigen (Ag/Ab) screen’, were classified as ‘HIV/AIDs and communicable diseases’ by Consent2Share and as both ‘HIV/AIDs and communicable diseases’ and ‘sexual and reproductive health’ by providers.

Table 8:

Provider and Consent2Share partial agreements, when providers classified medical record items as possibly sensitive information and Consent2Share classified them as not sensitive.

Drug Abuse & Other Info Alcohol Abuse & Other Info Mental Health & Other Info Comm Disease & Other Info Sex & Rep Health & Other Info Addictions & Other Info Total
Allergy 5 0 8 0 0 0 13
Diagnosis 0 1 0 0 2 0 3
Lab 0 4 0 2 0 0 6
Med 16 0 24 0 1 1 42
Proc/Serv 0 0 3 0 0 0 3
Total 21 5 35 2 3 1 67

When the 182 disagreements were analyzed (Table 9), it was found that Consent2Share and providers disagreed the most on ‘mental health’ information (50.0%), followed by ‘sexual and reproductive health’ (21.4%), and the classification as both ‘HIV/AIDs and other communicable diseases’ and ‘sexual and reproductive health’ (11.5%), and ‘drug abuse’ (9.9%). For example: ‘suicidal ideation’ was classified as sensitive by providers with category ‘mental health’ and as ‘not sensitive’ by Consent2Share.

Table 9:

Provider and Consent2Share disagreements based on value sets and data type.

Drug Abuse Alcohol Abuse Mental Health Comm Disease Genetic Data Sex & Rep Health Comm Disease & Sex and Rep Health Other Addictions Total
Allergy 1 0 1 0 0 1 0 0 3
Diagnosis 5 0 10 1 0 12 1 1 30
Lab 10 1 1 6 2 13 20 2 55
Med 2 0 69 0 0 7 0 0 78
Proc/Serv 0 0 10 0 0 6 0 0 16
Total 18 1 91 7 2 39 21 3 182

4.3. Lessons Learned from Provider and Consent2Share Categorizations of Sensitive Information

In-depth review of disagreements and partial agreements between provider and Consent2Share categorizations revealed that the accuracy of Consent2Share to determine the sensitivity of medical record items strongly depends on the content of the value sets and the decision rules implemented in the XACML-based engine:

  1. The value sets are incomplete or SAMHSA categorized more information as not sensitive than the providers involved in the evaluation (n=178). This led to 178 (97.8%) of the 182 disagreements found (Table 6). Disagreements occurred mainly within ‘mental health’, ‘sexual and reproductive health’, and the classification as both ‘HIV/AIDS and other communicable diseases’ and ‘sexual and reproductive health’ (Table 9). Providers classified 45 medical record items as ‘sexual and reproductive health’, but the ‘sexual and reproductive health’ value set only has 30 codes (Table 1). For example, ‘amenorrhea, unspecified’ was present in the EHR data but absent from the value set. Suicide related information (e.g. ‘suicidal attempt (disorder)’) is missing from the ‘mental health’ value set. While the EHR data contained genetic data (e.g. ‘Factor V Leiden Mutation’), the ‘genetic information’ value set is empty (Table 1).

  2. There is no homogenous approach to list medications in the value sets (n=87). For example, sometimes medications are listed in the value sets only as ingredients (e.g. ‘Clozapine’), only as brand names (e.g. ‘Geodon’), as both brand and ingredient (e.g. ‘Clozaril’ and ‘Clozapine’) with or without additional information about dose, administration form, etc (e.g. ‘Zolpidem tartrate 12.5 MG Extended Release Oral Tablet’). This led to 61 (33.5%) out of 182 disagreements and 26 (34.7%) out of the 75 partial agreements due to different strategies adopted by the EHRs and the HIE, and the value sets to describe the same medication.

  3. Medical records are classified as ‘not sensitive’ by Consent2Share’s decision engine when they do not belong to any of the seven SAMSHA sensitive value sets. The XACML-based decision engine in Consent2Share is binary and determines the sensitivity of a medical record item only based on presence/absence of information from the value sets. Any other piece is disregarded by the decision engine, included additional information contained in the FHIR resources from which the medical record item has been extracted.

  4. The value sets and Consent2Share’s decision engine do not support “possibly sensitive” categorizations (n=75). Instead, providers chose to categorize some medical record items from the patients’ EHR as possibly sensitive. For example, providers categorized the medication ‘Vicodin’ (ingredient: hydrocodone/paracetamol) as ‘drug abuse’ and ‘other information’. Abuse of Vicodin can be considered ‘sensitive’ information, while the use of Vicodin to manage pain could be categorized as ‘not sensitive’. However, this medical record item does not belong to any Consent2Share sensitive value set, leading to a classification of ‘not sensitive’. It is not known if the developers of the SAHMSA value sets chose to categorize medical record items as ‘not sensitive’ when in doubt of their possible sensitivity. Even then, we know of five cases (e.g. ‘amitriptyline’) in which Consent2Share classified a medical record item that the providers considered ‘possibly sensitive’ as ‘sensitive’. Out of the 584 medical record items considered, we found 75 partial agreements (12.8%) (Table 6).

  5. Providers classified some information as both “HIV/AIDs and communicable diseases” and “sexual and reproductive health” (n=24). For example, ‘HIV disease’ was classified by providers as both ‘HIV/AIDs and other communicable disease’ and as ‘sexual and reproductive health’. This led to 21 (11.5%) out of the 182 disagreements (Table 9) and 3 (4.0%) out of the 75 partial agreements (Table 6).

  6. There are some disagreements in sensitive categorizations between the value sets and theproviders (n=5). While the value sets contain ‘cannabis use disorders’, ‘cocaine use disorders’, ‘hallucinogens’, and ‘opioids’ as ‘other addictions’, providers classified them as ‘drug abuse’. For example, ‘Cannabis abuse, uncomplicated’ was classified by providers as ‘drug abuse’, while it appears as ‘other addictions’ in the value sets. From the 182 disagreements, there were 5 (2.7%) disagreements of this sort.

5. Discussion

Consent2Shrare is the first software available to automatically segment structured medical record data. Differences were found between sensitivity classifications by Consent2Share and providers (our gold standard). Comparisons based only on sensitive categories led to 56.0% of agreements, 31.2% disagreements and 12.8% partial agreements. Most (97.8%) sensitivity disagreements resulted from information classified as not sensitive by Consent2Share, but sensitive by providers. Comparison based on data types revealed that the highest disagreements occurred when considering procedures and services (57.1%). When value sets were considered, ‘genetic data’ (100%) followed by ‘sexual and reproductive health’ (88.9%) showed the highest disagreement. Based on our findings, further validation of Consent2Share tool is recommended before its deployment and use in health care settings. In particular, it is advisable to:

  • Involve patients, health care providers, policy makers, lawyers and ethicists, and other relevant stakeholders in the process of defining value sets and establishing procedures to achieve consensus. The goal should be to develop granular-based segmentation technology that is meaningful and reflective of preferences of stakeholders.

  • Involve informaticists with expertise in medical terminologies and knowledge representation to review the value sets and propose well-principled and consistent approaches to define and populate them. This approach should also be informed by the coding strategies employed by EHRs and HIEs. For example, all medications in the value sets could be specified in terms of ingredients (e.g. ‘Ziprasidone’ belongs to the ‘mental health’ value set) and the reasoning engine of Consent2Share could be extended to determine the sensitivity of a given medication based on its ingredients (e.g. ‘Geodon 20 mg tablet’ has ingredient ‘Ziprasidone’ and therefore can be classified as ‘mental health’ information). This approach could help resolve disagreements and partial agreements, thereby more accurately reflecting patient consent intent.

  • Propose mechanisms to better adapt value sets to updates in medical terminologies. Currently, if a medical terminology changes, the impact those changes have on the value sets needs to be manually checked. Protocols to manually revise value sets and document changes should be created.

  • Replace SAMHSA extensional value sets for intensional value sets. Extension value sets are not easily updated and tend to become obsolete quicker than intensional value sets. In contrast, an intensional value set is typically algorithmically defined. That is, the membership of a code in a group can be established by means of a computable definition. For instance, the definition could include “all medications that have Clozapine as ingredient”. Intensional code groups, instead, can be dynamically updated. Algorithmic approaches to converting the current extensional value sets into intensional value sets should be considered to automatize the process.33

  • Involve medical subject matter experts in the revision of the value sets to minimize missed, duplicated, or contradictory information. If disagreements in categorizations are identified, they should be resolved by consensus, and the rationale for the resulting categorization should be systematically documented. This approach could help resolve the disagreements found.

  • Extend the value sets and Consent2Share’s decision engine to support possibly sensitive information. Mechanisms to automatically identify data that are frequently categorized as ‘possibly sensitive’ information could be created and later associated with a contextual probability factor. For example, medications commonly prescribed for both sensitive and nonsensitive conditions could be organized and iteratively integrated into the Consent2Share logic. A subject matter expert could manually decide the sensitivity of difficult to classify information. This approach could help to resolve the partial agreements found.

  • Involve medical subject matter experts in the revision of value set definitions and content to achieve better alignment with existing data sensitive categorizations. For example, merging the ‘HIV/AIDs and other communicable diseases’ and ‘sexual and reproductive health’ categories from Consent2Share into one data category would help to achieve better alignment with existing NCVHS data sensitive categories.

Manually selecting 30 representative medical record items from each patient’s EHR was labor intensive. Natural language processing techniques might be employed in the future reduce the effort for such tasks. Alternatively, instead of using patient’s personal EHR data, we might have used a generic comparison between Consent2Share and provider data categorizations of clinical concepts codified by the standard clinical terminologies used in Consent2Share: ICD10CM, ICD9CM, RXNORM, LOINC, CPT, HCPCS or SNOMED CT. The scope (SNOMED CT has over 311,00 concepts34) and our desire to focus on medical information of clinical relevance to behavioral health led us to our current study design. Furthermore, the process of selecting medical record items from the EHRs resulted in the separation of some contextual information. For example, though available in the EHRs, we did not provide information to providers on diagnoses/symptoms that motivated the prescription of medications. As well, we recognized that the patient records accessed contained minimal genetic data, due in part to our state HIE statute that limits types of use and disclosures of genetic data.35 Limited information from specific data categories, such as ‘genetic data’, could introduce bias and limit the representativeness of our data. A future study for us focuses on how the availability and use of contextual information from the EHR may influence categorization of data into sensitive or possibly sensitive classes.

Health providers involved in this study included an internist and a psychiatrist, thereby reflecting sensitive data views of both physical and behavioral care medical specialties. Involving providers with different medical training, specialties, and experience with various sensitive data categories may have resulted in different agreement rates.

Health providers chose to categorize some medical information as ‘possibly sensitive’, leading to partial agreements with Consent2Share. Future work could look at implications of merging partial agreements with disagreements.

This study was conducted with a limited number of participants from a specific clinical environment (behavioral health). Considering the lack of power due to small population, any statistical comparisons may not yield concrete inferences.

We plan next to compare data categorizations between patients, Consent2Share and current data privacy laws and policies. As stated by Campbell et al, there is a need to understand how Consent2Share impacts compliance with existing and evolving policies and laws that control sensitive data sharing.36 Our recent review of state and federal laws on granular data sharing regulations indicate that evaluating data segmentation by Consent2Share in the context of data privacy laws will be a daunting task.12 Most sensitive data types have concurrent state and federal laws and regulations that control their sharing. For instance, federal law (42 CFR Part 2) allows patients to restrict the access to substance abuse information.11 Conversely, all states require providers to access individual opioid prescription history when prescribing pain medications as part of each state’s prescription drug monitoring program to combat the opioid epidemic.37

We ultimately will compare Consent2Share, patient, provider and law/policy sensitive data categorizations to understand the agreements and to analyze the partial and disagreements. Lessons learned will guide the future implementation and deployment of MyChoice, the electronic consent management tool under construction as an extension of Consent2Share.

6. Conclusion

This retrospective pilot evaluation of the SAMHSA Consent2Share software using physical and behavioral health medical records revealed inconsistency between the value sets of Consent2Sahre and the items available in the EHR. Most disagreements resulted when Consent2Share categorized data as not sensitive while providers considered the information sensitive. There is a need to further validate Consent2Share before its broad use in health care settings. The outcomes from this pilot study will help guide improvements in the segmentation logic of tools like Consent2Share and may set the stage for a new generation of personalized consent engines.

Highlights.

  • Little is known about automatic means to segment sensitive medical records

  • Consent2Share is the first tool to automatically segment sensitive data

  • Accuracy of Consent2Share to segment sensitive data was unknown

  • Poor Consent2Share accuracy to segment sensitive data was found

  • There is a need to further validate Consent2Share before its use

Summary points.

Before this study, it was known that:

  • There is a need for designating particular categories of sensitive health information (e.g. mental health) and defining what information is included in each category (e.g. depression can be categorized as mental health)

  • Consent2Share was introduced as the first tool to automatically segment sensitive data using sensitive data category definitions

  • The accuracy of Consent2Share to segment sensitive data has not been formally assessed

This study has added knowledge about:

  • SAMHSA definitions of sensitive data categories and information included in each category

  • Accuracy of Consent2Share to segment sensitive structured health data when compared against health providers categorizations

  • Recommendations to guide improvements of segmentation logic of tools like Consent2Share

Acknowledgements

This work was supported by the National Institute of Mental Health through My Data Choices, evaluation of effective consent strategies for patients with behavioral health conditions (R01 MH108992) grant. The content is solely a responsibility of the authors and does not necessarily represent the official views of the NIMH.

Footnotes

Conflicts of interest

None

*

In the rest of this paper, codes will sometimes be omitted in favor of the corresponding terms for readability reasons, but it should be assumed that every term is appropriately codified.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

RESOURCES