Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2025 May 22;2024:262–270.

Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world

Fangyi Chen 1, Kenrick Cato 2,4, Gamze Gürsoy 1, Patricia C Dykes 3,4, Graham Lowenthal 3, Sarah Rossetti 1,5
PMCID: PMC12099381  PMID: 40417480

Abstract

Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.

Introduction

Open science is a priority of the National Institutes of Health and other federally funded agencies, with the goal to make research inputs, activities and outputs accessible to all to enable principles such as reproducibility and fairness while maintaining security and privacy1. There are some published accessible clinical datasets, particular in the setting of Intensive Care Units (ICUs) and critical care units, such as MIMIC IV (Medical Information Mart for Intensive Care)2, HiRID (a high time-solution ICU dataset in Switzerland)3, PIC (Pediatric Intensive Care database in China)4, eICU-CRD (eICU Collaborative Research multi-center database in United States)5, and AmsterdamUMCdb (Amsterdam university medical centers database)6. MIMIC IV is particularly notable for the extensive value provided to the data science community in creating and maintaining a large openly available clinical data set. However, the availability and diversity of open datasets remain limited in scale. External validation of data-driven clinical models, such as those used for prediction, is necessary to showcase performance in real-world settings7,8, given that models built on one medical center’s data can be highly biased and do not necessary capture the diverse characteristics of the population well. Moreover, the limited availability of data representing multiple and diverse patient populations for use by data scientists has demonstrated clinical implications. A widely circulated study9 performed an external validation of the Epic Sepsis Model (ESM) which had already been widely adopted by many hospitals, and found that the model did not effectively detect the early onset of sepsis in their cohort, highlighting the importance of external model validation across diverse clinical sites prior to local adoption. It is not uncommon for predictive models, such as those predicting the risk of adverse events or short-term mortality for patients in ICU1013, to have been developed based on a single medical site; while these practices raise questions about robustness and generalizability they are understandable given the limited openly available clinical data sets that exist and resource challenges of conducting multi-site studies. Making clinical data derived from electronic health records (EHRs) publicly available empowers the reproducibility, transparency of scientific research, and advances knowledge discovery14,15.

In the light of the open science initiative and advancement of nursing research activities, we plan to release a subset of patient data from acute care and intensive care units (ICU) as part of the Communicating Narrative Concerns Entered by RNs (CONCERN: https://www.dbmi.columbia.edu/concern-study/) study16 by the end of 2024. The CONCERN data set is comprised of structured and unstructured (narrative nursing notes) clinical data that serves as the input for the CONCERN’s model features and additionally includes care provider information (e.g., specialty and type) and linkage between the actions taken and respective providers. Our first release will include only structured data. To our knowledge, this is first dataset for open public sharing that captures both the care providers’ information and patient data simultaneously, offering insightful knowledge in understanding the behaviors of care providers in response to patients’ health status.

As mandated by Health Insurance Portability and Accountability Act (HIPAA)17, all clinical data is required to undergo de-identification procedure for privacy protection prior to public release or any sharing across healthcare institutions. De-identification refers to the removal of protected health information (PHI) that could potentially reveal individual’s or a small subgroup’s identity. Safe Harbor is one of the de-dentification approaches recommended by HIPAA, where clinical data (both structured and unstructured) are deemed free of PHI and can be shared after the removal of all pre-defined 18 PHI categories (e.g., Names, Dates, SSN, locations). However, there remains a risk of re-dentification under such approach18,19, and the emergence of large language models (LLMs) raises further concern associated with data privacy and security20. LLMs have been showcased to achieve the state-of-the-art performances on various natural language processing tasks21,22, which can easily integrate diverse sources to expand knowledge domain and optimize decision-making23. The advancement of LLMs has unexplored implications on current best practices for de-identification. While these implications are unexplored, it is believed that LLMs will likely have capabilities that did not exist previously for re-identifying data sets22, substantially and necessarily changing decision-making considerations today for what and how to make data sets openly available. In this study, we considered the potential re-identification risks imposed by LLMs when designing our de-identification algorithm. Importantly, we also consider our future plans to include de-identified narrative notes24 in subsequent versions of our openly available data set.

The aim of this paper is to describe the considerations for releasing a clinical data set in the era of LLMs, the specific decisions our team arrived at through consensus sessions, the pros and cons of each decision, and the known and unknown implications of each decision on secondary analyses by other researchers once the data set is made public. We present a preliminary design of the databases and provide the description of de-identification procedure, identifying potential bias induced by the design of de-identification algorithm. We end with a robust discussion of our resultant decisions and their implications. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the era of LLMs and explicitly delineate the consequent limitations and bias imposed on future users of that data set. To ensure transparency and validity of subsequent scientific research using any openly available data set, it is necessary to acknowledge and disclose any limitations of the dataset resulting from the de-identification algorithms used, and it is also imperative for researchers to understand these issues when using and analyzing openly available datasets.

Methods

The dataset was collected from acute care and ICU settings at 2 large health systems in the Northeastern United States, spanning the years 2020 to 2022.

We followed a 5-step approach to reach consensus on our de-identification methods and their implications:

  1. Review our team’s prior findings regarding de-identification of narrative notes to inform decisions for de-identification of structured data, with the future premise to release narrative notes at later stage.

  2. Identify de-identification methods from: a) safe harbor, b) other studies using similar EHR data.

  3. Identify novel sources of potential risk for re-identification in the era of LLMs.

  4. Iteratively review each de-identification approach and risks during team consensus sessions for pros and cons along the following factors: a) balance between data utility and privacy protection, b) risk of re-identification

  5. Finalize de-identification decisions.

Results

Step 1: Review previous outcomes regarding de-identification of narrative notes.

In our previous study24 examining the generalizability of pre-trained de-identification transformers on narrative nursing notes, we found that the F1-measure in detecting PHI tokens was 0.932, and models (RoBERTa, ClinicalBERT) did not perfectly capture all PHI instances from notes. Information such as phone number, patients’ affiliated organization names or location can still remain present in the notes even after de-identification algorithms are applied. We do not plan to release the notes in our initial release and we will conduct thorough manual validation to ensure all PHI is removed before proceeding with the release of notes. We recognize, and it is important to convey, that one can still identify an individual by deliberately combining residual information from unstructured and structured data. Therefore, a cautious and conservative approach related to de-identification decisions for our structured data is preferred given our eventual plan to release both structured data and narrative notes.

Step 2 Identify de-identification methods from: a) safe harbor, b) other studies using similar EHR data (summarized in Table 1)

Table 1.

Summary of de-identification employed by other studies.

graphic file with name 5456t1.jpg

Given the primary targeted population of the CONCERN study is focused on adults, patients age below 18 at the time of visit were excluded from the data. The de-identification followed the Safe Harbor method by removing all 18 PHI categories, as applicable. In selecting the specific strategies for our de-identification, we reviewed the main de-identification strategies implemented by other studies, which are summarized in Table 1.

Step 3) Identify novel sources of potential risk for re-identification in the era of LLMs

De-identifying structured data can be regarded as relatively straightforward25, whereas de-identifying narrative notes is perceived as a challenging task. In anticipation of eventually incorporating the unstructured component to offer auxiliary enrichment clinical details, we must also consider additional risks of privacy, and re-identification particular in the era dominated by LLMs. While LLMs demonstrate remarkable ability in understanding, reasoning and in-context learning, there has been increasing concerns towards data privacy and safety. Firstly, LLMs can combine multiple available sources online to potentially infer individual’s identity. As demonstrated by one study26, pre-trained language models can leverage different kinds of knowledge to generate accurate prediction of the target, and with strong and larger scale of the models, more personal information can be extracted and recovered. With exponential growth in the size and complexity of the models, they become increasingly susceptible to incidental leakage of private information27.

Step 4) Iteratively review each de-identification approach and risk during team consensus sessions for pros and cons along the following factors: a) balance between data utility (secondary purposes) and privacy protection, b) risk of re-identification.

We identified three main de-identification needs: 1) dealing with date/times; 2) dealing with ages; and 3) dealing with rare data which entails all conditions with low occurrence in our dataset, including but not limited to rare diseases.

Step 5) Final De-identification Decisions

We transformed the dates and specific timestamps by using relative time in minutes since the admission. Some studies3133 revealed the effect of night-time and weekend admission on in-hospital mortality. Therefore, we included the binary day night and weekend indicators to capture these important factors which may contribute to ultimate patients’ outcomes. To comply with HIPAA regulations which classify ages above 89 as PHI, we obscured ages above 89 by replacing the specific ages to “age above 89” for privacy protection. Furthermore, we removed all conditions with low occurrence in the cohort based on certain threshold determined by analyzing the distribution of condition occurrence.

The elements of the databases were shown in Figure 1. Tables can be linked together using identifiers such as dummy MRN (patient ID) and encountered_id.

Figure 1:

Figure 1:

An overview of the data elements. The final version is subject to change and might differ from the preliminary version depicted above.

Discussion

We identified several major issues carried by the design choice of our proposed de-identification algorithm. Below, we described several implications of each de-identification decision.

Dates / Time: Relative to the reference point (admission date)

We obscured the dates and specific timestamps by using relative time (in minutes) from the individual’s corresponding admission date. Admittedly, keeping only the time interval will inevitably result in the loss of temporal information, restricting the ability to examine seasonal effects. It is important to acknowledge that our dataset is not ideal for supporting temporal analysis, examining the effect of seasonality on patients ‘outcome as well as how it may alter providers ‘behaviors and decision-making processes. Some datasets like MIMIC-IV2, PIC4, AmsterdamUMCdb6 retained the seasonality and fine temporal details such as time of the day and day the week; nevertheless they require large amount of datasets spanning over wide calendar years. The size of the dataset we intend to release is relatively small, covering a short timeframe with approximate one to two years. The risk of re-identifying the actual dates is much higher when maintaining such fine level of temporal descriptions. As indicated by Sarpatwari et al.29, simply leveraging time interval is still possible to make some inferences about true temporal details by analyzing interaction with clinical study periods. Essentially, both the time shifting approach without truncation of data points and the relative time approach poses a certain level risk of re-identification, which are less robust for continuously updated datasets. One study suggested the usage of Shift and Truncate (SANT) method28 that performs random shifting of patients’ records and data points within the truncate period are removed to help preserve temporal relations yet ensuring the protection of privacy. Several drawbacks of such approach were mentioned. First, the date shifting method fails to maintain seasonality, and truncation of the data points may result in the loss of potentially clinically useful information. For instance, in the cases of diseases such as new infectious diseases will be largely underrepresented in the dataset, particular when it closes to the time boundary of the dataset28. Therefore, researchers should select the ideal date/time obscuring approach by weighing the limitations it may impose on the potential research questions and balancing between the clinical utility and risk of exposure. In our case, a combination of relative time in minutes along with binary day/night and weekend indicators is more suitable for the initial version of our dataset release, whereby maximizing protection of privacy and preserving the completeness of the patients ‘records. However, people who intend to use the datasets should be mindful of acknowledged limitations beforehand.

Age-based Exclusion

The CONCERN study targets adult patients in the acute care and intensive care settings, thus in the original CONCERN database, we removed patients whose ages at the admission time were under 18. MIMIC-IV had the similar exclusion criteria for the protection of children and ethical considerations. One distinction was that MIMIC removed patients whose age was below 18 at the first visit, while we excluded patients with ages below 18 at the admission; however, they could be added into the cohort once they turned 18. Both MIMIC-IV and CONCERN databases were inapplicable for studying pediatric populations, and findings derived from these data cannot be readily generalized to minors. Beyond that, MIMIC-IV specified the removal of patients under the enhanced protection list to maintain privacy and confidentiality; nevertheless, we perceived some ambiguation in the definition of enhanced protection which is assumed to encompasses a broad spectrum of criteria. To our best knowledge, a standardized enhanced protection list is not available, which usually is curated locally and may vary between different institutions. We note the need to have explicit definitions to guide data scientists and other researchers in their use of these data sets for transparency of scientific research and to drive reproducibility.

In our study, the ages for patients with age above 89 were changed as “age above 89”. Alternatively, the most conservative approach is to remove patients above 89. This is most appropriate especially for a small number of individuals in this cohort meeting such criteria, removing these records can reduce the risk of identity exposure. In our cohort, the number of patients with age above 89 is quite large (above 100,000), thus less likely to identify at the individual level. Hence, we retain these patients’ records at this stage. However, for the future of adding narrative notes on top of the structured information, with the advancement of large language models (LLMs), the joining of all available information sources (e.g., open internet, social media, released medical dataset and clinical documentations) can potentially pinpoint to a subgroup of population. The level of risk should be re-evaluated carefully, considering the evolving capabilities of LLMs and inclusion of new medical data such as images and clinical notes.

Low Occurrence Concepts Removal

We have taken additional steps to enhance the privacy protection by removing concept codes (ICD codes, procedure codes, etc.) with low occurrence within our cohort, at which we denoted these concepts as rare conditions in this dataset. Note that our definition of rare conditions is not limited to the list of rare diseases published by National Institute of Health34. The removed concept terms were determined using a data-driven approach by analyzing their respective the frequency distribution. This procedure was conducted to protect patients with rare diseases and the initial occurrences of unknown diseases, for instance, the first COVID-19 case which was well-documented in the media. It is considered as a more dynamic and flexible strategy, compared to a static list of uncommon conditions. The compromise of such approach entails the loss of the information in the patients’ record, depending on the number of excluded conditions and number of patients affected. Importantly, given this approach, the dataset may not be appropriate for studies focusing on clinical edge cases. Additionally, it is also well-known that data collected from the EHR entails various issues entailing systematic errors, incompleteness and inaccuracy, presenting with significant bias when using for clinical research purposes35. Deleting certain conditions from patients’ records can further augment bias issues when conducting secondary analyses or building predictive models. In the era of LLMs with growing dependence on their powerful capabilities for automating large-scale computational analysis processes which leads to significant reduction in manual efforts, more investigation and regulatory process is necessitated to oversee any unintended consequences and algorithm bias. There has been some ongoing investigation36,37 to mitigate bias in downstream applications, and the first important step is to identify and understand the kinds of biases that exist in the open data sources commonly used by data scientists as we have aimed to in this paper.

Limitations

This analysis is based on our best current knowledge of potential reidentification risks with the use of LLMs. We expect this knowledge will rapidly evolve and the actual risks may differ than those we describe. Nonetheless, our stance is that a conservative approach is required for data set that are currently being released as those data sets will be publicly available to LLMs in the future.

Conclusion

In summary, we presented our de-identification decisions regarding structured data by comprehensively considering data utility, privacy protection, and risk of re-identification. Although the extent to which LLMs can re-identify or recover individual information based on structured data remains uncertain, we are inclined toward a more conservative stance when making de-identification decisions. The limitations and bias induced by our de-identification approach have been explicitly discussed and acknowledged, and such transparency will be beneficial for future utilization of the datasets.

Acknowledgements

This study was supported and funded by the National Institute of Nursing Research (1R01NR016941), and the American Nurses Foundation (ANF) Reimaging Nursing Initiative. The authors are solely responsible for the content of this work, and it does not necessarily reflect the official view of the National Institutes of Health. We would like to thank the MIMIC team for their generosity in providing expertise and guidance to our team related to the careful release of clinical data sets.

Figures & Tables

Table 2.

Summary of several de-identification methods, evaluated under three aspects data utility, privacy protection and risk of re-identification.

De-identification methods Data Utility Privacy protection Risk of re-identification
Dealing with Date/times
Remove all times Low. Temporality is critical in understanding studying patients ‘health status. High. Removing all timestamps can prevent the inadvertent disclosure of behavior patterns and clinical routines. Low. Much difficult to link the conditions/procedures to individual.
Random shifting of the dates, and times. High. Most parts of the temporality are preserved. Low. One can narrow down the date ranges by observing the patients with data at the end of the shifted28, allowing to make inferences about individuals. High. Preserving temporal details at fine level can impose high re-identification risk for dataset that has small time span.
Change to relative time Moderate. Some temporalities can be retained, knowing order of conditions and procedure relative to the admission time. Moderate. Duration although less likely but can still reflect temporal information via interaction with clinical study periods29. Moderate. It is still possible to simply leverage the relative times to make some inferences, for instance, combining the relative times and hospital units.
Dealing with Ages
Remove patients age above 89 Moderate. Less representative of the population. High. Completely exclude information from protected population. Low. No information can be used to re-identify such group of individuals.
Obscure ages above 89 Moderate. Enable secondary analysis of older population. Moderate. The cohort population age above 89 is relatively large (above 100,000). Moderate. Can further identify their identity by combining narrative notes released later.
Dealing with Rare Data
Remove only rare disease diagnostic codes High. In general, we anticipated prevalence of each rare disease should be less than 1%30. Low. The list of rare disease codes is incomplete and requires regular, dynamic update. High. The risk of re-identification is high, when certain information is retained in the chart that can be linked with rare diseases.
Remove all data for patients with any condition that has a low occurrence in the data set Moderate. Truncation of the dataset may lead to loss of information. High. All relevant information was removed, thus high protection of privacy. Low. The complete removal of patients’ record makes it difficult to re-identify them.
Remove only the specific conditions that have low occurrence in the data set, preserving the rest of the patient’s data Moderate. Less information loss. High. Rare concept codes such as diagnosis or procedure codes were removed. Low. Excluding the low occurrence concept codes minimizes the risk of re-identifying individuals.

Reference

  • 1.National Academies of Sciences. Global Affairs, Board on Research Data, Information, Committee on Toward an Open Science Enterprise. Open science by design: Realizing a vision for 21st century research. 2018. [PubMed]
  • 2.Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data. 2023;10(1):1. doi: 10.1038/s41597-022-01899-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Faltys M, Zimmermann M, Lyu X, Hüser M, Hyland S, Rätsch G, et al. HiRID, a high time-resolution ICU dataset (version 1.1. 1) Physio Net. 2021. p. 10.
  • 4.Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Scientific data. 2020;7(1):14. doi: 10.1038/s41597-020-0355-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Pollard TJ, Johnson AE, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data. 2018;vol. 5(no. 1) doi: 10.1038/sdata.2018.178. DOI: https://doi org/101038/sdata 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Thoral PJ, Peppink JM, Driessen RH, Sijbrands EJ, Kompanje EJ, Kaplan L, et al. Sharing ICU patient data responsibly under the society of critical care medicine/European society of intensive care medicine joint data science collaboration: the Amsterdam university medical centers database (AmsterdamUMCdb) example. Critical care medicine. 2021;49(6):e563. doi: 10.1097/CCM.0000000000004916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.König IR, Malley J, Weimar C, Diener H, Ziegler A. Practical experiences on the necessity of external validation. Statistics in medicine. 2007;26(30):5499–511. doi: 10.1002/sim.3069. [DOI] [PubMed] [Google Scholar]
  • 8.Steyerberg EW, Harrell FE., Jr Prediction models need appropriate internal, internal-external, and external validation. Journal of clinical epidemiology. 2016;69:245. doi: 10.1016/j.jclinepi.2015.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA internal medicine. 2021;181(8):1065–70. doi: 10.1001/jamainternmed.2021.2626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kramer AA, Zimmerman JE. A predictive model for the early identification of patients at risk for a prolonged intensive care unit length of stay. BMC medical informatics and decision making. 2010;10(1):1–16. doi: 10.1186/1472-6947-10-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang RZ, Sun CH, Schroeder PH, Ameko MK, Moore CC, Barnes LE. IEEE; 2018. Predictive models of sepsis in adult ICU patients; pp. p. 390–1. [Google Scholar]
  • 12.Qian Q, Wu J, Wang J, Sun H, Yang L. Prediction models for AKI in ICU: a comparative study. International Journal of General Medicine. 2021. pp. 623–32. [DOI] [PMC free article] [PubMed]
  • 13.Iwase S, Nakada T aki, Shimada T, Oami T, Shimazui T, Takahashi N, et al. Prediction algorithm for ICU mortality and length of stay using machine learning. Scientific reports. 2022;12(1):12912. doi: 10.1038/s41598-022-17091-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Elliott KC, Resnik DB. Making open science work for science and society. Environmental health perspectives. 2019;127(7):075002. doi: 10.1289/EHP4808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.de Kok JW, de la Hoz MÁA, de Jong Y, Brokke V, Elbers PW, Thoral P, et al. A guide to sharing open healthcare data under the General Data Protection Regulation. Scientific data. 2023;10(1):404. doi: 10.1038/s41597-023-02256-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rossetti SC, Knaplund C, Albers D, Dykes PC, Kang MJ, Korach TZ, et al. Healthcare process modeling to phenotype clinician behaviors for exploiting the signal gain of clinical expertise (HPM-ExpertSignals): development and evaluation of a conceptual framework. Journal of the American Medical Informatics Association. 2021;28(6):1242–51. doi: 10.1093/jamia/ocab006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Act A. Health insurance portability and accountability act of 1996. Public law. 1996;104:191. [PubMed] [Google Scholar]
  • 18.Rothstein MA. Is deidentification sufficient to protect health privacy in research? The American Journal of Bioethics. 2010;10(9):3–11. doi: 10.1080/15265161.2010.494215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhang Z, Yan C, Malin BA. Membership inference attacks against synthetic health data. Journal of biomedical informatics. 2022;125:103977. doi: 10.1016/j.jbi.2021.103977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Neel S, Chang P. Privacy issues in large language models: A survey. arXiv preprint arXiv:231206717. 2023.
  • 21.Espejel JL, Ettifouri EH, Alassan MSY, Chouham EM, Dahhane W. GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot setting and performance boosting through prompts. Natural Language Processing Journal. 2023;5:100032. [Google Scholar]
  • 22.Li H, Chen Y, Luo J, Kang Y, Zhang X, Hu Q, et al. Privacy in large language models: Attacks, defenses and future directions. arXiv preprint arXiv:231010383. 2023.
  • 23.Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus. 2023;15(5) doi: 10.7759/cureus.39305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chen F, Bokhari SMA, Cato K, Gürsoy G, Rossetti SC. Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes. Applied Clinical Informatics. 2024. [DOI] [PMC free article] [PubMed]
  • 25.Hartman T, Howell MD, Dean J, Hoory S, Slyper R, Laish I, et al. Customization scenarios for de-identification of clinical notes. BMC medical informatics and decision making. 2020;20(1):1–9. doi: 10.1186/s12911-020-1026-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Huang J, Shao H, Chang KCC. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:220512628. 2022.
  • 27.Plant R, Giuffrida V, Gkatzia D. You are what you write: Preserving privacy in the era of large language models. arXiv preprint arXiv:220409391. 2022.
  • 28.Hripcsak G, Mirhaji P, Low AF, Malin BA. Preserving temporal relations in clinical data while maintaining privacy. Journal of the American Medical Informatics Association. 2016;23(6):1040–5. doi: 10.1093/jamia/ocw001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sarpatwari A, Kesselheim AS, Malin BA, Gagne JJ, Schneeweiss S. Ensuring patient privacy in data sharing for postapproval research. New England Journal of Medicine. 2014;371(17):1644–9. doi: 10.1056/NEJMsb1405487. [DOI] [PubMed] [Google Scholar]
  • 30.Richter T, Nestler-Parr S, Babela R, Khan ZM, Tesoro T, Molsen E, et al. Rare disease terminology and definitions-a systematic global review: report of the ISPOR rare disease special interest group. Value in health. 2015;18(6):906–14. doi: 10.1016/j.jval.2015.05.008. [DOI] [PubMed] [Google Scholar]
  • 31.Manadan A, Arora S, Whittier M, Edigin E, Kansal P. Patients admitted on weekends have higher in-hospital mortality than those admitted on weekdays: Analysis of national inpatient sample. American Journal of Medicine Open. 2023;9:100028. doi: 10.1016/j.ajmo.2022.100028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bell CM, Redelmeier DA. Mortality among patients admitted to hospitals on weekends as compared with weekdays. New England Journal of Medicine. 2001;345(9):663–8. doi: 10.1056/NEJMsa003376. [DOI] [PubMed] [Google Scholar]
  • 33.Mizuno S, Kunisawa S, Sasaki N, Fushimi K, Imanaka Y. Effects of night-time and weekend admissions on in-hospital mortality in acute myocardial infarction patients in Japan. PLoS One. 2018;13(1):e0191460. doi: 10.1371/journal.pone.0191460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.National Institutes of Health. Advancing Rare Disease Research: The Intersection of Patient Registries. Biospecimen Repositories and Clinical Data.
  • 35.Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association. 2013;20(1):117–21. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li F, Wu P, Ong HH, Peterson JF, Wei WQ, Zhao J. Evaluating and mitigating bias in machine learning models for cardiovascular disease prediction. Journal of Biomedical Informatics. 2023;138:104294. doi: 10.1016/j.jbi.2023.104294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Barda N, Yona G, Rothblum GN, Greenland P, Leibowitz M, Balicer R, et al. Addressing bias in prediction models by improving subpopulation calibration. Journal of the American Medical Informatics Association. 2021;28(3):549–58. doi: 10.1093/jamia/ocaa283. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES