Abstract
Objectives:
We evaluated methods for preparing electronic health record data to reduce bias before applying artificial intelligence (AI).
Methods:
We created methods for transforming raw data into a data framework for applying machine learning and natural language processing techniques for predicting falls and fractures. Strategies such as inclusion and reporting for multiple races, mixed data sources such as outpatient, inpatient, structured codes, and unstructured notes, and addressing missingness were applied to raw data to promote a reduction in bias. The raw data was carefully curated using validated definitions to create data variables such as age, race, gender, and healthcare utilization. For the formation of these variables, clinical, statistical, and data expertise were used. The research team included a variety of experts with diverse professional and demographic backgrounds to include diverse perspectives.
Results:
For the prediction of falls, information extracted from radiology reports was converted to a matrix for applying machine learning. The processing of the data resulted in an input of 5,377,673 reports to the machine learning algorithm, out of which 45,304 were flagged as positive and 5,332,369 as negative for falls. Processed data resulted in lower missingness and a better representation of race and diagnosis codes. For fractures, specialized algorithms extracted snippets of text around keywork “femoral” from dual x-ray absorptiometry (DXA) scans to identify femoral neck T-scores that are important for predicting fracture risk. The natural language processing algorithms yielded 98% accuracy and 2% error rate The methods to prepare data for input to artificial intelligence processes are reproducible and can be applied to other studies.
Conclusion:
The life cycle of data from raw to analytic form includes data governance, cleaning, management, and analysis. When applying artificial intelligence methods, input data must be prepared optimally to reduce algorithmic bias, as biased output is harmful. Building AI-ready data frameworks that improve efficiency can contribute to transparency and reproducibility. The roadmap for the application of AI involves applying specialized techniques to input data, some of which are suggested here. This study highlights data curation aspects to be considered when preparing data for the application of artificial intelligence to reduce bias.
Keywords: Fairness, Algorithms, Data preparation, Artificial Intelligence, Inclusion, Diversity
Graphical Abstract
1. Introduction
Fairness in data has deep nuances that has resulted in many statistical and mathematical definitions in literature. However, there is a lesser explored interplay between lack of bias and implementation of equity. Here we use the term fairness in AI as a combined concept of reduction in bias along with a methodological approach to implement equity within the limitations of given data. Data-driven decision-making is dependent on algorithms, which explains the focus on algorithmic bias in COVID-19’s disproportionate infection and morbidity rates by race[1]. Aspects of artificial and augmented intelligence such as reporting quality, reproducibility, and transparency are dependent on the foundational data upon which these algorithms are based. Understanding the context of data collection is foundational to research[2]. We may risk misinterpreting data if we do not consider its provenance in design and analysis. We may even introduce bias if the underlying data misrepresents sample characteristics such as race, prevalence of disease, or clinical outcomes[3]. Literature suggests that the most important application of machine learning (ML) to health data so far has been phenotyping, classifying, and categorizing patients with a disease, as patients can be retrospectively selected using already existing data. Foundational work is required to prepare electronic health record (EHR) data, including unstructured text notes, in a reproducible manner that is based on principles of data structuring and continuous vigilance of algorithms for fairness.
Data goes through extensive curation from its raw form to the state appropriate for inclusion in artificial intelligence (AI) models. EHR data is especially challenging as it is typically comprised of multiple data sources and formats. Data curation methods may themselves introduce bias. For example, past studies[4] have found evidence of bias in algorithms introduced by factors such as predicting disease based on healthcare utilization or cost, as many individuals with the disease may not have access to healthcare. Unequal access to healthcare by race may lead to under-representation by race when predicting an illness based on cost data.
To demonstrate the role data preparation methods play in supporting algorithmic fairness, we evaluated the data processing methods used in two studies focused on falls and fractures before applying AI and natural language processing (NLP). We use the clinical data from the Veterans Affairs (VA) EHR, which contains data on over 13 million Veterans in care. The goal was to prepare data with a reproducible and rigorous approach for the foundational work required to create machine learning algorithms that reliably predict the risk of falls and NLP for the identification of T-scores (bone density comparison to normal value). At each step, data cleaning and query design were performed with a holistic approach to fairness and a reduction in bias. The sequential flow of suggested steps in this study presents a package that creates an innovative approach toward preparing and enabling AI ready data that reduces bias and promotes fairness.
Statement of Significance |
|
---|---|
Problem | Data preparation for AI lacks standards for reducing bias and building fairness |
What is already known | AI output may be biased due to methods that lack fairness and transparency. Prior studies have covered AI models, but less focus is given to data preparation that is used as input to AI models. |
What this paper adds | This study showcases examples from Veterans Affairs data that used AI for predicting risk of falls and NLP for identifying T-scores (fracture risk). We present recommendations on strategies for promoting fairness and reducing bias in data preparation methods that are foundational to AI. |
2. METHODS:
In this section, we present the methods employed for reducing bias and ensuring the reproducibility and transparency of our algorithms for the falls and fractures work. There are some strategies that overlap and others that are distinct to the goal of the research question; identification of falls with risk calculation was achieved with ML and NLP algorithms, and T-score extraction (indicative of bone mineral density) for fractures was done using NLP techniques.
2.1. Data preparation pipeline
The pyramid structure in Figure 1 denotes the foundational step that formed the base of each process in the pipeline. The steps taken for getting data ready for applying AI methods were: 1) data collection using VA EHR data, for which we augmented our primary data source with additional sources like the Centers for Medicaid and Medicare Services (CMS) to include care received outside the VA. Examples of data domains are visits, diagnoses, demographics, and radiology 2) Data cleaning was performed with validated definitions that have been used across studies conducted at the Veterans Aging Cohort Study (VACS). These definitions have been validated by clinical experts and have been tested on multiple data sources. Data missingness is mitigated by imputation or other relevant strategies 3) Radiology report text was used to create snippets of data with relevant key terms. The snippets were validated by chart reviews performed by clinical experts 4) The output from NLP and ML models was validated in comparison with other data sources, such as diagnostic ICD codes.
Figure 1:
Data preparation pyramid
2.2. Falls
For the falls study, we created a cohort of Veterans who had a serious fall (significant enough to result in a visit to a healthcare provider). Our ICD codes ((E880.X, E881.X, E884.X, E885.9, E886.9, E888.X) were derived using the Agency for Healthcare Quality’s (AHRQ) general equivalence mapping (GEM) mapping database and are extensively validated. We then created a cohort database by applying data management techniques to include data from different domains of the EHR. Missing data and lower frequencies were additionally checked from foundational data and were corrected or documented as missing. For example, if radiology reports from a specific VA site were lower than expected, additional checks were conducted to ensure data from all sites visited by the Veteran was extracted. Another example was in extracting data for specific ICD codes. If data was missing or low for a specific code, we ensured visits from all sites were included and the ICD code dimension tables properly joined the linking variables. The data domains included pharmacy, visits or encounters, diagnoses, demographics, and radiology notes. Definitions for variables such as gender can be differentiated from sex at birth and validated with self-reported data in our cohort; this analysis was restricted to sex at birth. Race and ethnicity were cleaned with validated cleaning steps that included multiple sources to fill in missing data. Missingness in the EHR data was imputed from self-report or from validated sources [5]. VA data includes millions of patients with records across multiple domains that need extract load transform (ETL) routines to create cohorts of interest. On this, programming done in structured query language (SQL) and statistical analysis system (SAS) was applied to create clinically meaningful and validated variables [6]. Figure 2 represents a simplified depiction of the flow of data from the VA corporate data warehouse (CDW) in its raw form through SQL table joins for the relevant domains using the application of programmatic routines for both fall and fracture work. The dimension used for joins and source tables is the second layer in the figure. The SQL algorithms included the programming for creating the cohort, extracting demographics, diagnoses, clinical notes, labs, and visits for the cohort. Cleaning steps were applied to these data tables to create analytically meaningful datasets. This included, but was not limited to, making essential joins with dimension tables, calculating active medication count and refills, linking a visit to a diagnosis, notes, and pharmacy data, and longitudinally analyzing it over a defined period of time. The performance of these routines was optimized based on the processing power of the environment. The informatics and computing infrastructure [7] (VINCI) at Veterans Affairs (VA) is a unique environment available to researchers with real-time electronic health record data and high-efficiency computing power.
Figure 2:
Data collection process, starting from extracting raw data from corporate data warehouse (CDW) to joining the tables for cohort extraction with diagnoses and radiology reports
2.2.1. Implementing a rule-based algorithm
Structured health record data often lacks concepts that may be identifiable with NLP in clinical notes, free text, or unstructured data. On the notes, we followed specialized data preparation techniques with NLP for identifying falls. In unstructured data, there are no standard entries or numerical results to be filled into a structured field. Clinicians may use different sentences and words to describe the same concept. Variations include word order, word variants, abbreviations, acronyms, synonyms, punctuation, misspelling, etc. Therefore, NLP was an important step before we could feed clinical notes into ML classifier algorithms. Data representation was the first and most critical task towards building the classifier. For the study analysis, a ML model was developed to identify the fall event for each radiology report. Each radiology report was represented by a feature set of words and Unified Medical Language System (UMLS) concepts.
We utilized YTEX, a NLP tool built on top of Apache Clinical Text Analysis and Knowledge Extraction System [8] (cTAKES), which comprises sentence splitting, tokenization, part-of-speech tagging, shallow parsing, named entity recognition, and storage of all annotations in a SQL database. The CTAKES/YTEX application created normalized output from the tokenizer and the concept unique identifiers of the named entity recognition components and saved them in database tables. The tokenizer component included word tokenization, normalization (lowercase, stemming, and lemmatization), and stop word removal. The named entity recognition component maps a span of text to a dictionary lookup table, which is prepopulated with UMLS concepts based on the study subject. For this study, a combination of words and concept identifiers were used as the feature. Each radiology report contained a “bag of concepts” feature that was extracted. A binary feature vector was created for extracting features from each radiology report. We considered both variable-ranking feature selection and embedded feature selection. Mutual information (MI) was used to rank features according to their predictive power for the target “fall” value. Features with the highest MI with respect to training labels were selected. Using the training set, 100 features with the highest MI for “fall” and 100 for “not fall” label were chosen. These were then used to train the support vector machine (SVM) classifiers. The extracted information from radiology reports was converted into a structured form (matrix) to which the ML techniques were applied. The training and testing sets were an 80/20 split. Figure 3b illustrates the fall classification workflow. The radiology report data was input into the Ctakes/Ytex pipeline, which generated the NLP results. This was input into a ML classifier to produce the classification results. At each stage, steps were taken to promote fairness by reducing bias. The selection of radiology reports was diversified within the cohort selection criterion, and data cleaning included multiple data sources to augment the missingness from the EHR data, which was the main source. If missingness still persisted, additional imputation strategies (such as multiple imputation) were selected based on the kind of missing data. The output of NLP was chart reviewed by clinical experts for validation, and the results were compared to other classification methods, such as ICD codes. The figure also illustrates the classifier using the trained model to make the prediction. Figure 3a depicts the bag of words and concepts model. Output of the model was “1” indicating fall and “0” indicating no fall. We identified the first fall that happens in a time window for a patient, and then used that in developing the model. The algorithm starts at the level of the fall but then is rolled up into the patient level. All tables presented in this study are based on analyses exploring falls in the patient population.
Figure 3b:
Falls classification workflow, including stages from radiology reports to classification results
Figure 3a:
Bag of words and concepts model example to show flagging of words
2.3. Fractures
For the fractures study, the cohort was created by identifying people based on diagnostic ICD codes for fractures (link for complete list in appendix). We included wrist, hip, vertebral, and upper arm fractures. These codes were validated with a chart review of both radiology reports and progress notes. Data were stratified by age, race, and sex, with the majority between the ages of 45-65. The stratification process involved applying validated definitions for the above variables and representing every sub-group in the cohort. SQL algorithms were applied to do the stratification. A diverse inclusion based on race and gender was achieved. To identify the data for the ML algorithms, we used current procedural terminology codes (CPT) from procedures to identify dual x-ray absorptiometry (DXA) scans. The codes used (77080, 77085, G8861) were validated by chart reviews. Once people with the relevant procedure codes were identified, we collected 1,387,479 DXA reports related to these scans from unstructured radiology exam tables. From these report texts, we created snippets of text related to key terms (“femoral”) identified by clinicians that would be relevant in mining these reports for additional context. NLP methods were run on the DXA scan reports to create snippets for creating context for the algorithm to extract T-scores. Data management and querying techniques were applied by utilizing SQL Server analysis and Microsoft.NET tools.
For the fractures study, identifying T-scores was a key factor for future fracture prediction work. To prepare the data for NLP, we developed a rule-based algorithm to retrieve T-scores from the snippets. We developed two groups of regular expressions. The first group was used to retrieve T-score values in table format, for which we first matched the table header in the snippet and then matched the value in the following context. The second group of regular expressions was used to match the T-score in plain language. In radiology notes, we were able to detect sentence patterns with the T-score. We annotated many snippets with a T-score in plain language before developing the pattern-match regular expressions.
2.4. Snippet formulation
The main purpose of unstructured clinical notes was envisioned as a tool for humans to provide context to other humans involved in patient care. The purpose evolved to bring accountability to care and become a valuable resource for information that was not structurally coded. These are large bodies of text with meaningful information hidden in short snippets of relevant terms like “fall”, “trip”, “slip”, femoral”. These snippets were extracted from unstructured notes to get meaningful short terms from the large body of notes, which would provide context for input to AI methods. These snippets were generated with specialized algorithms to extract relevant context for the risk prediction analysis. We included a diverse set of individuals among the cohort for inclusion in snippet extraction. Multiple iterations of the snippet extraction process were typically required to provide enough context in the text for the ML algorithms. We did a chart review on a set of notes to rule out the possibility of irrelevant imaging data. A review of terms to promote a reduction in bias in the context of T-score extraction was performed by experts. With each iteration, we refined the snippet length and negated terms that were out of context. SQL functions were utilized to perform the extraction of the snippets for relevant keywords. The functions were called within queries that were applied in pulling notes with the relevant snippets. This helped optimize snippet size for the NLP algorithms.
3. RESULTS
The methodologies for promoting fairness and reducing bias were embedded in the data preparation process for application of AI methods. These techniques are reusable and can be applied to other studies.
3.1. Falls
For the Falls study, 5,377,673 reports were processed by the ML algorithm out of which 45,304 were flagged as a positive prediction and 5,332,299 as negative. There were 170,179 people with 2.06% falls identified with ICD codes only, 2.5% with ML only and 3.85% composite. Table 1 summarizes the output in terms of number people with positive and negative predictions by sex, race, and ethnicity. Table 1A represents the accuracy metrics of the model.
Table 1:
Prediction output for Falls by sex, race & ethnicity
Radiology Reports Number of people with radiology reports |
Prediction | Percent prediction (%) |
---|---|---|
Sex | ||
Female | ||
4,300 | Negative | 81 |
1,022 | Positive | 19 |
Male | ||
144,811 | Negative | 88 |
20,046 | Positive | 12 |
Race | ||
Black | ||
65,389 | Negative | 87 |
9,538 | Positive | 13 |
White | ||
55,161 | Negative | 87 |
8,106 | Positive | 13 |
Asian | ||
667 | Negative | 93 |
47 | Positive | 7 |
American Indian/Alaska Native | ||
794 | Negative | 86 |
130 | Positive | 14 |
Native Hawaiin/Other Pacific Islander | ||
1,221 | Negative | 88 |
168 | Positive | 12 |
Other | ||
26,440 | Negative | 92 |
2,518 | Positive | 8 |
Ethnicity | ||
Hispanic | ||
14,560 | Negative | 89 |
1,816 | Positive | 11 |
Non-Hispanic | ||
134,777 | Negative | 87 |
19,026 | Positive | 12 |
3.2. Fractures
For the DXA analysis we found that the length of the snippet is an important factor for match accuracy. Shorter snippet length will miss important sentence parts and lead to missed T-score values. Longer snippet length was desirable, but if snippets are too long, they generate some false T-score matches (Table 2A). After we tested different length of the snippets (including whole report), we found the snippets with 50 words before and after keyword is a proper length. More than 50 words increased the noise in the data. Fewer than 50 words did not provide sufficient context. The final predictive model for fragility fractures had a 98% accuracy and 2% error rate( rate (number of false predictions / total number of reports).
3.3. Raw vs Processed data
Data processing strategies that were applied to the raw EHR data resulted in a more complete dataset by addressing missing data, integrating multiple data sources, and curating variables like race and ethnicity. The VA has over 1300 healthcare facilities across the United States. It is common for Veterans to access care at multiple sites, which is made easier by an integrated EHR system. In the raw data, the linkages for data across sites need to be done by specialized algorithms to get a complete picture of the patient’s health records. This data integration helped to get all radiology reports from each station a patient may have travelled to. With processed data, we extracted 5,377,603 reports, whereas without the station linkage, the raw data provided only 2,732,494 reports. With this, we concluded that even if the number of people is the same, there need to be specialized processes to get all the data from the EHR. Not doing so resulted in losing radiology reports and visits. Race and ethnicity in EHR data is from self-report; we often found this data to be missing, especially for minority groups. The race categories were primarily white, black, Asian, Pacific Islander, and American Indian. The ethnicity is categorized as Hispanic, Latino, or non-Hispanic. For race and ethnicity, the missingness was lower in the processed data (Table 2) for all categories.
Table 2:
Difference between raw and processed data for race and diagnosis codes for falls
Race categories | Processed data | Raw data |
---|---|---|
Race | Percent % | Percent % |
White | 46 | 42 |
Black | 51 | 49 |
Asian | <1 | <1 |
American Indian | <1 | <1 |
Pacific Islander | 1 | <1 |
Missing | <1 | 5 |
Ethnicity | ||
Hispanic or Latino | 8 | 8 |
Not Hispanic or Latino | 91 | 87 |
Missing | 1 | 5 |
ICD codes | Processed data | Raw data |
Falls | % | % |
ICD9 codes | 52 | 27 |
ICD10 codes | 48 | 24 |
Missing | 0 | 49 |
Fractures | ||
ICD9 codes | 61 | 32 |
ICD10 codes | 39 | 20 |
Missing | 0 | 48 |
ICD codes almost doubled by including visits from all stations and outside VA care for both falls and fractures (Table 2). The data curation was most impactful for the NLP algorithm (Table 3). We see that the number of concepts is significantly boosted in the processed data. The output generated by feeding processed data to the ML algorithm, directly impacts the model's performance. The results of the ML algorithm are dependent on the amount and quality of data used for training. The quality of raw vs processed data (Table 3) directly impacts interpretation of falls by demographic characteristics like race and gender. For example, if missing data misrepresents a specific demographic group it may lead to misinterpretation of clinical findings for that group. The raw data predicted positive falls for 6% of people in the cohort, whereas the processed data predicted that 12% of people had a fall in the cohort.
Table 3:
Difference between output for raw vs processed data from ML and NLP for falls
NLP output for Falls | Processed data | Raw data |
---|---|---|
Number of UMLS concepts | 101,734,369 | 84,533,827 |
Number of annotations | 260,694,068 | 201,938,057 |
Predictions for fall (N=170169) | Processed data (%) | Raw data (%) |
Positive | 12 | 6 |
Negative | 88 | 94 |
4. DISCUSSION
Algorithmic bias can introduce inequity in clinical decision-making and healthcare research [9,10]. To prevent bias and produce reliable results, the data that is the foundation for the algorithms needs careful curation. Consciousness in cohort selection and definition can help reduce bias and bring awareness to limitations. Our cohort definition was made with awareness of the limitations of our data environment. VA data is racially diverse; it has many women, but they are proportionally smaller than male Veterans. We used data sources, such as data from CMS, to augment VA data to include people who may have received care outside of the VA. The data preparation procedures were curated with the intention of preventing demographic bias. Since every data environment is unique, data readiness for the application of artificial intelligence is not a one-size-fits-all solution[11]. The preparation of data depends on the context of the research question’s data domain as well as the specific AI method it will be used for. To avoid bias based on race or ethnicity, we did not limit ourselves to a specific method for cohort selection but rather used a combination of ICD codes and text notes so we did not underrepresent people who may not have a diagnosis code. Cohort creation can inherently be biased if it makes assumptions about access to care or diagnosis.
There is no one specialized technique that can be applied to address the fairness concern. It is a combination of strategies and steps the data curator must keep in mind (Table 4). Stratifying data during preparation is one example, the lack of which results in the lack or underrepresentation of a subgroup in the dataset. We used validated data cleaning strategies to create variables such as race and ethnicity. If the training sample is not a fair representation of the population, bias can be inherent in the underlying dataset. Oversampling minority groups helps to address this issue. We excluded missing or erroneous ICD codes, missing visit dates, and people of invalidated race or sex. This approach may increase the underrepresentation of groups that may be missing data. Past work has shown that removing race and ethnicity as a predictor in a prognostic study of racial bias worsened fairness, leading to inappropriate care recommendations in a cancer recurrence risk algorithm [12,13]. Standardizing data preparation methods such as multiple data sources, including the including the inclusion of different races and genders in the data, and factoring them into prediction models can help reduce bias and promote fairness [14]. There has been a recent drive toward paying attention to social determinants of health in EHR data. While including social determinants may promote fairness, what is done to that data for preparation before analysis is equally important. Standardizing reproducible methods across teams working on design and building AI-ready data can promote improving these methods as they are reused over time.
Table 4:
Summary of steps to promote reduction in bias
1. Diversify research team and expertise | a) Different perspectives and strategies may emerge with a diverse research team. Include not only demographically diverse but skill diversity in research informatics groups b) Review literature for existing data preparation strategies |
2.Standardize data cleaning strategies | a) Create standardized procedures for data cleaning that can be applied across studies, such as how to address missingness, duplication and redundancy b) Check for available standard definitions that are validated and created with input from clinical experts, for example defining “serious fall” in EHR data c) Share and include feedback on data cleaning methods from experts on the team by communication strategies like code review sessions |
3.Validate cleaning procedures | a) Post analyses checks to confirm if cleaning steps worked optimally, validate methods with application to other cohorts/studies and compare results b) Check for representation of subgroups, do not negate populations conditionally, check for adequate representation of samples within the cohort, if not possible to include state limitation clearly. For example, VA data has more males than females, stating this is because Veteran population is male dominated presents the limitation c) Check if results vary when cleaning strategies differ, for example missingness may be augmented when using an additional data source vs using default values. |
4.Mitigate missingness by imputation or other suitable strategies | a) Allow multiple sources or other strategies to address missing data. Running analyses with missingness carries the risk of underreporting or misrepresentation of facts |
5.Update metadata for new knowledge or edits on existing data | a) If enhancements are made to existing processes, include in metadata for consistency. b) Do the same for identifiable limitations. |
6. Share code for reproducibility | a) Share code publicly where possible or within the team b) Feedback on shared code when reused |
7. Include data preparation strategies in publications and publicly available spaces. | a) Transparency improves when published results include behind the scenes methods b) review existing literature when deciding on methodology c) Data preparation and method details for this study were shared on Centralized Interactive Phenomics Resource15 (CIPHER) |
From this study, the steps taken to reduce bias and promote fairness in data that are foundational to AI algorithms can be summarized as follows: 1) diversify research team expertise and review existing data preparation methodologies for your research question 2) cleaning strategies must be standardized and validated before application in research 3) Validate cleaning strategies by validating them with other cohorts and experts. representation of subgroups must be as diversified as possible within the scope of the research question 4) The missing data must be addressed by imputation techniques suitable for the study design and 5) update metadata and code to reflect enhancements 6) share code for transparency and reproducibility, and 7) include data preparation strategies in publications. This is an iterative process, and every step feeds into other steps. Table 4 includes additional details that may be useful at each stage of the data preparation process. Examples of code used for implementing some of the steps below can be found at the code sharing link in the appendix.
5. CONCLUSION
One of the main goals of practicing algorithmic fairness is to reduce bias combined with methods to promote equity. Bias is multidimensional and operates at many levels. This study showcased bias reduction methods at the methodologic level. Factors operating at a social or institutional level require input at a systemic level. The roadmap of data for the application of AI involves applying specialized techniques to develop a framework that promotes transparency and improves efficiency. Building AI-ready data frameworks and establishing the informatics processes to do so can help promote equity in decision-making and healthcare research. Metrics to measure fairness are not standardized and may vary depending on institutional policies. Methodologies like the ones discussed in this study can contribute to defining these metrics by improving measures like transparency and reproducibility. There has been recent focus around the algorithms for machine learning, but not enough dialogue around the preparatory work that goes into building the foundational data for these algorithms. Some studies have explored factors such as demographics and healthcare utilization, but we need additional work on specific methodological strategies data experts can apply to reduce bias at a foundational level. For future work, we propose keeping the provenance of data as a guiding factor when deciding on methodologies for data curation. Fairness analysis can be part of the data cleaning process by studying trends and frequencies of diversity in sample populations.
Algorithmic discrimination in healthcare data is one part of a larger systemic issue in healthcare, research, the social practice of medicine, and the application of data. Guidelines for methods and best practices in all these areas will help facilitate fair implementation in the other areas. We solicit readers to consider spaces within their domain that can utilize a fairness and bias reduction action that can be shared with others with similar roles and responsibilities.
ACKNOWLEDGEMENTS
The authors acknowledge the Veterans without whom the data at the VA would not be available for this research. The views and opinions expressed in this manuscript are those of the authors and do not necessarily represent those of the Department of Veterans Affairs or the United States government.
FUNDING STATEMENT
This work was supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) R01 AR078715; and National Institute on Alcohol Abuse and Alcoholism grants: U10 AA013566, U01 AA020790, and U24 AA020794.
Appendix
List of complete ICD codes and additional algorithm components available at https://phenomics.va.ornl.gov/web/
Table 1A:
Performance of classifiers on radiology report test set
Regularization | AUC | F1 | PPV | Sensitivity | Mutual information | Size adjusted cost |
---|---|---|---|---|---|---|
L2 | 92.53 | 87.94 | 90.51 | 85.52 | ||
L2 | 95.28 | 91.03 | 91.03 | 91.03 | X | |
L2 | 97.04 | 93.52 | 92.57 | 94.48 | X | |
L2 | 97.04 | 93.52 | 92.57 | 94.48 | X | X |
SCAD | 97.04 | 93.52 | 92.57 | 94.48 |
AUC=area under curve; PPV=positive preditive value; SCAD=smoothly clipped absolute deviation
Table 2A:
Performance of the regular expressions at multiple rounds of annotation
Dataset | Size of context | Round | # Snippets |
#annotations | # Errors |
Accuracy |
---|---|---|---|---|---|---|
Training | 30 words Left & right of the keyword | 1 | 67 | 75 | 3 | 96% |
2 | 100 | 114 | 4 | 96% | ||
3 | 80 | 103 | 16 | 85% | ||
4 | 171 | 201 | 24 | 88% | ||
5 | 171 | 205 | 27 | 87% | ||
50 words Left & right of the keyword | 6 | 100 | 125 | 10 | 92% | |
7 | 100 | 127 | 9 | 93% | ||
8 | 100 | 133 | 5 | 96% | ||
Testing | 50 words | - | 200 | 263 | 4 | 98% |
Table 3A.
Examples of how T-scores appeared with sentence patterns
|
Figure 1A:
Most predictive features. Parentheses around a word – eg, “(fall)” – indicate a Unified Medical Language System
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
COMPETING INTERESTS STATEMENT
The authors have no conflicts or competing interests in the preparation of this manuscript.
ETHICS STATEMENT
The study was approved by the Internal Review Boards (IRB) for the Department of Veterans Affairs (VA) and Yale University. The IRB numbers for VACS are: VA IRB: AJ0001, VA IRB net – 1583210, Yale IRB – 0309025943.
DATA AVAILABILITY STATEMENT
Due to VA regulations and our ethics agreements, the analytic data sets used for this study are not permitted to leave the VA firewall without a Data Use Agreement. This limitation is consistent with other studies based on VA data.
REFERENCES
- [1]., Rentsch CT, Kidwai-Khan F, Tate JP, Park LS, King JT, Skanderson M, Hauser RG, Schultze A, Jarvis CI, Holodniy M, Lo Re V, Akgun KM, Crothers K, Taddei TH, Freiberg MS, & Justice AC (2020). Patterns of COVID-19 testing and mortality by race and ethnicity among United States veterans: A nationwide cohort study. PLOS Medicine, 17(9), e1003379. 10.1371/journal.pmed.1003379 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2]., Hannan MT, Felson DT, Dawson-Hughes B, Tucker KL, Cupples LA, Wilson PW, & Kiel DP (2010). Risk factors for longitudinal bone loss in elderly men and women: The Framingham Osteoporosis Study. Journal of Bone and Mineral Research, 15(4), 710–720. 10.1359/jbmr.2000.15.4.710 [DOI] [PubMed] [Google Scholar]
- [3]., Solomonides A, Koski E, Atabaki SM, Weinberg SS, McGreevey JD, Kannry JL, Petersen C, & Lehmann CU (2021). Defining AMIA’s artificial intelligence principles. Journal of the American Medical Informatics Association, 29(4), 585–591. 10.1093/jamia/ocac006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4]., Obermeyer Z, Powers B, Vogeli C, & Mullainathan S (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. 10.1126/science.aax2342 [DOI] [PubMed] [Google Scholar]
- [5]., Ochoa-Allemant P, Tate JP, Wiliams EC, Gordon KS, Marconi VC, Bensley KM, … & Justice AC (2023). Enhanced Identification of Hispanic Ethnicity Using Clinical Data: A Study in the Largest Integrated United States Health Care System. Medical Care, 61(4), 200–205. [DOI] [PMC free article] [PubMed] [Google Scholar]; McGinnis, Kathleen A, et al. “Validating smoking data from the Veteran’s Affairs Health Factors dataset, an electronic data source.” Nicotine & Tobacco Research 13.12 (2011): 1233–1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6]., Feder S, Redeker NS, Jeon S, Schulman-Green D, Womack JA, Tate JP, Bedimo R, Budoff MJ, Butt AA, Crothers K, & Akgün KM (2017). Validation of the ICD-9 Diagnostic Code for Palliative Care in Patients Hospitalized With Heart Failure Within the Veterans Health Administration. American Journal of Hospice and Palliative Medicine, 35(7), 959–965. 10.1177/1049909117747519 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7]., Velarde K, Romesser J, Johnson MR, Clegg DO, Efimova O, Oostema SJ, Scehnet JS, DuVall SL, & Huang GD (2018). An initiative using informatics to facilitate clinical research planning and recruitment in the VA health care system. Contemporary Clinical Trials Communications, 11, 107–112. 10.1016/j.conctc.2018.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Bates J, Fodeh SJ, Brandt C, & Womack JA (2015). Classification of radiology reports for falls in an HIV study cohort. Journal of the American Medical Informatics Association, 23(e1), e113–e117. 10.1093/jamia/ocv155 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Johnson K, Kamineni A, Fuller S, Olmstead D, & Wernli KJ (2014). How the Provenance of Electronic Health Record Data Matters for Research: A Case Example Using System Mapping. EGEMS, 2(1), 4. 10.13063/2327-9214.1058 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Savova G, Masanz J, Ogren PV, Zheng J, Sohn S, Kipper-Schuler K, & Chute CG (2010). Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation, and applications. Journal of the American Medical Informatics Association, 17(5), 507–513. 10.1136/jamia.2009.001560 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Paulus JK, & Kent DM (2020). Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. Npj Digital Medicine, 3(1). 10.1038/s41746-020-0304-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Khor S, Haupt EC, Hahn EE, Lyons LJ, Shankaran V, & Bansal A (2023). Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors. JAMA Network Open, 6(6), e2318495. 10.1001/jamanetworkopen.2023.18495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Straw I, & Wu H (2022). Investigating for bias in healthcare algorithms: a sex-stratified analysis of supervised machine learning models in liver disease prediction. BMJ Health & Care Informatics, 29(1), e100457. 10.1136/bmjhci-2021-100457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Jöhnk J, Weißert M, & Wyrtki K (2020). Ready or Not, AI Comes— An Interview Study of Organizational AI Readiness Factors. Business & Information Systems Engineering, 63(1), 5–20. 10.1007/s12599-020-00676-7 [DOI] [Google Scholar]
- [15].Honerlaw J, Ho YL, Fontin F, Gosian J, Maripuri M, Murray M,… & Cho K (2023). Framework of the Centralized Interactive Phenomics Resource (CIPHER) standard for electronic health data-based phenomics knowledgebase. Journal of the American Medical Informatics Association, 30(5), 958–964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Castelnovo A, Crupi R, Greco G, Regoli D, Penco IG, & Cosentini AC (2022). A clarification of the nuances in the fairness metrics landscape. Scientific reports, 12(1), 4209. 10.1038/s41598-022-07939-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Due to VA regulations and our ethics agreements, the analytic data sets used for this study are not permitted to leave the VA firewall without a Data Use Agreement. This limitation is consistent with other studies based on VA data.