Abstract
For evidence from observational studies to be reliable, researchers must ensure that the patient populations of interest are accurately defined. However, disease definitions can be extremely difficult to standardize and implement accurately across different datasets and study requirements. Furthermore, in this context, they must also ensure that populations are represented fairly to accurately reflect populations’ various demographic dynamics and to not overgeneralize across non-applicable populations. In this work, we present a generalized tool to assess the fairness of disease definitions by evaluating their implementation across common fairness metrics. Our approach calculates fairness metrics and provides a robust method to examine coarse and strongly intersecting populations across many characteristics. We highlight workflows when working with disease definitions, provide an example analysis using an OMOP CDM patient database1, and discuss potential directions for future improvement and research.
Introduction
Observational health research uses data collected by healthcare providers (e.g. hospital systems, clinics, etc.) to conduct retrospective population level analyses of various patient populations. [2] This data, commonly referred to as “Real World Data” [3], are data which relates to patient health status and/or the delivery of health care that is routinely collected from a variety of sources including electronic health records, medical claims and billing activities, and product and disease registries. Evidence generated from analyses using “Real-world Data” (RWD) from these types of sources is referred to as “Real-world Evidence” (RWE) and can be used to evaluate the effectiveness and safety of medical treatments, interventions, and healthcare practices in everyday clinical settings, outside of controlled clinical trials. [3]
Within observational health research, several communities of practice have emerged dedicated to how best to make use of this data. [4, 5, 6] One community that has emerged over the past decade is the the Observational Health Data Sciences and Informatics (OHDSI) open science collaborative. [7] This collective consists of hundreds of researchers with expertise in clinical sciences, health informatics, healthcare economics, and many more domains where observational health has found application. One technology that emerged from the OHDSI collaborative is the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) 2.
The OMOP CDM is not a specific technology or tool, but rather a description on how data engineering teams can transform real world data into a common database schema. Additionally, a variety of commonly used analytic research tools have been developed under the Health Analytics Data to Evidence Suite (HADES) to analyze patient data from within OMOP CDM-formatted databases. An informal interface commonly among these tools is the notion of “phenotype definitions” which can rigorously and reproducibly define the patient populations that researchers are investigating and under what specific conditions or details a patient cohort of interest can be formed. [8] As an example, a phenotype definition could define a cohort of patients who have had a previous diagnosis of type-2 diabetes and then went on to have an occurrence of myocardial infarction.
While the usage of these phenotype definitions to perform observational health research has become ubiquitous and continues to grow as a popular conceptual tool, an oft-overlooked aspect of these definitions is how fair they are in representing the patient populations they aim to capture. Here, phenotype fairness is defined as when a phenotype definition is constructed so as to maximize accuracy in representing the actual patient cohort it is being defined for across patient diagnostics and demographics. An unfair phenotype definition could fail to account for the presentation of various diseases across demographics, such as gender [9], leading to potential inaccuracy of reporting results concerning certain patient cohorts.
To illustrate the potential impact on fairness of varying phenotype definitions, the paper “Assessing Phenotype Definitions for Algorithmic Fairness” [10], Sun et al. show how three different phenotype definitions for type-2 diabetes could represent three distinctly different patient populations. For example, within individual populations captured by phenotype definitions and examined across single demographic features (such as gender or race), Sun et al. showed how fairness metrics (e.g. demographic parity, equality of opportunity, etc.) within individual captured populations can differ. Furthermore, it could be possible to think one could use definitions from literature for type-2 diabetes interchangeably, Sun et al. demonstrably showed that subtle differences across definitions could lead to differently captured populations. As a result, if one does not carefully assess those populations captured by phenotype definitions across observational health databases, even using the same definitions, there can be a risk of overgeneralizing or comparing research outcomes to populations who are not fairly represented in such studies. To ameliorate this potential risk, this paper builds upon the work of Sun et al. and introduces a generalized tool and associated workflows to evaluate the fairness of disease phenotype definitions across common metrics to improve both reproducibility and generalizability of large-scale observational health research.
Methods
For the work described in this paper, we used data provided by IQVIA in the IQVIA Pharmetrics Plus ® dataset. This dataset provided medical histories of 34, 808, 145 unique patients from the years 2017 to 2022. It consists of closed, adjudicated US medical and pharmacy claims made available in the OMOP CDM format and consists of patient demographics, diagnoses and procedure histories, medication usage, and associated healthcare costs.
Following this, we generated three separate cohorts against the dataset using three type-2 diabetes phenotypes: LEGEND [11], Miller [12] and a forthcoming phenotype definition for type 2 diabetes from the OHDSI PhenotypeLibrary (referred to as “PL 957” within figures and tables) [13]. These were the same phenotype definitions used in “Assessing Phenotype Definitions for Algorithmic Fairness” with the substitution of using PL 957 in place of the PheKb [14]. Although PL 957 at the time of this writing was not yet clinically validated, it was chosen intentionally to illustrate that even if a definition is not reviewed by clinicians, one can still develop phenotype definitions that can be iteratively improved upon that gets close to either a clinically validated definition while allowing a researcher to clearly state the possible differences or restrictions imposed upon their particular definition to provide a more comprehensive explanation of why their definition differs from other definitions.
We evaluated the fairness of a cohort generated by a phenotype across three metrics described in Sun et al.’s paper being the following metrics: demographic parity, equality of opportunity, and predictive rate parity. To summarize each of these metrics, 1) Demographic parity is defined as the difference in proportion for each protected class that receives the positive and negative outcome 2) Equality of outcome measures if true positive rates of a model (a disease definition in this case) are equal across demographic groups, and 3) Predictive rate probability compares predicted labeling accuracy across classes.
To calculate each of these metrics, we first had to determine a standard to compare populations against. The “gold standard” within observational health research is to harmonize secondary use patient data against actual patient outcomes. However, as the dataset we used did not have outcomes, we adopt Sun et al.’s approach in defining a “silver standard” for calculating fairness metrics. To calculate the “silver population”, a majority algorithm is used that searches for common patients across generated cohorts (based on the phenotype definitions) who appear more than a given threshold number of times which is defined as a value greater than or equal to the number of phenotype definitions studied at once. For example, if a researcher is comparing 5 phenotype definitions together, then if a patient appears in 3 or more of the associated cohorts, that patient is counted in the “silver population”.
Once this silver standard was computed, the rest of the metrics were implemented in a straightforward manner across cohorts. For demographic parity, cohort subjects were grouped by demographic variables, counted, and then divided by the number of patients across the database belonging to each respective demographic grouping. Concerning equality of opportunity and predictive rate parity, the same algorithm used for computing the silver standard was used for both to compute all subjects who were considered to have the condition of type-2 diabetes. Equality of opportunity was computed by grouping cohort subjects by demographic variables, removing subjects from the cohort if they did not appear in the population the silver standard algorithm yielded, counted and then divided by the number of patients across the database belonging to each respective demographic grouping. Predictive rate parity is computed in a near identical manner to equality of opportunity but is divided by the number of patients belonging to the cohort being examined and belonging to each respective demographic grouping.
Finally, for our assessment of cohorts generated by these three phenotype definitions, we computed these fairness metrics. We selected the covariates of gender, age groups bucketed by intervals of 10 years, and combinations of gender and age groups following the same bucketing scheme. Due to limitations of the IQVIA Pharmetrics Plus ®, a patient’s race was not made available as a potential covariate for this replication study.
Results
First, we characterized the IQVIA Pharmetrics Plus ®database, Figure 1 (additional summary statistics for this database characterization can found in Appendix A). After this assessment, we similarly summarized the patients captured by each of our 3 cohorts. We the intersections of the cohorts with respect to patient inclusion, including stratification across gender and age groups in Figure 2 (for more details and summary statistics for the unique patients across all cohorts, please see Appendix B). Next, we calculated demographic parity and equality of opportunity across the cohorts (predictive rate parity was excluded from this review as it is not a commonly used metric in the health equity and fairness space [10]).
Initially, demographic parity and equality of opportunity was calculated by one variable (age group) across all cohorts in Figure 3. From there, intersectional populations were explored by intersecting patient cohorts by their definition, gender, and age group (again, bucketed in 10 year increments). [15] After this stratification took place, demographic parity was calculated for each cohort in Figure 4 and equality of opportunity in Figure 5.
Discussion
Monolithic vs. Intersectional Population Fairness Assessment. As shown in the results of this paper, there can be tremendous differences present within phenotype definitions – even if they are defined for the same disease. With the 3 phenotype definitions, the clinically validated Miller and LEGEND phenotype definition had similar fairness metrics reported within Figure 4 and Figure 5 while the under-development phenotype definition PL 957 clearly demonstrated the potential risk of not accounting for differences between disease definitions. Using the tool, cohorts could be compared across monolithic stratifications (i.e. one variable such as in Figure 3) and across several covariates (as in Figure 4 and Figure 5) to inform where a phenotype definition can become unfair. The calculated fairness metrics provide useful information to researchers to determine how they want to make and compare their definitions, how to account for potentially unfair behavior in their phenotype definition, raise questions about underlying populations they have within their OMOP CDM, and can better equip someone to not only consider what trade-offs to make in a research program but to also explain what trade-offs they had to make and how that could affect their results.
Workflows with Fairness Tool. Although workflows can vary significantly depending on a single re-searcher’s requirements, the general phenotype development process is a collaborative and iterative process [8]. One potential workflow a researcher may wish to utilize is when they want to assess the fairness of a single cohort. This cohort could have been created via a phenotype definition that was co-developed with their clinician team or is an early prototype of a phenotype definition they are developing. Using the generated cohort within their OMOP CDM database, this fairness tool could be used quickly on the patient cohort to assess a battery of fairness metrics and report these metrics alongside other statistics of interest for the researcher.
While a researcher may want to explore a singular phenotype definition, it could be more productive to compare a variety of previously clinically validated phenotype definitions to determine if they can use a pre-defined defintion for their work. To address this, another workflow can be followed where they take a variety of different phenotype definitions and generate their corresponding cohorts within an OMOP CDM database. Using the tool, each cohort can be compared together and a silver standard will be computed to support generating fairness metrics across all cohorts being compared. At this stage, a researcher can then compare their definition to other definitions to see how multiple populations are represented by each definition and either select or modify a definition that best matches their needs and requirements while promoting fairness. For example, as can be seen in Figure 3 and Figure 3b, PL 957 is a significant outlier against the other generated cohorts. This could indicate that there may need to be further adjustment done to that particular “in development” phenotype definition for type 2 diabetes and can thus drive additional development.
Finally, across both workflows, researchers can stratify their cohorts across an arbitrary depth of covariates (the code samples within Appendix C, Listing 1 show how the subfigures for Figure 4 were created). Using the tool, one can add in a variety of covariates as needed and build measurements respective to correspondingly more specific populations (an example of this can be seen in Appendix C, Listing 2 where a custom cofactor function was written using FunSQL.jl [a SQL Domain Specific Language in the Julia programming language] to stratify along a person’s state [16]. This ability the tool provides to create increasingly niche sub-populations comes at the cost of generalizability of the results but better insight into the specific populations of interest [17]. With any of these workflows, it will be up to the researcher doing analysis with this tool to assess both the approach chosen and validate the results to ensure they are congruent with expectations.
Potential Limitations. While such a tool can explore fairness across an array of population factors, there are some significant limitations for its use. For example, for the specific analysis conducted on the IQVIA Pharmetrics Plus ®dataset, patient race information was not made available. As a result, fairness assessments made could only be conducted with respect to what information was present. This could potentially lead to risks of overstating the fairness of population studies in the future if one does not explicitly mention across what factors populations are deemed fairly represented. Furthermore, while the metrics of demographic parity and equality of opportunity are standard metrics used across fairness literature, these are not the only metrics to assess fairness. If used as the only source of fairness, a study can potentially risk oversimplifying the harmful dynamics present in an analysis and fail to capture where disparities may still be present.
Potential Future Work. Although being able to assess the fairness of phenotype definitions rapidly is a great first step in bringing about more equitable and fair phenotype definitions when investigating disease, there is still much that can be done. For example, in situations where the underlying OMOP CDM database has populations that skew heavily to one particular characteristic or combination of covariates, the fairness tool could potentially provide support to ameliorate such imbalances by provide class balancing algorithms or methods to correctly interpret the “true” fairness of a phenotype definition. Additionally, further fairness metrics can be implemented and added to the tool to enable automatic reporting of a regimen of fairness metrics across one or many phenotype definitions. This would contribute to avoiding accidental oversimplification of determining fairness in any study’s findings. Finally, a more ambitious goal for the fairness tool could be incorporation into standards of practice within the observational health research community and become a means to audit fairness across a wide variety of observational health research where, perhaps, fairness and equity research may not yet be commonplace.
Conclusion
This paper introduced a novel tool for assessing phenotype fairness that builds upon the work done within “Assessing Phenotype Definitions for Algorithmic Fairness”. As demonstrated, this tool can be used to assess phenotype definitions in isolation, compare several phenotype definitions together, and provide support to analyze fairness across niche cohort sub-populations. In conclusion, while the work presented here is a positive step in ensuring fairness and equitable representation of patients in observational health research, there still remains many more opportunities for future work in this space.
Acknowledgements
The authors would like to thank Dr. Tony Sun for assisting in code review and providing some initial comments, the OHDSI Lab at Northeastern University’s Roux Institute, Drs. Jaan Altosaar Li and Shenita Freeman for discussing ideas of fairness within observational health research, and the Julia community for their support.
Appendices
Appendix A Patient Count Summaries across IQVIA Pharmetrics Plus ®
Table 1:
Summary Counts
Gender | Total Count |
---|---|
Female | 18, 031, 369 |
Male | 16, 775, 478 |
Age Group | Total Count |
0 - 9 | 1, 537, 780 |
10 - 19 | 4, 056, 358 |
20 - 29 | 4, 694, 506 |
30 - 39 | 5, 713, 845 |
40 - 49 | 4, 808, 294 |
50 - 59 | 4, 678, 632 |
60 - 69 | 5, 000, 587 |
70 - 79 | 2, 769, 352 |
80 - 89 | 1, 547, 493 |
Total Count: | 34, 806, 847 |
Appendix B Patient Breakdowns and Summaries across Generated Cohorts
Table 2:
Unique patients across cohorts breakdown by gender and age group.
Age Group | Gender | Count |
---|---|---|
0 - 9 | Female | 425 |
10 - 19 | Female | 3608 |
20 - 29 | Female | 14679 |
30 - 39 | Female | 46331 |
40 - 49 | Female | 86087 |
50 - 59 | Female | 169225 |
60 - 69 | Female | 303104 |
70 - 79 | Female | 265915 |
80 - 89 | Female | 210854 |
0 - 9 | Male | 399 |
10 - 19 | Male | 3434 |
20 - 29 | Male | 7873 |
30 - 39 | Male | 23020 |
40 - 49 | Male | 76441 |
50 - 59 | Male | 189233 |
60 - 69 | Male | 329464 |
70 - 79 | Male | 275249 |
80 - 89 | Male | 191729 |
Table 3:
Summary counts of unique patients generated cohorts.
Gender | Total Count |
---|---|
Female | 1, 100, 228 |
Male | 1, 096, 842 |
Age Group | Total Count |
0 - 9 | 824 |
10 - 19 | 7, 042 |
20 - 29 | 22, 552 |
30 - 39 | 69, 351 |
40 - 49 | 162, 528 |
50 - 59 | 358, 458 |
60 - 69 | 632, 568 |
70 - 79 | 541, 164 |
80 - 89 | 402, 583 |
Total Count: | 2, 197, 070 |
Appendix C Code Examples
Listing 1:
Code to calculate demographic parity of 3 cohorts stratified by age group and gender.
Listing 2:
Example showing how to define custom covariate function to get a patient’s home state and then calculating demographic parity across cohorts stratified by state.
Footnotes
All code used to generate figures and perform analyses are available at the following GitHub repository: https://github.com/TheCedarPrince/PhenoFairPaper. As we are unable to share the actual data used for our analysis, this repository includes analyses performed on a similar but synthetic dataset generated by Synthea [1]. The software package discussed in this paper is available here: https://github.com/JuliaHealth/OMOPCDMMetrics.
Abbreviated often as the “OMOP CDM”, “OMOP”, or even the “CDM” when the context is clear; we will use these terms interchangeably in this paper to refer to this data model.
For more details on the “silver standard” calculation, please see the section Approach: Assessing Fairness of Crohn’s Disease and Diabetes Phenotype Definitions in “Assessing Phenotype Definitions for Algorithmic Fairness”.
Conflict of Interest
The authors have no conflicts of interest to declare.
Figures & Table
Figure 1:
Counts of IQVIA Pharmetrics Plus ® broken down by gender and age group. As a brief assessment, a minor characterization of the database was carried out. Figure 1a gives the raw counts of these populations stratified by gender and age in 10-year intervals. Figure 1b visualizes these raw counts using a grouped bar plot with age group intervals on the x-axis and patient counts against the y-axis.
Figure 2:
Patient overlaps between cohorts and patient breakdown. Figure 2a a heatmap indicating the extent patients are shared between cohorts intersecting on the matrix (e.g. at the intersection of PL 957 and LEGEND, only 24, 146 patients are shared between the two cohorts generated by their respective definitions). Figure 2b shows the number of unique patients identified across the 3 cohorts stratified by gender and age group (for additional details, please see Appendix B).
Figure 3:
Fairness Metrics Assessed over One Variable. In Figure 3a, demographic parity was calculated across all cohorts per age grouping. Within Figure 3b, equality of opportunity was also similarly calculated across all cohorts per age grouping. To support examining the differences between the cohorts, the y-axis extends from 80% to 100%.
Figure 4:
Demographic parity calculated by gender and age group per cohort. Within these plots, each cohort is broken down into sub-populations by their respective gender and age group combinations. Then, demographic parity is calculated for each of these sub-populations.
Figure 5:
Equality of opportunity calculated by gender and age group per cohort. Within these plots, each cohort is broken down into sub-populations by their respective gender and age group combinations. Then, equality of opportunity is calculated for each of these sub-populations and reported in a similar fashion to Figure 3b where the starting percentage was 80% and ended at 100%.
References
- 1.Walonoski J, et al. Journal of the American Medical Informatics Association. Vol. 25. Publisher: Oxford University Press; 2018. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record; pp. 230–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Overhage JM, Ryan PB, Reich CG, Hartzema AG, PE Stang. Journal of the American Medical Informatics Association. Vol. 19. Publisher: BMJ Group BMA House, Tavistock Square, London, WC1H 9JR; 2012. Validation of a common data model for active safety surveillance research; pp. 54–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.U.S. Food and Drug Administration Real-world evidence. 2021 Sep Available from: https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence . [Google Scholar]
- 4.Ergina PL, Barkun JS, McCulloch P, Cook JA, DG Altman. BMJ (Clinical research ed.) Vol. 346. Publisher: British Medical Journal Publishing Group; 2013. IDEAL framework for surgical innovation 2: observational studies in the exploration and assessment stages. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mandel JC, Kreda DA, Mandl KD, Kohane IS, RB Ramoni. SMART on FHIR: a standards-based, interoperable apps platform for electronic health records. Journal of the American Medical Informatics Association. 2016 Sep;23:899–908. doi: 10.1093/jamia/ocv189. Doi: 10.1093/jamia/ocv189. Available from: https://doi.org/10.1093/jamia/ocv189. [Accessed on: 2024 Jan 6] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bender D, Sartipi K. HL7 FHIR: An Agile and RESTful approach to healthcare information exchange. Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems. ISSN: 1063-7125. 2013 Jun:326–31. Doi: 10.1109/CBMS.2013.6627810. Available from: https://ieeexplore.ieee.org/document/6627810 [Accessed on: 2024 Jan 6] [Google Scholar]
- 7.Hripcsak G, et al. MEDINFO 2015: eHealth-enabled health. IOS Press; 2015. Observational health data sciences and informatics (OHDSI): opportunities for observational researchers; pp. 574–8. [PMC free article] [PubMed] [Google Scholar]
- 8.Zelko JS, et al. Developing a Robust Computable Phenotype Definition Workflow to Describe Health and Disease in Observational Health Research. arXiv:2304.06504 [cs] 2023 Mar Doi: 10.48550/arXiv.2304.06504. Available from: http://arxiv.org/abs/2304.06504. [Accessed on: 2023 Apr 16] [Google Scholar]
- 9.Milner KA, Funk M, Richards S, Wilmes RM, Vaccarino V, HM Krumholz. The American journal of cardiology. Vol. 84. Publisher: Elsevier; 1999. Gender differences in symptom presentation associated with coronary heart disease; pp. 396–9. [DOI] [PubMed] [Google Scholar]
- 10.Sun TY, Bhave S, Altosaar J, N Elhadad. Assessing Phenotype Definitions for Algorithmic Fairness. en. arXiv:2203.05174 [cs, q-bio] 2022 Mar arXiv: 2203.05174. Available from: http://arxiv.org/ abs/2203.05174 [Accessed on: 2022 Apr 29] [PMC free article] [PubMed] [Google Scholar]
- 11.Khera R, et al. BMJ open. Vol. 12. Publisher: British Medical Journal Publishing Group; 2022. Large-scale evidence generation and evaluation across a network of databases for type 2 diabetes mellitus (LEGEND-T2DM): a protocol for a series of multinational, real-world comparative cardiovascular effectiveness and safety studies; p. e057977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Miller DR, Safford MM, LM Pogach. Diabetes care. Vol. 27. Publisher: Am Diabetes Assoc; 2004. Who has diabetes? Best estimates of diabetes prevalence in the Department of Veterans Affairs based on computerized patient data; pp. b10–b21. [DOI] [PubMed] [Google Scholar]
- 13.Rao G. PhenotypeLibrary: The OHDSI phenotype library. manual. 2024 [Google Scholar]
- 14.Kirby JC, et al. Journal of the American Medical Informatics Association. Vol. 23. Publisher: Oxford University Press; 2016. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability; pp. 1046–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Choo HY, Ferree MM. Sociological theory. Vol. 28. CA: Los Angeles, CA: Publisher: SAGE Publications Sage; 2010. Practicing intersectionality in sociological research: A critical analysis of inclusions, interactions, and institutions in the study of inequalities; pp. 129–49. [Google Scholar]
- 16.Simonov K, Evans CC, JS Zelko. FunSQL : Julia library for compositional construction of SQL queries. 2023 Mar Doi: 10.5281/zenodo.7705325. Available from: https://doi.org/10.5281/zenodo.7705325 . [Google Scholar]
- 17.Homan P, Brown TH, B King. Structural Intersectionality as a New Direction for Health Disparities Research. en. Journal of Health and Social Behavior. 2021 Sep;62:350–70. doi: 10.1177/00221465211032947. Doi: 10.1177/ 00221465211032947. Available from: http://journals.sagepub.com/doi/10.1177/00221465211032947 [Accessed on: 2022 Oct 21] [DOI] [PMC free article] [PubMed] [Google Scholar]