Abstract
Health disparities and solutions are heterogeneous within and among racial and ethnic groups, yet existing administrative databases lack the granularity to reflect important sociocultural distinctions. We measured the efficacy of a natural-language–processing algorithm to identify a specific immigrant group. The algorithm demonstrated accuracy and precision in identifying Somali patients from the electronic medical records at a single institution. This technology holds promise to identify and track immigrants and refugees in the United States in local health care settings.
Characterizing and closing the gap of racial and ethnic health disparities is a national priority,1(p4) but disparities and solutions are heterogeneous for different groups.2 For example, a specific health-related assessment and intervention may take very different forms when applied to a Somali American community than to an ancestral African American community. This example reveals an important limitation of health disparities research: existing regional and national databases lack the granularity to reflect this sociocultural heterogeneity. Therefore, assessment of disease prevalence and intervention impact is compromised by the labeling of both communities in our example as African American in existing databases. Adding this texture to administrative databases has been recommended, but implementation is costly and many years away.3
Natural-language processing (NLP) holds the potential to bypass these limitations. NLP is an informatics discipline that allows computers to process and understand human languages. Application of NLP to the health care arena is an active area of research with escalating opportunity for impact in the context of a national mandate to expand electronic medical record (EMR) infrastructure. A recent demonstration project showed that NLP review of a health care system EMR outperformed administrative databases in documenting postoperative complications.4
We tested the hypothesis that application of NLP to EMRs can identify a subset racial/ethnic group for the purposes of eventually documenting and tracking health disparities. Persons from Somalia compose the largest African refugee population in the United States, with a particular concentration in Minnesota. Furthermore, data support the existence of health care disparities among this population.5,6 Therefore, we designed our NLP tool to identify this population.
METHODS
We conducted our study at a large academic medical center in the midwestern United States that serves a relatively large regional Somali population, Mayo Clinic in Rochester, Minnesota. For Somali cohort identification, we used a tool with proven effectiveness in finding specific, customized clinical terms for high-throughput phenotyping.7 This is a rule-based NLP algorithm in which it is possible to encode a customized dictionary and search the unstructured text of EMRs for inclusion and exclusion criteria. We reconfigured the tool for cohort identification so that descriptive terms such as “Somali” and “refugee” (and their variants) ruled in patients, and other terms ruled them out. We constructed the algorithm for local demographic context as follows:
Somali OR Somalian OR Somalia OR Immigrant OR Refugee NOT Spanish NOT Hispanic NOT Latino NOT Latina NOT Mexican NOT Mexico NOT Cambodian NOT Cambodia NOT Vietnamese NOT Vietnam NOT Sudanese NOT Sudan.
We applied the algorithm to a set of all patients aged 18 years and older who were seen in the outpatient primary care clinics in the divisions of Primary Care Internal Medicine and Family Medicine during a 15-day period in March 2011. A single clinician with experience caring for Somali patients manually reviewed all charts to identify patients as Somali or not Somali. First and last names were used to identify possible Somali patients; EMRs of these patients were then reviewed for direct or indirect documentation of Somali ancestry.
We used the results of this manual chart review as a gold standard for evaluating the efficacy of the algorithm for identifying Somali patients. We calculated sensitivity, specificity, positive predictive value, and negative predictive value for the algorithm.
RESULTS
We identified 5782 patients during the study interval; the NLP algorithm identified 122 of these patients as Somali. Compared with manual identification, the algorithm demonstrated sensitivity of 92.2%, specificity of 99.9%, positive predictive value of 97.5%, and negative predictive value of 99.8%.
Error analysis showed that the EMR for each of the 10 false negatives contained the term Somali, but the algorithm ruled the patient out because of other factors.
DISCUSSION
In this demonstration project, an NLP algorithm showed accuracy and precision in identifying patients from a subset immigrant group. This was a single-center study, with resultant implications for generalizability.
Future research should work to develop and disseminate a generalized NLP cohort identification tool for use in identifying patients from other vulnerable populations. It will be important that these tools have the ability to interact with users to create customizable cohorts that incorporate regional knowledge of populations. For example, exclusionary terms used in our algorithm were informed by local demographic data. This technology holds promise to identify and track immigrants and refugees in the United States at a local health care level, paving the way for improved patient care and reduction of health disparities.
Acknowledgements
This work was funded in part by Mayo Clinic's Center for Individualized Medicine and the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data (US Department of Health and Human Services grant 90TR000201).
Human Participant Protection
The study procedures were approved by the institutional review board at Mayo Clinic.
References
- 1.Institute of Medicine Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. Washington, DC: National Academies Press; 2003 [PubMed] [Google Scholar]
- 2.Sue S, Dhindsa MK. Ethnic and health disparities research: issues and problems. Health Educ Behav. 2006;33(4):459–469 [DOI] [PubMed] [Google Scholar]
- 3.Desai J. State-based diabetes surveillance among minority populations. Prev Chronic Dis. 2004;1(2):A03 [PMC free article] [PubMed] [Google Scholar]
- 4.Murff HJ, FitzHenry F, Matheny MEet al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011;306(8):848–855 [DOI] [PubMed] [Google Scholar]
- 5.Wieland ML, Morrison TB, Cha SS, Rahman AS, Chaudhry R. Diabetes care among Somali immigrants and refugees. J Community Health. 2012;37(3):680–684 [DOI] [PubMed] [Google Scholar]
- 6.Morrison TB, Wieland ML, Cha SS, Rahman AS, Chaudhry R. Disparities in preventive health services among Somali immigrants and refugees. J Immigr Minor Health. 2012;14(6):968–974 [DOI] [PubMed] [Google Scholar]
- 7.Savova GK, Fan J, Ye Zet al. Discovering peripheral arterial disease cases from radiology notes using natural language processing. : AMIA Annual Symposium Proceedings. Bethesda, MD: American Medical Informatics Association; 2010:722–726 [PMC free article] [PubMed] [Google Scholar]