Graphical Abstract

So, you want to do GI clinical research but aren’t sure how to get started? Although randomized clinical trials provide the highest level of evidence to inform clinical practice, guidelines, and policy, they may not be possible for a fellow or junior investigator to initiate, and even in the best-case scenario, may not be published for many years. Database-oriented research can overcome many of the daunting impediments that many junior researchers face. Furthermore it can jump-start a clinical research career as a vehicle to develop knowledge and experience in epidemiologic analyses and publication. It offers a way to test clinically meaningful hypotheses and most importantly, provide salient data that could ultimately impact clinical decision making.
What are the next steps to engaging in database research? Finding a mentor with experience in clinical and database research is critical. If necessary, mentorship can be split between a gastrointestinal (GI) clinical researcher who can help identify clinical questions and another mentor with experience in epidemiologic analyses (eg, health services researcher, epidemiologist, and biostatistician). The next step is to identify the clinical area within GI that is of most interest to you and possibly your future clinical focus (eg, cancer screening and prevention, motility, hepatology, inflammatory bowel disease [IBD]).
Although ideally you would have a hypothesis to be tested, in database-oriented research, it can be advantageous to explore the specific database to see what hypotheses can be investigated, using an iterative approach to home in on a meaningful analysis. Because of this low barrier to entry however, a thorough literature review is important because many of the analyses that you may consider may have already been done. Thus, a willingness to be flexible is critical to success.
In this piece, we provide synoptic review of numerous databases suited to GI-focused analyses (Table 1). The goal is not necessarily to be comprehensive but focused and provide illustrative examples with the ultimate aim to point an eager GI fellow or junior investigator in the right direction.
Table 1.
Review of Databases
| Database name | Access | Cost | Type of database | Data included | Why you would consider using this |
|---|---|---|---|---|---|
| NHANES https://www.cdc.gov/nchs/nhanes/index.htm |
Publicly available | Free | National Survey | Interview obtained medical information, physical exam and laboratory data | Investigate diseases and associations with medications, physical exam, and laboratory data |
| National Health Interview Survey https://www.cdc.gov/nchs/nhis/index.htm |
Publicly available | Free | National Survey | Interview obtained medical information | Investigate association between diseases and health care access and barriers |
| SEER/ SEER Research Plus https://seer.cancer.gov/ |
Publicly available For certain restricted data an application is required. | Free | Registry | Patient demographics and cancer specific data (such as primary tumor site, morphology, and stage) | Asses cancer incidence and mortality, survival, and limited duration prevalence |
| GI Quality Improvement Consortium (GIQUIC) https://giquic.gi.org/ |
Application process | Fee | Registry | Endoscopy center and provider characteristics, patient demographics, endoscopy reports and quality metrics | Describe endoscopy and colonoscopy measures |
| Veterans Affairs https://www.va.gov/vetdata/ |
Affiliation required | Free | Electronic health records | Entire medical charts | Evaluate longitudinal care of large cohort of patients |
| Marketscan https://www.ibm.com/products/marketscan-research-databases |
Publicly available | Fee | Claims data | Insurance claims, patient demographics. Supplements include laboratory data, disability data, weather data and others | Examine treatment patterns, patient adherence, and natural history of disease across both inpatient and outpatient care |
| SEER-MEDICARE https://healthcaredelivery.cancer.gov/seermedicare/ |
Application process | Fee | Combined registry and claims data | Includes SEER registry data along with inpatient, outpatient and medication claims data | Perform epidemiological and health services research in patients with cancer |
| Nationwide Inpatient Sample https://www.hcup-us.ahrq.gov/nisoverview.jsp |
Application process | Fee (discounts available for students) | Combined registry and claims data | Claims data, patient and provider characteristics | Evaluate inpatient care |
| All of Us https://allofus.nih.gov/ |
Two access types: anonymized aggregate data publicly available and more comprehensive individual data for registered researchers | Free | Prospectively enrolling registry, including survey data, electronic health records, wearable device data and biosamples; data are both prospective and retrospective | Patient health information from survey and electronic health records, biosamples | Evaluate the health of a large, diverse cohort of patients, including access to biosamples |
Types of Health Research Databases
Survey Data
There are many survey databases obtained through participant interview that are readily available to researchers. Two examples are the National Health Interview Survey (NHIS) and the National Health and Nutrition Survey (NHANES), which are conducted by the National Center for Health Statistics (NCHS).1,2 Both are large national surveys, with NHANES including 5000 individual people per year and NHIS including 30,000 households per year. While both surveys collect sociodemographic, health, and disease information, NHIS is solely interview based and is focused on health care utilization and access. A recent study used NHIS to evaluate food insecurity, social support, and financial toxicity in patients with IBD.3 In addition, every 5 years, NHIS includes cancer-related questions,4 which can be used to study cancer screening and risk factors, such as a recent study assessing adherence to colorectal cancer screening guidelines in African Americans.5
NHANES, on the other hand, also includes data from physical examination and laboratory testing. Kim et al6 used NHANES to show that reduced thyroid function predicted mortality in patients with nonalcoholic fatty liver disease. Despite the advantages, these surveys may be difficult to navigate without a knowledge of statistics. To make the sample population’s answers to survey questions representative of a larger population, the samples need to be weighted, or corrected, by demographic characteristics to improve the accuracy of survey estimates.
Registries
There are also databases of national registries. A registry collects detailed information about a set of patients, such as their age, race and ethnicity, sex, diagnosis, and treatments. One example is the Surveillance, Epidemiology, and End Results (SEER) program, which coalesces data from cancer registries and is funded by the National Cancer Institute (NCI).7 SEER includes patient demographics as well as cancer incidence, mortality, tumor stage, and morphology. For example, Hur et al8 found an increase in esophageal adenocarcinoma incidence and mortality rates from 1975 to 2009.
To perform analyses, researchers must use the free software, SEER*STAT. While this may be a barrier, there are free tutorials available on the SEER website and a robust and responsive helpline. There are also software applications that use SEER*STAT output and expand the type of statistical tests that can be done, such as JoinPoint for trend analysis. While SEER is publicly available, certain demographic and cancer data is restricted and is only available via SEER Research Plus, which requires an application.7
Another registry is the GI Quality Improvement Consortium (GIQuIC), which is intended as a repository for quality improvement measures.9 This registry includes patient demographics, provider characteristics, American Society of Anesthesiologists status, anticoagulation use, and endoscopic procedure measures beginning in 2010 from provider practices, ambulatory surgical centers, endoscopy suites, and hospitals. A recent study used this database to describe polyps and neoplasia in patients aged 45 to 49, which in light of recently changing colon cancer screening guidelines is important to informing adenoma detection rates for this age group.10,11
Electronic Health Records
Electronic health records can serve as a research tool that has the benefit of providing a more complete representation of patient care. Records can be obtained from local hospital systems or from large health systems such as Kaiser or Gei-singer. The benefit of local data is it may be accessible from your own institution, and the patient population will be familiar to you. The downsides of electronic health record data, however, are that the results may not be generalizable outside of a particular geographic area, and the data may be hard to access and may not be immediately user-friendly. Additionally, health records are by definition retrospective and only allow for observational data collection.
One notable example is the Veteran Health Affairs (VA) records, which is a large database of electronic health records that can be accessed with collaboration with VA staff. The VA record has been in place for several decades, providing a unique longitudinal data set. The records include diagnostic and procedural codes, laboratory data, vital signs, imaging, and pathology. One example is a study published in Hepatology in 2021 that described clinical characteristics and outcomes of veterans with and without cirrhosis who tested positive for severe acute respiratory syndrome coronavirus 2.12 Because of the large population, the authors were able to quickly publish a study at the start of the pandemic including 3306 patients with cirrhosis of 88,747 tested for coronavirus disease. However, because the veterans are often English-speaking and male, findings from these studies may not be generalizable,13 and information may be missing if care was provided outside of the VA system.
Claims Data
Data from insurance claims are a robust means of conducting research and can provide information about how care is provided “in the real world.” Data often include patient and provider information, diagnostic and procedural codes, and costs of care. These data sets often lack clinically important information such as laboratory data or social and family history. Another limitation is that it is beholden to provider reporting, which may not be entirely accurate, although it tends to be accurate for procedures (eg, endoscopy and colonoscopy).14,15
One example is MarketScan, which is a family of databases that include insurance claims from participating providers for their employees and employee dependents. Specifically, the claims data comes from employer-sponsored insurance, employer-sponsored Medicare supplement, and Medicaid in 11 states for inpatient, outpatient, and prescription drug claims, as well as expenditure data. Supplemental data include workplace and disability measures, weather pattern, benefit plan design, and inpatient drug use. Kulaylat et al16 used the MarketScan database to evaluate postoperative complications of patients with ulcerative colitis who received preoperative anti-tumor necrosis factor therapy.16 Some limitations of this particular data set are incomplete follow up if patients change employers and the inclusion of only employed patients who are working age.
Another notable database is the SEER-Medicare linkage, which combines cancer data from SEER with Medicare claims.17 This linkage allows for cancer research that includes measures of comorbidities, receipt of screening and evaluation tests, and detailed treatment data. For example, Rustgi et al18 found a rise in use of endoscopic ultrasound to diagnose patients with pancreatic cancer during 2000 through 2015 as well as a survival benefit. SEER-Medicare also provides a random 5% sample of patients without cancer to serve as controls so that researchers can conduct population-based analyses within the SEER registry areas.17 Specific limitations include inclusion of only older patients, cost of data purchase, and an extensive application and data use agreement that needs to be approved by NCI. Unless there is a researcher who already has the database, the time from application to data in hand can be many months.
Finally the National Inpatient Sample is a comprehensive database consisting of national and state-specific data on inpatient stays, ambulatory surgery, and readmissions.19 It is the largest publicly available all-payer inpatient database designed to produce estimates of inpatient use, access, cost, quality, and outcomes and represents 7 million inpatient stays per year or 35 million per year with weighting. It includes patient demographic and hospital characteristic, International Classification of Diseases codes, charges, discharge status, length of stay, and severity and comorbidity measures. This database was used by Joo et al20 to describe hospital costs and use of palliative care consults and procedures for patients with gastric cancer.
Comprehensive Data Sets
The All of Us Registry is a unique database that is a rich source of data for researchers.21 This prospective registry includes survey data at the time of entry and biosamples for future genomic studies and is also linked to health records for a subset of patients. The goal of this registry is to promote health equity, improve wellness and health outcomes, and inform earlier diagnosis of disease in a diverse population of patients. A published study by Renedo et al22 used novel definitions of underrepresented groups, besides race and ethnicity, to demonstrate disparities in revascularization after stroke. This registry may be subject to bias as all participants volunteer to participate.
Conclusion
Our review of the many types of databases that can be used for GI-focused research will hopefully provide the reader with some ideas and help “jump-start” a clinical research career. Although some of these databases can cost as much as several thousand dollars, others are free. Collaboration with other fellows and junior faculty can offset the cost and lead to more fruitful research endeavors.
Funding
Chin Hur was supported by National Institutes of Health National Cancer Institute grants R01 CA247790 and U01 CA265729.
Footnotes
Conflicts of interest
Chin Hur discloses relationships with Value Analytics Labs and Exact Sciences.
The other authors disclose no conflicts.
References
- 1.Centers for Disease Control and Prevention. National Center for Health Statistics National Health Interview Survey. Available at: https://www.cdc.gov/nchs/nhis/about_nhis.htm. Accessed March 13, 2022.
- 2.Centers for Disease Control and Prevention. National Center for Health Statistics. About the National Health and Nutrition Examination Survey. Available at: http://www.cdc.gov/nchs/nhanes/about_nhanes.htm. Accessed March 13, 2022. [Google Scholar]
- 3.Nguyen NH, Khera R, Ohno-Machado L, Sandborn WJ, et al. Prevalence and effects of food insecurity and social support on financial toxicity in and healthcare use by patients with inflammatory bowel diseases. Clin Gastroenterol Hepatol 2021;19:1377–1386.e1375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.National Cancer Institute. Division of Cancer Control & Population Sciences. National Health Interview Survey (NHIS) Cancer Control Supplement (CCS). Available at: https://healthcaredelivery.cancer.gov/nhis/. Accessed March 13, 2022.
- 5.Millien VO, Levine P, Suarez MG. Colorectal cancer screening in African Americans: are we following the guidelines? Cancer Causes Control 2021;32:943–951. [DOI] [PubMed] [Google Scholar]
- 6.Kim D, Vazquez-Montesino LM, Escober JA, et al. Low thyroid function in nonalcoholic fatty liver disease is an independent predictor of all-cause and cardiovascular mortality. Am J Gastroenterol 2020;115(9):1496–1504. [DOI] [PubMed] [Google Scholar]
- 7.National Cancer Institute, National Institutes of Health. Overview of the SEER Program. NCI’s Division of Cancer Control and Population Sciences. Available at: http://seer.cancer.gov/about/overview.html. Accessed March 13, 2022. [Google Scholar]
- 8.Hur C, Miller M, Kong CY, et al. Trends in esophageal adenocarcinoma incidence and mortality. Cancer 2013; 119:1149–1158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.GIQuIC Research Overview. Available at: https://giquic.gi.org/research.asp. Accessed March 13, 2022.
- 10.Bilal M, Holub J, Greenwald D, et al. Adenoma detection rates in 45–49 year old persons undergoing screening colonoscopy: analysis from the GIQuIC Registry. Am J Gastroenterol 2022;117:806–808. [DOI] [PubMed] [Google Scholar]
- 11.Trivedi PD, Mohapatra A, Morris MK, et al. Prevalence and predictors of young-onset colorectal neoplasia: insights from a nationally representative colonoscopy registry. 2022;162(4):1136–1146.e5 [DOI] [PubMed] [Google Scholar]
- 12.Ioannou GN, Liang PS, Locke E, et al. Cirrhosis and severe acute respiratory syndrome coronavirus 2 infection in US Veterans: risk of infection, hospitalization, ventilation, and mortality. Hepatology 2021; 74:322–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kumar S, Metz DC, Kaplan DE, et al. Seroprevalence of Helicobacter pylori infection in a national cohort of veterans with noncardia gastric adenocarcinoma. Clin Gastroenterol Hepatol 2020; 18:1235–1237.e1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cooper GS, Virnig B, Klabunde CN, et al. Use of SEER-Medicare data for measuring cancer surgery. Med Care 2002;40(Suppl): IV43–IV48. [DOI] [PubMed] [Google Scholar]
- 15.Warren JL, Harlan LC, Fahey A, et al. Utility of the SEER-Medicare Data to Identify Chemotherapy Use. Med Care 2002;40(Suppl): IV55–IV61. [DOI] [PubMed] [Google Scholar]
- 16.Kulaylat AS, Kulaylat AN, Schaefer EW, et al. Association of preoperative anti-tumor necrosis factor therapy with adverse postoperative outcomes in patients undergoing abdominal surgery for ulcerative colitis. JAMA Surg 2017;152:e171538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Naational Cancer Institute. SEER-Medicare Linked Database. Available at: https://healthcaredelivery.cancer.gov/seermedicare/. Accessed March 13, 2022. [Google Scholar]
- 18.Rustgi SD, Zylberberg HM, Amin S, et al. Use of endoscopic ultrasound for pancreatic cancer from 2000 to 2016. Endosc Int Open 2021;09:E1–E11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Agency for Healthcare Research and Quality. Overview of the National (Nationwide) Inpatient Sample. Available at: https://www.hcup-us.ahrq.gov/nisoverview.jsp. Accessed March 13, 2022. [Google Scholar]
- 20.Joo MK, Yoo JW, Mojtahedi Z, et al. Ten-year trends of utilizing palliative care and palliative procedures in patients with gastric Cancer in the United States from 2009 to 2018—a nationwide database study. BMC Health Serv Res 2022;22:20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.National Institutes of Health. All of Us Research Program. Available at: https://allofus.nih.gov/. Accessed March 13, 2022. [Google Scholar]
- 22.Renedo D, Acosta JN, Sujijantarat N, et al. Carotid artery disease among broadly defined underrepresented groups: the All of Us Research Program. Stroke 2022; 53:e88–e89. [DOI] [PMC free article] [PubMed] [Google Scholar]
