Multi‐site research using electronic health record data: Lessons learned from a case study

Brittany Garcia; Michael Hogarth; Yu Wang; Xi Zhu; Shin‐Ping Tu

doi:10.1002/lrh2.70039

. 2025 Sep 16;9(4):e70039. doi: 10.1002/lrh2.70039

Multi‐site research using electronic health record data: Lessons learned from a case study

Brittany Garcia ^1,^✉, Michael Hogarth ², Yu Wang ¹, Xi Zhu ³, Shin‐Ping Tu ¹

PMCID: PMC12569463 PMID: 41169633

Abstract

Introduction

Multi‐site research collaboration is necessary to increase generalizability, diversity, and innovation; however, there are complexities and challenges surrounding research processes, including regulatory oversight, data management, and data sharing activities. This report highlights the specific challenges of collaborative research identified in the conduct of a federally funded multi‐site study and presents lessons learned to inform ways to overcome these challenges. The purpose of our current research project is to determine how interprofessional teamwork affects quality outcomes and develop tools to improve teamwork in cancer care.

Methods

Our research team is comprised of 23 members across five academic institutions. Our study cohort includes approximately 20 000 adult patients with breast, colorectal, and non‐small cell lung cancers diagnosed at the respective sites between January 1, 2016, and December 31, 2021. Electronic health record (EHR) access log data was extracted 12 months pre‐ and 24 months post‐diagnosis for each patient, and outcome data of potentially preventable emergency department visits and unplanned hospitalizations was extracted from the California state All‐Payer Claims database.

Results

Major challenges experienced related to single institutional review board processes, establishment of contract agreements, and data management, analysis, acquisition, and transfer. Lessons learned included: (1) start research planning as early as possible, including engaging with information technology and compliance teams to identify processes and develop and share data dictionaries; (2) work closely with the institution contracting team to identify the most optimal timing and ordering for multiple data use agreement (DUA) contracts; and (3) ensure that research team members are abreast of current Health Insurance Portability and Accountability Act and institutional guidelines as they pertain to research and data practices.

Conclusions

Multi‐site research involving big data from the EHR requires ample planning and execution time. Adopting a single standardized DUA and developing data dictionaries that can be shared for research will improve the data acquisition phase of multi‐site research studies.

Keywords: access log data, data use agreements, electronic health record (EHR), multi‐site research, research challenges

1. INTRODUCTION

1.1. Multi‐site research collaboration

The purpose of this report is to highlight the specific challenges of collaborative research identified in the conduct of a federally funded multi‐site study and to present lessons learned to inform ways to overcome these challenges. As this research is ongoing, discussions will focus on research‐specific (i.e., not involving issues related to general grants administration such as budgeting, invoicing, etc.) and study start‐up challenges to date.

1.2. Case study

1.2.1. Project background

Cancer continues to rank as the second leading cause of mortality in the United States, with approximately 2 million projected new cancer cases in 2025, and 618 120 cancer deaths. ¹ The National Academy of Medicine described a cancer care system in crisis, with interprofessional (IP) teamwork and coordination largely the exception rather than the rule. ² The National Cancer Institute prioritized the need to improve IP team‐based cancer care. Aligned with these priorities, the overall goal of our research is to determine how IP teamwork affects quality outcomes and develop tools to improve teamwork in cancer care. The multiteam system (MTS) perspective offers a theoretical framework to examine IP teamwork among multiple groups of healthcare professionals (HCPs). Our current research leverages social network analysis to examine theory‐informed, targeted electronic health record (EHR) network structures at three study sites that all use Epic. This includes characterizing within‐ and between‐group EHR communication in cancer care, understanding EHR communication structures, and developing machine learning (ML)‐assisted visual analytics tools to identify patients with communication structures that are associated with poor quality outcomes.

1.2.2. Study details

Our research involves five academic research institutions across three states (University of California, Davis [UCD]; University of California, San Diego; University of California, Los Angeles; University of Iowa; Clemson University), including 23 study personnel—11 of whom are faculty investigators.

The study has three main phases: (1) Phase 1: Local and state‐level data extraction and social network analysis; (2) Phase 2: Interviews of patients, caregivers, and healthcare providers, as well as focus groups to ascertain pain points in the receipt and delivery of cancer patient care and incorporation of the qualitative findings back into the social network analysis of EHR communication; (3) Phase 3: Development of visual analytics interfaces and testing their usability in a clinical setting through case studies with clinical team researchers. Our experiences described here relate to Phase 1.

The data obtained in the first phase of this research consist of site‐specific cancer center patient‐level data from the three University of California (UC) health systems. Our cohort includes adult patients with breast, colorectal, and non‐small cell lung cancers diagnosed at the respective sites between January 1, 2016, and December 31, 2021. Across all three sites, our total cohort consisted of approximately 20 000 patients. Data extracted from the three UC health systems for this cohort include cancer‐specific variables (e.g., cancer type, stage, diagnosis date, etc.) and EHR access log data (e.g., access date, type, note ID, access user, etc.). Access log data was extracted 12 months pre‐ and 24 months post‐diagnosis for each patient. We also sought outcome variables of potentially preventable emergency department visits and unplanned hospitalizations from the state All‐Payer Claims database, specifically the California Department of Health Care Access and Information (HCAI) data through the California Cancer Registry (CCR). Given our study involves three (volume, variety, and variability) of the six (velocity, veracity, and value) key characteristics that define “big data,” ³ we will henceforth refer to our study as a big data study. An overview of our research processes and details related to our data extraction are presented in Figures 1 and 2, respectively. In Figure 1, we outline the primary study start‐up tasks for our current research study, including local and state institutional review board (IRB), internal data extraction, state‐level (CCR/HCAI) data extraction, as well as data de‐identification and delivery to the research team. Understanding this process is important because some activities were reliant on the completion of others to move forward. For example, local IRB approval is needed before requests for internal site‐specific data, data use agreements (DUAs), and California state IRB approval. Once the local primary (UCD) IRB was established, the remaining tasks could be initiated simultaneously. When California state IRB approval was established, state‐level data requests could be requested, then extracted, de‐identified, and sent to the research team. In the data extraction process (Figure 2), cohort identification based on specific cancer diagnosis eligibility criteria was done locally at each site's cancer center via CNEXT, a cancer registry management system (Box A). Once the patient cohort was identified, a list of the patient medical record number (MRNs) was used to extract patient‐level data (demographics, comorbidity diagnosis dates, encounter data, access log data, etc.) from Epic Clarity (Box B). The honest broker at each site generated study IDs for the patients in the cohort, as well as any providers that interacted with a given patient's record within the defined study timeframe. The honest broker retained the primary key that was used to de‐identify patients and providers. A limited dataset (LDS) was then made available to the research team via a remote advanced computing environment (ACE) for local users or transferred to external research members via secure file transfer protocol (SFTP). A patient cohort dataset containing identifiers was sent to the CCR to extract patient cancer data (from CCR's database) and emergency department and hospitalization data (Box C). This state‐level data was sent to their respective sites via SFTP for each site to create a LDS and sent which to the research team via SFTP.

Overall institutional review board (IRB), data use agreement, and data extraction processes. This figure outlines the flow of our research processes for the current study. At the start of our research study, we developed study protocols and submitted for the establishment of a single IRB. After the local single IRB was established, in tandem, data sharing agreements were requested, the California (CA) state IRB application was prepared and submitted, the application to California Cancer Registry (CCR) was prepared, and honest brokers at each site identified all eligible patients and extracted patient‐level and healthcare provider access log data. Once the California state IRB was approved, the approval letter was submitted as part of our application to CCR. While CCR reviewed our application, honest brokers at each of the data‐supplying sites prepared limited datasets to be shared with the research team. Once our CCR application was approved, honest brokers at each site sent datasets to CCR containing patient identifiers for their respective patients, which was a requirement of CCR to link our patient cohort to their datasets. Once CCR extracted and transferred identified data back to each site, the honest brokers transformed the identified dataset to a limited dataset and released the data to the research team. *Data are fully identified for the purpose of extracting and linking across multiple sets. These sets are not made available to the research team. Only the site‐specific honest brokers or the California Cancer Registry handle the identified data. **Data are limited (contains care and diagnosis dates) and made available to the research team. EHR, electronic health record; HCAI, California Department of Health Care Access and Information.

Detailed data extraction and delivery schema. This figure provides a more detailed overview of what data systems were used and the overall data management for the current research. As a first step, local cohort identification based on specific cancer diagnosis eligibility criteria were done locally at each site's cancer center via CNEXT, a cancer registry management system (Box A). Once the patient cohort was identified, a list of the patient medical record number (MRNs) was used to extract patient‐level data (demographics, comorbidity diagnosis dates, encounter data, access log data, etc.) from Epic Clarity (Box B). The honest broker at each site generated study IDs for the patients in the cohort, as well as any providers that interacted with a given patient's record within the defined study timeframe. The honest broker retained the primary key that was used to de‐identify patients and providers. A limited dataset (LDS) was then made available to the research team via a remote advanced computing environment (ACE) for local users or transferred to external research members via secure file transfer protocol (SFTP). A patient cohort dataset containing identifiers was sent to the California Cancer Registry (CCR) to extract patient cancer data (from CCR's database) and emergency department and hospitalization data (Box C). This state‐level data was sent to their respective data sites via SFTP for each site to create a limited dataset and send to the research team via SFTP. DADT, Data Access and Delivery Team; HCAI, California Department of Health Care Access and Information; SSN, social security number.

2. RESEARCH PROCESSES

2.1. Ethical use of data and privacy preservation

Completion of two essential activities ensures ethical use of patient data. This includes approval of the research project through an IRB and having the appropriate contracts in place to share and analyze patient data. While these two processes are related and IRB approval is needed before a DUA can be completed, these processes and offices are completely different and very often operate independently from one another. IRB is specifically for ethical use of human subjects' data, while the DUA is to legally bind institutions to uphold the processes described in the IRB.

To streamline the IRB process and eliminate potential contradicting IRB determinations and regulations among institutions involved in multi‐site research, the National Institutes of Health (NIH) implemented a policy pertaining to studies subject to the Revised Common Rule Cooperative Research Provision (45 CFR 46.114[b]). ⁴ , ⁵ This policy requires all NIH proposals submitted from January 25, 2018, onward to use a single IRB (sIRB) as required by the terms and conditions of the award. While this policy was intended to simplify the IRB process for multi‐site research, many have stated that this policy further convolutes the process and creates duplicative efforts. ⁶ , ⁷

The overall establishment of a sIRB involves many steps, from identifying if an investigator's research falls under the scope of the sIRB requirement to identifying an institution to serve in this capacity to subsequently maintaining responsibility for coordinating all communications between the research teams, the reviewing IRB, and the local or relying IRBs. This process is complex to manage for any researcher and may significantly constrain early‐stage investigators who may not have worked with or supervised a designated research coordinator familiar with the sIRB process, internal IRB processes, and who has a firm grasp of all the requirements or research activities involved in the project. Inadequate knowledge, experience, and resources to initiate a sIRB can lead to major problems, including inefficient and confusing communication among all involved parties, poor quality or incomplete submissions to the sIRB, and significant delays in study start‐up. ⁶ , ⁷

With funding of our research grant in July 2022, our team began developing the documents necessary to submit for our sIRB, which would be managed at our primary site, UCD. Establishing the sIRB among the participant sites required 26 local IRB documents. Additional review by the California Department of Public Health IRB, a requirement to receive data from CCR, resulted in a combined local and state total of 55 documents (Table 1). Figure 3 shows the timeline of our research activities, including what we initially proposed and planned for. Starting with the IRB, we initially allotted 9 months for protocol development and establishment of the local sIRB across all sites. Development of data extraction protocols along with other study‐related protocols and documents took 5 months. While the actual time of IRB completion occurred within the originally allotted 9‐month timeframe, protocol development (overall and data extraction specific protocols) took longer than anticipated, adding 5 months to the total IRB completion time (13 months total). The primary reason for our protocol delays was related to input from compliance and data information technology (IT) teams regarding whether specific data elements were able to be released and whether data elements were still collected and maintained at each site for our specific cohort timeframe. Each sub‐site also varied in the amount of time ceding to the primary IRB, ranging from 3 to 7 months. The time needed to negotiate and incorporate site‐specific language requirements for all relevant documents, along with the overall coordination across five sites, proved to be one of the most challenging aspects contributing to the extended timeline. Another concern emphasized by others has been establishing interoperability between institutions' differing electronic IRB configurations. ⁷ We found this to be the case among one of our study sites, whereby additional time was needed compared to the other sites to execute the sIRB to integrate and gain access to a SMART IRB compatible with the UCD IRB system.

TABLE 1.

Required Forms from local institutional review boards (IRBs), Committee for the Protection of Human Subjects (CPHS), and the California Cancer Registry (CCR).

Entity (total #)	Required forms (#)
Local institutional review boards (IRBs) (26)	Initial Review Application (5)
	Initial Scientific Review Committee Administrative Approval (1)
	Consent Forms for Case and Usability Study (3)
	Consent Forms for Interviews and Focus Groups (3)
	Collaborating Site Principal Investigator Curriculum Vitae (4)
	Reliance Fee Forms (4)
	Overall Study Protocol (1) ^a
	Data Extraction Protocol (1) ^a
	Interview Guides for Patients, Caregivers, and Healthcare Providers (2) ^a
	Care Coordination Survey for Patients (1) ^a
	Interview Guide for Case and Usability Study (1) ^a
Committee for the Protection of Human Subjects (CPHS) (13)	CPHS IRB Application (1)
	Chronic Disease Surveillance and Research Branch Letter of Support (1)
	Cover Letter (1)
	Data Extraction Budget (1)
	Site Principal Investigator Curriculum Vitae (4)
	CPHS Data Security Letter (2)
	Local IRB Health Insurance Portability and Accountability Act (HIPAA) Waiver (1)
	California Cancer Registry (CCR) Data Dictionary and Requested Variables (1)
	California Department of Health Care Access and Information (HCAI) Data Dictionary and Requested Variables (1)
California Cancer Registry (CCR) (16)	CCR Application for Disclosure of Confidential Registry Data (1)
	CPHS Letter of Approval (1)
	Chronic Disease Surveillance and Research Branch Letter of Support (1)
	CPHS IRB Application (1)
	Confidentiality Agreement for Disclosure of CCR Data (4)
	Local IRB Letters of Approval (4)
	Local IRB Overall Study Protocol (1)
	Grant Notice of Award (1)
	CCR Data Dictionary and Requested Variables (1)
	HCAI Data Dictionary and Requested Variables (1)

Open in a new tab

^{^a}

Forms that cover study‐wide activities and apply to all sites.

Timeline of SMART Cancer Care Teams project events. This figure outlines the timeline of research activities (activities originally proposed and actual time to completion) across the five research institutions in our study, three of which are data‐supplying sites. The five study sites include the University of California, Davis (UCD), University of California, San Diego, University of California, Los Angeles, University of Iowa, and Clemson University. The three UC sites are data‐supplying sites. To allow for anonymity, all sites have been masked except for UCD (Primary Site). Specific activities noted include local, multi‐site, and state‐level institutional review board (IRB) submission, review, and approval activities; data use agreement (DUA) submission, review, and approval activities; and state data request submission, review, approval, and estimated date of delivery of the data. Protocol development (overall and data extraction specific) took 5 months. Establishment of local single IRB took 9 months. The time from California (CA) state IRB submission to approval was 3 months. Execution of all five site‐level DUAs took 26 months. Of note, one data‐supplying site took the full 26 months, with the next highest amount of time to execute the DUA of 10 months. The length of time to extract internal electronic health record data from the three data‐supplying sites varied, ranging from 14 to 22 months. The time from submitting the request for state‐level data to delivery of that data was 23 months. CCR, California Cancer Registry.

Establishing DUAs for multi‐site research is another intricate process that has the potential to delay research activities and impact grant deadlines. Mello and colleagues interviewed contracting individuals from 48 of the top 50 US universities with the largest total research expenditures and found these major barriers to timely DUA execution ⁸ : (i) high contracting request volume and insufficient contract staffing; (ii) priority of contracting requests; (iii) poor or untimely communication between involved parties; (iv) time for ancillary reviews and compliance with privacy and security policies; (v) unclear allocation of DUA responsibilities across offices; and (vi) lack of incentives for the data supplier.

In our case, UCD pursued a single, multi‐site DUA that specified the data that each site would be sending and receiving at each phase of the research. Our multi‐site DUA was sent in May 2023 to all sites simultaneously. This would later become an apparent problem as the first sites to sign the agreement were the non‐data‐contributing sites, leaving no room for the data‐contributing sites to make changes or negotiate terms. Furthermore, despite the data‐contributing sites being sister campuses from the same University system, there was strong opposition from one site regarding the way activities were described, such as clarity of the process, data movement internally and externally (i.e., due to size and security concerns), data limitation and de‐identification methodology, and language used. The time from the initial DUA request for all sites to the time that the latest site signed the agreement was 26 months (Figure 3). Notably, one data‐supplying site was an outlier contributing to the total contracting time of 26 months—the next highest amount of negotiating time for the agreement was 10 months.

Our study also includes requesting All‐Payer Claims Data from the CCR, a state managed system, which requires California State IRB approval from the Committee for the Protection of Human Subjects (CPHS) and a disclosure agreement from all sites using HCAI/CCR data. This agreement is non‐negotiable, meaning that sites had to accept the terms as‐is. Considerable time was spent to get all necessary sites to agree on the terms of the disclosure agreement. To further complicate matters, one UC data‐contributing site was firm in requesting that a business associate agreement (BAA) be drawn up between the site and CCR. To resolve the disagreements regarding the acceptance of the disclosure agreement as‐is, UCD worked closely with the CCR team to demonstrate proof of precedence of CCR data being approved for use at the sister UC for a different project. Upon further discussions with the team requesting the BAA at the sister campus, this contradictory stance from the two campuses on the project stemmed from new staff who were unfamiliar with the process.

2.2. Data management, analysis, acquisition, and transfer—internal and external

The literature supports that sharing research data increases transparency, increases accountability, and incentivizes researchers to maintain quality research methodologies and documentation. ⁹ To promote the sharing of scientific data, NIH issued a Data Management and Sharing (DMS) policy. ¹⁰ This policy became effective January 25, 2023, and requires that, unless otherwise restricted based on NIH's acceptable justifiable reasons for limiting sharing, research data must be submitted to a data repository. Data that must be shared are those commonly accepted in the scientific community as of “sufficient quality to validate and replicate research findings, regardless of whether the data are used to support scholarly publications.” While ideal to share data and follow transparent research practices, challenges exist when sharing data, particularly big and protected patient health data from multiple sites: data quality and harmonization, data privacy and security, as well as data infrastructure and scalability.

2.2.1. Data quality and harmonization

The healthcare industry generates massive amounts of data, constituting approximately 30% of the world's stored data. Approximately 80 megabytes of imaging and EHR data are generated per patient per year. ¹¹ As this data volume continues to grow, the EHR has been a highly sought‐after data source for researchers. While the EHR offers a wealth of detailed information on a wide range of target populations, differences in what is captured across different institutions using the same EHR system (e.g., site customizations) create a fragmented medical record with significant missingness and non‐standardized data, both of which add an additional layer of complexity in having acceptably clean data for analysis in multi‐site research. In addition, a substantial amount of key patient data exists within unstructured text, such as physician's notes, making it difficult to leverage without sophisticated natural language processing. For our study, we also utilized EHR access log data. This presented challenges in understanding what information was available and the computing power needed to extract a large volume of data from a system that is not designed to have this specific data extracted as readily as patient‐level data.

We applied a staggered approach to extracting data at the three sites, allowing our primary site to lead and resolve major issues to inform the two remaining sites. Cohort identification and internal cancer center data extraction began in May 2023 (Figure 3, Month 11 from the grant start). While full draft data from all internal sources have been extracted, cleaning and troubleshooting data issues (identifying additional data elements in different tables, exploring field missingness, etc.) continue. Data extraction activities concluded at two of the three data‐supplying sites (in the third quarter of Year 3 of the grant). The remaining data‐supplying site anticipates finalizing its access log data extraction at the end of Year 3, needing 17 months in total to complete. By starting data extraction with the primary site, the other two sites were able to bypass the major data issues such as identification of variables and data tables to use for the extraction as well as patient and provider random‐ID generation and linkage across sets due to sharing of code/queries from the primary site. Having detailed study and data extraction protocols as well as connecting each site's IT analysts with each other allowed for the most efficient approach to troubleshoot data problems. Though not part of the research team, it is important to include the analysts in research team meetings to allow for alignment of data curation and extraction within the context of the research goals and obtain real‐time feedback from expert investigators when troubleshooting.

2.2.2. Data privacy and security

At the core of conducting research involving patient data is the balance of leveraging the rich potential of the data while protecting patient privacy and confidentiality. Privacy and security concerns have only grown as reports show that healthcare institutions experience the largest share of ransomware attacks, ¹² with a 264% increase in attacks and a 256% increase in data breaches within the health sector over the past 5 years ¹³ To address this concern, healthcare institutions implement strict data policies, which often do not translate well for use within a research setting. For our research, changes in such data policies and the creation of new procedures led us to experience the development of several iterations of data elements allowed to be shared with the research team, including restricting or re‐structuring (e.g., complete aggregation or collapsing of data element categories) data to meet limited or de‐identified data requirements; setup of a secured, dedicated computing environment that requires special authentication methods, accounts, and permissions and restrictions on allowable analysis software; and limitations on data release for internal and external parties (collaborating sites as part of the research team). These changes further extended our data extraction timeline, delaying research activities.

2.2.3. Data infrastructure and scalability

Health records hold a considerable volume of data, which can quickly exhaust existing data support infrastructure. As noted by Anderson et al., the major barriers in biomedical research data management and analysis include the financial burden of acquiring new expertise or tools; lack of time to invest in changing work practices to incorporate new technologies; limited availability of institutional support; and environment and file size limitations. ¹⁴ Scaling storage and analytical infrastructure can become costly quickly. The complexity of healthcare data often surpasses traditional IT infrastructures, demanding scalable solutions capable of handling exponential growth. This means that there must be a dedicated team with specialized training to maintain oversight of these activities for the purpose of research, which is secondary to the top priority of immediate patient care. While NIH allows for budgeting of data management costs, they do not add additional funds to the original maximum allowed budget amount. We experienced two major challenges related to infrastructure. The first was limited resources in extracting the data. Data size, system timeouts, and competing draws on network resources meant that data extraction was limited to overnight and weekend queries, making data checking time‐consuming as multiple extractions/iterations were needed. Second, we faced challenges in transferring the data files both internally and externally with our collaborating sites. Due to the file sizes and limited identifiers contained in the files, we needed to use an ACE with restricted software for internal data access and a SFTP connected to the ACE to allow for external collaborating sites to pull the data into their own secure computing environment. The ACE is located on‐site and is a self‐service environment allowing researchers to provision Linux or Windows computing resources “on demand.” These systems are situated in a secure network enclave with limited access to external resources and can be used to process data using open‐source tools. Major benefits to using this infrastructure are that it is Health Insurance Portability and Accountability Act (HIPAA) compliant and the systems can be scaled up to include large random access memory and central processing unit allocations, extensive storage, and graphics processing unit processors.

3. LESSONS LEARNED

We have experienced several main challenges in the conduct of this research and summarize these challenges in Table 2 to highlight the delays in protocol development, IRB, and data extraction as well as major lessons learned from conducting multi‐site, big data research to date. The first and most important is to start on as many pieces of research planning as early as possible. Starting early means working with your internal and external teams to identify the details and logistics of conducting the research. Mainly, initiating discussions with IT and compliance teams to determine their internal processes for working with large data, transferring large data, and working with IT at the collaborating sites to reconcile data structure and other data challenges that arise when extracting and preparing the data will be key. A driver for working through these challenges is the development and sharing of local and EHR‐specific data dictionaries. Since this information was not readily available, this added considerable time in identifying the data elements that were required for the research and their availability and locations in the system. We stress this point because our research was the first project of this complexity and scale at our institution. This meant that several of our teams were creating procedures and policies as our study was being implemented, which required more time to ensure that these new procedures and policies would meet the needs of our study and other future studies as well as fall within the capabilities of current infrastructure or allow for the team to scale up their infrastructure to meet the needs.

TABLE 2.

Study challenges, results of challenges, and proposed solutions to challenges.

Challenge	Reason for challenge	Result of challenge	Proposed solution
Delays in study protocol development	Insight into the raw data and data elements available and maintained at each site required more time to assess. No existing local data dictionaries for research use.	Protocol development took 5 months to finalize, extending the full sIRB execution to 13 months compared to the 9 months that we originally proposed.	Engage early with experienced IT teams to understand the raw data, particularly analysts who have experience with your target data. If including access log data, your study may require you to pinpoint a specific analyst familiar with that data as it is not as commonly used as patient‐level EHR data. Develop and share local data dictionaries.
Delays in contract agreement execution	Each institution had their own policies around sharing data. The policies were not only different between each site but were dynamic—changing throughout the duration of the research. Much more communication and negotiation of terms was needed for the data‐supplying sites.	Delays in contracting were substantial, resulting in the final site‐level DUA being executed at 26 months. This was over twice the amount of time we originally planned for (12 months).	Federal agencies funding research should consider adopting and requiring use of a single approved agreement designed to facilitate multi‐site research that utilizes patient data. This agreement should be required much like the NIH requires researchers to utilize a sIRB for multi‐site research. Ensure research team members thoroughly understand HIPAA and institutional guidelines as they relate to data and research.
Delays data extraction	Though each data‐supplying site uses the same vendor system for their EHR, differences in configuration were a major barrier. This required additional time to understand how to harmonize and map the data across each site. The details of these different configurations could not be addressed until internal research team members had sample extractions of the data to review.	Time to completion of internal data extraction varied by site, ranging from 14 to 22 months, and was longer than what was originally planned for (12 months). Compounded from the delays of study protocol development and contract execution, data delays had larger consequences from our funder (NIH). For our Year 2 annual report, NIH heavily scrutinized our budget due to the delays and requested extensive details regarding a modified timeline for our data‐supplying site that was behind our schedule compared to our other sites. With extensive justification in our annual report, we did not receive any budget cuts because of the delays.	Engage the institutional units responsible for overseeing data sharing early and continuously. This is not specific to the IRB unit itself but groups like Health Data Oversight Committee (HDOC). Engaging early and continuously will pre‐empt any misunderstandings for the data being requested and what is allowable under institution policies as they evolve. This also minimizes the impact that potential staff turnover may have on the research's progress. Exchange EHR‐specific data dictionaries among institutions collaborating on research.

Open in a new tab

Abbreviations: DUA, data use agreement; EHR, electronic health record; HIPAA, Health Insurance Portability and Accountability Act; NIH, National Institutes of Health; sIRB, single institutional review board.

The second lesson learned was related to the ordering and timing of implementing a single, multi‐site DUA. While we attempted to streamline the amount of DUAs needed for this study, it would have been more efficient to establish DUAs with the data‐supplying sites first so there was flexibility to negotiate since the data‐supplying sites held the most risk and held the most reservations regarding the language and terms of the DUA. We believe that it may be beneficial to adopt an approved and standardized single DUA to facilitate multi‐site research. Similar to the sIRB policy set forth by NIH, a recommendation may be that NIH implement a similar process for DUAs, which may help to minimize the amount of effort placed on the institutions in setting up these agreements.

Another lesson was the importance of having research team members thoroughly understand HIPAA and institutional guidelines as they relate to data and research. Because this research was the first of its kind at our institution, we found that we would often be given conflicting information from different teams (e.g., compliance, IT, IRB, etc.) on several topics, including data elements allowed to be released to internal and external teams. We found that this stemmed from high staff turnover, resulting in gaps in the expertise needed to make decisions that impact the research and ultimately delays in research activities. Additionally, we also found that internal policies often changed, further contributing to delays. While these challenges are difficult to plan for, having research team members be knowledgeable in policies and being able to demonstrate precedence of institutional decisions or decisions from the literature involving similar research will allow research to continue forward.

4. CONCLUSIONS AND FUTURE DIRECTIONS

The challenges described here resulted in extensive delays in some areas of our current research and, while our study was not designed to capture specific analytical impacts of these delays, this paper adds to the literature on the complexity and extensive coordination across multiple groups. To date, our study has involved contact and collaboration with 95 personnel who are not part of the research team from compliance (20.0%), contracts (32.6%), IRB (13.7%), IT and data (23.2%), and other (10.5%) teams. The volume of communication in the form of emails (~2000 sent and received) and meetings (164 h) is also staggering.

While multi‐site research is critical for rigorous and generalizable research findings, it remains challenging in execution due to a number of factors which are further complicated if research involves extensive, patient‐level health data. Despite these challenges, there have been moves in the right direction to make this type of research easier. At the UC level, these challenges were actualized in the creation of a system‐wide consortium, the University of California Biomedical Research Acceleration, Integration, and Development (UC BRAID). Formed in 2010, UC BRAID aims to accelerate research and improve health through collaboration, sharing resources, and infrastructure development. ¹⁵ Specific areas of focus include identifying more efficient ways to negotiate and finalize contracts, developing strategies to improve participant recruitment, and testing processes to increase and facilitate multi‐site research in the context of sIRB establishment processes. To streamline contracting, UC created a new master contract to improve turnaround time for contract execution. In doing so, they found that both the UC Master Agreements and the Accelerated Clinical Trial Agreement (ACTA) were 45% faster when compared to the contracts which did not use a master agreement. ¹⁶ UC BRAID also created a toolkit of guidelines and best practices to aid researchers in creating a multi‐site EHR recruitment plan that is respectful to patients, minimizes the risk of loss of patient confidentiality, and helps researchers anticipate and prepare for patient feedback. UC BRAID is also working to identify processes to facilitate use of sIRB review, and assess current processes, determine potential causes of variability and delay, and develop methods for improvement.

While this paper describes the challenges we experienced in the UC system, similar challenges may also be encountered by other academic health institutions looking to engage in multi‐site research involving patient data. From an outside perspective, it is easy to assume that since our UC sites are governed by the same body, collaboration and data sharing would be more seamless. However, our study demonstrated that this is not the case. The barriers noted here related to data acquisition and sharing are problems that can be faced by all institutions as they are dependent on available resources, support, and discrepancies in policies at each institution. Lessons learned on IRB and DUAs from clinical data network projects such as the National Patient‐Centered Clinical Research Network, Clinical and Translational Science Awards (CTSA) Evolve to Next‐Gen ACT, CTSA National Clinical Cohort Collaborative, NIH All of Us, Epic Cosmos, and TriNetX would greatly enhance the literature in this area. As a possible solution to simplify contracting, the Federal Demonstration Partnership has partnered with 40 organizations to create data transfer and use agreements to facilitate research between US‐based nonprofit and government organizations. ¹⁷ If federally funded research were to adopt a standardized multi‐party DUA instead of site‐to‐site ones, this could streamline the process while minimizing staff costs (burden volume) and delays.

We emphasize the specifics of our documentation and timeline of activities since, to our knowledge, this is not explicitly discussed in the literature on start‐up time required for non‐industry funded, non‐clinical trial, multi‐site research. However, this information is key to investigators planning their multi‐site proposals so they can proactively identify and budget for appropriate personnel time, address institutional processes, and adjust their research timelines to reflect feasible and realistic timeframes to successfully complete the research activities. Overall, this study's findings provide a detailed look at the complexities of conducting multi‐site research, particularly when it involves sensitive patient data. Our experience emphasizes the importance of pulling together a strong team of researchers with expertise in the specialized needs of the study to ensure successful troubleshooting and continued research progress. By sharing our experience, we hope to provide a valuable reference for other researchers, helping them to proactively plan and budget for the extensive time and resources required for productive collaboration involving multi‐institutional real‐world clinical big data.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest. Dr. Michael Hogarth maintains affiliations with Medeloop.ai and LifeLink Systems.

ACKNOWLEDGMENTS

Our research is supported by grants R01CA273058 and R01CA273058‐S1 from the National Cancer Institute. Contents of this manuscript are solely the responsibility of the authors and do not represent the official view of the National Cancer Institute.

Garcia B, Hogarth M, Wang Y, Zhu X, Tu S‐P. Multi‐site research using electronic health record data: Lessons learned from a case study. Learn Health Sys. 2025;9(4):e70039. doi: 10.1002/lrh2.70039

DATA AVAILABILITY STATEMENT

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

REFERENCES

1. American Cancer Society . Cancer facts & figures. 2025. https://www.cancer.org/research/cancer‐facts‐statistics/all‐cancer‐facts‐figures/2025‐cancer‐facts‐figures.html. Accessed 2024.
2. Levit LA, Balogh E, Nass S, Ganz PA. Delivering High‐Quality Cancer Care: Charting a New Course for a System in Crisis. National Academy of Sciences; 2013. [PubMed] [Google Scholar]
3. Andreu‐Perez J, Poon CCY, Merrifield RD, Wong STC, Yang GZ. Big data for health. IEEE J Biomed Health Inform. 2015;19(4):1193‐1208. [DOI] [PubMed] [Google Scholar]
4. National Institute of Health (NIH) . Final NIH policy on the use of a single institutional review board for multi‐site research NOT‐OD‐16‐094. 2016. https://grants.nih.gov/grants/guide/notice-files/not-od-16-094.html. Accessed 2024
5. National Institute of Health (NIH) . Single IRB for multi‐site or cooperative research. 2024. https://grants.nih.gov/policy‐and‐compliance/policy‐topics/human‐subjects/single‐irb‐policy‐multi‐site‐research. Accessed 2024
6. Green JM, Goodman P, Kirby A, Cobb N, Bierer BE. Implementation of single IRB review for multisite human subjects research: persistent challenges and possible solutions. J Clin Transl Sci. 2023;7(1):e99. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Murray A, Pivovarova E, Klitzman R, Stiles DF, Appelbaum P, Lidz CW. Reducing the single IRB burden: streamlining electronic IRB systems. AJOB Empir Bioeth. 2021;12(1):33‐40. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Mello MM, Triantis G, Stanton R, Blumenkranz E, Studdert DM. Waiting for data: barriers to executing data use agreements. Science. 2020;367(6474):150‐152. [DOI] [PubMed] [Google Scholar]
9. Devriendt T, Shabani M, Borry P. Data sharing in biomedical sciences: a systematic review of incentives. Biopreserv Biobank. 2021;19(3):219‐227. [DOI] [PubMed] [Google Scholar]
10. National Institute of Health (NIH) . Data management & sharing policy overview. 2023. https://sharing.nih.gov/data‐management‐and‐sharing‐policy/about‐data‐management‐and‐sharing‐policies/data‐management‐and‐sharing‐policy‐overview#after. Accessed 2024
11. Huesch MD, Mosher TJ. Using it or losing it? The case for data scientists inside health care. NEJM Catal. 2017;3(3). [Google Scholar]
12. Federal Bureau of Investigation (FBI) . Internet crime report 2023. 2023. https://www.ic3.gov/annualreport/reports/2023_ic3report.pdf. Accessed 2024
13. U.S. Department of Health and Human Services . HHS Office for Civil Rights issues letter and opens investigation of change healthcare cyberattack. 2024. https://www.hhs.gov/about/news/2024/03/13/hhs‐office‐civil‐rights‐issues‐letter‐opens‐investigation‐change‐healthcare‐cyberattack.html. Accessed 2024
14. Anderson NR, Lee ES, Brockenbrough JS, et al. Issues in biomedical research data management and analysis: needs and barriers. J Am Med Inform Assoc. 2007;14(4):478‐488. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. University of California Biomedical Research Acceleration Integration & Development (UC BRAID) . https://www.ucbraid.org/. Accessed 2024
16. Tran T, Bowman‐Carpio L, Buscher N, et al. Collaboration in action: measuring and improving contracting performance in the University of California contracting network. Res Manag Rev. 2017;22(1):28‐41. [PMC free article] [PubMed] [Google Scholar]
17. The Federal Demonstration Partnership . Data transfer and use agreements (DTUAs). 2025. https://thefdp.org/demonstrations-resources/dtuas/ Accessed 2025

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

[lrh270039-bib-0001] 1. American Cancer Society . Cancer facts & figures. 2025. https://www.cancer.org/research/cancer‐facts‐statistics/all‐cancer‐facts‐figures/2025‐cancer‐facts‐figures.html. Accessed 2024.

[lrh270039-bib-0002] 2. Levit LA, Balogh E, Nass S, Ganz PA. Delivering High‐Quality Cancer Care: Charting a New Course for a System in Crisis. National Academy of Sciences; 2013. [PubMed] [Google Scholar]

[lrh270039-bib-0003] 3. Andreu‐Perez J, Poon CCY, Merrifield RD, Wong STC, Yang GZ. Big data for health. IEEE J Biomed Health Inform. 2015;19(4):1193‐1208. [DOI] [PubMed] [Google Scholar]

[lrh270039-bib-0004] 4. National Institute of Health (NIH) . Final NIH policy on the use of a single institutional review board for multi‐site research NOT‐OD‐16‐094. 2016. https://grants.nih.gov/grants/guide/notice-files/not-od-16-094.html. Accessed 2024

[lrh270039-bib-0005] 5. National Institute of Health (NIH) . Single IRB for multi‐site or cooperative research. 2024. https://grants.nih.gov/policy‐and‐compliance/policy‐topics/human‐subjects/single‐irb‐policy‐multi‐site‐research. Accessed 2024

[lrh270039-bib-0006] 6. Green JM, Goodman P, Kirby A, Cobb N, Bierer BE. Implementation of single IRB review for multisite human subjects research: persistent challenges and possible solutions. J Clin Transl Sci. 2023;7(1):e99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[lrh270039-bib-0007] 7. Murray A, Pivovarova E, Klitzman R, Stiles DF, Appelbaum P, Lidz CW. Reducing the single IRB burden: streamlining electronic IRB systems. AJOB Empir Bioeth. 2021;12(1):33‐40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[lrh270039-bib-0008] 8. Mello MM, Triantis G, Stanton R, Blumenkranz E, Studdert DM. Waiting for data: barriers to executing data use agreements. Science. 2020;367(6474):150‐152. [DOI] [PubMed] [Google Scholar]

[lrh270039-bib-0009] 9. Devriendt T, Shabani M, Borry P. Data sharing in biomedical sciences: a systematic review of incentives. Biopreserv Biobank. 2021;19(3):219‐227. [DOI] [PubMed] [Google Scholar]

[lrh270039-bib-0010] 10. National Institute of Health (NIH) . Data management & sharing policy overview. 2023. https://sharing.nih.gov/data‐management‐and‐sharing‐policy/about‐data‐management‐and‐sharing‐policies/data‐management‐and‐sharing‐policy‐overview#after. Accessed 2024

[lrh270039-bib-0011] 11. Huesch MD, Mosher TJ. Using it or losing it? The case for data scientists inside health care. NEJM Catal. 2017;3(3). [Google Scholar]

[lrh270039-bib-0012] 12. Federal Bureau of Investigation (FBI) . Internet crime report 2023. 2023. https://www.ic3.gov/annualreport/reports/2023_ic3report.pdf. Accessed 2024

[lrh270039-bib-0013] 13. U.S. Department of Health and Human Services . HHS Office for Civil Rights issues letter and opens investigation of change healthcare cyberattack. 2024. https://www.hhs.gov/about/news/2024/03/13/hhs‐office‐civil‐rights‐issues‐letter‐opens‐investigation‐change‐healthcare‐cyberattack.html. Accessed 2024

[lrh270039-bib-0014] 14. Anderson NR, Lee ES, Brockenbrough JS, et al. Issues in biomedical research data management and analysis: needs and barriers. J Am Med Inform Assoc. 2007;14(4):478‐488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[lrh270039-bib-0015] 15. University of California Biomedical Research Acceleration Integration & Development (UC BRAID) . https://www.ucbraid.org/. Accessed 2024

[lrh270039-bib-0016] 16. Tran T, Bowman‐Carpio L, Buscher N, et al. Collaboration in action: measuring and improving contracting performance in the University of California contracting network. Res Manag Rev. 2017;22(1):28‐41. [PMC free article] [PubMed] [Google Scholar]

[lrh270039-bib-0017] 17. The Federal Demonstration Partnership . Data transfer and use agreements (DTUAs). 2025. https://thefdp.org/demonstrations-resources/dtuas/ Accessed 2025

PERMALINK

Multi‐site research using electronic health record data: Lessons learned from a case study

Brittany Garcia

Michael Hogarth

Yu Wang

Xi Zhu

Shin‐Ping Tu