Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2025 Nov 14;25:256. doi: 10.1186/s12874-025-02705-z

Characteristics of cohort data management systems (CDMS): a scoping review

Arezoo Abasi 1, Haleh Ayatollahi 2,, Seyed Abbas Motevalian 3,4,
PMCID: PMC12619332  PMID: 41239233

Abstract

Cohort studies are a core aspect of clinical research which helps to gather a large volume of data over time. As digital technologies evolve, managing these data has become increasingly complex. Therefore, the use of cohort data management systems (CDMS) has been suggested to enhance data accuracy, confidentiality, and consistency. However, the functional and non-functional requirements of these systems have not been adequately emphasized in literature. This study aimed to identify the key functional and non-functional requirements of these systems. This was a scoping review conducted in 2025, and articles were searched in PubMed, Scopus, Web of Science, ProQuest, IEEE Xplore, and the Cochrane Library databases as well as Google Scholar. Initially, 843 articles were retrieved, and finally, 45 articles published between 1st January 2005 and 31st June 2025 were selected. Nine functional and eight non-functional requirements were identified for CDMS. These systems are essential for facilitating cohort studies through data management, data processing and analysis. Advanced tools like AI, visual dashboards, and automation have improved CDMS functionalities. The most important non-functional requirements included flexibility, security and usability. CDMS must support comprehensive data operations, secure access, user engagement, and interoperability while ensuring scalability, privacy, and regulatory compliance. Requirements such as maintainability, although less emphasized, are essential for the long-term development and optimization of data management systems. Future research should focus on emerging technologies like blockchain and Internet of Things (IoT) to enhance the security, integrity, and performance of CDMS.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-025-02705-z.

Keywords: Cohort studies, Data management systems, Functional requirements, Non-functional requirements

Introduction

Cohort studies have always been recognized as one of the fundamental pillars of clinical research which provide valuable insights into disease progression, treatment effectiveness, and health outcomes for researchers [1]. These studies involve tracking large groups of participants and collecting their data in complex datasets. With the rapid advancement of medical research and digital health tools, the volume of data collected in cohort studies has exponentially increased, leading to significant challenges in data management, validation, and sharing [2, 3]. Managing such large datasets using traditional tools is not only time-consuming and prone to errors, but also jeopardizes data integrity, consistency, and timely reporting of research findings [4, 5].

Given the increasing reliance of clinical research on data-driven methods, the need for efficient, secure, and scalable cohort data management systems (CDMS) has become more critical than ever [68]. These systems are designed to overcome traditional barriers in data management [9] and facilitate collection, validation, and analysis of cohort data at a large scale [6, 1012]. They play a crucial role in enhancing data quality through automated cleaning, ensuring data security via controlled access mechanisms, and integrating seamlessly with other research tools [11, 1315]. By defining data entry rules, automating validation checks, and resolving inconsistencies, these systems help researchers to optimize data processing workflows, reduce errors, and save significant time [12, 13, 16]. Moreover, using emerging technologies like blockchain in electronic medical records [17] and Internet of Things (IoT) devices [18] can enhance interoperability, streamline data validation, and build trust among stakeholders, ultimately increasing the efficiency and reliability of cohort studies.

Despite advancements in this field, existing literature fails to provide a comprehensive overview of CDMS requirements. Previous studies have often focused on specific diseases, isolated functionalities, or case-based implementations, leading to significant gaps in understanding the general functional requirements (FRs) and non-functional requirements (NFRs) of these systems across different clinical settings [9, 13, 1921]. Many investigations have concentrated on disease-specific contexts, particularly within Human immunodeficiency virus (HIV)/Acquired immunodeficiency syndrome (AIDS) [2227], oncology [28, 29], or cardiovascular research [6, 21, 30, 31], without assessing the flexibility or adaptability of CDMS to other therapeutic domains. For example, while CDMS platforms have been extensively used in cancer genomics for managing sequencing data and biosample tracking, these configurations cannot be effectively translated to mental health or infectious disease cohorts, where data types, volume, and workflows differ significantly [32, 33]. In addition, some research has concentrated on specific technical aspects such as data security, user interfaces, or system integration [13, 31, 34], without addressing broader characteristics such as interoperability, scalability, long-term sustainability, and real-world usability.

Studies involving rare disease cohorts like Primary Mitochondrial Diseases (PMD), MELAS syndrome, and genetic disorders often yield insights with limited relevance for large-scale, population-based research. Even conditions like Primary Sjögren’s Syndrome (pSS), where only rare complications are studied, face similar challenges which limit generalizability [4, 35, 36]. Integration with Electronic Health Records (EHRs) and third-party analytics tools remains an underexplored area, leaving persistent operational and technical gaps that hinder seamless data exchange and longitudinal patient tracking. Moreover, the limited inclusion of end-user perspectives, such as clinicians, data managers, and coordinators means that many CDMS, though technically robust, may fall short in practical usability, user engagement, and workflow integration.

Therefore, persistent challenges in handling large, heterogeneous datasets, achieving cross-system interoperability, and ensuring robust data quality and security remain prevalent. These issues stem from an incomplete understanding of both FRs and NFRs, as well as the real needs and goals of organizations. They impede research efficiency by necessitating extensive data cleaning and validation, delaying the generation of critical insights. Limited interoperability constrains data integration across disparate sources, while insufficient security and privacy measures jeopardize patient confidentiality, potentially restricting data sharing and collaborative research efforts. Additionally, usability deficiencies may hinder system adoption among clinical researchers and data managers and may cause limiting the platforms effectiveness in supporting longitudinal cohort analyses [3740]. As a result, the literature remains fragmented and insufficient for guiding the development of scalable, interoperable, and user-centered CDMS that meet the diverse needs of modern cohort studies [5, 24, 25, 37, 4145].

As research tools become more sophisticated, ensuring the interoperability of CDMS with other clinical data analysis platforms remains an unresolved challenge [46, 47]. Evaluating the alignment of these systems with user needs, in terms of performance, accessibility, and usability, is essential for ensuring their successful adoption and scalability in clinical research environments [12, 48]. Addressing these gaps is important for optimizing the design and usability of CDMS, thereby supporting clinical researchers, software developers, and stakeholders in the healthcare sector [37, 45].

This study aimed to identify FRs and NFRs of CDMS to improve the design and performance of CDMS for software developers, clinical researchers, and other key stakeholders in the healthcare domain.

Theoretical background

The development of CDMS is fundamentally rooted in the historical evolution of epidemiological cohort studies and the subsequent integration of computing technologies into medical research. Early cohort-like investigations, such as those by Snow (1854-cholera study) [49] and Weinberg (1913) [50], relied on manual data recording and analysis to study disease patterns and population-level risk. Frost formalized the methodological foundation for modern cohort studies in the 1930 s through systematic longitudinal tracking, which emphasized the role of temporality in disease causation [51].

With the initiation of large-scale prospective studies, most notably the Framingham Heart Study (1948) [52] and the British Doctors’ Study (1951 to 2001) [53], the volume and complexity of longitudinal data necessitated more structured approaches to data management. The subsequent advent of computer-based systems in the late twentieth century, including relational databases and electronic data capture (EDC) tools, marked a transformative phase in cohort data infrastructure. These systems enabled scalable, efficient, and standard storage, retrieval, and processing of high-dimensional longitudinal data [54].

In the contemporary era, CDMS have evolved to incorporate data from electronic health records (EHRs), registries, genomics, and other real-world data sources [55, 56]. Their design is informed by theoretical frameworks from biostatistics (e.g., survival analysis, repeated measures modeling), epidemiology (e.g., bias control, confounding adjustment), and medical informatics (e.g., data standardization, privacy preservation, interoperability) [30, 37, 57]. Standard data models (e.g., the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)) and regulatory frameworks (e.g., General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA)) further define the architecture and function of CDMS [56, 58]. Conceptually, CDMS are specialized, computer-based platforms engineered to support the longitudinal, secure, and high-fidelity management of cohort data, facilitating causal inference, risk prediction, disease surveillance, and translational research.

Methods

This scoping review was conducted in 2025 using the Arksey and O’Malley methodology to identify FRs and NFRs of CDMS. The PRISMA-ScR guideline was used for screening and reviewing relevant articles [59].

Stage 1: identifying the research question

The research question for the present study was generated as follows:

  • What were the FRs and NFRs of CDMS?

Stage 2: identifying relevant studies

To identify relevant studies, databases including PubMed, Scopus, Web of Science, ProQuest, IEEE Xplore, Cochrane Library, and Google Scholar were searched. The search covered articles published from 1st January 2005 to 31 st June 2025. Google Scholar was included as a supplementary search engine to ensure comprehensive search of literature. Unlike traditional databases, Google Scholar indexes a wide range of sources, including grey literature, conference proceedings, and dissertations, which might not be indexed elsewhere. This strategy was adopted to reduce publication bias and capture relevant studies that could otherwise be missed.

Additional relevant studies were retrieved through reference tracking, citation analysis, and reviewing academic records and publications of key authors. For developing the search strategy, a combination of keywords related to "cohort studies" and "data management" was used. These keywords included synonyms and MeSH terms, combined using the Boolean operators "AND" and "OR". The detailed search strategies for each database are provided in Supplementary Material 1.

Stage 3: study selection

The study selection process was conducted in two phases: first, by screening titles and abstracts, and second, through full-text assessment. Two authors independently assessed each record to ensure consistency and to reduce selection bias. Discrepancies were resolved through discussion with the third author.

Studies were included if they met all of the following criteria: (1) published in English; (2) were peer-reviewed journal articles, dissertations, conference papers, or review articles; (3) used quantitative, qualitative, or mixed-methods approaches; and (4) explicitly described or evaluated features of cohort data management systems (CDMS).

Studies were excluded if they fell into any of the following categories: (1) letters to the editor, (2) study protocols without reported outcomes, and (3) articles without full-text availability, even after contacting the authors were excluded.

The screening process followed the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) guideline [59]. After retrieving relevant articles, references were managed using EndNote software (version X21, Clarivate) to remove duplicates. Subsequently, the titles, abstracts, and full texts of the retrieved studies were thoroughly screened.

Two authors (AA, HA) independently screened all titles and abstracts based on the predefined inclusion and exclusion criteria. In the next phase, full texts of the potentially eligible studies were assessed. Disagreements between authors were resolved through discussion or consultation with the third author (SAM), if necessary. The reasons for exclusion at the full-text screening stage were documented to ensure transparency.

Stage 4: charting the data

A data extraction form was completed for all selected studies to systematically collect the required data. The form included Authors' names, Year of the study, Country of the study, research objectives, cohort information system coverage (disease or research focus), Name of CDMS, research methodology, functional and non-functional requirements. The data extraction process was conducted independently by two authors, and the results were cross-checked to ensure accuracy and completeness.

Stage 5: collecting, summarizing, and reporting the results

A structured data extraction form was used for all selected studies to systematically collect the required data in a consistent and comprehensive manner. After data extraction, the data were carefully reviewed and categorized to identify patterns and commonalities across studies. The FRs and NFRs were then summarized, described, and presented in a tabular format. To ensure accuracy and consistency, two reviewers independently extracted the data, and any discrepancies or ambiguities were discussed and resolved through reaching a consensus.

Results

Characteristics of the selected studies

Initially, a total of 843 articles were identified from searching databases and various scientific resources (Fig. 1). Using Endnote software, 319 duplicated articles were removed, leaving 543 articles for further review. In the screening phase, 126 articles were initially excluded due to the mismatch of titles with the research objectives. Then, during the abstract review, 342 articles were excluded from the review process due to the inconsistency of the abstracts with the study objectives. Finally, 66 articles were selected for full-text retrieval and review. After evaluating the articles based on the full text, 60 articles were identified eligible, but 15 articles were excluded from the final study for specific reasons, including limited focus on cohort data management (three articles), studies which focused on policy-making (two articles), focused on infrastructure development or implementation (five articles), limited to specific diseases or registry-based studies (two articles), observational studies without focusing on system functionalities (one article), and insufficient coverage of FRs and NFRs characteristics (two articles). Finally, 45 articles were found eligible for the final review and entered into the study.

Fig. 1.

Fig. 1

Article selection process based on the PRISMA-ScR guideline

As Fig. 2 shows, the year 2020 had the highest number of related publications (eight studies), highlighting a peak that may be attributed to an increase in research activity during the COVID-19 pandemic.

Fig. 2.

Fig. 2

Distribution of the selected studies based on the publication year

In terms of geographical distribution (Fig. 3), the majority of the reviewed studies were from the United States (eleven studies), Germany (six studies), the United Kingdom, Switzerland (each one three studies). Other countries, including China, France, Italy, and South Korea, each contributed to two studies, while Malaysia, Thailand, European Union, Cameroon, Bangladesh, Senegal, Spain, Ireland, Canada, Netherlands, Japan, Poland, Uganda, and Sweden each contributed to one study. A summary of the reviewed articles is presented in Table 1.

Fig. 3.

Fig. 3

Distribution of the selected studies based on the geographical areas

Table 1.

Summary of the selected studies to compare research objectives, context, and methodology, and to report FRs and NFRs used for CDMS

No Authors Year Country Research Objective Context Name of CDMS Research Methodology Functional requirements Non-functional requirements
1 Zeeb et al. [22] 2025 Switzerland To establish a centralized storage and compute-orchestration solution for HIV-1 NGS data, ensuring streamlined bioinformatics workflows, reproducibility with version tracking, and adherence to FAIR data management principles while maintaining a design to accommodate evolving research needs HIV/AIDS and Neurocognitive disorders (in context of HIV genome-wide association study) SHCND (Swiss HIV Cohort Study Viral Next Generation Sequencing Database) Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Infrastructure and Operations

4. Protection and Access Control

5. Participant and ID Management

6. Longitudinal Cohort Studies Tracking

7. User Support and Interaction

8. Research Facilitation

1. Flexibility

2. Security

3. Usability

4. Performance Efficiency

5. Reliability

6. Compatibility

7. Maintainability

2 Klein et al. [60] 2025 USA To design and build a secure, participant-centric, privacy-preserving digital health research platform (DHRP) to support recruitment, enrollment, multimodal data collection, and long-term engagement for the All of Us longitudinal cohort study involving 1 million diverse U.S. participants Precision medicine and population-based longitudinal health research DHRP (Digital Health Research Platform) Mixed-Methods approach

1. Data Management

2. Infrastructure and Operations

3. Data Reporting, Visualization, and Reports Generation

4. Participant and ID Management

5. Longitudinal Cohort Studies Tracking

6. User Support and Interaction

1. Flexibility

2. Security

3. Usability

4. Privacy

5. Performance Efficiency

6. Compatibility

7. Maintainability

3 Liu et al. [61] 2025 China To develop the Cardiopulmonary Physiotherapists Database System (CPPTherapists-DBS) to enhance data management and analysis for researching risk factors of postoperative pulmonary complications (PPCs) in cardiac surgery patients Postoperative pulmonary complications (PPCs) and Cardiac surgery-related conditions CPPTherapists-DBS (Cardiopulmonary Physiotherapists Database System) Mixed-Methods approach

1. Data Management

2. Data Processing and Analysis

3. Infrastructure and Operations

4. Protection and Access Control

5. Data Reporting, Visualization, and Reports Generation

6. Participant and ID Management

7. Longitudinal Cohort Studies Tracking

1. Flexibility

2. Security

3. Usability

4. Performance Efficiency

5. Compatibility

6. Maintainability

7. Privacy

4 Tripathi et al. [28] 2024 USA To propose and implement Multimodal Integration of Oncology Data System (MINDS), a flexible, scalable, cloud-native data architecture that enables the integration, harmonization, and machine learning-readiness of large-scale, multimodal oncology datasets for personalized cancer research and treatment Cancer (specifically Lung cancer, Pancreatic cancer) MINDS (Multimodal Integration of Oncology Data System) Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Infrastructure and Operations

1. Flexibility

2. Security

3. Usability

4. Performance Efficiency

5. Reliability

6. Compatibility

7. Privacy

5 MacMullen et al. [35] 2024 USA To develop and implement an electronic data capture, integration, and visualization platform (MMFP-Tableau) to systematically assess patient-reported outcomes (PROs) in mitochondrial medicine, enabling precision care for primary mitochondrial disease (PMD) patients Primary Mitochondrial Diseases (PMD), Mitochondrial Encephalomyopathy, Lactic Acidosis, Stroke-like Episodes (MELAS) syndrome, and other neuromuscular disorders MMFP-Tableau (Mitochondrial Medicine Frontier Program) Quantitative Study

1. Data Management

2. Data Processing and Analysis

3. Infrastructure and Operations

4. Protection and Access Control

5. Data Reporting, Visualization, and Reports Generation

6. Longitudinal Cohort Studies Tracking

7. User Support and Interaction

1. Flexibility

2. Security

3. Usability

4. Performance Efficiency

5. Compatibility

6 Footer et al. [23] 2024 USA To design a publicly available, interactive epidemiological dashboard for sharing de-identified, aggregated HIV surveillance data from the Rakai Community Cohort Study (RCCS) while protecting participant privacy and facilitating research collaboration HIV/AIDS

RCCS (Rakai Community Cohort Study)

RHSP (Rakai Health Sciences Program) Data Mart

Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Infrastructure and Operations

4. Protection and Access Control

5. Data Reporting, Visualization, and Reports Generation

6. Participant and ID Management

7. Longitudinal Cohort Studies Tracking

8. Research Facilitation

1. Flexibility

2. Security

3. Usability

4. Privacy

7 Abdullah et al. [30] 2024 Malaysia To describe the development and implementation of digital health management systems in The Malaysian Cohort (TMC) study, focusing on managing large-scale longitudinal data while ensuring data integrity, security, and scalability Cardiovascular disease and Colorectal cancer

TMC (The Malaysian Cohort)

CIMS (Cohort Information Management System) which includes several subsystems:

eCIMS (Electronic Cohort Information Management System)

DIMS (Diet Information Management System)

HeDIMS (Health Diary Information Management System)

TSIMS (Tube and Sample Information Management System)

Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Protection and Access Control

4. Data Reporting, Visualization, and Reports Generation

5. Participant and ID Management

6. Longitudinal Cohort Studies Tracking

1. Flexibility

2. Security

3. Usability

4. Reliability

5. Privacy

8 Kusejko et al. [62] 2023 Switzerland To describe the process and challenges of migrating a longitudinal cohort study, the Swiss Mother and Child HIV Cohort Study (MoCHiV), from a relational Oracle SQL database to the Research Electronic Data Capture (REDCap) system and provide a more user-friendly electronic data entry tool for study nurses and physicians, while maintaining data quality and streamlining the data management process HIV (Consisted of Women living with HIV, Mother–child pairs, Children living with HIV) MoCHiV (Swiss Mother and Child HIV Cohort Study) Qualitative Study

1. Data Management

2. Protection and Access Control

3. Data reporting, visualization, and reports generation

1. Flexibility

2. Security

3. Usability

4. Performance Efficiency

5. Compatibility

6. Privacy

9 Schmidt et al. [63] 2023 Germany To provide an overview of key tools used in managing the SHIP (Study of Health in Pomerania) study and how these tools contribute to improving the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of the study’s data and metadata and the structural limitations of applying the FAIR principles, especially considering legal restrictions, and outlines future challenges and developments for the tools used in SHIP General population health with a wide range of Diseases (Cardiovascular Diseases, Metabolic Disorders, Mental Health Issues. Lung Diseases, Kidney Diseases) SHIP (Study of Health in Pomerania) which consists of three cohorts: SHIP-START, SHIP-TREND, and SHIP-NEXT Qualitative Study

1. Data Management

2. Infrastructure and Operations

3. Participant and ID Management

1. Flexibility

2. Security

3. Usability

4. Compatibility

5. Maintainability

10 Nye et al. [37] 2022 USA To outline the structure and processes involved in data collection, storage, and linkage for the SHARE study, a multicenter cohort study of pediatric palliative care patients and their parents and provide insights into the effective management of data for such studies, highlighting the advantages and challenges of the data management system used Pediatric palliative care SHARE (Pediatric Palliative Care Research Network’s SHAred Data and REsearch) project Qualitative Study

1. Data Management

2. Participant and ID Management

3. Infrastructure and Operations

4. User Support and Interaction

1. Flexibility

2. Security

3. Privacy

11 Sinitkul et al. [47] 2022 Thailand To present the design and implementation of an electronic data capture and management system (EDCM) tailored for a birth cohort study on children's environmental health in Thailand, aimed at supporting efficient data collection and study monitoring birth cohort study WICARE Mixed-Methods approach

1. Data Management

2. Protection and Access Control

3. Infrastructure and Operations

1. Flexibility

2. Usability

3. Performance Efficiency

4. Reliability

12 Pezoulas et al. [4] 2022 European Union To delineate the clinical picture and unmet needs of primary Sjögren's Syndrome (pSS) through the development and application of the HarmonicSS platform, which utilizes a federated approach for data management, AI model development, and cohort study analysis Primary Sjögren's Syndrome (pSS) and its association with lymphoma HarmonicSS (HARMONization and integrative analysis of regional, national and international Cohorts on primary Sjögren’s Syndrome) Mixed-Methods approach

1. Data Management

2. Protection and Access Control

3. Data Processing and Analysis

4. User Support and Interaction

1. Flexibility

2. Security

3. Performance Efficiency

4. Reliability

5. Privacy

13 Li et al. [64] 2022 China To describe and assess the current state of data management systems in cohort studies and to provide insights into how these systems facilitate data sharing and integration, ultimately enhancing reproducibility and the identification of small effects across various health studies Cardiovascular diseases, Neurodegenerative diseases, Mental disorders, Infectious diseases, Cancer

XNAT (Extensible Neuroimaging Archive Toolkit)

BBRC Cohort Platform (Barcelonaβeta Brain Research Center)

UK Biobank

ABCD (Adolescent Brain Cognitive Development)

IMAGEN

Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Data Reporting, Visualization, and Reports Generation

1. Compatibility

2. Security

3. Usability

4. Privacy

14 Huguet et al. [41] 2021 Spain To describe a data management system designed for cohort studies, particularly those involving large-scale neuroimaging studies related to neurodegenerative diseases and to enhance the quality control, automation, and reproducibility of imaging data processing and validation workflows in cohort studies, while ensuring scalability and ease of use Neurodegenerative diseases

ALFA project (Alzheimer's and Families)

XNAT (Extensible Neuroimaging Archive Toolkit)

Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Data Reporting, Visualization, and Reports Generation

4. Infrastructure and Operations

1. Performance Efficiency

2. Maintainability

3. Flexibility

4. Usability

15 Feric et al. [65] 2021 USA To present a complete, secure, and customizable open-source web-based software architecture that aids data administrators, scientists, and statisticians in various tasks surrounding data harmonization projects including the cleaning, storage, transformation, visualization, and analysis of health data collected across multiple cohort studies - Harmonizing data from three NIH-supported birth cohorts includes NBCS (The Navajo Birth Cohort Study), The New Hampshire Birth Cohort, PROTECT (The Puerto Rico Testsite for Exploring Contamination Threats) Mixed-Methods approach

1. Data Management

2. Data Processing and Analysis

3. Protection and Access Control

4. Data Reporting, Visualization, and Reports Generation

5. User Support and Interaction

1. Flexibility

2. Compatibility

3. Maintainability

4. Performance Efficiency

5. Reliability

6. Security

7. Usability

8. Privacy

16 DeMerle et al. [42] 2021 USA To develop and test VESPRE, a virtually enabled biorepository system embedded in the electronic health record (EHR), aimed at encouraging precision medicine in acute care by overcoming barriers related to scale, detail, time, and cost Sepsis VESPRE (virtually enabled biorepository and electronic health record (EHR)-embedded, scalable cohort for precision medicine) Quantitative Study

1. Data Management

2. Participant and ID Management

3. Protection and Access Control

4. Data Processing and Analysis

1. Performance Efficiency

2. Flexibility

3. Security

4. Privacy

17 Bartolacelli et al. [27] 2021 Italy To discuss the implementation and advantages of using a web-based computerized data collection and management system (specifically REDCap) for cohort studies like SE2030 and highlighting how such systems can optimize data management processes, ensure data quality, and enhance the overall efficiency of research studies Cardiovascular diseases SE2030 (Stress Echo 2030) Qualitative Study

1. Data Management

2. Protection and Access Control

1. Flexibility

2. Reliability

3. Security

4. Usability

18 Adhikari et al. [39] 2021 Canada To describe data harmonization strategies that enable the creation of comparable datasets across two cohort studies, facilitating the pooling of data for enhanced research capabilities and to answer clinically relevant research questions while providing a large sample size, additional variables, and measurements from multiple different scales Preterm birth

AOF (All Our Families)

APrON (Alberta Pregnancy Outcomes and Nutrition)

SAGE (Secondary Analysis to Generate Evidence)

Qualitative Study

1. Data Management

2. Data Processing and Analysis

1. Performance Efficiency

2. Flexibility

19 Tran et al. [11] 2020 France To propose a new infrastructure for clinical research, termed the COOP’ e-cohort, which aims to facilitate patient-centered research, enhance data reusability, address research questions not tackled by traditional clinical studies, and improve the efficiency of data collection and management -

COOP’ e-cohort(COllaborative Open Platform E-cohort)

ComPaRe (Community of Patients for Research)

Qualitative Study

1. Data Management

2. Research Facilitation

1. Performance Efficiency

2. Flexibility

20 Smith et al. [36] 2020 USA To present a distributed analytics framework for performing precision medicine queries, specifically using a real-world use case from UCLA, and to demonstrate scalability and efficiency in distributed loading, storage, access, and aggregate queries for increasing numbers of samples in genomic studies Genetic disorders and diseases ODA (Omics Data Automation) framework Quantitative Study

1. Data Management

2. Data Processing and Analysis

3. Infrastructure and Operations

1. Performance Efficiency

2. Maintainability

3. Reliability

4. Flexibility

5. Usability

6. Privacy

21 Kim et al. [24] 2020 Korea To improve the quality of data management in the Korea HIV/AIDS Cohort Study by establishing a cohort-customized data quality management strategy that reflects the characteristics of cohort subjects and data HIV/AIDS Korea HIV/AIDS Cohort Mixed-Methods approach

1. Data Management

2. Protection and Access Control

3. Infrastructure and Operations

4. Research Facilitation

1. Performance Efficiency

2. Reliability

3. Security

4. Usability

22 Sandifer et al. [34] 2020 USA To describe the design and implementation of the Gulf of Mexico Community Health Observing System (GoM CHOS) to collect, curate, and disseminate health-related data and biospecimens from residents, particularly focusing on the most vulnerable populations for enhancing the understanding of health effects related to disasters, improve public health responses, and increase individual and community resilience Various diseases and health concerns related to disasters emphasizing COVID-19, chronic diseases, and mental health issues GoM CHOS (Gulf of Mexico Community Health Observing System) Mixed-Methods approach

1. Data Management

2. Data Reporting, Visualization, and Reports Generation

3. Longitudinal Cohort Studies Tracking

1. Performance Efficiency

2. Flexibility

3. Compatibility

4. Reliability

5. Security

23 Butters et al. [66] 2020 UK Addressing the issue of data curation debt in cohort studies, emphasizing the need for strategic recommendations to manage and reduce debt effectively, the importance of proper data management practices, and the risks associated with neglecting data curation - Not mentioned Qualitative Study

1. Data Management

2. Protection and Access Control

3. Data Reporting, Visualization, and Reports Generation

1. Compatibility

2. Flexibility

3. Security

4. Usability

24 Lacey Jr et al. [29] 2020 USA To describe the implementation and benefits of a centralized, integrated, and automated data management system (data warehouse) for a cohort study (CTS) that facilitates data sharing, analysis, and collaboration among users Cancers CTS (California Teachers Study) Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Protection and Access Control

4. Data Reporting, Visualization, and Reports Generation

5. User Support and Interaction

1. Flexibility

2. Security

3. Reliability

4. Privacy

25 Zondergeld et al. [8] 2020 Netherlands To describe the data management strategies employed in the YOUth cohort study, particularly focusing on ensuring the privacy, confidentiality, safety, and quality of the data collected from participants, especially children Child health research Yoda (a research data management system), the UMCU Research Data Platform, and the Utrecht Biobank Qualitative Study

1. Data Management

2. Participant and ID Management

3. Protection and Access Control

1. Reliability

2. Flexibility

3. Security

4. Privacy

26 Bauermeister et al. [5] 2020 UK To emphasize the role of the DPUK Data Portal in enhancing data access and management for cohort studies, particularly in the context of dementia research Dementia DPUK (The Dementias Platform UK) Data Portal Qualitative Study

1. Data Management

2. Data Processing and Analysis

3. Protection and Access Control

4. Infrastructure and Operations

5. User Support and Interaction

1. Flexibility

2. Security

3. Usability

4. Privacy

27 Murtagh et al. [67] 2018 UK To discuss the strategic investment in optimizing research data use, especially in integrating health and social care data, to enhance evidence-based clinical care and public health, and the need for effective data management systems in cohort studies to address the increasing volume and complexity of data generated from various sources - METADAC (Managing Ethico-social, Technical and Administrative Issues in Data Access) Qualitative Study

1. Data Management

2. Participant and ID Management

3. Longitudinal cohort studies tracking

4. Data reporting, visualization, and reports generation

5. Infrastructure and Operations

1. Reliability

2. Flexibility

3. Security

4. Privacy

28 Bialke et al. [20] 2018 Germany To support researchers with their research projects by providing a flexible and easy-to-use software solution (the Toolbox for Research) for comprehensive data management in cohort studies Burns MOSAIC (Modular Approach to Data Management in Epidemiological Studies) Mixed-Methods approach

1. Data Management

2. Data Reporting, Visualization, and Reports Generation

3. Infrastructure and Operations

1. Security

2. Reliability

3. Usability

4. Privacy

29 Koshiba et al. [32] 2018 Japan To outline the methodologies and findings from the Tohoku Medical Megabank Project (TMM), particularly focusing on integrating multi-omics analyses (genome, epigenome, and transcriptome) from cohort studies and to identify effective biomarkers for personal healthcare and elucidate the relationships between genetic variants, metabolome data, and disease phenotypes Metabolic disorders

TMM CommCohort Study (TMM Community-Based Cohort Study)

TMM BitThree Cohort Study (TMM Birth and Three-Generation Cohort Study)

jMorp (Japanese Multi Omics Reference Panel)

Mixed-Methods approach

1. Data Management

2. Data Processing and Analysis

1. Performance Efficiency

2. Flexibility

3. Usability

30 Barger et al. [68] 2018 France To describe the development and implementation of an electronic Patient-Reported Outcomes (ePRO) system in a cohort study of people living with HIV (PLWH) and to improve patient-physician communication, enhance patient engagement in HIV care, and monitor health-related quality of life (HRQoL) among participants HIV and other chronic diseases (diabetes and heart disease) ARPEGE (ANRS CO3 Aquitaine cohort’s data capture and visualization system) Mixed-Methods approach

1. Data Management

2. Data Reporting, Visualization, and Reports Generation

3. Data Processing and Analysis

4. Protection and Access Control

5. User Support and Interaction

1. Flexibility

2. Security

3. Usability

4. Privacy

31 Steiner et al. [43] 2016 Switzerland To present and evaluate the odk_planner, an open-source, user-friendly, and flexible data management tool designed for real-time monitoring of cohort studies, particularly in under-resourced settings, and to showcase how odk_planner extends the ODK software package by offering additional features like automated SMS reminders and integrating multiple data collection modalities to improve data management and communication in scientific studies Tuberculosis odk_planner Qualitative Study (Descriptive Approach)

1. Data Management

2. Longitudinal cohort studies tracking

3. Data reporting, visualization, and reports generation

4. Infrastructure and Operations

5. User Support and Interaction

1. Performance Efficiency

2. Usability

3. Privacy

32 Rahman et al. [69] 2016 Bangladesh To describe the development and implementation of a data management system for the ANISA cohort study, focusing on challenges encountered and lessons learned during data collection, entry, and management across multiple sites and create a centralized, efficient, and reliable system that ensures data quality and consistency across all participating sites Respiratory and diarrheal infections in newborns and infants ANISA (Aetiology of Neonatal Infection in South Asia) data management system Qualitative Study (Case Study)

1. Data Management

2. Infrastructure and Operations

1. Performance Efficiency

2. Reliability

3. Flexibility

4. Security

5. Usability

33 Byun et al. [44] 2016 Korea To establish the first web-based database (KORCC) for collecting and analyzing clinicopathological characteristics of renal cell carcinoma (RCC) cases in Korea to improve the understanding and management of RCC through a comprehensive data collection system Renal Cell Carcinoma (RCC) KORCC (KOrean Renal Cell Carcinoma) Quantitative Study

1. Data Management

2. Longitudinal cohort studies tracking

3. Data Processing and Analysis

4. Infrastructure and Operations

1. Reliability

2. Security

3. Usability

34 Bialke et al. [21] 2015 Germany To demonstrate how generic software modules developed by the MOSAIC project can be utilized to meet essential requirements for implementing a Trusted Third Party (TTP) in the management of participant-identifying data for cohort studies and registries and facilitating compliance with ethical and legal data protection requirements while reducing the technical efforts needed for TTP implementation Cardiovascular diseases A part of CDMS, Trusted Third Party (TPP) Dispatcher, which is composed of functionalities from independent service modules (E-PIX Identity Management, gICS Informed Consent Management, gPAS Pseudonym Management) Qualitative Study

1. Data Management

2. Protection and Access Control

3. Infrastructure and Operations

4. User Support and Interaction

1. Maintainability

2. Flexibility

3. Compatibility

4. Privacy

35 Bialke et al. [6] 2015 Germany To deduce core elements of Clinical Data Management (CDM) for cohort studies and registries in terms of functional and non-functional requirements, while highlighting associated challenges related to IT security and legal data protection regulations Cardiovascular diseases MOSAIC (Modular Approach to Data Management in Epidemiological Studies) Qualitative Study

1. Data Management

2. Participant and ID Management

3. Data Processing and Analysis

4. Protection and Access Control

5. Infrastructure and Operations

1. Flexibility

2. Security

3. Usability

36 Lu et al. [70] 2014 USA To characterize a population of over 12,000 chronic viral hepatitis patients through automated electronic health record (EHR) and chart abstraction data collection and improving cohort identification for chronic HBV and HCV through enhanced data management methods, specifically using CART models to refine electronic cohort identification criteria Chronic Hepatitis B (HBV), Chronic Hepatitis C (HCV) CHeCS (Chronic Hepatitis Cohort Study) Mixed-Methods approach

1. Data Management

2. Participant and ID Management

3. Longitudinal cohort studies tracking

4. Protection and Access Control

1. Performance Efficiency

2. Reliability

3. Privacy

37 Grabe et al. [14] 2014 Germany To explore the integration of biological markers (genomics, transcriptomics, metabolomics, proteomics) into medical research to discover, validate, and implement novel biomarkers that improve individualized diagnostics and therapies through cohort profile Hypertension, Depression, Inflammation, Pathological states identified through incidental findings, Dilated cardiomyopathy Cohort data system is part of the Greifswald Approach to Individualized Medicine (GANI_MED) project Mixed-Methods approach

1. Data Management

2. Participant and ID Management

3. Longitudinal cohort studies tracking

4. Data Processing and Analysis

5. Infrastructure and Operations

6. User Support and Interaction

1. Compatibility

2. Maintainability

3. Flexibility

4. Security

5. Privacy

38 Asiki et al. [26] 2013 Uganda To detail the data management systems used in cohort studies, particularly focusing on the data collected through the GPC (General Population Cohort) in Uganda which includes aspects related to data capture, analysis, and its application for understanding disease trends, risk factors, and outcomes HIV/AIDS, Non-Communicable Diseases (NCDs), Sexually Transmitted Infections (STIs), Infectious diseases GPC (General Population Cohort) Mixed-Methods approach

1. Data Management

2. Participant and ID Management

3. Infrastructure and Operations

4. User Support and Interaction

1. Performance Efficiency

2. Privacy

39 Abugessaisa et al. [45] 2013 Sweden To tackle the challenges biomedical and clinical researchers encounter in a translational research environment. The paper focuses on implementing the CDC (Clinical Data Commons) as a platform for integrating clinical and biomedical databases specifically for studying Rheumatoid Arthritis (RA) and supporting cohort studies Rheumatoid Arthritis (RA) CDC (Clinical Development Center) Mixed-Methods approach

1. Data Management

2. Data Processing and Analysis

3. User Support and Interaction

1. Reliability

2. Flexibility

3. Security

4. Usability

5. Privacy

40 Wawrzyniak et al. [7] 2011 Poland To present a new advanced method for the practical design of data collection management and a management system that ensures intensive data quality control and data security for high-quality data assessment of health indicators in large-scale population studies, specifically in the context of the PONS project - PONS project (The Polish-Norwegian Study) Quantitative Study

1. Data Management

2. Data reporting, visualization, and reports generation

1. Performance Efficiency

2. Reliability

3. Flexibility

4. Security

5. Usability

41 Garzotto et al. [71] 2011 Italy To evaluate the implementation of a database collection tool specifically designed for cohort studies in clinical settings, focusing on assessing its usability and effectiveness in managing data Diabetes, Cardiovascular diseases, Respiratory diseases, Cancer NEFROINT System Qualitative Study

1. Data Management

2. Protection and Access Control

1. Performance Efficiency

2. Reliability

3. Flexibility

4. Security

5. Usability

42 Newman et al. [25] 2011 Cameroon To describe the development, implementation, and impact of the IeDEA Central Africa region cohort's data management system for HIV/AIDS patient care in improving patient management, continuity of care, and fulfilling reporting requirements through systematic data collection, the creation of an electronic database, and the use of a reporting instrument to track patient progress and trends HIV/AIDS IeDEA Central Africa region database Qualitative Study

1. Data Management

2. Participant and ID Management

3. Longitudinal Cohort Studies Tracking

4. Data Reporting, Visualization, and Reports Generation

1. Reliability

2. Usability

3. Privacy

43 Mbacké et al. [72] 2008 Senegal To emphasize the contributions of longitudinal community studies (specifically Demographic Surveillance Systems—DSS) to public health knowledge in Africa, focusing on their impact on understanding health and disease ecology, addressing ethical challenges, improving research capacity, and enhancing data management systems for better health interventions Child Mortality, AIDS Epidemic, Childhood Illnesses Household Registration System (HRS) and INDEPTH Network Qualitative Study (Descriptive Approach)

1. Data Management

2. Longitudinal Cohort Studies Tracking

3. Data Processing and Analysis

4. Data Reporting, Visualization, and Reports Generation

1. Performance Efficiency

2. Flexibility

3. Security

4. Usability

44 O'Mahony et al. [73] 2007 Ireland To describe the methodologies and instruments used in the Lifeways Cross-Generation Cohort Study to collect and manage data related to the health and development of participants, specifically focusing on the babies, mothers, fathers, and grandparents over a longitudinal period Myocardial Infarction, Stroke, Cancer, Asthma Lifeways Cross-Generation Cohort Study Mixed-Methods approach

1. Data Management

2. Participant and ID Management

3. Longitudinal cohort studies tracking

1. Reliability

2. Security

3. Usability

45 Holle et al. [31] 2005 Germany To describe the cooperation, quality assurance, data management practices, and perspectives of the KORA research platform, which focuses on population-based epidemiological studies Cardiovascular diseases KORA (Cooperative Health Research in the Region Augsburg) Qualitative Study (Descriptive Approach) 1. Data Management

1. Performance Efficiency

2. Flexibility

3. Security

4. Privacy

The reviewed studies span a diverse range of disease contexts, reflecting the multifaceted applications of CDMS across global health research. Figure 4 illustrates major disease categories addressed in the reviewed studies. The distribution highlights an emphasis on infectious diseases, cancers, and cardiovascular conditions, reflecting global health priorities for creating CDMS with precise FRs and NFRs. Several studies also targeted underrepresented areas such as autoimmune, mitochondrial, and pediatric disorders, underscoring the diversity and evolving scope of cohort-based research.

Fig. 4.

Fig. 4

Major disease categories and conditions addressed in the reviewed studies

Figure 5 illustrates the overall distribution of FRs and NFRs across the reviewed studies. It also helps to identify gaps where certain requirements might be underrepresented or less emphasized.

Fig. 5.

Fig. 5

Distribution of functional and non-functional requirements of CDMS in the reviewed studies

Functional Requirements (FRs)

FRs define the specific functions, behaviors, and services that a system or system component must provide to meet users' and stakeholders' needs. FRs describe what the system must do, like specific actions, tasks, or interactions, without specifying how these functions are achieved. According to ISO/IEC/IEEE 29148:2018, functional requirements focus solely on the required capabilities of the system, such as processing data, supporting user transactions, or generating reports to fulfill its intended purpose [74].

The findings showed that the main FRs for CDMS were divided into nine categories, including data management, participant and ID management, data processing and analysis, longitudinal cohort studies tracking, protection and access control, data reporting, visualization, and reports generation, infrastructure and operations, User support and interaction, and research facilitation which are explained below in more detail. Figure 6 presents a hierarchical breakdown of FRs and their respective subcategories.

Fig. 6.

Fig. 6

Breakdown of Main Functional Requirements of CDMS by Subcategories1

Data management

Data management specifies the basic functions a system must provide to handle data effectively. These consist of data collection from various sources, storage and retrieval for efficient access, integration of data from heterogeneous systems, curation and harmonization, metadata management for organizing contextual information, and quality assurance to maintain data accuracy and completeness. Across 38 studies [48, 11, 14, 20, 2225, 2732, 34, 35, 37, 39, 4244, 47, 6063, 65, 66, 6873], diverse strategies for data collection were discussed, ranging from manual entry [7, 72, 73] to advanced electronic systems including Research Electronic Data Capture (REDCap), Study of Health in Pomerania (SHIP), SHIPdesigner, and open data kit planner (ODK_planner) [27, 43, 62, 63]. Electronic health records, automated summarization, and clinical data management systems could improve accuracy, efficiency, and adherence to Findable, Accessible, Interoperable, and Reusable (FAIR) principles [5, 6, 63, 70]. Tools like VESPRE (virtually enabled biorepository and electronic health record (EHR)-embedded) for biobank tracking [20, 42] and self-reported data collection methods [8, 14, 68] enhanced research capabilities. Data sources span EHRs, genomics, wearable devices, biomarkers, cancer genomics platforms, and custom solutions like eCIMS and HeDIMS [22, 23, 28, 30, 35, 60, 61].

Efficient data storage and retrieval were addressed in 28 studies [58, 20, 22, 23, 2530, 3537, 45, 47, 6063, 6567, 69, 70, 73]. Solutions such as Next-Generation Sequencing (NGS) databases, MySQL, Rakai Health Sciences Program (RHSP) Data Mart, and Training Support Information Management System (TSIMS) [22, 23, 30, 60, 61], as well as queryable warehouses and cloud-based platforms [28, 35, 36], provided scalability and analytical power. Data integration was emphasized in 28 [7, 11, 2022, 2629, 31, 32, 3537, 41, 42, 45, 6065, 6769, 71, 72], where linking various components of the Health Information System (HIS), Electronic Health Records (EHRs), Picture Archiving and Communication System (PACS), survey data, and biobank repositories improved patient profiling, precision medicine, and longitudinal analysis [21, 22, 32, 42, 60, 61, 67, 71] Platforms like Aetiology of Neonatal Infection in South Asia (ANISA) [69] facilitated data integrations.

Curation and harmonization were reported in 23 studies [48, 22, 23, 2628, 30, 31, 34, 39, 42, 47, 60, 61, 65, 66, 6971] to emphasize cleaning, validation, standardization, and harmonization across datasets. Automated checks, real-time validation, and harmonization tools maintained harmonization in multi-center studies [7, 22, 26, 28, 42, 60, 65, 69]. Standardization practices including International Classification of Diseases, 10th Revision (ICD-10), Health Level Seven (HL7), and genomic data models, were crucial for interoperability [28, 61], while Classification and Regression Tree (CART) models and alert systems supported EHR preprocessing and validation [34, 70, 71]. Metadata management, as discussed in four studies [22, 29, 45, 64], featured standardized cataloging, versioning, and graph databases to enhance reproducibility and discovery. Data quality assurance was examined in 28 studies [68, 2022, 2427, 30, 31, 3537, 39, 41, 44, 45, 6163, 6772] and relied on automated checks, audit trails, error management protocols, and compliance with frameworks like ISO/IEC 27001 and GDPR to ensure accuracy, reliability, and security [6, 8, 20, 22, 25, 27, 30, 31, 35, 37, 39, 41, 44, 6163, 6972].

Participant and ID management

Participant and ID management studies involved organizing and tracking individuals within a study, ensuring accurate linkage between participants and their data. It included participant management and automatic ID generation to uniquely and securely identify each subject while maintaining confidentiality. Twelve studies [6, 8, 14, 23, 25, 26, 42, 60, 63, 67, 70, 73] emphasized the importance of automated registration, real-time monitoring, and efficient consent procedures as key requirements for ensuring data quality and retention. Advanced EHR systems like VESPRE [42] and CART-based automated cohort creation [70] streamlined participant management by reducing manual effort and improving efficiency, especially for enhancing real-time data monitoring and informed decision-making [67]. Engagement tools, including reminders, newsletters, and interactive strategies improved adherence and long-term retention [25, 26, 73]. Practical systems like SHIP and Web-based modular control and documentation system offered modular, flexible solutions for tracking participants, managing workflows, and ensuring anonymization [23, 26, 60]. The importance of automatic ID generation in cohort studies was demonstrated in 4 studies [22, 30, 37, 61]. Studies used Universally Unique Identifier version 4 (UUIDv4) for each NGS record and related files, registration/medical record numbers for patient linkage, and 10-digit SIDs for participants [22, 30, 37, 61].

Data processing and analysis

Data processing and analysis involved transforming raw data into meaningful insights through structured processing techniques and analytical methods. It also included AI workflows and disease subtyping to uncover patterns, support clinical decision-making, and enable precision medicine. Seven studies [6, 22, 35, 41, 64, 65, 72] considered data processing techniques, including analytical processing, data manipulation methods, and data transformation techniques in large-scale research projects. Anonymization, data quality control, configurable open-source architectures, and bioinformatics pipelines and Hypermutation filter, ensured security, reliability, and analytical rigor [6, 22, 35, 41, 64, 65, 72].

Cohort studies relied on data entry validation, consistency checks, and effective provisioning systems to meet CDMS needs [6, 72]. Sixteen studies [5, 14, 22, 23, 2830, 32, 35, 36, 39, 44, 45, 61, 64, 72] identified the importance of data analysis tools, including statistical analyses, search systems, and query analysis frameworks in cohort research. Statistical and computational tools were essential for epidemiological, genomic, and clinical research by enabling harmonization, query analysis, and reproducible results [5, 14, 22, 23, 2830, 32, 35, 36, 39, 44, 45, 61, 64, 72]. Advanced analyses included biomarker identification, integrating multi-omic data and centralized search systems, genome-wide association study, phylogenetics, drug resistance detection, predictive modeling, HIV incidence estimation, cardiovascular risk prediction, and cancer research [22, 23, 2830, 32, 35, 61]. Five studies [4, 28, 30, 42, 68] applied the role of artificial intelligence (AI) workflows and disease subtyping to advanced cohort research and precision medicine. AI workflows used machine learning for disease subtyping and biomarker discovery, with explainable AI models like VESPRE integrated into EHRs to enable personalized treatments [4, 28, 30, 42, 68]. Electronic patient-reported outcomes (ePRO) systems also enhanced patient-doctor relationships by combining self-reported and clinical data in cohort studies [68].

Longitudinal cohort studies tracking

Tracking longitudinal cohort studies enabled continuous monitoring of participants, records changes in health outcomes, and supports robust follow-up strategies. Longitudinal cohort studies tracking was addressed in 15 studies [14, 22, 23, 25, 30, 34, 35, 43, 44, 60, 61, 67, 70, 72, 73]. It is consequential in crises, disasters, chronic diseases, and disease trends, especially among vulnerable populations, where continuous monitoring ensures reliable data collection and aids prevention and treatment decisions [34, 43, 67]. Examples included the Swiss HIV Cohort Study with structured tracking since 1988 and DHRP [22, 60], which repeated PROs assessments at regular intervals [35], a concise, generic measure of self-reported health named EQ-5D pre-/post-op follow-up [61], and long-term surveys [23]. Follow-up statuses were systematically categorized [30], while systems developed for cancer and other diseases improved long-term data quality [44]. Self-reports, ongoing participant interaction, and systematic solutions to manage follow-up loss in conditions like HIV/AIDS, cancer, and cardiovascular diseases strengthen retention and accuracy [14, 25, 70, 73]. Longitudinal tracking also informs global health research, addressing challenges such as child mortality and epidemics [72].

Protection and access control

Protection and access control ensure that the system implements mechanisms to restrict data access to authorized users only, enforce privacy and regulatory compliance, and maintain data integrity. It must also include safeguards to protect patient safety by preventing unauthorized modifications or misuse of clinical data. Patient safety, discussed in five studies [8, 21, 42, 68, 70], was ensured through GDPR-compliant consent management [8], anonymization via trusted systems [21], precise participant selection using CART models [70], and the ethical use of pre-collected samples to reduce risks [42]. Treatment monitoring and response evaluation further improved participant care [68]. Data access control was highlighted in 19 studies [46, 8, 2124, 27, 29, 30, 35, 42, 47, 61, 62, 65, 66, 71], in which encryption, strong authentication, GDPR compliance, and role-based permissions were used to secure sensitive data [4, 6, 8, 27, 30, 61, 62]. Additional measures included pseudonymization and audit trails for transparency [8, 21, 27], secure data sharing across centers [4, 65], and integration of access protocols with EHRs and biobanks for real-time validation, alerts, secure Application Programming Interfaces (APIs), and dual one-time passwords systems [8, 24, 42, 47, 66, 71]. Restricted aggregate models and managed access frameworks provided an added layer of control [23]. These strategies ensure data security, compliance, and ethical management in medical research [4, 8, 24, 29, 42, 47, 65, 66, 71].

Data reporting, visualization, and reports generation

Data reporting, visualization, and report generation are essential for effective communication insights derived from cohort data, providing a clear presentation of complex findings. These requirements facilitate processes including creating visual images, dynamic reports, and data sharing to support decision-making and dissemination of research data. The findings showed that 19 studies [7, 20, 23, 25, 29, 30, 34, 35, 41, 43, 6062, 6468, 72] addressed reporting, visualization, and data integration tools in cohort studies. Customizable dashboards, automated summarization, and quality control reporting enhance management and monitoring of complex patient data [41, 60, 62, 64, 65]. Centralized systems with user-friendly interfaces supported multi-center and international collaborations, and improved data exchange and health interventions [29, 34, 66]. Real-time monitoring charts aided clinical decisions and patient follow-up [25, 43, 67], and continuous data reporting enabled targeted public health actions [7, 72]. Advanced visualization methods, including CCVPRA (CPPTherapists-DBS system, a tool providing advanced visualization and data modeling capabilities for a system that integrates cardiac surgery and cardiopulmonary rehabilitation data management), business intelligence cubes, and interactive dashboards like MMFP-Tableau, further enhanced data interpretation and cohort analysis [23, 30, 35, 61].

Infrastructure and operations

Infrastructure and operations ensure that the system is supported by reliable and accessible IT infrastructure, enabling seamless user access and system performance. It also includes process management and automation to streamline workflows, reduce manual tasks, and enhance operational efficiency. Twelve studies used centralized, scalable, and modular solutions for imaging [6, 14, 2123, 28, 36, 43, 44, 60, 63, 69], genomic data [36, 63], and multi-site integrity [69]. Overall, cloud-based architectures, APIs, and visualization tools enhanced interoperability, accessibility, and sharing [22, 23, 28, 60]. Web-based and modular systems were noted for integrating biomarkers and supporting diagnosis and treatment [6, 14, 21, 44], with lightweight local systems offering options in resource-limited settings [43]. Process management and automation, discussed in 15 studies [5, 2022, 24, 26, 35, 37, 41, 43, 47, 60, 61, 67, 69], optimized efficiency through structured workflows, real-time monitoring, and automation, ensuring accuracy, compliance, and synchronization with study goals [20, 21, 24, 26, 37, 43, 47, 67, 69]. Tools like XNAT in neuroimaging [41, 64] and Jenkins-based automation, extracted, transformed, and loaded (ETL) scheduling, and real-time analytics enhanced data validation, coordination, and timely clinical decision-making [5, 22, 35, 60, 61].

User support and interaction

User support and interaction focus on enabling intuitive and efficient user interaction with the system, ensuring ease of use and engagement. It also includes provisions for training, collaboration, and ongoing support to assist users and promote effective system adoption and use. Eight studies [21, 22, 35, 37, 43, 60, 65, 68] emphasized tools including automatic reminders, multilingual support, multi-user management, and personalized dashboards, which enhanced adherence to study protocols, facilitated system access, and streamlined communication in large-scale, multi-location cohort studies. User-friendly web interfaces and APIs further supported efficient data entry, retrieval, and interaction between researchers and participants [22, 37, 65, 68]. Training, collaboration, and support were discussed in eight studies [4, 5, 14, 26, 29, 45, 60, 65], focusing on the role of educational programs, collaborative analysis tools, and platforms such as HarmonicSS and the Clinical Development Center (CDC) in facilitating data sharing, integration, and analysis, particularly in public health and cancer research [4, 26, 29, 45, 60, 65]. Collaboration between domestic and international researchers, alongside analyst training and participant sensitization, was identified critical for improving adoption, strengthening engagement, and advancing medical research [5, 14].

Research facilitation

Research facilitation in cohort studies included research proposals and patient-centered approaches, and enhanced data efficiency and quality, as highlighted in four studies [11, 2224]. Transparent and ethical data access through proposal submissions improved data utilization [24]. Incorporating patients’ experiences fostered better communication between researchers and participants and boosted research quality and applicability [38]. Additionally, platforms designed for broad HIV research supported hypothesis generation and collaboration [22, 23].

Non-Functional Requirements (NFRs)

NFRs define criteria for assessing the performance of a system and focus on attributes such as performance and security. Unlike FRs, which define specific behaviors, NFRs impact the overall architecture of the system. The main non-functional requirements for CDMS were categorized into eight groups based on ISO/IEC 25010. These were performance efficiency, compatibility, usability, flexibility, reliability, security, maintainability, and privacy. These requirements are explained below in more detail. Figure 7 illustrates the hierarchical breakdown of NFRs into their respective subcategories.

Fig. 7.

Fig. 7

Breakdown of Main Non-Functional Requirements of CDMS by Subcategories2

Performance efficiency

Performance Efficiency refers to the extent to which a system effectively executes its intended functions within defined time constraints and throughput levels, while optimizing resource utilization under specified conditions. This characteristic encompasses aspects such as efficiency, cost-effectiveness, and performance. Efficiency, emphasized in 12 studies [7, 22, 26, 28, 31, 36, 41, 42, 47, 6971], could be achieved through electronic systems enabling rapid data collection [47, 69], automated pipelines and distributed frameworks for faster processing [7, 22, 28, 36, 41, 42], optimized data cleaning to enhance quality and cohort identification [26, 31, 70, 71]. Cost-effectiveness, addressed in 10 studies [11, 3436, 39, 42, 60, 62, 70, 72], emphasized using digital platforms like REDCap, VESPRE, and Gulf of Mexico Community Health Observing System (GoM CHOS) for reducing costs while improving efficiency [34, 36, 42, 62, 70]. Systems such as COllaborative open platform E-cohorts (COOP’e-cohort) [11]and low-cost customization approaches [35, 60] further improved value in clinical research, including in resource-limited environments [39, 72]. Performance, reported in 13 studies [4, 22, 24, 26, 28, 32, 36, 42, 43, 60, 61, 65, 71], was strengthened by high-performance computing, containerized pipelines, distributed Elasticsearch, and efficient querying systems [22, 28, 60, 61]. Examples include HarmonicSS executing analyses in less than 30 s, GenomicsDB supporting distributed genetic research [4, 36], and lightweight tools like odk_planner for constrained settings [43]. Optimized workflows for multi-omics [32], HIV/AIDS and acute care [24, 42] further illustrated improvements, though challenges like limited population diversity [26] remain.

Compatibility

Compatibility studies refer to the extent to which a system or component can operate effectively within a shared environment, exchanging and utilizing information with other systems. It primarily includes interoperability, which is the ability to exchange and meaningful use of shared information. Interoperability enables seamless data exchange and integration across systems, enhancing standardization, compatibility, accuracy, and reproducibility. Thirteen studies [6, 14, 22, 28, 34, 35, 6066] have examined its role in improving data coordination and management through system integration tools including REDCap, SHIP, Electronic Medical Record (EMR), and MMFP-Tableau [6264]. Studies emphasized supporting FAIR principles via JavaScript object notation (JSON) metadata, Docker containers, and integration with EHRs, wearables, and apps through HL7, Fast healthcare interoperability resources (FHIR), clinical data interchange standards consortium (CDISC), ICD-10, and device APIs [22, 28, 35, 60, 61]. Open-source platforms like Modular Approach to Data Management in Epidemiological Studies (MOSAIC) fostered data harmonization by improving software module compatibility [6, 14, 65]. Standard protocols and scalable methods enhanced data quality, integrity, and long-term interoperability [34, 6466].

Usability

Usability is defined as a system’s ability to facilitate easy and effective interaction. It has been identified as a key NFR in 31 studies within CDMS [57, 20, 2225, 27, 28, 30, 32, 35, 36, 41, 4345, 47, 6066, 68, 69, 7173]. Enhancements included user-friendly, web-based, and mobile-accessible interfaces, dashboards, and real-time visualizations to support users with low digital literacy and diverse cultural contexts [22, 23, 27, 28, 30, 35, 36, 6062]. Usability contributes to sustainability and efficient data management by simplifying collection, monitoring, and retrieval [25, 45, 47, 63, 69, 71]. It also facilitates collaboration, data sharing, and harmonization in scalable systems, particularly for neuroimaging and cohort studies [41, 64, 65]. Standard data formats, accessibility guidelines, and reduced researcher workload have improved usability in HIV/AIDS and Alzheimer’s research [5, 24, 66]. For researchers with limited IT expertise, tools emphasize simplicity in cohort studies [6, 20, 32, 68]. Systems like korean renal cell carcinoma (KORCC) and odk_planner demonstrated usability benefits by reducing complexity and ensuring efficient management in clinical, population, and public health research [7, 43, 44, 72, 73].

Flexibility

Flexibility is the extent to which a system can adapt to changing requirements, usage contexts, or environments. It includes scalability, which refers to the system’s ability to adjust its capacity in response to varying workloads. Twenty studies [5, 6, 11, 21, 27, 2931, 34, 35, 37, 39, 41, 42, 47, 60, 62, 63, 65, 68] emphasized the role of flexibility for accommodating diverse research needs and adapting seamlessly to the study requirements and contexts. Systems such as REDCap, ePRO, SHIP, VESPRE, and GoM demonstrated flexibility in harmonizing data, supporting chronic disease management, and coordinating multi-project studies [11, 27, 39, 41, 47, 62, 63, 65, 68]. Flexible protocol updates including surveys, age-specific adaptations, and integration of new data types were also identified as crucial requirements [30, 35, 60]. Scalability was examined in 27 studies [4, 7, 8, 14, 22, 23, 2729, 32, 3537, 39, 41, 45, 47, 6063, 6567, 69, 71, 72] and ensured that large datasets, multi-center cohorts, and a large sample size were managed efficiently while maintaining data quality. Scalable strategies included cloud autoscaling, distributed analytical frameworks, horizontal and vertical expansion, integration of new devices, and handling of multi-omics data [14, 22, 23, 28, 32, 35, 36, 60, 61, 63]. Centralized, automated systems enhanced large-scale data coordination, enabling multi-study collaboration and improving overall research efficiency [7, 29, 41, 66, 67, 69, 72].

Reliability

Reliability refers to the degree to which a system, product, or component consistently performs its intended functions under defined conditions over a specified period of time. This characteristic also encompasses sustainability which emphasizes the system’s long-term dependability and stability. Nineteen studies [4, 7, 8, 22, 24, 25, 27, 28, 34, 36, 44, 45, 47, 65, 67, 6971, 73] addressed system reliability through stable cloud hosting, Multi-AZ deployment for fault tolerance, rigorous quality control, and reproducible data processing to strengthen data integrity and reduce errors [8, 22, 28, 36, 47, 65, 69, 70]. Reliability was also supported by interpretable AI models, systematic data cleaning, governance frameworks, and controlled execution environments, ensuring crisis resistance and long-term credibility [4, 24, 27, 34, 67]. Several studies emphasized the role of uptime, intensive monitoring, and structured audits in maintaining data validity, particularly in low-resource and translational research settings [7, 25, 44, 45, 71, 73]. Sustainability was discussed in five studies [4, 20, 25, 29, 30] focused on the long-term viability of CDMS to support ongoing research. Approaches included Platform-as-a-Service (PaaS) models such as HarmonicSS [4], long-term strategies exemplified by IeDEA [25], and system functioning effectively for nearly two decades [30]. Future-proof infrastructures such as california teachers’ study (CTS) and tools like the Toolbox for Research ensure adaptability and efficient expansion, strengthening sustainability in data-driven studies [20, 46].

Security

Security refers to the extent to which a system protects against threats and unauthorized access, ensuring that data and resources are accessible only to users or systems with appropriate authorization. This characteristic includes compliance with relevant regulations and confidentiality, which safeguards sensitive information from exposure or misuse. A total of 33 studies [48, 14, 20, 2224, 2831, 34, 35, 37, 44, 45, 6069, 7173] discussed the integration and implementation of security measures in various CDMS. Multi-layered security requirements included advanced encryption standard (AES) like AES-256 encryption, transport layer security (TLS) protocols, secure APIs (HTTPS), UUIDs, and role-based access control aligned with HIPAA, national institute of standards and technology (NIST), and federal information security modernization act (FISMA) guidelines [22, 23, 28, 30, 35, 60, 61]. Additional protections included firewalls, endpoint security, identity and access management (IAM) policies, encrypted storage, anonymization, pseudonymization, prevention of unauthorized access, and separation of participant identifiers from research data [6, 8, 20, 24, 27, 29, 31, 34, 37, 44, 6466, 71, 73]. Privacy-preserving techniques, encrypted cloud infrastructures, routine audits, daily backups, and secure troubleshooting protocols collectively enhanced system security and minimized breach risks in federated and resource-limited settings [4, 68, 27, 31, 44, 45, 62, 64, 69, 72]. Compliance was mentioned in 12 studies [14, 20, 30, 31, 35, 42, 61, 63, 64, 66, 67, 71], reinforced security through adherence to Helsinki Declaration, the royal college of pathologists of Australasia (RCPA) external quality assurance programs, GDPR, HIPAA, ISO/IEC standards, Helsinki Declaration, institutional review board (IRB) approvals, and FAIR principles to promote ethical research and responsible data use [14, 30, 31, 35, 42, 61, 63, 64, 66, 67, 71].

Maintainability

Maintainability represents the degree to which a system or product can be efficiently and effectively modified to correct faults, improve performance, or adapt to changes in the environment or requirements. This characteristic includes extensibility, which refers to the ease with which new capabilities can be added without negatively affecting existing functionality. Five studies accentuated the importance of maintainability in cohort systems [14, 22, 36, 60, 65]. Frameworks like GenomicsDB simplified system updates in Spark clusters through enhanced maintainability [36], and communication-enabled IT networks improved medical research by integrating biomarkers [14]. To ensure long-term upkeep, studies recommended version-controlled tools, modular design, and microservices architecture [22, 60]. Extensibility was emphasized in five studies [21, 41, 61, 63, 65] to ensure the rapid adaptability and reusability of systems for emerging research needs. The SHIP project tools and modular neuroimaging data systems demonstrated effective reusability for new diseases [41, 63]. Iterative updates using spiral models, open-source architectures that facilitated harmonization, and reusable software modules like those of the MOSAIC project further strengthened extensibility and long-term maintainability [21, 41, 61, 63, 65].

Privacy

Privacy in CDMS involves a comprehensive framework of legal, technical, and ethical safeguards to protect sensitive personal and health information. Although privacy is a subset of security, it plays a significantly more prominent and independent role in the context of medical research and health-related data, as reflected in 26 studies [46, 8, 14, 20, 21, 23, 25, 26, 2831, 36, 37, 42, 43, 45, 60, 61, 64, 65, 67, 68, 70]. Regulatory compliance is foundational, requiring systems to adhere to national and international laws such as the HIPAA in the U.S.A [23, 28, 29, 36, 42, 60, 65, 70], GDPR in the EU [4, 5, 8, 20, 21, 31], UK Data Protection Act [5], and regional legislations like Germany’s BDSG (Bundesdatenschutzgesetz) [6, 21], the Swedish legal framework [45], national commission on informatics and liberty (CNIL) [68] and the Human Tissue Act in the UK [67]. Many studies obtained approval from IRBs and ethics committees [25, 26, 31, 61, 62, 70], and several systems followed recognized security standards such as ISO/IEC 27001 [5, 8], FISMA, and NIST guidelines [29, 60, 65]. At the system level, role-based and fine-grained access, AES-128/256, TLS/SSL, and infrastructure like Amazon Web Services (AWS) IAM, virtual private networks (VPNs), and private clouds strengthen protection [4, 5, 8, 2830, 42, 43, 45, 60, 65]. Audit trails maintain accountability [28, 30, 45]. Informed consent and eConsent ensure participant autonomy, with withdrawal and GDPR’s “right to be forgotten” supported by tools like gICS [4, 6, 8, 14, 21, 26, 30, 31, 6062, 64]. Data separation and minimization safeguard personally identifiable information (PII) by storing it apart from medical data [6, 8, 14, 20, 21, 26, 30, 31, 37, 65, 70].

Discussion

This research presented the functional and non-functional requirements of data management systems in cohort studies. The findings articulated the importance of data management, data processing and analysis, flexibility, and security in successfully managing complex and long-term studies within CDMS. These were identified as the most frequently emphasized requirements, highlighting their foundational role in CDMS design. Analyzing the NFRs showed that CDMS priorities strongly focused on flexibility, security, usability, privacy, and performance efficiency.

Among FRs, data management as a foundational requirement in CDMS was responsible for data collection, integration, curation, harmonization, and quality assurance to ensure effective data use. However, limited focus on metadata management creates a critical gap affecting data discoverability, provenance, interoperability, reusability, and core FAIR principles. This imbalance indicates that systems are often built to gather data, but may not be optimally designed to understand or utilize it effectively across studies or institutions [75]. Data processing and analysis were moderately addressed; however, AI workflows and disease subtyping remained underrepresented despite their potential. Advanced AI analysis requires robust, scalable infrastructure and automation, yet process management receives limited focus [76], suggesting many CDMS lack the sophistication needed for next-generation research. In Protection and Access Control, there is a strong emphasis on data access control due to security and privacy regulations, but patient safety within research modules is under-prioritized [77]. This under-prioritization is concerning, as CDMS should support clinical safety. Balancing strict access control with the need to protect patient safety in research requires more nuanced system design.

The Reporting and Visualization category was well-represented, highlighting the translation of complex data into actionable insights. Longitudinal tracking was also emphasized, particularly for chronic diseases and real-world research. However, this required integration with participant management and Auto ID generation [30]. By contrast, major gaps were observed in user support, training, collaboration, and research facilitation, as well as the lack of tools for protocol management, recruitment tracking, budget oversight, and regulatory compliance, which reduce efficiency and delay implementation [78].

Several studies in our review demonstrated how FRs like process management and automation support streamlined workflows. For instance, the VESPRE system used in sepsis research enables automated EHR-based data extraction and biospecimen linkage, accelerating cohort assembly and minimizing manual labor [42]. In terms of data integration, platforms like MINDS (used in oncology), ABCD (Adolescent Brain Cognitive Development), and BBRC (used in neurodegenerative research) integrated multimodal data, clinical imaging, and genomic data into cohesive datasets for predictive modeling and personalized cohort analysis [28, 64]. Integration not only enhances analytical capacity but also improves data reusability, which is vital in longitudinal research. The HarmonicSS platform, supports integration across national and international registries, enabling large-scale analysis of autoimmune diseases [4].

Some studies showed that there is a misalignment between the FRs and the strategic vision. The FRs appear heavily skewed toward data management, collection, integration, quality assurance, and regulatory readiness, but there is insufficient attention to advanced analytics readiness, long-term system adaptability, and research team productivity [7982]. Therefore, more emphasize on the CDMS maintainability, sustainability, and extensibility are required to develop more resilient and adaptive systems for long-term research [8385]. The inclusion of automation, reporting, and process management highlights the growing demand for efficiency and scalability in data handling [85, 86]. However, sometimes short-term gains have been prioritized over long-term perspectives [87]. The current priorities optimize initial data handling and risk mitigation, but by underestimating AI integration, user training, collaboration support, and workflow integration, the CDMS risks may reduce adoption and increase inefficiency and fragmented operations, especially as studies scale up or diversify [88, 89]. Over time, this could be translated into technical debt, costly system overhauls, or limited reuse value [90].

NFRs reflect the key priorities and essential needs for creating an optimized and efficient system. Flexibility emerged as the most frequent non-functional requirement, indicating the need for adaptable systems that can evolve with changing research needs. Security remains a top concern, emphasizing data protection and risk mitigation due to the sensitivity of clinical data. Establishing trust among users and stakeholders is critical. Usability is also essential, as it directly impacts productivity and reduces human errors. In CDMS, a user-friendly interface, quick access to information, and process simplification enhance efficiency and minimize errors.

Among NFRs, the prominence of flexibility signals a fundamental shift towards designing for change and growth. This reflects the reality of modern software ecosystems, like volatile business requirements, unpredictable user loads, rapid technological evolution, and cloud-native architectures [91, 92]. Scalability specifically points up the critical need to handle increasing demand efficiently, often a primary business driver [93]. Security and Privacy were emphasized as critical factors in protecting systems amid advanced cyber threats and strict regulations like GDPR, CCPA, and HIPAA [94]. The focus on security highlights core measures such as authentication, authorization, encryption, and intrusion detection. The strong presence of compliance reflects the necessity of meeting legal requirements as an essential part of system design [95]. Privacy’s distinct prominence underscores its recognition as a fundamental right and a core system property, closely linked to but separate from security.

The substantial count of NFRs for usability underscores its enduring importance. While sometimes under-prioritized against functional needs, this data confirms its recognition as a critical success factor influencing user adoption, productivity, satisfaction, and ultimately, the system's value proposition [96]. The near-exclusive focus within this category suggests a potential area for future refinement.

Performance Efficiency is a well-balanced concern, with attention evenly distributed among performance, efficiency, and cost-effectiveness, reflecting a mature understanding of the need for speed, resource optimization, and financial management, especially in cloud-based environments where resource use impacts costs [97]. By contrast, reliability receives less emphasis than security or flexibility, possibly because robustness, fault tolerance, and high availability are now assumed or are being overshadowed by newer priorities. According to the results, sustainability was rarely addressed, despite increasing societal and regulatory focus on environmental and social responsibility, indicating a disconnection between global priorities and current requirements engineering practices [98].

In this study, compatibility is solely defined by interoperability, emphasizing the need for systems to seamlessly exchange data and functions via APIs, data formats, and protocols to integrate into digital ecosystems and support workflows [99]. Maintainability receives minimal attention, indicating a lack of awareness of its impact on long-term software costs. Poor maintainability leads to higher technical debt, increased modification costs, slower feature delivery, and more defects [100]. The limited emphasis on extensibility suggests a focus on short-term delivery at the expense of long-term system health and adaptability.

Overall, CDMS implementations have been successful across various medical research domains by emphasizing flexibility, scalability, and usability. Examples include Swiss HIV Cohort Study Viral Next Generation Sequencing Database (SHCND), managing complex NGS data for over 21,000 individuals using high-performance computing [22]; Digital Health Research Platform (DHRP), supporting the "All of Us" cohort with cloud-based scalability, secure multi-source data integration, and user-friendly [60]; and Multimodal Integration of Oncology Data System (MINDS), efficiently handling multimodal oncology data for over 41,000 cases [28]. Federated clinical research networks such as PCORnet have demonstrated their ability to manage massive patient datasets across multiple institutions, highlighting scalability as a critical enabler of collaborative research [101]. Usability has been addressed for clinical data warehouses of chronic disease cohorts, and intuitive interfaces have been suggested to facilitate data access, analysis, and sharing by researchers [102]. Some platforms have achieved a balance between scalability and usability, with i2b2 serving as a prominent example by supporting very large patient datasets across research sites while also offering intuitive query tools for clinicians and researchers [103, 104].

The results also demonstrated a strong alignment with immediate business priorities, such as security, privacy, scalability, and usability. However, long-term engineering fundamentals, including maintainability, reliability, and sustainability, are often undervalued, leading to potential technical debt and higher lifecycle costs. Cost-effectiveness and efficiency, though important for reducing costs and improving productivity, are usually secondary considerations in early CDMS stages [30, 105]. Security and scalability are prioritized initially, as deficiencies in these areas can lead to system failure, whereas cost and efficiency issues are more manageable after deployment [105107]. Successful CDMS design requires balancing all requirements based on risk and long-term goals to ensure sustainability, security, and scalability over time [108]. However, scalability and cost-effectiveness often conflict, and supporting longitudinal data and complex queries demands significant infrastructure investment, while minimizing costs can result in under-provisioned systems [109]. Cohort systems can address this through architectural strategies like tiered storage, serverless queries, materialized view caching, and ML-driven auto-scaling, which reduce storage and compute costs while maintaining performance [110113]. However, balancing scalability with efficiency remains a key design challenge.

Next-generation CDMS must prioritize maintainability, sustainability, and extensibility as core design principles to evolve from short-term solutions to resilient, scalable systems [114]. Neglecting maintainability fosters technical debt, higher maintenance costs, and limits innovation [115]. Without extensibility, adapting systems for new studies may require costly workarounds or complete redesigns [116]. To ensure long-term viability and mitigate technical debt, modern CDMS must shift from fragile, short-term success to embracing sustainability, maintainability, and integration with real-world healthcare ecosystems as primary design imperatives [114]. Platforms like DHRP demonstrated that cloud-native architectures can balance scalability, cost-efficiency, and long-term resilience [60]. Furthermore, the strong focus on interoperability within the compatibility requirement lays the groundwork for real-world healthcare ecosystem integration, as demonstrated by federated networks like PCORnet [101] and HarmonicSS [4]. These systems enable seamless data exchange across institutions, and systems like i2b2 [103, 117, 118]. Sustainability extends beyond environmental considerations to encompass financial and operational viability over decades of cohort research, ensuring that platforms remain functional without prohibitive costs or energy demands [119]. Maintainability is equally vital, as systems that are difficult to update or extend accumulate technical debt may erode efficiency and reliability over time [120], a challenge that modular platforms like REDCap [27, 62] have successfully addressed through a vibrant ecosystem of shared modules and consistent updates. Integration with real-world healthcare ecosystems, particularly EHRs, registries, and clinical workflows, enables CDMS to move from isolated research silos to embedded, interoperable infrastructures that generate value in both research and care delivery [121, 122], exemplified by the MINDS [28], which directly integrates multimodal clinical data for translational research. Prioritizing these three dimensions positions CDMS not only as research enablers but as resilient, adaptive infrastructures capable of supporting long-term, large-scale, and clinically relevant cohort studies [123].

Building an effective CDMS is akin to laying a strong foundation for future research. Incorporating emerging technologies and adapting to evolving requirements further enhances the potential of clinical cohort studies and healthcare research. Emerging technologies like AI and blockchain are expected to significantly transform CDMS operations. AI can enhance CDMS through automated data preprocessing, pattern recognition, and predictive modeling, improving data quality, enabling disease subtyping, and supporting personalized cohort analysis. Machine learning algorithms help to detect anomalies and handle incomplete records in large datasets, leading to more accurate results [124, 125]. Blockchain offers a decentralized, tamper-evident system for secure data sharing and audit trails, supporting data integrity, transparent consent management, and patient trust through immutable records and fine-grained access control [126, 127]. Together, these technologies address key challenges in scalability, privacy, and interoperability, marking a shift in how longitudinal research data is managed and shared.

Limitations and future works

Since this study included research with various methodologies, sample sizes, and evaluation criteria, it was not possible to combine the data or perform a meta-analysis. The inclusion of heterogeneous study designs, particularly qualitative evaluations and case-specific implementations may also limit the generalizability of the findings to all CDMS contexts, especially in large-scale or cross-national cohort studies. Additionally, some relevant studies may have been overlooked due to access limitations to databases and language barriers. However, efforts were made to identify the functional and non-functional requirements of CDMS using the available articles. To address these limitations, future research should be more inclusive covering non-English publications and grey literature.

Future studies should also explore underrepresented requirements like maintainability, sustainability, and integration of emerging technologies, and evaluate them in diverse CDMS settings. From a practical perspective, the findings of this review can guide the design of real-world CDMS by offering a structured framework of requirements prioritized by frequency and relevance. Developers should use high-priority requirements as a baseline and iteratively integrate lower-frequency but critical elements like privacy, maintainability, and scalability. Stakeholders are encouraged to align system features with both user needs and project-specific goals to ensure usability and long-term adaptability.

Additionally, mapping these requirements to specific implementation phases (design, testing, deployment) can facilitate more efficient and needs-driven CDMS development. Moreover, future work should assess how different combinations of functional and non-functional requirements influence system outcomes, user satisfaction, and research efficiency across various healthcare organizations and research infrastructures.

Conclusion

Requirement analysis indicated that the primary focus of CDMS is on critical requirements for establishing a reliable and scalable system while optimizing and standardizing data management processes. Meanwhile, specific and less frequently mentioned requirements serve as key support mechanisms for achieving special objectives. A balanced integration of priorities, requirements, and needs can provide a stable framework for the successful implementation of CDMS. The design of CDMS requires a well-balanced approach that simultaneously considers both immediate and long-term needs. By achieving this, CDMS can effectively support both current and future research needs, ensuring long-term success.

To improve CDMS design, developers should prioritize modular architectures that allow for easy integration of evolving components such as AI-driven analytics, data visualization dashboards, and real-time monitoring tools. Implementing standard data models (e.g., OMOP CDM), role-based access controls, and automated data validation pipelines can further enhance reliability, security, and data quality. Moreover, designing user-centered interfaces and embedding structured training modules can improve usability and long-term adoption by clinical teams and researchers.

Advancements in CDMS can directly enhance clinical research outcomes by enabling more efficient cohort identification, reducing time-to-insight through automated data processing, and supporting predictive modeling for early disease detection or treatment response. Improved data integration and harmonization across sites also facilitate multicenter studies, leading to higher statistical power and more generalizable findings. These improvements collectively accelerate the research cycle, reduce operational burden, and support evidence-based decision-making in healthcare. Continuous system updates, based on user feedback and technological advancements, are essential to ensure that the CDMS evolves in alignment with clinical research demands and regulatory standards.

Supplementary Information

Supplementary Material 1. (12.7KB, docx)

Acknowledgements

This research was supported by Iran University of Medical Sciences, Tehran, Iran.

Abbreviations

CDMS

Cohort Data Management Systems

IoT

Internet of Things

ISO

International Organization for Standardization

IEC

International Electrotechnical Commission

IEEE

Institute of Electrical and Electronics Engineers Standard

FRs

Functional Requirements

NFRs

Non-Functional Requirements

Authors’ contributions

AA contributed in research conceptualization, methodology, data analysis, and manuscript writing. HA supervised research and manuscript editing. SAM helped with data analysis and manuscript editing. All authors reviewed the manuscript.

Funding

This research was funded by Iran University of Medical Sciences, Tehran, Iran (1402–11-37–28302).

Data availability

The data that support the findings of this study are available from the corresponding author (Haleh Ayatollahi), upon reasonable request.

Declarations

Ethics approval and consent to participate

The research was conducted in accordance with the ethical principles for medical research outlined in the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Iran University of Medical Sciences (IR.IUMS.REC.1403.074). As this was a scoping review study, it involved only the analysis of previously published literature and did not include any direct human participation or intervention. Therefore, informed consent to participate was not required.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

1

The total count for each requirement refers to the number of specific requirements that were addressed through one or more of its subcategories. A single article might cover multiple subcategories under the same main requirement, or it might address only one subcategory.

2

The total count for each main requirement refers to the number of specific requirements that are addressed through one or more of its subcategories. A single article may cover multiple subcategories under the same main requirement, or it may address only one subcategory.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Haleh Ayatollahi, Email: Ayatollahi.h@iums.ac.ir.

Seyed Abbas Motevalian, Email: Motevalian.a@iums.ac.ir.

References

  • 1.Szklo M. Population-based cohort studies. Epidemiol Rev. 1998;20(1):81–90. [DOI] [PubMed] [Google Scholar]
  • 2.Abdullah N, Husin NF, Goh Y-X, Kamaruddin MA, Abdullah MS, Yusri AF, et al. Development of digital health management systems in longitudinal study: the Malaysian cohort experience. Digit Health. 2024;10:20552076241277480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Dellacasa C, Ortali M, Rossi E, Abu Attieh H, Osmo T, Puskaric M, et al. An innovative technological infrastructure for managing SARS-CoV-2 data across different cohorts in compliance with General Data Protection Regulation. Digit Health. 2024;10:20552076241248920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Pezoulas VC, Goules A, Kalatzis F, Chatzis L, Kourou KD, Venetsanopoulou A, et al. Addressing the clinical unmet needs in primary Sjögren’s syndrome through the sharing, harmonization and federated analysis of 21 European cohorts. Comput Struct Biotechnol J. 2022;20:471–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bauermeister S, Orton C, Thompson S, Barker RA, Bauermeister JR, Ben-Shlomo Y, et al. The dementias platform UK (DPUK) data portal. Eur J Epidemiol. 2020;35(6):601–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bialke M, Bahls T, Havemann C, Piegsa J, Weitmann K, Wegner T, et al. MOSAIC–a modular approach to data management in epidemiological studies. Methods Inf Med. 2015;54(4):364–71. [DOI] [PubMed] [Google Scholar]
  • 7.Wawrzyniak ZM, Paczesny D, Mańczuk M, Zatoński WA. Application of advanced data collection and quality assurance methods in open prospective study - a case study of PONS project. Ann Agric Environ Med. 2011;18(2):207–14. [PubMed] [Google Scholar]
  • 8.Zondergeld JJ, Scholten RHH, Vreede BMI, Hessels RS, Pijl AG, Buizer-Voskamp JE, et al. FAIR, safe and high-quality data: the data infrastructure and accessibility of the YOUth cohort study. Dev Cogn Neurosci. 2020;45:100834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Alencar GA, Felipe VDS, Correia-Neto JDS, Teixeira MM. Non-functional requirements in health information systems. In: 2019 14th Iberian Conference on Information Systems and Technologies (CISTI). Coimbra, Portugal; 2019. pp. 1–5.
  • 10.Almeida JR, Silva LB, Bos I, Visser PJ, Oliveira JL. A methodology for cohort harmonisation in multicentre clinical research. Inform Med Unlocked. 2021;27:100760. [Google Scholar]
  • 11.Tran V-T, Ravaud P. Collaborative open platform E-cohorts for research acceleration in trials and epidemiology. J Clin Epidemiol. 2020;124:139–48. [DOI] [PubMed] [Google Scholar]
  • 12.Das S, Zijdenbos AP, Harlap J, Vins D, Evans AC. LORIS: a web-based data management system for multi-center studies. Front Neuroinform. 2011;5:37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Maier C, Kapsner LA, Mate S, Prokosch HU, Kraus S. Patient cohort identification on time series data using the OMOP common data model. Appl Clin Inform. 2021;12(1):57–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Grabe HJ, Assel H, Bahls T, Dörr M, Endlich K, Endlich N, et al. Cohort profile: Greifswald approach to individualized medicine (GANI_MED). J Transl Med. 2014;12:144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Toledano MB, Smith RB, Brook JP, Douglass M, Elliott P. How to establish and follow up a large prospective cohort study in the 21st century-lessons from UK COSMOS. PLoS One. 2015;10(7):e0131521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Knake LA, Ahuja M, McDonald EL, Ryckman KK, Weathers N, Burstain T, et al. Quality of EHR data extractions for studies of preterm birth in a tertiary care center: guidelines for obtaining reliable data. BMC Pediatr. 2016;16:59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dubovitskaya A, Xu Z, Ryu S, Schumacher M, Wang F. Secure and trustable electronic medical records sharing using blockchain. AMIA Annu Symp Proc. 2018;2017:650–9. [PMC free article] [PubMed]
  • 18.Osama M, Ateya AA, Sayed MS, Hammad M, Pławiak P, Abd El-Latif AA, et al. Internet of medical things and healthcare 4.0: trends, requirements, challenges, and research directions. Sensors. 2023;23(17):7435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Rahy S, Bass JM. Managing non-functional requirements in agile software development. IET Softw. 2022;16(1):60–72. [Google Scholar]
  • 20.Bialke M, Rau H, Thamm OC, Schuldt R, Penndorf P, Blumentritt A, et al. Toolbox for research, or how to facilitate a central data management in small-scale research projects. J Transl Med. 2018;16(1):16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bialke M, Penndorf P, Wegner T, Bahls T, Havemann C, Piegsa J, et al. A workflow-driven approach to integrate generic software modules in a trusted third party. J Transl Med. 2015;13:176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zeeb M, Frischknecht P, Balakrishna S, Jörimann L, Tschumi J, Zsichla L, et al. Addressing data management and analysis challenges in viral genomics: the Swiss HIV cohort study viral next generation sequencing database. PLoS Digit Health. 2025;4(4):e0000825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Footer K, Lake CM, Porter JR, Ha GK, Ahmed T, Glogowski A, et al. Using publicly available, interactive epidemiological dashboards: an innovative approach to sharing data from the Rakai Community Cohort Study. JAMIA Open. 2024;7(3):ooae069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kim SM, Choi Y, Choi BY, Kim M, Kim SI, Choi JY, et al. Prospective cohort data quality assurance and quality control strategy and method: Korea HIV/AIDS Cohort Study. Epidemiol Health. 2020;42:e2020063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Newman J, Torres P, Azinyue I, Hemingway-Foday J, Atibu J, Wilfred A, Balimba A, Kalenga L, Mbaya M, Mukumbi H. Improvement of service capabilities following the establishment of an electronic database to evaluate AIDS in Central Africa. J Health Inform Dev Countries. 2011;5(2). Available from: https://www.jhidc.org/index.php/jhidc/article/view/70
  • 26.Asiki G, Murphy G, Nakiyingi-Miiro J, Seeley J, Nsubuga RN, Karabarinde A, et al. The general population cohort in rural south-western Uganda: a platform for communicable and non-communicable disease studies. Int J Epidemiol. 2013;42(1):129–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bartolacelli Y, Barbieri A, Antonini-Canterin F, Pepi M, Monte IP, Trocino G, et al. Imaging quality control, methodology harmonization and clinical data management in stress echo 2030. J Clin Med. 2021;10(14):3020. 10.3390/jcm10143020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tripathi A, Waqas A, Venkatesan K, Yilmaz Y, Rasool G. Building flexible, scalable, and machine learning-ready multimodal oncology datasets. Sensors. 2024;24(5):1634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lacey JV Jr, Chung NT, Hughes P, Benbow JL, Duffy C, Savage KE, et al. Insights from adopting a data commons approach for large-scale observational cohort studies: the California teachers study. Cancer Epidemiol Biomarkers Prev. 2020;29(4):777–86. [DOI] [PMC free article] [PubMed]
  • 30.Abdullah N, Husin NF, Goh YX, Kamaruddin MA, Abdullah MS, Yusri AF, et al. Development of digital health management systems in longitudinal study: the Malaysian cohort experience. Digit Health. 2024;10:20552076241277481. 10.1177/20552076241277481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Holle R, Happich M, Löwel H, Wichmann HE. KORA–a research platform for population based health research. Gesundheitswesen. 2005;67(Suppl 1):S19–25. [DOI] [PubMed] [Google Scholar]
  • 32.Koshiba S, Motoike I, Saigusa D, Inoue J, Shirota M, Katoh Y, et al. Omics research project on prospective cohort studies from the Tohoku Medical Megabank Project. Genes Cells. 2018;23(6):406–17. [DOI] [PubMed] [Google Scholar]
  • 33.Prins BP, Leitsalu L, Pärna K, Fischer K, Metspalu A, Haller T, et al. Advances in genomic discovery and implications for personalized prevention and medicine: Estonia as example. J Pers Med. 2021;11(5):358. 10.3390/jpm11050358. [DOI] [PMC free article] [PubMed]
  • 34.Sandifer P, Knapp L, Lichtveld M, Manley R, Abramson D, Caffey R, et al. Framework for a community health observing system for the Gulf of Mexico region: preparing for future disasters. Front Public Health. 2020;8:578463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.MacMullen LE, George-Sankoh I, Stanley K, McCormick EM, Muraresku CC, Goldstein A, et al. Bridging the clinical-research gap: harnessing an electronic data capture, integration, and visualization platform to systematically assess prospective patient-reported outcomes in mitochondrial medicine. Mol Genet Metab. 2024;142(1):108348. [DOI] [PubMed] [Google Scholar]
  • 36.Smith JM, Lathara M, Wright H, Hill B, Ganapati N, Srinivasa G, et al. Advancing clinical cohort selection with genomics analysis on a distributed platform. PLoS One. 2020;15(4):e0231826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Nye RT, Hill DL, Carroll KW, Boyden JY, Katcoff H, Griffis H, et al. The design of a data management system for a multicenter palliative care cohort study. J Pain Symptom Manage. 2022;64(1):e53–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kayaba K. Overcoming the difficulties of cohort studies. J Epidemiol. 2013;23(3):156–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Adhikari K, Patten SB, Patel AB, Premji S, Tough S, Letourneau N, et al. Data harmonization and data pooling from cohort studies: a practical approach for data management. Int J Popul Data Sci. 2021;6(1):1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Du J, Lu Q, Jin G, Xia Y, Shen H, Hu Z. Data management and quality control strategies for population based cohort study. Zhonghua yu fang yi xue za zhi [Chin J Prev Med]. 2018;52:1078–81. [DOI] [PubMed] [Google Scholar]
  • 41.Huguet J, Falcon C, Fusté D, Girona S, Vicente D, Molinuevo JL, Gispert JD, Operto G, Study A. Management and quality control of large neuroimaging datasets: developments from the Barcelonaβeta Brain Research Center. Front Neurosci. 2021;15:633438. [DOI] [PMC free article] [PubMed]
  • 42.DeMerle KM, Kennedy JN, Palmer OMP, Brant E, Chang CCH, Dickson RP, et al. Feasibility of embedding a scalable, virtually enabled biorepository in the electronic health record for precision medicine. JAMA Netw Open. 2021;4(2):e2037739-e2037739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Steiner A, Hella J, Grüninger S, Mhalu G, Mhimbira F, Cercamondi CI, et al. Managing research and surveillance projects in real-time with a novel open-source emanagement tool designed for under-resourced countries. J Am Med Inform Assoc. 2016;23(5):916–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Byun SS, Hong SK, Lee S, Kook HR, Lee E, Kim HH, et al. The establishment of KORCC (KOrean renal cell carcinoma) database. Investig Clin Urol. 2016;57(1):50–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Abugessaisa I, Gomez-Cabrero D, Snir O, Lindblad S, Klareskog L, Malmström V, et al. Implementation of the CDC translational informatics platform–from genetic variants to the national Swedish Rheumatology Quality Register. J Transl Med. 2013;11:85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Liu S, Wang Y, Wen A, Wang L, Hong N, Shen F, et al. Implementation of a cohort retrieval system for clinical data repositories using the Observational Medical Outcomes Partnership Common Data Model: proof-of-concept system validation. JMIR Med Inform. 2020;8(10):e17376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Sinitkul R, Maude RJ, Nithirochananont U. Design of an integrated clinical research informatics system for a multi-centre and multi-visit prospective birth cohort study. Stud Health Technol Inform. 2022;290:125–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.AbuHalimeh A. Improving data quality in clinical research informatics tools. Front Big Data. 2022;5:871897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Morabia A. History of epidemiological methods and concepts. In: Handbook of epidemiology 2023. New York, NY: Springer New York, pp. 1–33 .
  • 50.Morabia A, Guthold R. Wilhelm Weinberg’s 1913 large retrospective cohort study: a rediscovery. Am J Epidemiol. 2007;165(7):727–33. [DOI] [PubMed] [Google Scholar]
  • 51.Morabia A. Snippets from the past: the evolution of Wade Hampton Frost’s epidemiology as viewed from the American Journal of Hygiene/Epidemiology. Am J Epidemiol. 2013;178(7):1013–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Mahmood SS, Levy D, Vasan RS, Wang TJ. The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective. Lancet. 2014;383(9921):999–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Doll R, Hill AB. Lung cancer and other causes of death in relation to smoking; a second report on the mortality of British doctors. Br Med J. 1956;2(5001):1071–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Pavlović I, Kern T, Miklavcic D. Comparison of paper-based and electronic data collection process in clinical trials: costs simulation study. Contemp Clin Trials. 2009;30:300–16. [DOI] [PubMed] [Google Scholar]
  • 55.Feigelson HS, Clarke CL, Van Den Eeden SK, Weinmann S, Burnett-Hartman AN, Rowell S, et al. The Kaiser Permanente research bank cancer cohort: a collaborative resource to improve cancer care and survivorship. BMC Cancer. 2022;22(1):209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Belenkaya R, Watson A, Bethusamy S, Patel M, Sandler T, Schwartz J. Abstract PO-061: Data harmonization for COVID-19 and cancer research registries. Clin Cancer Res. 2020;26(18_Supplement):PO-061. 10.1158/1557-3265.COVID-19-PO-061. [Google Scholar]
  • 57.Tang H, Jiang X, Lou J, Chen T. Methodology for survival assessment of cancer patients using population-based cancer registration data. Zhejiang Da Xue Xue Bao Yi Xue Ban. 2018;47(1):104–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Zhao Y, Wang Y, Wang H, Yan B, Shen F, Peterson KJ, Rocca WA, Sauver JS, Liu H. Annotating Cohort Data Elements with OHDSI Common Data Model to Promote Research Reproducibility. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 3-6 Dec. Madrid, Spain; 2018. pp. 1310–7. 10.1109/BIBM.2018.8621269.
  • 59.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed]
  • 60.Klein D, Montgomery A, Begale M, Sutherland S, Sawyer S, McCauley JL, et al. Building a digital health research platform to enable recruitment, enrollment, data collection, and follow-up for a highly diverse longitudinal US cohort of 1 million people in the all of us research program: design and implementation study. J Med Internet Res. 2025;27:e60189. 10.2196/60189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Liu X, Wang Y, Luo Z, Xi T, Huang W, Zhang X, et al. Automated cohort database system for cardiopulmonary physiotherapy: a comprehensive tool supporting research on cardiac surgery patients—framework design, development and validation. Comput Methods Programs Biomed. 2025;268:108825. 10.1016/j.cmpb.2025.108825. [DOI] [PubMed] [Google Scholar]
  • 62.Kusejko K, Smith D, Scherrer A, Paioni P, Kohns Vasconcelos M, Aebi-Popp K, et al. Migrating a well-established longitudinal cohort database from Oracle SQL to Research electronic data entry (REDCap): data management research and design study. JMIR Form Res. 2023;7:e44567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Schmidt CO, Struckmann S, Scholz M, Schössow J, Radke D, Richter A, et al. Conducting an epidemiologic study and making it FAIR: reusable tools and procedures from a population-based cohort study. Stud Health Technol Inform. 2023;302:871–5. [DOI] [PubMed] [Google Scholar]
  • 64.Li X, Liang H. Project, toolkit, and database of neuroinformatics ecosystem: A summary of previous studies on “Frontiers in Neuroinformatics.” Front Neuroinform. 2022;16:902452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Feric Z, Agostini NB, Beene D, Signes-Pastor AJ, Halchenko Y, Watkins D, MacKenzie D, Karagas M, Manjourides J, Alshawabkeh A. A secure and reusable software architecture for supporting online data harmonization. In: 2021 IEEE International Conference on big Data (Big Data): Orlando, FL, USA; 2021. pp. 2801-12. 10.1109/BigData52589.2021.9671538. [DOI] [PMC free article] [PubMed]
  • 66.Butters OW, Wilson RC, Burton PR. Recognizing, reporting and reducing the data curation debt of cohort studies. Int J Epidemiol. 2020;49(4):1067–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Murtagh MJ, Blell MT, Butters OW, Cowley L, Dove ES, Goodman A, et al. Better governance, better access: practising responsible data sharing in the METADAC governance infrastructure. Hum Genomics. 2018;12(1):24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Barger D, Leleux O, Conte V, Sapparrart V, Gapillout M, Crespel I, et al. Integrating electronic patient-reported outcome measures into routine HIV care and the ANRS CO3 Aquitaine cohort’s data capture and visualization system (QuAliv): protocol for a formative research study. JMIR Res Protoc. 2018;7(6):e147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Rahman QS, Islam MS, Hossain B, Hossain T, Connor NE, Jaman MJ, et al. Centralized Data Management in a Multicountry, Multisite Population-based Study. Pediatr Infect Dis J. 2016;35(5 Suppl 1):S23-28. [DOI] [PubMed] [Google Scholar]
  • 70.Lu M, Rupp LB, Moorman AC, Li J, Zhang T, Lamerato LE, et al. Comparative effectiveness research of chronic hepatitis B and C cohort study (CHeCS): improving data collection and cohort identification. Dig Dis Sci. 2014;59(12):3053–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Garzotto F, Piccinni P, Cruz D, Gramaticopolo S, Dal Santo M, Aneloni G, et al. Rifle-based data collection/management system applied to a prospective cohort multicenter Italian study on the epidemiology of acute kidney injury in the intensive care unit. Blood Purif. 2011;31(1–3):159–71. [DOI] [PubMed] [Google Scholar]
  • 72.Mbacké CS, Phillips JF. Longitudinal community studies in Africa: challenges and contributions to health research. Cover Photograph. 2008;23(3):23. [Google Scholar]
  • 73.O'Mahony D, Fallon UB, Hannon F, Kloeckner K, Avalos G, Murphy AW, Kelleher CC. The Lifeways Cross-Generation Study: design, recruitment and data management considerations. Ir Med J. 2007;100(8):suppl 3–6. [PubMed]
  • 74.ISO/IEC/IEEE International Standard - Systems and software engineering -- Life cycle processes -- Requirements engineering. In: ISO/IEC/IEEE 29148:2018(E), 2018. pp. 1-104. 10.1109/IEEESTD.2018.8559686.
  • 75.Mayernik MS, Liapich Y. The role of metadata and vocabulary standards in enabling scientific data interoperability: a study of earth system science data facilities. J eSci Librarianship. 2022;11(2). 10.7191/jeslib.619.
  • 76.Gebler R, Reinecke I, Sedlmayr M, Goldammer M. Enhancing clinical data infrastructure for AI research: comparative evaluation of data management architectures. J Med Internet Res. 2025;27:e74976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Mukasa CDM, Kovacheva VP. Development and implementation of databases to track patient and safety outcomes. Curr Opin Anaesthesiol. 2022;35(6):710–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Kiesewetter J, Hege I, Sailer M, Bauer E, Schulz C, Platz M, et al. Implementing remote collaboration in a virtual patient platform: usability study. JMIR Med Educ. 2022;8(3):e24306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Rinaldi E, Stellmach C, Rajkumar NMR, Caroccia N, Dellacasa C, Giannella M, et al. Harmonization and standardization of data for a pan-European cohort on SARS-CoV-2 pandemic. NPJ Digit Med. 2022;5(1):75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Syed R, Eden R, Makasi T, Chukwudi I, Mamudu A, Kamalpour M, et al. Digital health data quality issues: systematic review. J Med Internet Res. 2023;25:e42615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Chen Z, Ning J, Shen Y, Qin J. Combining primary cohort data with external aggregate information without assuming comparability. Biometrics. Biometrics. 2021;77(3):1024–36. 10.1111/biom.13356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Richesson RL, Hammond WE, Nahm M, Wixted D, Simon GE, Robinson JG, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH health care systems collaboratory. J Am Med Inform Assoc. 2013;20(e2):e226–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Liu S, Wang Y, Wen A, Wang L, Hong N, Shen F, et al. Implementation of a cohort retrieval system for clinical data repositories using the observational medical outcomes partnership common data model: proof-of-concept system validation. JMIR medical informatics. 2020;8(10):e17376. 10.2196/17376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Abrahão MTF, Nobre MRC, Gutierrez MA. A method for cohort selection of cardiovascular disease records from an electronic health record system. Int J Med Inform. 2017;102:138–49. [DOI] [PubMed] [Google Scholar]
  • 85.Nind T, Galloway J, McAllister G, Scobbie D, Bonney W, Hall C, et al. The research data management platform (RDMP): a novel, process driven, open-source tool for the management of longitudinal cohorts of clinical data. Gigascience. 2018;7(7):giy060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Rao HRM, Chen T, Shah SL. Advances in the smart data analytics framework: integrating data extraction and automated reporting. In: 2024 IEEE 3rd Industrial Electronics Society Annual On-Line Conference (ONCON): Beijing, China: IEEE; 2024. pp.1–6.10.1109/ONCON62778.2024.10931649.
  • 87.Clivio L, Tinazzi A, Mangano S, Santoro E. The contribution of information technology: towards a better clinical data management. Drug Dev Res. 2006;67(3):245–50. [Google Scholar]
  • 88.Kaur A, Garg R, Gupta P. Challenges facing AI and big data for resource-poor healthcare system. In: 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC): Coimbatore, India: IEEE; 2021. pp. 1426–33. 10.1109/ICESC51422.2021.9532955.
  • 89.Rodrigues DA, Roque M, Mateos-Campos R, Figueiras A, Herdeiro MT, Roque F. Barriers and facilitators of health professionals in adopting digital health-related tools for medication appropriateness: a systematic review. Digit Health. 2024;10:20552076231225132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Lu Z, Su J. Clinical data management: Current status, challenges, and future directions from industry perspectives. Open Access J Clin Trials. 2010;2:93–105. 10.2147/OAJCT.S8172.
  • 91.Rahman MM, Ripon S. Elicitation and modeling non-functional requirements-a POS case study. arXiv preprint arXiv:1403.1936. 2014
  • 92.Phalnikar R. Validation of non-functional requirements in cloud based systems (short paper). In: 2016 5th IEEE international conference on cloud networking (Cloudnet): Pisa, Italy, IEEE; 2016. pp. 142–5. 10.1109/CloudNet.2016.18.
  • 93.Coviello N, Autio E, Nambisan S, Patzelt H, Thomas LDW. Organizational scaling, scalability, and scale-up: definitional harmonization and a research agenda. J Bus Venturing. 2024;39(5):106419. [Google Scholar]
  • 94.Hawamdeh SS. Cybersecurity and data privacy laws: balancing innovation and protection in the digital age. Middle East J Econ Law Soc Sci (MEJELSS). 2025;3(2):36–46.
  • 95.Riou C, El Azzouzi M, Hespel A, Guillou E, Coatrieux G, Cuggia M. Ensuring general data protection regulation compliance and security in a clinical data warehouse from a university hospital: implementation study. JMIR Med Inform. 2025;13:e63754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Muhammad A, Siddique A, Mubasher M, Aldweesh A, Naveed QN. Prioritizing non-functional requirements in agile process using multi criteria decision making analysis. IEEE Access. 2023;11:24631–54. [Google Scholar]
  • 97.Mbau R, Musiega A, Nyawira L, Tsofa B, Mulwa A, Molyneux S, et al. Analysing the efficiency of health systems: a systematic review of the literature. Appl Health Econ Health Policy. 2023;21(2):205–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Chitchyan R, Groher I, Noppen J. Uncovering sustainability concerns in software product lines. J Softw Evol Process. 2017;29(2):e1853. [Google Scholar]
  • 99.Fernandez M, Pinto HA, Fernandes LM, Oliveira JASd, Lima AMFdS, Santana JSS, et al. Interoperability in universal healthcare systems: insights from Brazil’s experience integrating primary and hospital health care data. Front Digit Health. 2025;7:1622302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Albuquerque D, Guimarães E, Tonin G, Rodríguezs P, Perkusich M, Almeida H, et al. Managing technical debt using intelligent techniques-a systematic mapping study. IEEE Trans Software Eng. 2022;49(4):2202–20. [Google Scholar]
  • 101.Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014;21(4):578–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Wang Z, Craven C, Syed M, Greer M, Seker E, Syed S, Zozus MN, Syed S, Zozus MN, Craven CK. Clinical data warehousing: a scoping review. J Soc Clin Data Manag. 2024;4(1):1-19. 10.47912/jscdm.320.
  • 103.Murphy S, Wilcox A. Mission and sustainability of informatics for integrating biology and the bedside (i2b2). eGEMs (Generating Evidence & Methods to improve patient outcomes). 2014;2(2):1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17(2):124–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.AbdElazim K, Moawad R, Elfakharany E. A framework for requirements prioritization process in agile software development. In: Journal of Physics: Conference Series 2020;1454(1):012001. IOP Publishing.
  • 106.Yuan J, Malin B, Modave F, Guo Y, Hogan WR, Shenkman E, et al. Towards a privacy preserving cohort discovery framework for clinical research networks. J Biomed Inform. 2017;66:42–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Konidena S. Cost-effective scalability in cloud monitoring systems: a comparative stud. Int J Innov Sci Res Technol. 2024;9(8):382–5. 10.38124/ijisrt/IJISRT24AUG641.
  • 108.Yaseen M. Exploratory study of existing research on software requirements prioritization: a systematic literature review. J Softw Evol Process. 2023;36(6):e2613. 10.1002/smr.2613. [Google Scholar]
  • 109.Parmar T. Scaling data infrastructure for high-volume manufacturing: challenges and solutions in big data engineering. Int Sci J Eng Manag. 2024;3:1. [Google Scholar]
  • 110.Yadav S. Cloud database optimization: strategies for performance, scalability, and cost-efficiency. Int J Sci Res Comput Sci Eng Inform Technol. 2025;11:2958–67. [Google Scholar]
  • 111.Li W, Feng C, Jin C, Chen Q, Liu H, Zhao D. A two-level cloud storage system based on asynchronous message for medical image big data. In: 2019 5th International Conference on Big Data Computing and Communications (BIGCOM): QingDao, China, IEEE; 2019. pp. 54–8. 10.1109/BIGCOM.2019.00017.
  • 112.Prakash B, Reddy Rella B. Mlops and dataops integration for scalable machine learning deployment. Int J Multidisciplinary Res. 2022;4:20. [Google Scholar]
  • 113.Alharthi S, Alshamsi A, Alseiari A, Alwarafy A. Auto-scaling techniques in cloud computing: issues and research directions. Sensors (Basel). 2024;24(17):5551. 10.3390/s24175551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Friedman SH, Anderson AR, Bortz DM, Fletcher AG, Frieboes HB, Ghaffarizadeh A, et al. Multicellds: a community-developed standard for curating microenvironment-dependent multicellular data. bioRxiv. 2016:090456. 10.1101/090456.
  • 115.Franco EF, Hirama K, Armenia S, dos Santos JR. A systems interpretation of the software evolution laws and their impact on technical debt management and software maintainability. Softw Qual J. 2023;31(1):179–209. [Google Scholar]
  • 116.KnutsenGlette M, Ludlow K, Wiig S, Bates DW, Austin EE. Resilience perspective on healthcare professionals’ adaptations to changes and challenges resulting from the COVID-19 pandemic: a meta-synthesis. BMJ Open. 2023;13(9):e071828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Kucan G, Tan T, Grossmann D, Graser K, Hall D. Sustainable future-proofing healthcare facilities: a modular and adaptable design approach. J Manag Eng. 2024;40(6):04024053.
  • 118.Toheeb OA, Esther S, Becan D. Future-proofing healthcare through advanced technologies: a strategic framework for sustainable development. 2025.
  • 119.Hu H, Cohen G, Sharma B, Yin H, McConnell R. Sustainability in health care. Annu Rev Environ Resour. 2022;47(1):173–96. [Google Scholar]
  • 120.Ferreira Franco E. A dynamical evaluation framework for technical debt management in software maintenance process. 2020.
  • 121.Adeshina YT. Interoperable IT architectures enabling business analytics for predictive modeling in decentralized healthcare ecosystems. Int J Adv Res Publication Rev. 2025;2(5):128–52. 10.5281/zenodo.15393463.
  • 122.Ehrenstein V, Kharrazi H, Lehmann H, Taylor CO. Obtaining data from electronic health records. In: Gliklich RE, Leavy MB, Dreyer NA, editors. Tools and Technologies for Registry Interoperability, Registries for Evaluating Patient Outcomes: A User’s Guide, 3rd Edition, Addendum 2 [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2019, Chapter 4. [PubMed]
  • 123.Boiten J-W, Ohmann C, Adeniran A, Canham S, Cano-Abadia M, Chassang G, Chiusano M-L, David R, Fratelli M, Gribbon P. EOSC-Life D4. 3-Guidance and policy on standards and tools to facilitate sharing and reuse of multimodal data (including imaging), cohort integration, and biosamples. 4.3, ECRIN; BBMRI; IRFMN; EU-OPENSCREEN; ELIXIR; VHIR; Charité; UNITO; Lygature; INSERM; EMBRC; ERINHA; EATRIS. 2021.⟨hal-04161974⟩
  • 124.Friedrich S, Groß S, König IR, Engelhardt S, Bahls M, Heinz J, et al. Applications of artificial intelligence/machine learning approaches in cardiovascular medicine: a systematic review with recommendations. Eur Heart J. 2021;2(3):424–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Liu C, Zhang W, Ooi BC, Yip JWL, Zeng L, Zheng K. Toward cohort intelligence: A universal cohort representation learning framework for electronic health record analysis. arXiv preprint arXiv. 2304.04468,2023. 10.48550/arXiv.2304.04468.
  • 126.Ali A, Ali H, Saeed A, Ahmed Khan A, Tin TT, Assam M, et al. Blockchain-powered healthcare systems: enhancing scalability and security with hybrid deep learning. Sensors. 2023 , 23 (18): 7740. 10.3390/s23187740 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Richard F, Agordzo GK. E-healthcare security and privacy using advanced blockchain-based algorithm in data storage. In: Evolutionary Artificial Intelligence: 2025// 2025. Singapore: Springer Nature Singapore; 2025. p. 683–98.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1. (12.7KB, docx)

Data Availability Statement

The data that support the findings of this study are available from the corresponding author (Haleh Ayatollahi), upon reasonable request.


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES