Skip to main content
Indian Journal of Occupational and Environmental Medicine logoLink to Indian Journal of Occupational and Environmental Medicine
. 2023 Dec 30;27(4):359–363. doi: 10.4103/ijoem.ijoem_342_22

Data Management: The First Step in Reproducible Research

Soundarya Soundararajan 1, Sukhdev Mishra 1,
PMCID: PMC10880825  PMID: 38390491

Abstract

Reproducibility is a preferred aim in any scientific research, including occupational health research. Datamanagement is an important and essential step in marching towards reproducibility. A good datamanagement helps us stay organized, improve transparency, quality and fosters collaboration. Here we discuss how to organize and prepare for data management, how data management facilitates interoperability and accessibility, followed by storing and dissemination of data. We wrap up by providing pointers on what needs to be included in the data management plans.

Keywords: Data management, data quality, data integrity, interoperability, occupational health, reproducibility, research data

INTRODUCTION

Reproducibility, that is, producing the same results by the same methods, is the cornerstone of scientific research.[1] It is an essential component for the assurance of the integrity of research data. According to Nature’s online survey, more than 50% of researchers failed to reproduce their experimental findings, and 70% could not reproduce another scientist’s work.[2] Evidently, reproducibility is an essential scientific aim that is far from reach. That is, a majority of the current science may be non-reproducible. There is a growing recognition of this problem[3] in the allied sciences of occupational health, such as environmental health and epidemiology.[4,5] Yet, the occupational health field still has a lot of work to do in this domain. Not recognizing non-reproducibility can be as expensive as 28 billion USD per year globally.[6] Similar figures in the Indian context may be even more significant. New occupational health researchers are hardly trained formally for reproducible health practices to add to the problem. Lack of reproducibility in public health research may instill a lack of transparency and trust in data.

Non-reproducibility is multifactorial. One of the foremost critical contributing factors is “a lack of access to methodological details, raw data, and research materials.”[2] Fortunately, we can tackle this. The best way is to educate researchers on feasible workflows to present their methods, data, and research materials accessible to others. The first imperative step would be to create a quality dataset that is trustworthy to the researcher and related stakeholders. As the open access movement catches up in India and is recognized as a way to foster collaborations and knowledge sharing,[7] good data management will aid in confidently sharing the data and relevant codes. By learning to maintain a good data management plan and sharing it with others can build a positive loop of developing good data management practices for future research.

Understanding the gravity of this problem and to enable the validation and replication of research findings, health funding agencies have implemented mandatory submission of data management plan (DMP) along with research proposals.[8] Compared to the Western world, Indian funding institutions do not typically require either a DMP or assess them in future project updates. Still, it should not limit researchers from having one. A template DMP and filling out the essentials should be part and parcel of the proposal writing, even if not imposed. Having a DMP increases the rigor of the proposed research.[9] It is high time we consider data management more than just a requirement because efficient and reliable data management is a bedrock of critical scientific advancement. By being a scientist, one has the public’s confidence that what a scientist mentions is based on the evidence and that evidence is derived from the data. So rather than the choice, maintaining the rigor of the data is a commitment every scientist should uphold.

Here, we brief on why and how to conduct research data management in three key sections: organization and preparation, interoperability and accessibility, and storage and dissemination. Toward the end, we lay out a DMP and its key elements. We also provide a list of valuable resources which can be used for implementing the discussed constructs. This understanding will aid in practicing data management skills and, upon implementation by most researchers, will substantially positively impact their reproducible research practices and thus contribute to great science.

Data analysis and management are the cornerstones of occupational health. Regardless of whether you are a beginner or a seasoned researcher, this paper will provide you with useful information. A statistician with years of experience and a new scientist learning the nuances of data management make up our combined experience in this field. We have trial and error with several best practices[6] and present here our current workflow which is readily adaptable. Several examples in this paper are drawn from the authors’ current project.

Organization and Preparation

Data management is not a to-do list one can strike off but is an ongoing process. A good data management workflow is well-organized and needs preparation.[10] Having to prepare should not dampen a beginner’s enthusiasm for starting a good management plan. Instead, a considerable amount of time spent initially will help create a good base that only needs updates after that.

Consistent Hierarchy of Files

A strategic way to begin is to start structuring the project folders. The focus here should not be on elaborate folder/subfolder structures but on having consistency across projects.

For example, a simple project folder structure is provided below (See Figure 1). Let’s assume an occupational health researcher is researching on what is the prevalence of occupational stress among healthcare workers. Per researcher’s requirement, one can restructure this example.

Figure 1.

Figure 1

A template for project folder structure

README.txt Add a brief description of your project, including your research question, for anyone landing on the project to understand it better.

1_Proposal: Add your full proposals here, including developing draft versions.

2_Data Management: Add your Data Management Plan (DMP) here; Refer to the later section for developing a DMP.

3_Data: Add information on how you calculate scores; for example, for the stress scale, how the overall scores are calculated; if possible, copy and paste a relevant reference here.

Having a consistent folder structure helps us to be organized in the first place. When projects get older, a consistent folder/subfolder structure will help us better locate the files rather than relying on memory. Consistency is the first essential step in data management.

An advanced user may try project management in R using the libraries: “makeProject” or “ProjectTemplate.” These libraries help create structured project folders with easy codes.

Escaping the Trap of Versions

As someone constantly working and improving the manuscripts or data files, it is crucial to escape the habit of creating files with version numbers v1, v2, etc.; for example, we write draft 1.docx, draft2.docx, finaldraft.docx, and final-to-share.docx. After a considerable time has passed, looking back at the project folder, a researcher stumbles on many files with slightly different version names. This creates confusion and ambiguity on which file one needs to work. A better alternative is to add dates in the beginning file, for example, 202203_manuscript_firstdraft.docx and 202204_manuscript_introadded.docx. This method helps in organizing the files according to the timeline automatically. Similarly, when multiple authors suggest corrections and share appended files, their initials can be added to the file names. An advanced user can explore version control systems like git and GitHub to maintain project folders and file versions, which can be accessed at github.com.

Templates for re-use

Once the folder structure is laid, this can be used as a template for the next project. A template folder structure can also be updated by adding folders required on the go. It is easier to copy and paste the template folder structure to another place while beginning a new project. As stated at the beginning of this section, this will save considerable time working on the folder structure again. An advanced user may try making template repositories in git and GitHub[11] and start projects quickly with a click.

Interoperability and Accessibility

Interoperability is the “ability to access and process data from multiple sources without losing meaning and then integrate that data for mapping, visualization, and other forms of representation and analysis,” according to the data interoperability collaborative initiative.[12]

Put yourself in others’ shoes

Imagine a reader trying to access a dataset in the contributing researchers’ absence. The first hindrance will be in trying to understand what variables are used. The second problem would be in not knowing what codes generate the derived data (data for analysis) from the raw data. For data to be interoperable, it needs documentation of the followed standardization procedures. Imagine this reader being able to understand what is done by you without you being there navigating them. These details constitute the meta-data.

  • Title

  • Brief description of the project

  • Tags/keywords

  • Creator

  • Last modified

  • Contact

  • How to cite?

  • Format

  • Size

  • Codebook

  • Associated research articles

  • Variables in the dataset

    • Number of columns

    • Number of rows.

A good codebook is a great cookbook

A good codebook speaks for the data. Unfortunately, when a reader accessing a dataset and cannot understand whether 1 = females or males is a confusing scenario, this is not uncommon. There are two simple ways to tackle this. One obvious way is to have an extensive codebook for all the variables in the dataset and describe the variables’ levels as below.

Let’s assume the researcher studies how lead levels vary among men and women. The researcher collects the following variables: gender, age, whether having a BPL card and the levels of lead in blood. For this specific purpose, the codebook can be as below as in Table 1.

Table 1.

Sample codebook

Variable Description Type Levels Note
Is_Male Gender Categorical 0=females, 1=males
Is_BPL Bpl=below poverty line status Categorical 0=Not BPL, 1=BPL
Age Age of the participant Numeric -
Lead Lead levels Numeric - Units: micrograms of lead per deciliter of blood (μg/dL)

Another intuitive alternative is to have self-standing variable names; for example, Is_Male the variable name, and when coded as 0 and 1, it is understandable that 1 = males and 0 = females. But using this descriptive variable name does not supersede having a codebook.

From raw to data for analysis

Data management includes both raw and derived data. Once raw data is available from the variables collected, cleaning it produces derived data for analysis. When handling derived data, other users might want to know how that data for analysis is derived from the raw data. For example, a researcher studies smoking behavior among the study participants; obtaining how many cigarettes per day are smoked, a variable called pack years is calculated by multiplying the number of cigarettes and the number of years of smoking.

Pack years is a derived variable, whereas the raw data might have only number of cigarettes and number of years of smoking.

Giving a description: Pack Years (PY) = (Number of cigarettes per day×Years of smoking).

Reference: https://www.cancer.gov/publications/dictionaries/cancer-terms/def/pack-year.

Providing detailed steps to clean the raw data to produce the derived data makes the data usage more confident, open, and transparent. Thus, the project folder should include both raw and derived data. Providing raw and derived data and presenting the steps to move from raw to derived data is crucial for reproducibility. Those who use R may be familiar with the script files that can be shared as codes to reproduce the analysis. Many might feel down as the codes written might not look clean and tidy enough to disseminate. The idea behind sharing codes is not to be artistic but to share usable code. If the code works for you and whether others can reproduce it is the crucial question to ask, and not whether the code is tidy-looking.

As many researchers work with SPSS, we suggest an important method to share data cleaning and analysis codes—Syntax editor (See Figure 2). Even users for a long time seem to be unaware of maintaining a syntax in SPSS. Before an analysis is run, pasting creates a new window, generating the syntax (codes for the analysis). This syntax file can be saved just like the output files in SPSS. Then during re-analysis, the syntax file and the dataset can be made available from which the same analysis results can be reproduced. Even if not sharing, saving the syntax for re-analysis is a good practice. Storing a syntax retraces the analysis steps, which is very difficult in point-and-click software like SPSS.

Figure 2.

Figure 2

A screenshot from SPSS syntax

R Studio offers script files that help store your codes and provide a platform to run the codes. Readers who start learning R can adapt to writing codes in R scripts instead of running commands in the console. While a console offers a temporary space to run your codes, a script file is more permanent, which is essential in retracing or reproducing your analysis. The following steps in R Studio can open a new script file: File→New file→ R script or by control + shift + N on the keyboard.

How to conduct a quality check for data?

It is a good practice to run quality checks for the datasets. Transcribing numbers to excel sheets from printed papers may generate errors. For this, a random 5% of the data can be checked by another person. For example, in a dataset of 100 participants, a person conducting a quality check can select five random numbers and check whether the variables are correctly coded from the paper to the datasheet. We recently conducted a data quality check for our project on stress and sleep quality in nurses. For this, two independent technical staff picked random numbers from the data-id and manually performed a check on the raw data. The raw data and the entire interview sheets were provided to the staff. Discrepancies from the recorded interview in the raw data, if any, were brought to the Principal Investigators (PI’s) attention and attended to and corrected. Once all the variables are ensured for accurate transcribing from the interview sheets, the quality check is marked complete.

Storage and Dissemination

Good data management culminates in sharing the data; even if not shared, it is efficiently stored to facilitate retrieval when needed. Sharing the data often depends on institutional data-sharing policies and storage mandates. Readers may get familiarized with such policies from their institute statistician or data management experts. If such mandates are unavailable, storing the data for at least three years after publication is a good idea. The storage format should be accessible, meaning—an SPSS output file is not an accessible format, as a user without a subscription to SPSS cannot access the data. Whereas an excel file is acceptable, a CSV format is widely accepted. Having a universal format (.txt., csv) lets many standard software programs access the data. Data-sharing sites like Dataverse or OSF[13] can be explored to disseminate the data. One great advantage of sharing data is getting cited.[14] A data management written earlier can also be linked to this data, and both data and DMP can be cited in the research publication. This is not only about visibility but also leveraging the fact that transparency builds trust and confidence. And when many can access the data, understand it, reproduce or replicate it, science will rise from the reproducibility crisis. For that, good data management is the first essential step.

DATA MANAGEMENT PLAN

The ideal time to start data management is while developing the proposal from the research idea. Even if not initiated initially, data management can be initiated anytime during the project cycle. Having realized the importance and why a DMP should be developed, now is the time for action: to add a data management plan to the project folder. A data management plan need not be developed from scratch. Multiple templates are available on sites like OSF or dmptool.org to develop them, even according to the funder’s requirements. On the very basic level as a text file, a DMP should hold the following information:

  1. Data and datatypes to be generated from the project.

  2. How will they be preserved?

  3. What are the plans to share?

  4. What tools are required to access data?

  5. Is this data findable? Depositing the data in websites generates a DOI, which can be used as a unique identifier for the data, just like a published article.

  6. Is the data accessible? Some accessible formats include CSV and txt files.

Imagine data management as a stand-alone document that will educate the reader about the data necessary to understand and redo the analysis mentioned. Data management is not restricted to big data or multi-collaborative projects. It applies to even smaller projects and those maintained only by one person. Data management helps the researcher in practice and paves the way for other researchers who want to work with the data initially put in by a single researcher. In other words, a scientific data management plan makes the data interoperable, which is an essential concept in reproducible research.

TAKE HOME AND CONCLUSION

Data stewardship is an essential yet enjoyable process. Once the benefits are reaped, one might never want to return to an unorganized state. Data management requires very fundamental skills and is conducted at the minimum level. We believe data management should be part and parcel of the research training curriculum as it is often overlooked or considered only the responsibility of data curators. Interested researchers can step their game up by incorporating GitHub workflows and developing required skills in R. Even without these, data management can be successfully done and updated.

Marching toward fully reproducible research is an ultimate goal in the era of open and reproducible science (See Figure 3). Imagine the relief of having adequately managed data that can be shared just with a click of a link to collaborators, followers, and also journals. By adequately following a detailed data management plan, one invites collaboration and citation and fosters rigor in the collected data and, thus, future research. New research is built on the foundation of existing research[15] and should be built on robust data. A step into building one starts with a data management plan which encourages us to think about several aspects of the data, thus improving transparency and quality of the data. Thus we encourage readers to indulge in good data management to increase their confidence to share the data (as required) and take the first step in reproducible research.

Figure 3.

Figure 3

Key elements in venturing reproducible research

List of useful resources

  1. Tools for creating data management plans: https://dmptool.org/, https://dmpg.nfdi4plants.org

  2. Free and reusable resources for reproducibility: https://www.repro4everyone.org/resources

  3. Data management plans from Harvard: https://datamanagement.hms.harvard.edu/plan/data-management-plans

  4. List of data repositories: https://www.nature.com/sdata/policies/repositories

  5. 2023 Updated NIH document on writing a Data Management and Sharing Plan https://sharing.nih.gov/data-management-and-sharing-policy/planning-and-budgeting-for-data-management-and-sharing/writing-a-data-management-and-sharing-plan#writing-a-data-management-and-sharing-plan.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

REFERENCES


Articles from Indian Journal of Occupational and Environmental Medicine are provided here courtesy of Wolters Kluwer -- Medknow Publications

RESOURCES