Skip to main content
Indian Journal of Psychological Medicine logoLink to Indian Journal of Psychological Medicine
. 2022 Sep 20;44(6):618–619. doi: 10.1177/02537176221127369

The Young Researcher’s Guide to Creating Data Mastersheets

Chittaranjan Andrade 1,
PMCID: PMC9615445  PMID: 36339688

Abstract

Researchers who have completed a study need to transfer their data from individual patient research files to a mastersheet. This article is a beginner’s guide to creating data spreadsheets that can be shared with others and imported into statistical programs. The article discusses choice of spreadsheet software, use of separate worksheets for main data and granular data, use of columns for variables and rows for patient data, coding data, entry of data into cells in the worksheets, and other matters.

Keywords: Datasheets, mastersheets, spreadsheets, worksheets, data coding, data entry


Patients in research studies have research files in addition to their hospital files. Whereas the hospital files contain clinical and treatment data as required by the hospital, the research files contain clinical and treatment data that are related to the study being undertaken.

Source Data

A typical research file for a study subject contains, for example, separate (hardcopy) pages for study identification details, subject identification details, signed and witnessed informed consent form, diagnostic criteria fulfilled, inclusion/exclusion criteria fulfilled, subject sociodemographic data, subject clinical data, assessments on each instrument at baseline and at each follow up visit, treatment and adverse event details at each study visit, and subject disposal details at study endpoint; the actual particulars would depend on study specifics.

The research file should contain full information and not redacted data. For example, it should contain the actual age and the actual educational attainments of the subject rather than the age band into which the subject falls or the number of years of education the subject has received. And, it should contain the original rating instrument sheets for each study visit with scoring marked for each item rather than columns of numbers for ratings for each item on each instrument at each study visit. Thus, the research file will eventually become quite thick as the study progresses.

The hospital records and the research file together comprise the source data to which reference can be made should data need to be verified or retrieved at any future time. Source data need to be preserved for years, even decades, after completion of the study. This is to permit audits, defend against allegations of research misconduct, and allow re-extraction of data for future analyses, as necessary.

Spreadsheets

Data from research files need to be transferred to data mastersheets so that statistical analysis can be performed. In pre-computer days, mastersheets were created on large sheets of chart paper that were pasted together, with rows and columns ruled thereon, and with data painstakingly filled into little squares. Today, spreadsheet software performs the same function with far greater ease.

It is better to use popular spreadsheet software than specialized software that works only on a specific platform. Spreadsheets created using Microsoft Excel or Google Sheets can easily be shared across users and imported into various statistical programs for analysis. Spreadsheets have columns and rows.

Columns and Rows

Each column represents a different variable. The first column should be serial number. Serial numbers, as recorded on the patient research files, start with 1 for the first patient and end with the last patient recruited. No more columns are necessary for patient identification. Thus, the datasheet is anonymized and confidentiality of patient information is protected.

If the study is a clinical trial, the next column could be study group; otherwise, the next columns are usually sociodemographic datapoints such as age, sex, and years of education, followed by clinical data such as diagnosis, age at onset of illness, and number of previous episodes of illness, followed by baseline ratings on various research instruments, followed by treatments administered at baseline, followed by follow-up ratings, treatments, adverse effects, and so on. A sample datasheet is provided as Supplementary Material.

Column headings are best restricted to eight self-explanatory alphanumeric characters because many statistical programs (into which the datasheet must be imported) set this restriction. So, serial number can be represented by SLNo, education by Educ, age at illness onset by AgeOnset, baseline Hamilton Rating Scale for Depression (HAM-D) total score by HDTot0, HAM-D total score at the first follow up by HDTot 1 , and so on. If individual item scores on the HAM-D need to be listed, the column headings can run as HD0i1, HD0i2, HD0i3, and so on, where HD is HAM-D, 0 is baseline, i1 is item 1, i2 is item 2, and so on.

Each row represents a different patient; data from different patients should not be entered into the same row; data from a single patient should not be entered into more than one row.

Cells

Data are entered into cells in the spreadsheet. Each cell contains data for the column variable for the row patient. Data should be entered into cells as numbers, only. So, an age of 5 years should be entered as 5 and not as ‘5 years’ or ‘5 y’. Each cell should contain only one number; so, if a patient is receiving two drugs, the data should be entered in the patient’s row under two columns, one column for each drug.

Continuous variables such as age, years of education, or depression ratings do not require coding; they are entered as actual values. Continuous variables should never be categorized; thus, age should be entered as the actual age and not as 0 for <18 years, 1 for 18–45 years, and so on. If at all categorization is needed, this can be done using the statistical software Recode function at a later time.

Dichotomous variables are best coded as 0 and 1; for example, Male = 0 and Female = 1. Group can be coded as 0 for control and 1 for experimental. No and Yes are 0 and 1; absent and present are 0 and 1. Thus, what 0 and 1 are is often intuitive.

Socioeconomic status can be coded as 0 for lower, 1 for middle, and 2 for upper because these are ordinal data. Religion can be coded as Hindu = 1, Muslim = 2, Christian = 3, and Other Religion = 4. However, at the time of analysis, because religion is not an ordinal variable, the Recode function in the statistical software should be used to create new variables: Hindu, No = 0, 1 = Yes; Muslim, No = 0, 1 = Yes; and so on. If this is not done, the statistical software will interpret Other Religion (code, 4) to be 4 times as much as Hindu (code, 1) and twice as much as Muslim (code, 2). How data should be coded should be decided before data entry starts, in consultation with the statistician who will analyze the data; else, poorly coded data would require data to be entered all over again.

If data are missing, it is best to leave the cell empty. Sometimes, values of 8, or 88, or 888 are used to denote ‘other’ for single-, double- or triple-digit data, and 9, 99, or 999 to denote ‘not known’ or missing data. Again, it’s best to consult the statistician because 8s and 9s are treated as real values by statistical software and not as comments.

Worksheets

A single spreadsheet file can contain many worksheets. The first worksheet should contain the data on which the main statistical analysis is done. Subsequent worksheets can contain granular data, such as item-wise ratings for each rating instrument; these are seldom required in the first worksheet in which, usually, it suffices to enter subscale and total scale scores. When there are many data worksheets, the first column (SLNo), representing patient identification data, should be the same for every worksheet. This would allow correct merging of individual patient data when merging worksheets for additional statistical analysis. The last worksheet should contain explanations for the coding used for the different variables in the different worksheets (Supplementary Material).

Parting Notes

Data, once entered, should be checked for correctness, preferably by a different person. Descriptive statistics, such as range or frequency counts, should be obtained for every variable before starting analysis. This way, if Relapse (No/Yes) is coded as 0 and 1, and if some cells contain other values, the errors can be spotted; or, if only subjects aged 18–60 years were recruited, if some subjects have an age value <18 or >60 years, the errors can be picked up and corrected after reference to the source documents. Finally, the guidance in this article is of a general nature; different studies will have different study-specific requirements.

Supplemental Material

Supplemental material for this article is available online.

Footnotes

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author received no financial support for the research, authorship, and/or publication of this article.

Reference

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material for this article is available online.


Articles from Indian Journal of Psychological Medicine are provided here courtesy of Indian Psychiatric Society South Zonal Branch

RESOURCES