Skip to main content
PLOS One logoLink to PLOS One
. 2021 Jun 21;16(6):e0251194. doi: 10.1371/journal.pone.0251194

A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses

Heidi Seibold 1,2,3,4,*, Severin Czerny 1, Siona Decke 1, Roman Dieterle 1, Thomas Eder 1, Steffen Fohr 1, Nico Hahn 1, Rabea Hartmann 1, Christoph Heindl 1, Philipp Kopper 1, Dario Lepke 1, Verena Loidl 1, Maximilian Mandl 1, Sarah Musiol 1, Jessica Peter 1, Alexander Piehler 1, Elio Rojas 1, Stefanie Schmid 1, Hannah Schmidt 1, Melissa Schmoll 1, Lennart Schneider 1, Xiao-Yin To 1, Viet Tran 1, Antje Völker 1, Moritz Wagner 1, Joshua Wagner 1, Maria Waize 1, Hannah Wecker 1, Rui Yang 1, Simone Zellner 1, Malte Nalenz 1
Editor: Jelte M Wicherts5
PMCID: PMC8216542  PMID: 34153038

Abstract

Computational reproducibility is a corner stone for sound and credible research. Especially in complex statistical analyses—such as the analysis of longitudinal data—reproducing results is far from simple, especially if no source code is available. In this work we aimed to reproduce analyses of longitudinal data of 11 articles published in PLOS ONE. Inclusion criteria were the availability of data and author consent. We investigated the types of methods and software used and whether we were able to reproduce the data analysis using open source software. Most articles provided overview tables and simple visualisations. Generalised Estimating Equations (GEEs) were the most popular statistical models among the selected articles. Only one article used open source software and only one published part of the analysis code. Replication was difficult in most cases and required reverse engineering of results or contacting the authors. For three articles we were not able to reproduce the results, for another two only parts of them. For all but two articles we had to contact the authors to be able to reproduce the results. Our main learning is that reproducing papers is difficult if no code is supplied and leads to a high burden for those conducting the reproductions. Open data policies in journals are good, but to truly boost reproducibility we suggest adding open code policies.

Introduction

Reproducibility is—or should be—an integral part of science. While computational reproducibility is only one part of the story, it is an important one. Studies on computational reproducibility (e.g. [16]) have found reproducing findings in papers is far from simple. Obstacles include lack of methods descriptions and no availability of source code or even data. Researchers can choose from a multitude of analysis strategies and if they are not sufficiently described, the likelihood of being able to reproduce the results are low [7, 8]. Even in cases where results can be reproduced, it is often tedious and time-consuming to do so [6].

We conducted a reproducibility study based on articles published in the journal PLOS ONE to learn about reporting practices in longitudinal data analyses. All PLOS ONE papers which fulfilled our selection criteria (see Fig 1) in April 2019 were chosen ([919]).

Fig 1. Data selection.

Fig 1

Data selection procedure according to our requirements and number of papers fulfilling the respective requirements.

Longitudinal data is data containing repeated observations or measurements of the objects of study over time. For example, consider a study investigating the effect of alcohol and marijuana use of college students on their academic performance [10]. Students perform a monthly survey on their alcohol and marijuana use and consent to obtain their grade point averages (GPAs) each semester during the study period. In this study not only the outcome of interest (GPAs during several semesters) is longitudinal, but also the covariates (alcohol and marijuana use) change over time. This does not always have to be the case in longitudinal data analysis. Covariates may also be constant over time (e.g. sex) or baseline values (e.g. alcohol consumption during the month before enrollment).

Due to the clustered nature of longitudinal data with several observations per subject, special statistical methods are required. Common statistical models for longitudinal data are mixed effect models or generalized estimating equations. These models can have complex structures and rigorous reporting is required for reproducing model outputs. A study on reporting in generalized linear mixed effect models (GLMMs) on papers from 2000 to 2012 found that there is room for improvement on reporting of these models [20]. Alongside the models, visualization of the data often plays an important role in analyzing longitudinal data. An example is the spaghetti plot, a line graph with the outcome on the y-axis and time on the x-axis. Research on computational reproducibility when methods are complex—such as in this case—is still in its infancy. With this study we aim to add to this field and to provide some insights on challenges of reproducibility in the 11 papers investigated. Furthermore we would like to note that each reproduced paper, is another paper that we can put more trust in. As such reproducing a single paper is already a relevant addition to science.

Computational (or analytic [21]) reproducibility studies—as we define them for this work—take existing papers and corresponding data sets and aim to obtain the same results from the statistical analyses. One prerequisite for such a study is the access to the data set which was used for the original analyses. Also, a clear description of the methods used is essential. An easily reproducible paper provides openly licensed data alongside an openly licensed source code in a programming language commonly used for statistical analyses and also available under a free open source software license (e.g. R [22] or python [23]). If the source code is accompanied with a detailed description of the computing environment (e.g. operating system and versions of R packages) or the computing environment itself (e.g. a Docker container [24]) we believe the chances of obtaining the exact same results to be highest. It is difficult to determine whether a scientific project is reproducible: Is it possible to obtain exactly the same values? Is the (relative) deviation lower than a certain value? Is the difference in p-value lower than a certain value? These and more are questions that can be asked and if answered “yes” the results can be marked as reproducible. Yet all of these come with downsides including being too strict, incomparable, uncomputable, or downright not interesting. Here, we use the definition of leading to the same interpretation, without a rigorous formal definition. The reason is, that the papers analysed here use very different models, so it is hard to compare them on a single scale (such as absolute relative deviation, see e.g. [6]). We argue, that in combination with a qualitative description of challenges and difficulties that we faced in each reproduction process, this definition fits our small scale, heterogeneous, setting better.

In this work we investigated longitudinal data analyses published in PLOS ONE. The multidisciplinarity of PLOS ONE is a benefit for our study as longitudinal data play a role in various fields. Additionally the requirement for a data availability statement in PLOS ONE (see https://journals.plos.org/plosone/s/data-availability) facilitates the endeavour of a reproducibility study. Note that we only selected papers which provided data openly online and where authors agreed with being included in this study. We assume that this leads to a positive bias in the sense that other papers would be more difficult to reproduce.

In the following we discuss the questions we asked in this reproducibility study, the setup of the study within the context of a university course, the procedure of paper selection, and describe the process of reproducing the results.

Materials and methods

Study questions

The aim of this study is to investigate reproducibility in a sample of 11 PLOS ONE papers dealing with longitudinal data. We also collect information on usage of methods, how they are made available and computing environments used. We expect that this study will help future authors in making their work reproducible, even in complex settings such as when working with longitudinal data. Note that based on the selection of 11 papers we cannot make inferences on papers in general or in the journal. We can, however, learn from the obstacles we encountered in the given papers. Also, even reproducing a single paper creates scientific value. It provides a scientific check of the work and increases (or in case of failure decreases) trust in the results.

With the reproducibility study we want to answer the following questions:

  1. Which methods are used?
    • (a) What types of tables are shown?
    • (b) What types of figures are shown?
    • (c) What types of statistical models are used?
  2. Which software is used?
    • (a) Is the software free and open source?
    • (b) Is the source code available?
    • (c) Is the computing environment described (or delivered)?
  3. Are we able to reproduce the data analysis?
    • (a) Are the methods used clearly documented in the paper or supplementary material (e.g. analysis code)?
    • (b) Do we have to contact the authors in order to reproduce the analysis? If so, are authors responsive and helpful? How many e-mails are needed to reproduce the results?
    • (c) Do we receive the same (or very similar) numbers in tables, figures and models?
  4. What are characteristics of papers which make reproducibility easy/possible or difficult/impossible?

  5. What are learnings from this study? What recommendations can we give future authors for describing their methods and reporting their results?

Project circumstances

This project was conducted as part of the master level course Analysis of Longitudinal Data running during the summer term 2019 (23.01.19—27.07.19) at the Ludwig-Maximilians-Universität München. The course is a 6 ECTS (credit points according to the European Credit Transfer and Accumulation System) course aimed at statistics master students (compulsory in biostatistics master, elective in other statistics masters) with 4 hours of class each week: 3 hours with a professor (Heidi Seibold), 1 with a teaching assistant (Malte Nalenz). The course teaches how to work with longitudinal data and discusses appropriate models, such as mixed effect models and generalized estimating equations, and how to apply them in different scenarios. As part of this course, student groups (2-3 students) were assigned a paper for which they aimed to reproduce the analysis of longitudinal data. In practical sessions the students received help with programming related problems and understanding the general theory of longitudinal data analysis. To limit the likelihood of bias due to differing skills of students, all groups received support from the teachers. Students were advised to contact the authors directly in case of unclear specifications of methods. Internal peer reviews, where one group of students checked the setup of all other groups, ensured that all groups had the same solid technical and organizational setup. Finally all projects were carefully evaluated by the teachers and updated in case of problems. Replications and a student paper were the output of the course for each student group and handed in in August 2019. We believe that the setup of this reproducibility study benefits from the large time commitment the students put into reproducing the papers. Also having several students and two researchers work on each paper, ensures a high quality of the study.

This project involved secondary analyses of existing data sets. We had not worked with the data sets in question before.

Selection of papers

For a paper to be eligible for the reproducibility study it has to fulfill the following requirements:

  • R.1 The paper deals with longitudinal data and uses mixed effect models or generalized estimating equations for analysis.

  • R.2 The paper is accompanied by data. This data is freely available online without registration.

  • R.3 At least one author is responsive to e-mails.

Requirement R.1 allows us to select only papers relevant to the topic of this project. Requirement R.2 is necessary to allow for reproducing results without burdens (e.g. application for data access). Although PLOS ONE does have an open data policy (https://journals.plos.org/plosone/s/data-availability), we found many articles which had statements such as “Data cannot be made publicly available due to ethical and legal restrictions”. Issues with data policies in journals have been studied in [25]. Requirement R.3 is important to be able to contact the authors later on in case of questions. Fig 1 shows the selection procedure. All papers which did not fulfill the criteria were excluded. The PLOS website search function was utilized to scan through PLOS ONE published works. Key words used were “mixed model”, “generalized estimating equations”, “longitudinal study” and “cohort study”. This key word search—performed for us by a contact at PLOS ONE—resulted in 57 papers. From these 14 papers fulfilled all criteria and were selected. Two authors prohibited to use of their work within our study. We note that authors do not have the right to prohibit the reuse of their work as all papers are published under CC-BY license. However the negative response lead us to drop the papers, as we expected to have the need to contact authors with questions. For one paper we did not receive any response. Discussions on the selection criteria of all proposed papers are documented in https://osf.io/dx5mn/?branch=public.

Table 1 shows a summary of all papers selected so far.

Table 1. Selected papers.

Citation Title
[9] Wagner et al (2017) Airway Microbial Community Turnover Differs by BPD Severity in Ventilated Preterm Infants
[10] Meda et al (2017) Longitudinal Influence of Alcohol and Marijuana Use on Academic Performance in College Students
[11] Visaya et al (2015) Analysis of Binary Multivariate Longitudinal Data via 2-Dimensional Orbits: An Application to the Agincourt Health and Socio-Demographic Surveillance System in South Africa
[12] Vo et al (2018) Optimizing Community Screening for Tuberculosis: Spatial Analysis of Localized Case Finding from Door-to-Door Screening for TB in an Urban District of Ho Chi Minh City, Viet Nam
[13] Aerenhouts et al (2015) Estimating Body Composition in Adolescent Sprint Athletes: Comparison of Different Methods in a 3 Years Longitudinal Design
[14] Tabatabai et al (2016) Racial and Gender Disparities in Incidence of Lung and Bronchus Cancer in the United States: A Longitudinal Analysis
[15] Rawson et al (2015) Association of Functional Polymorphisms from Brain-Derived Neurotrophic Factor and Serotonin-Related Genes with Depressive Symptoms after a Medical Stressor in Older Adults
[16] Kawaguchi, Desrochers(2018) A Time-Lagged Effect of Conspecific Density on Habitat Selection by Snowshoe Hare
[17] Lemley et al (2016) Morphometry Predicts Early GFR Change in Primary Proteinuric Glomerulopathies: A Longitudinal Cohort Study Using Generalized Estimating Equations
[18] Carmody et al (2018) Fluctuations in Airway Bacterial Communities Associated with Clinical States and Disease Stages in Cystic Fibrosis
[19] Villalonga-Olives et al (2017) Longitudinal Changes in Health Related Quality of Life in Children with Migrant Backgrounds

Replication

In the reproducibility study we adhered to open science best practices. (1) We contacted all corresponding authors of papers we aimed to reproduce via e-mail; (2) all of our source code and data used is available; (3) any potential errors in the original publications were reported immediately to the corresponding author.

In our study we conducted all analyses as close to the original analyses as possible. If many analyses were performed in the original paper, we focused on the analyses of longitudinal data. We conducted all analyses using R [22] regardless of the software used in the original paper to mimic a situation where no access to licensed software is available (R was the only open source software used in the 11 papers).

Each analyis consisted of the following steps:

  1. Read the data into R.

  2. Prepare data for analysis.

  3. Produce overview figure(s) with outcome(s) on the y-axis and time on the x-axis.

  4. Reproduce analysis results (e.g. model coefficients, tables, figures).

The description about all these steps was generally vague (see classification of reported results in [6]) meaning that there were multiple ways of preparing or analysing the data that were in line with the descriptions in the original paper. This study, thus, exposed a large amount of “researcher degrees of freedom” [26] coupled with a lack in transparency about in the original studies. We aimed to take steps that align as closely as possible with the original paper and the results therein. That means, if the methods description in paper or supplementary material were clear, we used those; If not, we tried different possible strategies that we assumed could be correct; If this was not possible or did not lead to the expected results, we contacted the authors to ask for help. All code used by us is publicly available including software versions and in a format easily readable by humans (literate programming, for further information see section on technical details).

Results

The results of our study are summarized in Tables 24. As each paper has its own story and reasons why it was or wasn’t reproducible and what the barriers were, we provide a short description of each individual paper reproduction.

Table 2. Which statistical methods were used by the papers?.

Overview Tables Visualisations Models Used
[9] Baseline demographics Several, e.g. spaghetti plot Beta Binomial Mixed Model
[10] Baseline demographics, model output Several, e.g. scatter plots (alkohol vs. marijuana use) of different time points LMM
[11] Overview of household types Several, e.g. lasagna plot GEE
[12] Baseline demographics none GEE
[13] Correlation none LMM (cross-classified)
[14] Many especially smoking and lung cancer incidence rates for different year, genders, races and regions Mean curves LMM
[15] Baseline demographics Mean curves GEE
[16] Data overview Mean curves GEE
[17] Correlation matrix Mean curves GEE
[18] Sample characteristics Several, e.g. FEV1 over time GEE
[19] Baseline demographics DAG GEE

Table 4. Were the results reproducible?.

Method documentation Contact Attempts Author Responses Models Computable Same Interpretation Classification of Failure
[9] Missing Details 2 1 partly no Software differences
[10] Missing Details 0 0 yes yes
[11] yes 1 1 partly yes Software differences
[12] Missing Details 1 1 yes yes
[13] Missing Details 3 2 partly no Software differences
[14] yes 1 0 no no Software differences, Model Description
[15] Correlation Structure missing 1 1 yes yes
[16] Correlation Structure missing 1 1 yes yes
[17] Correlation Structure missing 3 1 yes yes
[18] 4 1 no Data and Model description
[19] yes 0 0 yes yes

Which methods are used?

For an overview on the following questions we refer to Table 2.

What types of tables are shown?

Most of the papers show tables on characteristics of the observation units at baseline or other summary tables (similar to the so called “Table 1” commonly used in biomedical research) which give a good overview of the data.

What types of figures are shown?

Few papers include classical visualizations taught in courses on longitudinal data, such as spaghetti plots. They mostly present other visualizations (for details, see Table 2).

What types of statistical models are used?

Although in most cases (G)LMMs are superior to GEEs (see [27] for an in-depth discussion and further references)—, 7 out of the 11 papers used GEEs for their analyses [11, 12, 1519]. There is, in fact, only one complex mixed model among the methods used (Beta Binomial Mixed Model, [9]). The other articles [10, 13, 14] use LMMs which are equivalent to GEEs for normally distributed response variables. It should be noted that the selection of papers may not be representative of the general use of GEEs and (G)LMMs. Nevertheless it seems that the reluctance of using GEEs has not spilled over from the statistics community to some other fields, which we speculate to have historical reasons, as GLMMs used to be difficult to compute.

Which software is used?

The results of this section are summarized in Table 3.

Table 3. Which software was used by the papers?.
Software Open Source Source Code Computing Environment
[9] SAS no partly SAS version
[10] SPSS no no SPSS version
[11] no information (email contact states Stata) no no no information
[12] no information (email contact states Stata) no no no information
[13] SAS no no SAS version
[14] SAS no no SAS version
[15] SAS no no SAS version
[16] R yes upon request Package version
[17] SAS no no SAS version
[18] SPSS no no SPSS version
[19] MPlus no no MPlus version

Is the software free and open source?

All except one paper (paper [16]) used closed source software. As our goal was to evaluate how hard reproducing results is when licenses for software products are not available we worked with the open source software R. Implementations in different software products for complex methods such as GEEs and (G)LMMs may show slightly different results even when given the same inputs and with this we expected difficulties in reproducing exactly the same numbers for all papers using software other than R.

Is the source code available?

Only one paper (paper [9]) provided source code. The source code provided was only a small part of the entire code needed to reproduce the results. Nevertheless it was a major help in obtaining the specifications of the models. For one paper we received the code through our email conversations [16]. For all other papers we had to rely on the methods and results sections of the papers. Often we resorted to reverse engineering the results as the methods sections were not sufficiently detailed.

Is the computing environment described (or delivered)?

In most cases the authors provided information on the software used and the software version (9 out of 11). None of the papers described the operating system or provided a computing environment (e.g. Docker container).

Are we able to reproduce the data analysis?

The results of this section are summarized in Table 4.

Are the methods used clearly documented in paper or supplementary material (e.g. analysis code)?

Although all papers in question had methods sections, for most papers we were not able to extract all needed information to reproduce the results by ourselves. The most common issue was that papers did not provide enough detail about the methods used (e.g. model type was mentioned but no detailed model specifications, for details see Table 4). Since, in addition, no source code was provided (except for paper [9]), reproducing results was generally only possible by reverse engineering and/or contacting the authors. As most authors used licensed software which was not available to us, we could not determine if we would have reached the same results using default settings in the respective software. A clear documentation therefore requires enough detail to explicitly specify all necessary parameters for the model, even when using a different software.

Do we have to contact the authors in order to reproduce the analysis? How many e-mails are needed to reproduce the results?

In all but two cases (papers [10, 19]) we contacted the authors to ask questions on how the results were generated (for four of them several emails were exchanged). All but one of the authors responded, which was to be expected as we had previously contacted them asking whether they would agree with us doing this project and only papers were chosen where authors responded positively. In most cases responses by authors were helpful.

Do we receive the same (or very similar) numbers in tables, figures and models?

As the articles use different models and present their main results in terms of different statistics (model coefficients, F-statistics, correlation), the purely numerical deviation between our results and the original results is not informative in isolation. Also, as we used different software implementations, some deviation was to be expected. Therefore, we define similar results as having the same implied interpretations, regarding sign and magnitude of effects. If the signs of the coefficients was the same and the ordering and magnitude of coefficients roughly the same, we regarded the results as successfully reproduced. We were able to fully reproduce 6 out of 11 articles (see also Table 4). Here differences were marginal and did not lead to a change of interpretations. An example (original and reproduced coefficients of article [15]) can be seen in Fig 2. For another two articles at least parts of the analysis could be reproduced (e.g. one out of two models used by the authors). For the 8 articles, that we found to be fully or partly reproducible, we were able to follow the data preprocessing and identify the most likely model specifications. Only three out of the 11 papers could not be reproduced at all, one because of implementation differences [13] and one due to problems preparing the data set used by the authors [18]. In [14] it was unclear how the data was originally analysed and without responses from the authors to our contact attempts via email we were not able to determine whether the different conclusions reached by our analysis are due to incorrect analysis on side of the authors or missing information.

Fig 2. Original and reproduced model parameter estimates for the ewbGEE model of article [15].

Fig 2

In this article the differences in parameters do not lead to a different interpretation.

Note that for some of the results, a considerable amount of time and effort needed to be invested to reverse engineer model settings. In the following we summarize the reproduction process for each paper individually, in order to give more insights about the specific problems and challenges that we encountered. (see also Table 4).

In [9] problems arose with the provided data set. The data description was found to be insufficient. Variable names in the data set differed from the ones in the code provided by the authors. We were able to resolve this problem based on feedback from the authors. When running the analysis using R and the R package PROreg [28], results differed from the original results due to details in the implementation and a different optimization procedure. The reproduced coefficients had the same sign as in the original study. However, differences in magnitude were large for some of the coefficients, likely due to differences in the optimization procedure. Given our definition, we were unable to reproduce the results. A second model fitted by the authors was not reproduced, due to convergence problems (model could not be fitted at all).

We were able to reproduce the results in [10] without contacting the authors. Some difficulty arose from the very sparse model description in the publication, such as, which variables were included as fixed or random effects. Also no source code was available. However within reasonable trial of different model specifications we obtained very similar results as in the original publication.

In [11] the number of observations differed between the publication and the provided data set. Upon request one of the authors provided a data set, that was almost identical to the one used in the study. The performed descriptive analysis and correlation analysis yielded the same results. A second difficulty arose, as the authors did not specify the correlation structure used in their model, but instead relied on the Stata routine to determine the best fitting correlation structure using the Quasi-Likelihood information criterion. If the correlation structure yielding the coefficients closest to the ones in [11] is used, the coefficients are almost identical. However, we also performed the aforementioned model search procedure in R but ended up with a different correlation structure as the best fitting. Using the correlation structure found best by our R implementation, would lead to a change in interpretation of the coefficients.

In [12] difficulties arose from different implementations in the software used. Also the model description was incomplete, which required us to try all possible combinations of variables to include. However, the correlation structure was well described and with feedback from the authors we were able to obtain the same results deviating only on the third decimal.

[13] used a cross-classified LMM, via the SAS “PROC mixed procedure”. Reproduction in R was difficult, as no R package offered the exact same functionality. After trying several R implementations, we settled on the nlme R package [29]. The random effects were not specified in the publication. Also SAS code to shed light on this question was not available. Other questions regarding preprocessing and model specifications could be resolved through the feedback of the authors, but we did not receive the needed information on the random effects. As such we could not reproduce the results.

In [14] the data set used for modeling was not given as a file. Instead the authors provided links to the website where the data had been initially obtained from. We were not able to obtain the same data set given the sources and the description. This might be due to changes in the online sources. Still, differences in summary statistics were not substantial. We were unable to reproduce the same model due to unclear model specification. Our attempts led to some vastly different estimates. Possible reasons for failure are an insufficient model description or even incorrect analysis.

We were able to reproduce the results in [15] with only minor differences in the estimated coefficients. Feedback from the authors was required to find the correct correlation structure used in their GEE model, which was not explicitly stated in the paper.

The results in [16] were computationally reproducible. Despite minor differences in the coefficients we arrived at the same interpretations and differences were most likely due to different optimization procedures in the softwares used. The correlation structure was not stated in the article, but we were able to find the correct one using reverse engineering (grid search).

For the reproduction of [17] we had problems with data preprocessing. This was partly due to the unclear handling of missing values and due to details of the dimensionality reduction procedure used in preprocessing. The authors provided the final data set when we contacted them. The model specifications of the GEE used by the authors were not stated, but we were able to reproduce the exact same results as the authors by reverse engineering the correlation structure and link function. During this we found that using different model specifications or slightly different versions of the data set leads to substantially different results. Given the above definition this article was reproducible.

The results in [18] could not be reproduced. The (DNA) data was given in raw format as a collection of hundreds of individual files, without any provided code or step by step guide for preprocessing, making reproduction of the data set to be used in the statistical analysis impossible for us. Figures and Tables of the clinic data were reproducible.

The results in [19] were reproducible. All necessary model specifications for their GEE model and reasoning behind it were explicitly stated in the paper. The original analysis was carried out in M-plus, but reproduction in R gave almost identical results.

What are characteristics of papers which make reproducibility easy/possible or difficult/impossible?

Based on the discussion of the individual papers we identified determinants of successes and failures. We found that the simpler the methods used in the paper the easier it was to reproduce the paper. Papers dealing with classical LMMs (papers [10, 14]) were reasonably easy to reproduce.

The data provided by the authors played a major role as well. If the clean data was provided, reproducing was much easier than for papers providing raw data (papers [14, 17, 18]), where preprocessing was still necessary. For one paper [18] getting and preparing the data was so complex that we gave up. Even after the authors provided us with an online tutorial on working with this type of data, we were far from understanding what needed to be done. If specialists (e.g. bioinformaticians) on working with this type of data had been involved, we might have had better chances.

We believe that with code provided—even if it is written using software we do not have access to—computational reproducibility is easier to obtain. It is hard to make this conclusion based on the 11 papers we worked with, because only one provided partial code and 1 provided code on request, but they also did not contradict our prior beliefs.

What are learnings from this study? What recommendations can we give future authors for describing their methods and reporting their results?

Trying to reproduce 11 papers gave us a glimpse at how hard computational reproducibility is. We used papers published in an open access journal, which provided data and the authors were supportive of the project. We think it is fair to assume that these papers are among the most open projects available in academic literature at the moment. Nevertheless we were only able to reproduce the results without contacting the authors for two papers.

We not only recommend authors to provide data and code with their paper, but we suggest that this should be made a requirement from journals.

Further points

One paper published raw names of study participants, which we saw as unnecessary information and with that as an unreasonable breach of the participants. We informed the authors who updated the data on the journal website.

Discussion

In this study we aimed at reproducing the results from 11 PLOS ONE papers dealing with statistical methods for longitudinal data. We found that most authors use tables and figures as tools for presenting research results. Although all papers in question had data available for download, only one paper came with accompanied source code. From our point of view the lack of source code is the main barrier in reproducing results of the papers. For some papers we were still able to reproduce results by using a strategy of reverse engineering the results and by asking the authors. In an ideal situation, however, the information needed should not be hidden within the computers and minds of original authors, but should be shared as part of the article (optimally in the form of a research compendium with paper, data, code, and metadata).

One of the authors initially contacted asked us to refrain from reproducing their paper on the grounds that students would not have the capabilities to do such complex analyses. We did not include the article in our study, but strongly disagree with this statement, especially since the students in question all have a strong statistics background and benefited from the guidance of researchers. Furthermore the students checked each other’s works in an internal peer review. We would even go so far as to claim that a lot of other statistical work is less understood by the researcher and less thoroughly checked by peers before it is combined into a publication. Working as a big team gave us the option to conduct time intensive reverse engineering attempts of results, which small research teams or single researchers would potentially not have had.

We did not choose the papers randomly, but based on the set of potential papers given to us by PLOS ONE and then selected all papers meeting our criteria (see Fig 1). We can and should not draw conclusions from our findings on the 11 selected papers on the broader scientific landscape. Our work does, however, give us some insights on what researchers, reviewers, editors and publishers could focus on improving in the future: Publish code next to the data. To PLOS ONE we propose to include code in their open data policy.

Reproducing a scientific article is an important contribution to science and knowledge discovery. It increases trust in the research which is computationally reproducible and raises doubt in the research which is not.

Technical details

All results including detailed reports and code for each of the 11 papers are available in the GitLab repository https://gitlab.com/HeidiSeibold/reproducibility-study-plos-one. All files can also be accessed through the Open Science Framework (https://osf.io/xqknz). For all computations all relevant computational information (R and package versions, operating system) are given below the respective computations. The relevant information for this article itself is shown below.

  • R version 4.0.3 (2020-10-10), x86_64-pc-linux-gnu

  • Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=de_DE.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=de_DE.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=de_DE.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=de_DE.UTF-8, LC_IDENTIFICATION=C

  • Running under: Ubuntu 20.04.2 LTS

  • Matrix products: default

  • BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0

  • LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

  • Base packages: base, datasets, graphics, grDevices, methods, stats, tools, utils

  • Other packages: data.table 1.13.0, dplyr 1.0.2, ggplot2 3.3.3, googlesheets 0.3.0, kableExtra 1.3.1, knitr 1.32, plyr 1.8.6, rcrossref 1.1.0

  • Loaded via a namespace (and not attached): cellranger 1.1.0, cli 2.4.0, codetools 0.2-18, colorspace 2.0-0, compiler 4.0.3, crayon 1.4.1, crul 1.1.0, curl 4.3, digest 0.6.27, DT 0.18, ellipsis 0.3.1, evaluate 0.14, fansi 0.4.2, farver 2.1.0, fastmap 1.1.0, generics 0.0.2, glue 1.4.2, grid 4.0.3, gtable 0.3.0, hms 0.5.3, htmltools 0.5.1.1, htmlwidgets 1.5.3, httpcode 0.3.0, httpuv 1.5.5, httr 1.4.2, jsonlite 1.7.2, labeling 0.4.2, later 1.1.0.1, lifecycle 1.0.0, magrittr 2.0.1, mime 0.10, miniUI 0.1.1.1, munsell 0.5.0, pillar 1.6.0, pkgconfig 2.0.3, promises 1.2.0.1, ps 1.6.0, purrr 0.3.4, R6 2.5.0, Rcpp 1.0.6, readr 1.4.0, reshape2 1.4.4, rlang 0.4.10, rmarkdown 2.7, rstudioapi 0.13, rvest 0.3.6, scales 1.1.1, shiny 1.6.0, stringi 1.5.3, stringr 1.4.0, tibble 3.1.1, tidyselect 1.1.0, utf8 1.2.1, vctrs 0.3.7, viridisLite 0.4.0, webshot 0.5.2, withr 2.4.2, xfun 0.22, xml2 1.3.2, xtables 1.8-4

Data Availability

All results including detailed reports and code for each of the 11 papers are available in the GitLab repository https://gitlab.com/HeidiSeibold/reproducibility-study-plos-one. All files can also be accessed through the Open Science Framework (https://osf.io/xqknz).

Funding Statement

This research has been supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A (Munich Center of Machine Learning) to HS.

References

  • 1. Stodden V, Seiler J, Ma Z. An Empirical Analysis of Journal Policy Effectiveness for Computational Reproducibility. Proceedings of the National Academy of Sciences. 2018;115(11):2584–2589. doi: 10.1073/pnas.1708290115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Kirouac DC, Cicali B, Schmidt S. Reproducibility of Quantitative Systems Pharmacology Models: Current Challenges and Future Opportunities. CPT: Pharmacometrics & Systems Pharmacology. 2019. doi: 10.1002/psp4.12390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hardwicke TE, Bohn M, MacDonald K, Hembacher E, Nuijten MB, Peloquin BN, et al. Analytic reproducibility in articles receiving open data badges at the journal Psychological Science: an observational study. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Obels P, Lakens D, Coles NA, Gottfried J, Green SA. Analysis of open data and computational reproducibility in registered reports in psychology. Advances in Methods and Practices in Psychological Science. 2020;3(2):229–237. doi: 10.1177/2515245920918872 [DOI] [Google Scholar]
  • 5. Maassen E, van Assen MA, Nuijten MB, Olsson-Collentine A, Wicherts JM. Reproducibility of individual effect sizes in meta-analyses in psychology. PloS one. 2020;15(5):e0233107. doi: 10.1371/journal.pone.0233107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Artner R, Verliefde T, Steegen S, Gomes S, Traets F, Tuerlinckx F, et al. The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods. 2020;. [DOI] [PubMed] [Google Scholar]
  • 7.Hoffmann S, Schönbrodt FD, Elsas R, Wilson R, Strasser U, Boulesteix AL. The multiplicity of analysis strategies jeopardizes replicability: lessons learned across disciplines; 2020. Available from: osf.io/preprints/metaarxiv/afb9p. [DOI] [PMC free article] [PubMed]
  • 8. Baumgaertner B, Devezer B, Buzbas EO, Nardin LG. Openness and Reproducibility: Insights from a Model-Centric Approach; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Wagner BD, Sontag MK, Harris JK, Miller JI, Morrow L, Robertson CE, et al. Airway Microbial Community Turnover Differs by BPD Severity in Ventilated Preterm Infants. PLOS ONE. 2017;12(1):e0170120. doi: 10.1371/journal.pone.0170120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Meda SA, Gueorguieva RV, Pittman B, Rosen RR, Aslanzadeh F, Tennen H, et al. Longitudinal Influence of Alcohol and Marijuana Use on Academic Performance in College Students. PLOS ONE. 2017;12(3):e0172213. doi: 10.1371/journal.pone.0172213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Visaya MV, Sherwell D, Sartorius B, Cromieres F. Analysis of Binary Multivariate Longitudinal Data via 2-Dimensional Orbits: An Application to the Agincourt Health and Socio-Demographic Surveillance System in South Africa. PLOS ONE. 2015;10(4):e0123812. doi: 10.1371/journal.pone.0123812 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Vo LNQ, Vu TN, Nguyen HT, Truong TT, Khuu CM, Pham PQ, et al. Optimizing Community Screening for Tuberculosis: Spatial Analysis of Localized Case Finding from Door-to-Door Screening for TB in an Urban District of Ho Chi Minh City, Viet Nam. PLOS ONE. 2018;13(12):e0209290. doi: 10.1371/journal.pone.0209290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Aerenhouts D, Clarys P, Taeymans J, Cauwenberg JV. Estimating Body Composition in Adolescent Sprint Athletes: Comparison of Different Methods in a 3 Years Longitudinal Design. PLOS ONE. 2015;10(8):e0136788. doi: 10.1371/journal.pone.0136788 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Tabatabai MA, Kengwoung-Keumo JJ, Oates GR, Guemmegne JT, Akinlawon A, Ekadi G, et al. Racial and Gender Disparities in Incidence of Lung and Bronchus Cancer in the United States: A Longitudinal Analysis. PLOS ONE. 2016;11(9):e0162949. doi: 10.1371/journal.pone.0162949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Rawson KS, Dixon D, Nowotny P, Ricci WM, Binder EF, Rodebaugh TL, et al. Association of Functional Polymorphisms from Brain-Derived Neurotrophic Factor and Serotonin-Related Genes with Depressive Symptoms after a Medical Stressor in Older Adults. PLOS ONE. 2015;10(3):e0120685. doi: 10.1371/journal.pone.0120685 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Kawaguchi T, Desrochers A. A time-lagged effect of conspecific density on habitat selection by snowshoe hare. PLOS ONE. 2018;13(1):e0190643. doi: 10.1371/journal.pone.0190643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lemley KV, Bagnasco SM, Nast CC, Barisoni L, Conway CM, Hewitt SM, et al. Morphometry Predicts Early GFR Change in Primary Proteinuric Glomerulopathies: A Longitudinal Cohort Study Using Generalized Estimating Equations. PLOS ONE. 2016;11(6):e0157148. doi: 10.1371/journal.pone.0157148 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Carmody LA, Caverly LJ, Foster BK, Rogers MAM, Kalikin LM, Simon RH, et al. Fluctuations in Airway Bacterial Communities Associated with Clinical States and Disease Stages in Cystic Fibrosis. PLOS ONE. 2018;13(3):e0194060. doi: 10.1371/journal.pone.0194060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Villalonga-Olives E, Kawachi I, Almansa J, von Steinbüchel N. Longitudinal Changes in Health Related Quality of Life in Children with Migrant Backgrounds. PLOS ONE. 2017;12(2):e0170891. doi: 10.1371/journal.pone.0170891 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Casals M, Girabent-Farrés M, Carrasco JL. Methodological Quality and Reporting of Generalized Linear Mixed Models in Clinical Medicine (2000–2012): A Systematic Review. PLoS ONE. 2014;9(11):e112653. doi: 10.1371/journal.pone.0112653 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. LeBel EP, McCarthy RJ, Earp BD, Elson M, Vanpaemel W. A Unified Framework to Quantify the Credibility of Scientific Findings. Advances in Methods and Practices in Psychological Science. 2018;1(3):389–402. doi: 10.1177/2515245918787489 [DOI] [Google Scholar]
  • 22. R Core Team. R: A Language and Environment for Statistical Computing; 2020. Available from: https://www.R-project.org/. [Google Scholar]
  • 23.Python Software Foundation. Python Software; 2020. Available from: http://www.python.org.
  • 24. Boettiger C. An Introduction to Docker for Reproducible Research. SIGOPS Oper Syst Rev. 2015;49(1):71–79. doi: 10.1145/2723872.2723882 [DOI] [Google Scholar]
  • 25. Couture JL, Blake RE, McDonald G, Ward CL. A Funder-Imposed Data Publication Requirement Seldom Inspired Data Sharing. PLOS ONE. 2018;13(7):e0199789. doi: 10.1371/journal.pone.0199789 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Simmons JP, Nelson LD, Simonsohn U. False-Positive Psychology. Psychological Science. 2011;22(11):1359–1366. doi: 10.1177/0956797611417632 [DOI] [PubMed] [Google Scholar]
  • 27. Muff S, Held L, Keller LF. Marginal or conditional regression models for correlated non-normal data? Methods in Ecology and Evolution. 2016;7(12):1514–1524. doi: 10.1111/2041-210X.12623 [DOI] [Google Scholar]
  • 28.Najera J, Lee DJ, Arostegui I. PROreg: Patient Reported Outcomes Regression Analysis; 2017. Available from: https://CRAN.R-project.org/package=PROreg.
  • 29.Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team. nlme: Linear and Nonlinear Mixed Effects Models; 2020. Available from: https://CRAN.R-project.org/package=nlme.

Decision Letter 0

Jelte M Wicherts

6 Oct 2020

PONE-D-20-25993

A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses

PLOS ONE

Dear Dr. Seibold,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The two reviewers and I agree that your study was well done, clearly reported, and relevant, and that your work contributes to our understanding of reproducibility of studies using longitudinal analyses. Yet, several issues need to be dealt with in your revision.

First, the two reviewers requested some additional clarity of the type of submission (registered report or not). I apologize for not indicating to the reviewers that your work was originally submitted as a registered report but later handled as a standard submission. Please clarify this. 

Second, both reviewers asked you to provide additional details of the methods, sampling, operationalizations, and data access. Both reviewers provided detailed feedback on the reporting and analyses that I ask you to consider as you revise your submission.

Third, the reviewers indicate that the main goals and main results could be presented more clearly in several sections of the manuscript. It is also import to be clear on your definition of reproducibility as you present the main results in the abstract. 

Please submit your revised manuscript by Nov 16 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Jelte M. Wicherts

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.  

Whilst you may use any professional scientific editing service of your choice, PLOS has partnered with both American Journal Experts (AJE) and Editage to provide discounted services to PLOS authors. Both organizations have experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. To take advantage of our partnership with AJE, visit the AJE website (http://learn.aje.com/plos/) for a 15% discount off AJE services. To take advantage of our partnership with Editage, visit the Editage website (www.editage.com) and enter referral code PLOSEDIT for a 15% discount off Editage services.  If the PLOS editorial team finds any language issues in text that either AJE or Editage has edited, the service provider will re-edit the text for free.

Upon resubmission, please provide the following:

  • The name of the colleague or the details of the professional service that edited your manuscript

  • A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)

  • A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

3.  In your manuscript text, we note that "We did not choose the papers randomly, but based on the set of potential papers given to us by PLOS ONE and then selected all papers meeting our criteria". In the Methods, please ensure that you have included the following:

- Information about how the initial set of 57 papers was selected, including any inclusion/exclusion criteria applied, so that other interested researchers can reproduce this analysis. We would also recommend that the complete set of 57 initial papers be provided as a supplementary information file.

- The complete inclusion/exclusion criteria used to select the initial set of 14 papers from the 57 papers that were identified.

- The complete inclusion/exclusion criteria used to select the final set of 11 papers.

4. Please amend the manuscript submission data (via Edit Submission) to include authors Severin Czerny,  Siona Decke, Roman Dieterle, Thomas Eder, Steffen Fohr, Nico Hahn, Rabea Hartmann, Christoph Heindl, Philipp Kopper, Dario Lepke, Verena Loidl, Maximilian Mandl, Sarah Musiol, Jessica Peter, Alexander Piehler, Elio Rojas, Stefanie Schmid, Hannah Schmidt, Melissa Schmoll, Lennart Schneider, Xiao-Yin To, Viet Tran, Antje Volker, Moritz Wagner, Joshua Wagner, Maria Waize, Hannah Wecker, Rui Yang, Simone Zellner.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Manuscript ID: PONE-D-20-25993

Manuscript title: A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses

Summary

This article reports a retrospective observational study designed to test the analytic reproducibility of a small set of PLOS ONE articles containing longitudinal analyses. Specifically, as a part of a class exercise, the authors and their students attempted to repeat the original analyses performed in the selected articles to see if they could obtain similar results. A range of difficulties reproducing the original results were encountered – some of which could be resolved through contact with the original authors. The generalizability of the results is quite limited due to the small sample size and somewhat ad-hoc sampling procedures; however, the authors appropriately calibrate their conclusions to the evidence, for example stating that “We can and should not draw conclusions from our findings on the 11 selected papers on the broader scientific landscape.”

Generally, the paper is clearly written and is concise, though I think some important information is absent (see detailed comments below). The study appears to be transparently reported with analysis scripts and data for each reproducibility attempt made publicly available in an Open Science Framework repository. I only checked this repository superficially – one issue is that I could not seem to identify a data file for the study (see comment below) which needs to be addressed.

Important note: After reading the paper and writing my review, I was surprised to find a document included after the reference section with the title “Review and response to the registered report”. I was not informed that this study was a registered report and as far as I can tell this is not mentioned in the manuscript aside from this appended document. Can the circumstances of this study be clarified? If this is a registered report then I feel I should be given more information. Most importantly, I need to see the original registered protocol. I would also like to see the prior review history and know whether the original stage 1 reviewers are also appraising the stage 2 manuscript.

Major Comments

- The exact operationalization of several of the outcome variables could be made clearer. For example, for the question “Are the methods used clearly documented in paper or supplementary material (e.g. analysis code)?” what was considered ‘clear’ vs. ‘unclear’? For “Do we receive the same (or very similar) numbers in tables, figures and models?” what was considered ‘similar’ vs. ‘dissimilar’? For “What are characteristics of papers which make reproducibility easy/possible or difficult/impossible?” – exactly what characteristics were examined?

- The oversight of the student work could be described in more detail – did teachers fully

- In the methods section, the sampling procedure is somewhat unclear – for example, “57 papers were initially screened. The PLOS website search function was utilized to scan through PLOS ONE published works. Key words used were “mixed model”, “generalized estimating equations”, “longitudinal study” and “cohort study”. 14 papers fulfilled the criteria and were selected”. Where does the number 57 come from? For the 14 papers – did they fulfill the search criteria? Or the criteria in Fig 1? Or both?

- It seems to me that the selection criteria will have heavily biased the results in favour of positive reproducibility outcomes – for example, only studies that had confirmed data availability were selected and only studies where authors replied to contact and were favourable to the reproducibility attempt were included. Because these factors probably influenced the results quite substantially, I’d suggest this bias is mentioned in key parts of the paper like the abstract and introduction.

- I examined the OSF repository (https://osf.io/xqknz/) for this study and it was unclear to me where to find the data for this study (I could find data for the studies that the authors attempted to reproduce). Could clear instructions be provided on how to find and access the study data?

- Could it be clarified if all analyses reported in eligible papers were examined or just the subset pertaining to longitudinal analyses?

- It is reported that for one article partial analysis code was available and for a second article the full analysis code was made available during email exchanges with the original authors. Its not clear whether all original authors were explicitly asked to make their code available – could this be clarified? If so, what were the responses to this query? Were reasonable justifications provided for not sharing the original code?

- The operational definition of reproducibility could be made clearer in the methods section (and perhaps also in the introduction) – in the results section the authors state “we define similar results as having the same implied interpretations” – this seems to be a less strict definition than used in other studies of analytic reproducibility (e.g., Hardwicke et al., 2018; 2020; Minocher et al., 2020). Some clarification and comment on this would be helpful for understanding the results.

- I think it would be informative to mention in the abstract how many analyses were reproducible only when assistance was provided by original authors.

- This sentence in the discussion is unclear – “We did not choose the papers randomly, but based on the set of potential papers given to us by PLOS ONE and then selected all papers meeting our criteria (see Figure 1).” If the papers were given to the authors by PLOS ONE then this needs to be mentioned and explained at least in the methods section.

Minor Comments

- Terminology usage in this domain is diverse and sometimes contradictory (see e.g., https://doi.org/10.1126/scitranslmed.aaf5027) – I’d recommend including explicit definitions and avoiding use of loaded terminology if possible. For example, it would be good to have a clear definition of ‘computational reproducibility’ in the opening paragraph. The authors may also want to consider using the term ‘analytic reproducibility’ instead of computational reproducibility. Researchers in this domain have recently started to draw a distinction between the two concepts and the former seems more applicable to what the present study has addressed. The distinction is discussed in this article - https://doi.org/10.31222/osf.io/h35wt – specifically, “Computational reproducibility is often assessed by attempting to re-run original computational code and can therefore fail if original code is unavailable or non-functioning (e.g., Stodden et al., 2018; Obels et al., 2019). By contrast, analytic reproducibility is assessed by attempting to repeat the original analysis procedures, which can involve implementing those procedures in new code if necessary (e.g., Hardwicke et al., 2018; Minocher et al., 2020)”

- An additional point on terminology - use of the term ‘replication’ (e.g., in the abstract and introduction) should perhaps be avoided if possible in this context because it is often used to mean “repeating original study methods and obtaining new data” – whereas here it is being used synonymously with computational reproducibility to mean “repeating original study analyses on the original data” (see http://arxiv.org/abs/1802.03311)

- I felt the study design could be made much more explicit in the introduction. For example, “The articles we chose are [1–11]” – briefly mentioning the sampling procedure would be helpful here so the reader can understand the study design (e.g., was it a random sample, arbitrary sample, etc).

- The rationale for the study could be made clearer in the introduction. The review of existing literature in this domain is sparse – it is not clear what knowledge gap the study is trying to fill. How does this work build on previous studies and/or extant knowledge in this domain? Why focus on these 11 papers? Why focus on PLOS ONE?

- It would be helpful to define acronyms e.g., what is a “6 ECTS course”?

- This is unclear and perhaps needs rewording: “For problems with implementation specifics for methods described in the papers”

- “RequirementR.3 is important to be able to contact the authors.” – this appears to just be a restatement of the requirement rather than a justification for including it.

- To ensure the reproducibility of their own analyses, the authors may wish to consider saving and sharing the computational environment in which the analyses were performed. Various tools are available to achieve this e.g., Docker, Binder, Code Ocean.

Reviewer #2: What did they do

The authors tried to reproduce 11 statistical analyses published in PLOS ONE that used longitudinal data. This was done by cleverly making use of student labor in the light of a university course. For each paper, a detailed summary file on the OSF describes the study, the model, the analyses, and potential deviations in results. Those files further contain the used R code allowing the verification of this reproducibility study (Personally, I did not make use of that possibility!).

General remarks

I believe this work to be an important contribution to open science and a service to the scientific community in general. The manuscript is well-written and the authors delightfully refrained from being unnecessarily complicated. To put this work into perspective with similar empirical work on reproducibility in psychology, I suggest giving a more detailed description of methods, results, and, implications in the main manuscript. As of now, I am not sure which conclusions to reach about the state of reproducibility in PLOS one. A more detailed summary of the findings is particularly important in this case because each summary was written by a different teams of students making it very time-consuming to extract all the important information.

Major remarks

• I am confused as to the nature of this manuscript. Does it constitute a registered report? If so, the manuscript should clearly indicate what part of the work was done prior to the submission of the registered report and what was done afterward.

• I am missing a (short) Method section where you describe the timeline of the conducted reproductions (when where authors contacted to provide code of analysis?, how did the students work on the assignment?, how (much) assistance did they receive from the teaching team?) In line 340 an internal peer review is mentioned – please provide more information on that.

• I agree with what is being said in lines 208-212, however, I would like to have precise information about when the magnitude of the effect is the same. Further, the possibility of achieving the same numbers by deliberately deviating from the method description of the paper should be discussed as this has implications on the implied interpretations.

• Roughly, we can group reproduction failures into 3 groups: Reporting errors in the paper, insufficient/incorrect description of methods or data that prevent reproduction, and software/algorithm differences. I would like to know for each reproduction failure the group to which it likely belongs. Since you exclusively used R in your reanalyses whereas only 1 of the 11 papers did so, I think it is important to discuss software differences (including differences in algorithms, default/starting values..) in detail. Whereas software differences are negligible for simple designs as ANOVA and t-tests, this cannot just be assumed for GEEs and GLMMs. A discussion of software differences is, for example, important to interpret the results for paper #1 (lines 225 to 234) and also line 257. Looking at your summary file of this paper “essay_01.pdf” it turns out that you have deviated in multiple ways for a multitude of reasons from what was described in the paper. As a result, it is hard to judge whether the original analysis contains reporting errors or not. your analysis of paper #1 A related issue is when you apply a different optimization algorithm. It might be of interest to try to reproduce those papers where the reproduction attempt was unsuccessful (and where the provided data does not seem to be the culprit) via the software package (and the functions therein) used by the respective authors.

• 233 – If you believe that your R code does not converge properly, it should be changed until it does, no? If you are unable to fit the model in R, it cannot be judged whether the published results are approximately correct or not. Now, all we know is that the students assigned to this paper were unable to properly fit the statistical model to the data via R.

• 316 – I would mention that in the abstract. The current abstract might give an incorrect impression as it is nowhere mentioned that the stated results involved author assistance (ideally, reproducibility in an open data journal should be possible without contacting the authors!)

Minor remarks

• 4 – use reproduction instead of replication. More generally, I suggest to use reproduce/reproducibility to describe computational reproducibility and replicated/replicability for new studies involving different data as this terminology is most commonly used in Psychology nowadays.

• 12 - Longitudinal data includes variables that are measured repeatedly over time but those variables do not necessarily have to do with humans.

• 111 – I would choose a more descriptive figure caption.

• 112 – I would refer to R.1 in singular (i.e. requirement R1)

• 130 – Is there the possibility of including additional papers? If so, I would like to see reproduction attempts of the 2 papers were the authors “prohibited” the use of their work to be reproduced.

• 139-143 – Did you try to reproduce ALL figures and numbers reported in the paper that were related to the longitudinal study? If not, what was omitted and why? Please add 1 or 2 clarification sentences.

• 144 – Reproducing someone else’s analysis typically involves many RDFs, yes. But, it does not make sense to say (line 145) that there were many decisions that would adhere to YOUR steps 1-4. Instead, you should write that there are multiple ways to read in/process/analyze the provided data that are not in disagreement with what is stated in the paper or the supplementary material.

• 147 – Please be more specific.

• 153 – The title is not self-explanatory, especially because it is written in the present tense. Maybe “Which statistical methods were used by the papers” instead.

• 162 – “according to statisticians” I would refrain from using such a phrase. Instead, just cite relevant papers arguing for GLMMs over GEEs and, potentially, summarize some of their advantages.

• 172 – See comments for 153 above

• 184 – Did you always ask the authors for their source code. If not, when (before or after the 1st reproduction attempt?) did you ask for it. You provide some information in lines 201 to 207, but, I would like to know the specific time-line, and I want to know what was planned a-priori and what was ad hoc.

• 247 – Where is the search procedure mentioned?

• I would like to see the implications of the non-reproducible findings discussed. How many unreasonable original analyses (& conclusions drawn from it) could be identified? I know that this type of finger-pointing is uncomfortable, especially since only work from authors that provided both their data and responded to your e-mails were included in your sample, yet, it is important to estimate the rate of reporting errors and irreproducible findings in Psychology.

Comments about Review and response to the registered report

• The updated outline of the aim of this study is “Our aim with this study is to better understand the current practices in 11 PLOS ONE papers dealing with longitudinal data in terms of methodology applied but also in how results were computed and how it is made available for readers.” I find this unnecessary complicated and, more importantly, it does not reflect the content of your study well at all. Wasn’t the aim of this study to assess the extent to which papers analyzing longitudinal data in PLOS ONE could be reproduced by independent researchers.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Richard Artner

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jun 21;16(6):e0251194. doi: 10.1371/journal.pone.0251194.r003

Author response to Decision Letter 0


7 Jan 2021

---

output:

pdf_document: default

---

# PONE-D-20-25993: A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses

PLOS ONE

Dear Dr. Seibold,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The two reviewers and I agree that your study was well done, clearly reported, and relevant, and that your work contributes to our understanding of reproducibility of studies using longitudinal analyses. Yet, several issues need to be dealt with in your revision.

First, the two reviewers requested some additional clarity of the type of submission (registered report or not). I apologize for not indicating to the reviewers that your work was originally submitted as a registered report but later handled as a standard submission. Please clarify this.

Second, both reviewers asked you to provide additional details of the methods, sampling, operationalizations, and data access. Both reviewers provided detailed feedback on the reporting and analyses that I ask you to consider as you revise your submission.

Third, the reviewers indicate that the main goals and main results could be presented more clearly in several sections of the manuscript. It is also import to be clear on your definition of reproducibility as you present the main results in the abstract.

Please submit your revised manuscript by Nov 16 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

- A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

- A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

- An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Jelte M. Wicherts

Academic Editor

PLOS ONE

#### Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

**Ok.**

2. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.

Upon resubmission, please provide the following:

- A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.The name of the colleague or the details of the professional service that edited your manuscript

- A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)

- A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

**Ok.**

3. In your manuscript text, we note that "We did not choose the papers randomly, but based on the set of potential papers given to us by PLOS ONE and then selected all papers meeting our criteria". In the Methods, please ensure that you have included the following:

- Information about how the initial set of 57 papers was selected, including any inclusion/exclusion criteria applied, so that other interested researchers can reproduce this analysis. We would also recommend that the complete set of 57 initial papers be provided as a supplementary information file.

- The complete inclusion/exclusion criteria used to select the initial set of 14 papers from the 57 papers that were identified.

- The complete inclusion/exclusion criteria used to select the final set of 11 papers.

**Thank you. We updated the manuscript accordingly. Please let us know if we should move any further information from Figure 1 to the text.**

4. Please amend the manuscript submission data (via Edit Submission) to include authors Severin Czerny, Siona Decke, Roman Dieterle, Thomas Eder, Steffen Fohr, Nico Hahn, Rabea Hartmann, Christoph Heindl, Philipp Kopper, Dario Lepke, Verena Loidl, Maximilian Mandl, Sarah Musiol, Jessica Peter, Alexander Piehler, Elio Rojas, Stefanie Schmid, Hannah Schmidt, Melissa Schmoll, Lennart Schneider, Xiao-Yin To, Viet Tran, Antje Volker, Moritz Wagner, Joshua Wagner, Maria Waize, Hannah Wecker, Rui Yang, Simone Zellner.

**Done.**

### 5. Review Comments to the Author

**Thank you so much for your thorough and constructive feedback. We truly think that your input has improved the paper and have rarely seen such helpful reviews. We hope that we have answered all your questions to your satisfaction. Please let us know if we missed or misunderstood anything. All text changes are marked in blue.**

#### Reviewer #1: Manuscript ID: PONE-D-20-25993

Manuscript title: A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses

##### Summary

This article reports a retrospective observational study designed to test the analytic reproducibility of a small set of PLOS ONE articles containing longitudinal analyses. Specifically, as a part of a class exercise, the authors and their students attempted to repeat the original analyses performed in the selected articles to see if they could obtain similar results. A range of difficulties reproducing the original results were encountered – some of which could be resolved through contact with the original authors. The generalizability of the results is quite limited due to the small sample size and somewhat ad-hoc sampling procedures; however, the authors appropriately calibrate their conclusions to the evidence, for example stating that “We can and should not draw conclusions from our findings on the 11 selected papers on the broader scientific landscape.”

Generally, the paper is clearly written and is concise, though I think some important information is absent (see detailed comments below). The study appears to be transparently reported with analysis scripts and data for each reproducibility attempt made publicly available in an Open Science Framework repository. I only checked this repository superficially – one issue is that I could not seem to identify a data file for the study (see comment below) which needs to be addressed.

Important note: After reading the paper and writing my review, I was surprised to find a document included after the reference section with the title “Review and response to the registered report”. I was not informed that this study was a registered report and as far as I can tell this is not mentioned in the manuscript aside from this appended document. Can the circumstances of this study be clarified? If this is a registered report then I feel I should be given more information. Most importantly, I need to see the original registered protocol. I would also like to see the prior review history and know whether the original stage 1 reviewers are also appraising the stage 2 manuscript.

**We had initially submitted a registered report for this study to PLOS ONE, but since the review process took longer than the project, we decided -- together with PLOS ONE editors -- that it would be more scientifically sound to withdraw the submission and only submit the final paper. We submitted including the full history of reviews and hope you will receive the full information in the upcoming round of reviews.**

##### Major Comments

- The exact operationalization of several of the outcome variables could be made clearer. For example, for the question “Are the methods used clearly documented in paper or supplementary material (e.g. analysis code)?” what was considered ‘clear’ vs. ‘unclear’? For “Do we receive the same (or very similar) numbers in tables, figures and models?” what was considered ‘similar’ vs. ‘dissimilar’? For “What are characteristics of papers which make reproducibility easy/possible or difficult/impossible?” – exactly what characteristics were examined?

**We added a more detailed explanation what we considered a clear description in the methods section. As we only used R in our analysis we speak of a clear documentation, if it allowed us to follow the authors steps in a different software environment. As similar outputs we defined, if they have the same interpretation, as some deviations are expected. If already deviations in simple summary tables occured we immediatly contacted the authors, to get more information on the data. The characteristics examined were based on the discussion of the individual papers and what we identified as the most likely reasons for success/failures. We added a line to point this out.**

- The oversight of the student work could be described in more detail – did teachers fully

**We added two sentences which we hope will clarify the setup further: "Internal peer-reviews ensured that all groups had a solid technical setup. Finally all projects were carefully evaluated by the teachers and updated in case of problems."**

- In the methods section, the sampling procedure is somewhat unclear – for example, “57 papers were initially screened. The PLOS website search function was utilized to scan through PLOS ONE published works. Key words used were “mixed model”, “generalized estimating equations”, “longitudinal study” and “cohort study”. 14 papers fulfilled the criteria and were selected”. Where does the number 57 come from? For the 14 papers – did they fulfill the search criteria? Or the criteria in Fig 1? Or both?

**The phrasing was a bit unfortunate. Thank you for spotting this. We hope it is more clear now.**

- It seems to me that the selection criteria will have heavily biased the results in favour of positive reproducibility outcomes – for example, only studies that had confirmed data availability were selected and only studies where authors replied to contact and were favourable to the reproducibility attempt were included. Because these factors probably influenced the results quite substantially, I’d suggest this bias is mentioned in key parts of the paper like the abstract and introduction.

**Very good point. We included a sentence ("Inclusion criteria were the availability of data and author consent.") in the abstract and another ("Note that we only selected papers which provided data openly

online and where authors agreed with being included in the study.

We assume that this leads to a positive bias in the sense that other papers

would be more difficult to reproduce.") in the introduction.**

- I examined the OSF repository (https://osf.io/xqknz/) for this study and it was unclear to me where to find the data for this study (I could find data for the studies that the authors attempted to reproduce). Could clear instructions be provided on how to find and access the study data?

**The data can either be found in the data folder in the respective paper folder (e.g. `papers/01_2017_Wagner/data`) or it is automatically downloaded in the provided code.**

- Could it be clarified if all analyses reported in eligible papers were examined or just the subset pertaining to longitudinal analyses?

**We focused on the longitudinal data analyses. We clarified this in the text.**

- It is reported that for one article partial analysis code was available and for a second article the full analysis code was made available during email exchanges with the original authors. Its not clear whether all original authors were explicitly asked to make their code available – could this be clarified? If so, what were the responses to this query? Were reasonable justifications provided for not sharing the original code?

**We are unsure in how much detail we are allowed to share the email exchanges openly. Aside from the initial contact, email exchanges were not standardized and authors were not asked if we would be allowed to share the details of the conversation. Not all authors were asked to share the code. We hope this answers your questions.**

- The operational definition of reproducibility could be made clearer in the methods section (and perhaps also in the introduction) – in the results section the authors state “we define similar results as having the same implied interpretations” – this seems to be a less strict definition than used in other studies of analytic reproducibility (e.g., Hardwicke et al., 2018; 2020; Minocher et al., 2020). Some clarification and comment on this would be helpful for understanding the results.

**Thank you for the interesting references. We agree that our operationalisation is less strict than others, such as the ones quoted by you. The main reason for this is, that we deliberately deviate from the software used by the authors, so deviation is to be expected to a certain degree and does not necessarily imply any kind of error on the side of the authors. However a detailed enough method description to be able to reach similar results, when using a different software implementation is in our eyes a minimal requirement for successful reporting. We added more explanation to the methods section.**

- I think it would be informative to mention in the abstract how many analyses were reproducible only when assistance was provided by original authors.

**Thank you for the suggestion, we added the number to the abstract.**

- This sentence in the discussion is unclear – “We did not choose the papers randomly, but based on the set of potential papers given to us by PLOS ONE and then selected all papers meeting our criteria (see Figure 1).” If the papers were given to the authors by PLOS ONE then this needs to be mentioned and explained at least in the methods section.

**Thank you for spotting. We added the information in the methods section.**

##### Minor Comments

- Terminology usage in this domain is diverse and sometimes contradictory (see e.g., https://doi.org/10.1126/scitranslmed.aaf5027) – I’d recommend including explicit definitions and avoiding use of loaded terminology if possible. For example, it would be good to have a clear definition of ‘computational reproducibility’ in the opening paragraph. The authors may also want to consider using the term ‘analytic reproducibility’ instead of computational reproducibility. Researchers in this domain have recently started to draw a distinction between the two concepts and the former seems more applicable to what the present study has addressed. The distinction is discussed in this article - https://doi.org/10.31222/osf.io/h35wt – specifically, “Computational reproducibility is often assessed by attempting to re-run original computational code and can therefore fail if original code is unavailable or non-functioning (e.g., Stodden et al., 2018; Obels et al., 2019). By contrast, analytic reproducibility is assessed by attempting to repeat the original analysis procedures, which can involve implementing those procedures in new code if necessary (e.g., Hardwicke et al., 2018; Minocher et al., 2020)”

**Thank you very much for hinting us to the article by Hardwicke et al. This is indeed highly relevant to our study as their setup is similar to ours. Furthermore we now mention the term analytic reproducibility and cite LeBel et al.**

- An additional point on terminology - use of the term ‘replication’ (e.g., in the abstract and introduction) should perhaps be avoided if possible in this context because it is often used to mean “repeating original study methods and obtaining new data” – whereas here it is being used synonymously with computational reproducibility to mean “repeating original study analyses on the original data” (see http://arxiv.org/abs/1802.03311)

**You are completely right. We believe this increases clarity in our manuscript. Thank you!**

- I felt the study design could be made much more explicit in the introduction. For example, “The articles we chose are [1–11]” – briefly mentioning the sampling procedure would be helpful here so the reader can understand the study design (e.g., was it a random sample, arbitrary sample, etc).

**We added the sentence "They are all PLOS ONE papers which fulfilled our selection criteria [...] in March 2019."**

- The rationale for the study could be made clearer in the introduction. The review of existing literature in this domain is sparse – it is not clear what knowledge gap the study is trying to fill. How does this work build on previous studies and/or extant knowledge in this domain? Why focus on these 11 papers? Why focus on PLOS ONE?

**Thank you. This is a very important question which we now answer in the introduction.**

- It would be helpful to define acronyms e.g., what is a “6 ECTS course”?

**Thanks for spotting. ECTS are credit points according to the European Credit Transfer and Accumulation System. We updated the sentence.**

- This is unclear and perhaps needs rewording: “For problems with implementation specifics for methods described in the papers”

**We agree. We did so and hope it is more understandable now: "Students were advised to contact the authors directly in case of unclear specifications of methods."**

- “RequirementR.3 is important to be able to contact the authors.” – this appears to just be a restatement of the requirement rather than a justification for including it.

**Thanks for spotting all these small yet important things. We updated this.**

- To ensure the reproducibility of their own analyses, the authors may wish to consider saving and sharing the computational environment in which the analyses were performed. Various tools are available to achieve this e.g., Docker, Binder, Code Ocean.

**Some of the computations run very long and require server usage which makes it difficult to use the suggested solutions (our University servers do not allow usage of Docker). As we are of course aware of the issue, we decided to provide the `SessionInfo()` for each paper instead. We realized that this information is not provided in the paper and we added it (see Compuational Details).**

#### Reviewer #2: What did they do

The authors tried to reproduce 11 statistical analyses published in PLOS ONE that used longitudinal data. This was done by cleverly making use of student labor in the light of a university course. For each paper, a detailed summary file on the OSF describes the study, the model, the analyses, and potential deviations in results. Those files further contain the used R code allowing the verification of this reproducibility study (Personally, I did not make use of that possibility!).

##### General remarks

I believe this work to be an important contribution to open science and a service to the scientific community in general. The manuscript is well-written and the authors delightfully refrained from being unnecessarily complicated. To put this work into perspective with similar empirical work on reproducibility in psychology, I suggest giving a more detailed description of methods, results, and, implications in the main manuscript. As of now, I am not sure which conclusions to reach about the state of reproducibility in PLOS one. A more detailed summary of the findings is particularly important in this case because each summary was written by a different teams of students making it very time-consuming to extract all the important information.

##### Major remarks

- I am confused as to the nature of this manuscript. Does it constitute a registered report? If so, the manuscript should clearly indicate what part of the work was done prior to the submission of the registered report and what was done afterward.

**We had initially submitted a registered report for this study to PLOS ONE, but since the review process took longer than the project, we decided -- together with PLOS ONE editors -- that it would be more scientifically sound to withdraw the submission and only submit the final paper. We submitted including the full history of reviews and hope you will receive the full information in the upcoming round of reviews.**

- I am missing a (short) Method section where you describe the timeline of the conducted reproductions (when where authors contacted to provide code of analysis?, how did the students work on the assignment?, how (much) assistance did they receive from the teaching team?) In line 340 an internal peer review is mentioned – please provide more information on that.

**Thank you. We incorporated this suggestion.**

- I agree with what is being said in lines 208-212, however, I would like to have precise information about when the magnitude of the effect is the same. Further, the possibility of achieving the same numbers by deliberately deviating from the method description of the paper should be discussed as this has implications on the implied interpretations.

**We agree and added such comments in the discussion of the individual papers, when the difference was not negligible. In general we only deviated from the method description when it was unclear how to proceed, even after contacting the authors. In some cases, we also performed sensitivity analyses and mention the results, but they are not used to judge reproducibility.**

- Roughly, we can group reproduction failures into 3 groups: Reporting errors in the paper, insufficient/incorrect description of methods or data that prevent reproduction, and software/algorithm differences. I would like to know for each reproduction failure the group to which it likely belongs. Since you exclusively used R in your reanalyses whereas only 1 of the 11 papers did so, I think it is important to discuss software differences (including differences in algorithms, default/starting values..) in detail. Whereas software differences are negligible for simple designs as ANOVA and t-tests, this cannot just be assumed for GEEs and GLMMs. A discussion of software differences is, for example, important to interpret the results for paper #1 (lines 225 to 234) and also line 257. Looking at your summary file of this paper “essay_01.pdf” it turns out that you have deviated in multiple ways for a multitude of reasons from what was described in the paper. As a result, it is hard to judge whether the original analysis contains reporting errors or not. your analysis of paper #1 A related issue is when you apply a different optimization algorithm. It might be of interest to try to reproduce those papers where the reproduction attempt was unsuccessful (and where the provided data does not seem to be the culprit) via the software package (and the functions therein) used by the respective authors.

**We agree with this insightful comment. In this study, software/algorithm differences are expected, as we use different software as most of the original studies. Due to this, differences in coefficients, especially in complex models, are expected to occur. We cannot differentiate between reporting errors and errors due to different implementations. The only errors, that we can spot with reasonable confidence are if the methods and data description are not clear enough for us to reach similar results using different implementations. This means, that starting values and other parameters used in the optimization need to be provided, if they can have a high impact on the results. If they were not provided, we relied on the standard settings in the respective R-libraries and reasonable trial and error. So for this study, we can only answer the question "can we reach the same interpretations following the authors descriptions but using different implementations". This is less strict than other operationalisations of reproducibility, but in our opinion reflects a very common situation for researchers. We added a column to the results table, that consists of the most likely reason for failures of reproduction for parts or all results.**

- 233 – If you believe that your R code does not converge properly, it should be changed until it does, no? If you are unable to fit the model in R, it cannot be judged whether the published results are approximately correct or not. Now, all we know is that the students assigned to this paper were unable to properly fit the statistical model to the data via R.

**Thank you for your comment. We did this study under the assumption of having access to only free software. We believe that it is too much to ask from someone aiming to reproduce a paper to write new software. Using licensed products comes with the possibility of using methods that are not openly available at all, which was the case here. We added more information on this in the manuscript.**

- 316 – I would mention that in the abstract. The current abstract might give an incorrect impression as it is nowhere mentioned that the stated results involved author assistance (ideally, reproducibility in an open data journal should be possible without contacting the authors!)

**Thanks for spotting this. We completely agree and added the sentence "For all but two articles we had to contact the authors to reproduce results."**

##### Minor remarks

- 4 – use reproduction instead of replication. More generally, I suggest to use reproduce/reproducibility to describe computational reproducibility and replicated/replicability for new studies involving different data as this terminology is most commonly used in Psychology nowadays.

**You are completely right. We believe this increases clarity in our manuscript. Thank you!**

- 12 - Longitudinal data includes variables that are measured repeatedly over time but those variables do not necessarily have to do with humans.

**Done, thanks.**

- 111 – I would choose a more descriptive figure caption.

**Done, thanks.**

- 112 – I would refer to R.1 in singular (i.e. requirement R1)

**Done, thanks.**

- 130 – Is there the possibility of including additional papers? If so, I would like to see reproduction attempts of the 2 papers were the authors “prohibited” the use of their work to be reproduced.

**We deliberately chose not to work with papers where the authors are not ok with us reproducing the paper as it did not fit in the context of our study. We hope you understand.**

- 139-143 – Did you try to reproduce ALL figures and numbers reported in the paper that were related to the longitudinal study? If not, what was omitted and why? Please add 1 or 2 clarification sentences.

**Thank you for finding this missing information. We added "If many analyses were performed, we focused on the analyses of longitudinal data."**

- 144 – Reproducing someone else’s analysis typically involves many RDFs, yes. But, it does not make sense to say (line 145) that there were many decisions that would adhere to YOUR steps 1-4. Instead, you should write that there are multiple ways to read in/process/analyze the provided data that are not in disagreement with what is stated in the paper or the supplementary material.

**You are absolutely right. We updated the text.**

- 147 – Please be more specific.

**We added more context and hope it is more understandable now.**

- 153 – The title is not self-explanatory, especially because it is written in the present tense. Maybe “Which statistical methods were used by the papers” instead.

**Good point. We updated the caption.**

- 162 – “according to statisticians” I would refrain from using such a phrase. Instead, just cite relevant papers arguing for GLMMs over GEEs and, potentially, summarize some of their advantages.

**Thank you. Now that I read it again, it sounds funny to me, too. We updated the sentence.**

- 172 – See comments for 153 above

**Good point. We updated the caption.**

- 184 – Did you always ask the authors for their source code. If not, when (before or after the 1st reproduction attempt?) did you ask for it. You provide some information in lines 201 to 207, but, I would like to know the specific time-line, and I want to know what was planned a-priori and what was ad hoc.

**We did attempt to do everything on our own. Only if necessary information was not available, we contacted the authors. We are now making this more clear in the text.**

- 247 – Where is the search procedure mentioned?

**We updated the description to make it more clear how the search procedure was conducted: "The PLOS website search function was utilized to scan through PLOS ONE published works. Key words used were ..."**

- I would like to see the implications of the non-reproducible findings discussed. How many unreasonable original analyses (& conclusions drawn from it) could be identified? I know that this type of finger-pointing is uncomfortable, especially since only work from authors that provided both their data and responded to your e-mails were included in your sample, yet, it is important to estimate the rate of reporting errors and irreproducible findings in Psychology.

**Out of the papers, that were not reproducible at all for us, in one case we were unable to preprocess the data, in one the failure was clearly due to implementation differences and only in one case we identified a potentially unreasonable analysis. Of course for the two papers mentioned previously we just cant know, as we were unable to reach results ourselfs.**

- The updated outline of the aim of this study is “Our aim with this study is to better understand the current practices in 11 PLOS ONE papers dealing with longitudinal data in terms of methodology applied but also in how results were computed and how it is made available for readers.” I find this unnecessary complicated and, more importantly, it does not reflect the content of your study well at all. Wasn’t the aim of this study to assess the extent to which papers analyzing longitudinal data in PLOS ONE could be reproduced by independent researchers.

**You are right, thanks. We updated the section.**

Attachment

Submitted filename: 2020_10_response.pdf

Decision Letter 1

Jelte M Wicherts

9 Mar 2021

PONE-D-20-25993R1

A computational reproducibility study of PLOS ONE articles

featuring longitudinal data analyses

PLOS ONE

Dear Dr. Seibold,

Thank you for revising your manuscript submitted for publication in PLOS ONE. Please accept my apologies for the relatively slow processing of your revision caused by several factors including me being on parental leave around the birth of our daughter and me having to homeschool her three proud brothers during the lockdown .

The remaining reviewer (Reviewer 1 was unfortunately no longer available) and I agree that you responded very well to the issues raised in the earlier round and that your submission is very close to being publishable in PLOS ONE. The reviewer raises some minor issues that can be readily dealt with in the revision or responded to in your letter (in case you choose not to follow the suggestion). Also, please consider adding some references to some relevant recent studies on reproducibility and sharing of syntax and computer code and update the references referring to pre-prints that have appeared in the meantime. If you respond well to the remaining minor issues, I except to make a quick decision on your manuscript without needing to resend it out for review. I am looking forward to seeing your rigorous and interesting work appearing in print.

Please submit your revised manuscript by Apr 23 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Jelte M. Wicherts

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: I find that this revised paper well-structured and easy to read, and I can recommend its publication as new insights on reproducibility are much-needed. What I would like to see added is a more thorough discussion of the state of reproducibility in PLOS ONE (as well as more generally in psychology) together with its implications (both in the introduction and the discussion section). In particular, this paper should include references to all relevant literature on this topic (see minor remarks below). Otherwise, a reader of this paper will not be made aware of other important empirical findings on this topic.

Below some minor remarks hopefully further improve the quality of this paper.

• Abstract: The last sentence only states quite obvious things. I would prefer to read about non-obvious insights gained in light of this study.

• Line 4: please include important recent studies on reproducibility such as

o Artner, R., Verliefde, T., Steegen, S., Gomes, S., Traets, F., Tuerlinckx, F., & Vanpaemel, W. (2020). The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods.

o Maassen, E., van Assen, M. A., Nuijten, M. B., Olsson-Collentine, A., & Wicherts, J. M. (2020). Reproducibility of individual effect sizes in meta-analyses in psychology. PloS one, 15(5), e0233107.

o Obels, P., Lakens, D., Coles, N. A., Gottfried, J., & Green, S. A. (2020). Analysis of open data and computational reproducibility in registered reports in psychology. Advances in Methods and Practices in Psychological Science, 3(2), 229-237.

• Line 33: What did other empirical investigations on reproducibility find?

• Lines 160-162: I would not label a lack of knowledge of the exact calculations performed by the authors due to a lack of information provided as RDFs. RDFs are what the original authors had! Maybe you can write: „The description about all these steps was generelly vague (see classification of reported results in Artner et. al, 2020) meaning that there were multiple ways in line with the descriptions in the originial paper. This study, thus, exposed a large amount of Researcher degrees of freedom [23] coupled with a lack in transparancy about in the original studies.“

• I find the style in which the results section is written weird (until line 251). Why not just describe the results with a reference to the respective table. Now we have Tables first and it is not really clear if the text re-iterates the information in the tables or if additional information is provided. Also, why not merge tables 1, 2 and 3? Table 1 alone does not provide enough information to be included in the main text. Maybe the results section can be structured as follows: 1

• Line 239: When was the magnitude considered to be the same? It is important to exactly describe your criteria here to allow the reader to gauge the overall results of your study. Without knowing whether your criterion was lenient or rather strict, it cannot be done as each and every one of us uses individual metrics to gauge similarity.

• Fig 2: General comment - Without knowing the range of parameter values it is hard to interpret differences in original and reproduced results. Why did you choose to report on the estimates of this article. Why not report, for example, on the results of article [1] instead?

• Lines 261-264: Large differences in magnitude should result in different interpretation even in case of equal signs!!!!!

• Line 299: substantial!

• Line 351: Nevertheless we were only able ….

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Richard Artner

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jun 21;16(6):e0251194. doi: 10.1371/journal.pone.0251194.r005

Author response to Decision Letter 1


9 Apr 2021

## Review Comments to the Author Reviewer #2:

I find that this revised paper well-structured and easy to read, and I can recommend its publication as new insights on reproducibility are much-needed. What I would like to see added is a more thorough discussion of the state of reproducibility in PLOS ONE (as well as more generally in psychology) together with its implications (both in the introduction and the discussion section). In particular, this paper should include references to all relevant literature on this topic (see minor remarks below). Otherwise, a reader of this paper will not be made aware of other important empirical findings on this topic.

**Thank you so much Mr Artner for your insightful review and constructive feedback. Please see our response to each point below. New changes in the manuscript are marked in green.**

Below some minor remarks hopefully further improve the quality of this paper.

- Abstract: The last sentence only states quite obvious things. I would prefer to read about non-obvious insights gained in light of this study.

**Thanks for this suggestion. As per the *writing in the sciences* course (see https://youtu.be/xmzUQ46YFiE), we added a sentence on the implications of our findings.**

- Line 4: please include important recent studies on reproducibility such as

- Artner, R., Verliefde, T., Steegen, S., Gomes, S., Traets, F., Tuerlinckx, F., & Vanpaemel, W. (2020). The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods.

- Maassen, E., van Assen, M. A., Nuijten, M. B., Olsson-Collentine, A., & Wicherts, J. M. (2020). Reproducibility of individual effect sizes in meta-analyses in psychology. PloS one, 15(5), e0233107.

- Obels, P., Lakens, D., Coles, N. A., Gottfried, J., & Green, S. A. (2020). Analysis of open data and computational reproducibility in registered reports in psychology. Advances in Methods and Practices in Psychological Science, 3(2), 229-237.

**Thank you so much for sharing these interesting articles with us! I devoured them and gained many insights on what we ourselves could have done differently :). We included all suggested references and made some changes to the manuscript text.**

- Line 33: What did other empirical investigations on reproducibility find?

**I am not sure if that answers the question but we are writing about this in the first paragraph in the introduction. See lines 2-9.**

- Lines 160-162: I would not label a lack of knowledge of the exact calculations performed by the authors due to a lack of information provided as RDFs. RDFs are what the original authors had! Maybe you can write: „The description about all these steps was generelly vague (see classification of reported results in Artner et. al, 2020) meaning that there were multiple ways in line with the descriptions in the originial paper. This study, thus, exposed a large amount of Researcher degrees of freedom [23] coupled with a lack in transparancy about in the original studies.“

**That is a great suggestion and now thinking about it more, I completely agree with you. We changed the text according to your suggestion (with a few tweaks). Thank you!**

- I find the style in which the results section is written weird (until line 251). Why not just describe the results with a reference to the respective table. Now we have Tables first and it is not really clear if the text re-iterates the information in the tables or if additional information is provided. Also, why not merge tables 1, 2 and 3? Table 1 alone does not provide enough information to be included in the main text. Maybe the results section can be structured as follows: 1

**Each paper has its own story and reasons why it was or wasn't reproducible and what the barriers were. With the individual sections we want to show how diverse the papers were and what types of challenges we faced. We now clarify this in the paper. We use 3 tables instead of one because the table would be too big for a PDF otherwise.**

- Line 239: When was the magnitude considered to be the same? It is important to exactly describe your criteria here to allow the reader to gauge the overall results of your study. Without knowing whether your criterion was lenient or rather strict, it cannot be done as each and every one of us uses individual metrics to gauge similarity.

**Very good point. We use a rather lenient definition in the result tables, but describe each result in more detail, to highlight problems or borderline cases. A formal definition is difficult, as the analyzed models are very different and it is hard to define a single criterion, that captures the results adequately. We believe that our lenient definition, together with the descriptive part works well in this smaller scale study, but probably would be impossible for large scale studies on reproducibility.**

- Fig 2: General comment - Without knowing the range of parameter values it is hard to interpret differences in original and reproduced results. Why did you choose to report on the estimates of this article. Why not report, for example, on the results of article [1] instead?

**This is a very good point. The plot shows regression coefficients from the fitted GEE model. We added a clarification to the paper. We chose to report this article as an example for our definition.**

- Lines 261-264: Large differences in magnitude should result in different interpretation even in case of equal signs!!!!!

**We agree that our conclusion was inconsistent in this case and changed the classification of article [1] to be not reproducible. In the other, reproducible, papers the deviations of coefficients were much smaller.**

- Line 299: substantial!

**Thanks for spotting!**

- Line 351: Nevertheless we were only able ….

**Thanks for spotting!**

Attachment

Submitted filename: 2021_03_response.pdf

Decision Letter 2

Jelte M Wicherts

22 Apr 2021

A computational reproducibility study of PLOS ONE articles

featuring longitudinal data analyses

PONE-D-20-25993R2

Dear Dr. Seibold,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Jelte M. Wicherts

Academic Editor

PLOS ONE

Acceptance letter

Jelte M Wicherts

10 Jun 2021

PONE-D-20-25993R2

A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses

Dear Dr. Seibold:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Jelte M. Wicherts

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: 2019_08_response.pdf

    Attachment

    Submitted filename: 2020_10_response.pdf

    Attachment

    Submitted filename: 2021_03_response.pdf

    Data Availability Statement

    All results including detailed reports and code for each of the 11 papers are available in the GitLab repository https://gitlab.com/HeidiSeibold/reproducibility-study-plos-one. All files can also be accessed through the Open Science Framework (https://osf.io/xqknz).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES