Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Jul 9;16(7):e0254090. doi: 10.1371/journal.pone.0254090

Mapping the discursive dimensions of the reproducibility crisis: A mixed methods analysis

Nicole C Nelson 1,*, Kelsey Ichikawa 2, Julie Chung 2, Momin M Malik 3
Editor: Sergi Lozano4
PMCID: PMC8270481  PMID: 34242331

Abstract

To those involved in discussions about rigor, reproducibility, and replication in science, conversation about the “reproducibility crisis” appear ill-structured. Seemingly very different issues concerning the purity of reagents, accessibility of computational code, or misaligned incentives in academic research writ large are all collected up under this label. Prior work has attempted to address this problem by creating analytical definitions of reproducibility. We take a novel empirical, mixed methods approach to understanding variation in reproducibility discussions, using a combination of grounded theory and correspondence analysis to examine how a variety of authors narrate the story of the reproducibility crisis. Contrary to expectations, this analysis demonstrates that there is a clear thematic core to reproducibility discussions, centered on the incentive structure of science, the transparency of methods and data, and the need to reform academic publishing. However, we also identify three clusters of discussion that are distinct from the main body of articles: one focused on reagents, another on statistical methods, and a final cluster focused on the heterogeneity of the natural world. Although there are discursive differences between scientific and popular articles, we find no strong differences in how scientists and journalists write about the reproducibility crisis. Our findings demonstrate the value of using qualitative methods to identify the bounds and features of reproducibility discourse, and identify distinct vocabularies and constituencies that reformers should engage with to promote change.

Introduction

A unique characteristic of recent conversations about rigor, reproducibility, and replication is that they are a truly transdisciplinary phenomenon, not confined to any single scientific discipline. A 2016 survey in Nature found that a majority of scientists across a wide range of disciplines had personal experience of failing to reproduce a result, and that a majority of these same scientists believed that science was presently facing a “significant” reproducibility crisis [1]. Reproducibility conversations are also unique compared to other methodological conversations because they have received sustained attention in both the scientific literature and the popular press. Major outlets such as the Wall Street Journal [2], the Economist [3], and the Atlantic [46] have all published feature-length articles on reproducibility.

The scale and scope of reproducibility problems, however, means conversations about them appear ill structured. In a news feature accompanying the Nature survey, one scientist described the results as “a confusing snapshot of attitudes” which demonstrated that there was “no consensus on what reproducibility is or should be” [7]. Thought leaders in the reproducibility space similarly describe these discussions as unwieldy. In a 2016 perspective article, Steven Goodman, Daniele Fanelli and John Ioannidis of Stanford’s Meta-Research Innovation Center at Stanford argued that “the lexicon of reproducibility to date has been multifarious and ill-defined,” and that a lack of clarity about the specific types of reproducibility being discussed were an impediment to making progress on these issues [8]. Many commentators have noted that there is considerable confusion between the terms reproducibility and replicability, and that those terms are often used interchangeably in the literature [913]. Victoria Stodden has argued that there are three major types of reproducibility—empirical, statistical, and computational—each of which represents a distinct conversation tied to a different discipline [14].

The attention devoted to reproducibility issues in the popular press adds another dimension of variation. Some commentators have suggested that journalists and the media are responsible for the crisis narrative, translating “rare instances of misconduct or instances of irreproducibility…into concerns that science is broken,” as a Science policy forum article puts it [15]. Content analysis of news media has similarly suggested coverage of reproducibility issues is promoting a “science in crisis” story, raising concerns among scientists about overgeneralized media narratives decreasing public trust in science [16].

To date, scholars have attempted to address these concerns by proposing clarifying definitions or typologies to guide discussions. The National Academies’ 2019 report on reproducibility [17] notes the problem of terminological confusion and creates a definitional distinction between reproducibility and replicability—a distinction that aligns with the usage of these terms in the computational sciences, but which is at odds with the more flexible way they are used by major organizations such as the Center for Open Science and the National Institutes of Health [18, 19]. Numerous other attempts have been made by scholars from both the sciences and humanities/social sciences to clarify the terms of the discussion through conceptual analyses of reproducibility and related concepts such as replication, rigor, validity, and robustness [13, 2023].

We take an empirical approach to systematizing conversations about reproducibility. Rather than developing analytical definitions, we look for underlying patterns of similarity and difference in existing discussions. Our approach to understanding variation in reproducibility conversations is also more expansive than previous approaches. Rather than focusing solely on differences in terminology, we examine differences in how authors tell the story of the reproducibility crisis. This approach offers insight not just into what authors refer to when writing about reproducibility, but also why they believe reproducibility is important (or unimportant), how they came to this realization, and what they think should be done about these issues.

Using a mixed-methods approach, we created a curated data set of 353 English-language articles on reproducibility issues in biomedicine and psychology (Figs 1 and 2) and analyzed the thematic components of each article’s narrative of the reproducibility crisis. We hand-coded the articles for four themes: what the authors saw as 1) the signs that there is a reproducibility crisis (e.g. a high profile failure to replicate, or an action taken by the NIH), 2) the sources of the crisis (e.g. poorly standardized reagents, or misaligned incentives in academic research), 3) the solutions to the crisis (e.g. greater transparency in data and methodology, or increased training in methods), and 4) the stakes of the crisis (e.g. public loss of confidence in science, or the potential for public policy to be built on faulty foundations). The combination of themes discussed and amount of text devoted to each theme creates a unique narrative profile for each article, which can then be compared to the mean article profile for the data set as a whole.

Fig 1. Data set by audience and author type.

Fig 1

Each block represents one article.

Fig 2. Data set by year of publication and audience.

Fig 2

Given that those at the center of reproducibility discussions experience those discussions as ill-structured, we expected to find distinct clusters of discourse: for example, a group of popular articles focusing on fraud as the source of irreproducible results, another group of scientific articles focusing on misuses of null hypothesis significance testing and proposals to change how p-values are reported and interpreted, and so on. Instead, we found that the majority of articles in our data set shared a common narrative structure which identified a lack of transparency, misaligned incentives, and problems with the culture of academic publishing as the core causes of irreproducible research.

Materials and methods

Qualitative research methods are rarely used explicitly in metascience, but they hold great value for understanding the subjective perceptions of scientists. Qualitative research is typically exploratory rather than confirmatory, and uses iterative, non-probabilistic methods for data collection and analysis. The many “researcher degrees of freedom” [24] inherent to qualitative research may raise concerns for readers more well-versed in quantitative paradigms, and can lead to misinterpretations about the conclusions that can be drawn from a qualitative data set. For the present study, the data collection and analysis methods were chosen to allow us to characterize the range and variability of the discursive landscape. However, they do not allow for conclusions to be drawn regarding the relative prevalence of different themes or narratives in a larger body of reproducibility discussions.

Data collection

We collected English-language articles discussing reproducibility issues using a maximum variation sampling strategy. Nonrandom, purposive sampling strategies such as maximum variation sampling are common in qualitative research because they yield “information rich” cases that are most productive for answering the research question at hand [2527]. In the present case, maximizing variation increases the chances of identifying rare narratives that might be difficult to see in a random sample, while at the same time allowing us to identify shared patterns that cut across cases. If reproducibility discussions are ill structured or consist of distinct clusters of conversation, maximum variation sampling aids in characterizing the full extent of that variation. If reproducibility conversations are homogeneous, maximum variation sampling allows us to draw even stronger conclusions than a random sample would—any “signal [that emerges] from all the static of heterogeneity,” as Michael Quinn Patton puts it, is of “particular interest and value in capturing the core experiences and central, shared dimensions of a setting or phenomenon” [25].

We employed an iterative version of maximum variation sampling: We first collected a sample of articles that maximized variation along dimensions suggested by the existing literature. We then analyzed that sample to identify rare article types, and finally collected a second sample to maximize those rare article types. Based on existing commentaries about the structure of reproducibility conversations, we aimed to maximize variation in 1) discipline, 2) terminology, and 3) audience. We chose to focus on biomedicine and psychology, since reproducibility discussions have been especially active in these fields and have generated substantial popular press coverage (compared to fields such as computer science, where reproducibility issues have been extensively discussed by scientists but not the popular press). We used two databases specializing in scientific literature (Web of Science, PubMed), and two databases specializing in print mass media (Nexis Uni, ProQuest). Using multiple databases introduces redundancy that can compensate for the potential weaknesses of each individual database. For example, not all subfields of psychology may be equally well represented in PubMed, but may be better captured in Web of Science since it includes humanities and social science indexes.

To maximize heterogeneity in terminology, we used multiple search strings with variations and combinations of the following terms: “reproducibility,” “irreproducibility,” “credibility,” “confidence,” or “replicability;” “translational research,” “medical,” “clinical,” or “research;” “crisis,” “problem,” or “issue.” For each query, we reviewed the results and collected articles relevant to reproducibility, excluding articles on clearly unrelated topics (e.g. DNA replication). When search strings retrieved many relevant results (e.g. >500) and could not be feasibly narrowed by modifying the search string, we purposively sampled the relevant articles by selecting rare article types (e.g. non-US articles, articles in smaller journals or newspapers, blog posts). We stopped searching when new permutations of searches revealed few novel articles.

In the resulting data set, the following types of articles were rare: articles published before 2014, articles published in online venues and aimed at popular audiences (e.g. blog posts, online magazines such as Slate), non-US articles, political opinion articles, conference proceedings, and white papers from professional societies. To further maximize heterogeneity in our data set, we searched specifically for those rare article types by: 1) following links/citations to rare article types within the articles we had already collected, 2) searching for earlier publications by the authors already identified, and 3) searching for media coverage of key events that took place prior to 2014. It should be noted that these search strategies may have minimized variation along some dimensions while maximizing variation along others. For example, searching for earlier publications by the authors identified in the first round maximized variation in year of publication, but likely did not increase the number of unique authors included.

This search process began in September 2018 and was completed in December 2018, and resulted in a total data set of 464 articles. This number may seem small, but it is in line with prior research: One study using similar search strings in Web of Science found only 99 articles that the author identified as discussing the reproducibility crisis (searching across all scientific disciplines and publication years ranging from 1933 to 2017) [28]. We refined our data set to select for “information-rich” articles. In searching for articles that described irreproducibility as a “crisis,” “problem,” or “issue,” we aimed to identify articles that included a narrative recounting of the reproducibility crisis as a scientific/intellectual movement [29], rather than merely describing a method for enhancing the reproducibility or validity of a specific technique. We excluded research articles that did not contain such a narrative, as well as hearing transcripts and press releases, since these genres did not routinely include a reproducibility narrative. Our final data set contained 353 articles. Complete bibliographic information for the data set (including the articles excluded from the analysis) is available at: https://www.zotero.org/groups/2532824/reproducibility-ca.

Qualitative data analysis

We employed grounded theory methodology [30] to develop a coding scheme to analyze the themes present in the data set. Grounded theory methodology is widely used in qualitative research to derive new theory inductively from empirical data. It is especially useful when little is known about the phenomena under study, or when existing theories do not adequately capture or explain the phenomenon. It proceeds in two phases. The “open coding” phase involves a process of generating and iteratively refining “codes” that capture particular themes in the data. This is followed by a “focused coding” phase where the entire data set is then re-coded using the coding scheme generated during the first phase. While grounded theory methodology asks researchers to pay close attention to the themes as they are expressed by the people under study, it also recognizes that all researchers carry “conceptual baggage” [31] which influences their interpretation of the empirical material. The three authors (N.C.N., K.I., J.C.) who developed the coding scheme each have different disciplinary backgrounds and began with different degrees of familiarity with reproducibility discussions: K.I. and J.C. were undergraduate students with training in neuroscience and anthropology, respectively, and had relatively little familiarity with qualitative data analysis or reproducibility discussions on beginning the project. N.C.N.’s training is in the field of science and technology studies, and she began the project with an extensive background in qualitative data analysis and a moderate degree of familiarity with reproducibility discussions.

To make use of the differing perspectives embodied in our research team, N.C.N., K.I., and J.C. each independently selected a random sample of 15–25 articles from our data set and created a list of the themes present in those articles. We compared our lists, noting common themes and generating broader umbrella themes that summarized more specific themes. We refined these thematic codes until 1) our codes covered most of the themes that could reasonably arise in our articles, and 2) each code was specific enough to reveal meaningful trends in code frequency patterns, but general enough to apply to common ideas across articles. To maximize inter-rater reliability (IRR) in applying the codes to the data set, we then performed three rounds of code refinement. All three coders independently coded a group of articles randomly selected from our data set, and we compared our applications of the coding scheme to the articles and revised the coding scheme to improve consistency.

We then coded all 353 articles afresh using this refined coding scheme. Articles were assigned to each coder using a random number generator, and each then independently coded her assigned articles using the qualitative data analysis software NVivo12 [32]. In cases where a sentence contained more than one theme, we coded that sentence with both themes in most cases (S1 Appendix provides a complete description of instances where we opted not to “double code” passages with more than one theme). For the purposes of later calculating IRR scores, fifty-three of these articles (randomly selected but distributed evenly throughout the data set, to account for potential drift in our application of the codes over time) were assigned to and coded by all three coders. In addition to coding the articles for thematic content, we hand-coded the intended audience of each article, the author type, and the main term used in the article (reproducibility, replication, or another term such as “credibility crisis”). The full coding scheme, including the description of each code and examples of the type of discussion included under that code, is available as an appendix (S1 Appendix).

When coding was complete, we merged the NVivo files and conducted pairwise comparisons of IRR on the 53 articles coded by all three researchers using NVivo’s “Coding Comparison” query at the paragraph level. Achieving strong IRR scores is a common difficulty in qualitative research, particularly as the codes increase in conceptual complexity and for themes with a low prevalence in the data set [33]. This is reflected in our average Kappa scores (S1 Table): we achieved excellent agreement on codes relating to specific individuals or events (e.g., discussions of John Ioannidis’s work or Nature’s 2016 reader survey on reproducibility), but much lower agreement on codes describing more complex ideas (e.g. that scientists need to change their expectations about what degree or type of reproducibility should be expected). We modified several codes with poor Kappa scores by combining codes that overlapped or by narrowing their scope (details about the specific modifications made are described in S1 Appendix). Only codes that reached an average Kappa score of 0.60 or higher, indicating moderate to substantial agreement between raters [33, 34], were used in the next stage of analysis.

Correspondence analysis

To visualize similarities and differences between articles and authors, we chose correspondence analysis (CA), a dimensionality reduction technique akin to principal component analysis (PCA) but applied to contingency tables and hence appropriate for categorical and count data [35]. CA also has a long history of use as part of mixed methods approaches for answering sociological questions. Social theorist Pierre Bourdieu famously used multiple correspondence analysis to study the structure of fields and social spaces beginning in the 1970s [36]. CA also has a history of use in examining differences in authors’ writing styles, and author data cases frequently appear as canonical examples in CA textbooks and methodological discussions [35, 37]

CA is particularly suited for our analysis because, unlike PCA, it normalizes by the length of a text. Existing analyses show that shorter and longer texts by the same author will appear to be stylistically different when analyzed through PCA, but not through CA [37]. This is important because our data set contains articles ranging from less than one page to more than a hundred, and authors also devote different amounts of space within each article to discussing the reproducibility crisis (e.g. a newspaper article may focus entirely on the crisis, while a longer academic article may only devote a few pages to the subject). CA normalizes the amount of text tagged with each code against the total amount of text coded in the article as a whole, thereby only considering relative proportions within parts of an article that are coded.

We exported tables from NVivo summarizing the percentage of text coded for each theme in each article as.xlsx spreadsheets using the “Summary View” option in the Node Summary pane (available in NVivo for Windows only), as well as tables summarizing article metadata and IRR. We compiled these files into a data frame and performed correspondence analysis using the FactoMineR package (version 2.3) [38]. The 29 codes reaching the Kappa score threshold of 0.60 were treated as active variables in the CA, and word frequency variables and metadata were treated as supplementary variables. Based on a scree plot of the variance explained per dimension, we chose to interpret the first three dimensions of the CA. The data files exported from NVivo and the code for the analysis are available at are available at: https://github.com/nicole-c-nelson/reproducibility-CA.

Word frequency variables

In CA, supplementary variables are frequently used to assist in the qualitative interpretation of the meaning of the dimensions. These variables do not participate in the construction of the dimensions but can be correlated with the dimensions after they are constructed. We used NVivo’s “Text Search” query function to construct several word frequency variables by counting mentions the following terms (and stemmed/related terms) and expressing those word counts as a percentage of the total words in the article: “Gelman”, “Ioannidis”, “NIH”, “psychology”, “questionable research practices”, and “reagent/antibody/cell line”. We used the “Gelman”, “Ioannidis”, and “reagent/antibody/cell line” variables as an internal double check on our analysis, comparing the position of those word frequency variables to the position of the Andrew Gelman, John Ioannidis, and reagent variables assessed through qualitative data analysis. We also used word frequency variables to aid in the interpretation of Dimension 1. We selected “NIH” as a term that might be more strongly associated with biomedicine, and “psychology” and “questionable research practices” as terms that might be more strongly associated with psychology to assess whether Dimension 1 could be interpreted as capturing disciplinary differences.

Hierarchical clustering

We used the FactoMineR package [38] to perform hierarchical clustering on the articles in our data set using Euclidean distance. While the distance between rows in the latent dimensions are χ2 distances, the projection is onto an orthonormal basis and so it is appropriate to cluster using Euclidean distance [39]. To eliminate noise and obtain a more stable clustering, we retained only the first 18 dimensions from the CA (representing ~75% of the variance). We chose to cluster the data into four classes based on measurements of the loss of between-class inertia (the inertia is a measure of variance appropriate for categorical data, defined as the weighted average of the squared χ2-distances between the articles’ profiles and the average profile).

Bootstrap analysis

Bootstrapping has been applied to CA, but only to capture the variability of the relationship between rows and columns: in our case, that would correspond to uncertainty from ambiguity in coding [40]. But given the importance of maximum variation sampling to our argument, the uncertainty we are interested in quantifying is that resulting from our choice of articles. To simulate this, we constructed 1,000 bootstrap samples by resampling with replacement from the set of 353 articles. These cannot be analyzed with individual correspondence analyses, as the resulting coordinates would not be comparable (CA solutions can be reflected and still be equivalent, and some rotation or scaling might make for more fair comparison). We experimented with Procrustes analysis, but in order to also make the inertia of each sample comparable arrived at Multiple Factor Analysis [41] as a more principled framework [42] for carrying out the bootstrap analysis. Multiple factor analysis (MFA) allows the analyst to subdivide a matrix into groups. This allows the analyst to compare, for example, how groups of people differ in in their responses to survey questions or their evaluations of a quality of an object (e.g., how experts and consumers rate the sensory qualities of the same wines) [42]. In our case, we used MFA to analyze how our 29 themes were positioned in 1) each of the 1000 individual bootstrap samples, 2) our original sample, and 3) all 1001 samples considered together. Using the FactoMineR package [38], we applied MFA to the 353,353 observations of the original and bootstrap samples. To estimate a 95% confidence region in the first factor plane for each theme, we plotted points for each theme’s position in each of the 1001 samples, calculated the convex hull around the point clouds for each theme, and calculated peeled convex hulls consisting of 95% of the points, using the method described by Greenacre [35].

Analysis by author and audience

To examine differences between journalist/scientist and popular/scientific articles, we again used MFA implemented with the FactoMineR package [38]. We grouped the articles using our hand-coded metadata (journalist/scientist author; scientific/popular venue). For the analysis by author type, the groups of articles authored by journalists and by scientists were treated as active (included in the analysis with full mass), and the group of articles by other authors (e.g., members of the general public, employees of policy think tanks) was treated as supplementary (retained in the analysis, but with mass set to zero and hence contributing nothing to inertia). For the analysis by intended audience, the groups of articles (popular and scientific venues) were both treated as active. All code used for analysis is available at: https://github.com/nicole-c-nelson/reproducibility-CA.

Results

Correspondence analysis

Correspondence analysis (CA) uses spatial embeddings to visualize relationships between the rows and columns of a matrix. In our case, the unique narrative profile for each article (consisting of the amount of text devoted to each theme in that article) was compared to the mean article profile for the data set as a whole. The first factor plane (Fig 3), representing 16% of the variance, captures two distinctions: Dimension 1 separates articles focused on bench work from articles focused on statistical methods, and Dimension 2 separates articles that focus on technical problems from those that focus on the stakes of the crisis.

Fig 3. Correspondence analysis biplot of 353 articles discussing reproducibility, analyzed for 29 themes.

Fig 3

Articles that are close together have similar narrative profiles. The closer an article appears to the center of the plot, the more closely it resembles the mean profile for all articles. The further away a theme is from the origin, the more variation there is in how authors discuss that theme. The color of an article’s plotted point (a circle) indicates the main term used in the article, and the size of a theme’s plotted point (a square) represents the contribution of that theme to constructing the dimensions. The eight most contributing themes are labeled. Supplementary variables (not used to construct the dimensions) are labeled in red.

The distribution of articles across the first factor plane shows many articles clustered around the origin. If reproducibility discourse were composed of distinct clusters of conversation, we would expect to see relatively few articles at the origin—the mean article profile would be a theoretical entity representing the average of several clusters distributed across the first factor plane. Instead, we find that many actual articles resemble the mean article profile in their relative coverage across themes. The proximity of many articles to each other and to the origin of the first factor plane indicates an overall lack of variation in reproducibility discussions. On average, articles in our data set devoted a large percentage of text to discussing transparency, incentives, and publishing culture (9.25%, 8.34%, and 5.82% respectively) but these three themes contributed little to the construction of the first three dimensions of the correspondence analysis (S2 Table). This large presence but low variability suggests that these themes constitute the core of contemporary discussions of reproducibility, with most authors discussing these themes to similar degrees. Given that the data set was constructed to maximize heterogeneity, this is a finding that is generalizable to the discourse as a whole.

Two unique clusters of articles are evident in the upper left and upper right quadrants of the first factor plane. Highly contributing articles located in the upper left extremes include scientific publications on antibody and cell line authentication [43, 44], as well as Nature News pieces covering these same issues [45]. In the upper right extremes are articles appearing in Nature News and The New York Times comparing frequentist and Bayesian statistical approaches, and citing thought leaders in psychology such as Eric-Jan Wagenmakers and Uri Simonsohn [46, 47]. These articles are unique not simply because of the extent to which they discuss reagents and statistical issues, but because they discuss these issues without making strong connections to other themes in reproducibility discourse. For example, the title of Nature News feature “Reproducibility crisis: blame it on the antibodies,” suggests a singular explanation for reproducibility problems, which is reinforced in the body of the article with a quote asserting that “poorly characterized antibodies probably contribute to the problem more than any other laboratory tool” [45]. While articles in this cluster typically assert that reagents are only one contributor to irreproducibility, they do not discuss other contributors in depth.

An analysis of the supplementary variables correlated with the dimensions suggests that Dimension 1 captures some disciplinary differences and Dimension 2 captures some differences in the intended audience of the article. The word frequency variables “psychology” and “NIH” are, respectively, positively correlated (r = 0.25, 95% CI [0.15,0.35], p = 1.24e-06) and negatively correlated (r = -0.29, 95% CI [-0.38, -0.19], p = 3.32e-08) with Dimension 1, meaning that discussions of statistical techniques are more closely associated with mentions of psychology or psychologists, while discussions of reagents are associated with mentions of the NIH. This suggests that Dimension 1 could also be interpreted as separating biomedicine and psychology. However, it should be noted that these correlations are weak and that mentions of “questionable research practices” are not well correlated with Dimension 1 (r = 0.11, 95% CI [0.01, 0.21] p = 3.48e-02), which is surprising given that this term was coined in the context of discussions of statistical practices in psychology [48].

The intended audience of an article is correlated (r = 0.34, 95% CI [0.23, 0.42], p = 9.07e-11) with Dimension 2. This indicates that the distinction between articles examining technical problems and those that focus on social problems is related to difference in audience. Articles at the negative extreme of Dimension 2 include newspaper op-eds and short articles that focus on problems in science broadly construed [4951], typically drawing on empirical examples from several disciplines or using evidence from one field to draw conclusions about another (e.g., claiming that evidence of irreproducibility in biomedicine means that half of all findings about climate change might also be untrue). These articles share a focus on the stakes of the reproducibility crisis, directly questioning the legitimacy of the scientific enterprise or expressing the fear that reproducibility issues may cause others to lose faith in science. As one op ed puts it succinctly, “the house of science seems at present to be in a state of crisis” [49]. These articles support existing analyses arguing that media coverage of the reproducibility crisis may undermine public trust in science, although it is worth noting that a number of the articles at the extreme of this dimension are specific to climate science and authored by individuals associated with conservative think tanks (e.g., the Pacific Research Institute for Public Policy).

Dimension 3 (Fig 4) separates out a cluster of articles focusing on the heterogeneity and intrinsic complexity of the natural world. This cluster includes articles by psychologists and neuroscientists arguing that it should not be surprising that many results fail to replicate because of differences in experimenter gender [52] and the “contextual sensitivity” of many phenomena [53, 54]. It also includes articles on animal experiments which argue that increasing standardization might counterintuitively decrease reproducibility by generating results that are idiosyncratic to a particular laboratory environment [5557]. The clusters of articles discussing reagents and Bayesian statistics fall at the opposite extreme of this dimension, representing a distinction between authors who tend to focus on standardization and statistics as the solution to reproducibility problems, versus those who see reproducibility problems as the natural consequence of complexity and intentional deployment of variation as the solution. As one article at the negative extreme of Dimension 3 puts it, “experiments conducted under highly standardized conditions may reveal local ‘truths’ with little external validity,” contributing to “spurious and conflicting findings in the literature” [56]. The authors argue that intentional deployment of variation, such as allocating mice to different housing conditions, offers a more promising approach to addressing reproducibility issues than further standardization.

Fig 4. Correspondence analysis biplot depicting dimensions 1 and 3 of the analysis.

Fig 4

Articles that are close together have similar narrative profiles. The closer an article appears to the center of the plot, the more closely it resembles the mean profile for all articles. The further away a theme is from the origin, the more variation there is in how authors discuss that theme. The color of an article’s plotted point (a circle) indicates their label in a hierarchical clustering, based on Euclidean distance in the latent dimensions. The size of a theme’s plotted point (a square) represents the contribution of that theme to constructing the dimensions. The eight most contributing themes are labeled.

All three of the unique article clusters appearing in Fig 4 (reagents, p-values/Bayesian statistics, and heterogeneity) fall into clusters separate from the main body of articles when analyzed using hierarchical clustering. The position of these hierarchical clusters on the factor plane again suggests that reproducibility discourse as a whole is more unified than commentators have assumed. While there are several unique clusters of conversation, the largest cluster is centered at the origin, indicating a low degree of variation in reproducibility narratives.

Bootstrap analysis

To gain additional insight into themes that cluster together in reproducibility conversations, we conducted bootstrap resampling of our original data set combined with multiple factor analysis (Fig 5). The 95% confidence region for Bayesian statistics overlaps with the confidence regions for p-values and sample size and power, indicating that these three themes cannot be reliably distinguished and should be treated as part of a shared discussion about statistical methods. The themes reagents and Bayesian statistics have elongated confidence regions pointing towards the origin, indicating variation in how these themes are discussed in both of the first two dimensions. While some articles discuss these reagents and statistics to the exclusion of other themes, other articles discuss them alongside other potential sources of irreproducibility. The asymmetry of the bootstrapped peeled convex hulls is because few articles uniquely focus on reagents or statistics; when those articles are not sampled into a particular bootstrap draw, the closer association of those themes with others draws their positions closer to the origin. The overlapping confidence regions for fraud, legitimacy of science, retractions, and impact on policy or habits show a similar structure, but with variation primarily in the second dimension. This suggests that the degree to which an article discusses these four themes varies along with the audience of the article but not with the type of science or disciplines it discusses. The core themes transparency, incentives, and publishing culture are overlapping and found near the origin, and also near the origin are pre-registration and Brian Nosek/Center for Open Science. Although the confidence regions for these two central clusters are distinct in our analysis, this is likely an artifact of our coding scheme (S1 Appendix). When authors discussed the Center for Open Science’s efforts to enhance transparency of data and methods, we elected to code these passages as Brian Nosek/Center for Open Science and not transparency, and this decision is likely responsible for their apparent distinction.

Fig 5. Multiple factor analysis of bootstrap samples drawn a set of 353 articles on reproducibility.

Fig 5

Shaded areas indicate the peeled convex hulls of points based on 1000 bootstrap replicates plus the original sample, showing an approximate 95% confidence region for each theme. Twenty-nine codes were included in the analysis, but only select codes are displayed for legibility. Colored squares indicate the position of each theme based on the original sample, and black squares indicate the position of each theme based on the overall bootstrap analysis.

The bootstrap analysis also provides insight into the effect of the maximum variation sampling strategy on the analysis. The position of each theme in the overall bootstrap analysis is closer to the origin than the position of each theme in the original analysis, indicating that the total inertia (recall: the weighted average of the squared χ2-distances between the articles’ profiles and the average profile, used as a measure of variance) of the bootstrap samples is on average smaller than those of the maximum variation sample. The difference in position between the original and bootstrap analysis is greater for themes with greater variation, such as reagents and Bayesian statistics. This suggests that the maximum variation sampling strategy has, as expected, captured more variation than would be expected in a random sample or in the population of articles as a whole.

Analysis by author and audience

Finally, we used multiple factor analysis to compare the profiles of articles authored by journalists versus those by scientists, and articles published in scientist-facing venues versus public-facing venues. When articles authored by journalists and scientists are considered separately, the first factor plane resembles the results of the correspondence analysis performed on the entire data set (Fig 6). The themes with the greatest within-theme inertia on the first factor plane (indicating larger differences in how scientists and journalists discuss these themes) are Bayesian statistics, sample size and power, p-values, and reagents. Scientists tend to write more than journalists about these technical issues, although journalists do sometimes devote substantial attention to highly technical topics [46]. Journalists discuss questions about the legitimacy of science and Bayer/Amgen scientists’ reports of their experiences of failures to replicate [58, 59] at greater length than scientists. Overall, however, scientists and journalists address similar themes to similar extents when writing about reproducibility.

Fig 6. Multiple factor analysis of 353 articles discussing reproducibility, analyzed for 29 themes, with articles grouped by author type.

Fig 6

The size of the square indicates the degree of within-theme inertia. The eight themes with the greatest within-theme inertia are labeled, and partial points for those themes are displayed. Red points indicate the location of those themes in the group of articles authored by scientists, and blue points indicate the theme location for the journalist group. Grey points indicate the location of the articles on the first factor plane.

Larger differences are evident when comparing articles written for a scientific audience to those appearing in popular venues (Fig 7). The RV coefficient (a measure of similarity between squared symmetric matrices) for the journalist and scientist author groups is 0.71, while for the scientific and popular article groups it is 0.52. In the analysis by audience, Dimension 1 separates articles focused on statistics from the main body of articles, and Dimension 2 makes visible two narratives that are prominent in popular articles but not in scientific ones. At one extreme of Dimension 2 are articles focusing on heterogeneity, as well as a cluster of popular articles found near the themes Brian Nosek/Center for Open Science, failures to replicate important findings, and heterogeneity. These articles share a narrative of the replication crisis that is specific to psychology, which discusses the failure to replicate long-established findings such as priming effects alongside the COS’s efforts to estimate the reproducibility of other findings in the Reproducibility Project: Psychology [60]. Articles authored by journalists in this group argue that there is a crisis in the field [61, 62], and articles authored by psychologists in this group rebut the idea that there is a crisis by appealing to the heterogeneity and complexity of the phenomena psychologists study [53, 54].

Fig 7. Multiple factor analysis of 353 articles discussing reproducibility, analyzed for 29 themes, with articles grouped by intended audience.

Fig 7

The size of the square indicates the degree of within-theme inertia. The eight themes with the greatest within-theme inertia are labeled, and partial points for those themes are displayed. Red points indicate the location of those themes in the group of articles aimed at a scientific audience, and blue points indicate the theme location for the popular audience group. Grey points indicate the location of the articles on the first factor plane.

At the other extreme of Dimension 2 are articles focusing on reagent issues, as well as a more loosely clustered collection of popular articles that are generally pessimistic about science and scientists. This group includes articles that question whether academic science is really as good at self-correction [63] or at generating novel ideas [64] as it is generally presumed to be. Misconduct, retractions, and the idea that there is “rot” at the core of science are prominent themes [65]. The tone of this cluster is perhaps best encapsulated by a Fox News opinion piece titled “Has science lost its way?,” which argues that “the single greatest threat to science right now comes from within its own ranks” [66].

Discussion

Taken together, our findings suggest that conversations about the reproducibility crisis are far from ill structured. The specific structure we describe here should not be taken as a definitive mapping of reproducibility discourse—it reflects the particular themes included in our analysis, and including additional themes or using a different ontology of themes would result in a different structure. Nevertheless, this model of the reproducibility crisis’s discursive structure is useful in that it points towards potential strategies that reformers could adopt to advance reproducibility-oriented change in science.

Our findings suggest that there is a clear thematic core to reproducibility discussions, one that is shared by biomedical scientists and psychologists, and by scientists and journalists. Given that our data set was constructed to maximize heterogeneity, this is an especially noteworthy finding. This core pertains primarily to the sources of irreproducibility and associated solutions; more variation was present in what authors identified as the signs of crisis or the stakes of continued irreproducibility. This is good news for reformers, in that it suggests that consensus is present around the important question of what reforms should look like. Individuals may be convinced by a variety of signs and motivated by a variety of rationales (e.g., the need to reduce costs in drug development, or to preserve the public credibility of science), but these diverse experiences and motives appear to point towards similar interventions.

Our analysis did not identify any strong differences in how journalists and scientists write about the reproducibility crisis. It is worth noting that we did not distinguish between different ways that authors might discuss the same theme (e.g., arguing for or against the need for transparency), and the cluster of articles in Fig 7 discussing the crisis in psychology raises the possibility that journalists and scientists may address the same themes but arrive at different conclusions. However, the supplementary variables correlated with the second dimension of the correspondence analysis and the RV coefficients for the multiple factor analyses both indicate a relationship between the venue in which an article is published and the themes it addresses. This suggests that scientists and journalists alike may foreground different aspects of reproducibility when writing for a popular audience, and therefore both bear some responsibility for popular narratives that call the “well-being” of science into question [16].

It is tempting to attribute the coherence of reproducibility discourse to the emerging discipline of metascience, given the strong resemblance between the core themes identified in this study and the core concerns of this new field. Metascientists have been central players in discussions about reproducibility, and have been especially active in research related to scholarly communication and open science [67]. However, time series analysis shows that the thematic core identified here has been present even from the very early days of reproducibility discussions [68], suggesting that it emerged before or alongside metascience rather than as a result of the formation of the field. The cluster of themes identified here is also present in articles by authors who would be unlikely to self-identify as metascientists, such as Francis Collins and Lawrence Tabak’s early paper outlining the NIH’s plans for enhancing reproducibility [19].

Although there is a clear core to reproducibility discussions, some elements are better integrated than others. Correspondence analysis, subsequent hierarchical clustering in the latent dimensions, and our bootstrap analysis all indicate that some articles discussing reagents, statistics, and heterogeneity are distinct from the main body of articles we analyzed. Reformers should take note of these minority constituencies in crafting their arguments and interventions, because these single-issue and heterogeneity-focused groups of authors are less likely to see a need for systemic reform. Our findings suggest that there is a subset of scientists, not confined to any particular discipline, who see the natural world as more intrinsically variable than their colleagues do and are therefore less inclined to see failures to replicate as problematic. We also generally observed in our close reading of the articles (although we did not code for this explicitly) that those authors who saw reproducibility problems as largely attributable to a single factor tended to focus either on antibody specificity or overuse/misuse of p-values. While we are not able to draw conclusions about the size of these constituencies based on our analysis, we are able to identify their distinct orientation towards reproducibility problems.

Differences in core assumptions about heterogeneity also explain why distinctions between direct and conceptual replications remains controversial: to those who see the natural world as deeply variable, few (if any) replications would truly count as direct replications. While those who employ the direct/conceptual distinction acknowledge that it is one of degree rather than of kind, and that not everyone will agree on what counts as a replication [69], they do not acknowledge that individuals appear to vary widely in their baseline assumptions about variation. For reformers, this suggests that attempts to use replication to advance theory development are likely to be frustrated if they do not take into account this diversity in scientists’ worldviews.

Articles also differed in their usage of the terms reproducibility and replication. The primary term used in an article is correlated (r = 0.38, 95% CI [0.29, 0.47], p = 2.87e-12) with Dimension 1, suggesting that these terms might be markers of methodological or disciplinary difference. While our maximum variation sampling strategy may have exaggerated the degree of variation in terminology, our analysis points towards the presence of established patterns of use unrelated to the terms’ meanings, which may interfere with attempts to create definitional distinctions. This does not mean that there is no value for reformers in making distinctions between types of reproducibility/replication problems; rather, it is to say that hanging distinctions on these two terms seems likely to generate more confusion than clarity.

Finally, this analysis illustrates the value of bringing qualitative approaches to bear on reproducibility. While metascientists have had great success in employing quantitative methods to understand reproducibility problems, these approaches are limited in their ability to identify subjective differences in the meaning individuals give to different terms or events, or to explain why scientists act (or fail to act) in particular ways. Hand-coding is necessary to overcome problems as simple as different words and terms expressing similar ideas (or conversely, the same word or terms expressing different ideas). But the data set produced from this analysis, with high-quality manual annotations for content and themes, will also support research and benchmarking in areas like topic modeling.

Many commentators argue that reproducibility is a social problem that will require changes to the culture of science [70], and yet methodologies designed for studying cultural variation and change—participant observation, ethnography, cross-cultural comparisons, and qualitative data analysis—are only rarely employed in metascientific or reproducibility-oriented research. Achieving lasting change in scientific cultures will first require a more systematic understanding of variation in how scientists interpret reproducibility problems in order to create “culturally competent” interventions.

Supporting information

S1 Checklist. SRQR checklist.

(PDF)

S1 Table. Inter-rater reliability Kappa scores for all themes coded.

(PDF)

S2 Table. Percentage share of the mean article profile, coordinates, contribution, and cos2 for all themes included in the correspondence analysis.

(PDF)

S1 Appendix. Code book for reproducibility data set.

(PDF)

Acknowledgments

We thank Ashley Wong for her assistance with article collection, Harald Kliems for his assistance with producing the figures, and Megan Moreno for providing advice about inter-rater reliability issues. We also thank the Department of Medical History and Bioethics at the University of Wisconsin—Madison for providing publication fee assistance.

Data Availability

Complete bibliographic information for the data set is available at: https://www.zotero.org/groups/2532824/reproducibility-ca. The data files exported from NVivo and the code book (which provides the definition for each thematic code and illustrative examples) are available at are available at: https://github.com/nicole-c-nelson/reproducibility-CA. The NVivo file containing the coded articles is available on request. All code used for analysis is available at: https://github.com/nicole-c-nelson/reproducibility-CA.

Funding Statement

NCN received financial support for the Radcliffe Institute for Advanced Study in the form of a Residential Fellowship (no grant/award number, https://www.radcliffe.harvard.edu/fellowship-program), and KI and JC received financial support through the Radcliffe Research Partnership Program at the Radcliffe Institute for Advanced Study (no grant/award number, https://www.radcliffe.harvard.edu/fellowship-program/radcliffe-research-partnership). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Decision Letter 0

Sergi Lozano

19 Nov 2020

PONE-D-20-26565

Mapping the discursive dimensions of the reproducibility crisis: A mixed methods analysis

PLOS ONE

Dear Dr. Nelson,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

As you can see below, the reviewers found the topic under study very interesting. Nevertheless, they all raised concerns about methodological choices made and their justification in the text. Considering PLOS ONE's publication criterion #3 (https://journals.plos.org/plosone/s/criteria-for-publication#loc-3), these comments must be properly addressed when revising the manuscript. Please notice that, when presenting mixed-methods studies, we recommend that authors use the COREQ checklist, or other relevant checklists listed by the Equator Network, such as the SRQR, to ensure complete reporting (http://journals.plos.org/plosone/s/submission-guidelines#loc-qualitative-research).

Moreover, Reviewer 3 made some comments on the results (more specifically, about the interpretation of Dimensions 1 & 2). This should also be addressed, as the journals publication criterion #4 reads "Conclusions are presented in an appropriate fashion and are supported by the data.".

Please submit your revised manuscript by Jan 03 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Sergi Lozano

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

3. We noted in your submission details that a portion of your manuscript may have been presented or published elsewhere:

'One related manuscript is under consideration for a special issue of the journal _General Review of Psychology_, to be published in 2021. The manuscript under review is based on the same data set but analyses this data using a different technique (subset correspondence analysis), answers a different question (how have reproducibility conversations changed over time?), and is intended for a different audience (historians and philosophers of psychology).'

Please clarify whether this publication was peer-reviewed and formally published.

If this work was previously peer-reviewed and published, in the cover letter please provide the reason that this work does not constitute dual publication and should be included in the current manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: No

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In ”Mapping the discursive dimensions of the reproducibility crisis: A mixed methods analysis” the authors analyze a large set of writings about reproducibility in Biomedicine and Psychology. They sample writings on this topic from a broad set of texts, both scientific and popular. They then manually code the content and themes of each article and report the dispersion of content across the most important dimensions. If articles are thematically similar they group together. While initially believing that the discourse of reproducibility is sprawling and badly coordinated, the results of the paper seem to indicate the opposite. Indeed, most articles are thematically close, even though the study includes research from both social and natural sciences.

I thank the authors for a very interesting read. Having worked with reproducibility and open science for a long time, but never with qualitative methods I see the submitted work as part of a very welcome and important methodological development of research into open and reproducible science. While I lack deep knowledge of the methods used, it seems to me the study fulfils the requirements for publication in PLOS One. I have a number of comments mostly related to the structure of the paper and some suggestions for additional analyses which I believe would make the results and conclusions more clear.

Since I am not experienced in the methodology I will in this report focus on the other aspects of the paper. My main comment relates to the motivation and current structure of the paper. The authors could do a better job at outlining exactly what it is they do and what their results are. Currently, the connection between the motivation and the results that are being presented is weak. The abstract and introduction gives the impression that the main contribution of the paper is the mapping of topics, but at least after the fact most topics seem obvious. Instead, the discussion section, which I found very interesting, seems to focus on the finding that despite sampling to maximize heterogeneity, most papers group closely together. This seems to me like the most important finding of the paper and is something I think the authors should write much more clearly in the abstract and introduction. But the importance of these results completely depends on what one would expect. The authors claim in the introduction that the discourse is ill-structured, but this is not motivated by any previous research or even quotations. It seems important to know why the authors thought so at the outset? And especially, how such dispersion would look in terms of results. As a reader, I have little ability to benchmark the results. How much dispersion would we want to see to say that the discussion is ill-structured? An analysis with papers in the same field but on different topics could help here, but is of course a considerable endeavor.

I am uncertain why the authors decided to focus on biomedicine and psychology. Is it specifically because the fields are different? It becomes a bit strange when they write about themes around “reagents” and Bayesian statistics are outside of the core of the discussion. Indeed, these topics are mechanically distinct since they are field-specific. But they could (and likely are) at the core of the discussion within each field. It is not obvious to me that we should expect, or even want, a complete coordination of discourse between fields as different as biomedicine and psychology. Perhaps it would be worthwhile to run some analysis on the topics separately, or to compare psychology to the discussion in other social sciences. This could also be useful as a benchmark.

I wonder if it could be that the authors have in fact identified a distinct discipline. That the reason why papers group so closely on this topic across two such different fields as biomedicine and psychology is because they are more connected across fields than within their own respective field. It would be really interesting if the authors could discuss this and perhaps show some results. It could very well be a problem for the reproducibility movement if it is too closed off from its different fields. That everyone who is working with reproducibility speaks the same language is good, but not if it means that they cannot communicate with the rest of the researchers in their own fields.

Short comments:

As a paper using mixed methods and about scientific discourse, I miss quotations and examples from the included texts. How different can two very different papers be? And what are examples of surprisingly similar articles?

The authors sample to maximize heterogeneity in an attempt to “stack the cards against them”. Finding that articles are closely grouped together thematically is more surprising the more heterogeneous the sample is. This is great but not explained clearly enough!

Reviewer #2: This paper makes use of some of the methodological tendencies of the metascience movement-- namely, statistical analyses of variables derived from word searches of an arbitrarily selected body of texts-- in order to purportedly make a contribution to the growing literature on the problems of reproducibility/replication in modern science. Much of my reaction is grounded in an attempt to understand the larger motivations of the metascience movement growing out of science studies; and here I would encourage the authors to consult the paper “metascience as a scientific social movement” by David Peterson and Aaron Panofsky at UCLA. This will provide some background for my current comments.

There is a long history of work done on the relative scarcity of replication and its variants in science, especially dating back to the work of Harry Collins in the 1980s. This literature explained why replication was difficult, and how scientists dealt with this obstacle in their own research. The problems were there approached both as a matter of practical methodology and of profound philosophical complications, particularly as science studies held as a tenet that there did not exist any unique transdisciplinary ‘scientific method’. Starting after the turn of the millennium, there grew up a movement that ignored all that previous work, but began positing the existence of a putative distinct contemporary crisis of reproducibility (particularly in biomedicine and psychology) that could be rectified by application of the monolithic “scientific method” to the problem; dubbed henceforth as ‘metascience’. One characteristic attribute of the latter was to become especially focused on techniques of formal inductive inference, and allied dependence upon Big Data and computational considerations. At this point a number of semantic shifts were evoked to differentiate ‘reproducibility’ from replicability, itself a topic strewn with pitfalls. This turn received benediction from the National Academies volume on Reproducibility and Replicability in Science (2019).

There is much about the metascience movement to give one pause; Peterson and Panofsky do a good job of raising the issues. Yet, whatever one’s attitudes towards metascience as such, the authors of the current paper push the practices one step further, resulting in an exercise that lacks credibility. They in effect propose a meta-metascience exercise, where they code the appearance of certain words in a set of articles appearing in arbitrarily chosen databases (described on pp.5-6) that they have decided bear upon discussions of problems of reproducibility, and then subject them to statistical analysis of variance, all with the avowed purpose of clarification of the state of discussion of replicability within science. This leads the tendencies of metascience to bedlam, since statistical/computational exercises on their own applied to methodological/philosophical discussions of the nature of science (as opposed to actual empirical scientific research) can do almost nothing to clarify what are inherently abstract disputes over the nature of science and the conditions of its success or failure.

Since there is a looming danger of misunderstanding, let me rephrase this objection. Statistical analysis of variance of what an arbitrary crowd of authors say about replicability cannot serve to clarify what are essentially sociological and philosophical problems of research behavior and organization. (This is doubly compromised by the authors renouncing any interest in the one quantitative aspect of their coding exercise, insisting they are unconcerned about ‘how much’ their authors talked about reproducibility on p.8.) The very notion that there is an abstract ‘scientific method’ that can be applied to long-standing disputes over the nature and conduct of scientific research is itself a category mistake, although it is a hallmark of the tendency of metascience to instrumentalize what are rather deep-seated organizational and conceptual problems besetting modern scientific research. Scientists impatient with philosophical discussion of replication and the social organization of science will simply repeat the errors of their forebears.

The essential irrelevance of the authors’ statistical exercise is itself illustrated by the relatively jejune and pedestrian ‘findings’ they claim in this paper: that some papers focus on individual difficulties of replication such as lack of transparency and variance of inputs to the experimental apparatus (all long long ago covered in Collins), or that “there are no strong difference in how scientists and journalists write about the reproducibility crisis.” These could readily be developed and recounted in any competent survey article, without all the superfluous rigmarole of ‘correspondence analysis’ and the like. The only reason for this mountain to bring forth such an unprepossessing mouse is that the paper wants to assume the trappings of ‘scientific quantitative analysis’ when that recourse is lacking in any real substance or intrinsic motivation. Treating statistics as a self-sufficient ‘method’ is how we got to the modern impasse in the first place.

One might conclude by suggesting this complaint is sometimes broached by science studies scholars concerning the metascience movement in general; I merely mention this to suggest this review is written from within a particular perspective, which is not represented in any of the citations to this article.

For these reasons, I cannot support the publication of this article.

Reviewer #3: Referee report on PONE-D-20-26565

Mapping the discursive dimensions of the reproducibility crisis:

A mixed methods analysis

Using a mixed methods approach, the authors identify the discursive dimensions in the discussion about reproducibility in science. In particular, the authors conduct correspondence and multi-factor analyses on 350 articles, analyzed for 30 themes, to address the question whether discussions about the reproducibility crisis systematically differ along several dimensions like the type of authors (scientists vs. journalists) and the type of the intended audience (popular vs. scientific outlets). The paper identifies the incentive structure of science, transparency of methods, and the academic publishing culture as the main themes of discussions about reproducibility. The authors report systematic differences in the discursive structure depending on whether the article addresses a popular or scientific audience, but relatively little differences in how scientists and journalists write about reproducibility issues. Despite the scale and scope of the crisis, the results suggest that discussions about reproducibility follow a relatively clear underlying structure.

First, I want to admit that I am not at all an expert on the analysis methods applied by the authors. Apparently, this should be considered when reading my comments on the manuscript. As I would expect that a large share of the potential audience of the paper are not too familiar with details of the methods either, I hope that my comments help the authors to revise the paper in such a way that it is more accessible to readers.

Second, I want to emphasize that I very much sympathize with the research question addressed and that I acknowledge the contribution to the literature. <from experience="" in="" my="" own="">

Overall, the paper is well written, convincingly motivated, and clearly structured. I think the paper is novel in terms of the research question addressed and might be of interest to a broad audience. However, the description of the methods is a bit sparse in some aspects and lacks arguments for particular design choices (see below for details and examples). Moreover, for an audience who is not familiar with the methods used (as I am), some claims and interpretations do not seem to follow “naturally” from the results presented in the paper. Below, I outline my main concerns in some detail and hope that my comments will help to improve the paper further.

Comments on methods and materials:

Data collection:

The authors collected articles from various databases using combinations of different search strings. The results were reviewed by the authors and only “articles relevant to reproducibility” (p. 5) were included in the sample. “To further maximize heterogeneity,” the authors handpicked additional articles that were rare in the article type. This process apparently involves ample degrees of freedom. The authors address neither how they came up with the selection of the databases nor how they generated the list of search strings.

Skimming the bibliographic information for the data set provided via Zotero further raises some questions. First, some key contributions to the discussion on replicability are not included in the sample (e.g., the Many-Labs studies, etc.). In more general terms, I wonder how the list of search strings used translates into the rather small sample of only 350 articles. Given the scale and scope of the discussion across the social and natural sciences in the scientific literature and the popular press, I would have expected a much larger number of articles. Second, and potentially more relevant to the methodology, several authors appear repeatedly (e.g., Aschenwald, Baker, Engber, Gelman, Ioannidis, Lowe, Nosek, Yong, etc.), and articles in the sample are not balanced across relevant dimensions of the analysis (type of authors, type of audience, year; Fig. 1 and 2). How does this fit the intend to maximize variability in the sample?

While I understand that jointly maximizing variation across various dimensions is a non-trivial task, I wonder according to which criteria the sample was actually constructed. Given the rather vague description of the sampling strategy and the various degrees of freedom in the data collection, to me the final sample appears as being generated by a black box. Providing the interested reader with a more thorough description of the data collection process (perhaps in a detailed section in the appendix) could considerably improve the paper.

Coding Scheme:

Related to the previous comment, I consider it difficult for readers to get a clear idea of how the authors derived the thematic codes for their analysis. The authors’ approach involves a seemingly infinite number of degrees of freedom. For instance, the authors counted mentions of a particular strings (e.g., “Gelman,” “psychology,” etc.; p. 7) but it is not at all clear what is the rational for counting exactly these words.

Another issue that puzzles me with respect to the coding is that several of the codes ended up with very low average interrater reliability scores. Terms that appear highly relevant to the analysis – such as, e.g., “Replication,” “Failures to replicate,” “Effect size,” or “Selective reporting” – have average Kappa values way below 0.6. Moreover, several codes that I would consider important (e.g., “harking”/”forking,” “open science,” “publication bias,” “researcher degrees of freedom,” “transparency,” etc.) do not at all appear of the list. I think the methods section, overall, could be improved by describing the derivation of the coding scheme more thoroughly (potentially in a separate section in an annex to the paper).

Overall, given the topic addressed and the concerns raised in the comments above, it is a pity that the methodology used in the paper was not pre-registered. Yet, I commend the authors for transparently sharing their data and analysis scripts.

Comments on the results:

Overall, I commend the authors for presenting the results in a way that is accessible even to readers like me, i.e., someone without detailed knowledge about the methods used. The presentation and discussion of the results are easy to follow and present novel insights into the research question. The figures neatly depict the main findings and the figure captions are clear and self-contained (a small exception is Fig. 5 and 6 where the labels of the dimensions are missing).

Indeed, from my point of view, there is quite little to criticize about the results section. A minor point is that Dimension 1 is initially referred to as separating articles focused on bench work from studies on statistical methods, whereas Dimension 2 is defined as separating articles focusing on technical issues from articles focusing on the stakes of the crisis. In the following, however, the authors state that “Dimension 1 can be interpreted more broadly as representing disciplinary differences between biomedicine and psychology/statistics, and Dimension 2 can be interpreted as representing differences in the intended audience of the article.” While I agree that this broader definition seems to be supported, I do not see this interpretation clearly follows from the by the analysis. Other minor points that puzzles me – such as the choice of themes and supplementary variables, for instance – are likely due to the issues addressed in my comments on the methodology.

One more minor point: The discussion of Fig. 6 only refers to Dimension 2 but does not address Dimension 1. Although the pattern of clusters along Dimension 1 appears to be similar to the pattern identified in Fig. 5, it might be worthwhile to address the variation along Dimension 1 for the intended audience too (for the sake of completeness).

However, given the concerns discussed in my first comments on the degrees of freedom in the data collection and data analysis stage, a thought that comes immediately to my mind is how robust the results are to different samples and/or subsamples. Personally, I think that robustness analysis (e.g., on bootstrapped subsamples) could be a worthwhile exercise to rule out that the main findings are not driven by the sampling strategy, for instance. Another related issue is the question how generalizable the results are. Given the sampling strategy aiming at maximum heterogeneity in the sample, the articles are likely not representative for its population. The discussion section, however, tends to generalize the patterns identified in the sample – justifiably so?

An aspect not addressed by the analysis is the time dimension. Apparently, it would be interesting to see whether differences in the use of words and terms used in articles along the various dimensions analyzed are correlated with the year in which the article appeared. Put differently: did the patterns of discursive dimensions systematically change over time? I think this question arises quite naturally and addressing it could add an interesting result to the paper.</from>

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jul 9;16(7):e0254090. doi: 10.1371/journal.pone.0254090.r002

Author response to Decision Letter 0


17 May 2021

Editor:

1. Complete COREQ or equivalent reporting checklist. A completed SRQR checklist has been submitted with the revised manuscript. The COREQ checklist is formatted for interview-based qualitative studies, and so we opted to use the SRQR checklist instead.

2. Address formatting issues. We have re-generated the bibliography in PLOS ONE format.

3. Explain why some data is not publicly available. We have made the data and code needed to reproduce the analyses in the manuscript available on GitHub. This includes the outputs of the qualitative data analysis performed using NVivo, in spreadsheet format. We have not included the original NVivo file since this file contains the full text pdfs of the articles we analyzed and making this publicly available may constitute a copyright violation.

4. Clarify the status of the Review of General Psychology manuscript. The related manuscript has been accepted with minor revisions at RGP but has not yet been published. We have added a citation to this forthcoming paper in the revised text. As stated previously in the submission letter, the RGP manuscript is based on the same data set but analyses this data using a different technique (subset correspondence analysis), answers a different research question (how have reproducibility conversations changed over time?), and is intended for a different audience (historians and philosophers of psychology). It does not constitute dual publication because very little text is shared between the two articles, apart from the methodological description of the sample collection and coding process.

Reviewer One:

1. The initial assumption that reproducibility conversations are ill-structured is not supported by references to existing research. We have added multiple references to the introduction to better support this claim.

2. It is difficult for readers unfamiliar with CA to “benchmark” the results. The point is well-taken that the correspondence analysis is less meaningful without some background expectation, or what the CA would look like under greater heterogeneity. It is difficult to provide quantitative benchmarks since the numeric scales produced through CA cannot be meaningfully compared across data sets, but we have added additional text at the beginning of the results section that describes qualitatively what our results may have looked like if reproducibility discourse was more heterogeneous.

3. The rationale for focusing on biomedicine and psychology is not clear. We have added additional text to the Materials and Methods/Data collection section to explain that we selected biomedicine and psychology because reproducibility issues in these fields have been discussed extensively both in the scientific community and the popular press.

4. The findings may suggest that reproducibility research might constitute a distinct discipline of its own. This is an interesting suggestion, which we now take up in the revised conclusion. We now argue that while the presence of a shared discourse may be partially explained by the emergence of metascience as a discipline, but not fully, since the shared discourse extends beyond the community of authors who self-identify as metascientists.

5. Include quotations and examples from the texts analyzed. We appreciate this invitation to expand the presentation of our results in this direction and have included more representative quotes throughout the results section.

6. Provide a clearer explanation of how the sampling strategy employed is useful for addressing the research question posed. We appreciate the feedback on our description of the sampling strategy. We have added additional text describing why purposive sampling strategies are common in qualitative research, and why maximum variation sampling is especially appropriate for characterizing the range and variability of reproducibility narratives.

Reviewer Two:

1. The manuscript conducts statistical analyses of variables derived from word searches of an arbitrarily selected body of texts. We regret that we failed to convey to this reviewer key pieces of our methodological approach. First, the frequencies are not based on word searches, but on a qualitative coding scheme that we have now described in greater detail, both in the Methods section and in the S1 Appendix. Second, the texts are not arbitrarily selected, but have been selected through maximum variation sampling. We have expanded our description of this sampling strategy, as described above in R1.6.

2. The analysis “statistical/computational exercises on their own applied to methodological/philosophical discussions of the nature of science... can do almost nothing to clarify what are inherently abstract disputes over the nature of science.” We cannot answer the question of whether computational techniques alone (e.g., topic modeling/natural language processing) can be useful in clarifying reproducibility discussions, since this is not the approach that we employed here. Our mixed methods approach uses qualitative techniques that are appropriate for parsing the different positions taken by actors within a social world, in combination with quantitative approaches to visualize the results of that analysis. We suspect that this concern is rooted in the misunderstandings identified in R2.1 above.

3. The decision to normalize the counts of coded text using the total amount of coded text (rather than using raw counts or normalizing the counts against the total length of the text) is inappropriate. We have added additional text to explain why this choice is appropriate. As explained in the Abdi and Williams (2010) article that we cite, using PCA in our case would fail to reveal meaningful similarities in style between authors because of the differences in the length and format of the texts.

4. The findings presented in the manuscript could be derived through alternative methods such as systematic review, and the use of correspondence analysis is “lacking in any real substance or intrinsic motivation.” We agree that it may be possible to arrive at similar results using alternative methods, but it does not follow that there is no value to the methods we use here. Given the large number of texts, themes, and metadata variables we analyzed, exploratory data analysis techniques are helpful for summarizing and visualizing the main features of the data set. We believe that the results of our analysis are more succinctly presented through visualizations in combination with narrative than they would be through a narrative format alone.

Reviewer Three:

1. The manuscript needs a more thorough description of the article selection method, including rationales for the selection of databases, selection of search strings, and how maximizing variation along some dimensions (e.g. year of publication) may reduce variation in others (e.g. the appearance of multiple articles by the same authors in the data set). We have added more detail on the rationales for selecting the fields studies, databases employed, and search strings used to the methods section (see also R1.3). We have also added a note to the methods section acknowledging that maximizing variation in some areas may have reduced variation in others.

2. The number of articles is smaller than expected given the scale and scope of reproducibility discussions. We have added a citation to Fanelli (2018), which describes a bibliographic data set generated using search terms that are comparable to ours, and which identifies only 99 publications related to the reproducibility crisis.

3. The process for deriving the thematic codes is not clear and involves many “degrees of freedom.” Some expected codes (e.g., “harking”/”forking,” “open science”) do not appear on the list. We have added additional text to the methods section to describe the process of code development through grounded theory methodology, including the important distinction between the “open coding” phase where codes are generated, and the “focused coding” phase where established codes are applied to the body of texts as a whole. We hope that this alleviates some of the concerns about researcher degrees of freedom, although we also note in the revised methods text that some of these concerns may arise from differences inherent to quantitative and qualitative research paradigms. We have also incorporated the code book (originally provided as a standalone document on the GitHub repository) as a new appendix to the paper (S1 Appendix). The code book describes in detail what was included under each code, which will help readers identify how themes of interest were parsed according to our scheme (e.g., discussions of open science would be classified under “transparency of data and methods” in our analysis).

4. The rationale for performing the text search queries is not clear. We have added additional text to the Materials and Methods section to clarify the purpose of creating the word frequency variables, which serve as a check on the qualitative analysis and to enhance the interpretation of the CA dimensions.

5. The Kappa values of many thematic codes appears unexpectedly low and raises concerns about the exclusion of these codes from the analysis. Low Kappa scores are very common in qualitative research, particularly as codes increase in conceptual complexity and for codes that appear at low frequency in the corpus. We have added additional text in the Materials and Methods/Qualitative data analysis section to clarify why it is difficult to achieve strong Kappa scores. We have also added text to the discussion section to explain that the CA mapping presented here should not be interpreted as a definitive mapping, since the configuring of the map would shift depending on the codes included (or if a different coding scheme altogether were used).

6. The methodology was not pre-registered. We agree that this is a limitation. At the time we began this study, we were not aware of any options for pre-registration that were tailored to qualitative research. The OSF, for example, released its first template for pre-registration of qualitative research in September 2018, very shortly after we began our study. To compensate for this limitation, we have attempted to be as transparent as possible in reporting our methods and data.

7. The rationale for the expanded interpretation of Dimensions 1 and 2 is not clear. We are grateful for the encouragement to revisit our interpretations here because in rechecking our analysis we discovered an error in the r value originally reported for the “psychology” word frequency variable, which suggests that our original interpretation that Dimension 1 was also capturing differences in discipline was not as sound as we had initially thought. Accordingly, we have revised the manuscript so that Dim 1 is presented as separating bench vs. statistical techniques and Dim 2 is interpreted as separating social vs. technical issues. We still discuss the relationship of these two dimensions to discipline and audience, but are now more careful in our presentation of the potential weaknesses of these interpretations.

8. The dimensions have not been labeled in Figures 5 and 6. A discussion of Dimension 1 in Figure 6 should be included for completeness. We have corrected these omissions.

9. A robustness/bootstrap analysis would provide insight into how the sampling method has impacted the results. We thank the reviewer for this generative suggestion. We have implemented a bootstrap analysis by using the same multiple factor analysis design used to analyze articles by author and by audience. This analysis does demonstrate the impact of the sampling strategy on the results, but it has also allowed us to identify additional clusters of themes at the core of the conversation that can be differentiated with confidence. The results of this new analysis are presented in Fig 5 of the revised text.

10. It is unclear whether the results of the analysis can be generalized to the population as a whole. We have added additional description about the sampling strategy (see also R1.6) as well as a general caveat to the beginning of the methods sections about what kinds of conclusions can be generalized to the population as a whole, and what kinds of conclusions cannot be drawn from our analysis. We have also added several caveats to the discussion section to remind readers of these limitations.

11. The manuscript could explore how the discursive dimensions might have changed over time. We address this question in a separate manuscript: the forthcoming Review of General Psychology manuscript described in point Editor.4 above.

Attachment

Submitted filename: PLOS ONE response to reviewers.docx

Decision Letter 1

Sergi Lozano

21 Jun 2021

Mapping the discursive dimensions of the reproducibility crisis: A mixed methods analysis

PONE-D-20-26565R1

Dear Dr. Nelson,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sergi Lozano

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I thank the authors for addressing my thoughts and comments and for the interesting read. I am happy to recommend the paper for publication.

Reviewer #3: All comments have been properly addressed and the manuscript has been considerably improved. Particularly, the description of the methodology is way clearer now.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

Acceptance letter

Sergi Lozano

1 Jul 2021

PONE-D-20-26565R1

Mapping the discursive dimensions of the reproducibility crisis: A mixed methods analysis

Dear Dr. Nelson:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sergi Lozano

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Checklist. SRQR checklist.

    (PDF)

    S1 Table. Inter-rater reliability Kappa scores for all themes coded.

    (PDF)

    S2 Table. Percentage share of the mean article profile, coordinates, contribution, and cos2 for all themes included in the correspondence analysis.

    (PDF)

    S1 Appendix. Code book for reproducibility data set.

    (PDF)

    Attachment

    Submitted filename: PLOS ONE response to reviewers.docx

    Data Availability Statement

    Complete bibliographic information for the data set is available at: https://www.zotero.org/groups/2532824/reproducibility-ca. The data files exported from NVivo and the code book (which provides the definition for each thematic code and illustrative examples) are available at are available at: https://github.com/nicole-c-nelson/reproducibility-CA. The NVivo file containing the coded articles is available on request. All code used for analysis is available at: https://github.com/nicole-c-nelson/reproducibility-CA.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES