Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2023 Jan 26;18(1):e0274429. doi: 10.1371/journal.pone.0274429

Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process

Hannah Fraser 1,, Martin Bush 1,‡,*, Bonnie C Wintle 1,2, Fallon Mody 1, Eden T Smith 1, Anca M Hanea 1,3, Elliot Gould 1,2,3, Victoria Hemming 1,4, Daniel G Hamilton 1, Libby Rumpff 1,2, David P Wilkinson 1,2, Ross Pearson 1, Felix Singleton Thorn 1, Raquel Ashton 1, Aaron Willcox 1, Charles T Gray 1,5, Andrew Head 1, Melissa Ross 1, Rebecca Groenewegen 1,2, Alexandru Marcoci 6, Ans Vercammen 7,8, Timothy H Parker 9, Rink Hoekstra 10, Shinichi Nakagawa 11, David R Mandel 12, Don van Ravenzwaaij 10, Marissa McBride 7, Richard O Sinnott 13, Peter Vesk 1,2, Mark Burgman 7, Fiona Fidler 1
Editor: Ferrán Catalá-López14
PMCID: PMC9879480  PMID: 36701303

Abstract

As replications of individual studies are resource intensive, techniques for predicting the replicability are required. We introduce the repliCATS (Collaborative Assessments for Trustworthy Science) process, a new method for eliciting expert predictions about the replicability of research. This process is a structured expert elicitation approach based on a modified Delphi technique applied to the evaluation of research claims in social and behavioural sciences. The utility of processes to predict replicability is their capacity to test scientific claims without the costs of full replication. Experimental data supports the validity of this process, with a validation study producing a classification accuracy of 84% and an Area Under the Curve of 0.94, meeting or exceeding the accuracy of other techniques used to predict replicability. The repliCATS process provides other benefits. It is highly scalable, able to be deployed for both rapid assessment of small numbers of claims, and assessment of high volumes of claims over an extended period through an online elicitation platform, having been used to assess 3000 research claims over an 18 month period. It is available to be implemented in a range of ways and we describe one such implementation. An important advantage of the repliCATS process is that it collects qualitative data that has the potential to provide insight in understanding the limits of generalizability of scientific claims. The primary limitation of the repliCATS process is its reliance on human-derived predictions with consequent costs in terms of participant fatigue although careful design can minimise these costs. The repliCATS process has potential applications in alternative peer review and in the allocation of effort for replication studies.

1. Introduction

Scientific claims should be held to a strong standard. The strongest claims will fulfil multiple criteria. They will be reliable and reproducible, in that multiple observers examining the same data should agree on the facts and the results of analyses, they will be replicable, in that repetitions of the methods and procedures should produce the same facts/results, and they will be generalizable, in that claims should extend beyond a single dataset or process. Evaluating the strength of individual scientific claims on the basis of any one of these criteria, however, is not straightforward, let alone all of them.

The focus of the work underlying this paper is on assessing the replicability of published research claims. In assessing the strengths of scientific claims from different experimental and/or disciplinary contexts the criteria listed above will be of varying importance, however, replication studies are clearly a key technique for assessing certain kinds of research and the resultant literature. Such studies thereby contribute to the progressive development of robust knowledge with respect to both inferences from empirical data and deductions made from empirical evidence; policy-making and public trust both draw on such developments. Interestingly, several large-scale replication projects in the social sciences and other disciplines have shown that many published claims do not replicate [14]. These failures to replicate may indicate false positives in the original study, or they may have other explanations, such as differences in statistical power between the original and the replication, or unknown moderators in either.

This paper outlines an expert elicitation protocol designed to evaluate the replicability of a large volume of claims across the social and behavioural sciences. We describe the utility, accuracy and scalability of this method, how it has been validated, and discuss how this technique has the potential to provide insight into aspects of the credibility of scientific claims beyond replicability, such as the generalizability of claims.

The elicitation procedure in this paper a modified Delphi process called the IDEA protocol [5, 6], and involves experts working in small groups to provide quantitative predictions, and associated (qualitative) reasoning of the probability of successful replication. The qualitative data provides information about participants’ judgements, including about the credibility signals of claims beyond replicability.

The IDEA protocol has been applied to assessing the replicability of research claims in the social and behavioural sciences by the repliCATS (Collaborative Assessments for Trustworthy Science) project as a component of the Systematizing Confidence in Open Research and Evidence (SCORE) program funded by the US Defence Advanced Research Projects Agency. The overall goal of the SCORE program is to create automated tools for forecasting the replicability of research claims made within the social and behavioural science literature [7]. In Phase 1 of the SCORE program, we elicited expert assessments of replicability for 3000 research claims. These assessments formed a benchmark for the comparison of the performance of automated tools developed by other teams within the SCORE program. In Phase 2 of the SCORE program we expanded the elicit assessments of replicability for another 900 specific research claims while expanding the elicitation to include additional credibility signals of 200 papers.

The repliCATS project introduces a new approach to predicting the replicability of research claims. It is neither a prediction market, nor a simple one-off survey. The repliCATS approach, involves four steps, represented in the acronym IDEA: ‘Investigate’, ‘Discuss’, ‘Estimate’ and ‘Aggregate’ (Fig 1). Each individual is provided a scientific claim and the original research paper to read, and provide an estimate of whether or not the claim will replicate (Investigate). They then see the group’s judgements and reasoning, and can interrogate these (Discuss). Following this, each individual provides a second private assessment (Estimate). A mathematical aggregation of the individual estimates is taken as the final assessment (Aggregate).

Fig 1. Overview of the IDEA protocol, as adopted in the repliCATS project.

Fig 1

2. Utility, accuracy and validity, scalability, and insight

2.1 Utility

Whilst replication studies have significant epistemic value, they are expensive, labour intensive, and face logistical barriers, such as holding little career incentives. In other cases, they are simply not feasible at all. For example, national censuses are not feasibly recreated and historical circumstances are impossible to study again. Even where the contextual factors are more favourable, the expense of full replication studies requires analysis of the potential benefits against costs [8]. Therefore, analytical techniques for a prognostic assessment of the reliability, replicability and generalizability of research claims without attempting full replications have substantial utility because they can provide similar benefits at much lower cost. They can also, in some cases, inform decisions about where to direct scarce resources for such full replications [8, 9]. The method described in this paper is one such process, and thereby provides this kind of cost-saving utility.

2.2 Accuracy and validity

To measure the accuracy of the repliCATS process, we compare it with existing techniques for predicting the outcomes of replication studies. The two main techniques that have been used for human-derived predictions are surveys and prediction markets [2, 10, 11]. In the former, experts give independent judgements which are aggregated into quantitative predictions of replicability. In the latter, participants trade contracts that give a small payoff if and only if the study is replicated and prices are used to derive the likelihood of replicability. Both of these have typically been implemented alongside large-scale replication projects and used replication outcomes as the ground truth to test prediction accuracy. There are other techniques for predicting replicability, such as machine learning methods [12, 13], but such fully automated techniques are not considered further in this paper.

There are multiple ways of measuring the success of predictions (Table 1). The most straightforward is ‘classification accuracy’. This treats predictions >50% as predictions of replication success and <50% as predictions of replication failure. Classification accuracy is the percentage of predictions that were correct (i.e. on the right side of 50%), excluding predictions of 50%.

Table 1. Summary of different metrics commonly used to evaluate predictions.

Metric Definition Advantages Disadvantages
Classification Accuracy Percentage of assessments (probability estimates) >50% that successfully replicated, and <50% that did not successfully replicate Easy to calculate. Intuitively understood. Comparable across datasets Loses information
Area Under the Curve (AUC) Obtained by plotting the true positive rate against the false positive rate. The best possible predictor with no false negatives and no false positives produces a point at (0,1), and a random assessment gives a point along the diagonal Does not rely on a single cutoff, considers all possible thresholds Loses information
Brier Score The squared difference between an estimated probability (a participant’s best estimate) and the actual outcome Uses all the information. Easily interpretable as a mean square error Not easily comparable across different datasets. Only useful for long term accuracy
Informativeness The average width of the intervals between the upper and lower bounds elicited alongside best estimates Does not require performance information Not properly operationalised for in a (classical) probabilistic framework

Existing techniques for predicting replicability perform well. The classification accuracy from previous prediction studies has ranged between 61% and 86%. The lower limit was reported by Camerer et al. [1] for both surveys and prediction markets on the replicability of 18 laboratory experiments in economics. The higher limit was from Camerer et al. [2] for surveys and prediction markets on predicting the replicability of 21 social science experimental studies published in Nature and Science between 2010 and 2015. In both studies, surveys and prediction markets performed well. Other studies have reported small but noteworthy differences. For the surveys and markets running alongside the Many Labs 2 project [3], prediction markets had a 75% classification accuracy while pre-market surveys reached 67%. There is also evidence that for some kinds of social science research claims non-experts are able to make fairly accurate predictions for the replicability of research claims [14].

An essential goal of the repliCATS project is to improve upon this already good accuracy of current techniques for predicting replicability. Fine-tuning expert accuracy in forecasting replicability is a non-trivial challenge. The primary ways in which the IDEA protocol is expected to meet the exacting demands of this task are by harnessing the wisdom of the crowd and diverse ideas, sharing information between participants to resolve misunderstandings and promote further counterfactual thinking, and by taking an aggregate of the judgement. These help to improve accuracy, and reduce overconfidence by reducing biases such as anchoring, groupthink, and confirmation bias. The IDEA protocol aims to improve accuracy through controlling groupthink by aggregating group assessments mathematically and not behaviourally [1518]. That is, group members are not forced to agree on a single final judgement that reflects the whole group. Mathematical aggregation can be more complex than taking the arithmetic mean. Several aggregation techniques have been pre-registered by repliCATS (https://osf.io/5rj76/). Our many mathematical aggregation models are in detailed in Hanea et al. [19], these fall into three main groupings: 1) linear combinations of best estimates, transformed best estimates [20] and distributions [21]; 2) Bayesian approaches, one of which incorporates characteristics of a claim directly from the paper, such as sample size and effect size [22]; and 3) linear combinations of best estimates, mainly weighted by potential proxies for good forecasting performance, such as demonstrated breadth of reasoning, engagement in the task, openness to changing opinion, informativeness of judgements, and prior knowledge (inspired by Mellers et al. [23, 24]). Groupthink is also controlled by recruiting groups that are, ideally, as diverse as possible [25]. Work in structured elicitation has described how the sharing of information can improve the accuracy of group judgements [26]. The IDEA protocol implemented by the repliCATS process implements this feature, in contrast with survey-based methods of prediction, which generally do not allow for the sharing of information between participants. The IDEA protocol also reduces overconfidence in individual judgements through the use of three-point elicitations, that is asking participants to provide lower and upper bounds for their assessment, as well as their best estimate [27, 28]. This technique [29, 30] is thought to encourage information sampling [29] and prompt participants to consider counter-arguments [31]. However, see also [32] for counter-evidence regarding this sequencing.

To test the quantitative predictions of the replicability of research claims produced by the repliCATS process, we conducted a validation experiment [33] alongside the SIPS conference in 2019. This experiment was conducted independently of the SCORE program. A pre-registration for this experiment can be seen at https://osf.io/e8dh7/. Five groups of five participants separately assessed the replicability of 25 claims drawn from previous replication projects, including projects that had run prediction studies using alternative methods for predicting replicability. A summary of results from this experiment is provided in Fig 2, showing all group assessments for each of the 25 claims. The specific elicitation used appears in the S1 Table. The full anonymized dataset from this validation experiment is available at https://osf.io/hkmv3/.

Fig 2. Summary of results from validation experiment.

Fig 2

Each dot represents a particular group’s aggregated assessment of replicability for each claim, with five groups assessing all 25 claims drawn from previous replication projects. Claims for which the replication study did not produce a statistically significant result are shown in the top pane, and claims for which the replication study did produce a statistically significant result are shown in the bottom pane.

When all judgements from the validation experiment were aggregated, participants using the repliCATS process achieved a classification accuracy of 84% (and an Area Under the Curve measure of 0.94). Compared with published accuracies for other processes, [13] this indicates that the repliCATS technique performed comparably or better than existing methods for predicting replicability. We rely on previously published accuracies for comparison with existing methods as it was not possible to directly test the performance of these five groups using alternative prediction techniques without duplicating the experiment and thus the burden on participants.

We note that the results of this validation study should be interpreted with caution. The majority of articles in this study were from psychology papers and most participants at the SIPS conference were psychology researchers. Thus we considered that they may have known more about the signals or replicability for these papers than a participant may have for a typical claim, for which they may have less disciplinary insight. It was also expected that in some instances participants were familiar with the results of replications for the claims they evaluated, as these results were publicly available prior to SIPS 2019. We infer that the study has validated the use of the repliCATS process for application but a more precise measurement of its accuracy will depend on further work.

2.3 Scalability

In addition to accuracy, repliCATS aims to provide a scalable process. The IDEA protocol is typically implemented through group sizes between four and seven, with perhaps one or two groups of experts per problem, each working under the guidance of a facilitator. This process allows for a rapid evaluation of claims, in contrast with prediction markets that rely on many participants engaged in multiple trades. At the same time, an online platform implementation of the IDEA protocol allowed for many experts to address many problems. The technical details of this specific implementation of the repliCATS process are described by Pearson et al. [34]. However, the process described in this paper could be implemented through other realizations.

The primary application of the repliCATS process to date has been through the SCORE program. The scalability of this process is demonstrated by the volume of claims assessed as part of this program—3000 claim assessments in 18 months in Phase 1 of the SCORE project, which ran from February 2019 to 30 November 2020 and a further 900 claim assessments and 200 paper-level assessments in Phase 2, running from December 2020 to May 2022. Alongside this, the online platform developed for this program has been used in a number of experiments and classroom teaching contexts as briefly described below. In Phase 1 of the SCORE project, on which we concentrate, the repliCATS project was provided with 3000 social and behavioural science claims for evaluation by another team involved in the SCORE program. This independent team was also responsible for conducting replication studies for a subset of the 3900 claims. The claims were drawn from 62 peer-reviewed journals across eight social and behavioural science disciplines: business research, criminology, economics, education, political science, psychology, and sociology. Full details of these journals are provided in the S2 Table. The repliCATS project gathered expert predictions of replicability through a combination of face-to-face and asynchronous workshops, as well as fully remote elicitations. One third of the claims, as well as the data for the validation study presented below, were elicited in three large workshops held alongside relevant conferences. Two of the workshops were held alongside the Society for Improvement of Psychological Science (SIPS) conference in July 2019 (Netherlands) and May 2020 (virtual workshop); and the third was alongside the inaugural Association for Interdisciplinary Metaresearch & Open Science (AIMOS) conference in November 2019 (Melbourne). In workshops, participants were divided into small groups with a facilitator and assessed claims synchronously. Different workshop groups took differing amounts of time for each claim but on average assessments were completed in under 30 minutes.

2.4 Insight

The repliCATS process is capable of generating valuable qualitative data to provide insight on issues beyond the direct replicability of specific evidentiary claims. Such problems include: identifying the precise areas of concern for the replicability of a given claim; understanding the limits of generalizability and hence potential applicability of a claim for a research end-user, and assessing the quality of the operationalization of a given research study design. For example, a research claim may be highly replicable and yet offer a poorly operationalized test of the target hypothesis [35]. In such a case, knowing only the predicted replicability of the claim says little about the status of the overall hypothesis. Similarly, even if a claim is well-operationalized, understanding how applicable a claim is for research end-users means understanding its limits of generalizability. This aspect of the repliCATS project, as an implementation of the IDEA protocol, is arguably the most critical point of difference from previous approaches to predicting replicability. Although both surveys and prediction markets can be adapted to collect this kind of data, the advantage of the repliCATS process described here is that such data is produced directly through the elicitation itself, resulting in more straightforward and richer data generation, collection and subsequent analysis.

This elicited data describes participants’ reasoning about the claims being evaluated, typically including justifications for the assessment of the target question about replicability, as well as judgements about the papers’ importance, clarity and logical structure. While such reasoning data has been a part of previous applications of the IDEA protocol, it has not been the focus of previous studies. The repliCATS project thus extends the IDEA protocol through a structured qualitative analysis of a substantial corpus. This data has the potential to address a number of questions that are of interest for replication studies, for research end-users, and more broadly in metaresearch and the philosophy of the sciences. Some of these target problems, like understanding the quality of operationalization of a given research design, or the limits of generalizability of a particular claim, were described in Section 2.2 above. Other issues include identification of particular strengths or weaknesses in a given research claim, with implications for how a claim might be further investigated. Participants’ justifications might indicate weaknesses with a particular kind of measurement technique, suggesting that further work should be focused on that aspect of experimental design. Alternatively, they might indicate a very high level of prior plausibility for the claim, in which case further work on the theoretical background of the claim might be more fruitful. There are many ways to address such qualitative data. Within the repliCATS project, qualitative analysis of data is briefly as follows. A subset of the data (776 of the set of 3000 claims in Phase 1 of the SCORE program, which comprised 13901 unique justifications from a total dataset of 46408 justifications) was coded by a team of five analysts against a shared codebook, with each justification being coded by at least two coders. Several principles were applied in developing the codebook for analysis. Inclusion and exclusion criteria were developed for codes most relevant to the research questions. In particular, this included both direct markers of replicability (e.g. statistical details) and proxy markers of replicability (e.g. clarity of writing). The total size of the codebook was limited to ensure its usability. The codebook was developed iteratively, with discussion among coders between each round of development to provide a form of contextualised content analysis. In addition, the inter-coder-reliability (ICR) for each code was assessed at various points in this process. The ICR results were used for reviewing the convergence of textual interpretation between analysts, and after coding, mixed-method aggregation techniques only utilised those codes that had met a pre-registered ICR.

3. Current implementations

The repliCATS process described here could be implemented in several ways, including a simple pen-and-paper version. We have developed a cloud-based, multi-user software platform (‘the repliCATS platform’) that supports both synchronous face-to-face workshops and asynchronous remote group elicitations [34]. A full description of the elicitation used on this platform for the validation experiment appears in the S1 Table. Fig 3 provides a snapshot of the technical operationalization of this elicitation as supported in the repliCATS platform. In particular, Fig 3 shows an example of aggregated group judgements and reasoning from round 1, as shown to participants in round 2 prior to submitting their final judgements and reasoning.

Fig 3. The repliCATS platform.

Fig 3

Anticlockwise from centre top (a) complete layout, plus expanded details of: (b) claim information; (c) and (d) responses to elicitation and aggregated feedback; and (e) participants’ reasoning comments.

This specific repliCATS platform incorporated several features beyond the IDEA protocol that were designed to improve participants’ confidence in completing their assessments, as well as support or enhance the elicitation process. These features include ready access to general decision support materials and a glossary; tooltips for each question to provide additional guidance on answering each question; additional interactive discussion elements such as the ability to upvote, downvote and have threaded comments; and gamification in the form of participant badges. The badges were designed to reward participation as well as to reward behaviours considered to be beneficial for improving participants’ and group judgements. For example, participants were awarded badges for seeking out decision support materials, interacting with group members’ reasoning, and consistently submitting their reasoning while evaluating claims.

This implementation was used in Phase 1 of the SCORE program by 759 registered participants. Of these, 550 users assessed one or more of the 3000 claims. All participants explicitly provided consent through the online platform and participant activities were approved by the Human Research Ethics Committee of The University of Melbourne (ID 1853445). However, some participants used multiple identities and not all student participants could be individually identified due to the additional ethics clearance obtained from the University of North Carolina for the teaching use of the repliCATS platform described in section 5.1.

4. Limitations of the repliCATS process

The advantages of the IDEA process also come with challenges due to issues of elicitation, including recruitment of participants, that are likely to be generally applicable to most implementations of the repliCATS process. Other limitations faced by the repliCATS project were based in the specific program context in which this process was developed and implemented. These will be relevant for some but not all implementations of the repliCATS process and will not be discussed further beyond noting that principally amongst these was the inability, within the recruitment burden, for claims to be assessed by multiple groups.

4.1 Challenges with elicitation

A problem for any research involving human subjects is elicitation burden—both the quantity and quality of information provided by experts are thought to decline the longer the elicitation process takes, similar to participant fatigue in surveys and experiments [36]. There is thus a trade-off between the extent of information desired for research purposes and the number and richness of questions asked. As noted, the average time taken for a group to assess a claim in repliCATS workshops was just under 30 minutes. Such a bounded time of participation was also necessary because the majority of research participants were volunteers who were unpaid, or only minimally compensated for their time. Although aspects of the elicitation are believed to be intrinsically rewarding—by design—this is another reason for minimizing the burden on participants.

In order to address the elicitation burden, the repliCATS platform was designed to be user-friendly, and online workshops were flexible in terms of participant schedules. However, the trade-off for such elicitations is that it reduced the capacity for participants to engage in discussion, which is a key aspect of the IDEA protocol. Participants in widely-spaced time zones could not have synchronous discussions, and could only exchange comments over a period of days. Other online workshops using the repliCATS process have grouped participants, as best as possible into compatible time zones, although other logistical constraints do not always make this possible. (By contrast, face-to-face workshops allow for full discussions, at the cost of committing participants to a set schedule and physical co-location.) Given that the information-gathering phase of the process constituted around one-third of the time spent by participants (approximately 30 minutes per claim assessed), it is difficult to imagine a method that will substantially reduce this burden. Prediction markets have the potential for shorter engagement but only if participants do not return to engage in repeated trades. In this case, as with all methods, there is a trade-off between effort by participants and performance. Ultimately, the problem of participant fatigue is inherent to using human-derived judgements which is best addressed by user-centred design appropriate to the particular method.

A related difficulty is that research participants might have multiple interpretations of key terms for the elicitation. Indeed, there can be considerable conceptual slippage around many of these terms, such as conflicting taxonomies attempting to define replication practices. While the discussion phase of the IDEA protocol can be used to work on resolving such ambiguities, it can be difficult or impossible to completely eliminate such differences in interpretation. This difficulty was addressed by encouraging participants to describe their interpretations through textual responses, allowing both team members and qualitative analysts to understand these ambiguities better. The repliCATS platform also provides the definitions for key terms such as ‘successful replication’ where specific definitions are used within the SCORE program. The provision of such information also involves trade-offs between definitional specificity and richness of response. This problem of ambiguity is inherent to human assessment of claims and so is relevant to alternative techniques, although the impacts will be different for each technique.

Another challenge is that participants may hold opinions about aspects of a study that could bias their judgements about replicability. One such issue relates to the perceived importance of a study, whilst another relates to stylistic features of how the paper is written. While both of these may be proxies for the replicability of a study in some contexts, they need not be. Even more directly relevant markers, like the quality of the experimental design, need not suggest non-replicability. For example, a poorly operationalized study where the dependent variable is auto-correlated with the independent variable will, in fact, be highly replicable if you repeat the study exactly. For these reasons, the elicitation was designed to separate out judgements about replicability while allowing participants to explicitly express their opinions about matters like the clarity, plausibility and importance of the study. This is done through two questions prior to the question about replicability, and one question afterwards. Like the problem of ambiguity, this challenge affects alternative techniques for predicting replicability, although, also like the previous case, the discussion phase of the IDEA protocol has the advantage of allowing the problem to be directly addressed.

4.2 Challenges with recruitment

Ideally, for most forecasting problems, IDEA groups are deliberately constructed to be diverse. People differ in the way they perceive and analyze problems, as well as the knowledge they bring to bear on them. Access to a greater variety of information and analyses may improve individual judgements. The accuracy of participants’ answers to quantitative and probabilistic judgements has been shown to improve between peoples’ independent initial judgements and the final judgements they submit through the IDEA protocol [5, 6]. Offset against this is that domain-specific knowledge is clearly relevant to understanding the details of research claims and thus being able to assess their replicability. The balance between diversity and domain knowledge in good assessments of replicability is poorly understood. This problem warrants further research. The repliCATS recruitment strategy meant that we were not able fully to control for or consistently recruit diversity in groups. Where participants are allowed to self-select claims to assess, such diversity is even harder to achieve, even if the overall participant pool contains a large amount of diversity. Maximising diversity will be a challenge for any implementation of the repliCATS process, subject to the specific constraints of the implementation context.

5. Potential applications

The repliCATS process is intended to be used by researchers well beyond its application to the SCORE program. We describe potential application of this method that we intend and that may interest others.

5.1 Capacity building in peer review

There are few peer review training opportunities available to researchers [37] and even fewer that have clearly demonstrated evidence of success [38, 39]. As a consequence, many early career researchers report that they frequently learn how to review in passive and fragmentary ways, such as performing joint reviews with their advisors and senior colleagues, e.g. via participation in journal clubs or from studying reviews of their own submissions [40].

Participants in the repliCATS project noted the benefits of the feedback and calibration in the repliCATS process: “I got a lot of exposure to a variety of research designs and approaches (including some fun and interesting theories!) and was afforded [the] opportunity to practice evaluating evidence. In practicing evidence evaluation, I feel like I sharpened my own critical evaluative skills and learned from the evaluations of others.” This type of feedback was particularly common from early career researchers, although more experienced researchers also described the value of the process as professional development in peer review.

This feedback suggests the potential to deploy evaluations of research papers through the repliCATS process as explicit training in both research design and peer review. A pilot application of such student training has already been undertaken at the University of North Carolina, Chapel Hill where the repliCATS process was deployed as an extra credit undergraduate student activity.

5.2 Alternative peer review models

Surprisingly little is known about how reviewers conduct peer review. More than a quarter of a century after the remark was made, it is still the case that “we know surprisingly little about the cognitive aspects of what a reviewer does when he or she assesses a study” [41]. Nor are journal instructions to reviewers especially informative here with criteria for acceptance usually expressed in general terms, such as “how do you rate the quality of the work?”.

Initiatives such as Transparency and Openness Promotion (TOP) guidelines offer guidance for authors and reviewers at signatory journals [42] (see also https://www.cos.io/initiatives/top-guidelines). These are important for ensuring completeness of scientific reporting. The repliCATS process is not an attempt at competing guidelines. Its goal is not to provide checklists for reviewers (or authors or editors). Rather, we conceptualize peer review as a process that takes advantage of collective intelligence through an expert deliberation and decision-making process, and thus one to which a structured elicitation and decision protocol is applicable [43].

Despite limited implementation among popular journals in the social sciences [44], there are existing models of interactive peer review that encourage increased dialogue between reviewers, such as the one used in the Frontiers journals. However, these typically rely on behavioural consensus, with its attendant disadvantages. In contrast, the repliCATS process has the strong advantage of a predefined end-point, avoiding ‘consensus by fatigue’. It is transparent by design, and the underlying IDEA protocol is directly informed by developments in the expert elicitation, deliberation and decision making literature.

5.3 Commissioned reviews

One particular example of alternative forms of review includes commissioned reviews, such as for papers or research proposals prior to submission. This potential application was suggested by a number of repliCATS participants. Variations on this theme include reviews of a suite of existing published research intended as the basis of a research project or as the evidence base for specific policy, action or management decisions. The former is likely to appeal to early career researchers, for example, at the start of their PhD candidature, and the latter to end users and consumers of research. The scalability of the repliCATS process makes it suitable for this use.

5.4 Allocating replication effort

Replication of studies is not always possible nor is it always desirable. Some studies cannot be replicated, as described above, due to, for example, their historical nature. Such studies may still have the potential to inform decisions and assessment of their reliability may still be valuable. Nor is replication always ideally suited for assessing the reliability of a claim, even when it is a viable approach. In any case, replications are typically resource-intensive.

There are several approaches to determining how best to allocate scarce resources to replication studies. One suggested approach is to apply the results of prediction markets to this question [10]. Other approaches propose selection of studies for replication based on a trade-off between the existing strength of evidence for a focal effect and the utility of replicating a given study. In Field et al. [45], a worked example is given on how to combine strength of evidence (quantified with a Bayesian reanalysis of published studies) with theoretical and methodological considerations. Pittelkow et al. [46] follow a similar procedure, applied to studies from clinical psychology. Isager et al. [9] outline a model for deciding on the utility of replicating a study, given known costs, by calculating the value of a claim and the uncertainty about that claim prior to replication. The key variables in this model—costs, value, and uncertainty—remain undefined, with the expectation that each can be specified outside the model (as relevant to a given knowledge domain). This approach to formalizing decisions can help clarify how to justify allocating resources toward specific replication practices.

While all techniques for predicting replicability can generate data about specific research claims that can be used as inputs to a formal model can provide further inputs. For example, the repliCATS approach is able to provide data on uncertainty, such as the extent of agreement or disagreement within groups about the replicability of a given claim, and the potential uncertainty of individual assessments as seen in interval widths. Additionally, qualitative data from the repliCATS process provides information on the theoretical or practical value of specific claims.

6. Conclusion

Assessment of the credibility of scientific papers in general, and predicting the replicability of published research claims in particular is an expert decision-making problem of strong importance to the scientific community. In such cases the use of a structured protocol has known advantages. The repliCATS process is a practical realization of the IDEA protocol for the assessment of published research papers. A validation experiment has shown that this process meets or exceeds the accuracy of current techniques for predicting replicability. The repliCATS process has also been demonstrated to be scalable, from undertaking rapid review of small numbers of claims to the capacity to assess a large volume of claims over an extended period. A particular advantage of the repliCATS process is the collection of rich qualitative data that can be used to address questions beyond those of direct replicability, such as the generalizability of given claims. Future work building on the repliCATS process to incorporate broader sets of credibility signals beyond replicability and in a range of applications has the potential to improve the scientific literature by supporting alternative models of peer review.

Supporting information

S1 Table. Details of elicitation questions used in the repliCATS process for the validation experiment.

(TIF)

S2 Table. List of source journals for research claims in the SCORE program.

(TIF)

Data Availability

The anonymized data underlying the validation experiment for the method described in this paper are available from a publicly accessible OSF repository (at https://osf.io/hkmv3/). This link is provided in the manuscript.

Funding Statement

This project is sponsored by the Defense Advanced Research Projects Agency (DARPA) under cooperative agreement No.HR001118S0047. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Camerer CF, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, et al. Evaluating replicability of laboratory experiments in economics. Science; 351(6280):1433–1436. doi: 10.1126/science.aaf0918 [DOI] [PubMed] [Google Scholar]
  • 2. Camerer CF, Dreber A, Holzmeister F, Ho TH, Huber J, Johannesson M, et al. Evaluating the rreplicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour; 2(9):637–644. doi: 10.1038/s41562-018-0399-z [DOI] [PubMed] [Google Scholar]
  • 3. Klein RA, Ratliff KA, Vianello M, Adams RB, Bahník S, Bernstein MJ, et al. Investigating variation in replicability: “Many Labs” Replication project. Social Psychology; 45(3):142–152. doi: 10.1027/1864-9335/a000178 [DOI] [Google Scholar]
  • 4. Open Science Collaboration. Estimating the reproducibility of psychological science. Science; 349(6251):aac4716. doi: 10.1126/science.aac4716 [DOI] [PubMed] [Google Scholar]
  • 5. Hanea AM, Burgman M, Hemming V. IDEA for uncertainty quantification. In: Dias LC, Morton A, Quigley J, editors. Elicitation: the science and art of structuring judgement. International Series in Operations Research & Management Science. Springer International Publishing; pp. 95–117. [Google Scholar]
  • 6. Hemming V, Burgman MA, Hanea AM, McBride MF, Wintle BC. A practical guide to structured expert elicitation using the IDEA protocol. Methods in Ecology and Evolution; 9(1):169–180. doi: 10.1111/2041-210X.12857 [DOI] [Google Scholar]
  • 7.Alipourfard N, Arendt B, Benjamin DM, Benkler N, Bishop MM, Burstein M, et al. Systematizing Confidence in Open Research and Evidence (SCORE). SocArXiv 46mnb [Preprint]. 2021 [posted 2021 May 4; cited 2021 Aug 17]: [33 p.]. Available from: https://osf.io/preprints/socarxiv/46mnb
  • 8.Coles N, Tiokhin L, Scheel AM, Isager PM, Lakens D. The costs and benefits of replication studies. PsyArXiv [Preprint]. 2018 PsyArXiv c8akj [posted 2018 January 18; revised 2018 July 2; cited 2021 Feb 17]: [7 p.]. Available from: https://psyarxiv.com/c8akj/ [DOI] [PubMed]
  • 9.Isager PM, van Aert RCM, Bahník S, Brandt M, DeSoto KA, Giner-Sorolla R, et al. Deciding what to replicate: a formal definition of “replication value” and a decision model for replication study selection. PsyArXiv [Preprint]. 2018 PsyArXiv c8akj [posted 2020 September 2; cited 2021 Feb 17]: [14 p.]. Available from: https://osf.io/preprints/metaarxiv/2gurz/
  • 10. Dreber A, Pfeiffer T, Almenberg J, Isaksson S, Wilson B, Chen Y, et al. Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences; 112(50):15343–15347. doi: 10.1073/pnas.1516179112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Benjamin D, Mandel DR, Kimmelman J Can cancer researchers accurately judge whether preclinical reports will reproduce?. PLOS Biology 15(6): e2002212. doi: 10.1371/journal.pbio.2002212 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Altmejd A, Dreber A, Forsell E, Huber J, Imai T, Johannesson M, et al. Predicting the replicability of social science lab experiments. PLOS ONE; 14(12):e0225826. doi: 10.1371/journal.pone.0225826 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Yang Y, Youyou W, Uzzi B. Estimating the deep replicability of scientific findings using human and artificial intelligence. Proceedings of the National Academy of Sciences; p. 201909046. doi: 10.1073/pnas.1909046117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Hoogeveen S, Sarafoglou A, Wagenmakers EJ. Laypeople can predict which social science studies replicate. Advances in Methods and Practices in Psychological Science; September 2020:267–285. [Google Scholar]
  • 15. French S. Aggregating expert judgement. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales. Serie A. Matematicas; 105(1):181–206. doi: 10.1007/s13398-011-0018-6 [DOI] [Google Scholar]
  • 16. Hanea AM, McBride MF, Burgman MA, Wintle BC. The value of performance weights and discussion in aggregated expert judgments. Risk Analysis: An Official Publication of the Society for Risk Analysis; 38(9):1781–1794. doi: 10.1111/risa.12992 [DOI] [PubMed] [Google Scholar]
  • 17. Hemming V, Hanea AM, Walshe T, Burgman MA. Weighting and aggregating expert ecological judgments. Ecological Applications; 30(4):e02075. [DOI] [PubMed] [Google Scholar]
  • 18. McAndrew T, Wattanachit N, Gibson GC, Reich NG. Aggregating predictions from experts: a review of statistical methods, experiments, and applications. WIREs Computational Statistics; e1514. doi: 10.1002/wics.1514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Hanea A, Wilkinson D P, McBride M, Lyon A, van Ravenzwaaij D, Singleton Thorn F., et al. Mathematically aggregating experts’ predictions of possible futures: PLOS ONE; 16(9): e0256919. doi: 10.1371/journal.pone.0256919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Satopää VA, Baron J, Foster DP, Mellers BA, Tetlock PE, Ungar LH. Combining multiple probability predictions using a simple logit model. International Journal of Forecasting; 30(2):344–356. [Google Scholar]
  • 21. Cooke RM, Marti D, Mazzuchi T. Expert forecasting with and without uncertainty quantification and weighting: what do the data say?. International Journal of Forecasting; 37(1):378–387. doi: 10.1016/j.ijforecast.2020.06.007 [DOI] [Google Scholar]
  • 22.Gould, E., Willcox, A., Fraser, H., Singleton Thorn, F., Wilkinson, D. P. Using model-based predictions to inform the mathematical aggregation of human-based predictions of replicability. MetaArXiv [Preprint]. 2021 MetaArXiv f675q [posted 2021 May 1]: [18 p.]. Available from: 10.31222/osf.io/f675q/ [DOI]
  • 23. Mellers B, Stone E, Atanasov P, Rohrbaugh N, Metz SE, Ungar L, et al. The psychology of intelligence analysis: drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied; 21(1):1–14. [DOI] [PubMed] [Google Scholar]
  • 24. Mellers B, Stone E, Murray T, Minster A, Rohrbaugh N, Bishop M, et al. Identifying and cultivating superforecasters as a method of improving probabilistic predictions. Perspectives on Psychological Science; 10(3):267–281. doi: 10.1177/1745691615577794 [DOI] [PubMed] [Google Scholar]
  • 25. Page Scott E. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton: Princeton University Press; 2008. [Google Scholar]
  • 26. Kerr NL, Tindale RS. Group performance and decision making. Annual Review of Psychology; 55(1):623–655. doi: 10.1146/annurev.psych.55.090902.142009 [DOI] [PubMed] [Google Scholar]
  • 27. Burgman MA. Trusting judgements: how to get the best out of experts. Cambridge: Cambridge University Press; 2016. [Google Scholar]
  • 28. Speirs-Bridge A, Fidler F, McBride M, Flander L, Cumming G, Burgman M. Reducing overconfidence in the interval judgments of experts. Risk Analysis; 30(3):512–523. doi: 10.1111/j.1539-6924.2009.01337.x [DOI] [PubMed] [Google Scholar]
  • 29. Soll JB, Klayman J. Overconfidence in interval estimates. Journal of Experimental Psychology: Learning, Memory, and Cognition; 30(2):299–314. [DOI] [PubMed] [Google Scholar]
  • 30. Teigen KH, Jørgensen M. When 90% confidence intervals are 50% certain: on the credibility of credible intervals. Applied Cognitive Psychology; 19(4):455–475. doi: 10.1002/acp.1085 [DOI] [Google Scholar]
  • 31. Koriat A, Lichtenstein S, Fischhoff B. Reasons for confidence. Journal of Experimental Psychology: Human Learning and Memory; 6(2):107–118. [Google Scholar]
  • 32. Mandel DR, Collins RN, Risko EF, Fugelsang JA. Effect of confidence interval construction on judgment accuracy: Judgment and Decision Making; 15(5):783–797. doi: 10.1017/S1930297500007920 [DOI] [Google Scholar]
  • 33.Wintle B, Mody F, Smith E T, Hanea A, Wilkinson D P, Hemming V. Predicting and reasoning about replicability using structured groups. MetaArXiv [Preprint]. 2021 MetaArXiv vtpmb [posted 2021 May 4]: [31 p.]. Available from: https://osf.io/preprints/metaarxiv/vtpmb/ [DOI] [PMC free article] [PubMed]
  • 34.Pearson R, Fraser H, Bush M, Mody F, Widjaja I, Head A, et al. Eliciting group judgements about replicability: a technical implementation of the IDEA protocol: Hawaii International Conference on System Science. 2021 [posted 2021 January 5; cited 2021 Feb 17]: [10 p.]. Available from: http://hdl.handle.net/10125/70666
  • 35.Yarkoni T. The generalizability crisis. PsyArXiv [Preprint]. 2019 PsyArXiv jqw35 [posted 2019 November 22; revised 2020 November 3; cited 2021 Feb 17]: [27 p.]. Available from: https://psyarxiv.com/jqw35
  • 36. Lavrakas P. Respondent fatigue. In: Encyclopedia of survey research methods. Sage Publications, Inc. Available from: http://methods.sagepub.com/reference/encyclopedia-of-survey-research-methods/n480.xml [Google Scholar]
  • 37. Patel J. Why training and specialization is needed for peer review: A Case Study of Peer Review for Randomized Controlled Trials. BMC medicine; 12(1):128. doi: 10.1186/s12916-014-0128-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Bruce R, Chauvin A, Trinquart L, Ravaud P, Boutron I. Impact of interventions to improve the quality of peer review of biomedical journals: a systematic review and meta-analysis. BMC medicine; 14(1):85. doi: 10.1186/s12916-016-0631-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Callaham ML, Tercier J. The relationship of previous training and experience of journal peer reviewers to subsequent review quality. PLoS medicine; 4(1):e40. doi: 10.1371/journal.pmed.0040040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. McDowell GS, Knutsen JD, Graham JM, Oelker SK, Lijek RS. Research culture: co-reviewing and ghostwriting by early-career researchers in the peer review of manuscripts. eLife; 8:e48425. Available from: https://elifesciences.org/articles/48425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Kassirer JP, Campion EW. Peer review: crude and understudied, but indispensable. JAMA; 272(2): 96–97. doi: 10.1001/jama.1994.03520020022005 [DOI] [PubMed] [Google Scholar]
  • 42. Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, et al. Promoting an open research culture. Science; 348(6242): 1422. doi: 10.1126/science.aab2374 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Marcoci A, Vercammen A, Bush M, Hamilton DG, Hanea AM, et al. Imagining peer review as an expert elicitation process. BMC Res Notes; 15: 127. doi: 10.1186/s13104-022-06016-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Hamilton DG, Fraser H, Hoekstra R, Fidler F. Meta-research: journal policies and editors’ opinions on peer review. eLife; 9: e62529. Available from: https://elifesciences.org/articles/62529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Field SM, Hoekstra R, Bringmann L, van Ravenzwaaij D. When and why to replicate: as easy as 1, 2, 3?. Collabra: Psychology; 5(1): 46. doi: 10.1525/collabra.218 [DOI] [Google Scholar]
  • 46. Pittelkow M, Hoekstra R, Karsten J, van Ravenzwaaij D. Replication target selection in clinical psychology: a Bayesian and qualitative re-evaluation. Clinical Psychology: Science and Practice; Forthcoming. [Google Scholar]

Decision Letter 0

Sherief Ghozy

4 Jun 2021

PONE-D-21-05317

Predicting reliability through structured expert elicitation with repliCATS (Collaborative Assessments for Trustworthy Science

PLOS ONE

Dear Dr. Bush,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we have decided that your manuscript does not meet our criteria for publication and must therefore be rejected.

I am sorry that we cannot be more positive on this occasion, but hope that you appreciate the reasons for this decision.

Yours sincerely,

Sherief Ghozy, M.D., Ph.D. candidate

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper “outlines an expert elicitation protocol (repliCATS) designed to accurately predict the replicability of a large volume of claims across the social and behavioural sciences.” Its main research goals are: accuracy, scalability and insight. Nowhere is the claim of accuracy supported by argument, data or even elaborated. As far as I can see the paper consists only of describing their platform and their aspirations. There is no scientific claim to review and nothing for a reviewer to do except to say that this manuscript doesn’t belong in a scientific journal. Indeed it reads more like an interim report to a funding agency. I will say that judging “classification accuracy” by treating “predictions >50% as predictions of replication success and <50% as predictions of replication failure” is extremely maladroit. Why go to the trouble of eliciting probabilistic predictions?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

- - - - -

For journal use only: PONEDEC3

PLoS One. 2023 Jan 26;18(1):e0274429. doi: 10.1371/journal.pone.0274429.r002

Author response to Decision Letter 0


14 Oct 2021

Responses to the reviewer were included in the PointResponses attachment. I copy them below.

Point-by-point responses

Reviewer’s comments

“Nowhere is the claim of accuracy supported by argument, data or even elaborated.”

As a Methods paper, we are required to provide validation of our method. This was done in §4. We welcome comments that will improve this discussion. We disagree that validation was nowhere supported.

“As far as I can see the paper consists only of describing their platform and their aspirations. There is no scientific claim to review and nothing for a reviewer to do except to say that this manuscript doesn’t belong in a scientific journal.”

These reviewer comments suggest that the reviewer is not aware of the fact that PLOS ONE explicitly accepts Methods papers, and is unfamiliar with the guidelines by which such papers should be assessed. §2 describes the background to the field and existing alternatives. §3 describes our method (not our platform).

“I will say that judging “classification accuracy” by treating “predictions >50% as predictions of replication success and <50% as predictions of replication failure” is extremely maladroit.”

As described in §2 and throughout, there are multiple measures for prediction accuracy. Classification accuracy is a standard metric within this field, is reported by previous authors, and our use of it allows us to benchmark our method's validity against the existing literature. We would happily describe in more detail the advantages and disadvantages of the various standard metrics for prediction outcomes.

“Why go to the trouble of eliciting probabilistic predictions?”

We have described in §3 and throughout the intrinsic advantages of probabilistic estimation, including reducing overconfidence and harnessing diverse viewpoints through mathematical rather than behavioural aggregation. We would happily go into more detail if these arguments were unclear.

Attachment

Submitted filename: PointResponse_PredictingReliability_repliCATS_20120820.docx

Decision Letter 1

Ferrán Catalá-López

9 Feb 2022

PONE-D-21-05317R1Predicting reliability through structured expert elicitation with repliCATS (Collaborative Assessments for Trustworthy SciencePLOS ONE

Dear Dr. Bush,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 26 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript: 

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ferrán Catalá-López

Academic Editor

PLOS ONE

Journal Requirements:

1. PLOS ONE has specific requirements for studies that are presenting a new method or tool as the primary focus (https://journals.plos.org/plosone/s/submission-guidelines#loc-methods-software-databases-and-tools.) One requirement is that the tool must meet the criteria of validation, which may be met by including a proof-of-principle experiment or analysis. To this effect, please ensure that you have described your pilot study in sufficient detail so as to adequately demonstrate that the new tool achieves its intended purpose.

2. Thank you for stating the following financial disclosure:

"This project is sponsored by the Defense Advanced Research Projects Agency (DARPA) under cooperative agreement No.HR001118S0047. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred."

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

4. Please upload a new copy of Figure 2 as the detail is not clear. Please follow the link for more information: " ext-link-type="uri" xlink:type="simple">https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/" " ext-link-type="uri" xlink:type="simple">https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/"

5. We note that Figure 2 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 2 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

6. Please provide additional details regarding participant consent. In the Methods section, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

Additional Editor Comments (if provided):

For further considering your manuscript at PLOS ONE, please see and address the comments and requests by two independent reviewers below.

PLOS ONE will consider submissions that present methods, as the primary focus of the manuscript if they meet specific criteria. Please, visit the journal website:

https://journals.plos.org/plosone/s/submission-guidelines#loc-methods-software-databases-and-tools

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: No

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: No

Reviewer #3: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: No

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Dear Authors,

Thank you for the opportunity to review this manuscript. I realize there has been a round 1 and disagreements with one of the reviewers, and due to the deadline, that was given me – Feb 28, I am not sure of the effect of this review – as the form says the data will already be collected and deposited by March 2021 (possibly a typo) and in order to evaluate the correct accuracy and validity of the data, as the other reviewer said – I would need access to data that support these claims.

Nevertheless, here are some of my suggestions:

1) It is not clear when reading the abstract that this is a methods paper, and that it demonstrates utility, validation or addresses availability of the method or the platform. So please restructure the paper so that it has clear sections with methods and results for utility and validity, and comment on availability of the questions as well as the platform that you used.

2) If it is a method paper then full data, which demonstrate validity, feasibility, accuracy, and availability need to be made available for reviewers to evaluate them. If on the other hand this is a more a protocol or proof of concept – then details on how validity and utility will be assessed when it is finished need to be described in detail and pilot results can be moved to appendix and used only to say that with them that you showcase feasibility of the proposed assessment of utility / validity /accuracy.

2) Intro - replicable, in that repetitions of the methods and procedures should 4

produce the same facts/result – I believe its is more important to state here they are first reproducible – difference between reproducibility and replicability needs to be stressed in this paper. Additionally, as this project in a way evaluates also reproducibility of the papers – a clear description needs to be made on how reproducibility influences replicability.

3) Intro - I would move all text starting from -The elicitation procedure in this paper is based on the IDEA protocol – in the methods – explaining how the process will be done. End the intro with your objective for this paper.

4) While your first question was whether or not the claim will replicate – it sounds very much that this question translates to whether they believe the results of the study are reliable/valid. If there was time to amend your protocols, I would insist you ask this in one group of participants, before the question on replicability to see how often would answer to these two questions differ. Additionally, when a reviewer finds that the study would replicate – but it is nevertheless useless in its findings –the question remains of should resources be spent on its replication.

2) Research context – structure suggestion - the objectives stated in this section should be numbered and methods to assess each explained in detail.

3) I suggest moving the platform section in the appendix and focusing on the method for prediction exclusively, or make it clear if this is a paper about the platform – or about both.

4) As you stated that Full details of current performance are presented in Wintle et al. (in review) and Hanea et al.(in review) – it is not clear what was the purpose of those papers, vs this one. Which one represents validity / feasability / methods paper, and why is there a need for this paper if the methods are validated in other papers?

5) A validation method in my view requires use of one or more different methods on the same studies, and showcasing accuracy of your vs the other methods. A clear table with this is needed. If it about the platform, then user experience surveys, accessibility and other items need to be addressed. Please clearly state - what are you demonstrating with subtitles, and methods that you use for that.

6) While I appreciate the challenges section, please focus this paper on its primary goal – is it a protocol of your method, a methods paper, or a protocol on how full validation will be done. The purpose of reviewers should be to be able to say if your validation methods are sufficient to say this method is good for what you propose. The challenges could be renamed to advice to those planning to implement them, and you should consider creating a checklist, or flowchart that researchers could follow, but these sections should come only after we agree that you provided evidence of the proof of utility/validity/ superiority of your method . Similarly for future application. These sections are currently too long, and take away from the focus of the paper – you can move them to the appendix. Focus on the goal – to demonstrate that this method can be used to evaluate replicability. Include examples of discussions, reflections on time to provide feedback and similar aspects.

7) If in conclusions you say that you still do not have answer to whether repliCATS reliably improves replication prognostics – then you need to be clear what exactly do you mean you are validating in this paper under section 4. PLOS guidelines say demonstrate that the method achieves its intended purpose, and therefore structure the paper in that way. To me this read as methods paper with a proof of principle – and the tile and the structure should reflect that. Papers on its accuracy compared to other methods are pending, and that needs to be made clear in the abstract and in the main message of the paper. You could also consider structuring this paper in a way that would help others using your method to evaluate this particulars study’s replicability. Finally, please rewrite your conclusions and say – In this paper we showcase a new method or a new platform that can be used to …. We have also demonstrated its x and y. Our pilot data indicate z, with studies on a, b, c, pending. We also believe this method could be used for…

8) My final recommendation would be for the editor to provide you with an example of a methods paper published in PLOS or elsewhere according to which you would structure this paper. Alternatively, you could cite, in the begging a methods, template you followed. Currently the structure is not easy to follow nor clearly states the advantages and limitations of methods you used to demonstrate the purpose, validity, superiority.

Reviewer #3:

Authors have appealed against the initial decision to reject their paper. They argue that :

- There was no major issue raised by the reviewer ;

- That this is a method paper and therefore that the fact that there is no specific data/interpretation was appropriate ;

I must say that I agree with the authors. This is a paper describing a method. The main difficulty is that it reviews and summarize previous material / preliminary evidence (including preprints) suggesting that that this new method is appropriate and validate. The boundary between a method paper and a review is therefore small and I can also understand the opinion of the first editor and reviewer.

However, if one one follow PLOS One guidance, this manuscript reports data on :

- Utility ;

- Validity (although this is preliminary evidence) ;

- And, to a lesser extent on availabilty (i.e. material from various projects may available through the Open Science Framework) ;

Regarding this last point, the following data sharing statement is not acceptable regarding PLOS One policy :

"This paper describes a research methodology prior to final results. Not all results from the research have been finalised and so results cannot yet be made available.

The data underlying this research will be made available via an OSF repository after March 2021. Data will be anonymised before being made publicly available. Some participant data were collected on a confidential basis and cannot be made publicly available due to ethical concerns."

I mean, that as a reviewer of this specific paper, I don't care about data that is being collected using the method proposed by the authors of this paper (because this data will be necessary to share for the future papers reporting on the results) but I am interested in all material described in this paper. For instance, concerning this very relevant link that is cited in the paper (https://osf.io/m6gdp/). This is not publicly available and one need to ask for access. It needs to be made publicly available. No restrictions are acceptable.

Authors should provide, a detailed table reporting for each section of the manuscript :

- All links to registered protocols ;

- A link to aggregated results (e.g. publication, pre-prints) ;

- All links to datasets and code (when available) ;

Of course, in some occasions, this can be found in the text but a clear table detailling all the output and describing where data can be found would be great.

This was my major point regarding this paper. I have a few minor points :

- In the title make it explicit that it is a "METHOD" paper. You may want to add that you are elaborating on "UTILITY, VALIDITY and AVAILABILITY" ;

- Please avoid acronyms in the abstract :

"We introduce a new technique to evaluating replicability, the repliCATS (Collaborative Assessments for Trustworthy Science) process, a structured expert elicitation approach based on the IDEA protocol." A reader may have no idea of what the IDEA protocol is...

- In the abstract, again, please elaborate more on the 3 categories "UTILITY, VALIDITY and AVAILABILITY" and detail how each was investigated or addressed.

- Please also add in the abstract the main limitation of the approach to be fair as there are some limitations.

- Please elaborate a little bit more on the feasibility of claims like "the IDEA protocol for many experts addressing many problems, with 141 the capacity for the assessment of 3000 claims in 18 months."

- I do think that results of the pilot study may nicely be illustrated with a figure ;

- Regarding your answer to the reviewer : "We would happily describe in more detail the advantages and disadvantages of the various standard metrics for prediction outcomes." I agree and support you in working on a figure and or a table to detail this point.

- Last panels in Figure 2 are not appropriately ordered and the order should be a, b, c... I understand the restrictions in terms of space, but I really invite you to organise it better.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes:

Reviewer #3: Yes:

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Jan 26;18(1):e0274429. doi: 10.1371/journal.pone.0274429.r004

Author response to Decision Letter 1


20 Jun 2022

Copyright statement:

Neither Figure 1 nor 3 have been previously copyrighted nor contain any elements which are under copyright restriction. All elements of these figures were either generated by the authors of the manuscript, are images copyright for which resides with the authors of the manuscript, or are images that have been used under the licence CC0 1.0 Universal (CC0 1.0) Public Domain Dedication (‘No copyright’).

Responses to editor

“One requirement is that the tool must meet the criteria of validation, which may be met by including a proof-of-principle experiment or analysis.”

We have described in §2.5 more detail regarding the validation experiment and provided a new figure including results. While we regard the results of this as more than a proof-of-concept experiment, we have adjusted the wording of the paper to address this concern, shared by Reviewer 2. The SCORE program has been a large, complex, multi-disciplinary and multi-team project and it has not been possible to cover all of the project's aims, methods and results in a single paper. Moreover, some of the data from the overall program is still embargoed for public release, as the work of some teams is ongoing. However, all data underlying our validation experiment is now publicly available and we have included a link.

“Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.”

Updated funder statement was provided in the previous cover letter. We will include it in this cover letter.

“Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data.”

Data underlying the validation experiment reported in this paper is now publicly available and we have clarified this in re-submission.

“Please upload a new copy of Figure 2 as the detail is not clear.”

New Figure 2 was uploaded with previous resubmission, and this new Figure was specifically referred to by Reviewer 3. Could the editor please confirm if further revision is required.

“We note that Figure 2 in your submission contain copyrighted images”

All graphics in Figure 2 were created by the repliCATS project. Could the editor please confirm which graphics are of concern.

“Please provide additional details regarding participant consent.”

Accepted; additional details provided in §3 as requested.

Responses to Reviewer 2

“So please restructure the paper so that it has clear sections with methods and results for utility and validity, and comment on availability of the questions as well as the platform that you used.”

Accepted; we have restructured as requested so that results on validity are made clear in §2.5. The precise elicitation we used is provided in the Supporting Information section. The process itself does not rely on our specific implementation; we have clarified in §3 that the description of our platform is provided as an example.

“If it is a method paper then full data, which demonstrate validity, feasibility, accuracy, and availability need to be made available for reviewers to evaluate them.”

Accepted; we have included in §2.5 more details about the accuracy of our validation experiment and a link to our full data from this experiment.

“I believe its is more important to state here they are first reproducible”

Accepted; the term used in §1 Introduction, “reliable”, was intended to include computational reproducibility and we have clarified this. A full discussion of issues around computational reproducibility, however, is beyond the scope of this paper.

“make it clear if this is a paper about the platform – or about both.”

Accepted; we have clarified in §3 that the online platform is just an example of a practical implementation of the method, which is based on the elicitation procedure plus aggregation methods. We have described elsewhere the technical implementation of the platform so that others could use that, but the method itself could be operationalised in many different ways to suit different contexts.

“A validation method in my view requires use of one or more different methods on the same studies, and showcasing accuracy of your vs the other methods.”

Partly accepted; the studies evaluated by participants in the repliCATS project were also the subject of prediction experiments using other techniques. We use the reported results of those experiments for comparison. We have clarified this.

“While I appreciate the challenges section, … . Similarly for future application. These sections are currently too long,”

Accepted; we have reduced the length of these sections.

Responses to Reviewer 3

“this very relevant link that is cited in the paper (https://osf.io/m6gdp/). This is not publicly available and one need to ask for access. It needs to be made publicly available. No restrictions are acceptable.”

Accepted, and with our apologies. We included the private rather than the public link and this error has been fixed. We have confirmed the data availability.

“A reader may have no idea of what the IDEA protocol is”

Accepted; change made as requested. We outline the basic characteristics of the protocol and direct the reader to previously published work that comprehensively describe it.

“In the abstract, again, please elaborate more on the 3 categories "UTILITY, VALIDITY and AVAILABILITY" and detail how each was investigated or addressed.”

Accepted; we emphasise that the “utility of processes to predict replicability is their capacity to test scientific claims without the costs of full replication” and in §3 that this method is available for a range of implementations, not only our implementation.

“Please also add in the abstract the main limitation of the approach to be fair as there are some limitations.”

Accepted; major limitation added to abstract as requested.

“Please elaborate a little bit more on the feasibility of claims like "the IDEA protocol for many experts addressing many problems, with the capacity for the assessment of 3000 claims in 18 months.””

Accepted; more detail provided in §3 about the specific implementation within the SCORE project.

“I do think that results of the pilot study may nicely be illustrated with a figure”

Accepted; figure provided in §2.5 as requested.

“Regarding your answer to the reviewer: "We would happily describe in more detail the advantages and disadvantages of the various standard metrics for prediction outcomes." I agree and support you in working on a figure and or a table to detail this point.”

Accepted; table added in in §2.2 as requested.

“Last panels in Figure 2 are not appropriately ordered and the order should be a, b, c... I understand the restrictions in terms of space, but I really invite you to organise it better.”

Not accepted; table is labelled anticlockwise from centre top. We believe this is a straightforward and readily comprehensible way to organise the information. This has been clarified in the figure caption.

Attachment

Submitted filename: Responses_Predicting_reliability_repliCATS_20220601.pdf

Decision Letter 2

Ferrán Catalá-López

6 Jul 2022

PONE-D-21-05317R2Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) processPLOS ONE

Dear Dr. Bush,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 20 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-emailutm_source=authorlettersutm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ferrán Catalá-López

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Dear Dr. Bush,

Thank you very much for your improved manuscript. Before I can consent to publication in PLOS ONE, however, I still have the following issues you need to deal with.

Points:

• Methodological reports must meet the criteria of utility, validation, and availability, which are described in detail at https://journals.plos.org/plosone/s/submission-guidelines

• Specifically, authors should adequately describe the validation of their methods in the pilot study, and whether they have included sufficient detail on the results of the study.

• Authors’ should provide detailed responses to the peer reviewers’ comments (specially, reviewer 2) and consider the additional requests for your improved manuscript.

My decision to recommend publication will depend on how we get these issues solved and worked into your fine manuscript. By personal experience, I know that this is not the answer you wanted. But before you feel dishearten I can assure you that these issues should substantially improve the messages your manuscript send to the potential readers as well as make the text more coherent.

Thank you.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: No

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: No

Reviewer #3: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: No

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Thank you for inviting me to look at this again. I dislike that the authors did not provide full copy of my previous comments, nor that I received a message from the editor on what is the expected and agreed upon structure for this paper. The following are changes I would like to see before the paper is published:

1) Please rewrite the abstract, to showcase actual data/numbers on Utility, Accuracy, Scalability, Insight and Validity

2) Please expand your definition of reproducibility in the intro. It is not only about the fact that “ multiple observers examining the same data should agree on the facts and the results of analyses” – it is also about the reproducibility of the manuscript itself, and all data/figures produced in it – this aspect requires more attention in the introduction and in the discussion – and its relation to replicability. Additionally, authors should make clear examples of disciplines that replicate their own findings in the primary articles – and therefore do not require replication to be done after study publication. If the culture change would require replications to be included in the publication of a primary study - then the evaluation of strength of claims would be dfferent then they are explained here. Please mention examples of papers and fields that already do this, and why this is not necessary or feasabile for all types of studies. Reproducible manuscript are however feasible.

3) Please be consistent – you stated in intro: describe the utility, accuracy and scalability of this method – and yet the section 2 is titled - Utility, Accuracy, Scalability, Insight and Validity. If all will be described, that should be stated.

4) I am puzzled again by section 2 - utility should be about the utility of repliCATS project not replication studies themselves. Same applies to the rest. I would advise arranging a call with the editor and agreeing on the structure of this paper.

5) Section 2.2 did not specify what is the accuracy of repliCATS vs prediction markets, vs surveys – therefore this section also requires rewriting. If as I mentioned in the round 1 review, this paper is about how accuracy will be measured for repliCATS – then this paper needs to be turned into a protocol for all items in section 2.

6) I strongly suggest authors create a table where in the rows are Utility, Accuracy, Scalability, Insight and Validity and in 3 columns – surveys, predication markets, and repliCATS project – additional columns can be added – to include actual replication studies or other methods authors believe should be included. And then the paper should be structured in such a way that all of these are covered. Until this is done, I will refrain for making comments on the rest of the manuscript.

Reviewer #3: Thank you for addressing my comments. I think the manuscript has improved now and is suitable for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Mario Malicki

Reviewer #3: Yes: Florian NAUDET

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Jan 26;18(1):e0274429. doi: 10.1371/journal.pone.0274429.r006

Author response to Decision Letter 2


25 Aug 2022

This text is also included in the response to reviewers document.

In this revision, only Reviewer 2 has requested changes. We accept Reviewer 2’s points 1, 3, 5 and partially point 4. We do not accept Reviewer 2’s points 2 and 6. Details are provided below.

Responses to Reviewer 2

1) Please rewrite the abstract, to showcase actual data/numbers on Utility, Accuracy, Scalability, Insight and Validity

We accept this recommendation. The abstract has been re-written to include specific figures for Accuracy and Validity, and Scalability. We note that Utility and Insight are qualitative features of the repliCATS process rather than quantitative ones and so specific figures are not included.

2) Please expand your definition of reproducibility in the intro. It is not only about the fact that “ multiple observers examining the same data should agree on the facts and the results of analyses” – it is also about the reproducibility of the manuscript itself, and all data/figures produced in it – this aspect requires more attention in the introduction and in the discussion – and its relation to replicability. Additionally, authors should make clear examples of disciplines that replicate their own findings in the primary articles – and therefore do not require replication to be done after study publication. If the culture change would require replications to be included in the publication of a primary study - then the evaluation of strength of claims would be dfferent then they are explained here. Please mention examples of papers and fields that already do this, and why this is not necessary or feasabile for all types of studies. Reproducible manuscript are however feasible.

We do not accept this recommendation. We do not agree with Reviewer 2 that reproducibility is necessarily a prior goal in the assessment of scientific manuscripts, nor is the discussion about practices in different fields in scope for this paper with a well-defined disciplinary range. This recommendation appears to be based on a specific opinion of Reviewer 2, which we do not share, and we do not feel we need to rewrite our manuscript accordingly. We have, however, substantially rewritten the second paragraph of the paper (in lines 9 – 13) to clarify the scope even more explicitly.

3) Please be consistent – you stated in intro: describe the utility, accuracy and scalability of this method – and yet the section 2 is titled - Utility, Accuracy, Scalability, Insight and Validity. If all will be described, that should be stated.

We accept the recommendation and thank the reviewer for noting the inconsistency in the heading to Section 2. This has been fixed (lines 52-53).

4) I am puzzled again by section 2 - utility should be about the utility of repliCATS project not replication studies themselves. Same applies to the rest. I would advise arranging a call with the editor and agreeing on the structure of this paper.

We partially accept this recommendation. While we believe we had discussed the cost-saving utility of the repliCATS (and similar) processes, we have added additional detail to Section 2.1 in lines (64-66) to make this even more explicit.

5) Section 2.2 did not specify what is the accuracy of repliCATS vs prediction markets, vs surveys – therefore this section also requires rewriting. If as I mentioned in the round 1 review, this paper is about how accuracy will be measured for repliCATS – then this paper needs to be turned into a protocol for all items in section 2.

As discussed, we have restructured Section 2 (lines 67-155) to combine Accuracy and Utility into a single sub-section.

6) I strongly suggest authors create a table where in the rows are Utility, Accuracy, Scalability, Insight and Validity and in 3 columns – surveys, predication markets, and repliCATS project – additional columns can be added – to include actual replication studies or other methods authors believe should be included. And then the paper should be structured in such a way that all of these are covered. Until this is done, I will refrain for making comments on the rest of the manuscript.

We do not accept this recommendation. This would require a complete re-write of the manuscript according to the stylistic preference of Reviewer 2 and we do not consider this to be a reasonable request.

Attachment

Submitted filename: Responses_Predicting_reliability_repliCATS.pdf

Decision Letter 3

Ferrán Catalá-López

30 Aug 2022

Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process

PONE-D-21-05317R3

Dear Dr. Bush,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ferrán Catalá-López

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Thank you for this revised version of your manuscript, and for your detailed answers to reviewer(s) suggestions.

Reviewers' comments:

Acceptance letter

Ferrán Catalá-López

19 Sep 2022

PONE-D-21-05317R3

Predicting reliability through structured expert elicitation with the repliCATS (Collaborative Assessments for Trustworthy Science) process

Dear Dr. Bush:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ferrán Catalá-López

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Details of elicitation questions used in the repliCATS process for the validation experiment.

    (TIF)

    S2 Table. List of source journals for research claims in the SCORE program.

    (TIF)

    Attachment

    Submitted filename: PointResponse_PredictingReliability_repliCATS_20120820.docx

    Attachment

    Submitted filename: Responses_Predicting_reliability_repliCATS_20220601.pdf

    Attachment

    Submitted filename: Responses_Predicting_reliability_repliCATS.pdf

    Data Availability Statement

    The anonymized data underlying the validation experiment for the method described in this paper are available from a publicly accessible OSF repository (at https://osf.io/hkmv3/). This link is provided in the manuscript.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES