Skip to main content
PLOS One logoLink to PLOS One
. 2024 Nov 18;19(11):e0301111. doi: 10.1371/journal.pone.0301111

Testing for reviewer anchoring in peer review: A randomized controlled trial

Ryan Liu 1,*, Steven Jecmen 1, Vincent Conitzer 1, Fei Fang 1, Nihar B Shah 1
Editor: Stephan Leitner2
PMCID: PMC11573134  PMID: 39556577

Abstract

Objective

Peer review frequently follows a process where reviewers first provide initial reviews, authors respond to these reviews, then reviewers update their reviews based on the authors’ response. There is mixed evidence regarding whether this process is useful, including frequent anecdotal complaints that reviewers insufficiently update their scores. In this study, we aim to investigate whether reviewers anchor to their original scores when updating their reviews, which serves as a potential explanation for the lack of updates in reviewer scores.

Design

We design a novel randomized controlled trial to test if reviewers exhibit anchoring. In the experimental condition, participants initially see a flawed version of a paper that is corrected after they submit their initial review, while in the control condition, participants only see the correct version. We take various measures to ensure that in the absence of anchoring, reviewers in the experimental group should revise their scores to be identically distributed to the scores from the control group. Furthermore, we construct the reviewed paper to maximize the difference between the flawed and corrected versions, and employ deception to hide the true experiment purpose.

Results

Our randomized controlled trial consists of 108 researchers as participants. First, we find that our intervention was successful at creating a difference in perceived paper quality between the flawed and corrected versions: Using a permutation test with the Mann-Whitney U statistic, we find that the experimental group’s initial scores are lower than the control group’s scores in both the Evaluation category (Vargha-Delaney A = 0.64, p = 0.0096) and Overall score (A = 0.59, p = 0.058). Next, we test for anchoring by comparing the experimental group’s revised scores with the control group’s scores. We find no significant evidence of anchoring in either the Overall (A = 0.50, p = 0.61) or Evaluation category (A = 0.49, p = 0.61). The Mann-Whitney U represents the number of individual pairwise comparisons across groups in which the value from the specified group is stochastically greater, while the Vargha-Delaney A is the normalized version in [0, 1].

1 Introduction

Peer review is the primary method for systematically evaluating scientific research. Many peer-review processes involve reviewers submitting an initial review, following which they may be presented with additional information. This additional information frequently takes the form of a response from the authors. The reviewers are then requested to read the response and adapt their stated opinions and evaluations accordingly. In this work, we put this potential change under the microscope, investigating whether reviewers anchor to their original opinions. For concreteness, we instantiate our study in the setting of conference peer review, a large human-centric system that has been widely adopted in computer science academia. (In computer science, leading conferences are typically rated at least on par with leading journals, with full paper submissions, competitive acceptance rates from 15–25%, and are often terminal venues for publication.) Across conference peer review, the author response mechanism is termed the “rebuttal stage”, placed between the initial reviews and final review score decisions and are an opportunity for the author(s) to provide additional information or arguments in response to the initial reviews. (Depending on the specific review setting, there may also be alternative forms of information made available to the reviewer, such as the evaluations of other reviewers. In this work, we focus on author rebuttals due to its widespread use and frequently-raised questions about its efficacy.) In computer science conferences, rebuttal stages are a widely adopted practice, with a large number of recent conferences having instituted such periods [1, 2].

Despite its pervasiveness, there is so far mixed evidence regarding the usefulness of rebuttals. A program chair of the NAACL 2013 conference described the rebuttal phase as “useless, except insofar as it can be cathartic to authors and thereby provide some small psychological benefit” [3]. A study on the NeurIPS 2016 conference found that only 4180 of 12154 (34.4%) reviews had reviewers participate in the discussion after the rebuttal, and only 1193 (9.8%) of reviews subsequently changed in score [4]. Furthermore, adjustments in reviewer scores do not necessarily affect paper decisions—in the ACL 2018 conference, 13% of reviewer scores changed after rebuttals, but the amount of papers whose acceptances were likely affected was only 6.6% [1]. In addition, authors from various conferences have shared vast amounts of anecdotes on social media regarding the limited impact of their rebuttal statements on reviewer evaluations, including cases where they had written a strong rebuttal but reviewers did not respond to it in a fair and reasonable way [57]. Rogers and Augenstein [8] find that in the natural language processing community, Twitter posts drastically spike both during the rebuttal phase and at acceptance notifications (corresponding to when authors create their rebuttals and when they see the results after rebuttals), with these tweets often including bitter complaints and reform suggestions.

One potential explanation behind the limited effect of the rebuttal stage on overall acceptances is that, due to anchoring, reviewers are simply not changing their scores as much as they should. Anchoring [9] is formally defined as the bias where people who make an estimate by starting from an initial value and then adjusting it to yield their answer typically make insufficiently small adjustments. Anchoring effects have been found in many applications, including responses to factual questions, probability estimates, legal judgments, purchasing decisions, future forecasting, negotiation resolutions, and judgements of self-efficacy [1015]. However, despite the high stakes of peer review, anchoring has not yet been studied in the context of conferences and the rebuttal process.

1.1 Research question

In this paper, we test for the existence of anchoring in reviewers to verify whether reviewers are biased in a systematic manner. Our research question compares the following two scenarios in which a reviewer evaluates an academic paper.

  • Scenario A: The reviewer evaluates the paper’s quality and provides a set of numeric scores (termed initial scores). The reviewer is then presented with additional evidence proving that their initial evaluation was mistaken. Subsequently, the reviewer optionally adjusts their previous scores to new values (termed revised scores).

  • Scenario B: The reviewer is simultaneously presented with the same paper and the additional evidence from the previous scenario. They then provide a numeric evaluation of the paper’s quality (termed control scores).

Here, scenario A is a situation that may occur in a typical rebuttal process. Scenario B is a counterfactual where the additional evidence of scenario A is incorporated into the paper and presented to the reviewer during their initial reading of the paper. If anchoring is present in the rebuttal process, reviewers’ revised scores in scenario A would remain closer to their lower initial scores, and not be identical to the scores they would have given if they had been in scenario B. In aggregate, this would lead to a muted change in acceptances and a less effective rebuttal process.

Altogether, we study the following research question: Are the revised scores given by reviewers when placed in scenario A lower than the control scores that those reviewers would have given if they had been placed in scenario B?

We hypothesize that, in line with the existing literature on anchoring, reviewers in scenario A will anchor to their initial review scores, causing their revised scores to be lower than the control scores they would have given if they had been in scenario B.

1.2 Our contributions

To answer the research question, we designed and conducted a study to analyze the reviewer anchoring effect.

  1. We recruited 108 participants who have recently published in a computer science-related field and are currently pursuing or have completed their PhD, and randomly assigned them to the control or experimental group. Each participant was placed in the role of a reviewer in a mock conference setting and was asked to review one paper.

  2. We constructed a fake paper for participants to review, and showed different versions of the paper to the different groups. The control group was given a paper with an animated GIF graphic (shown in Fig 1A) that contains the main evaluation results of the paper’s proposed framework, while the experimental group was instead given a frozen frame of the GIF (Fig 1B) that showed a much weaker result. After experimental group participants completed their review, they were deceived that the GIF was frozen as the result of a technical error, and were shown the proper animated GIF, upon which they were given the opportunity to revise their scores. Our experiment was carefully designed to avoid several confounders and challenges in simulating an anchoring effect under the rebuttal setting, which we detail in Section 3.1.1.

  3. For the paper, each reviewer was asked to provide an overall score, five category scores, and text comments justifying each category score. We collected this data once from the control group (control scores) and twice from the experimental group (initial and revised scores). We also collected participant data such as self-reported confidence, PhD year and institution. The de-identified data and analysis code are available on GitHub at https://github.com/theryanl/ReviewerAnchoring.

  4. In our analysis, we first checked whether our GIF manipulation created a difference in reviewer ratings. We compared the initial scores and control scores, in both the Overall rating and the Evaluation category (which directly corresponds to the aspect of the paper we manipulated). We conducted a one-sided permutation test with the Mann-Whitney U statistic and measured the effect size in terms of the Vargha-Delaney A [16], representing the probability that a randomly-chosen control score is greater than a randomly-chosen experimental score (breaking ties uniformly at random). We found that the initial scores were lower than the control scores in both the Evaluation category (effect size A = 0.64, p = 0.0096) and Overall scores (effect size A = 0.59, p = 0.058), with moderate effect sizes. Thus, our experimental setup successfully introduced a difference in paper quality that enabled our test for anchoring.

    To test for the anchoring effect, we compared the revised scores with the control scores using a one-sided permutation test with the Mann-Whitney U statistic. We did not find significant evidence of reviewer anchoring in either the Overall scores (effect size A = 0.50, p = 0.61) or Evaluation category scores (effect size A = 0.49, p = 0.61).

Fig 1. The evaluation results used in the fake paper.

Fig 1

A: Chronological frames (from left to right) demonstrating the animated result GIF. B: Frozen GIF initially shown to the experimental group. The animation compresses the existing data points to the left, introducing more data points to the right in chronological fashion. Larger improvements on the y-axis correspond to a better evaluation result for the paper’s method. The baseline is the leftmost point in all frames, 2.21 on a 1–5 scale. In the frozen figure (B), the rightmost point is 2.23, representing an improvement of 0.02 (< 2%). In the animated figure (A), the rightmost point is 2.63, representing an improvement of 0.4 (> 33%). The animated figure can be viewed at https://github.com/theryanl/ReviewerAnchoring/blob/main/fake_paper/images/animated_plot.gif.

Although our experiment imitates a specific rebuttal process in conference peer review, we take the first step in extending the literature on anchoring bias to the academic peer review setting, where individual expertise and knowledge may interact differently with human biases. To our knowledge, this is the first randomized controlled trial on anchoring in peer review. Our work could potentially be informative for similar academic settings, such as anchoring in reviewer discussion phases and longer-term author feedback processes.

In the following sections, we give a more comprehensive view on our work. In Section 2, we give context to how our work fits into the broader literature on conference peer review and human biases. In Section 3, we detail our experimental design, data collection, and analysis methods, and describe the various challenges that our design addresses. In Section 4, we report the results for our analyses. In Section 5, we present the takeaways and discuss the limitations for our current work, and propose directions for future research.

2 Related work

In this section, we give a brief outline of the work done in several areas: Research done to improve the conference peer review process, studies on cognitive biases in academic reviewers, sources relating to the rebuttal process in particular, and psychology literature regarding the anchoring bias.

2.1 Conference peer review

Conference peer review has been an increasingly active area of research due to the need for automated and scalable solutions, especially in the field of computer science [17]. Work has focused on improving the quality of reviewer assignments [1822], providing robustness to malicious behavior [2325], and addressing issues of miscalibration [26, 27] and subjectivity [28] between reviewers. Of particular relevance is the literature investigating cognitive biases in reviewers. These include studies on confirmation bias [29], commensuration bias [30], the effects of revealing author identities to reviewers [3134], reviewer herding [35], resubmission bias [36], citation bias [37], and others [38]. Other works propose methodology for detecting such biases [33, 39].

Research has also focused on the reviewer discussion phase of peer review, which has some similarities to the rebuttal process we study. Most peer review processes include a reviewer discussion phase after initial reviews are submitted, where reviewers can read and respond to each others’ reviews. Similar to the author rebuttal process, reviewers are allowed to update their reviews after receiving this new information. Several studies [4042] on reviewer discussions in grant proposal reviews have found that disagreement between reviewers greatly decreases after discussion, indicating that reviewers do update their scores to reach consensus. In one experiment [43], 47% of reviewers updated their review scores after being shown scores from other fictitious reviewers. Authors of [35] conducted a randomized controlled trial in the ICML conference to investigate the existence of herding in reviewer discussions, but found no evidence for this effect. While these studies provide insights into how reviewers update their opinions, the present work focuses specifically on anchoring in the rebuttal process.

2.2 Rebuttal processes

Many conference organizers have analyzed the rebuttal process within their own conferences, and the common finding is that rebuttals only make a meaningful difference to a small fraction of submissions. Out of the 2273 rebuttals at CHI 2020, 931 (41%) did not result in a mean score change, 183 (8%) resulted in an absolute mean score change of 0.5 or more, and only 6 (0.3%) saw the mean score change by 1 or more [44]. In ICML 2020, only 43% of reviewers updated their review in response to author rebuttals [45]. In ACL 2018, 13% of review scores changed after rebuttals, affecting 26.9% of all papers, but only 6.6% of papers were likely impacted in terms of acceptance [1]. At the same venue, though author responses had a marginal but statistically significant influence on final scores, a reviewer’s final score was largely determined by their initial score and distances to scores given by other reviewers [46].

Despite these statistics, there is overwhelming support for the rebuttal stage from the research community. A set of surveys from PLDI 2015 [47] found that authors strongly value the rebuttal process; 96% of authors agreed (with 88% strongly agreeing) that they should be provided the opportunity to rebut reviews. Meanwhile, only 44% of authors agreed to the statement that their reviews were constructive and professional, and only 41% of authors agreed that their reviewers had sufficient expertise. Rogers and Augenstein [8] found that both the rebuttal stage and the acceptance results after rebuttals yield large increases in the number of tweets in the NLP research community, often including bitter complaints and reform suggestions. In an author survey for IEEE S&P 2017 [48], which did not have a rebuttal phase, approximately 30% of less experienced and 20% of experienced authors felt like they could have convinced their reviewers to accept their paper if they were given an opportunity for a rebuttal. Together, these results send the message that authors are often dissatisfied with their reviews, and that they strongly value the rebuttal mechanism as a method to address bad reviewing.

2.3 Anchoring bias

Anchoring (more specifically, the anchor-and-adjust hypothesis) was initially described by Tversky and Kahneman [9], who defined it as the effect where people who make an estimate by starting from an initial value and then adjusting it to yield their answer typically make insufficiently small adjustments. The initial value can be irrelevant to the question asked, and can also be a partial computation by the person themselves. One basis to interpret this behavior [49] is to view it as a cognitive shortcut: to reduce the mental strain of incorporating new evidence, individuals take their starting estimate and integrate new information in a naive, insufficient way. The anchoring effect has been shown to be present in a variety of domains and applications [1015]. However, to our knowledge, our study is the first randomized controlled trial to analyze whether reviewers exhibit anchoring behaviors in peer review.

3 Methods

In this section, we describe the experiment we conducted and the analysis methods we employed to investigate the research question specified in Section 1. We first define the experimental procedure along with associated justifications, and then describe participant recruitment and data collection. Lastly, we describe the analysis we performed on the data. Our research question and study design were pre-registered at https://aspredicted.org/W94_GD3. This experiment was approved by the Carnegie Mellon University Institutional Review Board (Federalwide Assurance No: FWA00004206, IRB Registration No: IRB00000603).

3.1 Experiment design

In this subsection, we first describe the challenges inherent to this problem setting before concretely defining the experimental procedure. We then articulate how our key design choices allow us to surmount these challenges.

3.1.1 Challenges for the design

First and foremost, our hypothesis cannot be tested with an experiment in a real conference environment as it is impossible to control the quality of papers and the strength of rebuttals. Thus, we carefully designed an environment for our experiment that simulates a real conference. In designing our experiment and simulated environment, we address four main challenges:

  1. Clarity and objectivity of the quality of rebuttal. In a real conference environment, the impact of a rebuttal argument on its paper’s quality is often subjective. This makes it hard to distinguish between an anchoring effect and a genuine belief that the rebuttal was weak. In our experiment, the rebuttal must clearly and objectively improve the quality of the paper. Furthermore, the participants chosen need to be able to detect this improvement. Lastly, the rebuttal should be meaningful no matter what participants write in their initial review.

  2. Addressing “author mistake” confounder. When reviewing, reviewers find and comment about mistakes in the submission that are important to the quality of the paper. Even when authors address these mistakes, if these mistakes were influential enough in the first place, reviewers may choose to take them into account and penalize the authors by giving a lower score. In this study, we explicitly choose to focus on anchoring with respect to reviewer opinions about the paper itself and not their opinions about the authors. As such, we label this phenomenon as the author mistake confounder, and consider it to be distinct from the anchoring effect in our research question. In our experiment, we want to account for this confounder, and separate its effects from the anchoring bias.

  3. Equality of the experimental and control experiences. In the experiment, we want to compare between an experimental group, which sees a rebuttal and adjusts their scores, and a control group, which gives the ground truth scores that the experimental group should ideally adjust to. In order to make a meaningful comparison between groups, we want the control group’s paper to be equivalent to the experimental group’s initial paper combined with the rebuttal. In the traditional conference form, this is paradoxical to recreate; rebuttals are constructed to directly address initial reviews, but the control group cannot give initial reviews without being potentially subjected to anchoring bias themselves.

  4. Participant obliviousness to true purpose of study. Since anchoring would usually be unnoticed by reviewers themselves, it is important to replicate this condition in the experiment. Informing participants of the true purpose of the study could potentially change their behavior according to the demand characteristics effect [50]. In our experiment, we need to conceal the purpose of the study and make it such that participants do not suspect that the study concerns reviewer anchoring.

Addressing challenge 1 enables us to measure an anchoring effect if it exists, while addressing challenges 2–4 ensure that in the absence of an anchoring effect, the ratings received from the control and experimental groups should be equivalent.

These challenges are very tricky to simultaneously address. For example, consider a simple experimental design in which reviewers are randomly assigned to either a high-quality or a low-quality version of a paper; then, after the reviews, experimenters construct a rebuttal to address the points raised in the review. The criticisms raised by the reviewers could concern naturally subjective topics such as its significance. In these cases, we would not be able to refute the reviewer with an objective response in the rebuttal and would struggle to distinguish reviewer anchoring from genuine subjective beliefs (challenge 1). Since the errors in the low-quality paper are due to mistakes by the authors, we would not be able to distinguish between reviewers exhibiting anchoring and reviewers penalizing the author mistakes (challenge 2). Even for the same version of the paper, the criticisms raised by reviewers will likely be widely varied in topic. Thus, if the same rebuttals are used for all reviews, the rebuttals may not match the concerns in each review (challenge 1). Alternatively, if the experimenter generates individualized rebuttals for each review, we cannot guarantee that the post-rebuttal version of the low-quality paper has equivalent quality to the high-quality paper (challenge 3). Finally, if the experiment places significant focus on the rebuttal, participants may suspect the true purpose of the study and modify their behavior accordingly (challenge 4).

3.1.2 Experimental procedure

In this subsection, we present our experimental procedure, which addresses each of the aforementioned challenges.

3.1.2.1 Experimental setting. The experiment procedure consists of a 30-minute, 1-on-1 Zoom meeting with each participant. Each participant takes the role of a reviewer for one paper within a simulated peer review process, and all participants review the same paper. A snapshot of the paper reviewed is provided in Fig 2. Participants are falsely told that the purpose of the study is to analyze the effect of new types of media (such as animations) on reviews, and are informed that the paper should be reviewed as a submission to an application-focused track of a large AI conference. Participants are given a reviewer form constructed based on the reviewer guidelines in the AAAI 2020 [51] and NeurIPS 2022 [52] conferences. The reviewer form contains scores in five sub-categories {Significance, Novelty, Soundness, Evaluation, Clarity}, one sentence justifications for these scores, as well as an Overall score and a confidence rating. Following the fictitious purpose of the study, the form also asked participants to “Please comment on the use of animated figures. (If you did not see this form of media, please answer ‘N/A’)”. This question regarding animated figures plays an important part in our experimental intervention, which we detail in the following paragraph. After the review, we also record participants’ institution, program, and year of study.

Fig 2. A snapshot of the constructed paper reviewed by participants.

Fig 2

The paper is hosted online and viewed through the participant’s browser, allowing for the natural use of an animated GIF figure.

3.1.2.2 Intervention. The key difference between the conditions lies in the presentation of the main evaluation result of the paper. In the control group, this result is presented as an animated GIF graphic (shown in Fig 1A), whereas the experimental group is initially presented a broken version of the GIF that is stuck on the first frame (Fig 1B), which shows a significantly weaker result. Then, when experimental group participants are asked the aforementioned question to comment on animated figures, they would indicate that they had not seen any by answering ‘N/A’. After these participants submit their reviews, the experimenter deceives them by saying that their answer was unexpected and that they should have seen an animated figure. In parallel, the experimenter secretly changes the contents of the webpage displaying the paper such that all new visits see the animated GIF in the paper working properly. The experimenter then suggests the participants to refresh the website, upon which the animation loads and they are asked to revise their scores and comments accordingly.

We performed a pilot study with 14 participants before full deployment to test for feasibility and practice the deception. For more details on the deception and score revision process, as well as how deviations from the procedure due to unexpected participant behavior were handled, we refer the reader to S1 Appendix. All of the instructions, interfaces, and the paper contents are available at https://github.com/theryanl/ReviewerAnchoring.

3.1.3 Design justification

We now highlight some key aspects of our experimental design and how they address the aforementioned challenges.

  • Construction of the reviewed paper. In order to ensure that the change in quality between the initial and revised versions of the paper was clear and objective (challenge 1), we manually constructed a single paper for all participants to review. The initial and revised versions differed in the paper’s numerical results, as this was an area where the paper’s quality could be changed objectively. To make the change in quality clearer, the results between the initial and revised/control versions of the paper were very different, and the paper was constructed to emphasize this result. Additionally, we made the paper heavily application-focused and made its metrics easily interpretable such that our participants (who were at minimum computer science PhD students) would not need any specific technical background to interpret the results.

  • Technical error in displaying the GIF. In the experimental group, the issue in the initial version of the paper was presented as the result of a technical error (the frozen GIF). Since the error was clearly not attributable to the authors, reviewers could not reasonably justify reflecting the error in their scores, which allowed us to circumvent the author mistake confounder (challenge 2). Additionally, the frozen GIF issue in the initial paper could be corrected for all participants regardless of the specifics of their review. Thus, we were able to ensure that the change seen by the experimental group was both relevant and identical across participants (challenge 1), while the changed paper was also equal to the paper reviewed by the control group (challenge 3).

  • Deceptive experimental purpose. We created the alternate experimental purpose, “To study the effect of new types of media on reviews”, to accomplish three objectives. First, we were able to justify the perceived experimental procedure without mentioning anchoring to participants (challenge 4). Second, we enabled the natural use of animated GIFs in the paper, while not raising suspicion in the case where no GIF was seen. Third, we were able to naturally include the question asking for comments on the use of animated figures. On one hand, this enabled the experimenter to easily convince participants that there was a technical error by citing their answer. On the other hand, it allowed for the experimenter to naturally ask the participant to refresh the page, allowing the change in the paper to be shown immediately after. Participants were debriefed about the deception and true purpose of the experiment immediately after the study.

3.2 Participation and data collection

We recruited 108 participants, who were separated at random into control and experimental groups and were unaware of their assignment. Participants were either PhD students or PhDs with at least one publication in a computer science-related field in the last 5 years (see Table 1). Participants were recruited across nine research universities in the United States through various methods including physical posters, university mailing lists, and social media posts (see S3 Appendix). We conducted a power analysis to determine the target number of participants (see S2 Appendix). As a large fraction of reviewers in computer science conferences are PhD students (e.g., 33% in the NeurIPS 2016 conference [4]), our participant pool is fairly representative of the conference reviewer population we aim to study.

Table 1. Distribution of participant years of study.

Year of PhD studies Post-PhD
1st 2nd 3rd 4th 5th 6th+
# Participants 17 28 18 20 12 6 7

For each participant, we gathered the following data:

  1. Overall scores on a 1–10 scale.

  2. Category scores in {Significance, Novelty, Soundness, Evaluation, Clarity} on a 1–4 scale and 1-sentence comments justifying each.

  3. Confidence in their evaluation on a 1–5 scale.

  4. Comments on the hyperlinks and animated figures.

  5. Participant-specific information: Institution, program and (if PhD student) year.

The score categories and scales were modeled after those of NeurIPS and AAAI, two of the largest annual computer science conferences. In the experimental group, participants were given a chance to revise all review information after seeing the figure change. In this case, both initial and revised versions were recorded. This resulted in the collection of 3 different sets of data: scores from the control group, initial scores from the experimental group, and revised scores from the experimental group.

After the study, we asked participants a few questions to determine the effectiveness of the deception and ensure that they were oblivious to the true study purpose (i.e., challenge 4 in Section 3.1.1). Before debriefing participants, we asked them if they suspected that the study featured deception; if they answered affirmatively, we asked them to describe what they believed the true study purpose was. If they were able to detect that we deceived them on the study purpose and specifically identify that the true purpose was about re-reviewing or rebuttals, we would exclude them from the study. Along with this, we also included two trivial exclusion criteria: (i) If participants do not consent to their data being collected for the true study purpose, and (ii) if participants do not finish the study. For reference, participants were compensated $20 for participation in the study, and were allowed to withdraw at any time for partial ($10-$15) compensation. No participants withdrew or were excluded due to these criteria (or for any other reason), demonstrating the effectiveness of the deception in our experiment design.

3.3 Analysis

We first performed a preliminary test of the validity of our experimental setup by comparing the initial scores I provided by the experimental group with the the scores C provided by the control group. If our experimental setup was successful at inducing a perceived difference in paper quality, we should see that the initial experimental scores are generally lower than the control scores. To compare the distributions of these scores, we performed a non-parametric test of the null hypothesis that the control and initial scores have the same distribution. Specifically, we conducted a one-sided permutation test (with 100000 permutations) with the Mann-Whitney U statistic against the alternative hypothesis that the distribution of the control scores C is stochastically greater than the distribution of the initial scores I. The test statistic is

U=CiCIjIS(Ci,Ij),whereS(a,b)={1ifa>b1/2ifa=b0ifa<b. (1)

We performed two tests between these groups, comparing both the Overall scores and the Evaluation category scores. We chose to analyze the Evaluation category, defined as “a score for how its evidence supports its conclusions […]”, as we expected our experimental manipulation to have the greatest effect in this category. Across the two tests, we controlled the false discovery rate using the Benjamini-Hochberg correction under the assumption that the test statistics are positively dependent [53], and the p-values we report are adjusted for this correction [54]. As effect sizes, we also report point estimates of the Vargha-Delaney A statistic [16], computed as A=U|C||I|, along with 95% bootstrapped confidence intervals (using 100000 samples).

Our primary analysis aims to detect anchoring in reviewers. To test for the anchoring effect, we compared the revised scores R provided by the experimental group with the scores C provided by the control group. For this, we performed a non-parametric test of the null hypothesis that the control and revised scores have the same distribution. We again used a one-sided permutation test with the Mann-Whitney U statistic against the alternative hypothesis that the distribution of the control scores is stochastically greater than the distribution of the revised experimental scores. The test statistic is

U=CiCRjRS(Ci,Rj). (2)

We performed two tests to compare both the Overall scores and the Evaluation category scores, and again controlled the false discovery rate at α = 0.05 across the two tests using the Benjamini-Hochberg correction (again assuming positive dependence). We report the Vargha-Delaney A statistic as the effect size, with estimates computed as A=U|C||R|.

As stated in Section 3, our research question and study design were pre-registered. However, the analysis specified here differs from the analysis plan specified in the preregistration. In the preregistration, the test statistic was specified to be the difference between the mean scores of each group, and only the Overall scores were to be analyzed. However, as the scores are not necessarily on a linear scale (in fact, they were each given a description on the review form), the arithmetic means of the scores are not as meaningful. We also analyzed Evaluation category scores since our experimental design specifically manipulates the paper quality in this category. The tests of the validity of our experimental setup were also not preregistered. The preregistered original analysis is available at https://aspredicted.org/W94_GD3.

Code for all analyses is provided at https://github.com/theryanl/ReviewerAnchoring.

4 Results

4.1 Main results

The results of our main hypothesis tests introduced in Section 3.3 are reported in Table 2.

Table 2. Results of comparisons between initial or revised scores from the experimental group and scores from the control group, with respect to both Overall and Evaluation scores.

Experimental Condition Score Type A 95% Confidence Interval p-value Experimental Condition Mean Control Mean
Initial Overall 0.5857 [0.4793, 0.6881] 0.0575 5.519 6.037
Initial Evaluation 0.6375 [0.5381, 0.7327] 0.0096 1.908 2.352
Revised Overall 0.5048 [0.3981, 0.6109] 0.6064 5.907 6.037
Revised Evaluation 0.4863 [0.3836, 0.5888] 0.6064 2.389 2.352

The effect size A is the Vargha-Delaney A statistic, with 95% confidence intervals constructed via bootstrap. Benjamini-Hochberg adjusted p-values are reported. The rightmost two columns show the mean Overall (1–10 scale) or Evaluation (1–4 scale) scores within the experimental group (initial or revised score) and the control group.

Our comparisons between the initial scores and control scores to test the validity of our experimental setup resulted in effect sizes A = 0.5857 with respect to the Overall scores (adjusted p = 0.0575) and A = 0.6375 with respect to the Evaluation category scores (adjusted p = 0.0096). The effect sizes can be interpreted as the probability that a randomly chosen control score is greater than a randomly chosen initial score, breaking ties uniformly at random. An effect size of A = 0.5 means that the two distributions are stochastically similar, and higher values of A indicate the extent to which the distribution of control scores is stochastically greater. If our experimental setup successfully created a perceived difference in paper quality between the conditions, we expect the control scores to be higher than the initial scores (corresponding to effect sizes A > 0.5). While both comparisons had moderate effect sizes, the comparison in the Evaluation category is significant at α = 0.01, while the comparison in Overall scores is significant at α = 0.1. This provides evidence that the paper quality was perceived as different between the two groups, although reviewers may not have reflected this difference as much in their Overall scores.

Given that our experiment successfully constructed an environment where anchoring could occur, we turn to our analysis of whether anchoring did occur. Our comparisons between the revised scores and control scores, which test for the anchoring effect, resulted in effect sizes A = 0.5048 with respect to the Overall scores (adjusted p = 0.6064) and A = 0.4863 with respect to the Evaluation category scores (adjusted p = 0.6064). Recall that in the presence of an anchoring effect, we expect the control scores to be higher than the revised scores (corresponding to effect sizes A > 0.5). Both statistics are insignificant at α = 0.1 (and would have been insignificant even without Benjamini-Hochberg correction), indicating that our analysis failed to reject the null hypothesis that reviewers do not anchor. In other words, we did not find any evidence of anchoring bias.

4.2 Supplemental results

In addition to the main test statistic, we also performed the following informal supplemental analyses. As these analyses were exploratory and data-dependent, the observations we made in these analyses should be interpreted primarily as motivation for future work and not as support for statistically significant conclusions.

4.2.1 Other category scores

In Table 3, we show the results of additional comparisons conducted between revised scores and control scores. We compared scores for each of the categories on the review form apart from the Evaluation category analyzed earlier. We used the same methodology as in our main analysis to compute the effect sizes and 95% confidence intervals. Overall, these results do not indicate that other categories showed signs of anchoring.

Table 3. Results of comparisons between revised and control scores in the remaining categories.
Category A 95% Confidence Interval Revised Mean Control Mean
Significance 0.5065 [0.4119, 0.6010] 2.78 2.83
Novelty 0.4746 [0.3745, 0.5736] 2.48 2.46
Soundness 0.4863 [0.3849, 0.5898] 2.76 2.69
Clarity 0.4609 [0.3614, 0.5621] 3.31 3.17

The effect size A is the Vargha-Delaney A statistic, with 95% confidence intervals constructed via bootstrap. The rightmost two columns show the mean scores (1–4 scale).

4.2.2 Confidence

We additionally conducted comparisons to investigate whether anchoring was associated with the self-reported confidence of reviewers. In Table 4, we separate participants into two groups based on their self-reported confidence score, given on a scale of of 1–5: confident, where participants reported a score of 3 (“Fairly Confident”) or higher, and unconfident where they reported a score of 2 (“Willing to defend”) or lower. This threshold between confident and unconfident reviewers was chosen before the analysis based on the stated descriptions of the scores. In both the control and experimental groups, there were 41 confident reviewers and 13 unconfident reviewers. We conducted comparisons between the revised Overall scores from the experimental group and the Overall scores from the control group, and found that confident reviewers had the same mean revised and control scores, while unconfident reviewers had generally lower revised scores (indicated by the A = 0.63 effect size). This could indicate that unconfident reviewers are more likely to exhibit anchoring. However, since there were less unconfident reviewers, the uncertainty around this effect size is large.

Table 4. Results of comparisons between revised and control Overall scores for both confident (3+, 1–5 scale) and unconfident (2-) reviewers.
Subgroup # Experimental # Control A 95% Confidence Interval Revised Mean Control Mean
Confident 41 41 0.4685 [0.3453, 0.5925] 6.00 6.00
Unconfident 13 13 0.6331 [0.4201, 0.8284] 5.62 6.15

The effect size A is the Vargha-Delaney A statistic, with 95% confidence intervals constructed via bootstrap. The rightmost columns show the mean scores (1–10 scale).

4.2.3 Seniority

Next, we split participants into less experienced (“junior”) and more experienced (“senior”) reviewers, and conducted a comparison between the revised and control Overall scores for each subgroup in Table 5. Junior reviewers were PhD year 3 and under, whereas senior reviewers were PhD year 4 and over or beyond their PhD. This threshold was chosen before the analysis to produce the most equally-sized groups. We found similar results across the two subgroups, suggesting that our study results may not be dependent on the large amount of junior participants we have in comparison to real conference settings, though the uncertainty around the effect size is large.

Table 5. Results of comparisons between revised and control Overall scores, for both junior (PhD years 1–3) and senior (4+) reviewers.
Subgroup # Experimental # Control A 95% Confidence Interval Revised Mean Control Mean
Junior 26 37 0.4865 [0.3415, 0.6284] 5.96 6.00
Senior 28 17 0.5357 [0.3739, 0.6975] 5.86 6.12

The effect size A is the Vargha-Delaney A, with 95% confidence intervals constructed via bootstrap. The rightmost two columns show the mean scores (1–10 scale).

4.2.4 Counts of score changes

Though our main analysis did not find evidence of anchoring (as shown in Section 4.1), we observe that, consistent with the findings from previous conference organizers in Section 2.2, a majority of the reviewers in the experimental group did not change their given scores (see Table 6). Out of 54 experimental group participants, 15 (28%) changed their Overall score, with nine participants raising their Overall scores by 1 and six raising their Overall scores by 2. Meanwhile, 25 (46%) participants changed one or more category scores, with 22 (41%) participants including a change in the Evaluation category. Other category scores were changed by only a few participants, which was expected as our manipulation primarily targeted the Evaluation category. In Table 7, we further break down the scores and comments updated by experimental group participants.

Table 6. Breakdown of experimental group participants by those who changed either their Overall score or their score in at least one category.
Overall score unchanged Overall score changed Total
Category scores unchanged 28 1 29
Category scores changed 11 14 25
Total 39 15 54

Most (> 50%) participants changed neither their category nor their Overall scores.

Table 7. Number of experimental group participants (out of 54 total) who changed their scores or comments in each category and Overall.
Category # Participant scores changed # Participant comments changed
Significance 7 (13%) 9 (17%)
Novelty 1 (2%) 0 (0%)
Soundness 6 (11%) 5 (9%)
Evaluation 22 (41%) 31 (57%)
Clarity 0 (0%) 2 (4%)
Overall 15 (28%)

Due to timing constraints, comments on the Overall score were not collected.

5 Conclusion and discussion

In this paper, we presented the design and results of a randomized controlled experiment to test for reviewer anchoring bias in conference peer review. Our design carefully addresses various challenges and confounders through the employment of animated media, deception, and an overarching cover story.

Our main analysis did not find evidence of the existence of reviewer anchoring effects in peer review. In the absence of anchoring, the lack of change in scores and decisions observed in conference rebuttal phases may be due to other reasons, such as rebuttals having a relatively weak impact on the quality of the paper, or reviewers penalizing the paper for statements that were unclear or misunderstood in the initial submitted version. Another significant issue concerning the rebuttal process is the limited participation from reviewers [1, 4]. Regardless of the prevalence of anchoring, it is essential for conferences to address this lack of active participation in the review processes.

Our study had several limitations which we now discuss. One potential limitation was that our sample size could have resulted in insufficient statistical power to detect an effect. Although we estimated the sample size needed for our experiment using real conference data (see S2 Appendix), the variance in the collected scores was higher than that of the data we used. This variance in scores could have been due to the lack of a unifying context or set of norms that conference reviewers in the same subfield would have. Thus, future studies can consider recruiting participants with expertise in one particular subfield to help increase the calibration between reviewers.

Another possibility is that, even if anchoring is prevalent in real conference settings, the experimental conditions of our study failed to replicate the conference environment sufficiently to induce this same effect. For example, a common piece of feedback we received from participants in the study was that there was no context behind the result in the paper. Some participants expressed uncertainty in their review as to whether the weak initial result is significant, and retained this even for the larger corrected result. In contrast, reviewers in a real conference may have better knowledge to more accurately judge the significance of a paper’s contributions. In future studies, the aspects of the paper that are updated during the rebuttal may need to be more clearly interpretable to the entire study population, which could also be resolved by recruiting participants with expertise in a particular subfield.

Additionally, our experiment intentionally omits certain elements that are typically present in a real conference environment, some of which may be responsible for reviewer anchoring in the real setting. One such aspect is the social dynamic of reviewers. For example, if reviewers know that other reviewers and area chairs can observe their reviews, it is possible that they would choose to defend their initial position more due to concerns about their image in front of others. Similar social dynamics may be present when reviewers are asked to engage directly with authors in discussions. However, the social aspect may also introduce various confounding effects such as reviewers being influenced by the scores of other reviews [46]. We decided to forgo the capturing of these secondary social effects, instead leaving them to future work.

Another limitation of our work is that we run our experiment with only one paper, which could lead to our findings to be less generalizable. There is precedence of research involving reviewers reviewing fake papers, and in each of these only 1 to 3 papers are constructed [5559]. Due to the high sample size determined from the power analysis (see S2 Appendix) and the limited pool of eligible participants (see S3 Appendix), we chose to have one paper to reduce the sample size needed in order to test for statistical significance, as having multiple papers would require an additional random effect to be modeled. Future work may also include papers from multiple domains to bolster the generalizability of the study.

Finally, there are other variations of our research question that future work could consider. Our supplemental analysis with respect to reviewer confidence suggests that the answer to our research question may not be homogeneous across the entire reviewer pool. Future work may want to design experiments that more carefully take this consideration into account by testing for effects within subpopulations.

Supporting information

S1 Appendix. Deception and revision process.

(PDF)

pone.0301111.s001.pdf (48.9KB, pdf)
S2 Appendix. Power analysis.

(PDF)

pone.0301111.s002.pdf (93.8KB, pdf)
S3 Appendix. Participant recruitment.

(PDF)

pone.0301111.s003.pdf (46.2KB, pdf)

Data Availability

All code and data files are available on GitHub at the following repository: https://github.com/theryanl/ReviewerAnchoring.

Funding Statement

NS received the National Science Foundation CAREER Award 1942124 (https://www.nsf.gov/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. NS and FF received the National Science Foundation Division of Information and Intelligent Systems 2200410 (https://new.nsf.gov/cise/iis). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Dershowitz N, Verma RM. Rebutting Rebuttals. Communications of the ACM. 2023;. doi: 10.1145/3584664 [DOI] [Google Scholar]
  • 2. Frachtenberg E, Koster N. A survey of accepted authors in computer systems conferences. PeerJ Computer Science. 2020;6. doi: 10.7717/peerj-cs.299 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Daumé III H. Some NAACL 2013 statistics on author response, review quality, etc. Natural Language Processing Blog. 2015;.
  • 4. Shah NB, Tabibian B, Muandet K, Guyon I, Von Luxburg U. Design and analysis of the NIPS 2016 review process. The Journal of Machine Learning Research. 2018;19(1):1913–1946. [Google Scholar]
  • 5.Huang F. Eye-opening rebuttal #NeurIPS22; 2022.
  • 6.Upadhyay J. Reviewer: why not paper X + noise work? Rebuttal:; 2020.
  • 7.Modi D. I had a racist reviewer who said this work is from India and its authenticity is doubtful.; 2020.
  • 8.Rogers A, Augenstein I. What Can We Do to Improve Peer Review in NLP? In: Findings of the Association for Computational Linguistics: EMNLP 2020; 2020. p. 1256–1262.
  • 9. Tversky A, Kahneman D. Judgment under Uncertainty: Heuristics and Biases. Science. 1974;185(4157):1124–1131. doi: 10.1126/science.185.4157.1124 [DOI] [PubMed] [Google Scholar]
  • 10. Furnham A, Boo HC. A literature review of the anchoring effect. The Journal of Socio-Economics. 2011;40(1):35–42. doi: 10.1016/j.socec.2010.10.008 [DOI] [Google Scholar]
  • 11. McAlvanah P, Moul CC. The house doesn’t always win: Evidence of anchoring among Australian bookies. Journal of Economic Behavior & Organization. 2013;90:87–99. doi: 10.1016/j.jebo.2013.03.009 [DOI] [Google Scholar]
  • 12. Bucchianeri GW, Minson JA. A homeowner’s dilemma: Anchoring in residential real estate transactions. Journal of Economic Behavior & Organization. 2013;89:76–92. doi: 10.1016/j.jebo.2013.01.010 [DOI] [Google Scholar]
  • 13. Meub L, Proeger TE. Anchoring in social context. Journal of Behavioral and Experimental Economics. 2015;55:29–39. doi: 10.1016/j.socec.2015.01.004 [DOI] [Google Scholar]
  • 14. Verousis T, ap Gwilym O. The implications of a price anchoring effect at the upstairs market of the London Stock Exchange. International Review of Financial Analysis. 2014;32:37–46. doi: 10.1016/j.irfa.2013.12.001 [DOI] [Google Scholar]
  • 15. Marchiori D, Papies EK, Klein O. The portion size effect on food intake. An anchoring and adjustment process? Appetite. 2014;81:108–115. doi: 10.1016/j.appet.2014.06.018 [DOI] [PubMed] [Google Scholar]
  • 16. Vargha A, Delaney HD. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics. 2000;25(2):101–132. doi: 10.3102/10769986025002101 [DOI] [Google Scholar]
  • 17. Shah NB. An Overview of Challenges, Experiments, and Computational Solutions in Peer Review; 2022. Communications of the ACM. [Google Scholar]
  • 18.Charlin L, Zemel R. The Toronto paper matching system: an automated paper-reviewer assignment system; 2013.
  • 19. Stelmakh I, Shah NB, Singh A. PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review. Journal of Machine Learning Research. 2021;22:163–1. [Google Scholar]
  • 20.Kobren A, Saha B, McCallum A. Paper matching with local fairness constraints. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2019. p. 1247–1257.
  • 21.Payan J, Zick Y. I Will Have Order! Optimizing Orders for Fair Reviewer Assignment. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems; 2022. p. 1711–1713.
  • 22.Leyton-Brown K, Mausam, Nandwani Y, Zarkoob H, Cameron C, Newman N, et al. Matching Papers and Reviewers at Large Conferences. arXiv preprint arXiv:220212273. 2022;.
  • 23. Jecmen S, Zhang H, Liu R, Shah N, Conitzer V, Fang F. Mitigating manipulation in peer review via randomized reviewer assignments. Advances in Neural Information Processing Systems. 2020;33:12533–12545. [Google Scholar]
  • 24.Wu R, Guo C, Wu F, Kidambi R, Van Der Maaten L, Weinberger K. Making paper reviewing robust to bid manipulation attacks. In: International Conference on Machine Learning. PMLR; 2021. p. 11240–11250.
  • 25.Dhull K, Jecmen S, Kothari P, Shah NB. The Price of Strategyproofing Peer Assessment. In: The 9th AAAI Conference on Human Computation and Crowdsourcing. vol. 2; 2022.
  • 26.Ge H, Welling M, Ghahramani Z. A Bayesian model for calibrating conference review scores; 2013. Manuscript.
  • 27.Wang J, Shah NB. Your 2 is my 1, your 3 is my 9: Handling arbitrary miscalibrations in ratings. International Conference on Autonomous Agents and Multiagent Systems. 2018;.
  • 28. Noothigattu R, Shah N, Procaccia A. Loss functions, axioms, and peer review. Journal of Artificial Intelligence Research. 2021;70:1481–1515. doi: 10.1613/jair.1.12554 [DOI] [Google Scholar]
  • 29. Mahoney MJ. Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive therapy and research. 1977;1(2):161–175. doi: 10.1007/BF01173636 [DOI] [Google Scholar]
  • 30. Lee CJ. Commensuration Bias in Peer Review. Philosophy of Science. 2015;82(5):1272–1283. doi: 10.1086/683652 [DOI] [Google Scholar]
  • 31. Blank RM. The Effects of Double-Blind versus Single-Blind Reviewing: Experimental Evidence from The American Economic Review. American Economic Review. 1991;81(5):1041–1067. [Google Scholar]
  • 32. Tomkins A, Zhang M, Heavlin WD. Reviewer bias in single- versus double-blind peer review. Proceedings of the National Academy of Sciences. 2017;114(48):12708–12713. doi: 10.1073/pnas.1707323114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Manzoor E, Shah NB. Uncovering latent biases in text: Method and application to peer review. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 4767–4775.
  • 34. Huber J, Inoua S, Kerschbamer R, König-Kersting C, Palan S, Smith VL. Nobel and novice: Author prominence affects peer review. Proceedings of the National Academy of Sciences. 2022;119(41):e2205779119. doi: 10.1073/pnas.2205779119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Stelmakh I, Rastogi C, Shah NB, Singh A, Daumé H III. A large scale randomized controlled trial on herding in peer-review discussions. Plos one. 2023;18(7):e0287443. doi: 10.1371/journal.pone.0287443 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Stelmakh I, Shah NB, Singh A, Daumé III H. Prior and Prejudice: The Novice Reviewers’ Bias against Resubmissions in Conference Peer Review. In: ACM Conference on Computer-Supported Cooperative Work and Social Computing; 2021.
  • 37. Stelmakh I, Rastogi C, Liu R, Chawla S, Echenique F, Shah NB. Cite-seeing and reviewing: A study on citation bias in peer review. Plos one. 2023;18(7):e0283980. doi: 10.1371/journal.pone.0283980 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Rastogi C, Stelmakh I, Shen X, Meila M, Echenique F, Chawla S, et al. To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online. arXiv preprint arXiv:220317259. 2022;.
  • 39.Stelmakh I, Shah N, Singh A. On testing for biases in peer review. Advances in Neural Information Processing Systems. 2019;32.
  • 40. Obrecht M, Tibelius KH, d’Aloisio G. Examining the value added by committee discussion in the review of applications for research awards. Research Evaluation. 2007;16:79–91. doi: 10.3152/095820207X223785 [DOI] [Google Scholar]
  • 41. Fogelholm M, Leppinen S, Auvinen A, Raitanen J, Nuutinen A, Väänänen K. Panel discussion does not improve reliability of peer review for medical research grant proposals. Journal of clinical epidemiology. 2012;65 1:47–52. doi: 10.1016/j.jclinepi.2011.05.001 [DOI] [PubMed] [Google Scholar]
  • 42. Pier EL, Raclaw J, Kaatz A, Brauer M, Carnes ML, Nathan MJ, et al. ‘Your comments are meaner than your score’: Score calibration talk influences intra- and inter-panel variability during scientific grant peer review. Research Evaluation. 2017;26 1:1–14. doi: 10.1093/reseval/rvw025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Teplitskiy M, Ranub H, Grayb GS, Meniettid M, Guinan EC, Lakhani KR. Social influence among experts: Field experimental evidence from peer review; 2019.
  • 44.McGrenere J, Cockburn A, Gould S. CHI 2020—The effect of rebuttals; 2019.
  • 45.Stelmakh I, Shah NB, Singh A, Daumé III H. A novice-reviewer experiment to address scarcity of qualified reviewers in large conferences. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 4785–4793.
  • 46.Gao Y, Eger S, Kuznetsov I, Gurevych I, Miyao Y. Does My Rebuttal Matter? Insights from a Major NLP Conference. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019. p. 1274–1290.
  • 47.Blackburn S, Grove D, Pingali K, McKinley K, Berger E, Eide E, et al. PLDI 2015 Surveys; 2015.
  • 48.Parno B, Erlingsson U, Enck W. Report on the IEEE S&P 2017 submission and review process and its experiments; 2017.
  • 49. Slovic P, Lichtenstein S. Comparison of Bayesian and Regression Approaches to the Study of Information Processing in Judgment. Organizational Behavior and Human Performance. 1971;6:649–744. doi: 10.1016/0030-5073(71)90033-X [DOI] [Google Scholar]
  • 50. Orne MT. On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications. American psychologist. 1962;17(11):776. doi: 10.1037/h0043424 [DOI] [Google Scholar]
  • 51.Association for the Advancement of Artificial Intelligence. AAAI 2020 Reviewer Guidelines; 2019. Available from: https://aaai.org/conference/aaai/aaai-20/.
  • 52.Neural Information Processing Systems Foundation. NeurIPS 2022 Reviewer Guidelines; 2022. Available from: https://neurips.cc/Conferences/2022/ReviewerGuidelines.
  • 53. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 2001;29(4):1165–1188. doi: 10.1214/aos/1013699998 [DOI] [Google Scholar]
  • 54. Benjamini Y, Heller R, Yekutieli D. Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2009;367(1906):4255–4271. doi: 10.1098/rsta.2009.0127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Baxt WG, Waeckerle JF, Berlin JA, Callaham ML. Who reviews the reviewers? Feasibility of using a fictitious manuscript to evaluate peer reviewer performance. Annals of emergency medicine. 1998;32(3):310–317. doi: 10.1016/S0196-0644(98)70006-X [DOI] [PubMed] [Google Scholar]
  • 56. Emerson GB, Warme WJ, Wolf FM, Heckman JD, Brand RA, Leopold SS. Testing for the presence of positive-outcome bias in peer review: a randomized controlled trial. Archives of internal medicine. 2010;170(21):1934–1939. doi: 10.1001/archinternmed.2010.406 [DOI] [PubMed] [Google Scholar]
  • 57. Godlee F, Gale CR, Martyn CN. Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomized controlled trial. Jama. 1998;280(3):237–240. doi: 10.1001/jama.280.3.237 [DOI] [PubMed] [Google Scholar]
  • 58. Schroter S, Black N, Evans S, Carpenter J, Godlee F, Smith R. Effects of training on quality of peer review: randomised controlled trial. Bmj. 2004;328(7441):673. doi: 10.1136/bmj.38023.700775.AE [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Schroter S, Black N, Evans S, Godlee F, Osorio L, Smith R. What errors do peer reviewers detect, and does training improve their ability to detect them? Journal of the Royal Society of Medicine. 2008;101(10):507–514. doi: 10.1258/jrsm.2008.080062 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Stephan Leitner

16 Oct 2023

PONE-D-23-21814Testing for Reviewer Anchoring in Peer Review: A Randomized Controlled TrialPLOS ONE

Dear Dr. Liu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 30 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Stephan Leitner

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This work was supported in part by NSF CAREER Award 1942124 and NSF 2200410.”

We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“NS received the National Science Foundation CAREER Award 1942124 (https://www.nsf.gov/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

NS and FF received the National Science Foundation Communications and Information Foundations 2200410 (https://new.nsf.gov/funding/opportunities/ccf-communications-information-foundations-cif). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. We note that Figure 2 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 2 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear authors,

We have received two reviews for your paper from expert reviewers. While one is very positive, the other is slightly more critical. I would like to offer you the chance to address these issues in a revision, particularly by extending the discussion of results and limitations as suggested by the more critical reviewer.

Regards,

Stephan Leitner

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thanks for this nicely designed and conducted experiment. The paper is innovative and the mild deception you use tolerable. I like the paper and results and thus support publication. Here just a few minor points you may change in the paper:

- page 1, second line from bottom: replace "change" with "adapt"

- page 11, line under Section 4 - delte this unnecessary sentence "In this section..."

Reviewer #2: In this paper, the authors investigate an intriguing phenomenon in peer reviews, known as "Anchoring", where reviewers simply do not adjust their scores as much as they should. Through a randomized controlled trial, the paper studies the anchoring effect in peer reviews. The experiment designed by the authors involved 108 researchers with varied research backgrounds. The main findings of the authors include: 1) paper quality was perceived differently between the two groups; 2) no evidence of anchoring bias was found.

The paper is meticulously designed, addressing some challenges, which showcases the substantial effort put forth by the authors. The writing is clear, the results are presented succinctly, and the overall presentation is good. Additionally, the authors have engaged in a degree of discussion regarding the limitations of their method. Overall, the article makes a certain contribution to anchoring in peer review.

Pros:

1. The research questions are interesting and novel.

2. The experiments are well designed. The experiment design of the article is commendable and incurred certain costs, including the recruitment of trial participants and data collection, etc.

3. The writing of the paper is very good, including sections like the abstract, introduction, and conclusion. The authors have highlighted their main conclusions and also discussed other findings like the involvement of junior participants.

Cons:

1. The experimental setup of the article has certain limitations, such as the construction of only one fake paper.

2. The data provided by the authors is somewhat scant, although they have discussed the limitations.

3. The gap between the authors' experiments and the real world is considerable, making it challenging to assure the validity of the findings on real-world data.

My primary concerns regarding the article are as follows:

As mentioned in the paper, the experimental design is overly idealized and deviates significantly from real-world scenarios. Firstly, in real-world situations, especially in computer conference papers, a reviewer typically reviews multiple papers, and a paper receives reviews from multiple reviewers. This introduces elements of peer effect and peer pressure, which are considered crucial factors [1]. Furthermore, the revision of scores by reviewers may also depend on the authors' rebuttal skills or rebuttal politeness [2,3,4].

These aspects are not reflected in the experiments conducted in the paper. While considering these factors is challenging, if the authors claim to study the anchoring phenomenon in peer reviews, then these factors cannot be overlooked.

Other minor suggestions:

Please revise some arxiv paper reference into their proceedings versions.

References:

[1] Gao, Yang, et al. "Does My Rebuttal Matter? Insights from a Major NLP Conference." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

[2] Huang, Junjie, et al. "What makes a successful rebuttal in computer science conferences?: A perspective on social interaction." Journal of Informetrics 17.3 (2023): 101427.

[3] Rogers, Anna, and Isabelle Augenstein. "What Can We Do to Improve Peer Review in NLP?." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020.

[4] Bharti, Prabhat Kumar, et al. "PolitePEER: does peer review hurt? A dataset to gauge politeness intensity in the peer reviews." Language Resources and Evaluation (2023): 1-23.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Nov 18;19(11):e0301111. doi: 10.1371/journal.pone.0301111.r002

Author response to Decision Letter 0


2 Mar 2024

Reviewer #1:

Thank you for your careful review of our manuscript. We have made changes to the paper according to your following suggestions:

- page 1, second line from bottom: replace "change" with "adapt"

- page 11, line under Section 4 - delte this unnecessary sentence "In this section..."

You can find these changes in the “track changes” version of the revision.

Reviewer #2:

Thank you for your detailed comments and suggestions for our manuscript. In particular, we appreciate your careful consideration the various facets of our experiment design. We would like to briefly discuss your concerns regarding our paper:

1. The experimental setup of the article has certain limitations, such as the construction of only one fake paper.

Response:

We agree with the reviewer that this is indeed a limitation to our work. We have added a paragraph in the discussion specifically discussing this limitation. We reproduce the paragraph here for the reviewer's convenience:

“Another limitation of our work is that we run our experiment with only one paper, which could lead to our findings to be less generalizable. There is precedence of research involving reviewers reviewing fake papers, and in each of these only 1 to 3 papers are constructed [1–5]. Due to the high sample size determined from the power analysis (see S2 Appendix) and the limited pool of eligible participants (see S3 Appendix), we chose to have one paper to reduce the sample size needed in order to test for statistical significance, as having multiple papers would require an additional random effect to be modeled. Future work may also include papers from multiple domains to bolster the generalizability of the study.”

2. The data provided by the authors is somewhat scant, although they have discussed the limitations.

Response:

We provide the full anonymized numerical responses and participant institution and year in the data folder of https://github.com/theryanl/ReviewerAnchoring. Our IRB approval and the consent of participants do not permit us to include participant comments in the dataset. This was due to potential concerns of anonymity breach (e.g., “Since I work in the neighboring field of xxx, …”). The participant pool, mainly CS or CS-related field PhDs, as well as their institution and year are provided. If additional information regarding their area of study can be inferred from their responses, this may cause certain participants’ identities to be identifiable.

3. The gap between the authors' experiments and the real world is considerable (e.g., peer effect, peer pressure, rebuttal skills, rebuttal politeness), making it challenging to assure the validity of the findings on real-world data. While considering these factors is challenging, if the authors claim to study the anchoring phenomenon in peer reviews, then these factors cannot be overlooked.

Response:

As you mention, there are many other peer effects in the process of rebuttals. The design of this experiment deliberately avoids adding these additional effects in order to isolate the effect we wish to understand – anchoring in its base form (as defined by [6]) without social or author effects. We believe this base form is important to analyze, as an existence of anchoring in this setting would mean that the re-reviewing paradigm itself causes anchoring bias to happen, meaning that altering social pressures and other effects would be insufficient to avoid bias. This setting is also consistent with anchoring bias studies in psychology which did not contain external social pressures or author skills or politeness [6].

In this case, adding other peer effects or author skills and politeness can confound any observations that we make. These factors are orthogonal to anchoring, and our objective in the experiment design is to single out anchoring and avoid these other confounders. This is actually a flaw with the study by Gao et al. you cited – the paper claims "peer pressure" as a reason for their observation, but their study does not isolate such effects and hence the claim can be confounded by various factors such as program chair instructions to arrive at a consensus. Our study, in contrast, asks a specific question (of anchoring), and the experiment design deliberately and carefully avoids other such confounders in order to answer this specific question. By coming to a conclusion on just anchoring, this also allows us to better evaluate the existence of other effects in future studies, and possibly help inform the design of future peer review paradigms.

4. Please revise some arxiv paper reference into their proceedings versions.

As you suggested, we have made changes to the references, replacing arxiv paper references to their proceedings versions. They should be up-to-date with Google Scholar as of 2/9/2024.

[1] W. G. Baxt et al. “Who reviews the reviewers? Feasibility of using a fictitious manuscript to evaluate peer reviewer performance”. In: Annals of emergency medicine (1998).

[2] G. B. Emerson et al. “Testing for the presence of positive-outcome bias in peer review: a randomized controlled trial”. In: Archives of internal medicine (2010).

[3] F. Godlee, C. R. Gale, and C. N. Martyn. “Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomized controlled trial”. In: JAMA (1998).

[4] S. Schroter et al. “Effects of training on quality of peer review: randomised controlled trial”. In: BMJ (2004).

[5] S. Schroter et al. “What errors do peer reviewers detect, and does training improve their ability to detect them?” In: Journal of the Royal Society of Medicine (2008).

[6] A. Tversky and D. Kahneman. “Judgment under uncertainty: Heuristics and biases.” In: Science (1974).

Attachment

Submitted filename: Reviewer Letter for Anchoring Submission - Google Docs.pdf

pone.0301111.s004.pdf (102.4KB, pdf)

Decision Letter 1

Stephan Leitner

12 Mar 2024

Testing for Reviewer Anchoring in Peer Review: A Randomized Controlled Trial

PONE-D-23-21814R1

Dear Dr. Liu,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at http://www.editorialmanager.com/pone/ and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Stephan Leitner

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

We have now received two reviews from the experts who evaluated your manuscript in the previous round. I am very pleased to inform you that, based on the reviewers' assessments and my own evaluation of your manuscript, we can now accept it for publication.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: Thanks to the author for the reply, although I still think that peer pressure must be considered when analyzing the anchoring effect of peer review (peer review opinions are obviously given by multiple people together, and few papers are reviewed by only one reviewer).

The reference [6] is not the definition of anchoring effect in peer review. However, I agree that the author used randomized experiments to verify the anchoring effect.

While PLOS ONE does not attempt to use the peer review process to determine whether or not an article reaches the level of 'importance' required by a given journal, PLOS ONE uses peer review to determine whether a paper is technically rigorous and meets the scientific and ethical standards for inclusion in the published scientific record.

So I am OK with the current revisions.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

Acceptance letter

Stephan Leitner

17 Jul 2024

PONE-D-23-21814R1

PLOS ONE

Dear Dr. Shah,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Stephan Leitner

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Deception and revision process.

    (PDF)

    pone.0301111.s001.pdf (48.9KB, pdf)
    S2 Appendix. Power analysis.

    (PDF)

    pone.0301111.s002.pdf (93.8KB, pdf)
    S3 Appendix. Participant recruitment.

    (PDF)

    pone.0301111.s003.pdf (46.2KB, pdf)
    Attachment

    Submitted filename: Reviewer Letter for Anchoring Submission - Google Docs.pdf

    pone.0301111.s004.pdf (102.4KB, pdf)

    Data Availability Statement

    All code and data files are available on GitHub at the following repository: https://github.com/theryanl/ReviewerAnchoring.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES