Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2023 Aug 29;21(8):e3002238. doi: 10.1371/journal.pbio.3002238

Relationship between journal impact factor and the thoroughness and helpfulness of peer reviews

Anna Severin 1,2, Michaela Strinzel 3, Matthias Egger 1,3,4,*, Tiago Barros 5, Alexander Sokolov 6, Julia Vilstrup Mouatt 7, Stefan Müller 8
Editor: Ulrich Dirnagl9
PMCID: PMC10464996  PMID: 37643173

Abstract

The Journal Impact Factor is often used as a proxy measure for journal quality, but the empirical evidence is scarce. In particular, it is unclear how peer review characteristics for a journal relate to its impact factor. We analysed 10,000 peer review reports submitted to 1,644 biomedical journals with impact factors ranging from 0.21 to 74.7. Two researchers hand-coded sentences using categories of content related to the thoroughness of the review (Materials and Methods, Presentation and Reporting, Results and Discussion, Importance and Relevance) and helpfulness (Suggestion and Solution, Examples, Praise, Criticism). We fine-tuned and validated transformer machine learning language models to classify sentences. We then examined the association between the number and percentage of sentences addressing different content categories and 10 groups defined by the Journal Impact Factor. The median length of reviews increased with higher impact factor, from 185 words (group 1) to 387 words (group 10). The percentage of sentences addressing Materials and Methods was greater in the highest Journal Impact Factor journals than in the lowest Journal Impact Factor group. The results for Presentation and Reporting went in the opposite direction, with the highest Journal Impact Factor journals giving less emphasis to such content. For helpfulness, reviews for higher impact factor journals devoted relatively less attention to Suggestion and Solution than lower impact factor journals. In conclusion, peer review in journals with higher impact factors tends to be more thorough, particularly in addressing study methods while giving relatively less emphasis to presentation or suggesting solutions. Differences were modest and variability high, indicating that the Journal Impact Factor is a bad predictor of the quality of peer review of an individual manuscript.


An analysis of the content of 10,000 peer review reports reveals that reports submitted to journals with higher impact factors pay more attention to the materials and methods of a study but less attention to presentation and reporting, whereas journals with low impact factors provide more suggestions, solutions and examples.

Introduction

Peer review is a process of scientific appraisal by which manuscripts submitted for publication in journals are evaluated by experts in the field for originality, rigour, and validity of methods and potential impact [1]. Peer review is an important scientific contribution and is increasingly visible on databases and researcher profiles [2,3]. In medicine, practitioners rely on sound evidence from clinical research to make a diagnosis or prognosis and choose a therapy. Recent developments, such as the retraction of peer-reviewed COVID-19 publications in prominent medical journals [4] or the emergence of predatory journals [5,6], have prompted concerns about the rigour and effectiveness of peer review. Despite these concerns, research into the quality of peer review is scarce. Little is known about the determinants and characteristics of high-quality peer review. The confidential nature of many peer review reports and the lack of databases and tools for assessing their quality have hampered larger-scale research on peer review.

The impact factor was originally developed to help libraries make indexing and purchasing decisions for their collections. It is a journal-based metric calculated each year by dividing the number of citations received in that year for papers published in the 2 preceding years by the number of “citable items” published during the 2 preceding years [7]. The reputation of a journal, its impact factor, and the perceived quality of peer review are among the most common criteria authors use to select journals to publish their work [810]. Assuming that citation frequency reflects a journal’s importance in the field, the impact factor is often used as a proxy for journal quality [11]. It is also used in academic promotion, hiring decisions, and research funding allocation, leading scholars to seek publication in journals with high impact factors [12].

Despite using the Journal Impact Factor as a proxy for a journal’s quality, empirical research on the impact factor as a measure of journal quality is scarce [11]. In particular, it is unclear how the peer review characteristics for a journal relate to this metric. We combined human coding of peer review reports and quantitative text analysis to examine the association between peer review characteristics and Journal Impact Factor in the medical and life sciences, based on a sample of 10,000 peer review reports. Specifically, we examined the impact factor’s relationship with the absolute number and the percentages of sentences related to peer review thoroughness and helpfulness.

Results

Characteristics of the study sample

The sample included 5,067 reviews from Essential Science Indicators (ESI) [13] research field Clinical Medicine, 943 from Environment and Ecology, 942 from Biology and Biochemistry, 733 from Psychiatry and Psychology, 633 from Pharmacology and Toxicology, 576 from Neuroscience and Behaviour, 566 from Molecular Biology and Genetics, 315 from Immunology, and 225 from Microbiology.

Across the 10 groups of journals defined by Journal Impact Factor deciles (1 = lowest, 10 = highest), the median Journal Impact Factor ranged from 1.23 to 8.03, the minimum ranged from 0.21 to 6.51 and the maximum from 1.45 to 74.70 (Table 1). The proportion of reviewers from Asia, Africa, South America, and Australia/Oceania declined when moving from Journal Impact Factor group 1 to group 10. In contrast, there was a trend in the opposite direction for Europe and North America. Information on the continent of affiliation was missing for 43.5% of reviews (4,355). The median length of peer review reports increased by about 202 words from group 1 (median number of words 185) to group 10 (387). S1 File details the 10 journals from each Journal Impact Factor group that provided the highest number of peer review reports, gives the complete list of journals, and shows the distribution of reviews across the 9 ESI disciplines.

Table 1. Characteristics of peer review reports by Journal Impact Factor group.

Journal Impact Factor group
1 2 3 4 5 6 7 8 9 10
Median JIF (range) 1.23 (0.21–1.45) 1.68 (1.46–1.93) 2.07 (1.93–2.22) 2.42 (2.23–2.54) 2.77 (2.54–3.01) 3.26 (3.01–3.55) 3.83 (3.55–4.20) 4.53 (4.21–5.16) 5.67 (5.163–6.5) 8.03 (6.51–74.70)
No. of review reports 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000
No. of journals 256 224 151 146 183 156 155 129 98 146
No. of reviewers 967 960 969 958 965 973 961 939 970 962
No. of sentences (median; IQR) 9 (4–18) 11 (6–22) 12 (5–22) 13 (6–23) 14 (7–25) 14 (7–25) 16 (8–28) 17 (8–27) 16.5 (9–27) 18 (10–30)
No. of words (median; IQR) 185 (84–359) 232.5 (116–426) 225 (104–419) 256.5 (116–478) 284.5 (146–506) 271 (142–495) 346 (170–581) 344.5 (176–555) 350.5 (195–567) 387 (213–672)
Continent of reviewers’ affiliation
Asia 139 107 163 115 93 135 98 93 80 62
Africa 15 14 18 9 5 14 8 6 5
Europe 119 156 187 190 231 250 268 273 280 241
North America 97 113 105 153 162 151 191 180 166 213
Central/South America 61 42 36 25 38 22 22 20 23 10
Australia/Oceania 50 55 36 46 64 37 26 37 38 52
Missing 519 513 455 462 407 391 387 391 408 422
Gender of reviewer
Female 242 262 261 254 241 211 216 189 260 206
Male 518 516 478 549 548 551 575 584 543 599
Unknown 240 222 261 197 211 238 209 227 197 195

IQR, interquartile range; JIF, Journal Impact Factor.

Continents are ordered by population size.

JIF group defined by deciles (1 = lowest, 10 = highest).

Performance of coders and classifiers

The training of coders resulted in acceptable to good between-coder agreement, with an average Krippendorff’s α across the 8 categories of 0.70. The final analyses included 10,000 review reports, comprising 188,106 sentences, which were submitted by 9,259 reviewers to 1,644 journals. In total, 9,590 unique manuscripts were reviewed.

In the annotated dataset, the most common categories based on human coding were Materials and Methods (coded in 823 sentences or 41.2% out of 2,000 sentences), Suggestion and Solution (638 sentences; 34.2%), and Presentation and Reporting (626 sentences; 31.3%). In contrast, Praise (210; 10.5%) and Importance and Relevance (175; 8.8%) were the least common. On average, the training set had 444 sentences per category, as 1,160 sentences were allocated to more than 1 category. In out-of-sample predictions based on DistilBERT, a transformer model for text classification [14], precision, recall, and F1 scores (binary averages across both classes [absent/present]) were similar within categories (see S2 File). The classification was most accurate for Example and Materials and Methods (F1 score 0.71) and least accurate for Criticism (0.57) and Results and Discussion (0.61). The prevalence predicted from the machine learning model was generally close to the human coding: Point estimates did not differ by more than 3 percentage points. Overall, the machine learning classification closely mirrored human coding. Further details are given in S2 File.

Descriptive analysis: Thoroughness and helpfulness of peer review reports

The majority of sentences (107,413 sentences, 57.1%) contributed to more than 1 content category; a minority (23,997 sentences, 12.8%) were not assigned to any category. The average number of sentences addressing each of the 8 content categories in the set of 10,000 reviews ranged from 1.6 sentences on Importance and Relevance to 9.2 sentences on Materials and Methods (upper panel of Fig 1). The percentage of sentences addressing each category are shown in the lower panel of Fig 1. The content categories Materials and Methods (46.7% of sentences), Suggestion and Solution (34.5%), and Presentation and Reporting (30.0%) were most extensively covered. The category Results and Discussion was present in 16.3% of the sentences, and 13.1% were assigned to the category Examples. In contrast, only 8.4% of sentences addressed the Importance and Relevance of the study. Criticism (16.5%) was slightly more common than Praise (14.9%). Most distributions were wide and skewed to the right, with a peak at 0 sentence or 0% corresponding to reviews that did not address the content category (Fig 1).

Fig 1. Distribution of sentences in peer review reports allocated to 8 content categories.

Fig 1

The number (upper panel) and percentage of sentences (lower panel) in a review allocated to the 8 peer review content categories is shown. A sentence could be allocated to no, one, or several categories. Vertical dashed lines show the average number (upper panel) and average percentage of sentences (lower panel) after aggregating them to the level of reviews. Analysis based on 10,000 review reports. The data underlying this figure can be found in S1 Data.

Fig 2 shows the estimated number of sentences addressing the 8 content categories across the 10 Journal Impact Factor groups. For all categories, the number of sentences increased from Journal Impact Factor groups 1 to 10. However, increases were modest on average, amounting to 2 or fewer additional sentences. The exception was Materials and Methods, where the difference between Journal Impact Factor groups 1 and 10 was 6.5 sentences on average.

Fig 2. Distribution of sentences in peer review reports allocated to 8 content categories by Journal Impact Factor group.

Fig 2

A sentence could be allocated to no, one, or several categories. Vertical dashed lines show the average number of sentences after aggregating numbers to the level of reviews. The number of sentences are displayed on a log scale. Analysis based on 10,000 review reports. The data underlying this figure can be found in S2 Data.

Fig 3 shows the estimated percentage of sentences across content categories and Journal Impact Factor groups. Among thoroughness categories, the percentage of sentences addressing Materials and Methods increased from 40.4% to 51.8% from Journal Impact Factor groups 1 to 10. In contrast, attention to Presentation and Reporting declined from 32.9% in group 1 to 25.0% in group 10. No clear trends were evident for Results and Discussion or Importance and Relevance. For helpfulness, the percentage of sentences including Suggestion and Solution declined from 36.9% in group 1 to 30.3% in group 10. The prevalence of sentences providing Examples increased from 11.0% (group 1) to 13.3% (group 10). Praise decreased slightly, whereas Criticism increased slightly when moving from group 1 to group 10. The distributions were broad, even within the groups of journals with similar impact factors.

Fig 3. Distribution of sentences in peer review reports allocated to 8 content categories by Journal Impact Factor group.

Fig 3

The percentage of sentences in a review allocated to the 8 peer review quality categories is shown. A sentence could be allocated to no, one, or several categories. Analysis based on 10,000 review reports. Vertical dashed lines show the average prevalence after aggregating prevalences to the level of reviews. The data underlying this figure can be found in S3 Data.

Regression analyses

The association between journal impact factor and the 8 content categories was analysed in 2 regression analyses. The first predicted the number of sentences of each content category across the 10 Journal Impact Factor groups; the second, the changes in the percentage of sentences addressing content categories. All coefficients and standard errors are available from S3 File.

The predicted number of sentences are shown in Fig 4 with their 95% confidence intervals (CI). The results confirm those observed in the descriptive analyses. There was a substantial increase in the number of sentences addressing Materials and Methods from Journal Impact Factor group 1 (6.1 sentences; 95% CI 5.3 to 6.8) to group 10 (12.5 sentences; 95% CI 11.6 to 13.5), for a difference of 6.4 sentences. For the other categories, only small increases were predicted, in line with the descriptive analyses.

Fig 4. Predicted number of sentences addressing thoroughness and helpfulness categories across the 10 Journal Impact Factor groups.

Fig 4

Predicted values and 95% confidence intervals are shown. Analysis based on 10,000 review reports. All negative binomial mixed-effects models include random intercepts for the journal name and reviewer ID. The data underlying this figure can be found in S4 Data.

The predicted differences in the percentage of sentences addressing content categories are shown in Fig 5. Again, the results confirm those observed in the descriptive analyses. The prevalence of sentences on Materials and Methods in the journals with the highest impact factor was higher (+11.0 percentage points; 95% CI + 7.9 to +14.1) than in the group with the lowest impact factor journals. The trend for sentences addressing Presentation and Reporting went in the opposite direction, with reviews submitted to the journals with the highest impact factor giving less emphasis to such content (−7.7 percentage points; 95% CI −10.0 to −5.4). There was slightly less focus on Importance and Relevance in the group of journals with the highest impact factors relative to the group with the lowest impact factors (−1.9 percentage points; 95% CI −3.5 to −0.4) and little evidence of a difference for Results and Discussion (+1.1 percentage points; 95% CI −0.54 to +2.8). Reviews for higher impact factor journals devoted less attention to Suggestion and Solution. The group with the highest Journal Impact Factor had 6.2 percentage points fewer sentences addressing Suggestion and Solution (95% CI −8.5 to −3.8). No substantive differences were observed for Examples (0.3 percentage points; 95% CI −1.7 to +2.3), Praise (1.6 percentage points; 95% CI −0.5 to +3.7), and Criticism (0.5 percentage points; 95% CI −1.0 to +2.0).

Fig 5. Percentage point change in the proportion of sentences addressing thoroughness and helpfulness categories relative to the lowest Journal Impact Factor group.

Fig 5

Regression coefficients and 95% confidence intervals are shown. Analysis based on 10,000 review reports. All linear mixed-effects models include random intercepts for the journal name and reviewer ID. The data underlying this figure can be found in S5 Data.

Sensitivity analyses

We performed several sensitivity analyses to assess the robustness of findings. In the first, we removed reviews with 0 sentences or 0% in the respective content category, resulting in similar regression coefficients and predicted counts. In the second, the sample was limited to reviews with at least 10 sentences (sentence models) or 200 words (percentage model). The analysis showed that short reviews do not drive associations. In the third sensitivity analysis, the regression models adjusted for additional variables (discipline, career stage of reviewers, and log number of reviews submitted by reviewers). The addition of these variables reduced the sample size from 10,000 to 5,806 reviews because of missing reviewer-level data. Again, the relationships between content categories and journal impact factor persisted. The fourth sensitivity analysis revealed that results were generally similar for male and female reviewers. The fifth showed that the results changed little when replacing the Journal Impact Factor groups with the raw Journal Impact Factor (S3 File).

Typical words in content categories

A keyness analysis [15] extracts typical words for each content category across the full corpus of the 188,106 sentences. The analysis is based on χ2 tests comparing the frequencies of each word in sentences assigned to a content category and other sentences. Table 2 reports the 50 words appearing more frequently in sentences assigned to the respective content category than in other sentences (according to the DistilBERT classification). The table supports the validity of the classification. Common terms in the thoroughness categories were “data”, “analysis”, “method” (Materials and Methods); “please”, “text”, “sentence”, “line”, “figure” (Presentation and Reporting); “results”, “discussion”, “findings” (Results and Discussion); and “interesting”, “important”, “topic” (Importance and Relevance). For helpfulness, common unique words included “please”, “need”, “include (Suggestion and Solution); “line”, “page”, “figure” (Examples); “interesting”, “good”, “well” (Praise); and “however”, “(un)clear”, “mistakes” (Criticism).

Table 2. The 50 key terms for each content category.

Results rely on keyness analyses using χ2 tests for each word, comparing the frequency of words in sentences where a content characteristic was present with sentences (target group) where characteristic was absent (reference group). Table reports the 50 words with the highest χ2 values per category.

Content category Words
Materials and Methods data, methods, analysis, model, patients, method, sample, used, analyses, test, treatment, models, performed, using, criteria, control, experiments, statistical, samples, measures, population, group, parameters, measure, approach, methodology, size, measured, procedure, cohort, groups, variables, scale, controls, design, tests, experiment, experimental, selection, testing, tested, measurements, regression, compared, procedures, measurement, analyzed, trials, score, sampling
Presentation and Reporting please, text, sentence, line, figure, written, table, section, page, paragraph, figures, references, introduction, tables, english, abstract, language, word, sentences, description, reference, mention, explain, information, detail, specify, reader, clarify, legend, well, needs, lines, described, mentioned, clearly, describe, term, summarize, details, informative, errors, abbreviations, read, well-written, grammar, explained, remove, check, need, clarified
Results and Discussion results, discussion, findings, conclusions, conclusion, result, outcome, correlation, effect, outcomes, section, finding, interpretation, discussed, correlations, confidence, variance, supported, statistical, regression, significant, implications, discuss, statistically, presented, summarize, main, significance, predictions, analysis, values, deviation, comparison, error, difference, obtained, comparisons, estimates, value, drawn, uncertainty, likelihood, draw, conclude, observed, objective, deviations, discussions, differences, variables
Importance and Relevance interesting, important, topic, interest, research, contribution, field, novel, importance, work, study, audience, relevance, literature, understanding, paper, useful, future, valuable, insights, knowledge, quality, focus, provides, great, originality, overall, rigor, timely, addresses, approach, clinical, significance, relevant, scientific, implications, usefulness, review, general, insight, context, innovative, readership, area, community, revision, comprehensive, findings, perspective, practical
Suggestion and Solution please, need, needs, better, suggest, provide, consider, clarify, recommend, helpful, include, must, section, required, needed, discussion, line, revision, table, detail, remove, discuss, explain, sentence, specify, help, check, revise, text, improve, think, reader, added, delete, make, replace, useful, highlight, minor, comment, might, clarified, details, clearer, paragraph, worth, references, information, adding, perhaps
Example line, page, figure, lines, sentence, paragraph, table, example, replace, delete, legend, remove, please, word, change, line, panel, comma, column, reference, typo, instead, pages, last, page, caption, statement, shown, mean, bottom, sentences, figures, phrase, rephrase, shows, panels, replaced, section, correct, indicate, write, missing, first, figure1, says, confusing, starting, figs, text, meant
Criticism unclear, clear, however, difficult, confusing, don’t, missing, hard, lack, sure, lacks, seem, seems, understand, little, misleading, doesn’t, enough, vague, confused, incorrect, lacking, unfortunately, somewhat, problematic, insufficient, although, convinced, major, wrong, statement, mistakes, quite, poorly, conclusion, incomplete, questionable, weak, grammatical, inconsistent, errors, sentence, remains, speculative, limited, really, follow, makes, figure, concerns
Praise interesting, well, good, written, well-written, topic, manuscript, paper, important, interest, excellent, overall, satisfactory, comments, timely, nice, great, valuable, author, work, appreciate, review, provides, publication, comprehensive, contribution, article, study, research, novel, useful, enjoyed, field, concise, sound, impressive, improved, dear, easy, nicely, congratulate, thorough, worthy, addresses, relevant, appreciated, appropriate, presents, designed, adequate

Discussion

This study used fine-tuned transformer language models to analyse the content of peer review reports and investigate the association of content with the Journal Impact Factor. We found that the impact factor was associated with the characteristics and content of peer review reports and reviewers. The length of reports increased with increasing Journal Impact Factor, with the number of relevant sentences increasing for all content categories, but in particular for Materials and Methods. Expressed as the percentage of sentences addressing a category (and thus standardising for the different lengths of peer review reports), the prevalence of sentences providing suggestions and solutions, examples, or addressing the reporting of the work declined with increasing Journal Impact Factor. Finally, the proportion of reviewers from Asia, Africa, and South America also declined, whereas the proportion of reviewers from Europe and North America increased.

The limitations of the Journal Impact Factor are well documented [1618], and there is increasing agreement that it should not be used to evaluate the quality of research published in a journal. The San Francisco Declaration on Research Assessment (DORA) calls for the elimination of any journal-based metrics in funding, appointment, and promotion [19]. DORA is supported by thousands of universities, research institutes and individuals. Our study shows that the peer reviews submitted to journals with higher Journal Impact Factor may be more thorough than those submitted to lower impact journals. Should, therefore, the Journal Impact Factor be rehabilitated and used as a proxy measure for peer review quality? Similar to the distribution of citations in a journal, the length of reports and the prevalence of content related to thoroughness and helpfulness varied widely, within journals and between journals with similar Journal Impact Factor. In other words, the Journal Impact Factor is a poor proxy measure for the thoroughness or helpfulness of peer review authors may expect when submitting their manuscripts.

The increase in the length of peer review reports with increasing Journal Impact Factor might be explained by the fact that reviewers from Europe and North America and reviewers with English as their first language tend to write longer reports and to review for higher impact journals [20]. Further, high impact factor journals may be more prestigious to review for and can thus afford to recruit more senior scholars. Of note, there is evidence suggesting that the quality of reports decreases with age or years of reviewing [21,22]. Interestingly, several medical journals with high impact factors have recently committed to improving diversity among their reviewers [2325]. Unfortunately, due to incomplete data, we could not examine the importance of the level of seniority of reviewers. Independently of seniority, reviewers may be brief reviewing for a journal with low impact factor, believing a more superficial review will suffice. On the other hand, brief reviews are not necessarily superficial: The review of a very poor paper may not warrant a long text.

Peer review reports have been hidden for many years, hampering research on their characteristics. Previous studies were based on smaller, selected samples. An early randomised trial evaluating the effect of blinding reviewers to the authors’ identity on the quality of peer review was based on 221 reports submitted to a single journal [26]. Since then, science has become more open, embracing open access to publications and data and open peer review. Some journals now publish peer reviews and authors’ responses along with the articles [2729]. Bibliographic databases have also started to publish reviews [30]. The European Cooperation in Science and Technology (COST) Action on new frontiers of peer review (PEERE), established in 2017 to examine peer review in different areas, was based on data from several hundred Elsevier journals from a wide range of disciplines [31].

To our knowledge, the Publons database is the largest of peer review reports, and the only one not limited to individual publishers or journals, making it a unique resource for research on peer review. Based on 10,000 peer review reports submitted to medical and life science journals, this is likely the largest study of peer review content ever done. It built on a previous analysis of the characteristics of scholars who review for predatory and legitimate journals [32]. Other strengths of this study include the careful classification and validation step, based on the coding by hand of 2,000 sentences by trained coders. The performance of the classifiers was high, which is reassuring given that the sentence-level classification tasks deal with imbalanced and sometimes ambiguous categories. Performance is in line with recent studies. For example, a study using an extension of BERT to classify concepts such as nationalism, authoritarianism, and trust reported results for precision and recall similar to the present study [33]. We trained the algorithm on journals from many disciplines, which should make it applicable to other fields than medicine and the life sciences. Journals and funders could use our approach to analyse the thoroughness and helpfulness of their peer review. Journals could submit their peer review reports to an independent organisation for analysis. The results could help journals improve peer review, give feedback to peer reviewers, inform the training of peer reviewers, and help readers gauge the quality of the journals in their field. Further, such analyses could inform a reviewer credit system that could be used by funders and research institutions.

Our study has several weaknesses. Reviewers may be more likely to submit their review to Publons if they feel it meets general quality criteria. This could have introduced bias if the selection process into Publons’ database depended on the Journal Impact Factor. However, the large number of journals within each Journal Impact Factor group makes it likely that the patterns observed are real and generalizable. We acknowledge that our findings are more reliable for the more common content categories than for the less common. We only examined peer review reports and could not consider the often extensive contributions made by journal editors and editorial staff to improve articles. In other words, although our results provide valuable insights into the peer review process, they give an incomplete picture of the general quality assurance processes of journals. Due to the lack of information in the database, we could not analyse any differences between open (signed) and anonymous peer review reports. Similarly, we could not distinguish between reviews of original research articles and other article types, for example, narrative review articles. Some journals do not consider importance and relevance when assessing submissions, and these journals may have influenced results for this category. We lacked the resources to identify these journals among the over 1,600 outlets included in our study to examine their influence. Finally, we could not assess to what extent the content of peer review reports affected acceptance or rejection of the paper.

Conclusions

This study of peer review characteristics indicates that peer review in journals with higher impact factors tends to be more thorough, particularly in addressing the study’s methods while giving relatively less emphasis to presentation or suggesting solutions. Our findings may have been influenced by differences in reviewer characteristics, quality of submissions, and the attitude of reviewers towards the journals. Differences were modest, and the Journal Impact Factor is therefore a bad predictor of the quality of peer review of an individual manuscript.

Methods

Our study was based on peer review reports submitted to Publons from January 24, 2014, to May 23, 2022. Publons (part of Web of Science) is a platform for scholars to track their peer review activities and receive recognition for reviewing [34]. A total of 2,000 sentences from peer review reports were hand-coded and assigned to none, one, or more than one of 8 content categories related to thoroughness and helpfulness. The transformer model DistilBERT [14,35] was then used to assign the sentences in peer review reports as contributing or not contributing to categories. More details are provided in the Section “Classification and validation” below and S2 File. After validating the classification performance using out-of-sample predictions, the association between the 2019 Journal Impact Factors [36] and the prevalence of relevant sentences in peer review reports was examined. The sample is limited to review reports submitted to medical and life sciences journals with an impact factor. The analysis took the hierarchical nature of the data into account.

Data sources

As of May 2022, the Publons database contained information on 15 million reviews performed and submitted by more than 1,150,000 scholars for about 55,000 journals and conference proceedings. Reviews can be submitted to Publons in different ways. When scholars review for journals partnering with Publons and wish recognition, Publons receives the review and some meta-data directly from the journal. For other journals, scholars can upload the review and verify it by forwarding the confirmation email from the journal to Publons or by sending a screenshot from the peer review submission system. Publons audits a random subsample of emails and screenshots by contacting editors or journal administrators.

Publons randomly selected English-language peer review reports for the training from a broad spectrum of journals, covering all (ESI) fields [37] except Physics, Space Science, and Mathematics. Reviews from the latter fields contained many mathematical formulae, which were difficult to categorise. In the next step, a stratified random sample of 10,000 verified prepublication reviews written in English was drawn. First, the Publons database was limited to reviews from medical and life sciences journals based on ESI research fields, resulting in a data set of approximately 5.2 million reviews. The ESI field Multidisciplinary was excluded as these journals publish articles not within the medical and life sciences field (e.g., PLOS ONE, Nature, Science). Second, these reviews were divided into 10 equal groups based on Journal Impact Factor deciles. Third, 1,000 reviews were selected randomly from each of the 10 groups. Second-round peer review reports were excluded whenever this information was available. The continent of the reviewer’s institutional affiliation, the total number of publications of the reviewer, the start and end year of the reviewers’ publications, and gender were available for a subset of reviews. The gender of reviewers were classified with the gender-guesser Python package (version 0.4.0). Since the data on reviewer characteristics are incomplete and automated gender classification suffers from misclassification, these variables are only included in regression models reported in S3 File.

Classification and validation

Two authors (ASE and MS) were trained in coding sentences. After piloting and refining coding and establishing intercoder reliability, the reviewers labelled 2,000 sentences (1,000 sentences each). They allocated sentences to none, one, or several of 8 content categories. We selected the 8 categories based on prior work, including the Review Quality Instrument and other scales and checklists [38], and previous studies using text analysis or machine learning to assess student and peer review reports [3943]. In the manual coding process, the categories were refined, taking into account the ease of operationalising categories and their intercoder reliability. Based on the pilot data, Krippendorff’s α, a measure of reliability in content analysis, was calculated [44].

The categories describe, first, the Thoroughness of a review, measuring the degree to which a reviewer comments on (1) Materials and Methods (Did the reviewer comment on the methods of the manuscript?); (2) Presentation and Reporting (Did the reviewer comment on the presentation and reporting of the paper?); (3) Results and Discussion (Did the reviewer comment on the results and their interpretation?); and (4) the paper’s Importance and Relevance (Did the reviewer comment on the importance or relevance of the manuscript?). Second, the Helpfulness of a review was examined based on comments on (5) Suggestion and Solution (Did the reviewer provide suggestions for improvement or solutions?); (6) Examples (Did the reviewer give examples to substantiate his or her comments?); (7) Praise (Did the reviewer identify strengths?); and (8) Criticism (Did the reviewer identify problems?). Categories were rated on a binary scale (1 for yes, 0 for no). A sentence could be coded as 1 for multiple categories. S4 File gives further details.

We used the transformer model DistilBERT to predict the absence or presence of the 8 characteristics in each sentence of the peer review reports [45]. For validation, data were split randomly into a training set of 1,600 sentences and a held-out test set of 400 sentences. Eight DistilBERT models (one for each content categories) were fine-tuned on the set of 1,600 sentences and predicted the categories in the remaining 400 sentences. Performance measures, including precision (i.e., the positive predictive value), recall (i.e., sensitivity), and the F1 score, were calculated. The F1 score is a harmonic mean of precision and recall and an overall measure of accuracy. The F1 score can range between 0 and 1, with higher values indicating better classification performance [46].

Overall, the classification performance of the fine-tuned DistilBERT language models was high. The average F1 score for the presence of a characteristic was 0.75, ranging from 0.68 (Praise) to 0.88 (Suggestion and Solution). For most categories, precision and recall were similar, indicating the absence of systematic measurement error. Importance and Relevance and Results and Discussion were the exceptions, with lower recall for characteristics being present. Balanced accuracy (the arithmetic mean of sensitivity and specificity) was also high, ranging from 0.78 to 0.91 (with a mean of 0.83 across the 8 categories). S2 File gives further details.

We compared the percentages of sentences addressing each category between the human annotation dataset and the output from the machine learning model. For the test set of 400 sentences, the percentage of sentences that fall into each of the 8 categories were calculated, separately for the human codings and the DistilBERT predictions. There was a close match between the two: DistilBERT overestimated Importance and Relevance by 3.0 percentage points and underestimated Materials and Methods by 2.3 percentage points. For all other content categories, smaller differences were observed. Having assessed the validity of the classification, the machine learning classifiers were fine-tuned using all 2,000 labelled sentences, and the 8 classifiers were used to predict the presence or absence of content in the full text corpus consisting of 188,106 sentences.

Finally, we identified unique words in each quality category using a “keyness” analysis [47]. The words retrieved from the keyness analyses reflect typical words used in each content category.

Statistical analysis

The association between peer review characteristics and Journal Impact Factor groups was examined in 2 ways. The analysis of the number of sentences for each category used negative binomial regression models. The analysis of the percentages of sentences addressing content categories relied on linear mixed-effects models. To account for the clustered nature of the data, we include random intercepts for journals and reviewers [48]. The regression models take the form,

Yi = αji,ki+m = 210βm·IJIFi = m+ϵi

with

αjNμαj,σαj2, for journalj = 1,,J.
αkNμαk,σαk2, for reviewer k = 1,,K
ϵiN0,σ2

where Yi is the count of sentences addressing a content category (for the negative binomial regression models) or the percentages (for the linear-mixed effects models), while i, βm are the coefficients for the m = 2,…,10 categories of the categorical variable of Journal Impact Factor (with m = 1 as the reference category), and ϵi is the unobserved error term. The model includes varying intercepts αj[i],k[i] for J journals and K reviewers. I· denotes the indicator function.

All regression analyses were done in R (version 4.2.1). The fine-tuning of the classifier and sentence-level predictions were done in Python (version 3.8.13). The libraries used for data preparation, text analysis, supervised classification, and regression models were transformers (version 4.20.1) [49], quanteda (version 3.2.3) and quanteda.textstats (version 0.95) [50], lme4 (version 1.1.30) [51], glmmTMB (version 1.1.7) [52], ggeffects (version 1.1.5) [53], and tidyverse (version 1.3.2) [54].

Supporting information

S1 File. Journals and disciplines included in the study.

The 10 journals from each journal impact factor group that provided the largest number of peer review reports and all 1,664 journals included in the analysis listed in alphabetical order. The numbers in parentheses represent the JIF and the number of reviews included in the sample.

(PDF)

S2 File. Further details on classification and validation.

Further information on the hand-coded set of sentences, the classification approach, and performance provide metrics on the classification performance and show that aggregating the classification closely mirrors human coding of the same set of sentences. All results are out-of-sample predictions, meaning that the data in the held-out test set are not used for training the classifier during validation steps.

(PDF)

S3 File. Additional details on regression analyses and sensitivity analyses.

All regression tables for the analysis reported in the paper, and plots and regression tables relating to the 5 sensitivity analyses. All sensitivity analyses are conducted for the prevalence-based and sentence-based models.

(PDF)

S4 File. Codebook and instructions.

Coding instructions and examples for each of the 8 characteristics of peer review reports.

(PDF)

S1 Data. Supporting data for Fig 1.

(XLSX)

S2 Data. Supporting data for Fig 2.

(XLSX)

S3 Data. Supporting data for Fig 3.

(XLSX)

S4 Data. Supporting data for Fig 4.

(XLSX)

S5 Data. Supporting data for Fig 5.

(XLSX)

S6 Data. Supporting data for Fig 1 in S1 File.

(XLSX)

S7 Data. Supporting data for Fig 1 in S2 File.

(XLSX)

S8 Data. Supporting data for Fig 2 in S2 File.

(XLSX)

S9 Data. Supporting data for Fig 3 in S2 File.

(XLSX)

S10 Data. Supporting data for Fig 4 in S2 File.

(XLSX)

S11 Data. Supporting data for Fig 1 in S3 File.

(XLSX)

S12 Data. Supporting data for Fig 2 in S3 File.

(XLSX)

S13 Data. Supporting data for Fig 3 in S3 File.

(XLSX)

S14 Data. Supporting data for Fig 4 in S3 File.

(XLSX)

S15 Data. Supporting data for Fig 5 in S3 File.

(XLSX)

S16 Data. Supporting data for Fig 6 in S3 File.

(XLSX)

S17 Data. Supporting data for Fig 7 in S3 File.

(XLSX)

S18 Data. Supporting data for Fig 8 in S3 File.

(XLSX)

S19 Data. Supporting data for Fig 9 in S3 File.

(XLSX)

S20 Data. Supporting data for Fig 10 in S3 File.

(XLSX)

Acknowledgments

We are grateful to Anne Jorstad and Gabriel Okasa from the Swiss National Science Foundation (SNSF) data team for valuable comments on an earlier draft of this paper. We would also like to thank Marc Domingo (Publons, part of Web of Science) for help with the sampling procedure.

Abbreviations

CI

confidence interval

COST

European Cooperation in Science and Technology

DORA

San Francisco Declaration on Research Assessment

ESI

Essential Science Indicators

Data Availability

All relevant data are within the paper and its Supporting Information files. The fine-tuned DistilBERT models, data, and code to verify the reproducibility of all tables and graphs are available at https://doi.org/10.5281/zenodo.8006829. Publons’ data sharing policy prohibits us from publishing the raw text of the reviews and the annotated sentences.

Funding Statement

This study was supported by Swiss National Science Foundation (SNSF) grant 32FP30-189498 to ME, see https://data.snf.ch/grants/grant/189498) and internal SNSF resources. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Severin A, Chataway J. Purposes of peer review: A qualitative study of stakeholder expectations and perceptions. Learn Publ. 2021;34:144–155. doi: 10.1002/leap.1336 [DOI] [Google Scholar]
  • 2.ORCID Support. Peer Review. In: ORCID [Internet]. [cited 2022 Jan 20]. Available from: https://support.orcid.org/hc/en-us/articles/360006971333-Peer-Review
  • 3.Malchesky PS. Track and verify your peer review with Publons. Artif Organs. 2017;41:217. doi: 10.1111/aor.12930 [DOI] [PubMed] [Google Scholar]
  • 4.Ledford H, Van Noorden R. Covid-19 retractions raise concerns about data oversight. Nature. 2020;582:160–160. doi: 10.1038/d41586-020-01695-w [DOI] [PubMed] [Google Scholar]
  • 5.Grudniewicz A, Moher D, Cobey KD, Bryson GL, Cukier S, Allen K, et al. Predatory journals: no definition, no defence. Nature. 2019;576:210–212. doi: 10.1038/d41586-019-03759-y [DOI] [PubMed] [Google Scholar]
  • 6.Strinzel M, Severin A, Milzow K, Egger M. Blacklists and whitelists to tackle predatory publishing: a cross-sectional comparison and thematic analysis. MBio. 2019;10:e00411–e00419. doi: 10.1128/mBio.00411-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Garfield E. The history and meaning of the journal impact factor. JAMA-J Am Med Assoc. 2006;295:90–93. doi: 10.1001/jama.295.1.90 [DOI] [PubMed] [Google Scholar]
  • 8.Frank E. Authors’ criteria for selecting journals. JAMA: The. J Am Med Assoc. 1994;272:163–164. [PubMed] [Google Scholar]
  • 9.Regazzi JJ, Aytac S. Author perceptions of journal quality. Learn Publ. 2008;21:225−+. doi: 10.1087/095315108X288938 [DOI] [Google Scholar]
  • 10.Rees EL, Burton O, Asif A, Eva KW. A method for the madness: An international survey of health professions education authors’ journal choice. Perspect Med Educ. 2022;11:165–172. doi: 10.1007/s40037-022-00698-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Saha S, Saint S, Christakis DA. Impact factor: a valid measure of journal quality? J Med Libr Assoc. 2003;91:42–46. [PMC free article] [PubMed] [Google Scholar]
  • 12.McKiernan EC, Schimanski LA, Muñoz Nieves C, Matthias L, Niles MT, Alperin JP. Use of the journal impact factor in academic review, promotion, and tenure evaluations. elife. 2019;8:e47338. doi: 10.7554/eLife.47338 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Essential Science Indicators. In: Overview [Internet]. [cited 2023 Mar 9]. Available from: https://esi.help.clarivate.com/Content/overview.htm?Highlight=esi%20essential%20science%20indicators
  • 14.Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv. 2020. Available from: http://arxiv.org/abs/1910.01108 [Google Scholar]
  • 15.Bondi M, Scott M, editors. Keyness in Texts. Amsterdam: John Benjamins Publishing Company; 2010. doi: 10.1075/scl.41 [DOI] [Google Scholar]
  • 16.Seglen PO. Why the impact factor of journals should not be used for evaluating research. BMJ. 1997;314(7079):498–502. doi: 10.1136/bmj.314.7079.497 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.de Rijcke S, Wouters PF, Rushforth AD, Franssen TP, Hammarfelt B. Evaluation practices and effects of indicator use—a literature review. Res Eval. 2016;25:161–169. doi: 10.1093/reseval/rvv038 [DOI] [Google Scholar]
  • 18.Bornmann L, Marx W, Gasparyan AY, Kitas GD. Diversity, value and limitations of the journal impact factor and alternative metrics. Rheumatol Int. 2012;32:1861–1867. doi: 10.1007/s00296-011-2276-1 [DOI] [PubMed] [Google Scholar]
  • 19.DORA–San Francisco Declaration on Research Assessment (DORA). [cited 2019 Oct 2]. Available from: https://sfdora.org/
  • 20.Global State of peer review report. In: Clarivate [Internet]. [cited 2023 Mar 10]. Available from: https://clarivate.com/lp/global-state-of-peer-review-report/
  • 21.Callaham M, McCulloch C. Longitudinal trends in the performance of scientific peer reviewers. Ann Emerg Med. 2011;57:141–148. doi: 10.1016/j.annemergmed.2010.07.027 [DOI] [PubMed] [Google Scholar]
  • 22.Evans AT, Mcnutt RA, Fletcher SW, Fletcher RH. The characteristics of peer reviewers who produce good-quality reviews. J Gen Intern Med. 1993;8:422–428. doi: 10.1007/BF02599618 [DOI] [PubMed] [Google Scholar]
  • 23.The Editors of the Lancet Group. The Lancet Group’s commitments to gender equity and diversity. Lancet. 2019;394:452–453. doi: 10.1016/S0140-6736(19)31797-0 [DOI] [PubMed] [Google Scholar]
  • 24.A commitment to equality, diversity, and inclusion for BMJ and our journals. In: The BMJ [Internet]. 2021 Jul 23 [cited 2022 Apr 12]. Available from: https://blogs.bmj.com/bmj/2021/07/23/a-commitment-to-equality-diversity-and-inclusion-for-bmj-and-our-journals/
  • 25.Fontanarosa PB, Flanagin A, Ayanian JZ, Bonow RO, Bressler NM, Christakis D, et al. Equity and the JAMA Network. JAMA. 2021;326:618–620. doi: 10.1001/jama.2021.9377 [DOI] [PubMed] [Google Scholar]
  • 26.Godlee F, Gale CR, Martyn C. Effect on the quality of peer review of blinding reviewers and asking them to sign their reports. A randomized controlled trial. JAMA: The. J Am Med Assoc. 1998;280:237–240. [DOI] [PubMed] [Google Scholar]
  • 27.Open Peer Review. In: PLOS [Internet]. [cited 2022 Mar 1]. Available from: https://plos.org/resource/open-peer-review/
  • 28.Wolfram D, Wang P, Hembree A, Park H. Open peer review: promoting transparency in open science. Scientometrics. 2020;125:1033–1051. doi: 10.1007/s11192-020-03488-4 [DOI] [Google Scholar]
  • 29.A decade of transparent peer review–Features–EMBO. [cited 2023 Mar 10]. Available from: https://www.embo.org/features/a-decade-of-transparent-peer-review/
  • 30.Clarivate AHSPM. Introducing open peer review content in the Web of Science. In: Clarivate; [Internet]. 2021. Sep 23 [cited 2022 Mar 1]. Available from: https://clarivate.com/blog/introducing-open-peer-review-content-in-the-web-of-science/ [Google Scholar]
  • 31.Squazzoni F, Ahrweiler P, Barros T, Bianchi F, Birukou A, Blom HJJ, et al. Unlock ways to share data on peer review. Nature. 2020;578:512–514. doi: 10.1038/d41586-020-00500-y [DOI] [PubMed] [Google Scholar]
  • 32.Severin A, Strinzel M, Egger M, Domingo M, Barros T. Characteristics of scholars who review for predatory and legitimate journals: linkage study of Cabells Scholarly Analytics and Publons data. BMJ Open. 2021;11:e050270. doi: 10.1136/bmjopen-2021-050270 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bonikowski B, Luo Y, Stuhler O. Politics as usual? Measuring populism, nationalism, and authoritarianism in U.S. presidential campaigns (1952–2020) with neural language models. Sociol Methods Res. 2022;51:1721–1787. doi: 10.1177/00491241221122317 [DOI] [Google Scholar]
  • 34.Publons. Track more of your research impact. In: Publons; [Internet]. [cited 2022 Jan 18]. Available from: http://publons.com [Google Scholar]
  • 35.Tunstall L, von Werra L, Wolf T. Natural Language Processing with Transformers: Building Language Applications with Hugging Face. 1st ed. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly Media; 2022. [Google Scholar]
  • 36.2019 Journal Impact Factors. Journal Citation Reports. London, UK: Clarivate Analytics; 2020.
  • 37.Scope Notes [cited 2022 Jun 20]. Available from: https://esi.help.clarivate.com/Content/scope-notes.htm
  • 38.Superchi C, González JA, Solà I, Cobo E, Hren D, Boutron I. Tools used to assess the quality of peer review reports: a methodological systematic review. BMC Med Res Methodol. 2019;19:48. doi: 10.1186/s12874-019-0688-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Ramachandran L, Gehringer EF. Automated assessment of review quality using latent semantic analysis. 2011 IEEE 11th International Conference on Advanced Learning Technologies. Athens, GA, USA: IEEE; 2011. p. 136–138. doi: 10.1109/ICALT.2011.46 [DOI]
  • 40.Ghosal T, Kumar S, Bharti PK, Ekbal A. Peer review analyze: A novel benchmark resource for computational analysis of peer reviews. PLoS ONE. 2022;17:e0259238. doi: 10.1371/journal.pone.0259238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Thelwall M, Papas E-R, Nyakoojo Z, Allen L, Weigert V. Automatically detecting open academic review praise and criticism. Online Inf Rev. 2020;44:1057–1076. doi: 10.1108/OIR-11-2019-0347 [DOI] [Google Scholar]
  • 42.Buljan I, Garcia-Costa D, Grimaldo F, Squazzoni F, Marušić A. Large-scale language analysis of peer review reports. Rodgers P, Hengel E, editors. elife. 2020;9:e53249. doi: 10.7554/eLife.53249 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Luo J, Feliciani T, Reinhart M, Hartstein J, Das V, Alabi O, et al. Analyzing sentiments in peer review reports: Evidence from two science funding agencies. Quant Sci Stud. 2022;2:1271–1295. doi: 10.1162/qss_a_00156 [DOI] [Google Scholar]
  • 44.Krippendorff K. Reliability in content analysis—Some common misconceptions and recommendations. Hum Commun Res. 2004;30:411–433. doi: 10.1111/j.1468-2958.2004.tb00738.x [DOI] [Google Scholar]
  • 45.Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. New York: Cambridge University Press; 2008. [Google Scholar]
  • 46.Olczak J, Pavlopoulos J, Prijs J, Ijpma FFA, Doornberg JN, Lundström C, et al. Presenting artificial intelligence, deep learning, and machine learning studies to clinicians and healthcare stakeholders: an introductory reference with a guideline and a Clinical AI Research (CAIR) checklist proposal. Acta Orthop. 2021;92:513–525. doi: 10.1080/17453674.2021.1918389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Gabrielatos C. Chapter 12: Keyness Analysis: nature, metrics and techniques. In: Taylor C, Marchi A, editors. Corpus Approaches to Discourse: A critical review. Oxford: Routledge; 2018. p. 31. [Google Scholar]
  • 48.Jayasinghe UW, Marsh HW, Bond N. A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc Ser A Stat Soc. 2003;166:279–300. doi: 10.1111/1467-985X.00278 [DOI] [Google Scholar]
  • 49.Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace’s transformers: state-of-the-art natural language processing. arXiv. 2020. Available from: http://arxiv.org/abs/1910.03771 [Google Scholar]
  • 50.Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. quanteda: An R package for the quantitative analysis of textual data. J Open Source Softw. 2018;3:774. doi: 10.21105/joss.00774 [DOI] [Google Scholar]
  • 51.Bates D, Maechler M, Bolker BM, Walker SC. Fitting linear mixed-effects models using lme4. J Stat Softw. 2015;67:1–48. [Google Scholar]
  • 52.Brooks ME, Kristensen K, van Benthem KJ, Magnusson A, Berg CW, Nielsen A, et al. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. R J. 2017;9:378. doi: 10.32614/RJ-2017-066 [DOI] [Google Scholar]
  • 53.Lüdecke D. ggeffects: Tidy data frames of marginal effects from regression models. J Open Source Softw. 2018;3:772. doi: 10.21105/joss.00772 [DOI] [Google Scholar]
  • 54.Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, et al. Welcome to the Tidyverse. J Open Source Softw. 2019;4:1686. doi: 10.21105/joss.01686 [DOI] [Google Scholar]

Decision Letter 0

Roland G Roberts

23 Aug 2022

Dear Matthias,

Thank you for submitting your manuscript entitled "Journal Impact Factor and Peer Review Thoroughness and Helpfulness: A Supervised Machine Learning Study" for consideration as a Meta-Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. After your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://www.editorialmanager.com/pbiology) within two working days, i.e. by Aug 25 2022 11:59PM.

If your manuscript has been previously peer-reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.

If you would like us to consider previous reviewer reports, please edit your cover letter to let us know and include the name of the journal where the work was previously considered and the manuscript ID it was given. In addition, please upload a response to the reviews as a 'Prior Peer Review' file type, which should include the reports in full and a point-by-point reply detailing how you have or plan to address the reviewers' concerns.

During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli

Roland Roberts, PhD

Senior Editor

PLOS Biology

rroberts@plos.org

Decision Letter 1

Roland G Roberts

12 Oct 2022

Dear Matthias,

Thank you for your patience while your manuscript "Journal Impact Factor and Peer Review Thoroughness and Helpfulness: A Supervised Machine Learning Study" was peer-reviewed at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

As you will see in the reviewer reports, which can be found at the end of this email, although the reviewers find the work potentially interesting, they have also raised a substantial number of important concerns. Based on their specific comments and following discussion with the Academic Editor, it is clear that a substantial amount of work would be required to meet the criteria for publication in PLOS Biology.

However, given our and the reviewers' interest in your study, we would be open to inviting a comprehensive revision of the study that thoroughly addresses all the reviewers' comments. Given the extent of revision that would be needed, we cannot make a decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript would need to be seen by the reviewers again, but please note that we would not engage them unless their main concerns have been addressed.

You'll see that most of reviewer #1's requests are for methodological clarification, but he also wonders whether your results may be skewed by the greater length of the reviews in high-impact journals. He also points out the underappreciated contribution of professional editors (!!). Reviewer #2 is also broadly positive, though regrets the fact that data are from a single source; he recommends that you try to validate with respect to another source, and suggests several substantial additional analyses. Like reviewer #1, he requests some methodological clarifications, mentions some possible confounders, and makes some very helpful textual and interpretational suggestions. Reviewer #3, who is a machine learning expert, is similarly positive, but has a long series of quite severe-sounding criticisms of your methodology and its reporting.

We appreciate that these requests represent a great deal of extra work, and we are willing to relax our standard revision time to allow you 6 months to revise your study. Please email us (plosbiology@plos.org) if you have any questions or concerns, or envision needing a (short) extension.

At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may withdraw it.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Revised Article with Changes Highlighted " file type.

*Resubmission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this resubmission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli

Roland Roberts, PhD

Senior Editor

PLOS Biology

rroberts@plos.org

-----------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

[identifies himself as Ludo Waltman]

Please find my review at https://ludowaltman.pubpub.org/pub/review-jif-pr/release/1

[the editor has here pasted the text of the review from that location]

This paper presents a large-scale analysis of the content of peer review reports, focusing on different types of comments provided in review reports and the association with the impact factors of journals. The scale of the analysis is impressive. Studies of the content of such a large number of review reports are exceptional. I enjoyed reading the paper, even though I did not find the results presented in the paper to be particularly surprising.

Feedback and suggestions for improvements are provided below.

The methods used by the authors would benefit from a significantly more detailed explanation:

“Scholars can submit their reviews for other journals by either forwarding the review confirmation emails from the journals to Publons or by sending a screenshot of the review from the peer review submission system.”: This sentence is unclear. Review confirmation emails often do not include the review itself, only a brief ‘thank you’ message, so it is not clear to me how a review can be obtained from such a confirmation email. I also do not understand how a review can be obtained from a screenshot. A screenshot may show only part of the review, not the entire review, and there would be a significant technical challenge in converting the screenshot, which is an image, to machine-readable text.

I would like to know whether all reviews are in English or whether there are also reviews in other languages.

Impact factors change over time. New impact factors are calculated each year. The authors need to explain which impact factors they used.

There are many journals that do not have an impact factor. The authors need to explain how these journals were handled.

The authors also need to discuss how reviewers were linked to publication profiles. This is a non-trivial step that needs to be taken to determine the number of publications of a reviewer and the start and end year of the publications of a reviewer. The authors do not explain how this step was taken in their analysis. It is important to provide this information.

“We used a Naïve Bayes algorithm to train the classifier and predict the absence or presence of the eight characteristics in each sentence of the peer review report.”: The machine learning approach used by the authors is explained in just one sentence. A more elaborate explanation is needed. There are lots of machine learning approaches. The authors need to explain why they use Naïve Bayes. They also need to briefly discuss how Naïve Bayes performs the classification task.

Likewise, I would like to see a proper discussion of the statistical model used by the authors. The authors informally explain their statistical approach. I would find it helpful to see a more formal description (in mathematical notation) of the statistical model used by the authors.

“Most distributions were skewed right, with a peak at 0% showing the number of reviews that did not address the content category (Fig 1).”: I do not understand how the peaks at 0% can be explained. Could this be due to problems in the data (e.g., missing or empty review reports)? The authors need to explain this.

“the prevalence of content related to thoroughness and helpfulness varied widely even between journals with similar journal impact factor”: I am not sure whether the word ‘between’ is correct in this sentence. My understanding is that the authors did not distinguish between variation between journals and variation within journals.

“Some journals now publish peer reviews and authors' responses with the articles”: Consider citing the following paper: https://doi.org/10.1007/s11192-020-03488-4. I also recently published a blog post on this topic: https://www.leidenmadtrics.nl/articles/the-growth-of-open-peer-review.

“Bibliographic databases have also started to publish reviews.”: In addition to Web of Science, I think the work done by Europe PMC needs to be acknowledged as well. See for instance this poster presented at the recent OASPA conference: https://oaspa.org/wp-content/uploads/2022/09/Melissa-Harrison_COASP-2022-poster_V2.pdf.

“peer review in journals with higher impact factors tends to be more thorough in addressing study methods but less helpful in suggesting solutions or providing examples”: I wonder whether this conclusion is justified. Relatively speaking sentences in reviews for higher impact factor journals are indeed more likely to address methods and less likely to suggest solutions or to provide examples. However, as shown by the authors, reviews for higher impact factor journals tend to be substantially longer than reviews for lower impact factor journals. Therefore it seems that the total number of sentences (as opposed to the proportion of sentences) suggesting solutions or providing examples may be higher in reviews for higher impact factor journals than in reviews for lower impact factor journals. If that is indeed the case, it seems to me the conclusion should be that peer review in higher impact factor journals is both more thorough and more helpful.

Finally, I think it needs to be acknowledged that quality assurance processes of journals consist not only of the work done by peer reviewers but also of the work done the editorial staff of journals. This seems important in particular for more prestigious journals, which presumably make more significant investments in editorial quality assurance processes. The results presented in the paper offer valuable insights into peer review processes, but they provide only a partial picture of the overall quality assurance processes of journals.

Reviewer #2:

[identifies himself as Bernd Pulverer]

Severin et al. add to their previous work (ref 23) on analyzing attributes of scholarly referee reports. Peer review is generally regarded as a pivotal component of the scholarly process and as such quantitative analysis is to be welcomed.

The study is based on a large set of about 10,000 referee report across a broad set if biomedical disciplines and uses human annotation to train machine learning based extraction according to 8 pre-identified categories. The study limits itself to analyzing how the 8 referee report attributes compare across 10 Journal Impact Factor (JIF) bins. Regression modelling is applied. Several of the attributes exhibit no trends, others at best very weak trends. That in itself is notable, as for example the minor negative trend of comments on 'Importance and Relevance' vs. JIF is surprising as higher JIF journals tend to instruct referees to comment specifically on these attributes, which for a core part of the selection criteria of such journals. Stronger correlations are reported for the categories 'Materials and Methods' (positive), 'Presentation and Reporting' and 'Suggestion and Solution' (both negative). The authors conclude that referee reports for higher JIF journals may be more 'thorough' but less 'helpful in suggesting solutions and providing examples'. These trends are notable and not predictable - they are also somewhat difficult to rationalize and it is to the authors credit that they don't overinterpret these numbers beyond the conclusions that 'JIF is a bad predictor of peer review' and in fact end the paper with a balanced strength/weakness analysis. The data and approach as reported in detail, but source data for fig 1-3 should be added.

This is an important area of analysis of general interest and the study is thorough. The conclusions are somewhat limited by restricting the analysis to one variable, the JIF, and by limiting the referee report attributes to 8 categories (see below). With the heavy lifting of the human curated training set in hand, it is a pity that the study was not developed beyond the JIF correlations. As such, this specific analysis appears novel and it is based on a large dataset, albeit form a single source.

Major comments:

1) The dataset is large, but limited to one database (Publons). This may well add biases the data, as the authors note themselves. It would have been helpful to expand the analysis to other databases hosting referee reports, such as ORCID, as well as to journals that publish referee reports alongside their papers, such as BMJ, EMBO, eLife and some Nature branded journals. Minimally to test if the reported trends hold up.

2) It would also have been useful to test for another potential bias: open reports vs. closed reports (still the majority): a collaboration with journals that do not publish their reports (and filtering out referees who posted on Publons or ORCID) would have led to an interesting comparison if the trends are identical when peer review is confidential. Since only aggregate data are reported a journal/publisher collaboration should be feasible.

3) The study is based on 8 categories. It is unclear how these categories were chosen and the detail of how the annotators defined them is limited. More importantly, other important attributes are missing, for example number of experimental requests made vs. number of textual requests made, or % of a referee report dedicated to specific points vs. general discussion/subjective points. Expanding the set would add value. As a minor point, it is noted that one category, 'Importance and Relevance' is explicitly excluded from a number of major journals, such as PLOSONE. This could be a confounding factor. I realize that 'multidisciplinary' journals have already been excluded, and maybe this covers all such journals, but please comment.

4) It is unclear if only research papers were analyzed. It is recommended that other peer reviewed papers such as reviews are excluded.

5) The study shows that the JIF does not predict many of the attributes. With the same dataset other variables could be assessed, such as 'subject area' (already defined in the study as ESI research field, in particular clinical research vs. 'basic research'). This is in particular important as baseline JIF is rather different between such categories, which may be a confounding factor in this analysis, but it may also lead to stronger correlations than JIF. Other variables could be category of paper (short report vs. full research paper) or length of paper (report correlation between paper and referee report length). Other interesting areas would be journal name, journal editorial process, referee age or experience, referee gender, referee affiliation. The authors note that referee age could not be analyzed and discuss other variables noting 'adjusting for additional variables strengthened relationships'. It is recommended that this section is expanded and the data added. The referee geography as a function of JIF is reported in Table 1: it would be interesting to correlate this with that of the corresponding authors of the paper refereed, if that is feasible.

6) The 'trends' seen for 'Importance and Relevance' and 'Example' (fig 3) are reported as statistically significant, but they are very small and arguable hard to interpret on the background of complex confounding factors. 'Criticism' shows arguable a similar range of variation and yet is classed as 'no effect'. I would recommend not to emphasize these.

Minor Comments:

1) The very first sentence of the abstract states that JIF is used as a proxy for journal quality and thus peer review quality. First of all, JIF claims to measure 'impact' not quality and this is a distortion, although both may correlate. Also, as stated it is implied that referees select what is published, which is not the case (editors select assisted by referee input). Thus, even assuming editors select for JIF maximization, the JIF to peer review connection is indirect at best.

2) Please explain why the regression analyses were controlled for review length. 'since longer texts …address more categories' seems tenuous as multiple categories can be assigned to each sentence.

3) Discussion, second paragraph: a key outcome of studies such as this is to develop 'referee credit' systems. Processes such as the referee report analysis applied here can be applied to individual referee reports and referees to aid such a system.

4) The 'Typical words' section could be removed as it is covered in S4.

5) Table 2 is of limited value and could be removed or added as a supplementary figure.

Text suggestions:

1) Abstract , line 3: add also no. papers analyzed here.

2) Abstract, line 9: state whole range to avoid confusion: 0.21-74.70, median 1.2-8.0

3) Introduction, lines 4-9: I suggest to remove the claim that peer review is 'particularly critical for the medical sciences'. This is debatable, but the paper is not restricted to the medical sciences (in fact, as noted above, a comparison between medical and biological sciences in this dataset would add considerably).

4) Introduction, second para: the claim 'in the absence of evidence on the quality of peer review ….proxy measures like JIF…' is tenuous at best. JIF is used as a proxy for 'impact' maybe even 'quality' and peer review is a key part of quality assurance but does not in itself define journal selection. In fact, one could highlight two functions: aiding journal selection; improving paper. Please adapt.

5) Introduction, second para: change 'articles published' to 'articles classed as citable (by ISI-Clarivate)'

6) Discussion, line 14: I am not sure the data definitively show high JIF reports are 'more thorough'.

7) Discussion, 3rd para: I disagree with the hypothesis that 'junior referees might be less able to comment on methodology'. All the evidence point the other way, and this is not surprising since ECRs are practitioners. It is fine to pose a hypothesis of course and then to cite evidence again, but this section could also be deleted as it is - unfortunately - not tested here.

8) Discussion 4th para: Ref 19, 20 are cited in support of transparent/open peer review. Nature was actually rather late in adopting this and others, like BMJ group, EmboPress, BMC series and eLife, could be cited.

9) Discussion 5th para: ORCID should be discussed here.

10) Methods: Publons has been part of Clarivate for years not 'now'.

This referee I not an expert in machine learning or statistical analysis and did therefore not assess these aspects of the work in detail.

Reviewer #3:

While this is a highly interesting study, there are several major questions and issues that preclude a favorable assessment at this point.

1) Introduction

It´s relatively well known, at least in my circles, that the impact factor (IF) is misused to assess journal quality and even single paper quality. However, I do not necessarily agree with the notion that this included an overestimation of peer review quality as well. Curiously, the manuscript also does not provide a single reference to back this extension ("re used to assess the quality of journals and, by extension, the quality of peer review."). This is problematic since the premise of the introduction lies upon this idea.

The authors should provide evidence for the notion that IF and peer review quality are linked or have been perceived as linked.

2) Methods

General comment on the methods: The machine learning pipeline is not very well described, also with the added supplement. A supplement should contain no information that is absolutely necessary to understand the methodology.

I strongly suggest a general revision for clarity using standard machine learning terminology and phrasings, and review what is in the main manuscript and the supplement.

One example:

"We divided the sample into five equally sized subsets and ran the cross-validation five times."

This is how cross-validation was explained. While this explanation _could_ in theory mean cross-validation, this definition could also support other data splits which are not cross-validation.

3) Methods / Classification and Validation

In the section that describes the categories, I noted that some points were labeled as "did the reviewers _discuss_" a certain topic, whereas in others the label was "did the reviewers _comment on_" a topic.

To comment on something or to discuss something is a clear qualitative difference. Is there a reason why these phrasings were used?

4) Methods / metrics

The method sections reads as if the authors only calculated PPV, sensitivity, and the F1 score. However, the supplement S2 also describes and shows the accuracy. I assume that the authors calculated an even bigger set of metrics. So please justify why these particular 3 (or 4) metrics were chosen to be presented in the paper (and why others were not).

5) Methods / Results

Generally, the performance of the models is pretty bad. Even the top three that were chosen for further analysis perform pretty badly. Also, the authors have compared only NB and an SVM. If this was a data science project in a bootcamp, the authors would fail it as they kind of stopped after 40% of the work. Especially boosted trees would have been worth exploring as they consistently rank highest in the literature and in competition compared to simpler algorithms. Together with point 6 (below) and proper hyperparameter tuning it is likely that a boosted tree model would lead to better results.

6) Methods

Looking at S2, it becomes pretty clear that there is an imbalance problem, definitely present for the 5 least common categories. I did not find any mention that the authors adjusted their k-fold crossvalidation for imbalanced data. In this case stratified sampling for the k-fold cross validation is the right method which would likely lead to better and more stable results. To assess this the authors should also report the standard deviation of the cross-validation. I would generally also suggest 10 instead of 5 folds when dealing with such a more complicated setup.

7) Results

How the paper is written, it is very suggestive that the authors believe that the correlation found between high-impact journals and peer-review focusing more on methods is also causal. While the authors to not claim causality (that per definition cannot be shown by ML alone), the phrasings are still very suggestive, e.g. from the abstract: "In conclusion, peer review in journals with higher journal impact factors tends to be more thorough in discussing the methods used...". There is, however, also the mention of a confounding factor. The authors say that reviewers for high impact journals tended to come from a certain geographic (Europe/NA). So, maybe researchers in Europe/NA are trained to be more focused on methods? Given this confounder and the fact that ML is based on correlation, any conclusions drawn from this study should be phrased much more carefully than currently.

8) General comment

Given how the study was based on available peer reviews the categories for very high impact of course contain those high-impact journals but very well known highest-impact journals like new england journal of medicine are not in the top ten. Are such flagship journals even present in the sample? If not that´s a shortcoming and a limitation. The authors should comment on this.

Overall, the methodological shortcomings of the study make it hard for me find the results trustworthy. The ML modelling should be performed completely and according to state-of-the-art and conclusions should only be drawn on data generated by those final models.

Decision Letter 2

Roland G Roberts

9 May 2023

Dear Dr Egger,

Thank you for your patience while we considered your revised manuscript "Journal Impact Factor and Peer Review Thoroughness and Helpfulness: A Supervised Machine Learning Study" for consideration as a Meta-Research Article at PLOS Biology. Your revised study has now been evaluated by the PLOS Biology editors, the Academic Editor, and the original reviewers.

In light of the reviews, which you will find at the end of this email, we are pleased to offer you the opportunity to address the [comments/remaining points] from the reviewers in a revision that we anticipate should not take you very long. We will then assess your revised manuscript and your response to the reviewers' comments with our Academic Editor aiming to avoid further rounds of peer-review, although might need to consult with the reviewers, depending on the nature of the revisions.

IMPORTANT - Please attend to the following:

a) Reviewer #1 raises a potentially important point about your decision to normalise by length, and indeed all three reviewers mention the way that review length was treated in this study. Please address these and the other concerns raised by the reviewers.

b) Please could you change the Title to "Relationship between Journal Impact Factor and the thoroughness and helpfulness of peer reviews"? Normally we would ask you to incorporate the specific finding(s) in the title, but these are somewhat complex and nuanced, and may change in response to the reviewers' comments.

c) Please ensure that you comply with our Data Policy; specifically, we need you to supply the numerical values underlying Figs 1, 2, 3, S1.1, S2.1, S2.2, S2.3, S2.4, and the 4 Figs in “S3 File”, either as a supplementary data file or as a permanent DOI’d deposition. We note that you have plans to deposit the data and R code in the Harvard Metaverse; however, I will need to see this before accepting the paper for publication, and we will also need you to make a permanent DOI’d version (e.g. in Zenodo).

d) Please cite the location of the data clearly in all relevant main and supplementary Figure legends, e.g. “The data underlying this Figure can be found in S1 Data” or “The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.XXXXX”

We expect to receive your revised manuscript within 1 month. Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension.

At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we withdraw the manuscript.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point-by-point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Revised Article with Changes Highlighted " file type.

*Resubmission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this resubmission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland Roberts, PhD

Senior Editor

PLOS Biology

rroberts@plos.org

----------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

[identifies himself as Ludo Waltman]

I am pleased to see the significant improvements made by the authors to their paper. I have one remaining comment.

The authors conclude that "this study of peer review characteristics indicates that peer review in journals with higher impact factors tends to be more thorough in addressing study methods but less helpful in suggesting solutions or providing examples". As pointed out in my previous review, I don't think this conclusion is warranted. It disregards the fact that review reports in journals with higher impact factors are much longer, on average, than review reports in journals with lower impact factors. The percentage of sentences in review reports that suggest solutions or provide examples is lower for higher impact factor journals than for lower impact factor journals, but the absolute number of sentences suggesting solutions or providing examples is higher, not lower. In my view, the conclusion therefore should be that journals with higher impact factors provide reviews that are more, not less, helpful in suggesting solutions or providing examples.

In their response to my previous report, the authors point out that "our analyses controlled for the length of peer review". This is exactly the problem. All statistics presented in the paper are percentages rather than absolute numbers, so the authors indeed control for the length of a review report. However, length is a relevant factor that, I would argue, one should not necessarily control for. For instance, suppose we have two review reports. One has a length of 100 words, 50 of which are used to provide suggestions. The other has a length of 1000 words, 200 of which are used to provide suggestions. From a relative point of view, the former report is more helpful in providing suggestions (50% vs. 20% of the words are used to provide suggestions), but from an absolute point of view, the latter report is more helpful in providing suggestions (50 vs. 200 words are used to provide suggestions). In my view, the absolute perspective is more relevant. The latter report is the one that will be more helpful for authors to improve their work.

More generally, the fact that review reports in the highest impact factor category are more than twice as long as review reports in the lowest impact factor category is of major importance and, in my view, needs to be emphasized more strongly. It indicates that higher impact factor journals tend to offer more in-depth peer review than lower impact factor journals. This is an important finding that I believe should be mentioned In the abstract and in the concluding section.

Ludo Waltman

PS I published my previous review online. I had hoped to also publish my new review. However, it seems the authors haven't posted their revised paper on a preprint server. I therefore consider the revised paper to be confidential and I won't publish my review.

Reviewer #2:

[identifies himself as Bernd Pulverer]

Ref #2 Re-review:

The authors are to be commended for the thorough responses.

A number of comments:

1) I appreciate the point that many journals with open review processes share these on Publons (now part of Clarivate 'Web of science'). However, Journal with public but unsigned reports less so. Nonetheless, I agree that scraping the literature for non-Publons listed reports in the absence of standardize identifiers is not trivial. I do believe ORCID profiles can point to referee reports from the 'Review URL' field (cf. https://support.orcid.org/hc/en-us/articles/360006971333-Peer-Review). This study us based on a large dataset and it is certainly reasonable to restrict the study to this dataset as it is unclear if a broader set of input data would alter the conclusions significantly. These points could be discussed.

2) Thank you for pointing out that signed vs. unsigned reports and referee reports on research article vs. reviews was not assessed - that is fine, but I am unclear why signed reports could not be automatically identified and compared to unsigned reports. Note that I had suggested a third comparison between published and unpublished referee reports, but acknowledge that while very interesting, this would be a complex undertaking that can be discussed.

3) - 6) Thank you for the constructive comments and revision

The minor points are addressed, apart from point 3): I would recommend to highlight more clearly that automated analysis of referee reports for quality attributes may inform a referee credit system that could be used objectively and at scale in research assessment by funders and research institutions.

I assessed the responses to ref #1 and #3 and, leaving aside the technical details on statistics and models, which I did not judge, I believe the responses are thorough and the revisions comprehensive leading to a more informative and balanced manuscript. In particular the causality point by ref 3 (no 7) is important and was addressed.

It may be worth emphasizing the referee report length more, both as a correlation with JIF and in the context of the length control applied here (as discussed in ref #1, point 10; ref #2, minor point2).

Reviewer #3:

The re-work of the manuscript was extensive, the authors have addressed all relevant shortcomings very well.

I have only two minor comments, that should be addressed imo before publication.

1) discussion, p. 12

"Our study shows that the peer reviews submitted to journals with higher Journal Impact Factor may be more thorough than those submitted to lower impact journals. Should, therefore, the Journal Impact Factor be rehabilitated, and used as a proxy measure for peer review quality? "

In the following discussion of this question, and also at other parts of the manuscript, there is the implicit assumption that submitted peer reviews are independent of the impact factor and journal, i.e. the same effort is put into providing a review. But of course that is likely not true. That reviews tend to be shorter and less thorough w/r to methodology in lower impact journals can have two additional confounding factors:

a) People are less thorough _because_ it is a low impact journal, believing they do not need to provide a review of as good quality as for a journal with a higher impact factor.

b) People might also be less thorough, when only basing this on the length(!), because the quality does not warrant more text. Let me explain. If I am confronted with an applied AI in healthcare methodology, that is completely not up to the standards, I might just write exactly that and give some examples in bullet points, and suggest rejection. Confronted with a good methodology that has only _some_ major shortcomings, I will likely take the time (and words) to explain these few shortcomings. My point here is that longer -> "more thorough" does not necessary mean more useful or better. The former case does not _warrant_ more text (and it also does not warrant many suggestions. Is it true that I find the former more often in low impact journals? I do not know. I had one of my worst experiences in this regard in the flagship journal of my field (and the paper was accepted despite the fact that I suggested a reject as the only AI methodology expert). But I think this point could still be considered.

I believe that the discussion would improve if these additional points were also discussed as potentially confounding factors and why this topic is very hard to assess.

2) conclusion, p. 14

"This study of peer review characteristics indicates that peer review in journals with higher impact factors tends to be more thorough in addressing study methods but less helpful in suggesting solutions or providing examples."

I believe that this sentence should be followed by something like (authors should modify as they wish): "These differences may also be influenced by differences in geographical reviewer characteristics, quality of submissions, and the attitude of reviewers towards the journals".

Otherwise the conclusion implies, at least to a degree, that higher impact may lead to more thorough reviews.

Decision Letter 3

Roland G Roberts

21 Jun 2023

Dear Dr Egger,

Thank you for your patience while we considered your revised manuscript "Relationship between Journal Impact Factor and the Thoroughness and Helpfulness of Peer Reviews" for publication as a Meta-Research Article at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors and the Academic Editor.

Based on our Academic Editor's assessment of your revision, we are likely to accept this manuscript for publication, provided you satisfactorily address the following data and other policy-related requests.

IMPORTANT - please attend to the following:

a) Please ensure that you comply with our Data Policy; specifically, we need you to supply the numerical values underlying Figs 1, 2, 3, S1.1, S2.1, S2.2, S2.3, S2.4, and the 4 Figs in “S3 File”, either as a supplementary data file or as a permanent DOI’d deposition. We note that you mention a Zenodo URL (https://doi.org/10.5281/zenodo.8006829); however, this is not accessible, and I will need to see it before accepting the paper for publication.

b) Please cite the location of the data clearly in all relevant main and supplementary Figure legends, e.g. “The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.8006829”

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Press*

Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland Roberts, PhD

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1, 2, 3, S1.1, S2.1, S2.2, S2.3, S2.4, and the 4 Figs in “S3 File.” NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

DATA NOT SHOWN?

- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).

------------------------------------------------------------------------

Decision Letter 4

Roland G Roberts

6 Jul 2023

Dear Dr Egger,

Thank you for the submission of your revised Meta-Research Article "Relationship between Journal Impact Factor and the Thoroughness and Helpfulness of Peer Reviews" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Ulrich Dirnagl, I'm pleased to say that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes.

IMPORTANT: I note that you mention the reviewers ("Ludo Waltman, Bernd Pulverer and an anonymous reviewer") in the Acknowledgements. While we appreciate the sentiment, this is against PLOS policy, so please could you remove this? I will ask my colleagues to include this request in their list of issues to attend to.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Roli Roberts

Roland G Roberts, PhD, PhD

Senior Editor

PLOS Biology

rroberts@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Journals and disciplines included in the study.

    The 10 journals from each journal impact factor group that provided the largest number of peer review reports and all 1,664 journals included in the analysis listed in alphabetical order. The numbers in parentheses represent the JIF and the number of reviews included in the sample.

    (PDF)

    S2 File. Further details on classification and validation.

    Further information on the hand-coded set of sentences, the classification approach, and performance provide metrics on the classification performance and show that aggregating the classification closely mirrors human coding of the same set of sentences. All results are out-of-sample predictions, meaning that the data in the held-out test set are not used for training the classifier during validation steps.

    (PDF)

    S3 File. Additional details on regression analyses and sensitivity analyses.

    All regression tables for the analysis reported in the paper, and plots and regression tables relating to the 5 sensitivity analyses. All sensitivity analyses are conducted for the prevalence-based and sentence-based models.

    (PDF)

    S4 File. Codebook and instructions.

    Coding instructions and examples for each of the 8 characteristics of peer review reports.

    (PDF)

    S1 Data. Supporting data for Fig 1.

    (XLSX)

    S2 Data. Supporting data for Fig 2.

    (XLSX)

    S3 Data. Supporting data for Fig 3.

    (XLSX)

    S4 Data. Supporting data for Fig 4.

    (XLSX)

    S5 Data. Supporting data for Fig 5.

    (XLSX)

    S6 Data. Supporting data for Fig 1 in S1 File.

    (XLSX)

    S7 Data. Supporting data for Fig 1 in S2 File.

    (XLSX)

    S8 Data. Supporting data for Fig 2 in S2 File.

    (XLSX)

    S9 Data. Supporting data for Fig 3 in S2 File.

    (XLSX)

    S10 Data. Supporting data for Fig 4 in S2 File.

    (XLSX)

    S11 Data. Supporting data for Fig 1 in S3 File.

    (XLSX)

    S12 Data. Supporting data for Fig 2 in S3 File.

    (XLSX)

    S13 Data. Supporting data for Fig 3 in S3 File.

    (XLSX)

    S14 Data. Supporting data for Fig 4 in S3 File.

    (XLSX)

    S15 Data. Supporting data for Fig 5 in S3 File.

    (XLSX)

    S16 Data. Supporting data for Fig 6 in S3 File.

    (XLSX)

    S17 Data. Supporting data for Fig 7 in S3 File.

    (XLSX)

    S18 Data. Supporting data for Fig 8 in S3 File.

    (XLSX)

    S19 Data. Supporting data for Fig 9 in S3 File.

    (XLSX)

    S20 Data. Supporting data for Fig 10 in S3 File.

    (XLSX)

    Attachment

    Submitted filename: Authors response.docx

    Attachment

    Submitted filename: Authors response R3.docx

    Data Availability Statement

    All relevant data are within the paper and its Supporting Information files. The fine-tuned DistilBERT models, data, and code to verify the reproducibility of all tables and graphs are available at https://doi.org/10.5281/zenodo.8006829. Publons’ data sharing policy prohibits us from publishing the raw text of the reviews and the annotated sentences.


    Articles from PLOS Biology are provided here courtesy of PLOS

    RESOURCES