Skip to main content
PLOS One logoLink to PLOS One
. 2023 Jul 26;18(7):e0284469. doi: 10.1371/journal.pone.0284469

An examination of psychometric properties of study quality assessment scales in meta-analysis: Rasch measurement model applied to the firefighter cancer literature

Soyeon Ahn 1,*, Paulo S Pinheiro 2, Laura A McClure 2,3, Diana R Hernandez 3, Alberto J Caban-Martinez 2,3, David J Lee 2,3
Editor: Simon Grima4
PMCID: PMC10370747  PMID: 37494348

Abstract

Most existing quality scales have been developed with minimal attention to accepted standards of psychometric properties. Even for those that have been used widely in medical research, limited evidence exists supporting their psychometric properties. The focus of our current study is to address this gap by evaluating the psychometrics properties of two existing quality scales that are frequently used in cancer observational research: (1) Item Bank on Risk of Bias and Precision of Observational Studies developed by the Research Triangle Institute (RTI) International and (2) Newcastle-Ottawa Quality Assessment Scale (NOQAS). We used the Rasch measurement model to evaluate the psychometric properties of two quality scales based on the ratings of 49 studies that examine firefighters’ cancer incidence and mortality. Our study found that RTI and NOQAS have an acceptable item reliability. Two raters were consistent in their assessment, demonstrating high interrater reliability. We also found that NOQAS has more items that show better fit than the RTI scale. The NOQAS produced lower study quality scores with a smaller variation, suggesting that NOQAS items are much easier to rate. Our findings accord with a previous study, which conclude that the RTI scale was harder to apply and thus produces more heterogenous quality scores than NOQAS. Although both RTI and NOQAS showed high item reliability, NOQAS items are better fit to the underlying construct, showing higher validity of internal structure and stronger psychometric properties. The current study adds to our understanding of the psychometric properties of NOQAS and RTI scales for future meta-analyses of observational studies, particularly in the firefighter cancer literature.

Introduction

Assessment of study quality is a critical aspect of conducting meta-analyses. Study quality considerably varies across studies and may lead to heterogeneity in study findings [15]. Bérard and Bravo warned that overall effect size estimates obtained from meta-analysis that do not account for variation in study quality may suffer from increased Type I error rates [6]. In addition, other factors investigators ought to be concerned when evaluating studies include the sources, directions, and even plausible magnitudes of such biases [7, 8]. Therefore, many researchers suggest that the quality of primary studies should be accurately assessed and used in meta-analysis [5, 7]. Despite the importance of assessing study quality in general, many researchers have identified challenges in dealing with the quality of primary studies in meta-analyses [3, 6, 9]. One of the critical issues is that while there are a variety of scales to assess the quality of primary studies, none has been universally adopted [10]. In fact, there is no consensus about how study quality should be conceptualized or measured in the existing quality scales [5, 1113]. Moreover, most existing quality scales have been developed with minimal attention to accepted standards of psychometrics properties such as reliability and validity [14]. Most of the research has focused on interrater reliability measures, such as kappa statistics or percentage of agreement, rather than item reliability, content validity, or construct validity. In addition, even for those that have been used widely in medical research, no to little evidence exists supporting their psychometric properties.

Therefore, the focus of our current study is to address this gap by evaluating the psychometrics properties (i.e., item reliability, interrater reliability, and construct reliability) of existing quality scales that are frequently used in cancer observational research: (1) Item Bank on Risk of Bias and Precision of Observational Studies developed by the Research Triangle Institute (RTI) International [15] and (2) Newcastle-Ottawa Quality Assessment Scale (NOQAS) [16]. Specifically, we used the Rasch measurement model [17] to evaluate the psychometric properties of these two quality scales based on the ratings of 49 studies that examine firefighters’ cancer incidence and mortality. The present study is focused on three primary research questions, namely:

  1. Can the RTI or the NOQAS scale be considered reliable?

  2. Do the items of RTI or NOQAS fit the overall quality score?

  3. Do the individual studies fit the overall quality score?

Study quality

Two different frameworks have been proposed in the literature to define and measure the quality of primary studies in meta-analysis [18]. One is based on the validity framework developed by Campbell and his associates [19] and the other, called “quality assessment”, was proposed by Chalmers and his colleagues [8].

The former approach, based on the idea of Campbell and his associates, suggests a matrix of designs and their features or threats to validity. The validity framework includes 33 separate threats to validity based on four distinct categories: internal, external, statistical, and construct validity [18]. This validity framework for assessing the quality of primary studies in a meta-analysis is mainly used in the social sciences. For instance, Devine and Cook [20] evaluated the quality of primary studies based on the validity framework by examining six design features representing internal, external, statistical, and construct validity (e.g., floor effect, publication bias, attrition, and domains of content).

The second approach, proposed by Chalmers and his associates, has been applied primarily to medical research [8, 18, 2124]. The objective of Chalmers’ system is to quantify the overall quality of primary studies based on in-depth criteria for assessing randomized controlled trials. Chalmers and his colleagues mainly focused on construct validity and statistical conclusion validity, examining such features as randomization, blinding of the statistician, and minimization of data-extraction bias [18].

Study quality assessments in observational studies

An informal PubMed search of published meta-analyses and systematic reviews in the cancer literature revealed that the Newcastle-Ottawa Quality Assessment Scale (NOQAS) [16] was the widely employed tool for review articles which focused on risk factor association studies [2531]. This tool was employed in a recent meta-analysis of the firefighter cancer literature [32]. The second identified assessment tool was less commonly employed in cancer-focused meta-analyses and systematic reviews [33, 34]: the Research Triangle Institute (RTI) International and Item Bank on Risk of Bias and Precision of Observational Studies [35, 36]. Although not commonly employed in cancer meta-analyses [33], it has been utilized in a variety of syntheses of other disease outcome association studies [3641] and was employed in a systematic review of lung function in firefighters [42]. Of note, some investigators have employed both the NOQAS and the RTI item bank to assess quality in meta-analyses and systematic reviews [37, 42, 43]. The RTI item bank is comprised of 29 multiple-choice questions that is designed to assess a range of risk of bias and precision domains for a variety of observational study designs [36, 37]. These domains include: sample definition and selection, interventions/exposure, outcomes, creation of treatment groups, blinding, soundness of information, follow-up, analysis comparability, analysis outcome, interpretation, and presentation and reporting. Investigators are encouraged to select items from the bank that are most appropriate to the content area and study design of studies under assessment.

The 8-item NOQAS was developed to assess the quality of nonrandomized studies with specific assessment forms for case-control and cohort study designs [16]. Several questions are designed to be tailored for use given the content being assessed. A simple summary quality score can be obtained by summing each individual item judged to be of high quality, although given its relatively short length investigators often report quality levels for the individual 8 items for each study under review. The NOQAS has been recommended for use for the assessment of quality of observational study designs [41, 44].

Our literature search using PubMed, PsycInfo and Medline resulted in one published study that compares the psychometric evidence of NOQAS to RTI. In a study by Margulis and her colleagues [40], two raters independently assessed the quality of 44 primary studies with RTI and NOQAS. After coding the quality of studies, Margulis and her colleagues computed interrater agreement using percentage of agreement and the first-order agreement coefficient statistics. In their study, the relationship between NOQAS and RTI for ranking ordering studies in terms of risk of bias was found to be medium, as indicated by the Spearman’s rank correlation coefficients of .35 and .38. Also, authors stated that NOQAS is easier to apply than the RTI item bank, but more limited in its scope, although the scope of quality was similar between NOQAS and RTI. Lastly, the interrater reliabilities between raters were reported to be fair for both NOQAS and RTI.

Like a study by Margulis and her colleagues [40], a few published studies addressed the quality of either NOQAS or RTI, using interrater reliability measures such as kappa statistics or percentage of agreement between raters [41, 44]. Likewise, all these studies evaluated the psychometric properties of the quality assessment tools under the Classical Test Theory (CTT) framework, which is somewhat simple by analyzing the raw data of the instrument. Also, most of the existing studies focused on interrater reliability or face validity of items used to measure the quality of individual studies.

Rasch measurement model

Whereas classical test theory (CTT) has been frequently used in evaluating the validity and reliability of study quality ratings, some issues have arisen regarding the calibration of item difficulty, sample dependence of coefficient measures, and estimates of measurement error. The Rasch model enables us to address those issues by (1) assessing the dimensionality of assessment; (2) identifying redundant items or items that measure a different construct or construct-irrelevant factors through the item-fit; (3) identifying items that should be flagged based on their difficulty levels; and (4) assessing whether response categories are appropriate for distinguishing items by their quality.

The Rasch Measurement Theory (RMT) is a psychometric model to analyze categorical data (particularly dummy variables) as a function of the person (e.g., rater or reviewer)’s ability on a trait and the item difficulty [17]. Andrich [45] then developed the Rasch Rating Scale Model (RSM, also called the Polytomous Rasch model) for polytomous data, which is data with more than two ordinal categories. The RSM provides estimates of person location on a continuous latent variable (a), item difficulties (b), and an overall set of thresholds that are fixed across items (c).

RMT obtains information from the person and the item to estimate the probability of a person with a given level of ability to answer a given item correctly, thus, connecting person ability to item difficulty [46]. This probabilistic framework allows RMT to be falsifiable and to meet the linearity assumptions of parametric statistical tests. Therefore, measures of fit statistics for both person-fit and item-fit can be obtained, which provide evidence of validity—how well the model can predict the response to each item.

In addition, RMT transforms ordinal data to logits, which allows a proper use of parametric statistical analysis, without assumption violation that is associated with Type I and II error inflation. Lastly, the item parameters estimated by RMT are generally invariant to the population used to generate these estimates. In other words, parameter estimates obtained from a sufficient sample should be equivalent to those obtained from another sufficient sample despite the average person’s ability level in each of the samples [46]. This property of RMT allows for greater generalization of results as well as more sophisticated applications.

Psychometric properties in Rasch measurement model

For any quality test or assessment, the supporting evidence must have three psychometric properties—validity, reliability, and fairness [47]. This section briefly reviews how each of these psychometric properties can be assessed when using the RMT. In this study, our focus is on reliability and validity.

Reliability

Reliability refers to the consistency or precision of scores across replications of a testing procedure. Under RMT, the Rasch-based reliability index, called the reliability of separation, is used to measure the reliability of a test or assessment. A reliability of separation index is obtained based on latent measures with equal intervals along the underlying continuum and it reflects how distinct latent scores are along the scale, which ranges from 0 to 1. This is defined as:

Reliability=SD2-MSESD2 (1)

, where SD = standard deviation of Rasch measures of a specific facet (e.g., students, tasks, and raters) and MSE = average Mean Squared Errors of Rasch measures for each facet. Higher values indicate higher reliability. High reliabilities are preferred because they indicate a good presentation of Rasch measures across the entire range of the latent scale.

Validity

Validity refers to the degree to which theory supports the interpretation of test scores47. Under the RMT, the Infit and Outfit Mean Square (MnSq) statistics can be used to evaluate how well the measures of an individual facet (i.e., item, study, and rater) fit the constructed latent scale (i.e., study quality score). In particular, the Infit MnSq identifies irregular response patterns, and the Outfit MnSq detects large residual values. The expected value for both Infit and Outfit MnSq statistics is 1.0, which shows a perfect fit to the underlying scale. The fit indices provide diagnostic information for identifying misfit elements on each facet (e.g., item, study, or rater), supporting the validity arguments of internal structure. Therefore, the validity can be rated on a scale ranging from A (item, study, or rater fits the scale very well) to D (item, study, or rater does not fit the scale). See Table 1 for the guidelines for interpreting the Infit and Outfit MnSq values.

Table 1. Fit categories for interpreting infit and outfit mean square errors.
Mean Square Residual (MnSq) Interpretation Fit Category
0.5 ≤ MnSq ≤ 1.5 Productive for measurement A
MnSq < 0.5 Less productive for measurement, but not distorting of measures B
1.5 < MnSq ≤ 2.0 Unproductive for measurement, but not distorting of measures C
2.0 < MnSq Unproductive for measurement, distorting of measures D

Methods

Description of 49 studies on firefighter cancer incidence and mortality

The studies evaluated in this quality assessment were gathered for a meta-analysis project that examines cancer incidence and mortality risk among firefighters. The included studies were identified through a comprehensive literature search using multiple databases including ERIC, PsycINFO, ProQuest Dissertation & Theses, PUBMED, and MEDLINE via EBSCO, and online search engines including Embase, Web of Science Core collection, Google Scholar, and SCOPUS. A total of 49 studies were identified that met the inclusion and exclusion criteria.

Two independent raters were responsible for coding (1) study design characteristics, (2) outcome type, (3) cancer coding system, (4) cancer types, (5) source of occupation designations, (6) type of incident that firefighters attended, (7) sample characteristics, and (8) study characteristics. Two additional reviewers were responsible for coding the statistical estimates presented in these studies for computing a standardized incidence ratio and (2) a standardized mortality ratio.

Procedure

Two content experts on epidemiology independently rated 49 observational studies using RTI item banks and NOQAS. Two independent raters are: (1) a cancer epidemiologist who holds a PhD in Epidemiology and has 20 years of experience in cancer research and teaching; and (2) a chronic disease and occupational epidemiologist who holds PhD in preventive medicine and community health and has over 30 years of teaching and research experience and the Principal Investigator of the Florida cancer registry (Florida Cancer Data System).

The two study quality scales were first tested by the independent reviewers on a sample of studies (i.e., random sample of 5–7 studies) to ensure that consistent assumptions and criteria were employed by raters. Slight modifications were then made to the original quality assessments to better align with the methods of the studies evaluated, and some items were removed that were not relevant. The items evaluated along with their modifications (modifications are italicized) and specific instructions (13 RTI and 8 NOQAS items) are displayed in Table 2.

Table 2. Revised risk of bias and precision of observational studies & Newcastle-Ottawa quality assessment scales.

SECTION STATEMENT, INSTRUCTIONS & SCORING
NEWCASTLE-OTTAWA QUALITY ASSESSMENT SCALE
Selection 1. Representativeness of the exposed cohort—Newcastle1
Item is assessing the representativeness of exposed individuals in the community, not the representativeness of the sample of women from some general population.
a) truly representative of the average _______________ (describe) in the community = 1 star
b) somewhat representative of the average ______________ in the community = 1 star
c) selected group of users eg nurses, volunteers = 0 stars
d) no description of the derivation of the cohort = 0 stars
2. Selection of the non-exposed cohort—Newcastle2
a) drawn from the same community as the exposed cohort = 1 star
b) drawn from a different source = 0 stars
c) no description of the derivation of the non-exposed cohort = 0 stars. Note: In the case of general population can code = 1 star; if other occupational groups only then code 0.5 star given possible overlapping exposures.
3. Ascertainment of exposure—Newcastle3
a) secure record (e.g. surgical records) = 1 star
b) structured interview = 1 star
c) written self report = 0 stars
d) No description = 0 stars
Note: if self-report = 0 stars. Exposure based on registry/death records = 0.5 star. Disregard other confounders when assessing on this item.
4. Demonstration That Outcome of Interest Was Not Present at Start of Study—Newcastle4
In the case of mortality studies, outcome of interest is still the presence of a disease/ incident, rather than death. That is to say that a statement of no history of disease or incident earns a star.
a) yes = 1 star
b) no = 0 stars
Note: For mortality studies code = 1 star if there if you can assume that persons were alive at enrollment into the cohort.
COMPARABILITY 1. Comparability of Cohorts on the Basis of the Design or Analysis—Newcastle5
Age and one other control OR stratified analysis will qualify for two star rating
a) study controls for ______ (select the most important factor) = 1 star
b) study controls for any additional factor = additional 1 star
OUTCOME 1. Assessment of outcome—Newcastle6
a) independent blind assessment (Independent or blind assessment stated in the paper, or confirmation of the outcome by reference to secure records = 1 star
b) record linkage (e.g. identified through ICD codes on database records) = 1 star
c) self-report (i.e. no reference to original medical records to confirm the
outcome) = 0 stars
d) no description = 0 stars
Note: ICD version should be specified in order to earn 1 star.
2. Was follow-up long enough for outcomes to occur—Newcastle7
An acceptable length of time should be decided before quality assessment begins
a) yes (select an adequate follow up period for outcome of interest) = 1 star
b) no = 0 stars
Note: Must mention > = 2 year lag; if not then assign = 0 stars
3. Adequacy of follow up of cohorts—Newcastle8
This item assesses the follow-up of the exposed and non-exposed cohorts to ensure that losses are not related to either the exposure or the outcome.
a) complete follow up—all subjects accounted for = 1 star
b) subjects lost to follow up unlikely to introduce bias = 1 star
c) low follow up rate or no description of those lost = 0 stars
d) no statement = 0 stars
Note: No star assignment if loss to follow-up exceeds 10%; Active follow-up (e.g., last date of contact reported when the event could be verified) = 1 star; passive follow-up or no mention of follow-up = 0 stars)
RISK OF BIAS AND PRECISION OF OBSERVATIONAL STUDIES (RTI)
SAMPLE DEFINITION AND SELECTION Is the study design prospective, retrospective, or mixed? [Abstractor: Prospective design requires that the outcome has not occurred at the time the study is initiated and information is collected over time to assess relationships with the outcome (and includes nested case-control studies). Mixed design includes case-control or cohort studies in which one group is studied prospectively and the other retrospectively. A retrospective design analyzes data from past records. The question is not applicable to cross-sectional studies.
Note: retrospective cohort designs should be coded as retrospective]
Prospective / Mixed / Retrospective / Cannot determine or not applicable
Are critical inclusion/exclusion criteria clearly stated (does not require the reader to infer)?—RTI1 [Principal Investigator (PI): Provide direction to abstractors by listing individual criteria of a priori significance and minimal requirements for criteria to be considered “clearly stated.” Include this question to identify specific inclusion/exclusion criteria that should be consistently recorded across studies] [Abstractor: Use “Partially” if only some criteria are stated or if some criteria are not clearly stated (corresponding to directions provided by the PI). Note that studies may describe inclusion criteria alone (i.e., include x), exclusion criteria (i.e., do not include x), or a combination of inclusion and exclusion criteria.
Note: many studies will describe these criteria on the basis of identifying firefighters thru employment/ certification/death certificate coding records which can be classified as ‘yes’]
Yes / Partially: some, but not all, criteria stated or some criteria not clearly stated / No
Are the inclusion/exclusion criteria measured using valid and reliable measures?—RTI2 [PI: Separately specify each criterion that abstractors should consider based on its relevance to study bias. It is unlikely that all criteria will need to be evaluated in relation to this question. Provide direction to abstractors on valid and reliable measurement of each criterion that is to be considered. For example, prior exposure or disease status is a frequent inclusion/exclusion criterion, particularly in inception cohorts. Subjective measures based on self-report tend to have lower reliability and validity than objective measures such as clinical reports and lab findings. Replicate question to evaluate each individual inclusion/exclusion criterion.
Note, in most cases firefighter studies will be coded yes since most studies rely upon administrative/employment/ death certificate records to identify firefighters for inclusion in the study.]
Yes / No / Cannot determine; measurement approach not reported
Did the study apply inclusion/exclusion criteria uniformly to all comparison groups/arms of the study?—RTI3 [PI: Drop question if not relevant to entire body of evidence (e.g., all case-series, single- arm studies).
Note: it may be possible that criteria are not uniformly applied when assembling firefighter cohorts or comparison groups]
Yes / Partially: some, but not all criteria, applied to all arms or not clearly stated if some criteria are applied to all arms / No / Cannot determine: article does not specify / Not applicable: study has only one arm and so does not include comparison groups
Was the strategy for recruiting participants into the study the same across study groups/arms of the study?—RTI4 [PIs: This question is likely to be more relevant for prospective or mixed designs than retrospective designs. Drop question if not relevant to entire body of evidence (e.g., all studies generally have only one arm).
Note: true recruitment into a cohort study is rare for firefighter cancer studies so most studies can be judged as meeting this criteria. In most cases for case-control studies this will be coded as “yes”, unless case and control recruitment differs.]
Yes / No / Cannot determine / Not applicable: one study group or arm
INTERVENTIONS/EXPOSURE What is the level of detail in describing the intervention or exposure?—RTI5 [PI: Specify which details need to be stated (e.g., intensity, duration, frequency, route, setting, and timing of intervention/exposure). For case-control studies, consider whether the condition, timing, frequency, and setting of symptoms are provided in the case definition. PI needs to establish criteria for high, medium, or low response.
Note: Many firefighter exposures are often based on occupational title/certification/hospital-based record or death certificate which can be judged as medium given that it is a crude measure of exposure. Important to consider if the comparison group is truly unexposed or reasonably complete in order to mark as high]
High: very clear, all PI-required details provided / Medium: somewhat clear, majority of PI- required details provided / Low: unclear, many PI-required details missing
CREATION OF TREATMENT GROUPS Is the selection of the comparison group appropriate, after taking into account feasibility and ethical considerations.—RTI6 [PI: Provide instruction to the abstractor based on the type of study. Interventions with community components are likely to have contamination if all groups are drawn from the same community. Interventions without community components should select groups from the same source (e.g., community or hospital) to reduce baseline differences across groups. For case-control studies, controls should represent the population from which cases arose; that is, controls should have met the case definition if they had the outcome.
Note: For most firefighter cohort studies the comparison group will be non-fighters drawn from the same registry while case control studies will draw controls from the same hospital system(s) or surrounding communities. In most cases these can be rated as “yes”. Cases in which selected controls clearly do not reflect the same community in cohort studies can be rated as “no” (e.g., comparing deaths in firefighters in a particular state to nation-wide mortality). In cases where the time periods or exposure and disease are different across groups in comparison, rate as “no”. In cases when multiple comparators/controls are applied select and rate the most favorable case (i.e., if at least one assessment is “yes” then report “yes for this item).]
Yes / No / Cannot determine or no description of the derivation of the comparison group / Not applicable: study does not include a comparison group (case series, one study arm)
Any attempt to balance the allocation between the groups (e.g., through stratification, matching, propensity scores).—RTI7 [PI: This is most likely to be used in case-control study designs. Drop if not relevant to the body of evidence.
Note: Score all prospective studies as “Not applicable”; for case-control studies that employ any of these in the design phase—not the analytic phase]
Yes or study accounts for imbalance between groups through a post hoc approach such as multivariate analysis / No or cannot determine / Not applicable: study does not include a comparison group (case series or one study arm)
SOUNDNESS OF INFORMATION Are interventions/exposures assessed using valid and reliable measures, implemented consistently across all study participants?—RTI8 [PI: Important measures may be listed separately. PI may need to establish a threshold for what would constitute acceptable measures based on study topic. When subjective or objective measures could be collected, subjective measures based on self- report may be considered as being less reliable and valid than objective measures such as clinical reports and lab findings. Replicate question when needed.
Note: Code High if there is a job title certification. If death certificate or registry records code medium, if no specification code Low.
High / Medium / Low
Are outcomes assessed using valid and reliable measures, implemented consistently across all study participants?—RTI9 [PI: Primary outcomes should be identified for abstractors and if there is more than one, they may be listed separately. Also, identify any relevant secondary outcomes and harms. Subjective measures based on self-report tend to have lower reliability and validity than objective measures such as clinical reports and lab findings. Note for case-control studies: consider whether the ascertainment of cases was independent of exposure.
Note: In most cases this will be coded “yes” since outcomes are typically derived registry/death certificate records and are collected in the same manner across exposure groups. In rare cases this information may not be collected in the same manner across exposure groups in which a “no” rating would be appropriate. Additional rare events could include poorly specified sources of incidence or mortality events which would warrant “Cannot determine…]
Yes / No / Cannot determine or measurement approach not reported
FOLLOW-UP Is the length of follow-up the same for all groups?—RTI10 [For case-control studies, are cases and controls matched on length of followup? Abstractor: When follow-up was the same for all study participants, the answer is yes. If different lengths of follow-up were adjusted by statistical techniques, (e.g., survival analysis), the answer is yes. Studies in which differences in follow-up were ignored should be answered no.
Note: Select “no” if exposures-outcome time-date imbalances vary across groups. Should be a lag of 2 years or more from exposure to outcome. If so, code Yes.]
Yes / No or cannot determine / Not applicable: cross-sectional or only one group followed over time
Is the length of time following the intervention/exposure sufficient to support the evaluation of primary outcomes and harms?—RTI11 [PI: Primary outcomes (including harms) should be identified for abstractors. Important measures may be listed separately. Abstractors should be provided with specific criteria for sufficient length of follow-up based on prior research or theory. Drop if entire body of evidence is cross-sectional or if minimal length of follow-up period is specified through inclusion criteria.
Note: If cross-sectional list non-applicable. Otherwise in most cases this will be coded “yes” if the follow-up period only includes events taking place at least two years after joining the cohort. If the follow-up period includes events taking place less than two years or after joining the cohort then code “No”. If there is no information provided on the follow-up period then report “Cannot determine”]
Yes / Partially: some primary outcomes are followed for a sufficient length of time / No / Cannot determine / Not applicable: cross-sectional
ANALYSIS COMPARABILITY Were the important confounding and effect modifying variables taken into account in the design and/or analysis (e.g., through matching, stratification, interaction terms, multivariate analysis, or other statistical adjustment)?—RTI12 [PI: Provide instruction to abstractors on adequate adjustment for confounding and testing for effect modification.
Note: if only age is accounted for in the analysis or via age group stratification, then select “partially”; select “yes if two or more variables is accounted for in the analysis or via age group stratification; otherwise select “no”]
Yes / Partially: some variables taken into account or adjustment achieved to some extent / No: not accounted for or not identified / Cannot determine
INTERPRETATION Are results believable taking study limitations into consideration?—RTI13 [Abstractor:This question is intended to capture the overall quality of the study. Consider issues that may limit your ability to interpret the results of the study. Review responses to earlier questions for specific criteria.
Note: Most firefighter linkage studies have inherent limitations such as job title as proxy for exposures that limit study ‘believability’. These studies and those employing case-control designs can also be impacted by limited statistical power. Coding of “partially” and “no” should be reserved for studies that have limitations beyond those that broadly impact firefighter cancer studies.]
Yes / Partially / No
PRESENTATION AND REPORTING Is the source of funding identified? [PI: The relevance of this question will depend upon the topic. This question may be modified to identify particular sources of funding (e.g., industry, government, university, or foundation funding).]
Yes / No

Note. Any modification to the original scale is italicized.

Model specification

The FACETS computer program [48, 49] for Rasch analysis, was used to examine the quality of two study quality assessments using a Many-facet Rating Scale Model (MFRM) [48]. The MFRM is expressed as below.

lnPjnmi,kPjnmi,(k-1)=θj-δi-τk (3)

, where

Pjnmi,k = probability of study j receiving a rating k on item i;

Pjnmi,(k−1) = probability of study j receiving a rating k-1 on item i;

θj = quality measure of study j;

δi = difficulty of endorsing item i;

τk = difficulty of endorsing category k relative to k-1;

Analyses

The ratio between Pjnmi,k and Pjnmi,(k−1) specified in Eq 3 is called odds so that the log-odds (logits) are a linear combination of latent measures for different facets. Since all the measures are on a common scale with logits as the units, the MFRM can create measures on an additive interval scale. Higher logit values reflect higher quality for studies, and items that are more difficult to endorse. These values were presented using a Wright map to show an empirical display of study quality scores and item difficulties.

In addition to logit values, we computed the reliability of separation indices for items; and study, rater, and Infit and Outfit MnSq statistics. The reliability of separation indices shows how reproducible the scale would be if using a different but equivalent study sample. Infit and Outfit MnSq statistics are used to demonstrate how well item and study fit the latent scale. Lastly, a Chi-square test is performed to examine if all items can be viewed as equal. A significant result indicates that the studies are distinct from each other.

Results

Descriptive statistics of study quality scale scores for firefighters’ cancer literature

Table 3 displays the summary statistics of observed and Rasch scores for each of two study quality measures: (1) RTI and (2) NOQAS. The RTI scale items had an observed mean of 1.51 (SD = .66) and a Rasch score mean of 0.00 (SD = 3.85), while the items in NOQAS had an observed mean of 1.84 (SD = .37) and a Rasch score mean of 0.00 (SD = 2.29). In addition, the overall quality scores of 49 studies had an observed mean of 1.48 (SD = 0.24) and a Rasch mean of 1.32 (SD = 2.24) when measured by RTI scale, but they had an observed mean of 1.84 (SD = 0.17) and a Rasch mean of 0.21 (SD = 0.78), when measured by the NOQAS. These results indicate that on average, the RTI scale item banks produced much higher quality scores across 49 studies, when compared to the NOQAS.

Table 3. Summary statistics.

Facets RTI NOQAS
Items
Observed Scores M 1.51 1.84
SD 0.66 0.37
Rasch Measures M 0.00 0.00
SD 3.85 2.29
Infit MnSq M 1.13 1.08
SD 0.52 0.33
Outfit MnSq M 0.84 0.93
SD 0.40 0.40
Studies
Observed Scores M 1.48 1.84
SD 0.24 0.17
Rasch Measures M 1.32 0.21
SD 2.24 0.78
Infit MnSq M 1.00 0.90
SD 1.00 0.50
Outfit MnSq M 0.85 0.92
SD 1.14 0.95
Raters
Observed Scores M 1.51 1.84
SD 0.66 0.05
Rasch Measures M .00 .00
SD 3.85 0.20
Infit MnSq M 1.13 0.99
SD 0.52 0.24
Outfit MnSq M 0.85 0.94
SD 0.40 0.46

Note. RTI: Item Bank on Risk of Bias and Precision of Observational Studies developed by the RTI International; NOQAS: Newcastle-Ottawa Quality Assessment Scale; MnSq: Mean Square Error

Figs 1, 2 display the wright map, which is an empirical display of the RTI scale (Fig 1) and NOQAS (Fig 2), respectively.

Fig 1. Wright map for risk of bias and precision of observational studies.

Fig 1

Fig 2. Wright map for Newcastle-Ottawa quality assessment scale.

Fig 2

In each figure, the first column shows the Rasch score on a logit scale. The items (column 4), raters (column 3), and the individual studies (column 2) are located on the wright map based on their Rasch score. The last column displays threshold estimates of response categories on the Likert scale. As shown in Fig 1, the latent Rasch scores of study quality measured by the RTI scale were skewed to the left, indicating that most studies appeared to be low in its study quality, while those Rasch scores measured by NOQAS followed a normal distribution (see Fig 2). Out of 13 items in the RTI scale item bank, 9 were above the mean of 0 (column 3 in Fig 1), indicating that most were quite difficult to evaluate. On the other hand, items on NOQAS were distributed relatively evenly in terms of item difficulty (column 3 in Fig 2), except item #5 (very easy; located at 5 standard deviations below the mean). Lastly, two raters (column 2) were quite consistent in evaluating the quality of primary studies using the NOQAS and RTI scale item bank in their ratings of study quality.

Psychometric evidence for the RTI items for firefighters’ cancer literature

Dimensionality

Results from the Many-facet Rating Scale Model (MFRM) indicated that there was one underlying factor that explained 79.82% of variances in 13 items. This result suggests that the RTI is unidimensional for measuring the quality of individual studies (> 20%) [49].

Reliability

The reliability of separation for RTI scale items was .99 (near 1.0), implying that the distribution of item measures can well represent the entire range of latent scale. The reliability value higher than .80 suggest that the RTI scale item bank scale does present acceptable reproducibility and consistency of the ordering of the Rasch scale scores.

Validity

As shown in Table 4, the Infit and Outfit MnSq for RTI scale items 6, 9, 10, and 12 were found to fall into a fit category of A, indicating a good fit of each item to the study quality scale. Although items 2 and 11 fell into a fit category of A using the Infit MnSq, item 11 did show high MnSq. Items 1, 3, 5, 7, 8 and 11 were less productive based on either Outfit or Infit MnSq. See Table 4.

Table 4. Summary of Rasch measures for RTI risk of bias and precision of observational studies.
Item Observed Rasch SE Infit Outfit
Number Score Measures MnSq Category MnSq Category
1 2.97 -8.52 0.59 1.70 C 0.94 A
8 2.92 -7.45 0.38 1.90 C 1.64 C
13 1.94 -2.53 0.21 0.40 B 0.40 B
6 1.48 -.38 0.22 1.15 A 1.14 A
12 1.38 0.11 0.23 0.92 A 0.92 A
10 1.39 0.24 0.23 1.06 A 0.91 A
11 1.23 0.92 0.26 0.21 A 0.97 C
7 1.20 1.13 0.27 1.06 B 0.99 D
3 1.05 2.82 0.48 1.70 C 0.66 A
5 1.05 2.82 0.48 1.77 C 0.87 A
9 1.04 3.06 0.53 1.01 A 1.16 A
2 1.03 3.35 0.59 0.86 A 0.37 B
4 1.00 4.43 0.85 0.01 B 0.01 B

Note. Observed average indicates mean score of all response ratings for each item. Rasch measure reflects the location of an item on the Rasch scale. S.E. stands for standard error for each Rasch measure. The Infit and Outfit mean square error (MSE) are fit indices for identifying misfit items. Mean Square Error (MnSq).

The quality measures of the 49 studies were significantly different, χ2 (46) = 177.7, p < .01. As shown in Table 5, study 29 had the highest study quality score, while study 8 had the lowest. Most studies fit the scale well with a fit category of A or B, with six studies falling into category C or D.

Table 5. Summary for study quality scores by RTI risk of bias and precision of observational studies.
Study Observed Rasch SE Infit Outfit
ID Score Measures MnSq Category MnSq Category
8 0 -12.77 1.99 - - - -
13 1.38 0.33 0.71 0.21 B 0.1 B
14 1.38 0.33 0.71 0.21 B 0.1 B
16 1.38 0.33 0.71 0.21 B 0.1 B
25 1.38 0.33 0.71 0.21 B 0.1 B
37 1.38 0.33 0.71 0.21 B 0.1 B
38 1.38 0.33 0.71 5.1 D 4.29 D
40 1.38 0.33 0.71 0.21 B 0.1 B
1 1.42 0.81 0.67 0.37 B 0.17 B
2 1.42 0.81 0.67 0.37 B 0.17 B
5 1.42 0.81 0.67 2.8 D 2.28 D
7 1.42 0.81 0.67 0.5 A 0.3 B
11 1.42 0.81 0.67 0.56 A 0.5 A
21 1.42 0.81 0.67 0.37 B 0.17 B
22 1.42 0.81 0.67 0.37 B 0.17 B
23 1.42 0.81 0.67 0.37 B 0.17 B
28 1.42 0.81 0.67 0.57 A 0.55 A
43 1.42 0.81 0.67 3.41 D 5.36 D
48 1.42 0.81 0.67 0.47 B 0.26 B
49 1.42 0.81 0.67 0.47 B 0.26 B
12 1.46 1.23 0.63 0.54 A 0.28 B
19 1.46 1.23 0.63 0.56 A 0.3 B
45 1.46 1.23 0.63 0.51 A 0.25 B
6 1.5 1.62 0.6 0.6 A 0.3 B
47 1.5 1.62 0.6 0.8 A 1.42 A
3 1.54 1.96 0.57 0.6 A 0.31 B
18 1.54 1.96 0.57 0.64 A 0.33 B
26 1.54 1.96 0.57 1.05 A 1.35 A
31 1.54 1.96 0.57 0.62 A 0.32 B
32 1.54 1.96 0.57 0.75 A 0.45 B
33 1.54 1.96 0.57 0.97 A 0.58 A
46 1.54 1.96 0.57 0.65 A 0.37 B
34 1.56 2.11 0.58 0.92 A 0.51 A
10 1.58 2.27 0.55 1.21 A 0.64 A
20 1.58 2.27 0.55 0.86 A 0.49 B
27 1.58 2.27 0.55 1.18 A 1.5 A
35 1.58 2.27 0.55 0.56 A 0.31 B
36 1.58 2.27 0.55 0.52 A 0.29 B
44 1.58 2.27 0.55 3.03 D 2.97 D
17 1.62 2.56 0.53 0.6 A 0.35 B
41 1.62 2.56 0.53 1.08 A 0.59 A
42 1.62 2.56 0.53 1.33 A 0.74 A
4 1.64 2.57 0.53 2.85 D 3.84 D
9 1.65 2.83 0.52 0.75 A 0.45 B
15 1.65 2.83 0.52 0.93 A 1.08 A
39 1.65 2.83 0.52 1.84 C 1.43 A
24 1.69 3.1 0.51 1.29 A 1.09 A
30 1.73 3.35 0.5 0.55 A 0.42 B
29 1.88 4.09 0.48 3.1 D 2.47 D

Note. Observed average indicates mean score of all response ratings for each study. Rasch measure reflects the location of an item on the Rasch scale. S.E. stands for standard error for each Rasch measure. The Infit and Outfit mean square error (MSE) are fit indices for identifying misfit studies; A list of studies can be found in S1 Appendix. Mean Square Error (MnSq).

Psychometric evidence for the NOQAS items for firefighters’ cancer literature

Dimensionality

The result from MFRM indicated that one underlying factor exists. This factor explained 41.41% of variances in 8 items, suggesting that the NOQAS is unidimensional for measuring the quality of individual studies (> 20%) [49].

Reliability

The reliability of separation for items was .99, implying that the NOQAS quality assessment scale does show acceptable reproducibility and consistency of the ordering of the Rasch scale scores.

Validity

As shown in Table 6, the Infit and Outfit MnSq for most items fell into a fit category of A, indicating a good fit of each item to the study quality scale. Exceptions were item 1 (B for Outfit MnSq), item 7 (C for Infit and Outfit MnSq). Particularly, item 7 had high Infit and Outfit MnSq values, showing that this item is unproductive and distorting. Content specialists should be consulted in terms of future uses of item 7 for this scale.

Table 6. Summary of Rasch measures for Newcastle-Ottawa quality assessment scale.
Item Number Observed Score Rasch Measures SE Infit Outfit
MnSq Category MnSq Category
5 2.66 -5.56 0.23 1.47 A 1.30 A
1 1.98 -0.58 0.41 0.65 A 0.33 B
4 1.96 -0.29 0.38 1.11 A 0.93 A
7 1.91 0.16 0.31 1.52 C 1.73 C
3 1.76 1.02 0.21 1.18 A 0.68 A
6 1.67 1.32 0.18 0.96 A 0.82 A
2 1.45 1.86 0.15 1.18 A 1.00 A
8 1.35 2.07 0.15 0.55 A 0.67 A

Note. Observed average indicates mean score of all response ratings for each item. Rasch measure reflects the location of an item on the Rasch scale. S.E. stands for standard error for each Rasch measure. The Infit and Outfit mean square error (MSE) are fit indices for identifying misfit items. Mean Square Error (MnSq).

Results indicated that the study quality measures of these 49 studies were significantly different, χ2 (46) = 71.9, p = .05. As shown in Table 7, study 12 had the highest study quality score, while study 40 had the lowest. Most studies fit the scale well with a fit category of A or B, with eight studies showing falling into category C or D.

Table 7. Summary for study quality scores by NOQAS scale.
Study ID Observed Score Rasch Measures SE Infit Outfit
MnSq Category MnSq Category
40 1.44 -1.07 0.38 1.06 A 0.8 A
24 1.50 -0.92 0.39 1.09 A 0.89 A
30 1.50 -0.92 0.39 1.19 A 0.96 A
35 1.56 -0.77 0.39 2.16 D 3.76 D
17 1.63 -0.61 0.4 0.65 A 0.49 B
43 1.63 -0.61 0.4 0.66 A 0.56 A
48 1.63 -0.61 0.4 0.9 A 0.66 A
15 1.69 -0.45 0.42 0.8 A 0.6 A
34 1.69 -0.45 0.42 1.32 A 0.92 A
42 1.69 -0.45 0.42 1.77 C 2 C
47 1.69 -0.45 0.42 0.33 B 0.32 B
4 1.75 -0.26 0.44 2.02 C 2.15 D
9 1.75 -0.26 0.44 1.17 A 0.74 A
18 1.75 -0.26 0.44 1.15 A 2.62 D
27 1.75 -0.26 0.44 0.84 A 0.55 A
28 1.75 -0.26 0.44 0.84 A 0.55 A
31 1.75 -0.26 0.44 0.51 A 0.43 B
33 1.75 -0.26 0.44 1.43 A 1.02 A
20 1.75 -0.07 0.61 0.8 A 0.53 A
10 1.81 -0.06 0.46 1.55 C 3.35 D
36 1.81 -0.06 0.46 0.72 A 0.49 B
37 1.81 -0.06 0.46 1.02 A 0.87 A
19 1.88 -0.05 0.71 0.47 B 0.39 B
25 1.88 -0.05 0.71 0.47 B 0.39 B
3 1.88 0.17 0.5 1.53 C 2.63 D
5 1.88 0.17 0.5 1.47 A 0.83 A
23 1.88 0.17 0.5 1.01 A 0.7 A
39 1.88 0.17 0.5 1.26 A 0.9 A
1 1.94 0.45 0.56 1.65 C 3.28 D
2 1.94 0.45 0.56 1.65 C 3.28 D
22 1.94 0.45 0.56 0.46 B 0.28 B
41 1.94 0.45 0.56 0.42 B 0.24 B
46 1.94 0.45 0.56 0.6 A 0.48 B
50 1.94 0.45 0.56 0.48 B 0.33 B
11 2.0 0.81 0.66 1.41 A 0.66 A
16 2.0 0.81 0.66 0.54 A 0.29 B
21 2 0.81 0.66 0.49 B 0.24 B
26 2 0.81 0.66 0.54 A 0.29 B
32 2 0.81 0.66 0.7 A 0.51 A
38 2 0.81 0.66 0.54 A 0.29 B
49 2 0.81 0.66 0.66 A 0.59 A
13 2.06 1.34 0.81 0.6 A 0.38 B
14 2.06 1.34 0.81 0.6 A 0.38 B
29 2.06 1.34 0.81 0.6 A 0.38 B
6 2.13 2.12 0.99 0.05 B 0.04 B
7 2.13 2.12 0.99 0.05 B 0.04 B
12 2.13 2.12 0.99 0.05 B 0.04 B

Note. Observed average indicates mean score of all response ratings for each study. Rasch measure reflects the location of a study on the Rasch scale. The Infit and Outfit mean square error (MSE) are fit indices for identifying misfit items. Mean Square Residual (MnSq). A list of studies can be found in S1 Appendix.

Comparison between RTI and NOQAS for firefighters’ cancer literature

The reliability of item separation index for the RTI scale and NOQAS were both found to be high (approximating 1), indicating that both scales are reproducible and consistent of the ordering of Rasch scale scores. In terms of rater agreement, the two coders rated study quality equally consistent using NOQAS than RTI, as shown in Tables 8 and 9. The NOQAS measures produced much lower Rasch latent study quality scores with a less variation, and their items were much easier to rate than the RTI. The NOQAS had more items that showed better fit between items and the overall quality scores. The reason for this could be that the NOQAS was adapted to assess the quality of firefighters’ cancer literature, which may have enabled ratings to be more closely aligned and less varied. As a result, there may be better fits between items and the quality scores. Additionally, study quality scores measured by the NOQAS scale were found to follow normal distribution. Both measures were found to be unidimensional.

Table 8. Summary for rater score by RTI risk of bias and precision of observational studies.

Rater Observed Rasch SE Infit Outfit
Number Score Measures MnSq Category MnSq Category
1 1.53 -0.17 0.12 1.16 A 1.07 A
2 1.49 0.17 0.12 0.92 A 0.62 A

Note. Observed average indicates mean score of all response ratings by each rater. Rasch measure reflects the location of a rater on the Rasch scale. The Infit and Outfit mean square error (MSE) are fit indices. Mean Square Error (MnSq).

Table 9. Summary for rater scores by NOQAS scale.

Rater ID Observed Score Rasch Measures SE Infit Outfit
MnSq Category MnSq Category
Rater 1 1.79 0.20 0.10 1.23 A 1.40 A
Rater 2 1.89 -0.20 0.11 0.75 A 0.48 B

Note. Observed average indicates mean score of all response ratings by each rater. Rasch measure reflects the location of a rater on the Rasch scale. The Infit and Outfit mean square error (MSE) are fit indices. Mean Square Error (MnSq)

Discussion

Using firefighters’ cancer literature, the current study is the first attempt to examine the psychometric properties of two commonly used study quality assessment measures using the Rasch measurement theory. Of many strengths, Rasch models can be used to (a) produce invariant study quality measures on a latent continuum, (b) assess the validity, reliability, and fairness of latent measures, and (c) use latent scores to explain variation in outcome measures. These characteristics of Rasch measurement theory offer practical applications in meta-analysis. Of many, study quality scores estimated by Rasch measurement model enable us to be directly compared across different studies and further modeled to explain variation in study effects by study quality scores.

Our study found that the RTI scale and NOQAS were reproducible and consistent in evaluating the quality of firefighters’ cancer literature, showing higher item reliability. In terms of interrater reliability, two raters were quite consistent in their assessment of study quality, when using both RTI and NOQAS scales. In terms of validity, we found that the NOQAS has more items that show better fit to the underlying construct of study quality than the RTI scale. This result indicates that NOQAS demonstrates better validity of internal structure to measure the quality of firefighters’ cancer literature. Lastly, latent scores measured using NOQAS were distributed across all range of the latent scores, with much lower study quality scores with a smaller variation. These results suggest that NOQAS items are much easier to rate the quality of firefighters’ cancer literature. Our findings accord with a previous study conducted by Margulis and her colleague [40], which concludes that RTI was harder to apply and thus produces much heterogenous quality scores than NOQAS.

The present study is significant in at least two major respects. First, the current study is the first in its kind that assesses the psychometric properties—reliability and validity—for two quality assessment tools that are most used in observational studies. Previous studies focused on interrater reliability of NOQAS and RTI scales, thus leaving the item reliability and validity of NOQAS and RTI unanswered. The current study provides the psychometric properties—reliability and validity—of NOQAS and RTI for future use beyond interrater reliability. Second, more importantly, we used the Rasch Measurement theory (RMT) that produces the compatible quality scores of the included studies in meta-analysis, which further enhance its generalizability and applicability in meta-analysis. It is because that Rasch scores allow us to utilize parametric statistical analysis, which mostly assumes normal distribution. When utilizing the Rasch scores of NOQAS and RTI in a meta-analysis of firefighters’ cancer incidence and mortality, we found that NOQAS scores significantly predict variation in the effect sizes. Specifically, results from a mixed-effects model indicate a significant and positive relationship between quality score and firefighters’ cancer incidence and mortality. Lastly, the item parameters estimated by RMT are generally invariant to the population, which will offer greater generalization of meta-analytic results.

In this study, we did not address one of the important psychometric properties: whether NOQAS and RTI showed fairness in its assessment. If NOQAS and RTI are equally applicable to any study, it is expected that NOQAS and RTI scores are invariant regardless of study characteristics such as sampling method, funding sources, inclusiveness of samples, and whether a study used a good-quality instrument or not. Despite this limitation, the current study certainly adds to our understanding of the psychometric properties of NOQAS and RTI for future meta-analyses of the observational studies, similar to firefighters’ cancer literature.

Supporting information

S1 Appendix. Included 49 studies.

(DOCX)

Data Availability

All relevant data are within the paper and its Supporting information files.

Funding Statement

This study was supported by funds from Florida State Appropriation #2382A (Principal Investigator: Kobetz). Research reported in this publication was also supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA240139. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Barley Z. Assessing the influence of poor studies on meta-analytic results. In: Meeting PpatAERAA, editor.; Chicago, IL: 1991. [Google Scholar]
  • 2.Cao H. A random effect model with quality score for meta-analysis: Master’s thesis, University of Toronto; 2001. [Google Scholar]
  • 3.Jüni P, Altman DG, Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001;323(7303):42–6. doi: 10.1136/bmj.323.7303.42 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ahn S, Becker BJ. Incorporating Quality Scores in Meta-Analysis. Journal of Educational and Behavioral Statistics. 2011;36(5):555–85. [Google Scholar]
  • 5.Jüni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. Jama. 1999;282(11):1054–60. doi: 10.1001/jama.282.11.1054 [DOI] [PubMed] [Google Scholar]
  • 6.Bérard A, Bravo G. Combining studies using effect sizes and quality scores: application to bone loss in postmenopausal women. J Clin Epidemiol. 1998;51(10):801–7. doi: 10.1016/s0895-4356(98)00073-0 [DOI] [PubMed] [Google Scholar]
  • 7.Colditz GA, Miller JN, Mosteller F. How study design affects outcomes in comparisons of therapy. I: Medical. Stat Med. 1989;8(4):441–54. doi: 10.1002/sim.4780080408 [DOI] [PubMed] [Google Scholar]
  • 8.Chalmers TC, Smith H Jr., Blackburn B, Silverman B, Schroeder B, Reitman D, et al. A method for assessing the quality of a randomized control trial. Control Clin Trials. 1981;2(1):31–49. doi: 10.1016/0197-2456(81)90056-8 [DOI] [PubMed] [Google Scholar]
  • 9.Detsky AS, Naylor CD, O’Rourke K, McGeer AJ, L’Abbé KA. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol. 1992;45(3):255–65. doi: 10.1016/0895-4356(92)90085-2 [DOI] [PubMed] [Google Scholar]
  • 10.Clarke M, Oxman A. Assessment of study quality. 2002. In: Cochrane Reviewers Handbook 415 [Internet]. Norway: The Cochrane Collaboration. Oxford: Update Software. [Google Scholar]
  • 11.Greenland S, O’Rourke K. On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions. Biostatistics. 2001;2(4):463–71. doi: 10.1093/biostatistics/2.4.463 [DOI] [PubMed] [Google Scholar]
  • 12.Linde K, Scholz M, Ramirez G, Clausius N, Melchart D, Jonas WB. Impact of study quality on outcome in placebo-controlled trials of homeopathy. J Clin Epidemiol. 1999;52(7):631–6. doi: 10.1016/s0895-4356(99)00048-7 [DOI] [PubMed] [Google Scholar]
  • 13.Valentine JC, Cooper H. Can we measure the quality of causal research in education? In: Education USDo, editor. Phye G D, Robinson D H, & Levin J (Eds), Experimental methods for educational interventions: Prospects, pitfalls and perspectives. San Diego: Elsevier Press.; 2005. p. 85–112. [Google Scholar]
  • 14.Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al. Assessing the quality of reports of randomised trials: implications for the conduct of meta-analyses. Health Technol Assess. 1999;3(12):i-iv, 1–98. [PubMed] [Google Scholar]
  • 15.Viswanathan M, Berkman ND, Dryden DM, Hartling L. Assessing Risk of Bias and Confounding in Observational Studies of Interventions or Exposures: Further Development of the RTI Item Bank [Internet]. 2013. In: AHRQ Methods for Effective Health Care [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US). [PubMed] [Google Scholar]
  • 16.Wells GA, Shea B, O’Connell D, Robertson J, Petersen JA, Peterson VW, et al., editors. The Newcastle-Ottawa Scale (NOS) for Assessing the Quality of Nonrandomised Studies in Meta-Analyses2014: Ottawa: Ottawa Hospital Research Institute. [Google Scholar]
  • 17.Rasch G. I. Probabilistic Models for Some Intelligence and Attainment Tests. In: Lydiche N, editor. Studies in mathematical psychology:. 2nd ed. ed1960. [Google Scholar]
  • 18.Cooper H, Hedges LV, Valentine JC. The Handbook of Research Synthesis and Meta-Analysis. 2nd ed. New York, NY, US: Russell Sage Foundation; 2009. 97–109 p. [Google Scholar]
  • 19.Campbell DT. Factors relevant to the validity of experiments in social settings. Psychol Bull. 1957;54(4):297–312. doi: 10.1037/h0040950 [DOI] [PubMed] [Google Scholar]
  • 20.Devine EC, Cook TD. A meta-analytic analysis of effects of psychoeducational interventions on length of postsurgical hospital stay. Nurs Res. 1983;32(5):267–74. [PubMed] [Google Scholar]
  • 21.Bérard A, Andreu N, Tétrault J, Niyonsenga T, Myhal D. Reliability of Chalmers’ scale to assess quality in meta-analyses on pharmacological treatments for osteoporosis. Ann Epidemiol. 2000;10(8):498–503. doi: 10.1016/s1047-2797(00)00069-7 [DOI] [PubMed] [Google Scholar]
  • 22.Griffiths AM, Ohlsson A, Sherman PM, Sutherland LR. Meta-analysis of enteral nutrition as a primary treatment of active Crohn’s disease. Gastroenterology. 1995;108(4):1056–67. doi: 10.1016/0016-5085(95)90203-1 [DOI] [PubMed] [Google Scholar]
  • 23.Treadwell JR, Tregear SJ, Reston JT, Turkelson CM. A system for rating the stability and strength of medical evidence. BMC Medical Research Methodology. 2006;6(1):52. doi: 10.1186/1471-2288-6-52 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials. 1996;17(1):1–12. doi: 10.1016/0197-2456(95)00134-4 [DOI] [PubMed] [Google Scholar]
  • 25.Blond K, Brinkløv CF, Ried-Larsen M, Crippa A, Grøntved A. Association of high amounts of physical activity with mortality risk: a systematic review and meta-analysis. Br J Sports Med. 2020;54(20):1195–201. doi: 10.1136/bjsports-2018-100393 [DOI] [PubMed] [Google Scholar]
  • 26.Michels N, van Aart C, Morisse J, Mullee A, Huybrechts I. Chronic inflammation towards cancer incidence: A systematic review and meta-analysis of epidemiological studies. Crit Rev Oncol Hematol. 2021;157:103177. doi: 10.1016/j.critrevonc.2020.103177 [DOI] [PubMed] [Google Scholar]
  • 27.Patterson R, McNamara E, Tainio M, de Sá TH, Smith AD, Sharp SJ, et al. Sedentary behaviour and risk of all-cause, cardiovascular and cancer mortality, and incident type 2 diabetes: a systematic review and dose response meta-analysis. Eur J Epidemiol. 2018;33(9):811–29. doi: 10.1007/s10654-018-0380-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ji LW, Jing CX, Zhuang SL, Pan WC, Hu XP. Effect of age at first use of oral contraceptives on breast cancer risk: An updated meta-analysis. Medicine (Baltimore). 2019;98(36):e15719. doi: 10.1097/MD.0000000000015719 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Yan S, Gan Y, Song X, Chen Y, Liao N, Chen S, et al. Association between refrigerator use and the risk of gastric cancer: A systematic review and meta-analysis of observational studies. PLOS ONE. 2018;13(8):e0203120. doi: 10.1371/journal.pone.0203120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Qin L, Deng HY, Chen SJ, Wei W. Relationship between cigarette smoking and risk of chronic myeloid leukaemia: a meta-analysis of epidemiological studies. Hematology. 2017;22(4):193–200. doi: 10.1080/10245332.2016.1232011 [DOI] [PubMed] [Google Scholar]
  • 31.Jalilian H, Ziaei M, Weiderpass E, Rueegg CS, Khosravi Y, Kjaerheim K. Cancer incidence and mortality among firefighters. Int J Cancer. 2019;145(10):2639–46. doi: 10.1002/ijc.32199 [DOI] [PubMed] [Google Scholar]
  • 32.Lange L, Peikert ML, Bleich C, Schulz H. The extent to which cancer patients trust in cancer-related online information: a systematic review. PeerJ. 2019;7:e7634. doi: 10.7717/peerj.7634 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Al-Saleh MA, Armijo-Olivo S, Thie N, Seikaly H, Boulanger P, Wolfaardt J, et al. Morphologic and functional changes in the temporomandibular joint and stomatognathic system after transmandibular surgery in oral and oropharyngeal cancers: systematic review. J Otolaryngol Head Neck Surg. 2012;41(5):345–60. [PubMed] [Google Scholar]
  • 34.Viswanathan M, Berkman ND. Development of the RTI item bank on risk of bias and precision of observational studies. J Clin Epidemiol. 2012;65(2):163–78. doi: 10.1016/j.jclinepi.2011.05.008 [DOI] [PubMed] [Google Scholar]
  • 35.Viswanathan M, Berkman ND. Development of the RTI Item Bank on Risk of Bias and Precision of Observational Studies. Methods Research Report. (Prepared by the RTI International–University of North Carolina Evidence-based Practice Center under Contract No. 290-2007-0056-I.) January 6, 2022. p. 77 pages.
  • 36.Varas-Lorenzo C, Margulis AV, Pladevall M, Riera-Guardia N, Calingaert B, Hazell L, et al. The risk of heart failure associated with the use of noninsulin blood glucose-lowering drugs: systematic review and meta-analysis of published observational studies. BMC Cardiovasc Disord. 2014;14:129. doi: 10.1186/1471-2261-14-129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.O’Dwyer T, O’Shea F, Wilson F. Physical activity in spondyloarthritis: a systematic review. Rheumatol Int. 2015;35(3):393–404. doi: 10.1007/s00296-014-3141-9 [DOI] [PubMed] [Google Scholar]
  • 38.Bijle MNA, Yiu CKY, Ekambaram M. Can oral ADS activity or arginine levels be a caries risk indicator? A systematic review and meta-analysis. Clin Oral Investig. 2018;22(2):583–96. doi: 10.1007/s00784-017-2322-9 [DOI] [PubMed] [Google Scholar]
  • 39.Senra H, Barbosa F, Ferreira P, Vieira CR, Perrin PB, Rogers H, et al. Psychologic adjustment to irreversible vision loss in adults: a systematic review. Ophthalmology. 2015;122(4):851–61. doi: 10.1016/j.ophtha.2014.10.022 [DOI] [PubMed] [Google Scholar]
  • 40.Slattery F, Johnston K, Paquet C, Bennett H, Crockett A. The long-term rate of change in lung function in urban professional firefighters: a systematic review. BMC Pulm Med. 2018;18(1):149. doi: 10.1186/s12890-018-0711-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Margulis AV, Pladevall M, Riera-Guardia N, Varas-Lorenzo C, Hazell L, Berkman ND, et al. Quality assessment of observational studies in a drug-safety systematic review, comparison of two tools: the Newcastle-Ottawa Scale and the RTI item bank. Clin Epidemiol. 2014;6:359–68. doi: 10.2147/CLEP.S66677 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Reinold J, Schäfer W, Christianson L, Barone-Adesi F, Riedel O, Pisa FE. Anticholinergic burden and fractures: a protocol for a methodological systematic review and meta-analysis. BMJ Open. 2019;9(8):e030205. doi: 10.1136/bmjopen-2019-030205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Zeng X, Zhang Y, Kwong JS, Zhang C, Li S, Sun F, et al. The methodological quality assessment tools for preclinical and clinical studies, systematic review and meta-analysis, and clinical practice guideline: a systematic review. J Evid Based Med. 2015;8(1):2–10. doi: 10.1111/jebm.12141 [DOI] [PubMed] [Google Scholar]
  • 44.Bae JM. A suggestion for quality assessment in systematic reviews of observational studies in nutritional epidemiology. Epidemiol Health. 2016;38:e2016014. doi: 10.4178/epih.e2016014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Andrich D. An Index of Person Separation in Latent Trait Theory, the Traditional KR-20 Index, and the Guttman Scale Response Pattern,. Education Research and Perspectives. 1982;9:1:95–104. [Google Scholar]
  • 46.Kleppang AL, Steigen AM, Finbråten HS. Using Rasch measurement theory to assess the psychometric properties of a depressive symptoms scale in Norwegian adolescents. Health and Quality of Life Outcomes. 2020;18(1):127. doi: 10.1186/s12955-020-01373-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.American Educational Research Association APA, National Council on Measurement in Education. Standards for educational and psychological testing.. Association AER, editor 2014.
  • 48.Linacre JM. Inter-rater reliability. Rasch Measurement Transactions 1991;5(3) (166). [Google Scholar]
  • 49.Reckase MD. Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational Statistics. 1979;4(3):207–30. [Google Scholar]

Decision Letter 0

Simon Grima

24 Feb 2023

PONE-D-22-30598An Examination of Psychometric Properties of Study Quality Assessment Scales in Meta-analysis: Rasch Measurement Model Applied to the Firefighter Cancer LiteraturePLOS ONE

Dear Dr. AHN,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

 expected to include results but may include pilot data. 

Please submit your revised manuscript by Apr 10 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Simon Grima, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following financial disclosure: 

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

At this time, please address the following queries:

a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. 

b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c) If any authors received a salary from any of your funders, please state which authors and which funders.

d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3.  Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is a nicely done paper examining psychometric properties of two study quality assessment tools used in systematic reviews and meta-analysis. I am not familiar with Rasch measures or the RMT, so was not able to assess the statistical components of the analysis. However, I overall think this study is well conducted and would be a useful addition to the literature, as there are few such evaluations of the various quality assessment tools out there.

I have a few more substantial questions which I think, if addressed, might help strengthen the manuscript:

1. The NOQAS, by design, is meant to be adapted to different topics, so is functionally not the same scale for different users. The authors adapted it to their topic (firefighters/cancer), which was appropriate. However, I think this aspect should be discussed when comparing psychometric properties of this tool.

2. I think it would be helpful to add a sentence describing the training/characteristics of coders for this systematic review – this will help in understanding how easy/hard it was for them to apply these quality assessment tools.

3. The authors note that of 49 studies included in their review, only 47 were able to be evaluated by raters using NOQAS (page 7 in the second start of page numbering). Why?

I also noticed a few minor suggestions for correction as I read through the manuscript:

1. Can the Rasch measurement model be cited when first mentioned in the introduction? (Page 3)

2. PsycINFO does not have an H in it. (Page 6)

3. Need to add a (1) before standardized incidence ratio. (Page 10 if the page numbering were consistent)

4. Table 7 is a bit long and may be more helpful as supplemental material.

5. Discussion: Need to add citation to Margulis and her colleague(s) – page 11 in the new numbering.

6. Some reference formatting looks like it got off (perhaps with reference management software) so should be double-checked prior to publication.

Reviewer #2: Please add the more latest citation in the paper.

Compared the discussion with latest exiting literature.

In Detailed, Please explained the Research gap.

Please Add the research hypothesis and Research Question.

Check the style of the reference according to the requirement of the journal.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Caitlin E. Kennedy

Reviewer #2: Yes: sanjay taneja

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.<quillbot-extension-portal></quillbot-extension-portal>

PLoS One. 2023 Jul 26;18(7):e0284469. doi: 10.1371/journal.pone.0284469.r002

Author response to Decision Letter 0


30 Mar 2023

Dear Editor and Reviewers:

Thank you for the great feedbacks on our paper, which have improved our paper tremendously. Please find our responses on each comment made by editors and reviewers in a table added to the "response to reviewers" document. We are looking forward to publishing our paper in PLOS ONE.

Thank you so much.

Attachment

Submitted filename: PLOS ONE RR_Letter_responses.docx

Decision Letter 1

Simon Grima

3 Apr 2023

An Examination of Psychometric Properties of Study Quality Assessment Scales in Meta-analysis: Rasch Measurement Model Applied to the Firefighter Cancer Literature

PONE-D-22-30598R1

Dear Dr. AHN,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Simon Grima, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

<quillbot-extension-portal></quillbot-extension-portal>

Acceptance letter

Simon Grima

11 Apr 2023

PONE-D-22-30598R1

An examination of psychometric properties of study quality assessment scales in meta-analysis: Rasch measurement model applied to the firefighter cancer literature

Dear Dr. Ahn:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Simon Grima

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Included 49 studies.

    (DOCX)

    Attachment

    Submitted filename: PLOS ONE RR_Letter_responses.docx

    Data Availability Statement

    All relevant data are within the paper and its Supporting information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES