ANCOVA (analysis of covariance): Test of whether means differ between groups after controlling for covariate(s). Useful for removing bias of nonexperimental independent variables that are likely to influence the dependent variable. Multiple regression with dummy-coded variables is an alternative method for examining group effects while controlling for confounds. |
Central tendency: Refers to a central value that summarizes with one number a cluster of numerical data. The most common measures of central tendency are mean, median, and mode. Less common measures include geometric mean and harmonic mean. |
Chi-square test: Useful as a test for association between two nominal variables that each have two or more possible values. Tests whether the relative proportions of one variable are independent of the other. Consider using Fisher's exact test when sample is small or some response option frequencies are very low. |
Cochran-Mantel-Haenszel test: Tests for association between two dichotomous variables while controlling for another dichotomous variable. Example situation: recoding a Likert item as a binary variable (1 = agree, 0 = neutral or disagree), then analyzing whether being in one of two treatment groups is associated with responding “agree,” while controlling for another dichotomous variable, such as gender. |
Coefficient alpha (aka Cronbach's alpha [α]; KR-20): An estimate of a scale's internal consistency (a form of reliability). Based on item covariances, quantifies the homogeneity of items that make up a scale. Ranges from 0 to 1, with α ≥ 0.80 and α ≥ 0.90 commonly described, respectively, as good and excellent levels of internal consistency. |
Item: The basic component of a test or attitudinal measure. Refers to the text of the question or item itself, as opposed to the response format of the item. |
Item response theory (IRT; latent trait theory; Rasch model): Psychological measurement theory in which responses to items can be accounted for by latent traits that are fewer in number than the items on a test. Most applications of this theory assume a single latent trait and enable the creation of a mathematical model of how examinees at different ability levels for the latent trait should respond to an individual item. This theory does not assume that all items are of equal difficulty, as is commonly done in classical test theory. Because it allows for comparison of performance between individuals, even if they have taken different tests, it is the preferred method for developing scales (especially for high-stakes testing, such as the Graduate Record Exam). Danish mathematician, Georg Rasch, one of the three pioneers of IRT in the 1950s and 1960s, is credited with developing one of the most commonly used IRT models—the one-parameter logistic model for which all items are assumed to have the same discrimination parameter. |
Kruskal-Wallis H (aka Kruskal-Wallis one-way ANOVA): A nonparametric overall test of whether medians differ between groups. Useful when there are three or more categorical independent variables and an ordinal dependent variable. |
Level of measurement: Refers to theoretical descriptions of data types. Includes nominal data (categorical with no inherent order), ordinal (ordered categories), interval (quantitative data with equal distances from on unit to the next), or ratio (all properties of interval plus a true zero). |
Mann-Whitney U (aka MWW; Wilcoxon rank-sum): A nonparametric test of whether two medians are different. Useful when there are two categorical independent variables and an ordinal dependent variable. |
Question stem: The first part of an item that presents the problem to be addressed or statement to which the examinee is asked to respond. |
Reliability: The consistency or stability of a measure. The extent to which an item, scale, test, etc., would provide consistent results if it were administered again under similar circumstances. Types of reliability include test–retest reliability, internal consistency, and interrater reliability. |
Response format: Following a question stem or item on a test or attitudinal measure, an array of options that may be used to respond to the item. |
Dichotomous: Examples include true/false, yes/no, and agree/disagree formats. |
Semantic differential: Two bipolar descriptors are situated on each side of a horizontal line or series of numbers (e.g., “agree” to “disagree”), which respondents use to indicate the point on the scale that best represents their position on the item. |
Likert: Response levels anchored with consecutive integers, verbal labels intended to have more or less even differences in meaning from one category to the next, and symmetrical response options. Example: Strongly disagree (1), Somewhat disagree (2), Neither agree nor disagree (3), Somewhat agree (4), Strongly agree (5). |
Scale: Although there are several different common usages for scale in psychometric literature (the metric of a measure, such as inches; collection of related test items; an entire psychological test), we use the term to mean a collection of empirically related items on a measure. A Likert scale refers to multiple Likert items measuring a single conceptual domain. A Likert item, on the other hand, specifically refers to a single Likert item that consists of multiple response options. |
t test (aka Student's t test): A statistical test of whether two means differ. Useful when there are two categorical independent variables and one quantitative dependent variable. |
Validity: In psychometrics, the extent to which an instrument measures what it is designed to measure. Demonstrated by a body of research evidence, not by a single statistical test. Types of validity include content validity, criterion validity, and construct validity. |