Small class sizes for improving student achievement in primary and secondary schools: a systematic review

. 2018 Oct 11;14(1):1–107. doi: 10.4073/csr.2018.10

Study	Used/not used in data synthesis	Notes
Achilles, 1993a	Not used in data synthesis	STAR. Reproduction of the results in Word et al. 1990 (significance levels from analysis‐of‐variance models) and further results on various subgroups (for example entering STAR in grade 1 or results on retained/not retained etc.)
Achilles, 1993b	Provide effect sizes from other studies.	Grade 4 results reproduced from Finn 1989 and Grade 5 results reproduced from Nye, 1992 and judged 5 in the other risk of bias data item. Separate results for S vs R and R vs RA
Balestra, 2014	Provide no results that can be used in data synthesis	STAR (quantile regression) only reported for kindergarten and 1. grade and Lasting Benefit Study reanalysis of graduation from high school (not an outcome of this review)
Bingham, 1994	Provide no results that can be used in data synthesis	STAR reanalysis. No useful data provided (only means)
Chetty, 2011	Provide no results that can be used in data synthesis	STAR no useful outcomes provided. Test score as the average mathematics and reading percentile rank score attained in the student's year of entry into the experiment is only relevant outcome reported for this review.
Ding, 2005	Provide no results that can be used in data synthesis	STAR reanalysis. None of the analyses can be used for this review. Analyses the effect of each class size in the range 12‐28 relative to 22. Further report results from regressions where class size is interacted with several covariates.
Ding, 2010	Not used in data synthesis	STAR reanalysis. Structural equation model. Effects of number of years (and sequence) treated
Ding, 2011	Provide no results that can be used in data synthesis	STAR reanalysis. Uses KG data only. Do not separate R and RA. Regression with small class interacted with covariates
Doulgas, 1989	Provide no results that can be used in data synthesis	Report percent of variance accounted for by factors (among others class size) affecting mean class achievement
Finn, 1989	Provide effect sizes for grade 4. Too high risk of bias (other bias item)	Report means, SD's and effect sizes for grade 4
Finn, 1990a	Provide results and data that can be used in data synthesis (although only for grade 1)	Report effect sizes, comparing small classes to the mean of regular and regular with aide. Report means for each of the three conditions and report standard deviations based on students in regular classes. Report total number of students and number of classes in the three conditions. Results divided on location (inner‐city, rural etc.) also provided. A growth analysis of students participating in the same classroom arrangement for both years and who had complete data (35%) performed but is given 5 on incomplete data
Finn, 1990b	Too high RoB	STAR reanalysis for those in same class arrangement for 3 years (K‐2. grade) Judged 5 in RoB (incomplete outcome data)
Finn, 1998	Provide effect sizes from other studies.	Reporting of effect sizes (KG‐3) from Nye, 1993 and Nye, 1992/1994.
Finn, 1999	Provide results from the LBS technical reports grade 4‐7. Could use results for grade 6 and 7 as the technical reports for these grades are not available (scores 5 on the other risk of bias item though). Otherwise no results are provided that can be used in data synthesis.	Reporting of effect sizes (KG‐3) from Finn, 1998 (who reports effect sizes from other studies). Reporting of effect sizes for grades 4, 5, 6 and 7 from Finn et al. 1989; the LBS Technical Reports: Nye et al., 1992; Nye et al., 1993 (study not available) and Nye et al., 1994 (study not available). The result for 6. Grade is to a large extent different from the result reported in Finn, 2001. Calculate Grade Equivalence effect sizes (not an outcome of this review) and behaviour effect sizes
Finn, 2001	Provide effect sizes for grade KG‐3 and grade 4, 6 and 8. Grade 4, 6 and 8 judged 5 on the other risk of bias item.	Reanalysis of STAR and LBS. Report effect sizes, comparing small classes to regular classes. Do not report whether classes of trained teachers or out‐of‐range classes are excluded or not. Report the total number of students used, though not per grade for KG‐3. Results are slightly different than the results reported in Folger 1989 for KG‐3 grade and in Finn 1989 for 4. Grade and to a large extent different from the result reported in Finn 1999 for grade 6. LBS results judged 5 in RoB (other bias)
Finn, 2005	Too high RoB	Analysis of high school graduation. Judged 5 in RoB (other bias)
Folger, 1989	Provide effect sizes for grade KG‐3. Used in data synthesis	It is most likely small classes compared to regular classes. Includes the teachers receiving STAR training although it is unclear how many teachers were trained. According to Word (1990) and and this study, 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3 received Project STAR training. According to Word et al. (1994) p. 73, 67 teachers received training in grade 2 and on page 117 it is stated that all teachers (57 teachers and 57 classes) from 13 schools received training in 2. Grade and all teachers from the same 13 schools (57 classes) received training in 3. Grade. The distribution of class type is not constant in these 13 schools; in 2. Grade it is reported there are 21 S, 19 R and 17 RA and in 3. Grade there are 25 S, 15 R and 17 RA. According to Finn et al. (2007): Second, during the summer between grade 1 and grade 2 (summer 1987), a three‐day training course was given to 54 second‐grade teachers (out of 340) from 15 STAR schools. The training was the same for all 54 teachers, since the assignment to class types had not yet been made. Excludes out‐of‐range classes although unclear how they are defined. Uses a range of 21‐28 students for regular classes (original the range was 22‐25. Analysis of STAR includes the 67 teachers receiving STAR training (although reports that it is 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3) and excludes out‐of‐range classes, results also shown in Word (1990 and 1994).
Hanushek, 1999	Provide effect sizes for grade KG‐3. Used in data synthesis	Compares small classes to the mean of regular and regular with aide. Do not explicitly report the numbers used for analysis but probably include the classes of trained teachers and out‐of‐range classes. Report the numbers with achievement data.
Harvey, 1994	Too high RoB	STAR data, only retainees used (reanalysis). Judged 5 in RoB (other bias)
Jackson, 2013	Provide no results that can be used in data synthesis	Reanalysis uses only kindergarten and 1. Grade and a composite z‐score (average of mathematics, reading and word scores).
Jacobs, 1987	Provide no results that can be used in data synthesis and too high RoB	Is judged 5 in RoB (incomplete outcome data) Results in table 3, 4 and 5 (for three different outcomes) have main effect for class type (not small separated out). Cross tabulation of the 3 outcomes in table 6, 7 and 8 but only raw totals and percent scoring low/middle/high and other tables subdivided on several covariates. Scores for small class size are given in fig. 20 and 38, but no standard deviation
Konstantopoulos, 2008	Provide no results that can be used in data synthesis	STAR reanalysis. Quantile regression with covariates (gender, ethnicity and SES). Whether achievement distribution used is taken over Treated/Control or Treated+Control is not reported
Konstantopoulos, 2009	Provide no results that can be used in data synthesis and too high RoB	Reanalysis of STAR and Lasting Benefits Study data. ITT and IV analyses (same quantile regression effect of 3. grade treatment in 4‐8 grade separately), also available, and a dose analysis (judged 5 in RoB, other bias). Unclear what their achievement distribution is.
Konstantopoulos, 2011	Not used in data synthesis. Too high RoB	Reanalysis of STAR data. ITT analysis. Each school treated as an individual RCT ‐ effect size from linear regression (with small class and regular with aide compared to regular classes in the same model, cannot separate teacher effect from treatment effect in schools with only one small class and/or only one regular class (approximately 43% of schools had only one small class and 81% had only one small and/or one regular class)) ‐ overall mean calculated by inverse variance weighted random effects model. Judged 5 in RoB (other bias)
Krueger, 1999	Provide no results that can be used in data synthesis	STAR reanalysis. Average percentile scores in mathematics, reading and word (not shown separately) used for analysis.
Krueger, 2001a	Too high RoB and provide no results that can be used in data synthesis	Same analyses as Krueger & Whitmore, 2001, with updated data (in addition they only report weighted averages of percentages and do not report the numbers used for analysis, so results cannot be used).
Krueger, 2001b	Too high RoB and provide no results that can be used in data synthesis	STAR follow up. Analysis of scores on two high school entrance exams is judged 5 in RoB (other bias). Analysis of entrance exam taken or not is also available (not an outcome of this review)
Mckee, 2010	Not used in data synthesis	STAR reanalysis. Only KG and merge R and RA. OLS w/wo school FE controlling for teachers with fewer than three years of experience and teachers with an advanced degree, and for the student's race‐ethnicity, gender, age, special education status, whether or not they are repeating kindergarten, attendance record, and subsidized lunch eligibility. Specifications that do not include school fixed effects also include indicators for community type (suburban, rural, urban, and inner‐city). Transform test scores to have zero mean and SD of one
McKee, 2015	Not used in data synthesis	STAR reanalysis. Use only KG and pool R and RA classes and transform test scores to have zero mean and SD of one and include covariates
Mosteller, 1995	Provides results from other articles only	Provides results from other articles: Finn, J.D., and Achilles, C.M. Answers and questions about class size: A state‐wide experiment. American Educational Research Journal (1990) 27, 3:557–77, Table 5. And Word, E., Johnston, J., Bain, H.P., et al. Student/Teacher Achievement Ratio (STAR): Tennessee's K‐3 class size study, Nashville: Tennessee Department of Education, Figures 1 and 2.
Nye, 1992	Too high RoB. Not used in the data synthesis	Technical report for fifth grade of the Lasting Benefits Study. Scores 5 on the incomplete outcome data item (and other risk of bias)
Nye, 1993	Results for KG‐3 grade used in the data synthesis. Results for grade 4 and 5 are reproduced from Finn, 1989, Nye et al., 1991 (not available) and Nye, 1992.	Results for grade KG‐3 are obtained comparing small classes to the mean of regular and regular with aide, also divided on white/minority (same analysis and results as in Nye, 1992/1994). Excludes the 67 teachers receiving STAR training (it is 67 teachers according to the technical report (Word 1994) page 73 (text and table IV‐12 providing the numbers used for analysis) but on page 117 and 192 and according to Word (1990) and Folger & Breda (1989) it was 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3) and includes out‐of‐range classes. Numbers used for KG and 1 grade are 5734 and 5905. Do not report the numbers used for 2. and 3. Grade analyses. Report effect sizes for grade 4 and 5 comparing small to regular. Grade 4 results reproduced from Finn, 1989 and Nye et al., 1991 (not available) and Grade 5 results reproduced from Nye, 1992.
Nye, 1992/1994	Results for KG‐3 grade used in the data synthesis. Results for grade 4 and 5 are reproduced from Finn, 1989, Nye et al., 1991 (not available) and Nye, 1992.	Compares small classes to the mean of regular and regular with aide, also divided on white/minority (same analysis and results as in Nye, 1993). Excludes the 67 teachers receiving STAR training (it is 67 teachers according to the technical report (Word, 1994) page 73 (text and table IV‐12 providing the numbers used for analysis) but on page 117 and 192 and according to Word (1990) and Folger & Breda (1989) it was 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3) and includes out‐of‐range classes. Numbers used for KG and 1 grade are 5734 and 5905. Do not report the numbers used for 2. and 3. Grade analyses. Report effect sizes for grade 4 and 5 comparing small to regular. Grade 4 results reproduced from Finn, 1989 and Nye et al., 1991 (not available) and Grade 5 results reproduced from Nye, 1992.
Nye, 2000a	Provide no results that can be used in data synthesis	Hierarchical linear regression model separate for each grade and reading and mathematics including gender, SES and minority status, interaction of small class and gender, SES and minority respectively and (three way) interaction of small class, gender and minority and a similar analysis with three way interaction: small class, gender and SES. Coefficient estimates with stars. Cannot be used. Also available are effect sizes (d's) separated by white/minority and high/low SES and ES's by gender within race (white/minority) and SES (high/low) (but do not report number of observations used so we cannot calculate standard errors).
Nye, 2000b	Provide no results that can be used in data synthesis	Three analyses (two separate models for treatment as received (a two level and a three level model) and a three level model for treatment as assigned) each comparing regular to small and (for the two level model only) regular with aide (in the three level model regular and regular with aide are assumed to be the same).. Analysis separate for each grade and reading and mathematics including gender and SES, interaction of small class and gender (although coefficients shown report they are for gender and minority interaction?), geographic location of school, teacher experience, school SES and school minority. Effect size estimates with stars (indicating significance level).
Nye, 2001a	Too high RoB	STAR follow up (9. Grade) Two analyses: 1) Students who participated at least 1 year and was part of the trial in 3. Grade; 2) students participating all 4 years. Judged 5 in RoB (incomplete outcome data)
Nye, 2001b	Provide no results that can be used in data synthesis and too high RoB	STAR reanalysis, grade 1‐3, special sample: it is unclear whether some of the students in the control group they use have spent some years in a small class (the control group is characterised by: small class in some or no grades, see table 1). In the analysis for each grade they include only treated who were in small class for that grade and all previous grades. Unclear whether the control group is required to have been in the experiment for all previous grades but probably not, the total sample size increases from grade 1 to 3 whereas the treated group considerably decreases. Grade 2 and 3 judged 5 in RoB (incomplete outcome data) and it is not possible to calculate standard errors (so results for grade 1 cannot be used either)
Nye, 2002	Provide no results that can be used in data synthesis	Analysis separate for each grade and reading and mathematics including gender, SES, minority status, low achiever (below median within classes at end of kindergarten) and interaction of small class and low achiever. Coefficient estimates with stars (indicating significance level). Cannot be used. Table 1 provides effect sizes (d's) separated by low/high achievers (relative within class at end of kindergarten) (but do not report number of observations used so we cannot calculate standard errors).
Prais, 1996	Provides results from other articles and otherwise provide no results that can be used in data synthesis	STAR reanalysis. ‘Reproduction of the Technical reports (Word, 1994) (mathematics/reading average scores) table p. 47/47 and figure p. 54/53, figure p.65/64, figure p.78/77 and figure p. 92/93 and (own) calculation of yearly value added and 3 years average of value added.
Schanzenbach, 2007	Not used in data synthesis	ITT reanalysis using composite mathematics and reading. Also provide results for composite test score for 4, 5, 6, 7 and 8 grade.
Shin, 2012	Provide no results that can be used in data synthesis	STAR reanalysis using new comers each year only and separate by race. Several analyses: 1) ITT (by IV, random assignment as IV for actual class size, i.e. multiple CS reduction levels and include new students each year also) separated by race and controlling for race and the race difference in same equation; 2) same as 1) but in a structural simultaneous model. They investigate whether there is school‐level confounding, by comparing a model with school‐level fixed‐effects to a model without fixed‐effects (comparison of 3L ITT and 2L ITT in table 2 and 3)
Shin, 2011	Provide no results that can be used in data synthesis	Same analyses as Shin, 2012, but not separated by race. They investigate whether there is school‐level confounding, by comparing a model with school‐level fixed‐effects to a model without fixed‐effects (comparison of 3L ITT and 2L ITT in table 4 and 5)
Sohn, 2015	Too high RoB	LBS reanalysis (CTBS data) 4., 6. and 8. grade. Analyse number of years in small class and divide on ‘effective’ (i.e. significant difference) and ineffective schools (also show total). Results cannot be used
Word, 1990 and 1994	Final report for grade KG‐3. Only report significance levels reported (can not be used). Summary of relevant results (effect sizes) from Folger, 1989 can be used.	Summary of original results. Only report significance levels reported (analysis‐of‐variance model results can not be used as they are only reported as a summary of the analyses showing significance levels (.05, .01, .001, all levels are <=). Provide effect sizes for KG‐3 grade from an analysis conducted by Folger (also provided in Folger & Breda, 1989).