Summary
This paper presents a simple decision-theoretic economic approach for analyzing social experiments with compromised random assignment protocols that are only partially documented. We model administratively constrained experimenters who satisfice in seeking covariate balance. We develop design-based small-sample hypothesis tests that use worst-case (least favorable) randomization null distributions. Our approach accommodates a variety of compromised experiments, including imperfectly documented re-randomization designs. To make our analysis concrete, we focus much of our discussion on the influential Perry Preschool Project. We reexamine previous estimates of program effectiveness using our methods. The choice of how to model reassignment vitally affects inference.
Keywords: Randomized Controlled Trial, Randomization Tests, Worst-Case Inference, Least Favorable Null Distributions, Partial Identification, Small-Sample Hypothesis Tests
1. INTRODUCTION
This paper develops a finite-sample, design-based approach for analyzing data from compromised social experiments using a satisficing model of experimenter behavior. Compromises can take many forms, including exchanges or transfers of subjects across the experimental groups based on post-randomization considerations that are not fully documented. For specificity, we motivate our approach drawing on the influential Perry Preschool Project, an experimental high-quality preschool program targeted toward disadvantaged African-American children in the 1960s.1
Previous studies of the Perry program report substantial treatment effects on numerous outcomes.2 These studies have greatly influenced discussions about the benefits of early childhood programs.3 However, critics of the Perry program question the validity of these conclusions. They point to the small sample size of the experiment—just over a hundred observations. They also mention incomplete knowledge of, and compromises in, the randomization protocol used to form control and treatment groups. Problems with attrition and non-response are also cited. Previous research (Heckman et al., 2010a; Heckman et al., 2020) addresses some of these concerns.4 We offer an alternative approach that models experimenter decision-making in conducting the experiment.
The Perry randomization protocol was a multi-stage process. Its main compromised feature is shared by many randomized controlled trials: undocumented re-randomization. This involves reassignment of treatment status after initial random assignment in order to improve balance between experimental groups with respect to baseline covariates, but without a pre-specified, fully documented reassignment plan.
This practice occurs often. Bruhn and McKenzie (2009) survey 25 leading researchers using randomized experiments and report a typical response:
“[Experimenters] regressed variables like education on assignment to treatment, and then re-did the assignment if these coefficients were ‘too big.’”
Some 52% admit to “subjectively deciding whether to redraw” and 15% admit to “using a statistical rule to decide whether to redraw” the treatment assignment vector in at least one of the experiments they conducted.5 The authors conclude that
“this reveals common use of methods to improve baseline balance, including several rerandomization methods not discussed in print.”
The approach developed in this paper applies to experiments conducted in such a subjective and incompletely documented manner. If rerandomization criteria are specified and adhered to before carrying out final treatment assignment, there exist simpler methods for conducting valid inference.6 We supplement the literature by considering the case where the reassignment rule is only partially documented. We build on and complement the analysis of Heckman et al. (2020) with an explicit model of experimenter behavior.
We model experimenters as decision-makers who satisfice in seeking to achieve covariate balance with a “suitable” metric. Implicit decision rules underlie all covariate balancing procedures. The decision-makers forming the experimental groups do not necessarily have a precise rule in mind and satisfice in the sense of Simon (1955). Even if experimenters have a specific rule in mind, it may not be carefully documented.
This paper proceeds in the following way. Section 2 illustrates the class of problems addressed in this paper by reexamining the reassignment protocols of an influential compromised small-sample social experiment, to which we apply our methods both here and in a more extensive analysis Heckman (2020). Section 3 presents a satisficing model of experimenter behavior consistent with the available information on it from published and informal accounts. We partially identify the set of randomization protocols consistent with our model. We consider the generality of our approach by discussing the class of experiments to which our model applies. In Section 4, we first discuss hypotheses of interest and conventional testing procedures used in the literature. We then construct worst-case randomization tests using stochastic approximations of least favorable randomization null distributions. We also compare our approach with that of Heckman et al. (2010a) and Section 5 presents our test statistics and uses our methodology to reexamine the inference reported by Heckman et al. (2020). Section 6 concludes.
2. THE MOTIVATING PROBLEM
To give specificity to our analysis we draw on a prototypical social experiment, the Perry Preschool Project, which was conducted in the early 1960s. The original sample for the experiment consisted of 128 children. Five of these children were dropped from the study due to extraneous reasons.7 Starting at age 3, treatment in the following two years included preschool for 2.5 hours per day on weekdays during the academic year. The program also offered 1.5-hour weekly home visits by the Perry teachers to promote parental engagement with the child.8 For more details on the background and eligibility criteria of the program, see Heckman et al. (2010a) and Appendix A.
2.1. Randomization Protocol
Understanding the randomization protocol is essential for constructing valid frequentist inference for any experiment. As Bruhn and McKenzie (2009) emphasize, many experimental studies in economics do not report the complete set of rules (e.g., balancing criteria) used to form experimental samples. They conduct hypothesis tests that ignore the randomization protocols actually used. In analyzing the Perry data, this issue is salient. Reports vary about the procedure used and the exact rules followed in creating experimental samples. We discuss the various descriptions of the randomization protocols. While the core descriptions of the procedure followed are broadly consistent across texts, some of the details provided are vague and inconsistent, even those by the same authors. We account for this ambiguity in designing and interpreting our hypothesis tests. While the details are Perry-specific, the general principles involved are not.
Before the initiation of the randomization procedure by the Perry staffers in each of the last four Perry cohorts, any younger siblings of participants enrolled in previous waves are separated from children of freshly recruited families, whom we term “singletons” Schweinhart et al. (1985); Schweinhart (2013). As Schweinhart et al. (1985) explain,
“[A]ny siblings [are] assigned to the same group [either treatment or control] as their older siblings in order to maintain the independence of the groups.”
By construction this does not apply to the very first cohort.
The singletons from new families are then randomized into the two experimental groups as follows. Weikart et al. (1978) detail the second step of the randomization protocol:
“First, all [singletons] are rank-ordered according to Stanford–Binet [IQ] scores. Next, they are sorted (odd / even) into two groups.”
Singletons are then divided into two groups, one comprising those with even IQ ranks and another with odd IQ ranks. The latter group has one additional person if the singletons are odd in number; otherwise, the sizes of the two groups are equal.
In the third step, children are exchanged between the two groups to balance the vector of means of an index of socioeconomic status (SES), the proportions of boys and girls, and the proportion of children with working mothers, in addition to mean IQ Weikart et al. (1964); Schweinhart et al. (1993). The exact balancing criteria and the number of exchanges are not specified, and the exchanges are not necessarily restricted to those between consecutively ranked IQ pairs,9 as is sometimes assumed, e.g., in Heckman et al. (2020). After the first three steps, there are two undesignated groups that differ in number by at most one, and the two groups are balanced with respect to mean IQ, mean SES, percentage of boys, and the proportion of children with working mothers, in a manner acceptable to the staffers, using balancing rules that are undocumented.
All sources agree that in the fourth step a toss of a fair coin decides assignment of the two groups to treatment and control conditions. The fifth step concerns children with working mothers who are placed in the treatment group after the fourth step. In the fifth step, some of these children are transferred to the control group.10 Although there is no consistent account of the number of transfers, the sources describe the fifth step as involving one-way transfers of some children of working mothers from the treatment group to the control group.11 Weikart et al. (1978) provide reasons for the transfers: “no funds were available [to provide all working mothers with logistical support, and] special arrangements could not always be made.” We interpret this statement as implying that special arrangements could be made for at least some working mothers to enable their children to attend preschool and participate in home visits if placed in the treatment group. The constraints facing program administrators in doing so likely vary across cohorts. We assume that the Perry staffers are impartial as to which working mothers get special arrangements.
Table 1 summarizes the randomization protocol. The main sources of ambiguity are in boldface: (a) the undocumented balancing criteria and rules used to satisfactorily balance the two undesignated groups with respect to the mean levels of baseline variables in the third step; and (b) the nature of constraints on the provision of special home visitation arrangements for children of working mothers in the fifth step.
Table 1.
Schematic of the Actual Randomization Protocol
| 1. Recruit participants and separate any younger siblings of participants enrolled in previous waves from singletons (children of freshly recruited families) |
| ↓ |
| 2. Rank singletons by IQ and split into two groups based on whether the rank is even or odd |
| ↓ |
| 3. Exchange singletons between the two groups to satisfactorily balance the mean levels of a vector of IQ, SES, gender, and mother's working status |
| ↓ |
| 4. Toss a fair coin to determine which of the two groups becomes the initial treatment group |
| ↓ |
| 5. Transfer some children of working mothers from the treatment group to the control group impartially if special arrangements for home visits can be made for only a limited number |
| ↓ |
| 6. Assign any eligible younger siblings to the same group as their enrolled older siblings |
3. MODELING THE RANDOMIZATION PROTOCOL AND BOUNDING THE UNKNOWN PARAMETERS
Since no precise description of the full Perry randomization protocol exists, we do not know who was exchanged in the third step and who was transferred in the fifth step, making a standard bounding analysis intractable. To address this problem, we assume that experimenters satisfice12 in seeking “balance” in the baseline covariate means of treatment and control groups, while facing capacity constraints on special home visits for children of working mothers.
Using this model, we bound the level of covariate balance deemed acceptable by the experimenters at the end of the first three stages of the protocol. We also bound the number of possible transfers at the fifth stage of the assignment procedure. Our model and the identified bounds are used to construct worst-case randomization tests using least favorable null distributions (for treatment effects). While the details differ, the approach readily generalizes to the class of compromised re-randomization designs discussed by Bruhn and McKenzie (2009).
3.1. Formalizing the Randomization Protocol
We first model the Perry randomization protocol and later discuss its generalizability. Let be the set of unique identifiers of participants in cohort13 c ∈ {0, 1, 2, 3, 4} with no elder siblings already enrolled in the Perry Preschool Project. The cardinality of the set of singletons is .14 The participants in the set are ranked according to their IQs by the Perry staffers, using an undocumented method to break any ties. The participants with odd and even ranks are then split into two undesignated groups, with and members, respectively.15 Staffers exchange participants between the two groups until the mean levels of four variables (Stanford–Binet IQ, index of socioeconomic status, gender, and mother’s working status) are balanced to their satisfaction.16 The exact metric the staffers used to determine satisfactory covariate balance is not documented.
We assume that they use Hotelling’s two-sample t-squared statistic , which is equivalent to the Mahalanobis distance metric often used in matching.17 However, for each cohort’s initial groups (partially identified in Section 3.2), the Hotelling statistic and raw mean differences do not correspond to their possible minimum values and are sometimes far away from them.18 Thus, it appears in terms of this model that program officials were satisficing rather than optimizing (minimizing covariate imbalance) in constructing the two groups.
This process results in a partition of the set chosen uniformly from
| (3.1) |
where δc is a satisficing threshold that captures how stringent or lax the Perry staffers were in trying to balance the mean levels of the two groups.19 Note that the above set is invariant to the choice of any strictly increasing transformation of the Hotelling statistic and the corresponding satisficing threshold. Define as an indicator of whether participant belongs to . In other words, .
In the next stage, the Perry staffers flip a fair coin to determine whether or becomes the preliminary treatment group. Let Qc be an indicator of whether the coin flip results in a head. If Qc = 1, then becomes the treatment group. If Qc = 0, then becomes the treatment group. Let denote membership in the preliminary treatment group. Thus
| (3.2) |
In the next step, some children of working mothers initially placed in the treatment group are transferred to the control group.20 To model this process, we introduce additional notation. Define Mi as an indicator of whether participant i’s mother was working at baseline. Cohorts 0 and 1 were both randomized in the fall of 1962, while each of the remaining cohorts were randomized in successive years from 1963 through 1965. For cohorts 2 and higher, i.e., for c ∈ {2, 3, 4}, let mc be the number of children of working mothers initially placed in the treatment group: . For the entry cohorts, let m0,1 be the number of children of working mothers initially placed in the treatment group for cohorts 0 and 1, that is, .
Define ηc as a parameter indicating the maximum number of children of working mothers in cohort c ∈ {2, 3, 4} for whom special arrangements could be made to enable special home visits.21 We define η0,1 to be the parameter indicating the maximum number of children of working mothers in the pooled cohorts 0 and 1 for whom special home visitation arrangements could be made, averting their transfer to the control group if placed in the initial treatment group.22
Special arrangements are made for min(η0,1, m0,1) children of working mothers in the entry cohorts and for min(ηc, mc) such children in each cohort c ∈ {2, 3, 4} to enable special home visits, as opposed to weekday home visits for children of non-working mothers. If there are any remaining children with working mothers in the initial treatment group, they are transferred to the control group, potentially increasing covariate imbalance.23 We assume that the Perry staffers impartially choose (with equal probability) the children for whom the special accommodations are made.24 To formalize this assumption, let Vi,c be a binary indicator for whether the participant was placed the initial treatment group, had a working mother, and remained in the treatment group through special accommodations for home visits. The vector is assumed to be drawn uniformly from the set for all c ∈ {2, 3, 4}. Since the two entry cohorts face a common capacity constraint with respect to special home visitation accommodations, the vector is assumed to be drawn uniformly from the set . In addition, Vi,c = 0 for a participant if for all c ∈ {0, 1, 2, 3, 4}.25 In this notation, the participant’s final treatment status is given by
| (3.3) |
Any Perry subjects with identifiers not in receive the same treatment status as their elder siblings already enrolled in the Perry study. Thus, the final treatment status Di of the i-th subject is given by if . Otherwise, if participant i is not from a freshly recruited family, the assignment is given by Di = Dh, where the h-th subject is the i-th subject’s eldest sibling enrolled in the Perry study, if , where is the set of all Perry subjects.
3.2. Partially Identifying Satisficing Thresholds and Capacity Constraints
Using the Perry data, we now demonstrate how we can partially identify the satisficing thresholds δc and the special home visitation capacity constraints ηc using the last three cohorts as examples. We then present a general framework for partially identifying these parameters.
Example 1: Wave 2 (A Case with One Transfer in the Last Stage)
| Wave 2 | Di= 0 | Di = 1 | Total |
|---|---|---|---|
| Mi = 0 | 9 | 7 | 16 |
| Mi = 1 | 3 | 3 | 6 |
| Total | 12 | 10 | 22 |
This example discusses the steps for bounding the parameters δ2 and η2 in wave 2. Shown above is a contingency table of mother’s working status Mi and final treatment status Di for participants in cohort 2 with no elder siblings already enrolled in the Perry study. There are 22 such participants in total. Since there are an even number of participants, each of the initial two undesignated groups (as well as the initial treatment and control groups in the next stage) would have been in size. However, we observe only 10 members in the final treatment group but 12 members in the final control group. This implies that there must have been one transfer from the initial treatment group to the control group. Thus, one of the 3 children of working mothers in the final control group was in the initial treatment group. However, we do not know exactly which one of these children was transferred, so there are 3 possibilities for the initial treatment group. Let , , be the Hotelling two-sample statistics for these three possibilities. One of these Hotelling statistics was the actual level of covariate imbalance between the initial treatment and control groups, and this level of imbalance is assumed to be within the satisficing threshold δ2 of the Perry staffers (by construction). Thus, . In addition, m2 = 4, since there must have been 4 children of working mothers in the initial treatment group, consisting of the 3 participants who remain in the final treatment group and the 1 participant who was transferred to the control group. Since 3 of the initial 4 participants remained in the final treatment group, min(η2, m2) = min(η2, 4) = 3, implying that η2 = 3, the only solution that satisfies the equality. We next present two other examples.
Example 2: Wave 3 (A Case with 1 or 2 Transfers in the Last Stage)
| Wave 3 | Di = 0 | Di= 1 | Total |
|---|---|---|---|
| Mi = 0 | 7 | 9 | 16 |
| Mi = 1 | 5 | 0 | 5 |
| Total | 12 | 9 | 21 |
In this example, we show a contingency table of Mi and Di for the 21 participants in cohort 3. The sizes of the larger and smaller undesignated groups would have been and , respectively. However, either of these two groups could have been the initial treatment group. Since there are 12 members in the final control group and 9 in the final treatment group, there are two possible cases: if the initial treatment group had 10 members, there would have been 10 – 9 = 1 transfer; but if it had 11 members, there would have been 11 − 9 = 2 transfers. Since the number of transfers involving children of working mothers is either 1 or 2, the number of possibilities for the initial treatment group is , as all the 5 children of working mothers in this cohort are in the control group. Let be the Hotelling statistics for those 15 possibilities. Then, . In addition, m3 ∈ {1, 2}, since m3 is the sum of the number of transfers (either 1 or 2) and the number of remaining children in the final treatment group (0 in this cohort). As no working mother remained in the treatment group, min(η3, m3) = 0, implying that η3 = 0, which is the only number consistent with this equality. Thus, the Perry staffers were unable to provide special home visitation accommodations for any of the participants in this cohort.
Example 3: Wave 4 (A Case with No Transfers in the Last Stage)
| Wave 4 | Di = 0 | Di = 1 | Total |
|---|---|---|---|
| Mi = 0 | 5 | 10 | 15 |
| Mi = 1 | 4 | 0 | 4 |
| Total | 9 | 10 | 19 |
In this example, we show a contingency table of Mi and Di for the 19 participants in cohort 4. The sizes of the larger and smaller undesignated groups would have been and . These coincide with the final sizes of the treatment and control groups, respectively. Accordingly, we can conclude that the observed final treatment group was indeed the initial treatment group for this cohort. Otherwise, the control group would have had at least 10 members. Let be the Hotelling statistic for the observed partition of based on the final treatment status. Then, . In addition, note that there are no children of working mothers in the final treatment group, which was also the initial treatment group, and so m4 = 0. Since min(η4, m4) = min(η4, 0) = 0 and there are 4 members with working mothers in total, it follows that η4 ∈ {0, 1, 2, 3, 4}, because any of these values satisfies the equality. Thus, the observed data for cohort 4 is not helpful in bounding η4.
Partial Identification of the Satisficing Thresholds and Capacity Constraints in General
We now present a general characterization of how to partially identify the satisficing thresholds and capacity constraints on special home visits.
| Wave c | Di = 0 | Di = 1 | Total |
|---|---|---|---|
| Mi = 0 | ω 0,0 | ω 0,1 | ω 0,* |
| Mi = 1 | ω 1,0 | ω 1,1 | ω 1,* |
| Total | ω *,0 | ω *,1 | |Sc| |
The above contingency table shows that there are ωm,d participants with (Mi, Di) = (m, d) ∈ {0, 1}2 among the set of participants in cohort c.26 The total number of children with non-working mothers is ω0,* = ω0,0 + ω0,1 and that of working mothers is ω1,* = ω1,0 + ω1,1. The total number of participants in the final control group is ω*,0 = ω0,0 + ω1,0 and that in the final treatment group is ω*,1 = ω0,1 + ω1,1. The partial identification of the satisficing thresholds and capacity constraints would vary depending on whether is even or odd and also depending on whether or . We discuss each of these cases separately.
First, consider the case where is even or odd and . In this case, since the size of the final treatment group remains the same as that of the initial treatment group, there must have been no transfers of children with working mothers from the treatment group to the control group. Since the final treatment group is the same as the initial one, we can bound the satisficing threshold as follows: , where is the Hotelling statistic for the partition of based on the final treatment status. In addition, since there are no transfers, the number of children of working mothers in the initial treatment group mc equals ω1,1. Since min(ηc, ω1,1) = ω1,1, it follows that ηc ∈ {ω1,1,…,ω1,*}, i.e., the number of slots available for special home visits must be at least the number ω1,1 observed in the data.
Second, consider the case where is even and . As in Example 1, in this case it is clear that the number of transfers in the final stage must have been , which is a positive number. The χc transferred children must be among the ω1,0 members with working mothers in the final control group. Thus, there are possibilities for the initial treatment group. Let be the set containing the Hotelling statistics for those possibilities. Then, . In addition, there must have been mc = ω1,1 + χc children with working mothers in the initial treatment group. It remains to determine which values of ηc are consistent with the equality min(ηc, ω1,1 + χc) = ω1,1. Since χc > 0, it follows that ηc = ω1,1.
Third, consider the case where is odd and . As in Example 2, in this case there are two possibilities for the number χc of transfers in the final stage. Specifically, . These χc transferred children must be among the ω1,0 members with working mothers in the final control group. Thus, there are possibilities for the initial treatment group. Let be the set containing the Hotelling statistics for those possibilities. Then, . The number mc of children with working mothers initially assigned treatment is either equal to or equal to . Let be the set of values of ηc consistent with the equality min(ηc, mc) = ω1,1. If , then ηc = ω1,1, since . However, if , there are two sub-cases: if , then ηc = ω1,1; but if , then ηc ∈ {ω1,1 …, ω1,*}. Therefore, the special home visiting slots can be partially identified as follows: , where if , and if .
This general characterization of the partial identification of satisficing thresholds δc applies to all cohorts c ∈ {0, 1, 2, 3, 4} but that of the special home visiting capacity constraints ηc applies only to cohorts c ∈ {2, 3, 4}. However, similar reasoning can be used to partially identify the capacity constraint η0,1 for pooled cohorts 0 and 1.27
3.3. Applicability of Our Approach to Other Compromised Experiments
Our approach can be applied to many of the studies that Bruhn and McKenzie (2009) criticize, especially experiments using undocumented re-randomization. All of these experiments have the feature that some criterion determines “satisfactory balance.” For example, Bruhn and McKenzie (2009) quote a survey response that says, “[experimenters] regressed variables like education on assignment to treatment, and then re-did the assignment if these coefficients were ‘too big.’” With appropriate modifications, our model of satisficing thresholds directly applies to experiments conducted in such a subjective and incompletely documented manner. Suitable adjustments include replacing Hotelling’s statistic in our model with studentized regression coefficients (selected by pretesting or otherwise) or other metrics actually used to measure covariate imbalance between the treatment and control groups. Our methods for partially identifying the underlying randomization rules can be used when the subjective satisficing thresholds are not documented. Even though we only use one balancing criterion (Hotelling’s statistic) for dimensionality reduction in our definition of , it can be trivially modified to accommodate multiple balancing criteria. In addition, if the experiment has strata instead of cohorts, the c’s in our model would correspond to strata.
If an experiment does not have transfers after forming the intermediate treatment and control groups, then there are no capacity constraints, i.e., the ηc’s play no role. However, in some social experiments, post-randomization transfer of some participants from the control to the treatment group can occur if additional funding for the intervention becomes available. For example, wait-list control groups are used in some clinical studies. While this is the reverse of what occurred in the Perry experiment, our model (with appropriate modifications) can be readily applied. Overall, our approach can be adapted to analyze a variety of compromised experiments across multiple disciplines.
4. HYPOTHESES OF INTEREST AND INFERENCE
The conventional way to analyze randomized experiments is to posit a null hypothesis that the average effect of treatment is zero and to proceed testing it with large-sample methods using asymptotic or bootstrap distributions. Given the relatively small size of many experimental samples, reliance on large sample methods can be problematic.28
In some settings permutation tests can be used to test the null hypothesis that the outcomes in the control group have the same distribution as those in the treatment group without relying on large-sample theory. Permutation tests exploit the property that treatment and control labels within the same strata are exchangeable under the null hypothesis of a common outcome distribution. If randomization of the treatment status did not involve explicit stratification on baseline covariates, permutation tests need to make restrictive assumptions on the strata within which treatment and control labels are exchangeable. This approach is used by Heckman et al. (2010a).29 They assume that conditioning on covariates solves the problem of post-random assignment reallocation but without any explicit model for why it is effective in doing so.30
This paper uses knowledge of the randomization protocol to draw inferences about treatment effects. Once a precise null hypothesis is specified, we can determine the distribution of estimates generated by the randomization scheme and assess the statistical significance of the observed treatment effects.
In this section, we first formulate our hypotheses of interest. We then discuss conventional inferential procedures. Finally, we introduce worst-case (least favorable) randomization tests and discuss how to conduct them using stochastic approximations, and then we compare our methods with alternative approaches for inference with imperfect randomization.
4.1. Hypotheses of Interest
Let Y1 be the treated outcome, Y0 be the untreated outcome, Z represent background variables, and F be their joint distribution at the population level. The conventional approach tests the hypothesis of equality of means, i.e.,
| (4.4) |
assuming that the realizations of those variables for individual i are distributed according to F for all , where is the set of experimental subjects. Because each participant in our sample is assigned to either the treatment group or the control group, we only observe either or for each . The hypothesis is traditionally tested by applying large-sample methods to the observed data , where Di is the treatment status, , and Zi is the vector of pre-program covariates. It is of interest to conduct tests about the missing counterfactual outcomes within our sample, even though tests of population-level parameters are more commonly employed.
Instead of appealing to some hypothetical long-run sampling experiment to conduct inference, we seek knowledge of the sample in hand. One hypothesis of interest is whether the average treatment effect within the sample is zero, i.e.,
| (4.5) |
31 where . A special case of is the sharp null hypothesis of no treatment effects whatsoever for each participant:
| (4.6) |
for all ,32 Fisher’s (1935; 1925) null hypothesis. It involves a joint test of zero individual treatment effects and is trivially equivalent to if there is no heterogeneity in the treatment effects. The advantage of Fisher’s hypothesis is that it provides a testable model in which all the counterfactual outcomes are specified.33 Such hypothesis testing can be conducted using our knowledge of the randomization protocol without relying on large-sample theory. With all the counterfactual outcomes specified, we can learn about the randomization distribution of treatment effects, and we can gauge the extent to which the observed data can be rationalized using the specified null model.34
Hypothesis nests the sharp null hypothesis . In general there are many configurations of the individual treatment effects that are all consistent with . Thus, to test using only limited knowledge of the randomization protocol, we would need to test each one of all the sharp null hypotheses like that imply .35 However, a non-rejection of implies non-rejection of , and so testing other sharp null hypotheses may not be necessary if we are unable to reject . Of course, a rejection of would not imply a rejection of . The latter is a very conservative criterion. We next discuss conventional hypothesis testing procedures.
4.2. Conventional Hypothesis Testing Procedures
For tests of population-level parameters such as in equation (4.4), the most commonly reported measure of statistical significance is the asymptotic p-value. For completely randomized experiments, it can be interpreted as the p-value based on a large-sample approximation of the distribution of an estimator, say difference-in-means, over all possible randomizations under the null hypothesis Neyman (1923). Li et al. (2018) derive an asymptotic theory of the difference-in-means estimator in experiments involving rerandomization with a pre-specified balancing rule using the Mahalanobis distance, for which the asymptotic distribution of the estimator is a linear combination of normal and truncated normal variables. Resampling methods are also widely used to quantify statistical uncertainty. For example, the bootstrap standard error is reported in many research papers with an associated bootstrap p-value.
Permutation tests are often used when researchers are interested in testing whether treatment and control groups have a common outcome distribution without relying on large-sample theory. Such tests rely on the property that the treatment and control labels are exchangable within each stratum of the experiment under the null hypothesis of a common distribution. In their permutation tests, Heckman et al. (2010a) use strata defined by wave, gender, and indicator for above-median socioeconomic status, assuming that experimental labels within each stratum are exchangeable. To compare their permutation procedures with the methods developed in this paper, we use a simplified version of their permutation tests using block permutations within cohorts of eldest participant-siblings (whose treatment statuses determine that of their younger participant-siblings).
In the Perry context, Heckman et al. (2020) develop an extension of permutation tests to account for imperfect randomization. In this paper, we offer an alternative design-based approach to conduct inference for a broader class of compromised experiments. We first present our approach and then compare it with theirs.
4.3. Worst-Case Randomization Tests
This paper advocates and uses worst-case approximate randomization tests to analyze the Perry data. Fisher’s sharp null hypothesis specifies all the counterfactual outcomes, which are imputed according to the hypothesis using the observed data. If we knew the exact randomization protocol of the Perry experiment, we could measure where the observed test statistic falls along its exact randomization distribution, i.e., the distribution of the test statistic over all possible treatment status vectors that could have been hypothetically generated by the randomization protocol. The more extreme the observed test statistic falls along the null distribution, the more incompatible the observed data would be with the sharp null hypothesis. However, for Perry and many other social experiments, the exact randomization protocol is unknown: even in our stylized model of the randomization protocol, the satisficing thresholds and capacity constraints are only partially identified. To account for this ambiguity, we could in theory conduct randomization tests36 over the set of all possible randomization protocols. Thus, we could conduct the worst-case randomization test using the least favorable distribution among all the possible randomization distributions. This results in the following worst-case p-value that serves as an upper bound for the true randomization p-value:
| (4.7) |
where Ξ is the partially identified set for γ = (δ0,…, δ4, η0,1, η2, η3, η4), the vector of true values of parameters (satisficing thresholds and capacity constraints), represents probability under the probability space Λγ* of randomizations generated by the protocol parameterized by γ*, represents a random treatment status vector defined on the probability space Λγ*, D denotes the observed treatment status vector, and T(·) is the chosen test statistic such that T(·) maps a treatment status vector to a real number measuring the magnitude of the outcome difference between the treatment and control groups. Since the sharp null hypothesis specifies counterfactual outcomes, the data are fixed according to , and the only random variation in the above construction comes from the randomization protocol. The sample space Ωγ* of the uniform probability space Λγ*, on which the random treatment status vector is defined, is given by
| (4.8) |
where is the Cartesian product of the sets of admissible partitions of (in the initial step of the protocol) across all cohorts c ∈ {0, …, 4}, and is the Cartesian product of the sample spaces for all other random variables, characterizing the outcomes Qc of the coin flips and vectors of variables Vi,c used for determining which children of working mothers are transferred from the treatment to control group in the last step across all cohorts c ∈ {0, …, 4}, used in the randomization protocol parameterized by γ*.37 Using this notation we establish the following proposition for any α ∈ (0, 1):
Proposition 1.
Under the model of randomization protocol in Section 3, the hypothesis test that rejects the sharp null hypothesis whenever pw ≤ α controls the Type I error rate at level α.
Proof.
Let for all γ* ∈ Ξ, let represent the worst-case p-value, and let be the test for a given D, a realization of the random treatment status vector defined on the probability space Λγ, where γ is the true value of the model parameter. Since pγ(D) ≤ pw(D) by construction, it follows that under . This is a trivial extension of the simple standard argument used to show the finite-sample validity of randomization tests (see Lehmann and Romano, 2005). The proposition can be equivalently stated in terms of a critical value for the test statistic, as in Heckman et al. (2020).
Although it would be ideal to compute the exact value of pw, it is computationally not feasible. As is common practice in computing permutation and randomization p-values (see Lehmann and Romano, 2005), we resort to stochastic approximations. Even so, there are two challenges in estimating the worst-case p-value. First, approximating the probability for a given value γ* ∈ Ξ is computationally demanding. Second, estimating pw based on such tail probability estimates for a finite number of points on Ξ is also challenging. We tackle these two challenges sequentially and discuss how we handle some forms of stochastic approximation errors.
4.3.1. Approximating Tail Probabilities of Randomization Distributions
The first challenge is to approximate for a given value γ* in the partially identified set, i.e., for . Our approach is to break up the sample space of Λγ* into two parts, compute the tail probability (measuring how extreme the observed test statistic is in its randomization null distribution) for each of these two parts, and then use the law of total probability and Monte Carlo methods to get the desired final result. To do so, we introduce additional notation. Let be the lower bound of the partially identified set for the true value of the satisficing threshold δc for c ∈ {0, …, 4}. Then, for any given value , observe that
| (4.9) |
where
| (4.10) |
and
| (4.11) |
for all c ∈ {0, …, 4}. Thus, we can use , which is the set with an infinite satisficing threshold such that all allowed partitions of are satisfactory, to construct , , and . The set has elements with Hotelling statistics below the lower bound of the partially identified set for the satisficing threshold. The other set has elements with Hotelling statistics between and . Let be the Cartesian product of the sets across cohorts, and let . Both and can be constructed using by discarding elements in their respective complements. Since the sets do not depend on the values , the set remains constant. Notice that
| (4.12) |
Let and be the uniform probability spaces over the sample spaces and , respectively. In addition, let
| (4.13) |
which is the proportion of elements in the sample space Ωγ* belonging to . Note that the last equality above implies that x(γ*) can be simply computed with the sets and constructed using .38 Then, by the law of total probability, we have that
| (4.14) |
where and represent random treatment status vectors defined on the probability spaces and , respectively, and y(γ*) = 1 – x(γ*). Since the sample spaces in the model are large, we use Monte Carlo draws from the probability spaces through rejection sampling to stochastically approximate the tail probability .39,40 Our approach provides a feasible way to estimate for points γ* in Ξ efficiently using rejection sampling.
4.3.2. Estimating and Bounding the Worst-Case Tail Probability
The second challenge is to estimate or bound the worst-case tail probability pw. Taking the supremum of tail probabilities over points in the set Ξ may seem intractable, since Ξ is the Cartesian product of a finite set and a non-compact set.41 However, we exploit the fact that for all , where , since is a finite set, for all c ∈ {0, …, 4}. Let Ξ° be the compact subset of Ξ given by
| (4.15) |
It then follows that
| (4.16) |
Thus, it suffices to estimate the worst-case tail probability over the set Ξ°, which is compact.42 We use stochastic approximations for this purpose as well. It is computationally infeasible to compute a p-value for each of the points in the set Ξ° and take the maximum of those p-values. To deal with this challenge, we first write , where are disjoint hyper-rectangles that form a partition of the set Ξ°. In our application, L = 20, and each hyper-rectangle represents the partially identified set for (δ0, …, δ4) at fixed values of (η0,1, η2, η3, η4).43 Then, note that
| (4.17) |
where
| (4.18) |
for l ∈ {1, …, L}. We approximate for each l ∈ {1, …, L} using the p-values arranged in descending order for S = 900 uniformly sampled random points on the set .44
We estimate for each l ∈ {1, …, L} using the maximum order statistic :
| (4.19) |
which converges almost surely to as S → ∞. However, this estimate may have stochastic approximation error. One way to deal with stochastic approximation-related uncertainty in is by constructing a confidence band for . To do so, we construct an upper bound based on de Haan’s (1981) 90% asymptotic confidence band for the true maximum using the S randomly sampled p-values. The upper confidence bound is given by
| (4.20) |
where is a factor provided by de Haan (1981) for the 90% asymptotic confidence bound.45 Thus, the 90% confidence interval for is given by . Finally, the true worst-case p-value pw can be approximated by the worst-case maximum (max.) p-value given by
| (4.21) |
and its upper confidence bound is given by the worst-case de Haan p-value as follows:
| (4.22) |
which provides at least 90% coverage as S → ∞. Of course, these stochastic approximations affect the exact finite-sample validity of the resulting hypothesis tests, but the validity of these approximations can be arbitrarily increased with adequate additional computational power. This is an issue common to most resampling methods in statistics (see Lehmann and Romano, 2005).
In the previous discussion, the test statistic T(·) used to compute the worst-case tail probability is left general. There is reason to suspect that the choice of the test statistic matters, as shown for permutation tests by Chung and Romano (2013, 2016). Wu and Ding (2020) show that using studentized test statistics in certain randomization tests can control type I error asymptotically under certain weak null hypotheses while preserving finite-sample validity under sharp null hypotheses. Their theory ignores covariates and is limited to completely randomized factorial experiments and stratified or clustered experiments. However, they conjecture that “the strategy [of using studentized test statistics to make randomization tests asymptotically robust under weak null hypotheses while retaining their finite-sample validity under sharp null hypotheses may also be] applicable for experiments with general treatment assignment mechanisms” Wu and Ding (2020). While we do not attempt to prove or disprove their conjecture in the Perry experimental setting, we take it seriously given their results for certain randomization tests along with Chung and Romano’s (2013, 2016) results for permutation tests. Thus, we provide worst-case p-values using both the nonstudentized and studentized test statistics.
4.3.3. Multiple Testing
Since under for any α ∈ (0, 1) by Proposition 1, Holm (1979) tests of multiple hypotheses based on the worst-case p-values would also have finite-sample validity. Multiplicity-adjusted p-values can be computed as follows. Let ρ(1), …, ρ(K) be the associated single worst-case p-values arranged in ascending order. Then, the Holm stepdown p-values adjusted for multiple testing are given by for k ∈ {1, …, K}. However, these adjusted p-values can be even more conservative because they assume least favorable dependence structure between the single worst-case p-values Romano et al. (2010), making this the “worst-case” of the “worst-case.” However, slightly less conservative multiple hypothesis tests are available in the literature (see Romano and Wolf, 2005; Romano and Shaikh, 2010). Since it is unclear how much improvement in terms of power they provide relative to Holm tests in our context, we do not discuss the more computationally involved stepdown procedures in this paper.
4.4. Comparing Methods for Inference with Imperfect Randomization
Our approach complements that of Heckman et al. (2020), who improve on the methodology of Heckman et al. (2010a) by (i) exploiting a symmetry generated by the Perry randomization protocol, (ii) using finite-sample inference that accounts for imperfect randomization, and (iii) making transfers in the fifth step of the randomization protocol depend on a binary variable indicating whether the mother is available for home visits, assuming program infrastructure is available to support it. It is only partially observed in their model. We also exploit the symmetry: Qc represents the result of a fair coin flip to determine which of the two initially undesignated groups becomes the intended treatment group. However, we model other features of the protocol differently.
Heckman et al. (2020) model the reassignment of children of some working women by introducing a partially observed binary variable Ui that equals 1 if the mother of participant i was unavailable for home visits and 0 otherwise. It is known only for children of non-working mothers, for whom Ui = 0, and for the children of working mothers in the final treatment group, who also have Ui = 0. For children of working mothers in the control group, Ui is not known and could be either 0 or 1. To deal with this difficulty, Heckman et al. (2020) construct two permutation tests. The first test sets Ui to 0 for all children of working mothers in the final control group and conducts a generalized permutation test accordingly. The second test: (i) samples a vector of Ui from the space of possibilities for Ui; (ii) conducts a generalized permutation test given the sampled vector of Ui and obtains the corresponding permutation p-value; and (iii) repeats steps (i) and (ii) until the space of possibilities is exhausted. It then takes the maximum p-value among the computed p-values. Our worst-case inferential methods are similar in spirit. However, there are three key differences between our approach and theirs.
First, Heckman et al. (2020) interpret Ui as a fixed trait of mothers regardless of the (random) circumstances facing program administrators. However, whether or not a working mother and her child are visited at home (through special arrangements, e.g., on a weekend) depends, at least in part, on the availability and capacity constraints of the Perry staff. While Ui = 0 for non-working mothers in both papers, we do not view Ui as a fixed binary trait of working mothers. Consistent with our review of the randomization protocol, we assume that children of working mothers are able to participate in the program if special arrangements, such as weekend home visits, are made for them. In our model, there are capacity constraints for making special arrangements, so only a limited number of slots are available.46 In their model, if Ui = 1 for a working mother, her child would always be placed in the control group, because she would not accept any special accommodations even if provided by the Perry staff. Unlike the Vi,c variable that determines post-randomization transfers in our model, the Ui characteristic in their model is allowed to be related to potential outcomes, but this is a consequence of its interpretation as a fixed trait of mothers independent of the capacities of program administrators.
Second, their procedure assumes that “some participants were exchanged between the treatment and control groups in order to ‘balance’ gender and socio-economic status score while keeping Stanford-Binet IQ score roughly constant.”47 However, as shown in Appendix B, Perry data from wave 4 reveal that the exchanges were not necessarily between consecutively ranked IQ pairs. Our approach accommodates this feature while also making more explicit the notion of balance.
Third, on a more minor note, we incorporate the five children (out of the original 128) who dropped out of the study due to extraneous reasons, since those five children were also a part of the initial randomization protocol. Our approach can also more readily be applied than that of Heckman et al. (2020) to a variety of compromised experiments, including many discussed by Bruhn and McKenzie (2009). We next demonstrate that there are important differences between inferences obtained from our procedure and theirs.
5. REANALYSIS OF HECKMAN ET AL. (2020)
This section uses the methods developed in this paper to reconsider the conclusions reached by Heckman et al. (2020) on the Perry participants through age 40. In a companion paper, we apply these methods to analyze fresh samples of participants through age 55 and their adult children Heckman (2020). Analyzing the new wave of data in this paper would raise a variety of new issues about sampling procedures better left for another occasion.
We first list our estimators of treatment effects. Using the corresponding test statistics, we then apply our worst-case inferential methods to reanalyze the results in Heckman et al. (2020).
5.1. Estimators and Test Statistics for Hypothesis Testing
A variety of test statistics and estimators can be used in our approach and that of Heckman et al. (2020). Our empirical work focuses conventional ones often used in practice. Let Di represent the treatment status of participant i, and let Zi be the vector of baseline variables.48 In addition, let Yi denote the observed outcome of interest of participant i in a relevant subsample containing participants, and let be the counterfactual outcome of participant i when his or her treatment status Di is fixed at d ∈ {0, 1}. In switching regression notation Quandt (1958, 1972),
| (5.23) |
The average treatment effect in the subsample given by
| (5.24) |
and is conventionally estimated by a difference-in-means (DIM) estimator that takes raw mean differences between non-attrited treated and control observations. However, the randomization procedures used in Perry and other similar experiments only justify conditional independence: . Exploiting this property and controlling for Zi in a regression of Yi on Di and Zi using complete case observations, we obtain the ordinary least squares (OLS) estimator.49 It would be desirable to relax linearity, but the Perry sample size makes this impractical.
All of these estimators assume that non-response is determined at random or at random conditional on observed covariates. Let Ri be an indicator of whether Yi is missing. It could depend on the treatment status Di and the pre-program covariates Zi. The augmented inverse probability weighting (AIPW) estimator allows for this possibility by using the weaker assumption that . The AIPW estimator of the treatment effect is
| (5.25) |
where
| (5.26) |
In this expression, is the ordinary least squares (projection) estimator of the conditional expectation for d ∈ {0, 1}, is an estimator of Pr(Di = d|Zi), the i-th participant’s propensity of being in experimental state d, and is an estimator of , which is the propensity of having a non-missing outcome after fixing the treatment status Di, for d ∈ {0, 1}. Propensity scores are often estimated using logits. The AIPW estimator adjusts the outcome based on pre-program covariates and corrects for covariate imbalance and various forms of non-response.50 It is asymptotically unbiased and has a double robustness property: the estimator is robust to misspecification of either the propensity score models or the models for counterfactual outcomes, but not both.51 For this reason, the AIPW estimator is sometimes preferred over the DIM and OLS estimators.52 We use the studentized version of the AIPW estimate as our main test statistic in our empirical analysis.53
We could use a local average treatment effect (LATE) estimator, and other standard estimation methods dealing with imperfect compliance, if we knew each observation’s initial treatment status. However, in the Perry example, we do not know which members were transferred from the initial treatment group to the control group in the last step of the randomization protocol. Given this problem, we do not present LATE estimates.54
5.2. Empirical Analysis
We first conduct a head-to-head comparison of Heckman et al.’s (2020) methods and ours using the same outcomes they analyze. Additionally, to compare the impact of using mean differences versus AIPW test statistics in the conventional inferential approaches and our design-based worst-case inference, we extend the outcomes they study and analyze data on violent crime.
Tables 2 and 3 below report our reanalyses of Heckman et al. (2020), analyzing each outcome one at a time using the doubly robust attrition-adjusted AIPW estimator. Tables 4 and 5 provide stepdown p-values for the outcomes based on multiple testing. Extended versions of these tables are presented in Appendices E through H using alternative test statistics for inference.55
Table 2.
Reanalysis of Male Outcomes in Heckman et al. (2020) Using Single Tests
| Variable | Age | Untreated mean | Treated mean | AIPW estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | 0.0000 | 0.0000 | 0.0004 | 0.0049 | 0.0056 |
| Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | 0.0000 | 0.0002 | 0.0004 | 0.0071 | 0.0071 |
| Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | 0.0557 | 0.0512 | 0.0712 | 0.0872 | 0.1229 |
| Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.2002 | 0.2198 |
| Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | 0.0512 | 0.0719 | 0.0556 | 0.1461 | 0.2396 |
| Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | 0.0398 | 0.0577 | 0.0472 | 0.1289 | 0.1457 |
| Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | 0.0225 | 0.0412 | 0.0292 | 0.0707 | 0.1022 |
|
| |||||||||
| CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.2957 | 0.3221 | 0.3112 | 0.3488 | 0.3823 |
| CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.2410 | 0.2629 | 0.2608 | 0.2909 | 0.3216 |
| CAT language score | 14 | 6.536 | 14.333 | 5.029 | 0.0815 | 0.0995 | 0.1076 | 0.1771 | 0.2098 |
| CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | 0.0538 | 0.0638 | 0.0712 | 0.1333 | 0.1467 |
| CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.2652 | 0.2865 | 0.2600 | 0.2734 | 0.3016 |
|
| |||||||||
| High school graduate | 19 | 0.513 | 0.485 | 0.015 | 0.4550 | 0.4540 | 0.4868 | 0.5607 | 0.6036 |
| Vocational training | 40 | 0.333 | 0.394 | 0.071 | 0.2762 | 0.2886 | 0.2932 | 0.3984 | 0.4353 |
| Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 0.3902 | 0.3901 | 0.4240 | 0.4589 | 0.6118 |
| Grade point average | 19 | 1.794 | 1.814 | −0.035 | 0.4366 | 0.4336 | 0.4328 | 0.5414 | 0.6563 |
|
| |||||||||
| Total non-juvenile arrests | 40 | 11.718 | 7.455 | −3.895 | 0.0461 | 0.0368 | 0.0668 | 0.1019 | 0.1795 |
| Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1376 | 0.1361 | 0.1764 | 0.2060 | 0.2329 |
| Total charges | 40 | 13.385 | 9.000 | −4.132 | 0.0678 | 0.0579 | 0.0920 | 0.1242 | 0.1789 |
| Non-victimless charges | 40 | 3.077 | 1.485 | −1.444 | 0.0274 | 0.0238 | 0.0372 | 0.0755 | 0.1332 |
|
| |||||||||
| Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.1263 | 0.1315 | 0.1292 | 0.2989 | 0.3351 |
| Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.1817 | 0.1827 | 0.2148 | 0.3050 | 0.4203 |
| Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.2572 | 0.2501 | 0.2928 | 0.3371 | 0.4217 |
|
| |||||||||
| Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.2156 | 0.2259 | 0.2452 | 0.3348 | 0.3703 |
| Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.2238 | 0.2190 | 0.2388 | 0.3460 | 0.3776 |
| Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | 0.0438 | 0.0430 | 0.0588 | 0.1193 | 0.2030 |
|
| |||||||||
| Currently employed | 40 | 0.500 | 0.700 | 0.266 | 0.0089 | 0.0075 | 0.0204 | 0.0444 | 0.0640 |
| Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | 0.0843 | 0.0957 | 0.0912 | 0.1629 | 0.1671 |
| Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | 0.0154 | 0.0200 | 0.0188 | 0.0632 | 0.0722 |
Note: This table reports p-values for single hypothesis tests of treatment effects on various outcomes of male participants at the given ages. The inferences are based on the studentized AIPW test statistic.
Table 3.
Reanalysis of Female Outcomes in Heckman et al. (2020) Using Single Tests
| Variable | Age | Untreated mean | Treated mean | AIPW estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | 0.0000 | 0.0000 | 0.0004 | 0.0034 | 0.0040 |
| Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | 0.0008 | 0.0006 | 0.0064 | 0.0273 | 0.0382 |
| Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | 0.0365 | 0.0281 | 0.0636 | 0.0820 | 0.0959 |
| Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | 0.0313 | 0.0154 | 0.0564 | 0.0858 | 0.1232 |
| Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1144 | 0.0896 | 0.1704 | 0.2040 | 0.2141 |
| Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | 0.0633 | 0.0679 | 0.1128 | 0.1992 | 0.2628 |
| Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | 0.0277 | 0.0323 | 0.0596 | 0.1600 | 0.1976 |
|
| |||||||||
| CAT reading score | 14 | 8.444 | 16.500 | 7.345 | 0.0130 | 0.0128 | 0.0268 | 0.0573 | 0.0935 |
| CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | 0.0102 | 0.0138 | 0.0284 | 0.0544 | 0.0710 |
| CAT language score | 14 | 7.833 | 19.455 | 11.923 | 0.0009 | 0.0013 | 0.0044 | 0.0178 | 0.0232 |
| CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | 0.0014 | 0.0015 | 0.0072 | 0.0208 | 0.0269 |
| CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | 0.0017 | 0.0042 | 0.0064 | 0.0180 | 0.0253 |
|
| |||||||||
| High school graduate | 19 | 0.231 | 0.840 | 0.570 | 0.0000 | 0.0000 | 0.0004 | 0.0051 | 0.0075 |
| Vocational training | 40 | 0.077 | 0.240 | 0.183 | 0.0286 | 0.0494 | 0.0420 | 0.1165 | 0.2231 |
| Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | 0.0023 | 0.0106 | 0.0120 | 0.0284 | 0.0500 |
| Grade point average | 19 | 1.527 | 2.415 | 0.958 | 0.0000 | 0.0155 | 0.0004 | 0.0112 | 0.0381 |
|
| |||||||||
| Total non-juvenile arrests | 40 | 4.423 | 2.160 | −1.938 | 0.0657 | 0.0795 | 0.0880 | 0.1566 | 0.1925 |
| Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1227 | 0.2436 | 0.2615 | 0.3036 |
| Total charges | 40 | 4.923 | 2.240 | −2.309 | 0.0439 | 0.0528 | 0.0580 | 0.1366 | 0.1407 |
| Non-victimless charges | 40 | 0.308 | 0.040 | −0.249 | 0.0365 | 0.0263 | 0.0612 | 0.0853 | 0.1201 |
|
| |||||||||
| Currently employed | 19 | 0.154 | 0.440 | 0.297 | 0.0054 | 0.0048 | 0.0152 | 0.0585 | 0.0619 |
| Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | 0.0029 | 0.0033 | 0.0104 | 0.0313 | 0.0377 |
| Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | 0.0723 | 0.1386 | 0.1140 | 0.2020 | 0.2780 |
|
| |||||||||
| Currently employed | 27 | 0.545 | 0.800 | 0.215 | 0.0471 | 0.0604 | 0.0648 | 0.0960 | 0.1237 |
| Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | 0.0523 | 0.0457 | 0.0728 | 0.1663 | 0.2242 |
| Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4526 | 0.7373 |
|
| |||||||||
| Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.4536 | 0.4586 | 0.4912 | 0.6550 | 0.6802 |
| Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | 0.0807 | 0.1079 | 0.1324 | 0.1934 | 0.2386 |
| Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.4910 | 0.4927 | 0.4700 | 0.6114 | 0.6379 |
Note: This table reports p-values for single hypothesis tests of treatment effects on various outcomes of female participants at the given ages. The inferences are based on the studentized AIPW test statistic.
Table 4.
Reanalysis of Male Outcomes in Heckman et al. (2020) Using Stepdown Tests
| Variable | Age | Untreated mean | Treated mean | AIPW estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Stanford–Binet IQ | 4 | 83.077 | 94.909 | 8.988 | 0.0001 | 0.0002 | 0.0028 | 0.0346 | 0.0394 |
| Stanford–Binet IQ | 5 | 84.793 | 95.400 | 9.167 | 0.0003 | 0.0012 | 0.0028 | 0.0425 | 0.0425 |
| Stanford–Binet IQ | 6 | 85.821 | 91.485 | 3.056 | 0.1593 | 0.2058 | 0.1888 | 0.3534 | 0.5112 |
| Stanford–Binet IQ | 7 | 87.711 | 91.121 | 1.576 | 0.2040 | 0.2143 | 0.2104 | 0.3866 | 0.5112 |
| Stanford–Binet IQ | 8 | 89.054 | 88.333 | −3.829 | 0.1593 | 0.2058 | 0.1888 | 0.3866 | 0.5112 |
| Stanford–Binet IQ | 9 | 89.026 | 88.394 | −4.167 | 0.1593 | 0.2058 | 0.1888 | 0.3866 | 0.5112 |
| Stanford–Binet IQ | 10 | 86.026 | 83.697 | −4.722 | 0.1126 | 0.2058 | 0.1460 | 0.3534 | 0.5112 |
|
| |||||||||
| CAT reading score | 14 | 9.000 | 13.926 | 1.815 | 0.7229 | 0.7886 | 0.7800 | 0.8202 | 0.9047 |
| CAT arithmetic score | 14 | 8.107 | 16.000 | 3.095 | 0.7229 | 0.7886 | 0.7800 | 0.8202 | 0.9047 |
| CAT language score | 14 | 6.536 | 14.333 | 5.029 | 0.3260 | 0.3980 | 0.4304 | 0.7084 | 0.8390 |
| CAT mechanics score | 14 | 6.964 | 15.556 | 5.979 | 0.2690 | 0.3190 | 0.3560 | 0.6667 | 0.7335 |
| CAT spelling score | 14 | 11.536 | 18.519 | 3.171 | 0.7229 | 0.7886 | 0.7800 | 0.8202 | 0.9047 |
|
| |||||||||
| High school graduate | 19 | 0.513 | 0.485 | 0.015 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Vocational training | 40 | 0.333 | 0.394 | 0.071 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Highest grade completed | 19 | 11.282 | 11.364 | 0.087 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Grade point average | 19 | 1.794 | 1.814 | −0.035 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
|
| |||||||||
| Total non-juvenile arrests | 40 | 11.718 | 7.455 | −3.895 | 0.1384 | 0.1103 | 0.2004 | 0.3057 | 0.5366 |
| Total crime cost | 40 | 775.901 | 424.679 | −313.263 | 0.1384 | 0.1361 | 0.2004 | 0.3057 | 0.5366 |
| Total charges | 40 | 13.385 | 9.000 | −4.132 | 0.1384 | 0.1158 | 0.2004 | 0.3057 | 0.5366 |
| Non-victimless charges | 40 | 3.077 | 1.485 | −1.444 | 0.1096 | 0.0952 | 0.1488 | 0.3020 | 0.5327 |
|
| |||||||||
| Currently employed | 19 | 0.410 | 0.545 | 0.147 | 0.3790 | 0.3946 | 0.3876 | 0.8968 | 1.0000 |
| Unemployed last year | 19 | 0.128 | 0.242 | 0.102 | 0.3790 | 0.3946 | 0.4296 | 0.8968 | 1.0000 |
| Jobless months (past 2 yrs) | 19 | 3.816 | 5.281 | 1.367 | 0.3790 | 0.3946 | 0.4296 | 0.8968 | 1.0000 |
|
| |||||||||
| Currently employed | 27 | 0.564 | 0.600 | 0.089 | 0.4313 | 0.4380 | 0.4776 | 0.6697 | 0.7405 |
| Unemployed last year | 27 | 0.308 | 0.242 | −0.081 | 0.4313 | 0.4380 | 0.4776 | 0.6697 | 0.7405 |
| Jobless months (past 2 yrs) | 27 | 8.795 | 5.133 | −3.868 | 0.1313 | 0.1290 | 0.1764 | 0.3580 | 0.6089 |
|
| |||||||||
| Currently employed | 40 | 0.500 | 0.700 | 0.266 | 0.0268 | 0.0225 | 0.0564 | 0.1333 | 0.1920 |
| Unemployed last year | 40 | 0.462 | 0.364 | −0.143 | 0.0843 | 0.0957 | 0.0912 | 0.1629 | 0.1920 |
| Jobless months (past 2 yrs) | 40 | 10.750 | 7.233 | −4.758 | 0.0309 | 0.0399 | 0.0564 | 0.1333 | 0.1920 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on various outcomes of male participants at the given ages. The inferences are based on the studentized AIPW test statistic. The blocks used for multiple testing are indicated above using divider lines.
Table 5.
Reanalysis of Female Outcomes in Heckman et al. (2020) Using Stepdown Tests
| Variable | Age | Untreated mean | Treated mean | AIPW estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Stanford–Binet IQ | 4 | 83.692 | 96.360 | 13.425 | 0.0000 | 0.0000 | 0.0028 | 0.0239 | 0.0278 |
| Stanford–Binet IQ | 5 | 81.650 | 94.316 | 14.157 | 0.0046 | 0.0035 | 0.0384 | 0.1637 | 0.2290 |
| Stanford–Binet IQ | 6 | 87.160 | 90.913 | 5.271 | 0.1387 | 0.1125 | 0.2820 | 0.4099 | 0.4796 |
| Stanford–Binet IQ | 7 | 86.000 | 92.520 | 7.347 | 0.1387 | 0.0771 | 0.2820 | 0.4099 | 0.4929 |
| Stanford–Binet IQ | 8 | 83.600 | 87.840 | 4.669 | 0.1387 | 0.1359 | 0.2820 | 0.4801 | 0.5928 |
| Stanford–Binet IQ | 9 | 83.043 | 86.739 | 4.809 | 0.1387 | 0.1359 | 0.2820 | 0.4801 | 0.5928 |
| Stanford–Binet IQ | 10 | 81.789 | 86.750 | 6.480 | 0.1387 | 0.1125 | 0.2820 | 0.4801 | 0.5928 |
|
| |||||||||
| CAT reading score | 14 | 8.444 | 16.500 | 7.345 | 0.0205 | 0.0255 | 0.0536 | 0.1088 | 0.1421 |
| CAT arithmetic score | 14 | 6.889 | 11.818 | 6.227 | 0.0205 | 0.0255 | 0.0536 | 0.1088 | 0.1421 |
| CAT language score | 14 | 7.833 | 19.455 | 11.923 | 0.0043 | 0.0064 | 0.0220 | 0.0889 | 0.1162 |
| CAT mechanics score | 14 | 8.833 | 20.636 | 12.425 | 0.0056 | 0.0064 | 0.0256 | 0.0889 | 0.1162 |
| CAT spelling score | 14 | 10.722 | 29.500 | 18.270 | 0.0056 | 0.0127 | 0.0256 | 0.0889 | 0.1162 |
|
| |||||||||
| High school graduate | 19 | 0.231 | 0.840 | 0.570 | 0.0000 | 0.0000 | 0.0016 | 0.0202 | 0.0299 |
| Vocational training | 40 | 0.077 | 0.240 | 0.183 | 0.0286 | 0.0494 | 0.0420 | 0.1165 | 0.2231 |
| Highest grade completed | 19 | 10.750 | 11.760 | 1.202 | 0.0046 | 0.0318 | 0.0240 | 0.0567 | 0.1142 |
| Grade point average | 19 | 1.527 | 2.415 | 0.958 | 0.0000 | 0.0318 | 0.0016 | 0.0336 | 0.1142 |
|
| |||||||||
| Total non-juvenile arrests | 40 | 4.423 | 2.160 | −1.938 | 0.1461 | 0.1589 | 0.2320 | 0.4098 | 0.4803 |
| Total crime cost | 40 | 293.497 | 22.165 | −246.242 | 0.1475 | 0.1589 | 0.2436 | 0.4098 | 0.4803 |
| Total charges | 40 | 4.923 | 2.240 | −2.309 | 0.1461 | 0.1585 | 0.2320 | 0.4098 | 0.4803 |
| Non-victimless charges | 40 | 0.308 | 0.040 | −0.249 | 0.1461 | 0.1051 | 0.2320 | 0.3413 | 0.4803 |
|
| |||||||||
| Currently employed | 19 | 0.154 | 0.440 | 0.297 | 0.0107 | 0.0099 | 0.0312 | 0.1171 | 0.1237 |
| Unemployed last year | 19 | 0.577 | 0.240 | −0.354 | 0.0088 | 0.0099 | 0.0312 | 0.0939 | 0.1132 |
| Jobless months (past 2 yrs) | 19 | 10.421 | 5.217 | −4.197 | 0.0723 | 0.1386 | 0.1140 | 0.2020 | 0.2780 |
|
| |||||||||
| Currently employed | 27 | 0.545 | 0.800 | 0.215 | 0.1412 | 0.1371 | 0.1944 | 0.2879 | 0.3712 |
| Unemployed last year | 27 | 0.542 | 0.250 | −0.269 | 0.1412 | 0.1371 | 0.1944 | 0.3325 | 0.4485 |
| Jobless months (past 2 yrs) | 27 | 10.455 | 6.240 | −1.298 | 0.3328 | 0.3449 | 0.2916 | 0.4526 | 0.7373 |
|
| |||||||||
| Currently employed | 40 | 0.818 | 0.833 | −0.016 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
| Unemployed last year | 40 | 0.409 | 0.160 | −0.194 | 0.2421 | 0.3237 | 0.3972 | 0.5803 | 0.7157 |
| Jobless months (past 2 yrs) | 40 | 5.045 | 4.000 | 0.057 | 0.9072 | 0.9173 | 0.9400 | 1.0000 | 1.0000 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on various outcomes of female participants at the given ages. The inferences are based on the studentized AIPW test statistic. The blocks used for multiple testing are indicated above using divider lines.
In Tables 6 and 7, we reproduce Heckman et al.’s (2020) results and provide a side-by-side comparison of their inferences with our own. The most stringent (max-U) single p-values they report for the effects on the California Achievement Test (CAT) reading, arithmetic, language, mechanics, and spelling scores at age 14 in the male sample using the studentized DIM test statistic are 0.036, 0.086, 0.012, 0.023, 0.012, respectively, which are lower than the asymptotic p-values we report in Table 2. After adjusting for multiple testing, their adjusted max-U p-values are no more than 0.086, based on which they conclude that these effects are statistically significant. In contrast, using our approach, the worst-case maximum (single) p-values using studentized DIM test statistic are 0.171, 0.119, 0.075, 0.054, 0.123, respectively. As shown in our Table 2, using the studentized AIPW test statistic, our worst-case maximum p-values are 0.349, 0.291, 0.177, 0.133, 0.273, respectively,56 implying that the effects on the CAT scores for males are not statistically significant. Of course, the stepdown p-values for these outcomes shown in Table 4 are also insignificant. Our inference for the female sample is qualitatively similar to theirs. As shown in Table 3, the block of CAT scores for females is statistically significant at the 10% level. However, the multiplicity-adjusted stepdown worst-case de Haan p-values in Table 5 are 0.12 or larger.
Table 6.
Comparing Heckman et al.’s (2020) DIM-Based Inference with Ours for Male Sample
| Heckman et al.’s (2020) p-values | Worst-case p-values using our method | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Variable | Age | U = 0 p-value (unadj.) | U = 0 p-value (adjusted) | Max-U p-value (unadj.) | Max-U p-value (adjusted) | Worst-case max. p (unadjusted) | Worst-case max. p (adjusted) | Worst-case de Haan p (unadjusted) | Worst-case de Haan p (adjusted) |
| Stanford–Binet IQ | 4 | 0.001 | 0.001 | 0.008 | 0.008 | 0.0048 | 0.0333 | 0.0053 | 0.0368 |
| Stanford–Binet IQ | 5 | 0.022 | 0.691 | 0.077 | 0.800 | 0.0076 | 0.0456 | 0.0095 | 0.0572 |
| Stanford–Binet IQ | 6 | 0.033 | 0.034 | 0.094 | 0.102 | 0.0249 | 0.1247 | 0.0306 | 0.1531 |
| Stanford–Binet IQ | 7 | 0.103 | 0.172 | 0.247 | 0.374 | 0.0782 | 0.3128 | 0.1185 | 0.4741 |
| Stanford–Binet IQ | 8 | 0.599 | 0.691 | 0.733 | 0.800 | 0.5480 | 1.0000 | 0.5743 | 1.0000 |
| Stanford–Binet IQ | 9 | 0.450 | 0.548 | 0.631 | 0.680 | 0.5580 | 1.0000 | 0.6044 | 1.0000 |
| Stanford–Binet IQ | 10 | 0.684 | 0.691 | 0.790 | 0.800 | 0.2568 | 0.7704 | 0.3509 | 1.0000 |
|
| |||||||||
| CAT reading score | 14 | 0.017 | 0.035 | 0.036 | 0.086 | 0.1711 | 0.3572 | 0.1776 | 0.5135 |
| CAT arithmetic score | 14 | 0.032 | 0.035 | 0.086 | 0.086 | 0.1191 | 0.3572 | 0.1555 | 0.5135 |
| CAT language score | 14 | 0.001 | 0.004 | 0.012 | 0.027 | 0.0748 | 0.2994 | 0.1284 | 0.5135 |
| CAT mechanics score | 14 | 0.006 | 0.007 | 0.023 | 0.035 | 0.0535 | 0.2673 | 0.0681 | 0.3407 |
| CAT spelling score | 14 | 0.003 | 0.035 | 0.012 | 0.086 | 0.1225 | 0.3572 | 0.1503 | 0.5135 |
|
| |||||||||
| High school graduate | 19 | 0.614 | 0.674 | 0.704 | 0.716 | 0.6322 | 1.0000 | 0.6718 | 1.0000 |
| Vocational training | 40 | 0.341 | 0.567 | 0.547 | 0.608 | 0.3674 | 1.0000 | 0.4281 | 1.0000 |
| Highest grade completed | 19 | 0.383 | 0.622 | 0.410 | 0.669 | 0.3659 | 1.0000 | 0.4031 | 1.0000 |
| Grade point average | 19 | 0.457 | 0.674 | 0.567 | 0.716 | 0.5133 | 1.0000 | 0.5493 | 1.0000 |
|
| |||||||||
| Total non-juvenile arrests | 40 | 0.036 | 0.038 | 0.100 | 0.115 | 0.0701 | 0.2493 | 0.0896 | 0.3249 |
| Total crime cost | 40 | 0.037 | 0.049 | 0.042 | 0.143 | 0.1695 | 0.2493 | 0.2133 | 0.3249 |
| Total charges | 40 | 0.049 | 0.049 | 0.143 | 0.143 | 0.0946 | 0.2493 | 0.1038 | 0.3249 |
| Non-victimless charges | 40 | 0.025 | 0.037 | 0.063 | 0.091 | 0.0623 | 0.2493 | 0.0812 | 0.3249 |
|
| |||||||||
| Currently employed | 19 | 0.050 | 0.164 | 0.224 | 0.290 | 0.2809 | 0.7459 | 0.3864 | 1.0000 |
| Unemployed last year | 19 | 0.901 | 0.901 | 0.922 | 0.922 | 0.2486 | 0.7459 | 0.3673 | 1.0000 |
| Jobless months (past 2 yrs) | 19 | 0.821 | 0.849 | 0.873 | 0.890 | 0.3291 | 0.7459 | 0.4019 | 1.0000 |
|
| |||||||||
| Currently employed | 27 | 0.268 | 0.295 | 0.485 | 0.512 | 0.3629 | 0.6294 | 0.4130 | 0.7278 |
| Unemployed last year | 27 | 0.235 | 0.295 | 0.360 | 0.512 | 0.3147 | 0.6294 | 0.3639 | 0.7278 |
| Jobless months (past 2 yrs) | 27 | 0.020 | 0.020 | 0.036 | 0.051 | 0.0965 | 0.2894 | 0.1265 | 0.3795 |
|
| |||||||||
| Currently employed | 40 | 0.103 | 0.116 | 0.130 | 0.146 | 0.0566 | 0.1697 | 0.0853 | 0.2559 |
| Unemployed last year | 40 | 0.154 | 0.154 | 0.216 | 0.216 | 0.1684 | 0.1697 | 0.2494 | 0.2559 |
| Jobless months (past 2 yrs) | 40 | 0.064 | 0.116 | 0.070 | 0.146 | 0.0693 | 0.1697 | 0.0957 | 0.2559 |
Note: This table compares inferences reported by Heckman et al. (2020) with the inferences obtained using our worst-case tests. The first two columns list the blocks of outcomes analyzed by Heckman et al. (2020). The next four columns reproduce their zero-U (U = 0) p-values and max-U p-values before and after adjusting for multiplicity of hypotheses. Since all of their tests are based on studentized DIM estimate, we report our inferences (using the studentized DIM test statistic) side by side for comparison. The last four columns report our worst-case maximum p-values and worst-case de Haan p-values before and after adjusting for multiplicity of hypotheses. The unadjusted p-values refer to single p-values that are unadjusted for multiplicity of hypotheses. The adjusted p-values refer to stepdown p-values after adjusting for multiple testing.
Table 7.
Comparing Heckman et al.’s (2020) DIM-Based Inference with Ours for Female Sample
| Heckman et al.’s (2020) p-values | Worst-case p-values using our method | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Variable | Age | U = 0 p-value (unadj.) | U = 0 p-value (adjusted) | Max-U p-value (unadj.) | Max-U p-value (adjusted) | Worst-case max. p (unadjusted) | Worst-case max. p (adjusted) | Worst-case de Haan p (unadjusted) | Worst-case de Haan p (adjusted) |
| Stanford–Binet IQ | 4 | 0.008 | 0.008 | 0.020 | 0.020 | 0.0052 | 0.0362 | 0.0066 | 0.0460 |
| Stanford–Binet IQ | 5 | 0.012 | 0.203 | 0.014 | 0.354 | 0.0183 | 0.1095 | 0.0500 | 0.2999 |
| Stanford–Binet IQ | 6 | 0.094 | 0.164 | 0.160 | 0.346 | 0.1397 | 0.5589 | 0.1476 | 0.5904 |
| Stanford–Binet IQ | 7 | 0.133 | 0.137 | 0.191 | 0.222 | 0.0734 | 0.3671 | 0.0978 | 0.4890 |
| Stanford–Binet IQ | 8 | 0.152 | 0.164 | 0.339 | 0.346 | 0.1487 | 0.5589 | 0.2143 | 0.5904 |
| Stanford–Binet IQ | 9 | 0.203 | 0.203 | 0.354 | 0.354 | 0.2134 | 0.5589 | 0.2272 | 0.5904 |
| Stanford–Binet IQ | 10 | 0.203 | 0.203 | 0.267 | 0.354 | 0.1398 | 0.5589 | 0.1748 | 0.5904 |
|
| |||||||||
| CAT reading score | 14 | 0.078 | 0.082 | 0.136 | 0.167 | 0.0413 | 0.0825 | 0.0771 | 0.1542 |
| CAT arithmetic score | 14 | 0.035 | 0.082 | 0.074 | 0.167 | 0.1000 | 0.1000 | 0.1357 | 0.1542 |
| CAT language score | 14 | 0.008 | 0.070 | 0.020 | 0.144 | 0.0111 | 0.0514 | 0.0235 | 0.0941 |
| CAT mechanics score | 14 | 0.047 | 0.082 | 0.097 | 0.167 | 0.0116 | 0.0514 | 0.0287 | 0.0941 |
| CAT spelling score | 14 | 0.043 | 0.082 | 0.082 | 0.167 | 0.0103 | 0.0514 | 0.0115 | 0.0577 |
|
| |||||||||
| High school graduate | 19 | 0.008 | 0.008 | 0.020 | 0.020 | 0.0050 | 0.0200 | 0.0075 | 0.0299 |
| Vocational training | 40 | 0.078 | 0.078 | 0.144 | 0.144 | 0.1233 | 0.1233 | 0.1489 | 0.1489 |
| Highest grade completed | 19 | 0.070 | 0.070 | 0.113 | 0.113 | 0.0249 | 0.0497 | 0.0474 | 0.0948 |
| Grade point average | 19 | 0.039 | 0.039 | 0.082 | 0.082 | 0.0079 | 0.0237 | 0.0094 | 0.0299 |
|
| |||||||||
| Total non-juvenile arrests | 40 | 0.020 | 0.133 | 0.121 | 0.158 | 0.1150 | 0.2641 | 0.1337 | 0.3160 |
| Total crime cost | 40 | 0.024 | 0.133 | 0.082 | 0.158 | 0.0660 | 0.2641 | 0.0790 | 0.3160 |
| Total charges | 40 | 0.020 | 0.067 | 0.043 | 0.090 | 0.0979 | 0.2641 | 0.1382 | 0.3160 |
| Non-victimless charges | 40 | 0.125 | 0.133 | 0.158 | 0.158 | 0.0693 | 0.2641 | 0.0803 | 0.3160 |
|
| |||||||||
| Currently employed | 19 | 0.008 | 0.031 | 0.035 | 0.090 | 0.0518 | 0.1200 | 0.0705 | 0.1742 |
| Unemployed last year | 19 | 0.024 | 0.031 | 0.074 | 0.090 | 0.0400 | 0.1200 | 0.0581 | 0.1742 |
| Jobless months (past 2 yrs) | 19 | 0.125 | 0.125 | 0.206 | 0.206 | 0.0807 | 0.1200 | 0.1044 | 0.1742 |
|
| |||||||||
| Currently employed | 27 | 0.110 | 0.149 | 0.175 | 0.198 | 0.0712 | 0.2137 | 0.0867 | 0.2600 |
| Unemployed last year | 27 | 0.078 | 0.149 | 0.128 | 0.175 | 0.1026 | 0.2137 | 0.1325 | 0.2650 |
| Jobless months (past 2 yrs) | 27 | 0.110 | 0.149 | 0.166 | 0.198 | 0.1881 | 0.2137 | 0.2449 | 0.2650 |
|
| |||||||||
| Currently employed | 40 | 0.442 | 0.442 | 0.567 | 0.567 | 0.4786 | 0.8773 | 0.5336 | 0.9701 |
| Unemployed last year | 40 | 0.047 | 0.070 | 0.113 | 0.160 | 0.0825 | 0.2475 | 0.1250 | 0.3750 |
| Jobless months (past 2 yrs) | 40 | 0.352 | 0.367 | 0.540 | 0.540 | 0.4386 | 0.8773 | 0.4850 | 0.9701 |
Note: This table compares inferences reported by Heckman et al. (2020) with the inferences obtained using our worst-case tests. The first two columns list the blocks of outcomes analyzed by Heckman et al. (2020). The next four columns reproduce their zero-U (U = 0) p-values and max-U p-values before and after adjusting for multiplicity of hypotheses. Since all of their tests are based on studentized DIM estimate, we report our inferences (using the studentized DIM test statistic) side by side for comparison. The last four columns report our worst-case maximum p-values and worst-case de Haan p-values before and after adjusting for multiplicity of hypotheses. The unadjusted p-values refer to single p-values that are unadjusted for multiplicity of hypotheses. The adjusted p-values refer to stepdown p-values after adjusting for multiple testing.
Table 4 reports stepdown p-values for male outcomes. No estimated effect (after age 5) remains statistically significant at the 10% level after adjusting for multiple hypothesis testing using the worst-case maximum or worst-case de Haan p-values. However, in Table 5, which presents stepdown analysis of female outcomes, the treatment effects on post-program outcomes (related to CAT scores and education) are statistically significant at the 10% level using our worst-case maximum p-value. Nevertheless, all of these effects on female outcomes, except for one (high school graduation), disappear when worst-case de Haan p-values are used.
Tables 2 through 5 use the studentized AIPW test statistic for inference. Heckman et al. (2020) use the studentized DIM test statistic instead. Tables 6 and 7 compare their inferences with ours using the same test statistic. The effects for males on post-program outcomes remain statistically insignificant at the 10% level using stepdown worst-case de Haan p-values, whereas treatment effects on CAT scores are statistically significant in Heckman et al.’s (2020) analysis.
The contrast between Tables 5 and 7 reveals the importance of the choice of test statistic. In Table 7, several effects on female outcomes are statistically significant at the 10% level using stepdown worst-case de Haan p-values based on the studentized DIM test statistic. However, in Table 5, using the studentized AIPW test statistic, only one effect (on high school graduation) after age 5 is statistically significant based on the worst-case de Haan p-values.
Heckman et al. (2020) do not analyze the Perry treatment effects on convictions for violent crime, which are substantial and play an important role in cost-benefit analyses of early childhood programs (see Heckman et al., 2010b). Using administrative data on the criminal activity of participants, we illustrate their importance and, at the same time, the importance of long-term follow-up. Tables 8, 9, and 10 provide estimates and measures of statistical significance of treatments effects in the pooled sample (of all participants) on cumulative convictions for violent misdemeanors and felonies at various ages. Appendix D presents expanded versions of these tables reporting inference for various estimators and test statistics for the pooled sample as well as the male and female subsamples. As shown in Table 9, the AIPW estimates of the treatment effect on cumulative violent misdemeanor convictions are below −0.5 at ages 30 and 40. The treatment effects on violent misdemeanor convictions are statistically significant at the 2% and 10% levels before and after multiple hypothesis testing, respectively, regardless of the method used for inference on the pooled sample (see Appendix D).
Table 8.
DIM-Based Single Hypothesis Tests on Cumulative Convictions for Violent Crime
| Type | Age | Untreated mean | Treated mean | DIM estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Misdemeanor | 30 | 0.5231 | 0.0517 | −0.4714 | 0.0109 | 0.0021 | 0.0036 | 0.0122 | 0.0154 |
| Misdemeanor | 40 | 0.6825 | 0.0877 | −0.5948 | 0.0033 | 0.0005 | 0.0004 | 0.0053 | 0.0065 |
| Felony | 30 | 0.2846 | 0.1897 | −0.0950 | 0.2301 | 0.2263 | 0.2624 | 0.4057 | 0.5318 |
| Felony | 40 | 0.4762 | 0.1930 | −0.2832 | 0.0333 | 0.0332 | 0.0384 | 0.0708 | 0.0900 |
Note: This table reports p-values for single hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized DIM (differece-in-means) test statistic.
Table 9.
AIPW-Based Single Hypothesis Tests on Cumulative Convictions for Violent Crime
| Type | Age | Untreated mean | Treated mean | AIPW estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | 0.0064 | 0.0020 | 0.0024 | 0.0099 | 0.0154 |
| Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | 0.0021 | 0.0010 | 0.0008 | 0.0133 | 0.0181 |
| Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4820 | 0.5078 |
| Felony | 40 | 0.4762 | 0.1930 | −0.2052 | 0.0664 | 0.0778 | 0.0708 | 0.1543 | 0.1754 |
Note: This table reports p-values for single hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized AIPW test statistic.
Table 10.
AIPW-Based Multiple Hypothesis Tests on Cumulative Convictions for Violent Crime
| Type | Age | Untreated mean | Treated mean | AIPW estimate | Asymptotic p-value | Bootstrap p-value | Permutation p-value | Worst-case max. p | Worst-case de Haan p |
|---|---|---|---|---|---|---|---|---|---|
| Misdemeanor | 30 | 0.5231 | 0.0517 | −0.5300 | 0.0192 | 0.0059 | 0.0072 | 0.0394 | 0.0617 |
| Misdemeanor | 40 | 0.6825 | 0.0877 | −0.6491 | 0.0085 | 0.0039 | 0.0032 | 0.0398 | 0.0617 |
| Felony | 30 | 0.2846 | 0.1897 | −0.0561 | 0.3174 | 0.3217 | 0.3488 | 0.4820 | 0.5078 |
| Felony | 40 | 0.4762 | 0.1930 | −0.2052 | 0.1327 | 0.1556 | 0.1416 | 0.3086 | 0.3507 |
Note: This table reports Holm stepdown p-values for multiple hypothesis tests of treatment effects on cumulative misdemeanor and felony convictions for violent crime at ages 30 and 40, using the pooled sample of participants. The inferences are based on the studentized AIPW test statistic. All the above four variables, which represent cumulative crime outcomes at different ages, are treated as a block for multiple testing.
The choice of inferential method becomes more important in analyzing treatment effects on cumulative convictions for felonies. At age 30, there are no statistically significant treatment effects. At age 40, as shown in Table 9, the magnitude of the treatment effect is higher at about −0.21, which represents more than a four-tenths reduction in the control mean. However, using simple difference-in-means estimates and conventional p-values can be misleading. Using conventional p-values, the effect at age 40 appears to be statistically significant at the 10% level, as shown in Table 8. However, the design-based worst-case p-values, especially those associated with the AIPW estimate, are much higher. The worst-case de Haan p-values for the studentized DIM and AIPW estimates are about 0.090 and 0.175, respectively.
The four variables at ages 30 and 40 considered in Tables 8 and 9 are conceptually related, since they are cumulative crime outcomes measured at different ages. To account for this, we treat these outcomes as a single block of variables and conduct multiple hypothesis testing using the more conservative Holm stepdown procedure, producing results in Table 10. After multiple testing, the effects on cumulative convictions for violent misdemeanors remain statistically significant at the 10% level at both ages 30 and 40, whereas the the effects on violent felonies are insignificant at both ages. These analyses, and those in the appendices, show that use of small-sample inference and the method used to account for compromised randomization matter in analyzing the data. Failure to account for either can give a very positive spin to the Perry program. Accounting for it qualifies such conclusions. We have not, however, established the superiority of our approach. We have established that a very cautious design-based approach produces conservative inference, which by itself is not surprising. Our reanalysis of Heckman et al. (2020) is very conservative. Nonetheless, a few conclusions survive. We test Fisher’s sharp null hypothesis HF of no treatment effect for each participant. It may in fact be the case that there are treatment effects for many participants and yet we do not reject the sharp null hypothesis because of our worst-case approach.
6. CONCLUSION
In this paper, we develop and apply a design-based finite-sample inferential method for analyzing social experiments with compromised randomization. Compromises come in many forms. They include incompletely documented re-randomization procedures used to improve baseline covariate balance between treatment and control groups. They also include reassignment of treatment status due to administrative constraints.
We build a behavioral model of satisficing experimenters who seek balance in baseline covariates across treatments and controls and who provide readers of their reports qualitative, and sometimes conflicting, summaries of the actual experimental protocols used. We model the randomization protocol as only partially known to the user of experimental data. The empirical researcher recognizes and tries to account for the guiding principles experimenters used in the reassignment of treatment status for balancing baseline covariates while operating under administrative constraints. We show how to partially identify model parameters and construct worst-case (least favorable) randomization tests over a set of possibilities for the actual treatment assignment mechanism.
Our analysis of the Perry program serves as a proof-of-concept of the usefulness of our worst-case finite-sample testing approaches, which are applicable to other compromised experiments, such as those discussed by Bruhn and McKenzie (2009). Our approach is more portable than that of Heckman et al. (2020), which utilizes very specific features of the Perry randomization protocol. Application of our procedures result in conservative finite-sample inferences. It is remarkable that when we apply our worst-case methods to the latest wave of Perry data on the participants at late midlife and their adult children Heckman (2020), we find many statistically significant policy-relevant beneficial treatment effects that survive application of worst-case inferential procedures.
Supplementary Material
ACKNOWLEDGEMENTS
This paper was delivered as the Sargan Lecture of The Royal Economic Society, Warwick, England, March 2019. It has been subject to the usual refereeing standards of this journal. We thank Juan Pantano and Azeem Shaikh for comments on early drafts of this paper. We thank Cheryl Polk and Lawrence Schweinhart of the HighScope Educational Research Foundation for their assistance in data acquisition, sharing historical documentation, and their longstanding partnership with the Center for the Economics of Human Development. This research was supported in part by: the Buffett Early Childhood Fund; NIH Grants R01AG042390, R01AG05334301, and R37HD065072; and the American Bar Foundation. The views expressed in this paper are solely those of the authors and do not necessarily represent those of the funders or the official views of the National Institutes of Health.
Footnotes
See Schweinhart et al. (1985, 1993, 2005). Heckman et al. (2010a) describe the program in detail. See also Appendix A.
See, e.g., Heckman et al. (2010a) and Heckman et al. (2020).
See Obama (2013).
We compare in detail the approaches of Heckman et al. (2010a) and Heckman et al. (2020) with our methods in Section 4.
These percentages are calculated by weighting each survey respondent by the number of experiments in which the respondent had participated.
See, e.g., Morgan and Rubin (2012, 2015) and Li et al. (2018). Morgan and Rubin (2012) state that they “only advocate rerandomization if the decision to rerandomize or not is based on a pre-specified criterion.” Their inferential methods require knowledge of such pre-specified criteria. Although rerandomization methods have the property that they reduce variance of the null distribution asymptotically in certain settings Morgan and Rubin (2012, 2015); Li et al. (2018), this property is not guaranteed in the finite-sample setting we consider.
According to Schweinhart et al. (2005), “4 children did not complete the preschool program because they moved away and 1 child died [in a fire accident] shortly after the study began.” We are missing the following data (on some of these children) that are necessary for inference procedures. We do not know the mother’s working status at baseline of a subject in wave 0 (who has a sibling in wave 1) among the five children who dropped out of the original sample of 128 for extraneous reasons. We also do not know the gender of a subject in wave 1. (We use the Perry convention that wave 0 is the first wave and wave 4 is the last one.) The baseline information on these subjects is important in our formal model of the randomization protocol. We do not make assumptions regarding the mother’s working status at baseline of the subject in wave 0 and the gender of the other subject in wave 1. We run our testing procedures for each of the possible values of the variables. While we use the data on the five dropped children in our simulations of the randomization protocol for our worst-case tests, we treat the five participants as ignorable in our estimation of the treatment effects. Thus, our effective sample for estimation and inference is the core sample of 123 children.
Those in the treatment group of the first entry cohort (wave 0) were provided the intervention for only one year, starting at age 4, and thus constitute an exception. Our estimates of treatment effects pool all five cohorts, even though the lower program intensity in the first cohort might in principle attenuate the magnitudes of the effects downward.
See Appendix B. According to Schweinhart et al. (1993), “[The staffers] exchanged several similarly ranked pair members so the two groups would be matched on [the baseline variables].” Even though the phrase “similarly ranked pair members” might suggest consecutively ranked members, this is not necessarily the case. In Appendix B, we use Perry data from wave 4 to demonstrate that the exchanges were not necessarily between consecutively ranked pairs.
This is also manifested in the observed data. For example, as explained later in Section 3.2, the number of singletons in wave 2 is 22, with 12 in the control group and 10 in the treatment group. If there were exchanges between the initial experimental groups instead of one-way transfers to the control group, there would have been 11 singletons in both the control and treatment groups instead of 12 and 10, respectively.
See Simon (1955), an early paper in behavioral economics.
Each of the cohorts corresponds to one of the five waves (labeled 0 through 4) of study participants recruited from the fall seasons of 1962 through 1965. Waves 0 and 1 were randomized in the fall of 1962, while the waves 2, 3, and 4 were randomized in the fall of 1963, 1964, and 1965, respectively. We follow the labeling convention for the cohorts by the Perry analysts who designate the first cohort as “0.”
Note that the other participants in cohort c who are not singletons have older siblings already enrolled in the Perry experiment in a previous wave. The non-singletons are not randomized but rather assigned to the same treatment status as their elder siblings already enrolled in the study.
Note that is the ceiling function and is the floor function. They assign the least upper integer bound and greatest lower integer bound to the argument in the function, respectively.
An exchange means a swap between two participants belonging to different undesignated groups. Since the Perry experiment did not use a matched pair design, an exchange or swap is not restricted to occur between participants with consecutive IQ ranks. Exchanges between participants with non-consecutive IQ ranks can occur. See Appendix B.
The Hotelling’s multivariate two-sample t-squared statistic maps a partition of (such that and ) to and is given by , where , with Zi as the vector containing the i-th participant’s IQ, index of socioeconomic status, gender, and mother’s working status, , and , while . We use this metric for dimensionality reduction and computational feasibility. Chung and Romano (2016) show, without assuming normality, that the permutation distribution of is asymptotically chi-squared. If adequate computational power were available, we could also incorporate into our model the raw mean differences in the four variables, their studentized versions, or other measures of mean differences between two groups. Of course, it is possible that the Perry staffers were just eyeballing mean differences and did not use any formal metric.
For cohort 0, the proportion of possible group formations with a lower Hotelling statistic is at least 29.24%. The corresponding numbers for cohorts 1, 2, 3, and 4 are 64.51%, 14.79%, 9.76%, 75.56%, respectively. Similarly, the raw mean differences in baseline covariates for the initial groups also do not correspond to their minimum possible values.
The satisficing threshold δc is the maximum level of covariate imbalance that satisficed Perry staffers. The threshold δc is unknown to the analyst but can be partially identified, as explained later. We assume a uniform probability over for the choice of the partition for the purpose of keeping the model simple and computationally feasible. In general, we might suspect the following: given two partitions of with the same level of Hotelling’s statistic, there might have been a higher probability mass on the partition closer to the initial grouping based on odd and even IQ ranks. In addition, the staffers might have also preferred not to make additional exchanges if they expected relatively insignificant reductions in covariate imbalance. In other words, the probability that the Perry staffers chose a particular partition could have depended on their preferences over substitution between two things: similarity of to the initial IQ rank-based grouping; and the level of covariate imbalance (as measured by Hotelling’s statistic) resulting from the partition . However, there is no unique way to formalize this notion. Such a general model may not even be computationally feasible.
The Perry teachers conducted special home visits for working mothers at times other than weekday afternoons, when they visited the homes of non-working mothers. Because of logistical and financial constraints, the teachers were able to visit the homes of only a limited number of working mothers at times other than weekday afternoons. Thus, the children of working mothers in the preliminary treatment group for whom these special arrangements could not be made were transferred to the control group.
Thus, ηc can be thought of as slots available for special visits to the homes of working mothers. Equivalently, it is the number of children of working mothers who would remain in the final treatment group if all of them were placed in the preliminary treatment group.
Since cohorts 0 and 1 had a common set of teachers, they share the number of slots available for the special home visits. Thus, we pool these two cohorts while defining m0,1 and η0,1. However, cohorts 2 through 5 have separate parameters for the slots available for special home visits.
It is possible that the Perry staffers engaged in another round of satisficing at this step. In principle, this could be incorporated into our model but would increase its dimensionality. Since the published accounts do not mention another round of balancing, we do not add this feature to our model to keep it computationally feasible.
We are implicitly assuming that all working mothers would be able to send their children to preschool and participate in weekly home visits if special arrangements could be made for them. A model allowing for heterogeneity in availability of working mothers (for special arrangements) does not appear to be computationally feasible.
In other words, Vi,c = 0 for the participants who were either initially placed in the control group or placed in the initial treatment group but have non-working mothers.
Note that ωm,d ≡ ωm,d,c for all (m, d) ∈ {0, 1}2 but we suppress the subscript c for simplicity.
Specifically, , where and , for c ∈ {0, 1}. In our application, η0,1 ∈ {3}. Since we do not make assumptions on the missing mother’s working status at baseline for a subject in wave 0 and the missing gender of another subject in wave 1 (among the five who dropped out of the initial sample of 128 for extraneous reasons), our partial identification of δ0 and δ1 depends on the values in the partially identified set for the missing variables. Since we do not make assumptions on the two missing binary variables, this is a strength of our analysis, despite quadrupling the computational cost. We also use known information that there was at least one transfer in wave 0 Weikart et al. (1964) to narrow the partially identified set for that cohort.
In a set of 53 studies of randomized controlled trials published in some leading economics journals, Young (2019) also finds that experimental results obtained using asymptotic theory are misleading, relative to results based on randomization tests.
However, unless the permutation method reflects the method used for random assignment of the treatment, permutation tests do not in general allow us to test hypotheses about counterfactual outcomes of the individual Perry participants.
In practice, their approach relies on large-sample methods in using regression analysis to condition on covariates.
This is attributed to Neyman (1923).
While this formulation states that each individual treatment effect τi is zero, the analyst may fix each τi at a desired value for hypothesis testing. Such a hypothesis is often called sharp because it specifies one set of counterfactual outcomes for the participants.
Note that we observe either or for each participant . Thus, under the null model (4.6), the other counterfactual outcome can be imputed according to the fact that . In general, if τi is hypothesized to be equal to a number , the counterfactual outcomes under the null model are equal to if Di = 0 and is equal to if Di = 1 for all .
See Athey and Imbens (2017) and Abadie et al. (2020) for background on this topic. Also, note that our randomization tests are conditional tests that exploit random variation in the treatment status but fix the other observed data. See Lehmann (1993).
When the outcomes under consideration are binary and the experiment involves a completely randomized design, there are are strategies to test the weak null hypothesis in a computationally feasible way (see, e.g., Li and Ding, 2016; Rigdon and Hudgens, 2015).
These tests are not strictly exact because our model simplifies the actual randomization procedure and can at best be considered a useful approximation of the true model of the protocol.
Specifically, , where and for all c ∈ {2, 3, 4}.
We use 500,000 Monte Carlo draws from , a very large set, to approximate x(γ*).
We use 400 Monte Carlo draws from to approximate . This is effectively importance sampling. In addition, we use 2600 Monte Carlo draws from , where , and use rejection sampling to draw random samples from for approximating . It takes much longer to compute these tail probabilities than to compute x(γ*). Limited computational power restricted the number of Monte Carlo draws.
Since the randomly sampled treatment status vectors are i.i.d. and uniformly distributed on corresponding sample spaces, for a given γ* the associated p-value stochastic approximations can be used to construct valid tests. For details, see Section 4 of Romano (1989), Section 3.2 of Romano and Wolf (2005), or Section 15.2.1 of Lehmann and Romano (2005). Although this holds when γ* is taken as given, our main object of interest is the worst-case p-value in equation (4.7). Since it is infeasible to compute a p-value for each γ* ∈ Ξ, we also resort to stochastic approximations of the supremum in equation (4.7). In Section 4.3.2, we discuss how we account for uncertainty in the stochastic approximation of the worst-case p-value.
Specifically, , where is the lower bound for the satisficing threshold δc, and is the finite partially identified set for the capacity constraint ηc.
In fact, we can further simplify the worst-case tail probability. Let , which is a finite set, for all c ∈ {0, …, 4}, and let , which is also a finite set. Then, we have that . However, even though the set ΞΓ is finite, its size is too large in practice, making stochastic approximations still necessary.
Note that in our application, η0,1, η2, and η3 are point-identified while η4 is partially identified to be in the set {0, …, 4}. Thus, (η0,1, η2, η3, η4) has 5 possible values. In addition, since we do not know the mother’s working status at baseline for a subject in wave 0 and the gender of a subject in wave 1 (both of whom are among the 5 participants who dropped out of the study for extraneous reasons), there are 4 possible configurations of the two missing binary variables. Thus, in total there are L = 5 × 4 = 20 hyper-rectangles that make up Ξ°.
To ensure that we are covering Ξ° and its edges well when sampling the random points, we use a normalization. We use the distribution of Hotelling statistics on to normalize δc so that , a compact set, for all c ∈ {0, …, 4}. Thus, γ and are monotonically transformed accordingly in practice. We can do this because is a finite set and is equivalent to the set .
Specifically, , where , based on de Haan’s (1981) result. In the context of estimating the minimum of a function over a compact set using order statistics, de Haan (1981) proposes construction of a confidence band for the minimum. We apply this result without loss of generality in our context (estimation of the maximum rather than the minimum).
Our model is limited in the sense that it does not allow for heterogeneity among working mothers in their availability for special arrangements. We assume that the Perry administrators choose with equal probability which working mothers get special arrangements.
This is Step 4′ in their paper. Accordingly, their tests involve “permuting treatment status among those families with the same observed and unobserved characteristics (defined by the characteristics of the eldest child in the case of families with multiple children).” In practice, they discretize socioeconomic status (SES) into a binary indicator of above-median SES.
In the Perry context, it consists of the four pre-program covariates used during the randomization phase, i.e., Stanford-Binet IQ, index of socioeconomic status, gender, and mother’s working status.
Both OLS and DIM estimators can be studentized using their cluster-robust asymptotic standard errors, allowing for correlation between error terms of the participant-siblings in the Perry experiment.
The AIPW estimator also assumes conditional independence of the counterfactual outcomes and the treatment status, i.e., , which is valid because of the random assignment of the treatment status conditional on pre-program variables. Note that the propensity score model used in the AIPW estimator is a direct consequence of the law of conditional probability: for d ∈ {0, 1}. In the econometrics literature, the AIPW estimator is better known as a type of efficient influence function (EIF) estimator Cattaneo (2010). The estimator given by equation (5.25) can be studentized using the empirical sandwich standard error under the assumption that the propensity score and regression models are both correctly specified Lunceford and Davidian (2004). For studentization, we use a cluster-robust version of this asymptotic standard error, given by the following formula: , where represents a cluster of participant-siblings in the set J of clusters. Our studentized test statistics are based on the asymptotic standard error mainly for computational ease, but studentization based on the bootstrap standard error would be superior in theory.
See Robins et al. (1994), Lunceford and Davidian (2004), and Kang and Schafer (2007). The double-robustness property (consistency despite certain forms of misspecification) is easier to understand by rewriting equation (5.26) as follows: for d ∈ {0, 1}. If the propensity score models are correctly specified, the average value of consistently estimates the probability that , in which case the sample average of the whole second term in the rewritten expression for tends to zero. If, on the other hand, the counterfactual outcome model is correctly specified, then the average value of consistently estimates the expectation of , again in which case the sample average of the whole second term in the rewritten expression for converges to zero. Thus, the AIPW estimator remains consistent for the average treatment effect if either the propensity score models or the counterfactual outcome models are misspecified but not both.
However, we present estimates from all of these procedures in the appendix as a form of sensitivity analysis. The AIPW estimator can become unstable if both the propensity score models and the counterfactual outcome models are misspecified Kang and Schafer (2007). Thus, we do not solely rely on the AIPW estimator but use it in conjunction with the DIM and OLS estimators.
Since AIPW clearly has an asymptotic justification, it is not strictly a small-sample procedure from an estimation perspective. Nevertheless, we can conduct inference using its finite-sample worst-case randomization null distribution using our design-based methods.
In theory, we could bound the LATE estimate by considering all possible values for each observation’s initial treatment status, and then we could use the LATE bound as a test statistic for inference. However, this is very demanding computationally and thus not feasible in practice.
In these appendices, for each outcome we include the conventional p-values (i.e., asymptotic, bootstrap, and permutation p-values) and design-based p-values (i.e., worst-case maximum and worst-case de Haan p-values) associated with each of the DIM, OLS, and AIPW estimators of treatment effects. We also include permutation and worst-case p-values based on both nonstudentized and studentized test statistics. In addition, we include stepdown versions of the worst-case p-values.
The corresponding worst-case de Haan (single) p-values are 0.382, 0.322, 0.210, 0.147, 0.302, respectively.
REFERENCES
- Abadie A, Athey S, Imbens GW, and Wooldridge JM (2020). Sampling-based versus design-based uncertainty in regression analysis. Econometrica 88 (1), 265–296. [Google Scholar]
- Athey S and Imbens GW (2017). The econometrics of randomized experiments. In Handbook of Economic Field Experiments, Volume 1, pp. 73–140. Amsterdam: Elsevier. [Google Scholar]
- Bruhn M and McKenzie D (2009). In pursuit of balance: Randomization in practice in development field experiments. American Economic Journal: Applied Economics 1 (4), 200–232. [Google Scholar]
- Cattaneo MD (2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155 (2), 138–154. [Google Scholar]
- Chung E and Romano JP (2013). Exact and asymptotically robust permutation tests. The Annals of Statistics 41 (2), 484–507. [Google Scholar]
- Chung E and Romano JP (2016). Multivariate and multiple permutation tests. Journal of Econometrics 193 (1), 76–91. [Google Scholar]
- de Haan L (1981). Estimation of the minimum of a function using order statistics. Journal of the American Statistical Association 76 (374), 467–469. [Google Scholar]
- Fisher RA (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd. [Google Scholar]
- Fisher RA (1935). The design of experiments. Edinburgh: Oliver and Boyd. [Google Scholar]
- Heckman JJ (2020). Intergenerational impacts of a program designed to promote the social mobility of disadvantaged African Americans. Unpublished manuscript, The University of Chicago. [Google Scholar]
- Heckman JJ, Moon SH, Pinto R, Savelyev PA, and Yavitz A (2010a). Analyzing social experiments as implemented: A reexamination of the evidence from the HighScope Perry Preschool Program. Quantitative Economics 1 (1), 1–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heckman JJ, Moon SH, Pinto R, Savelyev PA, and Yavitz A (2010b). The rate of return to the Highscope Perry Preschool Program. Journal of Public Economics 94 (1–2), 114–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heckman JJ, Pinto R, and Shaikh AM (2020). Inference with imperfect randomization: The case of the Perry Preschool Program. Unpublished manuscript, The University of Chicago. [Google Scholar]
- Holm S (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 65–70. [Google Scholar]
- Kang JDY and Schafer JL (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22 (4), 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmann EL (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association 88 (424), 1242–1249. [Google Scholar]
- Lehmann EL and Romano JP (2005). Testing statistical hypotheses. New York: Springer. [Google Scholar]
- Li X and Ding P (2016). Exact confidence intervals for the average causal effect on a binary outcome. Statistics in Medicine 35 (6), 957–960. [DOI] [PubMed] [Google Scholar]
- Li X, Ding P, and Rubin DB (2018). Asymptotic theory of rerandomization in treatment–control experiments. Proceedings of the National Academy of Sciences 115 (37), 9157–9162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lunceford JK and Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine 23 (19), 2937–2960. [DOI] [PubMed] [Google Scholar]
- Morgan KL and Rubin DB (2012). Rerandomization to improve covariate balance in experiments. The Annals of Statistics 40 (2), 1263–1282. [Google Scholar]
- Morgan KL and Rubin DB (2015). Rerandomization to balance tiers of covariates. Journal of the American Statistical Association 110 (512), 1412–1421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neyman JS (1923). Próba uzasadnienia zastosowań rachunku prawdopodobieństwa do doswiadczeń polowych (On the application of probability theory to agricultural experiments: Essay on principles). Roczniki Nauk Rolniczych (Annals of Agricultural Sciences) 10, 1–51. Reprinted in Statistical Science 5 (4), 465–472, as a translation by D. M. Dabrowska and T. P. Speed (1990) from section 9 (29–42) of the original Polish article. [Google Scholar]
- Obama B (2013). The 2013 State of the Union Address. Washington, DC: The White House Office of the Press Secretary. [Google Scholar]
- Quandt RE (1958). The estimation of the parameters of a linear regression system obeying two separate regimes. Journal of the American Statistical Association 53 (284), 873–880. [Google Scholar]
- Quandt RE (1972). A new approach to estimating switching regressions. Journal of the American Statistical Association 67 (338), 306–310. [Google Scholar]
- Rigdon J and Hudgens MG (2015). Randomization inference for treatment effects on a binary outcome. Statistics in Medicine 34 (6), 924–935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89 (427), 846–866. [Google Scholar]
- Romano JP (1989). Bootstrap and randomization tests of some nonparametric hypotheses. The Annals of Statistics 17 (1), 141–159. [Google Scholar]
- Romano JP and Shaikh AM (2010). Inference for the identified set in partially identified econometric models. Econometrica 78 (1), 169–211. [Google Scholar]
- Romano JP, Shaikh AM, and Wolf M (2010). Hypothesis testing in econometrics. Annual Review of Economics 2 (1), 75–104. [Google Scholar]
- Romano JP and Wolf M (2005). Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association 100 (469), 94–108. [Google Scholar]
- Schweinhart LJ (2013). Long-term follow-up of a preschool experiment. Journal of Experimental Criminology 9 (4), 389–409. [Google Scholar]
- Schweinhart LJ, Barnes HV, Weikart DP, Barnett W, and Epstein A (1993). Significant benefits: The High/Scope Perry Preschool Study through age 27 (Monographs of the High/Scope Educational Research Foundation, 10). Ypsilanti, MI: HighScope Educational Research Foundation. [Google Scholar]
- Schweinhart LJ, Berrueta-Clement JR, Barnett WS, Epstein AS, and Weikart DP (1985). The promise of early childhood education. The Phi Delta Kappan 66 (8), 548–553. [Google Scholar]
- Schweinhart LJ, Montie J, Xiang Z, Barnett WS, Belfield CR, and Nores M (2005). Lifetime effects: The High/Scope Perry Preschool Study through age 40 (Monographs of the High/Scope Educational Research Foundation, 14). Ypsilanti, MI: HighScope Educational Research Foundation. [Google Scholar]
- Schweinhart LJ and Weikart DP (1980). Young Children Grow Up: The Effects of the Perry Preschool Program on Youths Through Age 15. Ypsilanti, MI: HighScope Educational Research Foundation. [Google Scholar]
- Simon HA (1955). A behavioral model of rational choice. The Quarterly Journal of Economics 69 (1), 99–118. [Google Scholar]
- Weikart DP, Bond JT, and McNeil JT (1978). The Ypsilanti Perry Preschool Project: Preschool years and longitudinal results through fourth grade. Number 3. Ypsilanti, MI: HighScope Educational Research Foundation. [Google Scholar]
- Weikart DP, Kamii CK, and Radin NL (1964). Perry Preschool Project progress report. Technical report, Ypsilanti Public Schools. [Google Scholar]
- Wu J and Ding P (2020). Randomization tests for weak null hypotheses in randomized experiments. Journal of the American Statistical Association, in press. [Google Scholar]
- Young A (2019). Channeling Fisher: Randomization tests and the statistical insignificance of seemingly significant experimental results. The Quarterly Journal of Economics 134 (2), 557–598. [Google Scholar]
- Zigler E and Weikart DP (1993). Reply to Spitz’s comments. American Psychologist 48 (8), 915–916. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
