Abstract
Survey data on earnings tend to contain measurement error. Administrative data are superior in principle, but they are worthless in case of a mismatch. We develop methods for prediction in mixture factor analysis models that combine both data sources to arrive at a single earnings figure. We apply the methods to a Swedish data set. Our results show that register earnings data perform poorly if there is a (small) probability of a mismatch. Survey earnings data are more reliable, despite their measurement error. Predictors that combine both and take conditional class probabilities into account outperform all other predictors.
Keywords: Factor score, structural equation model, finite mixture, administrative data, validation study
1 Introduction
When both survey and administrative data are available, for example measuring individual earnings, the question arises how to make the best use of the available information in empirical applications. Measurement error in individual earnings data has received considerable attention in the literature, see Bound et al. (2001) for an extensive overview. Typically, the presence and extent of measurement error are studied by comparing survey answers with matched administrative data, either from company records or government agencies (e.g., Bound and Krueger, 1991; Pischke, 1995). In these studies, it is assumed that the administrative data are error-free. One of the findings in this literature has been that measurement error is nonclassical in the sense that it is negatively correlated with true earnings (mean reverting). Kapteyn and Ypma (2007) conduct a similar study, but they allow for errors in the matching of the administrative data to the survey data, and as a consequence, the administrative data may be incorrect. One of their findings is that mean reversion almost disappears after allowing for mismatch. This is an important finding as it calls into question the approach commonly adopted in empirical applications of discarding the self-reported data in favor of the supposedly more accurate administrative data. While using administrative data may solve some of the precision issues it may at the same time introduce new issues—a fact that has largely been ignored so far. In case of a mismatch, for example, the (reliably measured) earnings of one particular individual are then attributed to another individual, introducing an error that may be much worse than the imprecision of a self-reported figure. It seems intuitive that one could make better use of the available information by combining the observations from the survey data and the administrative data.
The objective of this paper is to derive several predictors of a single earnings variable and compare their statistical properties. We build on the model by Kapteyn and Ypma (2007), who introduce a mixture factor analysis model for survey and register earnings data. They estimate their model using data from Sweden. In this model, the true earnings variable is a latent variable. Our aim is to predict this latent variable taking the parameter estimates as given. In fact, throughout the paper, we abstract from model specification, identification, and estimation issues.
While both mixture models and factor score prediction have been extensively studied, the intersection of the two has received very little attention. The work by Zhu and Lee (2001) is most closely related to ours. They offer a Bayesian approach, proposing the Gibbs sampler to estimate the joint posterior distribution of the parameters and the latent variables, conditional on the observed variables. However, they do not explicitly consider prediction issues, which is the focus of our paper.
In our search for an optimal combination of survey and administrative data, we develop a theory of factor score prediction in mixture factor analysis models. We formulate the prediction problem in a generalized manner so that it covers the whole class of mixtures of linear structural equation models as well as mixtures of random coefficient linear regression models.
In the next section we introduce the Kapteyn-Ypma model. This model is a special case of a mixture factor analysis model, which we present in its full generality in section 3. To introduce notation and to provide the context for discussing the mixture case we briefly recapitulate the basics of factor score prediction in section 4. In section 5 we develop an array of predictors for the mixture model. In section 6, we adapt the Kapteyn-Ypma model to our model specification, apply the various predictors from section 5 and compare their performance in terms of mean squared error and reliability. Section 7 concludes.
2 The Kapteyn-Ypma model
The data used in Kapteyn and Ypma (2007) come from Sweden and were collected as part of a validation study to inform subsequent data collection efforts in the Survey of Health, Ageing and Retirement in Europe (SHARE). The objective was to compare self-reported information obtained in a survey with information obtained from administrative data sources (registers) to learn about measurement error, in particular in respondents’ self-reports. The survey was fielded in the spring of 2003, shortly after tax preparation season, when people are most likely to have relevant financial information fresh in their minds. The sample was designed to cover individuals age 50 and older and their spouses; it was drawn from an existing database of administrative records (LINDA) which contains a five percent random subsample of the entire Swedish population. Out of 1431 sampled individuals 881 responded. Kapteyn and Ypma (2007) investigate the relation between survey data and register data for three different variables. We use their model on individual earnings without covariates, which is based on 400 observations with positive values recorded both in the register and in the survey data.
In the Kapteyn-Ypma model, there is a single latent variable ξ, representing the true value of the logarithm of earnings of an individual. (We omit the index denoting the individual.) This true value is not measured directly. However, there are two indicators available, one coming from administrative data and the other from individuals’ self-reports collected in a survey. Figure 1 depicts the observations: the logarithms of the survey earnings are plotted against the logarithms of the register earnings. Most points are close to the 45° line. Thus, for most respondents, earnings as reported in the survey are close to the earnings observed in the administrative records. Still, for a noticeable number of points this is not the case.
Figure 1.
Survey data versus register data.
Unlike most of the literature in this area, the Kapteyn-Ypma model does not assume that this is necessarily due to measurement error in the survey earnings. Rather, Kapteyn and Ypma assume that both indicators may contain error. However, the structure of the error differs between the indicators.
Let ξ and three other random variables, ζ, η, and ω, be independent of each other and normally distributed with means and variances , , , and , respectively. Let r denote the value of log earnings from the administrative (register) data and let s be the corresponding value from the survey.
Kapteyn and Ypma (2007) assumed that any errors in the administrative data are due to a mismatch. Let πr be the probability with which the observed administrative value, r, equals the true value of the respondent's earnings, ξ. Then 1 – πr is the probability of a mismatch. In case of a mismatch, the administrative value r corresponds to the true value of someone else. This mismatched value will be denoted by ζ. According to the assumptions made above, ζ is not correlated with ξ. According to the same assumptions, the means and variances of ζ and ξ are allowed to differ. This reflects the background of the data collection. The survey data only cover a subset of the population restricted to individuals age 50 and older and their spouses, but a mismatch may involve an individual from the entire administrative database, which is a random sample of the entire Swedish population. To formalize this part of the model, the register data r are mixture normal with
where the first case reflects a match and the second case reflects a mismatch.
The second part of the model covers the survey data. Three cases are distinguished: earnings are either reported without error, with error, or with error and contamination. The probability that the observed survey value is correct is denoted by πs. With probability 1 – πs the survey data contain response error, part of which is mean-reverting, as expressed by the term ρ(ξ – μξ), where ρ < 0 implies a mean-reverting response error in the sense of Bound and Krueger (1991). When the data contain response error, there is a probability of πω that the data are contaminated, modeled by adding an additional error term, ω. Contamination can, for example, result from erroneously reporting monthly earnings as annual or vice versa. Collecting the elements of this second part of the model, the survey data s are mixture normal with
Having established the distributions of r and s separately, the distribution of (r, s) is a mixture of bivariate normals, with 2 × 3 = 6 classes.
3 The general mixture factor analysis model
As we will show in section 6, the Kapteyn-Ypma model is a special case of a mixture factor analysis model, which has been applied in the literature before. Therefore, it is worthwhile to develop our theory in a generalized form.
Let yn be a vector of observed random variables for observation n. We assume that there are J types or classes of observations, denoted by j = 1, . . . , J. Let πnj be the (unconditional) probability that observation n is from class j. In our model, πnj will typically be strictly between 0 and 1, so that class membership is not known with certainty, although we discuss known membership below as well. In each class the standard factor analysis (FA) model (e.g., Wansbeek and Meijer, 2000, chap. 7) is assumed to hold under normality. Thus, it is assumed that an observation from class j is generated from the model
| (1) |
where is a vector of unobserved (latent) random variables for observation n and is a vector of random errors, which are assumed to be independent of ξn, given the value of j. The errors may be interpreted as measurement errors or disturbances, depending on the application. Our aim is to predict ξn given an observed value of yn.
The vector of observed variables yn follows a multivariate finite mixture distribution, in particular a finite mixture of multivariate normals. Because of the FA structure of (1), this is a mixture FA model (e.g., Yung, 1997). Unlike the standard FA case where observations are routinely centered and hence intercepts play no role, we explicitly incorporate intercepts since they will generally vary across classes and constitute an essential ingredient of the predictors. The parameter vectors τnj and κnj and matrices Bnj, Φnj, and Ωnj are allowed to differ across classes.
We have not attached j subscripts to ξn and εn, because our aim is to predict ξn and it is conceptually easier if the meaning of this is the same for all classes. For example, in the Kapteyn-Ypma model, ξn captures the true value of log earnings, which is defined irrespective of whether earnings are reported with measurement error or whether there is a mismatch of records, or both. However, we allow the distributions of ξn and εn to be different for different classes, so omitting the j subscript does not reduce the generality of the model. Applications with class-specific factors ξnj can be incorporated into our notation by stacking all ξnj, j = 1, . . . , J into the vector ξn, and letting the columns of Bnj corresponding with ξnk, k ≠ j be zero. Our analysis below accommodates rank-deficient matrices, so our setup allows for such a specification.
In the notation used so far, we have also allowed the parameter vectors τnj and κnj, the parameter matrices Bnj, Φnj, and Ωnj, and the class probabilities πnj to differ across observations n. In a typical application of mixture FA models, the parameters will be assumed constant within the class, but allowing them to be observation-specific includes several models of interest. In particular, this includes all mixtures of linear structural equation models (e.g., Jedidi et al., 1997) and mixtures of linear random coe cient models. These models write the parameters as functions of covariates or of “deeper” parameters, or both. See the supplemental online appendix for further elaboration.
In line with the factor score prediction literature, we assume the parameters to be known and we abstract from estimation issues, and hence from issues of identification. Furthermore, we assume independence of observations. Consequently, only the observed variables yn and observation-specific parameters of observation n are relevant for our purpose of predicting the latent variable ξn. All other observations can be ignored. Therefore, in the remainder of this paper, we drop the observation index n. Hence, our model for a generic observation from class j is now written as
| (2a) |
| (2b) |
| (2c) |
and the probability that the observation falls in class j is πj. A direct implication of (2) is
| (3) |
with μj ≡ τj + Bjκj and . The mean μy and covariance matrix Σy of y, the mean μξ and covariance matrix Σξ of ξ, and the covariance matrix Σξy of ξ and y follow by combining these expressions and the class probabilities πj.
4 Factor score prediction with one class
For reference later on, we recapitulate some basics of factor score prediction when there is just one class. Given our earlier assumptions, all variables are normally distributed in this case, although a large part of the theory of factor score prediction still holds under nonnormality, see, e.g. Wansbeek and Meijer (2000, sec. 7.3). We use the notation as above, but without the class identifier j. Let
| (4) |
| (5) |
Using the expression for the conditional normal distribution we obtain . We consider the minimization of the mean-squared error of a function of y as a predictor of ξ. Using a standard result in statistics, the MSE is minimized by the conditional expectation. So is the minimum MSE predictor of ξ.
In general, for reasons of simplicity, we are particularly interested in predictors that are linear in y. Since the minimum MSE predictor is already linear in y, imposing the restriction that our minimum MSE predictor be linear leads to , where the subscript L denotes restriction to the class of linear predictors. Below, we will see that this does not carry over to the mixture model and that linearity is a binding restriction there.
Now suppose that we view (2) as a linear regression model, with a fixed unknown coefficient vector ξ that we would like to estimate, instead of as a random vector that we would like to predict. Or, equivalently, consider conditioning on the true value of ξ. (See Anderson and Rubin 1956, for these two views.) Then it follows that the predictor is biased in the sense that
This may be considered a drawback of . An unbiased predictor can be obtained as follows. The model, rewritten as y – τ = Bξ + ε, can be interpreted as a regression model with ξ as the vector of regression coefficients, B the matrix of regressors, and y – τ as the dependent variable. In this model the GLS estimator is the best linear unbiased estimator (BLUE) of ξ, and under normality it is also BUE as it attains the Cramér-Rao bound (e.g., Amemiya 1985, sec. 1.3.3). So, letting subscript U denote an unbiasedness restriction in the sense of , and LU the predictor that is obtained by imposing both linearity and unbiasedness, we obtain that
| (6) |
minimizes the conditional MSE, , for every possible value of ξ, conditional on unbiasedness. This implies that it must also minimize the unconditional MSE, , among unbiased predictors and thus that it is the best unbiased predictor of ξ.
5 A taxonomy of predictors
With known class membership, the problem of factor score prediction in the mixture FA model reduces to factor score prediction in the single class model discussed in the previous section. The resulting within-class predictors have the same optimality properties as in the single class model. In general, class membership is unknown, and there appear to be three natural ways to proceed: (1) compute the within-class predictors for each class and combine them in a weighted average; (2) predict class membership and then use the within-class predictor for the predicted class; and (3) derive predictors that minimize the total mean squared prediction error.
The weighted average is based on the analogy with the means of y and ξ. In these means, the (unconditional) class probabilities πj are the appropriate weights, but for predictors of ξ, we can also consider using conditional class probabilities given y. Using Bayes’ rule, these conditional probabilities are
| (7) |
where fj(y) is the normal density with mean μj and variance Σj, evaluated in y. We will consider both alternatives.
The two-stage predictor would share the desired statistical properties of its second-stage single-class predictor if the prediction of class membership in the first stage were error-free. We expect that if class membership can be predicted with a high probability, or if the within-class predictors for the most likely misclassifications in this first stage are very similar, then the two-stage predictor will have desirable properties, such as small MSE.
The system-wide predictors that minimize total MSE apply the methods for deriving single-class predictors to the mixture problem. If we can find solutions to these problems, these are optimal by construction, but their expressions may be complicated and it may not be possible to find solutions.
These three general approaches, the unconditional and conditional weighting for the weighted predictors, and the linearity and/or unbiasedness restrictions lead to an array of predictors for the mixture model, which can be grouped into a taxonomy. The resulting predictors are grouped in Table 1. The rows indicate the approaches and the type of weighting. The columns reflect the linearity and unbiasedness restrictions. In the next subsections we systematically develop all these predictors, although this turns out to be impossible for two of the system-wide predictors (in brackets in the table).
Table 1.
A taxonomy of predictors in a mixture structural equation model.
|
5.1 Within-class predictors
The first row of Table 1 presents the predictors from the classes. Within a class the distribution is normal, so the linearity restriction is redundant, as argued above, and just two instead of four predictors are given. We extend the notation of section 4 by adding the class subscript j.
Note that the first predictor, is placed under the heading ‘unbiased’ due to the reasoning along the lines of a single class; if class j were the only class present, this predictor would be unbiased, but this quality is lost in a multiclass context. It is of interest to elaborate this. From (6),
where the notation is chosen since it is a generalized inverse of Bj. Consequently,
Let
| (8) |
where gj(ξ) is the normal density with mean κj and variance Φj, evaluated in ξ. Then
| (9) |
with , , and . For unbiasedness, the latter should be equal to ξ for all ξ. This will only hold in special cases. The only case we know of is when τj and Bj do not vary across classes. The heterogeneity is then restricted to the case where the intercepts and factor loadings are homogeneous, and only the distributions of ξ and ε have a mixture character.
5.2 Weighted predictors
Given the predictors for a single class, a natural way to proceed to obtain a predictor in the mixture model is to combine them into a weighted average of the predictors per class. Potential weights for these weighted averages are the unconditional class membership probabilities πj, by analogy with the mean of ξ, and the conditional class membership probabilities pj(y). The second row of Table 1 lists the predictors using the unconditional probabilities,
Using the results from section 5.1, we can write
where and . Again, will be unbiased only in special cases.
The results for weighting with the conditional class membership probabilities are displayed in the next row of the table. An asterisk distinguishes these predictors from the previous ones. Then
| (10) |
Notice that the weights depend nonlinearly on y and hence the linearity property is lost when weighting. Also, the predictor will generally not be unbiased.
5.3 Two-stage predictors
An intuitively appealing alternative to computing a weighted predictor is to predict class membership first and then proceed as if class membership was known and use the class-specific predictor of the predicted class. Hence, this leads to two-stage predictors, analogous to two-step estimators. The obvious class predictor is the class with the highest conditional probability, i.e., Ĵ ≡ arg maxj pj(y). Then the two-stage predictors, listed in the fourth row of Table 1, are
Again, unbiasedness and linearity are lost.
5.4 System-wide predictors
The bottom row of Table 1 presents the system-wide predictors that are obtained by minimizing the MSE with or without the restrictions. Since in the mixture model normality does not hold any longer, linearity is not obtained as a byproduct, and the full range of four different predictors has to be considered. We use a tilde to denote system-wide predictors.
Prediction under linearity and unbiasedness
The starting point for finding an unbiased predictor that is linear in y is to find a matrix L and a vector b such that satisfies for all ξ. Using (9) we obtain
This should be equal to ξ for any ξ, so also for ξ = 0. Consequently, . Hence
should hold for all ξ. Because of the (nonlinear) dependence of and B̄(ξ) on ξ, a matrix L with this property only exists under special circumstances. We conjecture that it only exists when τj and Bj do not vary across classes. Hence, a linear unbiased predictor will generally not exist. If τj and Bj do not vary across classes, the model can be reinterpreted as a single-class FA model, in which the factors and errors follow nonnormal distributions, in particular mixture normal distributions. Then the minimum MSE linear unbiased predictor follows from the expressions in section 4, which do not depend on normality.
Prediction under unbiasedness
Imposing linearity may preclude a solution. We relax this requirement, but maintain unbiasedness. We then consider a predictor , where h(y) is a function of y such that E(h(y) | ξ = ξ for all ξ. Using (8), this requirement for h(y) can be rewritten as
for all ξ. Attempts to derive the predictor led to expressions without tractable solutions, except for the special case in which τj and Bj do not vary across classes, which was discussed above.
Prediction under linearity
This case is conceptually the simplest one. It is based on the idea of linear projection. According to a well-known result (e.g., Angrist and Pischke, 2009, Theorem 3.1.5 for the univariate case in a slightly different form), the MSE is minimized over all linear functions by
| (11) |
Note that the structure of the model is only used in computing the moments μξ, μy, Σξy, and Σy. Derivation and optimality (minimum MSE within the class of linear predictors) of the predictor follow from general principles and do not use the model structure explicitly.
Prediction without restrictions
The approach here rests on the fact that the MSE is minimized by the mean of the conditional distribution. As with the linear projection result above, this result is generic and does not depend on the structure of the model. Of course, for the implementation, we need an expression of this conditional mean, which is derived from the model structure. Adapting results from section 4, we have , with and . So ξ | y is mixture normal with class membership probabilities Pr(class = j | y) = pj(y) with pj(y) given by (7). Thus, the system-wide predictor without restrictions is given by
which equals , the predictor obtained by weighting of the unrestricted predictors per class using the conditional class probabilities. The conditional variance of this predictor is
The first term on the right-hand side is the weighted average of the variances within the classes, and the other terms represent the between-class variance.
5.5 Extensions
The derivations up till now assumed that all covariance matrices are nonsingular and that no extraneous information is available about class membership. However, in some cases, including the Kapteyn-Ypma model, these assumptions are not met. In this section, we generalize our methods to deal with this.
In the case of a labeled class, it is known a priori to which class an observation belongs, or this follows deterministically from the covariates. Thus, we have observed (measured) rather than unobserved (unmeasured) heterogeneity. More generally, we may know that an observation belongs to a certain subset of the classes. This can be viewed as a limiting case of observation-specific unconditional class probabilities πnj, and thus is accommodated by our general setup.
Degeneracy denotes singular covariance matrices or factor loadings matrices that do not have full column rank. Several types of degeneracies are conceivable in a single-class factor analysis model. For substantive reasons, the errors ε of some of the indicators may be assumed to be identically zero, leading to a singular covariance matrix Ω. This may also be the result of the estimation procedure without a priori restrictions, i.e., a Heywood case. Similarly, the covariance matrix Φ of the factors may be singular. Finally, the factor loadings matrix B may not be of full column rank. The latter does not necessarily lead to a degenerate distribution, but it does imply that some of the formulas presented earlier cannot be applied. In a single-class factor analysis setting, singularity of Φ or deficient column rank of B are not particularly interesting or relevant, because the model is then equivalent to a model with fewer factors and thus can be reduced to a more parsimonious one. However, in a mixture setting, such forms of degeneracy may occur quite naturally. If Ω is nonsingular, then the covariance matrix Σ of the observed indicators is also nonsingular. Thus, a necessary condition for singularity of Σ is singularity of Ω.
The formulas for the single-class predictors cannot be directly applied in the case of degeneracies, because the required inverses may not exist. Furthermore, when there are classes with degenerate distributions of the indicators y, (7) does not hold as y does not have a proper density. The supplemental online appendix gives the expressions for the within-class predictors and conditional probabilities in the case of degeneracies.
With these adapted expressions, prediction of the factor scores is relatively straightforward. For the non-system-wide predictors, we can apply the standard expressions, using the adapted formulas for the within-class predictors and the conditional probabilities. For the system-wide linear predictor, (3) is still valid, and thus the predictor is still given by (11), with replaced by the Moore-Penrose generalized inverse if it is singular. The system-wide unrestricted predictor is again equivalent to the conditionally weighted unrestricted predictor . We will apply these methods in section 6.
6 Application to the Kapteyn-Ypma model
We can now express the parameters of the general mixture FA model in terms of the parameters of the Kapteyn-Ypma model. We have y = (r, s)’ and κj = μξ and for all classes. The correspondence between the remaining parameters and the general setup is shown in Table 2. A notable feature is that Ωj is singular for classes 1 through 4. So there is a degree of degeneracy in the model, cf. the discussion in section 5.5. Moreover Σ1 is singular, because in class 1, y takes on values in the subspace y1 = y2, i.e., r = s, whereas Σj is nonsingular for j = 2, . . . , 6.
Table 2.
Parameterization of the Swedish earnings data model in terms of a mixture structural equation model.
| Class (j) | r | s | π j | τ j | B j | ω j |
|---|---|---|---|---|---|---|
| 1 | R1 | S1 | π r π s | |||
| 2 | R1 | S2 | πr(1 – πs)(1 – πω) | |||
| 3 | R1 | S3 | πr(1 – πs)πω | |||
| 4 | R2 | S1 | (1 – πr)πs | |||
| 5 | R2 | S2 | (1 – πr)(1 – πs)(1 – πω) | |||
| 6 | R2 | S3 | (1 – πr)(1 – πs)πω |
Note: R1 = correct match; R2 = mismatch; S1 = no measurement error; S2 = measurement error; S3 = measurement error and contamination.
6.1 Predictors
Given this structure, we now turn to the issue of prediction. We discuss a number of predictors, following the taxonomy presented in Table 1.
Within-class predictors
Formulas for the predictors per class are given in Table 3. Notice that for classes 5 and 6, the expressions for can be obtained from the corresponding expressions for by setting , i.e., by ignoring the measurement error variances in the computation.
Table 3.
Expressions for the within-class predictors for the Swedish earnings data model as functions of the parameters.
| Class (j) | r | s | ||
|---|---|---|---|---|
| 1 | R1 | S1 | ||
| 2 | R1 | S2 | r | r |
| 3 | R1 | S3 | r | r |
| 4 | R2 | S1 | s | s |
| 5 | R2 | S2 | ||
| 6 | R2 | S3 |
Notes: R1 = correct match; R2 = mismatch; S1 = no measurement error; S2 = measurement error; S3 = measurement error and contamination; s’ ≡ s – μξ – μη.
Weighted predictors
Moving down the rows of Table 1, we obtain overall predictors by weighting the and with the probabilities πj to obtain and , where a, b, c, a’, b’, and c’ depend on the various parameters. The detailed expressions can be found in the supplemental online appendix.
Conditionally weighted predictors
Consider the event r = s. If the observation is drawn from class 1, the conditional probability of observing this event is 1, whereas this conditional probability is 0 if the observation is drawn from any of the other five classes. Hence, the conditional probability that j = 1 given this event is 1: p1((r, r)’) = 1 for all r and pj((r, r)’) = 0 for j = 2, . . . , 6. Conversely, using the same logic, p1((r, s)’) = 0 for all s ≠ r. In the latter case,
where . It follows that, if r = s, and, if r ≠ s,
| (12) |
Two-stage predictors
In the first stage, class membership is predicted according to the maximum of the pj(y). Thus, for r = s, class 1 is predicted and for r ≠ s, class 1 is dropped from consideration. Figure 2 shows which class is predicted for a given (r, s) combination with r ≠ s. This class predictor was also computed for the observed data points by Kapteyn and Ypma (2007, Fig. C1). From this figure, we see that close to the 45° line, where most of the observations lie in Figure 1, membership of class 2 is predicted, and farther away from this line but still in the middle of the distribution of r, membership of class 3 is predicted. Perhaps most interesting is the area in the middle-left where membership of class 5 is predicted. There are about a dozen points in this area in Figure 1, and this is where it is predicted that r is a mismatch. The resulting factor score predictor is the within-class predictor corresponding with the predicted class, given in Table 3.
Figure 2.
Predicted class membership Ĵ for each combination of register and survey data (excluding r = s).
System-wide predictors
The last row of Table 1 presents four predictors. As indicated in section 5.4, expressions for the two unbiased predictors and could not be found. This leaves us with the projection-based predictor and the conditional expectation-based predictor . The expression for the former is given in (11). We need to elaborate the mean and covariance matrices of y and ξ. By weighting the rows of Table 2 with the π's we obtain, after some straightforward algebra, expressions for the elements of μy, Σy and Σξy. Substitution in (11) yields again a linear expression, . The full expression for the coefficients is given in the supplemental online appendix. Finally, , as was already stated in section 5.4. The formula for the latter has been given by (12).
6.2 Empirical results
We now turn to the empirical outcomes of the prediction process using the Kapteyn-Ypma estimates of the model parameters. Our objective is to predict earnings by combining the information from the survey and administrative data. We do so on the basis of a simple model, and of the estimation results from that model. Kapteyn and Ypma (2007) provide an extensive discussion of their results. Here, we only reproduce the relevant parameter estimates that we will use in implementing the various predictors. Estimated means and variances of the normal distributions are given in Table 4. The estimates of the other parameters are πr = .96, πs = .15, πω = .16, and ρ = –.01.
Table 4.
Parameter estimates for the Swedish earnings data model.
| ξ | ζ | η | ω | |
|---|---|---|---|---|
| μ | 12.28 | 9.20 | –0.05 | –0.31 |
| σ 2 | 0.51 | 3.27 | 0.01 | 1.52 |
Source: Kapteyn and Ypma (2007).
Table 5 presents results per class and the resulting unconditionally weighted predictors. The fourth column gives the class probabilities. The largest class is the second one, representing the case of measurement error, but no mismatch. Columns 5–10 present the numerical results corresponding with Table 3. The predictors in Table 3 have been written as linear functions ar + bs + c of r and s, the three weights a, b, and c being given for first and next for .
Table 5.
Unconditional class membership probabilities for the Swedish earnings data model and predictors as linear expressions ar + bs + c.
|
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Class (j) | r | s | π j | a | b | c | a | b | c |
| 1 | R1 | S1 | .15 | 0.5 | 0.5 | 0 | 0.5 | 0.5 | 0 |
| 2 | R1 | S2 | .69 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3 | R1 | S3 | .13 | 1 | 0 | 0 | 1 | 0 | 0 |
| 4 | R2 | S1 | .01 | 0 | 1 | 0 | 0 | 1 | 0 |
| 5 | R2 | S2 | .03 | 0 | 1.01 | –0.11 | 0 | 0.99 | 0.12 |
| 6 | R2 | S3 | .01 | 0 | 1.01 | 0.19 | 0 | 0.25 | 9.32 |
| 0.11 | –0.00 | 0.11 | 0.05 | ||||||
Notes: R1 = correct match; R2 = mismatch; S1 = no measurement error; S2 = measurement error; S3 = measurement error and contamination; probabilities do not sum to 1 due to rounding.
As was already apparent from Table 3, the rows for classes 5 and 6 are the most interesting ones. These represent the cases of a mismatch, hence the value from the register is discarded and only the value from the survey is used; and the latter suffers from measurement error. If there is no (further) contamination, the unbiased predictor is obtained from the survey figure by slightly overweighting it to compensate for the mean-reverting measurement error. The unrestricted predictor is obtained by slightly downweighting it, due to shrinkage to the overall mean (cf. Wansbeek and Meijer 2000, p. 165). If there is contamination, the results diverge strongly. The unrestricted predictor is predominantly a constant, plus some effect from the survey, reflecting a large amount of shrinkage due to the large variance of the contamination term, so that more weight is given to the population mean. This phenomenon is sometimes called borrowing strength in the random coefficients literature (e.g., Raudenbush et al., 2004, p. 9). In contrast, the unbiased predictor is very similar to the predictor obtained when there is no contamination.
The last row of the table gives the results for the predictors and that are obtained by weighting the preceding rows by the unconditional class membership probabilities given in the second column. Apart from the constant c, the results from the two predictors only differ from the third digit onwards, and imply an 89% weight to the register data and an 11% weight to the survey data. The linear projection-based predictor can also be expressed as a linear combination of r and s:
This is a striking result, with very low weight given to the register data, vastly unlike the case with and .
The predictors discussed so far were all based on uniform weights across the (r, s) space. The conditionally weighted predictors and and the two-stage predictors and cannot be expressed as a simple function of r and s since the weights now depend on y = (r, s)’ through the densities pj(y), cf. (10). For the two-stage predictors, Figure 2 indicates which class-specific predictor is used for a given value of (r, s), and thus from Table 5 the corresponding weights can be found. Similarly, we can express the conditionally weighted predictors as “linear” combinations of r and s with the weights depending on the values of r and s: , where ā(y), b̄(y), and c̄(y) are weighted averages of the a, b, and c coefficients in the last three columns of Table 5, with weights pj(y), and similarly for . We have computed the resulting relative weight given to r, as compared to s, computed as ā(y)/[ā(y) + b̄(y)], over the (r, s) space. The result for is given in Figure 3. The figure for is very similar and is omitted. We see that in the middle of the figure the predictor is (almost) fully dominated by r, but further from the mean of ξ (μξ = 12.28), more weight is attached to s. Since the variance of ζ is large, the relative weight largely depends on the value of r only and is rather insensitive to the value of s.
Figure 3.
Relative weight of log(register earnings) r in the conditionally weighted predictor . The lines connect points with the same relative weight given to r, computed as ā(y)/[ā(y) + b̄(y)].
It is also interesting to compare Figure 3 with Figure 2. This shows that in the areas of the (r, s) space where classes 2 or 3 are predicted for the two-stage predictors, which corresponds to these two-stage predictors being equal to r and thus giving a 100% weight to r, the relative weight for r in and is also very high, more than 90%. Conversely, in the areas where classes 5 or 6 are predicted for the two-stage predictors, r is predicted to be a mismatch and its weight in computing the two-stage predictors is zero. In these areas, the relative weight of r in and is very low, less than 10%. The weight for r in and changes continuously, whereas it changes discretely for the two-stage predictors, but Figure 3 shows that “continuously” is still quite fast, so that the two-stage predictors and conditionally weighted predictors are very similar.
In order to obtain an overall comparison of the performance of the seven predictors we have computed their reliabilities. The reliability of a proxy is a measure for the precision with which it measures the latent variable of interest. When ξ is one-dimensional, the reliability of a proxy x is the squared correlation between ξ and x. The reliability of a linear combination x = ar + bs + c can be straightforwardly obtained in closed form. For the conditionally weighted and two-stage predictors, reliabilities cannot be computed analytically since they are nonlinear combinations of r and s. We have computed the reliabilities of these predictors by simulating 100,000 draws from the model and computing the squared sample correlation between factor and predictor from these. (We have also computed the reliabilities of the linear proxies using these draws, which corresponded closely to the results obtained from the closed-form expressions for these proxies.)
The results are given in Table 6, for the seven predictors considered throughout, plus r and s as a reference. We also report the (unconditional) mean squared errors of the predictors, MSE = E(predictor – ξ)2, which are, of course, strongly negatively related to the reliabilities, although they are not perfect substitutes. For example, unlike the reliability, the MSE of a linear combination x = ar + bs + c does depend on the constant c. In addition, we present the (unconditional) bias E(predictor – ξ) and (unconditional) variance Var(predictor – ξ) of the prediction errors.
Table 6.
Precision of the predictors.
| Predictor | Reliability | MSE | Bias | Variance | |
|---|---|---|---|---|---|
| Register | r | .47 | .54 | –.13 | .52 |
| Survey | s | .69 | .23 | –.08 | .22 |
| Weighted (unconditional), unbiased | .53 | .43 | –.12 | .41 | |
| Weighted (unconditional) | .53 | .43 | –.12 | .41 | |
| Weighted (conditional), unbiased | .97 | .01 | .00 | .01 | |
| Two-stage, unbiased | .97 | .02 | .00 | .02 | |
| Two-stage | .97 | .01 | .00 | .01 | |
| System-wide linear | .76 | .12 | .00 | .12 | |
| System-wide | .98 | .01 | .00 | .01 |
The table shows some striking results. The register data r have a squared correlation with the true log earnings ξ of less than .5. Informally stated, r is a clear loser, which is surprising. Apparently a high probability of being exactly correct is not sufficient and the small probability of being an independent draw from a different distribution has dramatic consequences for the statistical properties of the indicator. The survey data s perform considerably better than r. The predictors obtained by unconditional weighting per class perform poorly, but the predictor based on linear projection performs much better. However, all predictors that use the conditional class membership probabilities are nearly perfect (with performing marginally better than the others), again, of course, against the background of the postulated model.
In the last two columns of Table 6, we see that the biases of the first four predictors are sizeable, but the squared biases are negligible compared to the variances, so that the variances dominate the MSEs. It is also noteworthy that both r and s have negative bias. The bias of r is due to mismatch, whereas the bias of s is due to the negative means of the measurement error η and contamination ω. The biases of r and s are identified on the basis of the r = s cases, which are higher up on the earnings distribution (on average) than the other cases.
The excellent performance in terms of reliability and MSE of the two-stage predictor, both in an absolute sense and compared to the minimum MSE predictor , is related to the high probability with which class membership is correctly predicted. Overall, the probability of correctly classifying an observation is 96%. Among the misclassifications, a classification of an observation from class 3 as an observation from class 2 takes more than 2 of the remaining 4 percent. But because the within-class predictor is r in both class 2 and class 3, this has no consequence for the precision of the two-stage predictors. We expect that in situations where the class probabilities are less concentrated and the within-class predictors vary more widely across classes, the two-stage predictors will be noticeably less precise.
7 Discussion
This paper has dealt with a very simple question. We are given two numerical values for the same quantity, which are often not equal. What is the best guess as to the truth? As always, getting to a sensible answer involves some model building, necessarily based on some assumptions. The model in case was of the mixture factor analysis type, and answering our question amounted to factor score prediction in that model. Since that topic appeared to be hardly researched, we first derived the methods needed. Taking the standard factor score prediction literature as our point of departure, we systematically explored the options for constructing predictors for the mixture case. This produced seven predictors for further consideration.
We then explored the consequences of two extensions of the model. The first is the presence of labeled classes, which means that class membership is either known a priori or is a deterministic function of the covariates. The second extension is the incorporation of degenerate distributions and rank-deficient coefficient matrices in the model.
We applied the tools we developed to the case of Swedish earnings data, where measurements were available for each individual both from an administrative database and from a survey. We took the Kapteyn-Ypma (Kapteyn and Ypma, 200) model as given, including their parameter estimates, and studied prediction of true (log) earnings from the two observed measurements. The major conclusion is that conditionally weighted predictors and the two-stage predictors perform extremely well, much better than either the register data or the survey data by themselves, predictors based on unconditional weights, or the linear-projection based estimator.
A topic for further research is to study the generality of our empirical findings. In particular, researchers increasingly use administrative data matched to surveys and simply treat the administrative data as if they contained the correct values. Our empirical results show that, if there is a potential for mismatch, this approach may be counterproductive, because we showed that in the Kapteyn-Ypma model, the survey earnings have a higher reliability and lower MSE than the register earnings.
In the U.S., administrative data from the Social Security Administration (SSA) are a primary source of such matched data sources. They have been matched to nationally representative surveys, such as the Health and Retirement Study (HRS), Survey of Income and Program Participation (SIPP), and the Current Population Survey (CPS). The algorithm that the SSA uses for matching administrative records to survey records tends to be conservative in declaring a correct match, but it is certainly not guaranteed to be error-free. Given the findings in this paper, it seems worthwhile to study the prevalence of mismatches using a model like the Kapteyn-Ypma one, and their potential consequences using an analysis like the one in the current paper.
There remains the puzzle why the register data perform so poorly. It is beyond the scope of this paper to resolve this puzzle, but a speculative explanation may be the following. In the Kapteyn-Ypma model, the register data are assumed to deviate from the true earnings values only in case of a mismatch. However, the register data might contain some measurement error themselves, possibly in the case of unreported earnings from a second or third job. Failure to report these earnings could, for example, be due to the earnings being minor or because they pertain to unreported employment in the underground economy (see Kapteyn and Ypma 2007, footnote 5, for a similar example). If this is the explanation, “mismatch” should be reinterpreted as “misreport”. Under such a reinterpretation, the model specification would likely need to be changed slightly, because then register earnings under misreport would not be independent from the true earnings. It is not immediately clear, though, whether they would be positively or negatively correlated with true earnings, and thus whether the reliability of register earnings as a measure of true earnings would be higher or even lower than estimated in the current paper. In any case, it seems that even under the most favorable circumstances, the quality of the register data is lower than is typically assumed in validation studies.
Supplementary Material
Acknowledgments
We would like to thank Jelmer Ypma and Arie Kapteyn for sharing their code with us and for discussions about their model, and Jos ten Berge and conference participants at Aston Business School, Birmingham, in particular Battista Severgnini, for stimulating discussions and comments on an earlier version of this paper. We would also like to thank the editors (Arthur Lewbel and Keisuke Hirano), an associate editor, and a referee for helpful comments. The National Institute on Aging supported the collection of the survey data and subsequent match to administrative records (grant R03 AG21780). Rohwedder acknowledges financial support from NIA grant P01 AG08291.
Contributor Information
Erik Meijer, RAND Corporation, Santa Monica, CA 90407-2138.
Susann Rohwedder, RAND Corporation, Santa Monica, CA 90407-2138.
Tom Wansbeek, University of Groningen, 9700 AV Groningen, The Netherlands.
References
- Amemiya T. Advanced econometrics. Harvard University Press; Cambridge, MA: 1985. [Google Scholar]
- Anderson TW, Rubin H. Statistical inference in factor analysis. In: Neyman J, editor. Proceedings of the third Berkeley symposium on mathematical statistics and probability V. University of California Press; Berkeley: 1956. pp. 111–150. [Available from http://projecteuclid.org/euclid.bsmsp/1200511860] [Google Scholar]
- Angrist JD, Pischke J-S. Mostly harmless econometrics: An empiricist's companion. Princeton University Press; Princeton, NJ: 2009. [Google Scholar]
- Bound J, Brown C, Mathiowetz N. Measurement error in survey data. In: Heckman JJ, Leamer E, editors. Handbook of econometrics. Vol. 5. North-Holland; Amsterdam: 2001. pp. 3705–3843. [Google Scholar]
- Bound J, Krueger AB. The extent of measurement error in longitudinal earnings data: Do two wrongs make a right? Journal of Labor Economics. 1991;9:1–24. [Google Scholar]
- Jedidi K, Jagpal HS, DeSarbo WS. Finite-mixture structural equation models for response-based segmentation and unobserved heterogeneity. Marketing Science. 1997;16:39–59. [Google Scholar]
- Kapteyn A, Ypma JY. Measurement error and misclassification: A comparison of survey and administrative data. Journal of Labor Economics. 2007;25:513–551. [Google Scholar]
- Pischke J-S. Measurement error and earnings dynamics: Some estimates from the PSID validation study. Journal of Business & Economic Statistics. 1995;13:305–314. [Google Scholar]
- Raudenbush SW, Bryk AS, Cheong YF, Congdon R. HLM 6: Hierarchical linear and nonlinear modeling. Scientific Software International; Chicago: 2004. [Google Scholar]
- Wansbeek TJ, Meijer E. Measurement error and latent variables in econometrics. North-Holland; Amsterdam: 2000. [Google Scholar]
- Yung Y-F. Finite mixtures in confirmatory factor-analysis models. Psychometrika. 1997;62:297–330. [Google Scholar]
- Zhu H-T, Lee S-Y. A Bayesian analysis of finite mixtures in the LISREL model. Psychometrika. 2001;66:133–152. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



