Abstract
The integration of different data sources is a widely discussed topic among both the researchers and the Official Statistics. Integrating data helps to contain costs and time required by new data collections. The non-parametric micro Statistical Matching (SM) allows to integrate ‘live’ data resorting only to the observed information, potentially avoiding the misspecification bias and speeding the computational effort. Despite these pros, the assessment of the integration goodness when we use this method is not robust. Moreover, several applications comply with some commonly accepted practices which recommend e.g. to use the biggest data set as donor. We propose a validation strategy to assess the integration goodness. We apply it to investigate these practices and to explore how different combinations of the SM techniques and distance functions perform in terms of the reliability of the synthetic (complete) data set generated. The validation strategy takes advantage of the relation existing among the variables pre-and-post the integration. The results show that ‘the biggest, the best’ rule must not be considered mandatory anymore. Indeed, the integration goodness increases in relation to the variability of the matching variables rather than with respect to the dimensionality ratio between the recipient and the donor data set.
Keywords: Statistical matching, hot deck method, data integration, matching goodness
AMS CLASSIFICATIONS: 62, 62-07, 62G
1. Introduction
The ongoing debate on big data as well as the increasing number of ad hoc project surveys and privately-owned (web) data sources are fostering the idea that there is abundance of data and that researchers are experiencing an unforeseen data flood. Nevertheless, while new data collection scenarios are emerging, often the researchers have to face the growing limits imposed to the availability (and accessibility) of the elementary data provided by the Official Statistics. The access to secondary micro-data is more and more restricted due to the privacy claim constraints that, on the one hand, make stiffer and more time-demanding the existing data releasing procedures and, on the other, allow the dissemination of data only at a very aggregated level. Moreover, the classic massive data collections e.g. the national general censuses are being narrowed due to a costs-reduction rationale [3].
Ideally, the researchers could tackle the shrinking availability of the elementary data by planning specific data collections e.g. by designing and carrying out ad hoc surveys. However, collecting primary data is very demanding in terms of both high costs and time. A potential solution to these challenging limits is offered by the integration methods which allow to exploit the information collected by already available data. This solution is profitable not only in order to overcome the shortage of the elementary data but also to enrich secondary data with specific information that is timely collected. Data integration strategies are increasingly addressed not only by researchers but also by the national statistical agencies. For example, the High Level Group for the Modernisation of Official Statistics (HLG MOS strategy) points out that the main aim for the incoming future of the Official Statistics is to provide more timely disaggregated statistics at higher frequencies than the traditional ones by aggregating already existing data, administrative registers and official surveys [39].
Data integration can be performed applying different methods such as Record Linkage (RL), Statistical Up(Down)scaling (SUD) and Statistical Matching (SM). Originally, RL was implemented with the specific purpose of duplicated records identification in data sets where unique identifiers are unavailable. RL was then further developed and it is currently applied also to match equal observations collected in different data sets [41]. Being at its origins a really practical ‘data cleaning’ procedure, RL is currently applied as an ‘entity resolution’ method, being its application within the Official Statistics widespread (see, for example, [6] and the references therein). SUD has been developed in the environmental and meteorological research fields with the primary purpose to adapt data collected at different space levels and time scales [7], enlarging or narrowing the information referred to a specific area or an aggregated level.
SM is the most up-to-date data integration method among the previous three [13]. It is applied to transfer a key information from the only data set where it is observed to one data set of interest. The former data set containing the information that we want to transfer is called ‘donor’ while the latter one is called ‘recipient’. This process of information transfer is usually called ‘imputation’. SM can be applied for several purposes ranging from missing values imputation (macro SM) up to different data sources integration and new data sets building (micro SM) [21]. This latter case is considered in the present work. The goal of generating a synthetic (complete) data set from different data sources already available can be approached as a ‘missing data’ problem. In the micro SM framework the targets of the imputation are entire data set columns that are essential for the research purpose, meaning that the ‘missing data’ scenario differs from the one that is more often tackled by means of the Multiple Imputation method, i.e. the case in which we have one (or more) variable(s) subjected to missingness. Moreover, the missing scenario is assumed to be the most trivial one, i.e. the sets of variables that are not observed are missing completely at random (MCAR situation) as defined by [27]. SM is commonly divided between the parametric approach and the non-parametric one. This latter, applied to micro purposes is usually called ‘hot deck’ method. The non-parametric SM offers some relevant advantages with respect to the parametric one [29], i.e.:
the integration concerns ‘live’ information (only the real observed data)
the potential model misspecification bias can be avoided
the computational effort required for the integration is minimal
For these reasons the non-parametric SM is particularly valuable when the researchers want to deal with real observed information and avoid the specification of a model, as it is e.g. within the context of the non-experimental/observational studies analysis.
The hot deck method not only offers some advantages but it also presents unsolved challenges. First, the SM literature is rather poor of well-known (robust) approaches to be used to assess the goodness of the integration. Second, the non-parametric micro SM applications comply with some practices which are neither empirically based nor properly theoretically formalized [22]. Third, the hot deck method distance-based techniques are based on the default Mahalanobis distance function [21].
Comparing the parametric and the non-parametric SM in terms of the assessment of the integration goodness, when we apply the parametric SM, we end up in analyzing the uncertainty behind the matching by focusing on the joint distribution function (see, for example, [11] and the references therein). This joint distribution considers Z and K which are two variables not jointly observed and X that is observed both in the donor data set and in the recipient one. The prominent solution usually applied to minimize this uncertainty is to select the matching distribution from a plausible set of joint distributions of , as it is e.g. in [13]. This approach is not different from the one applied within the macro SM, usually used to tackle the issue of values missingness. Otherwise, moving from the problem of the uncertainty minimization up to the quantification of the error that is produced by the integration and that is kept in the synthetic (complete) data set generated, this issue as well as the assessment of the reliability of the final data set generated, are seldom treated in the literature. This shortage in the literature is even more evident with respect to the non-parametric SM.
Considering these challenges, the present work pursues one main research objective, namely to:
propose and apply a validation strategy useful to assess the integration goodness within the non-parametric framework
Furthermore, we use the validation strategy for two secondary objectives, i.e.:
to investigate the validity of the main practices holding behind several applications of the hot deck method
to assess how the new combinations of the distance-based techniques of the hot deck method with not default distance functions do perform
To the best of our knowledge, this work constitutes a first effort to propose a cohesive validation strategy that can be applied to assess the integration goodness of the synthetic (complete) data set generated when the non-parametric micro SM is used. Moreover, it constitutes the first effort to explore and validate the common practices assumed as necessary (or mandatory) by several applications of the hot deck method. Finally, we apply the new validation strategy to new combinations of the hot deck distance-based techniques with some distance functions beyond the default Mahalanobis one. This is done also in order to identify the best performing combination, proposing few guidelines for their application. The paper is structured as follows: in Section 2 we review the state of the art in SM both from the methodological point of view and with respect to the SM applications. We then present our validation strategy and the further developments proposed that are related to the combinations of techniques and distance functions. In Section 3 we describe the simulation study carried out to assess the integration goodness within different scenarios of integration by means of our validation strategy. In this Section we also present the results of the testing and the validation procedure related to the practices that are commonly accepted by the SM practitioners. In Section 4 we present the results of a real world application with administrative and survey data on agricultural holdings. Finally, in Section 5 we discuss the results and conclude.
2. Approach
2.1. State-of-the-art methodology
SM was firstly developed by [32] that is considered by the academics the first work that approached the issue of data integration. Nevertheless, there are also information about the so-called ‘1966 merge file’ that can be considered the first original attempt to integrate different data sources. This mismatch can be a representative example of the fact that from the 60s and more or less until the mid 90s the SM framework has been developed without being defined in an unequivocal way. Indeed, the practitioners labeled different data integration procedures and methods indifferently like object or instance identification, data fusion, data completion, data combining, record matching and statistical matching [18]. The parametric SM has received more attention than the non-parametric one being improved by several authors (see, for example, [23,26,35]) who mainly discussed, at various level, the seriousness of the Conditional Independence Assumption (CIA). For the sake of brevity, we refer to [17] for a complete discussion of the CIA that constitutes the first theoretical approach adopted in the SM framework. Let be the case that we observe in a generic data set R and in a generic data set D; then, through the CIA it is possible to identify the joint distribution of by means of the following assumption:
where and are two sets of variables that are exclusively observed in the data sets at disposal while the set of variables is observed both in R and in D. Due to the fact that this assumption cannot be tested from the whole sample obtainable by joining the two data sets [21], if the statistical relation assumed on by the CIA does not hold, we will simply perform a biased data integration ending up with an uninformative synthetic (complete) data set [5].
In order to overcome the CIA, [36] proposed to use the auxiliary information to estimate an intermediate value of the variable of interest by means of the regression predictor based on the variable(s) observed both in R and D and (at least) two other variables that are exclusively observed, one in R, the other in D. In this sense, further developments were provided by [37] that show both how it is possible to avoid the computational limitations to the integration process and the CIA bias by using some ‘additional information’. Moreover, [37] offer proofs of the fact that the hot deck method generally outperforms the regression and log linear techniques in terms of the robustness of the estimation of the marginal distributions of the variables and in the integrated data set.
Recently, some remarkable methodological developments concerned the parametric SM with respect to the assessment of the ‘imputation noise’ and the quantification of the integration bias [8,14,31], but also the ‘identifiability’ issue [12,13,29]. Moreover, some software advances have been proposed [20]. From the point of view of the non-parametric SM instead, the most relevant advances were limited to the proper definition of the theoretical framework as well as to the set-up of a cohesive reference literature [2,21,33].
2.1.1. Hot deck method techniques
The non-parametric micro SM is applied to generate a synthetic (complete) data set by integrating two or more data sets which are characterized by the fact that:
they collect information on (at least) one variable that is observed in both the two data sets and (at least) two variables that are exclusively observed (one per each data set)
the observations are (potentially) disjoint sets of units
The main peculiarity of the hot deck method is that the integration is based only on the exploitation of the information that is jointly observed such that neither model assumptions nor the estimation of any family distribution for these variables (and/or any definition of the model's parameters) are necessary [10].
Let be the case that we have the data set R and the data set D, respectively defined as recipient data set (R) and donor data set (D). Let be and the number of observations in the recipient data set R and in the donor data set D, respectively. In other words, these are the dimensions of the two data sets. Note that R and D can differ in terms of their dimensionality. Let be the case that these two data sets collect a set of jointly observed variables and two different sets of variables which are observed in an exclusive way: one in R, the other in D ( and , respectively). The synthetic (complete) data set is generated by means of the imputation in R of some variables of interest observed only in D. Let be i and j the ith and jth observation collected in the two data sets with and . We define:
, that is the set of variables jointly observed in R and D (being a vector of dimension (), a vector of dimension ())
, that is the set of variables observed only in the recipient data set (being a vector of dimension ())
, that is the set of variables observed only in the donor data set (being a vector of dimension ())
In other words, we have at disposal two data sets:
A recipient data set
A donor data set
Let then be a generic subset of d variables of interest (with ), chosen among the variables. We are interested in the imputation from D to R of this subset , the so-called ‘imputed’ variables. The imputation is based on some jointly observed variables defined ‘matching’ variables. The synthetic (complete) data set generated by means of integration will be . For the sake of simplicity, we assume that L=1, i.e. X is a single (continuous) variable jointly observed in R and D.
The following assumptions hold:
R and D are two data sets containing information on two representative samples of the same target population [21]
has to be considered as a unique sample of the iid observations from the joint distribution of [21]
R (with observations) and D (with observations) have to be selected as recipient and donor data sets, respectively, such that the condition does hold [21,33]
Due to the fact that constitutes a unique data set of independent units that are generated from the same , the missing data mechanism is MCAR. Indeed, imagining to have a units' randomness indicator, this would be independent of both the observed and the unobserved and, by conditioning on it, the resulting distribution of () will be , as [35] show.
There are four non-parametric micro SM techniques [21], namely:
the Nearest Neighbour Distance Hot Deck (nnd)
the Constrained Nearest Neighbour Hot Deck (cnnd)
the Random Hot Deck (rhd)
the Rank Hot Deck (rkhd)
Applying the nnd technique, the ith and jth observation are matched when:
where is the absolute minimum value of the difference between the ith and jth unit (where the th unit is the donor unit chosen to be matched).
If we want to exclude an already matched observation from the set of the possible donors, we can perform a constrained version of the aforementioned nnd technique. This is done by selecting for each recipient observation the donor one that is the closest in terms of the lowest aggregated distance among the distances between the ith and jth unit, holding the fact that there can be only one donor for each recipient observation. If we approach this problem as a linear programing challenge [9] and we assume that , we can define the donor pattern as follows:
where if there is a matched pair of units and otherwise, with .
In the linear programing framework our goal is then to minimize the donor pattern such as the following constraints do hold:
| (1a) |
| (1b) |
If the simple case of equal dimensionality does not hold, we assume that we are in the case of . Hence, Equations (1a) and (1b) change such that:
| (2a) |
| (2b) |
The cnnd technique allows to preserve in a better way the marginal distribution of the imputed variable in the synthetic (complete) data set generated but it is far more time demanding from a computational point of view as well as it increases, on average, the distances of the matching variables between i and j [26].
The rhd technique picks at random the donor unit to be matched with the recipient one. Considering the (initial) potential set of donor and recipient units' pairs:
| (3) |
we can decide to sharpen it and reduce the level or randomness of the paired units by defining the donation classes. We define a donation class as a possible subset of the whole jointly observed set of information. In other words, we can select some of the jointly observed variables in R and D (which are not chosen as matching variables) to constrain the integration. For example, let be the case that we want to integrate some data referred to individuals that are collected by means of two different sources. We can decide to restrict the imputation of the variables of interest holding up some homogeneous sub-groups of individuals (i.e. observations) e.g. by defining the subsets of observations that have the same gender, age class, living area, etc. For example, let be and two variables jointly observed both in R and in D, iff a donation class defined by the aforementioned variables holds, the set in (3) is restricted to the following one:
The previous three techniques (nnd, cnnd, rhd) are distance-based techniques (we discuss the distance functions in Section 2.1.2). The fourth technique (the rkhd one) is not a distance-based technique and it is based on two steps. First, the donor and recipient units are ranked as follows:
considering R and D, respectively, with I that is the set of the indices of and of .
Second, each recipient unit is associated with a donor one and a matched units' pair is constituted as follows:
where the minimum of the difference between and is computed such that .
2.1.2. Distance functions
The hot deck distance-based techniques (nnd, cnnd and rhd) apply the Mahalanobis distance as default distance function [21]. Nevertheless, different distance functions can be used to compute the distance between a recipient unit and a potential donor.
Let be δ a generic distance function iff three properties hold [28]:
meaning, respectively, that between two distance functions there is always symmetry, that a distance function is always non-negative and that the identity property must hold.
Let be the hth observation (with ) observed in the data set D. Therefore, given the generic distance function δ, we define Δ as a metric iff the following two assumptions hold [28]:
, iff i = j
In other words, we assume that the principle of the ‘identity of the equals’ holds for each metric as well as that the metrics always benefit from the triangle inequality.
We propose to combine the aforementioned hot deck distance-based techniques with the following distance functions:
Manhattan (mn)
Mahalanobis (ms)
Exact (et)
The Manhattan metric computes the distance between the ith and jth unit such that:
i.e. by means of the sum of the absolute value of the differences between the donors and the recipients in terms of the values of the chosen matching variables.
The Mahalanobis metric computes the distance between the ith and jth unit, taking into account the statistical relation among the observed covariates such that:
where Σ is the covariance matrix of the matching variables .
The third distance function has to be conceived more properly as a semi-metric since, for it, the assumption A.2.2 (i.e. the triangle inequality) does not hold. This distance has to be thought as an index à la ‘Sørensen-Dice’ [24] or as the Gower's dissimilarity index [25]. It is defined as follows:
where is a scaling factor for the lth variable that is equal to 1 for binary variables and equal to for continuous and categorical ones (with ). This distance can provide a general measure of ‘proximity’ between observations when we use variables of different nature.
2.2. Application strategy
Considering the SM literature, from the mid 70s until the mid 90s there have been several applications on income and taxes as well as on the integration of different administrative registers on households (see, for example, [6,18,40] and the references therein). Recently, new fields of application emerged concerning households' living standards, income and consumption attitudes [19,38], accounting and structural information on agricultural holdings [4,34], policy evaluation [16] and pharmaceutical data [1].
The majority of these works apply the parametric SM that has been widely studied from the theoretical point of view compared to the non-parametric one. Indeed, this latter has received a minor attention both theoretically and practically speaking. Moreover, the hot deck method has been developed by practitioners and statistical agencies through a marked learn-by-doing and problem-solving approach [31]. This is the main reason why several applications go along with some key practices which have been often adopted until to become part of the ‘common knowledge’ of the SM framework. These practices lack of robust theoretical basis, being neither explicitly stated nor formalized by the most part of the applications.
Let be R the recipient data set and D the donor one. We can summarize the practices of the non-parametric SM as follows:
Existing equal dimensionality ratio between two pairs of data sets R and D, it is always highly preferable that the variability of the matching variables in R is minor than the variability of the matching variables in D
If there is not equal dimensionality ratio between the two pairs of data sets R and D, it is always highly preferable to have the (potentially) widest dimensionality ratio between R and D
‘The biggest, the best’ rule; it prescribes, in a mandatory way, to choose the biggest available data set as donor
The donation classes always benefit the integration goodness
These practices were (and are) adopted mainly due to a computational-time reduction rationale and because of the advantages offered by the practical choice to have the potentially biggest information pattern from which to select the information of interest and impute it in the recipient data set. At the best of our knowledge, these undoubtedly consistent motivations on how to carry out the integration were neither properly defined nor robustly proved.
2.3. Our proposals
2.3.1. Validation strategy
At the end of the integration process there is the need to assess its goodness, i.e. we should evaluate the optimality of the donor and recipient units' pairs which have been matched in order to impute the information of interest. In other words, we have to assess the reliability of the synthetic (complete) data set generated. When we apply the parametric SM, this validation usually concerns the study of the statistical relation between the variables and , considering the level of uncertainty hidden in the joint distribution function of , such that:
where is assumed to be a surrogate set of variables for the variables K. Assuming such a statistical relation in the model for , we proceed to evaluate the uncertainty by minimizing the error of the estimated parameter of interest.
Nevertheless, when we integrate data by means of the non-parametric micro SM, the study of such a relation between two not-jointly observed variables (Z and K) is not a priority. The amount of uncertainty hidden in the integration process that has been completed is not the primary concern of the researcher willing to quantify the error resulting from such a process. Moreover, the researcher who is willing to resort to use as much as possible only the ‘real’ observed information could be motivated in avoiding the validation path applied by the parametric approach and the assumption of any family distributions for the not-jointly observed variables as well as the choice of some model's parameters. Hence, we propose a validation strategy that exploits the solely observed information by means of a logical bond that is extrapolated from the observed variables.
For the sake of simplicity, let be the most trivial case the following one: in addition to the matching variables , in the recipient data set we observe at least one variable that is more or less the same of an observed variable in the donor data set. This latter variable can be imputed from the donor to the recipient on the basis of the matching variables. We define the variable W as the difference (computed after the integration) between the former variable that is originally observed in the recipient data set and the variable that is imputed from the donor data set. The variable W is defined as follows:
| (4) |
for at least the pth variable, where and are, basically, the ‘same’ variables. In other words, we propose to exploit all the observed information in the recipient and donor data set, structuring the after-matching validation on the difference between at least one of the variables originally observed in the recipient data set R and some variables that can be imputed from the donor data set D.
We analyze the integration goodness by validating the synthetic (complete) data set generated by means of three single but synergistic tools:
the graphical analysis of the distribution of the variables pre-and-post the integration
the graphical analysis of the distribution of the variable W
the MSE (mean square error) of the variable W
These tools are relevant in order to assess the goodness of the integration performed by means of the non-parametric micro SM because, within the non-parametric framework, they offer an easy and descriptive way to statistically evaluate the integration that has been performed. The graphical analysis of the distribution of the variables which are originally collected in the recipient data set and the ones associated to them by the imputation allows the researchers to quickly focus on the distributions of both the former and the latter, as well as to easily catch their shapes and, eventually, the presence of outliers. This is done without assuming any distributional form for the variables. Defining the variable W as we did in Equation (4) and by resorting to the graphical analysis of its distribution, this is particularly useful in order to compare the integrations which have been carried out by means, for example, of different techniques and/or resulting from the selection of different matching variables. The area under the curve of the distribution of W , its (a)symmetrical curve and its short or long tails are all elements which help the researcher to understand if the original variable has been underestimated or overestimated by the imputed variable(s) and if there are outliers values which have not been properly represented by the imputed information. This graphical analysis of the distribution of W is then strictly associated with the MSE evaluation that gives a numeric idea of the amount of error generated by the integration.
2.3.2. New combinations of techniques and distance functions
The present paper takes in exam the possibility to combine the existing distance-based techniques (nnd, cnnd, rhd) and two distance functions beyond the default Mahalanobis one, namely the Manhattan and the Exact ones. In the simulation study, we also consider the rkhd technique that is not a distance-based technique of the hot deck method.
Table 1 summarizes the combinations of the hot deck distance-based techniques with the distance functions (as well as the rkhd technique) applied.
Table 1. Combinations of the distance-based techniques and the distance functions (and the rkhd technique) applied in the paper.
| Technique | Distance function | Combination |
|---|---|---|
| Nearest Neighbour Distance Hot Deck (nnd) | Manhattan (mn) | nnd.mn |
| Mahalanobis (ms) | nnd.ms | |
| Exact (et) | nnd.et | |
| Constrained Nearest Neighbour Hot Deck (cnnd) | Manhattan | cnnd.mn |
| Mahalanobis | cnnd.ms | |
| Exact | cnnd.et | |
| Random Hot Deck (rhd) | Manhattan | rhd.mn |
| Mahalanobis | rhd.ms | |
| Exact | rhd.et | |
| Rank Hot Deck (rkhd) |
3. Simulation study
3.1. Simulated data application
The validation strategy that we propose to assess the integration goodness is applied in the present paper also to two secondary objectives. First, it is applied to investigate and assess the validity and the coherence of the practices (from P.1 to P.4) applied in several non-parametric SM applications. Second, we use it to analyze the performances of the different combinations depicted in Table 1.
We simulate two data sets, R and D, such that they differ with respect to two main aspects:
the dimensionality ratio
the variability of the matching variables
The data sets R and D are simulated such as:
and
and
and
Therefore, we simulate two data sets:
A recipient data set
A donor data set
is the set of the potential matching variables while is the set of the imputed variables. Differently from a real-life situation we assume that the imputed variables K are the same variables observed in the recipient data sets (i.e. the variables Z). This is done in order to test the robustness of the validation strategy in the most trivial scenario. Both and and both and are simulated as the realization of a log-Normal() multiplied by a Bernoulli(θ) with . and are simulated as the realization of a Bernoulli(θ) with . is a categorical variable indicating the main variable's value between and while is a categorical variable indicating the main variable's value between and . and are simulated as the sum of two log-Normals(). Therefore, we simulate the most trivial and ‘unlucky’ situation where it can be reasonably assumed that no connection and/or relation do exist between the three sets of variables of interest. This, due to the fact that if it would be in the other way round, parametric SM could be applied more profitably in order to integrate the two data sets of interest.
Considering the simulation study, Equation (4) becomes the following one:
| (5) |
The simulation study takes into account two different conditions of dimensionality ratio: 1 to 10 ( and ) and 1 to 3 ( and ). For both these conditions we simulate two different conditions of matching variables variance, simulating the cases and . For the sake of simplicity, we will refer to as var(R) and to as var(D). Table 2 summarizes the simulated scenarios, with the integrations carried out both with and without donation classes.
Table 2. Scenarios of the simulation study and the related Integrations.
| Scenario Nr. | 1 | 2 | 3 | 4 | ||||
|---|---|---|---|---|---|---|---|---|
| Ratio | 1 to 10 | 1 to 10 | 1 to 3 | 1 to 3 | ||||
| Variability | ||||||||
| Donation classes | with | without | with | without | with | without | with | without |
| Integration Nr. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Additionally, in the Appendix, we present and discuss the results related to the situation of equal dimensionality ratio ( and ). In this situation we examine both the cases when and ; we do it carrying out the integrations both with and without donation classes.
3.2. Validation strategy results
For the sake of brevity, we refer to [15] for a fully detailed display of the results related to both the graphical analysis of the distribution of the variables pre-and-post the integration and the distribution of the variable W . We provide here two examples related to the tools applied in the validation strategy referred to the application of the cnnd.ms combination in Integration Nr. 4 and the nnd.et combination in Integration Nr. 1, respectively.
Figure 1 shows the distribution of the variable that is originally observed in the recipient data set. Figure 2 shows the distribution of the variable that is imputed from the donor data set (and that corresponds to the variable ). These results are referred to the synthetic (complete) data set generated by means of the cnnd.ms combination in Integration Nr. 4.
Figure 1.
Distribution of the variable observed in the synthetic (complete) data set generated applying the cnnd.ms combination in Integration Nr. 4.
Figure 2.
Distribution of the variable imputed in the synthetic (complete) data set generated applying the cnnd.ms combination in Integration Nr. 4.
Figures 1 and 2 show that the variable that is originally observed in R is not statistically significant overestimated for the values in the class 0–5 and it is not statistically significant underestimated for the values in the classes 10–15 and 20–25 by the variable that is imputed from D. Moreover, few units with values in the class 115–120 are not matched with donors at all.
Figure 3 shows the overlap between the two density distributions of the variables and in the synthetic (complete) data set generated by the cnnd.ms combination in Integration Nr. 4. In order to provide the reader with a quick measure of similarity of the two distributions we calculate the Hellinger distance index (that is equal to 0.01022), stressing that above the threshold of 5% two distributions must not be considered similar [30]. Therefore, the imputed variable represents extremely well the originally observed variable .
Figure 3.
Density distributions of the variables (observed) and (imputed) in the synthetic (complete) data set generated applying the cnnd.ms combination in Integration Nr. 4.
Figure 4 shows the distribution of the variable that is originally observed in the recipient data set. Figure 5 shows the distribution of the variable that is imputed from the donor data set. These results are referred to the synthetic (complete) data set generated by the nnd.et combination in Integration Nr. 1.
Figure 4.
Distribution of the variable observed in the synthetic (complete) data set generated applying the nnd.et combination in Integration Nr. 1.
Figure 5.
Distribution of the variable imputed in the synthetic (complete) data set generated applying the nnd.et combination in Integration Nr. 1.
Figures 4 and 5 show that the variable that is originally observed in R is statistically significant under-estimated for the values in the classes 0–10 and 30–50 by the variable that is imputed from D. Moreover, several units with values in the classes above 130–140 are not matched with donors at all, meaning that the outliers are not well-represented in the integrated data set.
Figure 6 shows the overlap between the two density distributions of the variables and in the synthetic (complete) data set generated by the nnd.et combination in Integration Nr. 1. The Hellinger distance index is equal to 0.15941, meaning that the imputed variable is 15% more dissimilar than the originally observed variable .
Figure 6.
Density distributions of the variables (observed) and (imputed) in the synthetic (complete) data set generated applying the nnd.et combination in Integration Nr. 1.
The discussed examples are two out several graphical distributions of the variables and (pre-and-post the integration) and the overlap between their density distributions. A complete appendix is available upon request containing all the material related to the eight Integrations in the four scenarios depicted in Table 2.
In the following paragraphs we present the graphical distributions of the variable W related to the previous two examples but we also present and discuss the MSE evaluation related to all the proposed combinations for the eight Integrations in the four scenarios. Namely, Figure 7 shows an example of the graphical analysis of the distribution of the variable observed in Integration Nr. 4 while Figure 8 is referred to Integration Nr. 1. We stress, again, how much the nnd.et combination performs worse in controlling for the outliers and also properly pairing donors and recipients if compared to the cnnd.ms combination.
Figure 7.
Distribution of the variable resulting from the cnnd.ms combination in Integration Nr. 4.
Figure 8.
Distribution of the variable resulting from the nnd.et combination in Integration Nr. 1.
Table 3 shows that when the variability of the matching variables in R is higher than the variability of the matching variables in D, the condition of a bigger dimensionality ratio between the two data sets guarantees the highest goodness and, being equal the dimensionality ratio, the integration with donation classes performs better. Indeed, when the variability of the matching variables in R is higher than the variability of the matching variables in D (i.e. ), it is preferable to have a bigger dimensionality ratio (1 to 10) between the two data sets in order to obtain a higher integration goodness. Moreover, when donation classes hold, the goodness increases compared to the integrations carried out without donation classes. Comparing the Integrations Nr. 1 and 5 (both with donation classes) we obtain lower MSE values for almost all the combinations when we integrate data sets with dimensionality ratio 1 to 10. Exceptions are represented by the combinations of the nnd, cnnd and rhd techniques with the Exact distance function (see ). Comparing the Integrations Nr. 2 and 6 (both without donation classes) we obtain lower MSE values when we integrate data sets with dimensionality ratio 1 to 10. Exceptions are represented by the rhd.mn combination (see both and ) and by the rkhd technique. Being equal the dimensionality ratio, by comparing the Integrations Nr. 1 and 2 and the Integrations Nr. 5 and 6, we can assess that the donation classes benefit the integration goodness (with the only exception represented by the cnnd.et combination).
Table 3. MSE of the variable W (Integrations Nr. 1, 2, 5, 6).
| Integration Nr. | 1 | 5 | 2 | 6 | ||||
|---|---|---|---|---|---|---|---|---|
| Ratio | 1 to 10 | 1 to 3 | 1 to 10 | 1 to 3 | ||||
| Variability | ||||||||
| Donation classes | with | without | ||||||
| nnd.mn | 101.536 | 9.617 | 102.534 | 10.017 | 176.171 | 83.896 | 182.890 | 90.273 |
| nnd.ms | 101.536 | 9.617 | 102.534 | 10.017 | 176.171 | 83.896 | 182.890 | 90.273 |
| nnd.et | 1972.411 | 136.508 | 2113.379 | 121.772 | 1850.420 | 180.590 | 2047.865 | 187.587 |
| cnnd.mn | 101.527 | 9.608 | 102.679 | 10.293 | 175.903 | 83.628 | 183.459 | 90.858 |
| cnnd.ms | 101.526 | 9.606 | 102.815 | 10.368 | 176.010 | 83.734 | 183.573 | 90.964 |
| cnnd.et | 2688.750 | 139.780 | 2728.813 | 131.305 | 108.465 | 14.920 | 108.465 | 14.920 |
| rhd.mn | 1000.011 | 15.570 | 1186.610 | 19.674 | 1253.199 | 85.351 | 1192.059 | 73.047 |
| rhd.ms | 1005.479 | 17.575 | 1121.168 | 16.839 | 1257.923 | 90.165 | 1465.474 | 105.852 |
| rhd.et | 1794.635 | 127.224 | 1756.882 | 137.068 | 1798.596 | 162.784 | 1883.323 | 164.871 |
| rkhd | 165.375 | 45.464 | 133.446 | 23.293 | 281.824 | 167.775 | 203.317 | 99.555 |
Table 4 shows that the condition of a lower variability of the matching variables in R compared to the variability of the matching variables in D () can benefit the integration goodness regardless to the dimensionality ratio between the recipient and the donor data sets. Comparing the Integrations Nr. 3 and 7 and the Integrations Nr. 4 and 8 we can see that even a lower dimensionality ratio (1 to 3) can guarantee a high integration goodness. Therefore, the most relevant factor to take into account is the variability of the matching variables in R, that should be preferably lower than the variability of the matching variables in D. Moreover, when the donation classes hold, the goodness increases with respect to data integrations carried out without the donation classes.
Table 4. MSE of the variable W (Integrations Nr. 3, 4, 7, 8).
| Integration Nr. | 3 | 7 | 4 | 8 | ||||
|---|---|---|---|---|---|---|---|---|
| Ratio | 1 to 10 | 1 to 3 | 1 to 10 | 1 to 3 | ||||
| Variability | ||||||||
| Donation classes | with | without | ||||||
| nnd.mn | 9.532 | 9.528 | 7.872 | 7.945 | 77.918 | 77.904 | 87.838 | 88.045 |
| nnd.ms | 9.532 | 9.528 | 7.872 | 7.945 | 77.918 | 77.904 | 87.838 | 88.045 |
| nnd.et | 444.579 | 157.936 | 477.174 | 158.138 | 786.865 | 208.549 | 666.437 | 205.484 |
| cnnd.mn | 9.466 | 9.465 | 7.867 | 7.976 | 84.813 | 84.770 | 95.708 | 95.738 |
| cnnd.ms | 9.494 | 9.492 | 7.913 | 8.022 | 84.515 | 84.474 | 77.219 | 77.183 |
| cnnd.et | 343.698 | 163.905 | 420.386 | 169.801 | 46.965 | 37.842 | 46.965 | 37.842 |
| rhd.mn | 8.273 | 7.295 | 12.321 | 16.484 | 78.321 | 81.351 | 104.761 | 99.260 |
| rhd.ms | 9.421 | 9.767 | 9.950 | 16.915 | 92.751 | 88.203 | 85.926 | 87.745 |
| rhd.et | 407.317 | 94.668 | 573.707 | 106.418 | 583.777 | 121.647 | 334.443 | 76.499 |
| rkhd | 2943.404 | 98.975 | 2834.001 | 86.592 | 2963.817 | 160.906 | 2953.937 | 143.025 |
Table 5 shows that by comparing the Integrations Nr. 1 and 3 and the Integrations Nr. 2 and 4, being equal the dimensionality ratio between R and D, the condition of a lower variability of the matching variables in R compared to the variability of the matching variables in D benefits the integration goodness when the donation classes are built (the exceptions are the combination nnd.et and the rkhd technique). Similar results are obtained without the donation classes (with, nevertheless, more exceptions: the combinations nnd.et, cnnd.mn, cnnd.ms and cnnd.et). This is due to the fact that the Constrained Nearest Neighbour Hot Deck technique cannot optimally tackle the higher variance of the matching variables in the recipient data set without the donation classes. Further evidence of this is given by the graphical analysis of the distributions of the variable W that shows a worse control of the outliers (in the tails) if compared with the distributions of the variable W resulting from Integration Nr. 3.
Table 5. MSE of the variable W (Integrations Nr. 1, 2, 3, 4).
| Integration Nr. | 1 | 3 | 2 | 4 | ||||
|---|---|---|---|---|---|---|---|---|
| Ratio | 1 to 10 | 1 to 10 | ||||||
| Variability | ||||||||
| Donation classes | with | without | ||||||
| nnd.mn | 101.536 | 9.617 | 9.532 | 9.528 | 176.171 | 83.896 | 77.918 | 77.904 |
| nnd.ms | 101.536 | 9.617 | 9.532 | 9.528 | 176.171 | 83.896 | 77.918 | 77.904 |
| nnd.et | 1972.411 | 136.508 | 444.579 | 157.936 | 1850.420 | 180.590 | 786.865 | 208.549 |
| cnnd.mn | 101.527 | 9.608 | 9.466 | 9.465 | 175.903 | 83.628 | 84.813 | 84.770 |
| cnnd.ms | 101.526 | 9.606 | 9.494 | 9.492 | 176.010 | 83.734 | 84.515 | 84.474 |
| cnnd.et | 2688.750 | 139.780 | 343.698 | 163.905 | 108.465 | 14.920 | 46.965 | 37.842 |
| rhd.mn | 1000.011 | 15.570 | 8.273 | 7.295 | 1253.199 | 85.351 | 78.321 | 81.351 |
| rhd.ms | 1005.479 | 17.575 | 9.421 | 9.767 | 1257.923 | 90.165 | 92.751 | 88.203 |
| rhd.et | 1794.635 | 127.224 | 407.317 | 94.668 | 1798.596 | 182.784 | 583.777 | 121.647 |
| rkhd | 165.375 | 45.464 | 2943.404 | 98.975 | 281.824 | 167.775 | 2963.817 | 160.906 |
Finally, by comparing the Integrations Nr. 5 and 7 and the Integrations Nr. 6 and 8 in Table 6 we can see that being equal the dimensionality ratio between R and D, the condition of a lower variability of the matching variables in R compared to the variability of the matching variables in D, the integration goodness increases regardless to the fact that the dimensionality ratio is lower (1 to 3).
Table 6. MSE of the variable W (Integrations Nr. 5, 6, 7, 8).
| Integration Nr. | 5 | 7 | 6 | 8 | ||||
|---|---|---|---|---|---|---|---|---|
| Ratio | 1 to 3 | 1 to 3 | ||||||
| Variability | ||||||||
| Donation classes | with | without | ||||||
| nnd.mn | 102.534 | 10.017 | 7.872 | 7.945 | 182.890 | 90.273 | 87.838 | 88.045 |
| nnd.ms | 102.534 | 10.017 | 7.872 | 7.945 | 182.890 | 90.273 | 87.838 | 88.045 |
| nnd.et | 2113.379 | 121.772 | 477.174 | 158.138 | 2047.865 | 87.587 | 666.437 | 205.484 |
| cnnd.mn | 102.679 | 10.293 | 7.867 | 7.976 | 183.459 | 90.858 | 95.708 | 95.738 |
| cnnd.ms | 102.815 | 10.368 | 7.913 | 8.022 | 183.573 | 90.964 | 77.219 | 77.183 |
| cnnd.et | 2728.813 | 131.305 | 420.386 | 169.801 | 108.465 | 14.920 | 46.965 | 37.842 |
| rhd.mn | 1186.610 | 19.674 | 12.321 | 6.484 | 1192.059 | 73.047 | 104.761 | 99.260 |
| rhd.ms | 1121.168 | 16.839 | 9.950 | 16.915 | 1465.474 | 105.852 | 85.926 | 87.745 |
| rhd.et | 1756.882 | 137.068 | 573.707 | 106.418 | 1883.323 | 164.871 | 334.443 | 76.499 |
| rkhd | 133.446 | 23.293 | 2834.001 | 86.592 | 203.317 | 99.555 | 2953.937 | 143.025 |
Appendix is dedicated to the situation when the dimensionality ratio is 1 to 1. In this peculiar case, with the exception of the Constrained Nearest Neighbour Hot Deck technique that cannot be used, the results of the simulation study and the hints gathered from it are all confirmed. Indeed, a minor variability of the matching variable in the recipient data set, considered the variability of the matching variable in the donor one allows to perform better integrations. At the same time, by building the donation classes we can increase the integration goodness.
The practices P.1, P.2 and P.4 are verified. The practice P.3 that prescribes the mandatory dimensionality ratio condition between R and D is not verified. Indeed, the so-called ‘the biggest, the best’ rule must not be considered mandatory anymore. It can be relaxed whenever we have a small dimensionality ratio but we observe in D a variance of the matching variables higher than the one in R. It is hence this condition of variability of the chosen matching variables that must guide the researcher towards a higher integration goodness. When the variance of the matching variables in R is lower than the variance of the matching variables in D and iff this condition holds (regardless to the dimensionality ratio) the researchers can attempt to generate an optimal synthetic (complete) data set.
Summing up the results referred to the MSE of the variables but considering also the results from the validation strategy as a whole (i.e. taking into account also the graphical analysis of the distributions of the variables pre-and-post the integration and the graphical analysis of the distribution of the variable W) we can conclude that:
the combinations of the Nearest Neighbour Distance Hot Deck (nnd) technique with the Manhattan (mn) and the Mahalanobis (ms) distance functions, both when we build the donation classes and we do not, perform a good integration. They guarantee that the variables originally observed in the recipient data set are well ‘estimated’ by the imputed ones (we have symmetric, short-tailed distributions, i.e. an optimal control of the outliers)
the combinations of the Constrained Nearest Neighbour Hot Deck (cnnd) techniques with the Manhattan and Mahalanobis distance functions, depending on the different data sets characteristics, perform slightly similar to the aforementioned combinations, sometimes (under)overestimating the variables originally observed in the recipient data set even if the magnitude of the (under)overestimation is not statistically significant but in few cases
the combinations of both the Nearest Neighbour Distance Hot Deck and the Constrained Nearest Neighbour Hot Deck with the Exact (et) distance function usually do not guarantee neither a proper estimation of the variables originally observed in the recipient data set (often overestimating them), nor a good control of the outliers (we have asymmetric, long-tailed distributions). These combinations also generate worse integration goodness if the donation classes are not built
the combinations of the Random Hot Deck (rhd) technique with the Manhattan, Mahalanobis and Exact distance functions applied both with and without donation classes usually perform badly compared to the previously discussed combinations while the combinations with the Exact distance function perform the worst
the Rank Hot Deck (rkhd) technique produces the lowest integration goodness results with a statistically significant tendency to overestimate the variables originally observed in the recipient data set while it does not guarantee at all the control of the outliers
Considering the validation strategy as a whole we suggest that the best synthetic (complete) data set can be generated applying the combinations of the Nearest Nearest Neighbour Distance Hot Deck (nnd) and Constrained Nearest Neighbour Hot Deck (cnnd) with the Manhattan (mn) or the Mahalanobis (ms) distance functions, preferably exploiting all the information at disposal to build the donation classes.
4. Real data application
4.1. FADN and CAP-IRE data
We present here an application with real data referred to the year 2009. We use two different data sets on representative samples of the agricultural holdings of the Emilia-Romagna region (Italy). They are the Farm Accountancy Data Network (FADN) and the CAP-IRE 2009 survey. The former is generated from the official accounting data source of the European Commission on the farms of the European Union (EU) member states. The latter has been produced by an ad hoc project survey built within the CAP-IRE project (EU FP7) that aimed at assessing the multiple impacts of the Common Agricultural Policy (CAP) reform on the European rural economies.
The FADN data set collects general, accounting and structural information on the farms (e.g. altitude, location, costs, inputs and outputs, type of crops, hectares of land per type of crops, etc.). The CAP-IRE 2009 data set collects socio-demographic and policy information on the farms as well as information on the behavior of the farmers (e.g. farmers' age, gender, education, characteristics of the household, type of payments received by the CAP, plans and intentions related to future CAP scenarios, etc.).
We are interested in combining the socio-demographic, behavioral and policy information collected by the CAP-IRE 2009 survey (the recipient data set) with the official information on the structure and accountancy of the farms collected by the FADN data (the donor data set). The FADN data set contains 782 observations while CAP-IRE 2009 contains 300 observations. Few information is jointly collected in FADN and in the CAP-IRE 2009 survey, namely the variables:
taa, indicating, in hectares, the Total Agricultural Area (TAA) of the farm (i.e. the farm ‘size’)
tf14, factor indicating the agricultural specialization of the farm
alt, factor indicating the altitude of the farm (plain, hill, mountain area)
legal_status, factor indicating the type of farm ownership
Among the information collected exclusively by FADN data, we are interested in the variables that indicate the hectares of Utilised Agricultural Area (UAA) allocated to the individual crops (e.g. cereals, vegetables, industrial_crops, permanent_fruit, oth_fruit, oth_arable, fodder). We want then to impute these variables from FADN to the CAP-IRE 2009 data set. We use the variable taa as matching variable, eventually considering the building of donation classes upon the farm specialization (variable tf14) and the farm altitude (variable alt). We do not consider the legal status of the farm due to infeasibility issues in creating full sub-groups of observations.
The validation of the synthetic (complete) data set generated exploits the information related to the total UAA of the farm (uaa_tot). This variable is observed in the recipient data set CAP-IRE 2009 and indicates the total hectares of land that are actively cultivated by the farm (both owned and rented-in hectares of UAA). The validation strategy is then based on the variable of Equation (4), applied as , where is equal to . Therefore, after having imputed the UAA of the individual crops from the FADN donor to the CAP-IRE 2009 recipient, we compute for each observation the difference (W ) between the originally observed total UAA (Z) and the total UAA imputed (uaa_tot_imp) which results from the sum (, with P = 7) of the UAA of the individual crops imputed from the FADN donor.
4.2. Integration results
The integration scenario is characterized by a dimensionality ratio of 1 to 2.6 while the variability of the chosen matching variable taa in CAP-IRE 2009 (recipient data set) is smaller (almost one third) than the variability of the variable taa in FADN (donor data set). We apply the combinations of techniques and distance functions depicted in Table 1, running the integration in three different situations: first, with donation classes built upon both the variables tf14 and alt; second, with donation classes built only upon the variable tf14; third, without building donation classes. This is done in order to validate a less and less ‘conservative’ integration scenario up to the one where farms with different specializations are matched (an undesirable case).
The best performing combination in terms of the graphical analysis of the distribution of the variables pre-and-post the integration, the graphical analysis of the distribution of the variable W and the MSE of the variable W is nnd.mn (i.e. the combination of the Nearest Neighbour Distance Hot Deck with the Manhattan distance function), building the donation classes only upon the variable tf14 (i.e. the farm specialization). We show here the results of this integration but we discuss them also considering the other combinations (nnd.ms, nnd.et, cnnd.mn, cnnd.ms, cnnd.et, rhd.mn, rhd.ms, rhd.et and the rkhd technique) applied to the three possible situations of donation classes previously mentioned.
Figure 9 shows the distribution of the variable uaa_tot that is originally observed in the recipient data set CAP-IRE 2009 while Figure 10 shows the distribution of the variable uaa_tot_imp that is the sum of the UAA of the individual crops imputed from the donor data set FADN.
Figure 9.
Distribution of the variable uaa_tot originally observed in the recipient data set CAP-IRE 2009.
Figure 10.
Distribution of the variable uaa_tot_imp imputed from the donor data set FADN.
Figures 9 and 10 show that the total UAA originally observed in CAP-IRE 2009 is underestimated for the values in the class 30–40 while no matches exist for the values in the class 90–100 and the extreme ones (classes of UAA 390–400 and 590–600).
Figure 11 shows the overlap between the two density distributions of the variables uaa_tot and uaa_tot_imp in the synthetic (complete) data set generated by the nnd.mn combination. The Hellinger distance index is equal to 0.04007, acceptably under the threshold of 5%. As the Figure shows, farms with largest share of UAA are underestimated by the imputation.
Figure 11.
Density distributions of the variables uaa_tot (observed) and uaa_tot_imp (imputed) in the synthetic (complete) data set generated applying the nnd.mn combination.
Figure 12 depicts the graphical analysis of the distribution of the variable W , whose MSE is equal to 607.816. The Figure clearly shows how the highest values of UAA are underestimated by the imputation.
Figure 12.
Distribution of the variable W resulting from the nnd.mn combination.
The other combinations, as well as the integrations carried out by means of the nnd.mn combination both with the donation classes built upon the variables tf14 and alt and without donation classes, perform worse. In terms of the Hellinger distance index, none of them is under the threshold of 5%, ranging, for example, from 0.05258 (nnd.mn) or 0.06295 (nnd.et) up to 0.18657 (rnd.mn) or 0.19899 (rnd.et) in the integrations with donation classes built upon both the variables tf14 and alt. In terms of the MSE of the variable W , considering these same integrations, values range from 1793.002 (nnd.mn) or 4763.97 (nnd.et) up to 2074.044 (rnd.mn) or 2848.313 (rnd.et). The general tendency of the other combinations, both when donation classes are conservative (i.e. when both tf14 and alt are used) and when only the variable tf14 is used is to overestimate the originally observed variable uaa_tot. Instead, when the donation classes are not built the integration goodness worsen with really high MSE and unequivocal differences in the distributions of the variable uaa_tot and uaa_tot_imp, coherently with the results of the simulation study.
5. Discussions and conclusions
This work proposes a validation strategy for the assessment of the integration goodness. Moreover, we inspect the validity of the commonly accepted and applied practices holding in many applications of the hot deck method, i.e. the non-parametric micro SM. We explore and apply different combinations of the hot deck distance-based techniques with not default distance functions, comparing their performances and taking into account also the Rank Hot Deck technique. This is done in order to assess the best performing combination in terms of the best synthetic (complete) data set generated, trying to offer some insights on a data integration with the non-parametric micro SM.
All the practices are verified but the so-called ‘the biggest, the best’ rule. This practice must not be considered mandatory anymore whenever the variability of the matching variable(s) in the donor data set is higher than the variability of the matching variable(s) in the recipient one. Holding this condition, the goodness of the integration can be high regardless to the dimensionality ratio between the donor and the recipient data sets. The researchers should give more relevance to the choice of the matching variable(s) rather than bonding the whole integration process to the dimensional characteristics of the data sets at disposal.
Considering the simulated scenarios of both different dimensionality ratios and variance of the matching variables between the donor and the recipient data sets and also by carrying out the integrations both with and without donation classes, we assess that the best performing combinations are the Nearest Neighbour Distance Hot Deck (nnd) and Constrained Nearest Neighbour Hot Deck (cnnd) techniques with the Manhattan (mn) and Mahalanobis (ms) distance function. The core of our validation strategy is to analyze the difference among the variable(s) originally observed in the recipient data set and the one(s) imputed from the donor one by resorting to the variable W. This is defined as the difference between the variable(s) originally observed in the recipient data set R and the ‘same’ one(s) imputed from the donor data set D. One critic to the validation strategy proposed and applied could be that in the real-life contingency of data integration it is really difficult to observe the same variables between R and D (beyond the few potential matching variables X). Nevertheless, in several situations we observe in the data sets at disposal some functional variable(s) and/or the same ones. For example, we can observe in R a variable that is the sum of some individual variables observed in D. In data sources on households and individuals we could use the whole family household income that is the sum of the individual family members' income. Similarly, as we show in the application in Section 4 in data sources on agricultural holdings we can use the Utilised Agricultural Area (UAA) in hectares of the farm that is the sum of the hectares of UAA of the individual crops. The variable sum observed in the recipient data set can be used to assess the integration goodness comparing it with the individual variables observed in the donor data set, summed and/or adjusted after the imputation.
Further developments of the proposed validation strategy could be explored. For example, in order to increase the integration goodness, we could impose at the beginning of the matching procedure a constraint which holds upon the relation between the functional variable(s) and the individual ones. In other words, we could use the logical constraints applied for the validation strategy at the beginning of the matching procedure rather than ex post for the integration goodness assessment. Moreover, it could be useful to elaborate a synthetic index that summarizes the information given by our three single tools. Indeed, such an index would definitely benefit the researcher in analyzing synthetically and quickly the goodness of the integration process carried out.
Methodologically speaking, it could be relevant to approach a data integration of two (or more) data sets which collect information about samples that are not referred, by mere assumption, to the same target population. Indeed, at the best of our knowledge, there are no SM applications which do not lay on the assumption that states that the recipient and donor data sets contain information on two representative samples of the same target population. Last but not least, further analysis could concern the inclusion in the integration process of the information referred to the sample weights and to the sampling design, also properly taking into account the fact that, so far, the missing completely at random situation is the main assumption considered in the literature of data integration.
Appendix.
Note on the ‘1 to 1’ dimensionality ratio
This Appendix is dedicated to the discussion of the results related to the integration scenarios when or and when donation classes do hold or do not hold but R and D have the same dimensions, i.e. the dimensionality ratio between the recipient and the donor is 1 to 1. Table 7 depicts the related results concerning the MSE of the variable W.
Table A1. MSE of the variable W when dimensionality ratio is 1 to 1.
| Ratio | 1 to 1 | |||||||
|---|---|---|---|---|---|---|---|---|
| Variability | ||||||||
| Donation classes | with | without | ||||||
| nnd.mn | 107.257 | 13.678 | 5.836 | 6.083 | 179.474 | 85.020 | 96.779 | 85.872 |
| nnd.ms | 107.257 | 13.678 | 5.836 | 6.083 | 179.474 | 85.020 | 96.779 | 85.872 |
| nnd.et | 3368.738 | 120.481 | 562.360 | 237.360 | 2735.232 | 158.254 | 554.838 | 160.562 |
| rhd.mn | 1242.158 | 20.818 | 17.184 | 11.548 | 1463.593 | 107.653 | 112.489 | 99.766 |
| rhd.ms | 1138.947 | 18.295 | 15.135 | 13.921 | 1375.964 | 80.764 | 111.357 | 82.065 |
| rhd.et | 1808.278 | 122.391 | 221.126 | 92.479 | 1932.221 | 138.015 | 297.469 | 86.866 |
| rkhd | 123.261 | 26.572 | 56.913 | 40.255 | 211.712 | 119.535 | 164.047 | 154.632 |
Disclosure statement
No potential conflict of interest was reported by the author(s).
ORCID
Riccardo D'Alberto http://orcid.org/0000-0002-7227-7485
Meri Raggi http://orcid.org/0000-0001-6960-1099
References
- 1.Abello R. and Phillips B., Statistical matching of the HES and NHS: An exploration of issues in the use of unconstrained and constrained approaches in creating a basefile for a microsimulation model of the pharmaceutical benefits scheme, Tech. Rep., Australian Bureau of Statistics, Canberra, 2004.
- 2.Andridge R.R. and Little R.J.A., A review of hot deck imputation for survey non-response, Int. Stat. Rev. 78 (2010), pp. 40–64. doi: 10.1111/j.1751-5823.2010.00103.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Australian Bureau of Statistics , ABS data integration (2017). Available at http://www.abs.gov.au – accessed: 25th June 2018.
- 4.Balin M., D'Orazio M., Di Zio M., Scanu M., and Torelli N., On the evaluation of matching noise produced by non-parametric imputation techniques, Tech. Rep., ISTAT, Roma, 2009.
- 5.Barry J.T., An investigation of statistical matching, J. Appl. Stat. 15 (1988), pp. 275–283. doi: 10.1080/02664768800000038 [DOI] [Google Scholar]
- 6.Blackwell L., Charlesworth A., and Rogers N.J., Linkage of census and administrative data to quality assure the 2011 census for England and Wales, J. Off. Stat. 31 (2015), pp. 453–473. doi: 10.1515/JOS-2015-0027. [DOI] [Google Scholar]
- 7.Blöschl G., Statistical upscaling and downscaling in hydrology, in Encyclopedia of Hydrological Sciences, M.G. Anderson, J.J. McDonnell, and T. Gale, eds., Wiley, New York, 2006. Available at 10.1002/0470848944.hsa008. [DOI]
- 8.Brozzi A., Capotorti A., and Vantaggi B., Incoherence correction strategies in statistical matching, Int. J. Approximate Reasoning 53 (2012), pp. 1124–1136. doi: 10.1016/j.ijar.2012.06.009. [DOI] [Google Scholar]
- 9.Burkard R. and Derigs U., Assignment and Matching Problems: Solution Methods with FORTRAN-Programs, Springer-Verlag, Berlin, 1980. [Google Scholar]
- 10.Conti P.L., Marella D., and Scanu M., Evaluation of matching noise for imputation techniques based on nonparametric local linear regression estimators, Comput. Stat. Data Anal. 53 (2008), pp. 354–365. doi: 10.1016/j.csda.2008.07.041. [DOI] [Google Scholar]
- 11.Conti P.L., Marella D., and Scanu M., Uncertainty analysis in statistical matching, J. Off. Stat. 28 (2012), pp. 69–88. [Google Scholar]
- 12.Conti P.L., Marella D., and Scanu M., Uncertainty analysis for statistical matching of ordered categorical variables, Comput. Stat. Data Anal. 68 (2013), pp. 311–325. doi: 10.1016/j.csda.2013.07.004. [DOI] [Google Scholar]
- 13.Conti P.L., Marella M., and Scanu M., How far from identifiability? A systematic overview of the statistical matching problem in a non parametric framework, Commun. Stat. Theory Methods 46 (2017), pp. 967–994. doi: 10.1080/03610926.2015.1010005. [DOI] [Google Scholar]
- 14.Conti P.L. and Scanu M., On the evaluation of matching noise produced by non-parametric imputation techniques, Tech. Rep. DSPSA 7/2005, La Sapienza University, Roma, 2005.
- 15.D'Alberto R., Statistical matching imputation among different farm data sources, Ph.D. diss. thesis, Alma Mater Studiorum – University of Bologna, IT, 2017. Available at 10.6092/unibo/amsdottorato/7788. [DOI]
- 16.D'Alberto R., Zavalloni M., Raggi M., and Viaggi D., AES impact evaluation with integrated farm data: Combining statistical matching and propensity score matching, Sustainability 10 (2018), pp. 1–24. doi: 10.3390/su10114320. doi: [DOI] [Google Scholar]
- 17.Dawid A.P., Conditional independence in statistical theory, J. R. Stat. Soc: Ser. B Methodol. 46 (1979), pp. 1–31. Available at https://www.jstor.org/stable/2984718. [Google Scholar]
- 18.Denk M. and Hackl P., Data integration and record matching: An Austrian contribution to research in official statistics, Aust. J. Stat. 32 (2003), pp. 305–321. doi: 10.17713/ajs.v32i4.464 [DOI] [Google Scholar]
- 19.Donatiello G., D'Orazio M., Frattarola D., Rizzi A., Scanu M., and Spaziani M., Statistical matching of income and consumption expenditures, Int. J. Econ. Sci. 3 (2014), pp. 50–65. [Google Scholar]
- 20.D'Orazio M., Statistical Matching and imputation of survey data with StatMatch (2015). Software available at http://CRAN.R-project.org/package=StatMatch.
- 21.D'Orazio M., Zio M.D., and Scanu M., Statistical Matching: Theory and Practice, Wiley, Hoboken, 2006. doi: 10.1002/0470023554. [DOI] [Google Scholar]
- 22.European Commission , Insights on data integration methodologies, Tech. Rep. Methodologies & WP 2009, EUROSTAT, Luxembourg, 2009.
- 23.Fellegi I.P. and Holt D., A systematic approach to automatic edit and imputation, J. Am. Stat. Assoc. 17 (1976), pp. 17–35. doi: 10.1080/01621459.1976.10481472 [DOI] [Google Scholar]
- 24.Gallagher E.D., Compah documentation, Tech. Rep., University of Massachusetts, Boston, 1999.
- 25.Gower J.C., A general coefficient of similarity and some of its properties, Biometrics 17 (1971), pp. 857–871. doi: 10.2307/2528823 [DOI] [Google Scholar]
- 26.Kadane J.B., Some statistical problems in merging data files, Tech. Rep., Office of Tax Analysis, U.S. Department of the Treasury, Washington DC, 1978.
- 27.Little R.J.A., A test of missing completely at random for multivariate data with missing values, J. Am. Stat. Assoc. 83 (1988), pp. 1198–1202. doi: 10.2307/2290157. doi: [DOI] [Google Scholar]
- 28.Mardia K.V., Kent J.T., and Bibby J.M., Multivariate Analysis (Probability and Mathematical Statistics), Academic Press, London, 1980. [Google Scholar]
- 29.Marella D., Scanu M., and Conti P.L., On the matching noise of some nonparametric imputation procedures, Stat. Probab. Lett. 78 (2008), pp. 1593–1600. doi: 10.1016/j.spl.2008.01.020. [DOI] [Google Scholar]
- 30.Markatou M., Chen Y., Afendras G., and Lindsay B.G., Statistical distances and their role in robustness, in New Advances in Statistics and Data Science, D.G. Chen, Z. Jin, G. Li, Y. Li, A. Liu, and Y. Zhao, eds., Springer, New York, 2017.
- 31.Moriarity C. and Scheuren F., Statistical matching: a paradigm for assessing the uncertainty in the procedure, J. Off. Stat. 17 (2001), pp. 407–422. [Google Scholar]
- 32.Okner B.A., Constructing a new data base from existing microdata sets: The 1966 merge file, Ann. Econ. Soc. Meas. 1 (1972), pp. 325–342. [Google Scholar]
- 33.Rässler S., Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches, Lecture Notes in Statistics, Vol. 168, Springer, New York, 2002. [Google Scholar]
- 34.Roesch A. and Lips M., Sampling design for two combined samples of the Farm Accountancy Data Network (FADN), J. Agric. Biol. Environ. Stat. 18 (2013), pp. 178–203. doi: 10.1007/s13253-013-0130-5. [DOI] [Google Scholar]
- 35.Rubin D.B., Inference and missing data, Biometrika 63 (1976), pp. 581–592. doi: 10.1093/biomet/63.3.581. [DOI] [Google Scholar]
- 36.Rubin D.B., Statistical matching using file concatenation with adjusted weights and multiple imputations, J. Bus. Econ. Stat. 4 (1986), pp. 87–94. doi: 10.2307/1391390. [DOI] [Google Scholar]
- 37.Singh A.C., Mantel H.J., Kinack M.D., and Rowe G., Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption, Surv. Methodol. 19 (1993), pp. 59–79. [Google Scholar]
- 38.Sutherland H., Taylor R., and Gomulka J., Combining household income and expenditure data in policy simulations, Rev. Income Wealth 48 (2002), pp. 517–536. doi: 10.1111/1475-4991.00066. [DOI] [Google Scholar]
- 39.UNECE , UNECE Stats (2017). Available at https://www.unece.org/stats/stats_h.html – accessed: 14th December 2017.
- 40.van der Laan P., Integrating administrative registers and household surveys, Neth. Off. Stat. 15 (2000), pp. 7–15. [Google Scholar]
- 41.Winkler W.E., Overview of record linkage and current research directions, Tech. Rep. TechRepStat 2006-2, Bureau of the Census, Washington DC, 2006.












