Abstract
Sample surveys are an essential approach used in veterinary research and investigation. A sample obtained from a well-designed sampling process along with robust data analysis can provide valuable insight into the attributes of the target population. Two approaches, design-based or model-based, can be used as inferential frameworks for analysing survey data. Compared to the model-based approach, the design-based approach is usually more straightforward and directly makes inferences about the finite target population (such as the dairy cows in a herd or dogs in a region) rather than an infinite superpopulation. In this paper, the concept of probability sampling and the design-based approach is briefly reviewed, followed by a discussion of the estimations and their justifications in the context of several different elementary sampling methods, including simple random sampling, stratified random sampling, and one-stage cluster sampling. Finally, a concrete example of a complex survey design (involving multistage sampling and stratification) is demonstrated, illustrating how finding unbiased estimators and their corresponding variance formulas for a complex survey builds on the techniques used in elementary sampling methods.
Keywords: sampling, survey methodology, design-based approach, unbiasedness, variance estimation
1. Introduction
Sample surveys, where data from a subset, or sample, of a population are used to make inferences about that population, are a traditional research methodology which has been widely used in veterinary research and investigation [1,2]. However, in this era of “big data”, with modern techniques such as machine learning, bioinformatics, or other computer-based technologies being increasingly used in veterinary research [3] across areas such as animal behaviour [4] and disease detection [5] and prediction [6], the sample survey is in danger of appearing “old fashioned” and “out-dated”.
However, the “old-fashioned” sample survey still has some advantages over cutting-edge big data methodologies. Firstly, in a sample survey, information or data can be collected actively in order to answer a specific research question, whereas the research question that can be answered by “big data” techniques is dependent on what information is available in the big data source. Secondly, in a well-planned sample survey the target population can be framed in advance and followed by a well-designed sampling process so that the samples are representative of the population [7]. This representativeness is often not achieved during the passive “big data” collection process, with data often being collected only from a particular subset of the target population—e.g., Revilla, et al. [8] analysed more than 10.5 million measurements from ~13,000 pigs obtained using automatic feeding systems. However, this dataset was collected from only one boar testing station, making generalisation to the wider population potentially difficult. Finally, it is not economically feasible to collect “big data” when novel information is required for some specific research topics.
Once the specific data for a research topic are collected, a rigorous and robust data analysis is essential to gain insight from the sample survey. As per Alexis de Tocqueville: “when statistics are not based on strictly accurate calculations, they mislead instead of guide” [9]. Generally, there are two approaches for analysing survey data: the model-based approach and the design-based approach [10]. The former is possibly better understood by many veterinary researchers who have undertaken standard quantitative research methodology training, as mainstream statistical training usually treats the observed data—e.g., production data or diagnostic test outcomes—as realisations of some relevant random variables. However, one important assumption in this method is usually overlooked—that the underlying population is treated as a “superpopulation” which contains infinitely many animals [11]. Strictly speaking, an estimated model parameter therefore refers to a property of the hypothetical “superpopulation” rather than a characteristic of the finite population which is of actual interest [7].
For example, suppose that simple random sampling had been implemented on a dairy farm to study the prevalence of bovine digital dermatitis using a large sampling fraction (~70% of the herd size). The analyst then fitted an intercept-only logistic regression to estimate the intercept. As the intercept represents the logit of the prevalence, using a suitable back-transformation, the analyst was then able to report the estimated prevalence of digital dermatitis on the farm. However, given a large sampling fraction, this estimate actually represents the prevalence of a hypothetical superpopulation from which the sample was drawn rather than the prevalence on the farm of interest [12]. To make an inference to the actual finite population, a superpopulation approach can be used [12]; however, although mathematically correct, explaining the approach is likely to create difficulties in communications with other rural professionals or companion animal practitioners.
The design-based approach for analysing survey data avoids the complexity in analysis and communication seen with the model-based approach. One key advantage of the design-based approach is that it focuses on inferences related to the target finite population(s) without introducing extra assumptions about the parametric form of the outcome variable. In addition, the analysis steps are consistent with the sampling steps, so the process of checking for potential mistakes during the analysis process is clearer [13]. Thirdly, the design-based approach has no requirement of an “assumed” probability distribution dictated by the design itself [14]. The aim of this review article is to provide a comprehensive introduction to the design-based approach for analysing survey data by (1) describing the analytical methods for elementary probability sampling methods, including simple random sampling, stratified random sampling, and cluster sampling, and (2) to demonstrate the key ideas necessary to understand and interpret those analytical methods, as well as how those ideas can be used to develop methods for any specific complex survey design.
2. Overview of Probability Sampling
First, we define a set U as a target population including M animals in a finite population (e.g., animals on a farm or all >2-year-old Jersey cows in a region). There are various ways to obtain samples which are just some subsets of U. Let us denote S for a particular sample chosen from U, then S ⊂ U. With the proper subset notation “⊂”, we restrict the sample size to being smaller than the population size. Suppose we want to obtain a random sample with m animals, we could have different samples: S1, S2, S3, … We can then define a set “sample space” (denoted as Ω) that contains all these samples. With probability sampling, a probability can be explicitly assigned for each of the samples, with the constraint that , as the axiom states that the probability of a sample space is 1 and the union of all the samples forms the sample space. The probability of obtaining each of the L samples does not have to be constant—i.e., is absolutely acceptable—and we can also restrict the probability of a particular sample to 0 if some animals within the sample are considered inappropriate as study units. The other feature of these samples is that two samples can include the same animals, and the probability of an animal k being selected is calculated by summing the probabilities of all samples including this animal—i.e., . An intuitive numeric example is displayed in Figure 1. Eventually, we define the sampling weights as the reciprocal of the inclusion probability for any type of sampling method [15]. Generally, it is recommended that the veterinary researcher interprets the sampling weight of the animal k as the number of animals in the target population represented by this animal (a deeper treatment of sampling weights can be found in Gelman [16]; however, non-response adjustments are beyond the scope of this article).
3. Design-Based and Model-Based Approaches
3.1. Overview of Design-Based Approach
In the design-based approach, the observed value (production record or test outcome) is not considered to be a realisation generated from some data generation mechanism (or “population”); instead, it is regarded as a fixed constant, with the randomness arising solely from the sample selection [17]. In plainer language, although is fixed, it remains unknown unless the animal is selected in the sample [18]. If all animals are tested from a given population, then the test results for all the animals are known without any uncertainty. In contrast, if we only test a sample of these animals, the only outcomes that we know are the animals included in the sample. If we randomly draw samples of a fixed size repeatedly from a target population, a particular animal may not be sampled repeatedly due to the randomness in the sample selection; hence, the sample statistics can vary across the samples. That is the only source of randomness. Therefore, it is natural to define a Bernoulli random variable to indicate whether an animal in the target population is also in the sample. For example, if the kth animal is included in the sample, the Bernoulli random variable ; otherwise, it is . This random variable maps the (hypothetical) animal ID in the population into numeric values for selection status. This idea is essential when studying the properties of the estimators (unbiasedness) and deriving the variances of these estimators [19].
The design-based approach is of particular value when the finite population characteristics are of interest, as when a design-based approach is used, the researcher can direct inferences about the finite target population, even if the sample size to population size ratio is not small (i.e., when the finite population correction must be considered). For example, the prevalence in a finite population is interpreted as the proportion of diseased animals in that population. Assuming that 70% of the population is sampled, a design-based approach gives a direct estimate of this population proportion, which is often a key target of veterinary investigations. In addition, the estimators (i.e., the rules or formulas) for estimating the finite population characteristics are consistent with the sampling method. Therefore, the estimation process is naturally understandable and easier to communicate with non-statistically inclined veterinarians and researchers [20]. Finally, with the design-based approach, the analyst does not need to decide which potential model generated the data, as the observed values are treated as fixed constants. For example, if the average milk production in a herd is of interest, one does not need to assume that milk yield from a cow is generated from a normal random variable, particularly when it is not. One just needs to calculate the sample mean as an estimate of the average milk production in the herd. Finally, whether an estimator is unbiased (i.e., whether its expected value and the true value of the parameter are effectively equivalent) is not dependent on the parametric form of the observed value.
3.2. Overview of Model-Based Approach
Although this approach may be more familiar to researchers, we do not advocate this approach in this paper from a practical point of view. As with most mainstream statistical methods, this approach treats the observed value as a realisation of an underlying random variable. For example, the test result of the kth animal is generated from a random variable , whose parametric form must be decided. If is a Bernoulli random variable, then with the model-based approach a likelihood-based method is the most common approach for estimating the probability that a random animal will test positive. However, this probability is not a finite population prevalence; it is more correctly interpreted as a hypothetical infinite superpopulation prevalence. Although extra steps can help to make an inference back to the finite population, this significantly adds complexity to aspects of analysis and in communicating the results to stakeholders without a statistical background [12]. The other major disadvantage of the model-based approach is that the estimated parameters may be biased if the model is mis-specified [20].
4. Sampling Methods
4.1. Simple Random Sampling
Simple random sampling (SRS) is the most basic form of probability sampling. In this process, all possible samples of a given size have the same probability of being selected—i.e., is constant for every possible sample. As a result, all the animals in the population have an equal probability of being included in the sample—i.e., the inclusion probability is the same for every animal [21]. This sampling process has been applied to many veterinary studies, including recent investigations of lumpy skin disease [22], bovine mastitis [23], and foot-and-mouth disease [24]. Despite its simplicity, in the right situation it can be a powerful sampling method and provide the theoretical basis for more complicated sampling methods. There are two forms of SRS—with and without replacement. In this article, we will limit the discussion to SRS without replacement (the sample contains no duplicated animals,) as this is by far the most common practice in veterinary research.
The statistics in which a researcher is usually interested are the properties of the population, e.g., the average milk production of the herd or the prevalence of a disease within the herd. We denote this finite population mean as and prevalence is just a special case of the finite population mean when the individual outcome value can only be 1 or 0. In the SRS setting, estimating the mean is straightforward. However, for other sampling processes this is not always the case; hence, it is easier to start to estimate the finite population total before moving on to the mean (which is a linear function of the total). To ensure a consistent methodology is used in this review, we will stick with the two-step process—estimating the total first and then the mean or prevalence.
Suppose we have a herd with M animals, of which a sample of m animals has been obtained using SRS. The Horvitz–Thompson (HT) estimator of the finite population total is [25]:
(1) |
In the SRS setting, the sampling weight is a constant as the inclusion probability is the same for every animal, such that (see Appendix A for technical details), where is the Bernoulli random variable for selection and if the animal k is selected; otherwise, . The HT estimator is, by design, unbiased—i.e., its expected value is equal to the true value of the finite population characteristic [7]:
where is the expectation operator which takes all the possible values generated by the random variable and returns the weighted average value, so .
The unbiased estimator for the mean or the prevalence is, therefore:
(2) |
The proof is trivial. By observing Equation (2), we see that, in SRS, the sample mean or sample proportion is the unbiased estimator for the population mean or prevalence. This means that, for other sampling strategies, building up the sample mean from SRS will also result in an unbiased estimator if done correctly.
To derive the variance of the estimator for the mean or the prevalence, it is also easier to start with the variance of the total. The detailed derivation can be seen in Appendix B; here, we only provide the formulas for the variances. First, the variance for the estimated population total is:
(3) |
where is the variance of the finite population. In the special case where we estimate prevalence, we can replace with with some algebra (see Appendix B), resulting in . Therefore, the variances for and are given as follows:
(4) |
(5) |
However, the finite population variance depends on an unknown quantity or , which we are attempting to estimate; in practice, we often replace with = , which is the sample variance (or with ). Therefore, the estimated variance for and is:
(6) |
(7) |
where is usually referred to as the finite population correction factor [26].
To illustrate this process, consider an investigator who wants to estimate the prevalence of digital dermatitis in lactating cows in a dairy herd. A random sample of 100 cows is obtained from a herd of 300 cows, of which 35 sampled cows are diagnosed as diseased. These 35 cows have records and the remining 65 sampled cows have records . The estimated prevalence is calculated using Equation (2), thus it is 0.35. The variance of this estimate is calculated using Equation (7). As the actual prevalence is unknown, we need to use the estimated prevalence to calculate the estimated variance: .
4.2. Stratified Random Sampling
In the stratified random sampling procedure (STRRS), the target finite population (e.g., the total number of animals within a herd) is partitioned into non-overlapping groups based on some pre-defined attributes and each of the groups is referred to as a stratum. These strata constitute the entire population; therefore, each animal belongs to a specific stratum. Within each stratum, SRS is commonly used to sample animals, and the sampling processes in the different strata are independent [27]. There is no requirement to select all strata within a population. If only some strata are of interest (e.g., only those which include lactating cows), these can be selected and strata that are not of interest can be excluded. If this approach is used, it needs to be made clear that the target population is no longer the entire finite population, but rather the population represented by the selected strata.
The finite population mean or prevalence is then estimated by pooling the information from all the strata. Like SRS, STRRS is commonly used in veterinary research, for example stratification by area. This allows the researcher to investigate prevalences and associations across a country or a region—e.g., Heayns and Baugh [28] investigated the opinions of veterinarians across the UK about serological testing to assess revaccination requirements in dogs. In this study, each county of Great Britain was considered as a stratum and 10% of the small animal veterinary practices within each stratum were randomly selected (if there were fewer than 10 practices in a county, one practice was randomly chosen to represent the county). Similarly, Atuman et al. [29] investigated dog ecology, dog bites, and rabies vaccination rates in Bauchi, Nigeria, using STRRS. They stratified Bauchi into five areas, and within each area randomly selected 10% of the streets for direct street counts and the administration of a questionnaire. However, other sources of strata are also used—e.g., as part of a randomised clinical trial of footrot treatments in Kashmir, India, Kaler, et al. [30] allocated sheep with acute footrot to one of three treatments using STRRS, with the strata being based on each sheep’s maximum footrot score. Stratification is useful to ensure that the sample includes individuals which could otherwise be missed by chance in SRS due to the limited number of individuals in their stratum. For example, at a certain period a pig farm in Hong Kong may keep few finisher pigs, but many piglets and sows are present on the farm. With SRS, it is likely that none of the finisher pigs is included in the sample, therefore one can argue that there is error in the representation of the population which could potentially dimmish the accuracy of the estimate. For this reason, it is also common to sample a fixed number of individuals in each stratum. Compared to SRS, however, extra information such as the variable used for stratification (membership) must be obtained for all sampling units.
If STRRS has been used, care is required when pooling the information from the strata in order to obtain an unbiased estimator for the finite population mean or prevalence. A “natural” estimator for the mean/prevalence might involve summing up all the observed values in the sample and dividing by the sample size (equivalent to the process of the SRS). However, this estimator is unbiased if the sample size in each of the strata is proportional to the actual size of the stratum—i.e., there has been proportional allocation (this is demonstrated in more detail in Appendix C). The more general common approach to obtain an unbiased estimator for the finite population mean or prevalence follows the two principles we have mentioned: (1) following the actual sampling process and (2) starting with the finite population total. Consider a farm with animals. A researcher has created J strata based on the ages of the animals. For the jth stratum, there are animals, and clearly . Suppose that animals are sampled using SRS independently from each of the strata and that the value of the variable of interest is denoted as for the kth animal in the jth stratum.
The unbiased estimator (using weight notation) for the finite population total:
(8) |
where is the sampling weight which is the reciprocal of the inclusion probability For STRRS, this is the probability of the kth animal in the jth stratum being selected. However, writing the estimator in this form is not very intuitive, and it can be rewritten into a different formula in order to provide a more intuitive and meaningful picture for veterinary researchers. As SRS has been implemented within each of the strata, the inclusion probability for the kth animal in the jth stratum is simply the sample size divided by the stratum size , which leads to . Now, Equation (8) can be rewritten as:
(9) |
This formula says that in order to estimate the finite population total, we need to first compute the mean/prevalence for each of the strata using the estimator we have seen in SRS and then multiply it by the stratum size to obtain the estimated total for each stratum. We then sum up all these estimated stratum totals to obtain the estimated finite population total. This is consistent with and follows the actual sampling process, as well as producing an unbiased estimator:
where is the Bernoulli random variable for selection, representing whether the kth animal in the jth stratum is selected with an inclusion probability , and due to SRS. Once the estimated total is found, the estimated finite population mean or prevalence is just the total divided by the population size:
(10) |
Since each stratum is independently sampled, building on the SRS, the variances for and using STRRS are also straightforward:
(11) |
(12) |
where both and are unknown quantities representing the population variance and prevalence in the jth stratum. Similar to the SRS, the estimated variances are obtained by substituting estimated quantities into the unknowns, such as:
(13) |
(14) |
where is the sample variance of the jth stratum and the formula is given in the SRS section.
To illustrate this, consider an investigation of the seroprevalence of pseudorabies on a farm where STRRS is used. First, pigs are divided into groups based on the five production stages (strata): piglets, weaners, growers, finishers, and sows (breeding herds). The total numbers of pigs in each stratum are 30, 30, 40, 20, and 60, respectively. Within each stratum, a fixed number of pigs (10) are sampled using SRS and the numbers of infected pigs are 5, 6, 3, 2, and 7. The estimated prevalence can then be calculated using Equation (10): . The variance of this prevalence estimate can then be estimated using Equation (14). This is carried out stratum by stratum; for example, for the piglets, . This process is then repeated for all the strata, and the estimated variance is the sum of the quantities calculated for each stratum. In the example, the final estimated variance is 0.004.
4.3. Cluster Sampling
In this sampling method, the animals in a finite population (animals in a herd, region, or country) are aggregated into larger sampling units: clusters. A cluster is similar to a stratum; however, the sampling process is different. In a cluster sampling procedure, a set of (n) clusters is sampled using SRS from a population with N clusters. These clusters are usually referred to as primary sampling units, and the members within each cluster as secondary sampling units. Within the primary sampling units, all secondary sampling units may be measured or observed (one-stage cluster sampling) or the secondary sampling units may be sampled using SRS (two-stage cluster sampling). The selected individuals within the selected clusters then form a sample of the finite population [26]. In contrast, in STRRS all strata of interest must be included, and SRS is usually used to sample individuals within each stratum. These different sampling strategies mean that the sources of variability in cluster sampling are different from those in STRRS. In STRRS, the variability of the estimated population mean/prevalence arises only from individual variability within a stratum. For cluster sampling, the variability of the estimated population mean/prevalence comes from one or more sources [27]. In one-stage cluster sampling, where all individuals in a selected cluster are included, the variability of the estimated population characteristic or quantity is dependent on the variability between clusters. In two-stage cluster sampling, where only a sub-sample is collected from selected clusters, the variability of the estimated population characteristic comes from two sources: the within- and between-cluster variabilities [31]. One advantage of cluster sampling is that it overcomes some of the logistics issues associated with SRS or STRRS and therefore generally requires less spending on administration and travel expenses. However, the estimates provided by cluster sampling are usually less precise than those provided by SRS, given the same sample size [27].
Cluster sampling is possibly the most widely used approach in livestock research. Usually, a farm or a herd is regarded as a cluster and a number of farms/herds are selected. This was the approach adopted by Getahun, et al. [32], who studied mastitis and antibiotic resistance patterns in dairy cows in central Ethiopia. This design treated a farm as a cluster and a number of farms were chosen using SRS; within each farm, all the dairy cows were sampled. A similar approach was later used to estimate the prevalence of bovine tuberculosis in southern Ethiopia [33]. In this study, the target population was only cows above 6 months of age, and all cows above 6 months old were included on the selected dairy farms (clusters). We list here three examples of two-stage cluster sampling in veterinary research for interested readers [34,35,36]. In the rest of this section, we will first provide insights into the estimation process for one-stage cluster sampling and do the same for a two-stage cluster sampling where STRRS instead of SRS is used at the second stage (essentially a complex sampling) with details.
4.3.1. One-Stage Cluster Sampling
In one-stage cluster sampling, all animals within a farm are sampled; therefore, the farm total is directly measured, where is the value of the variable of interest measured for the kth animal on the ith farm given the herd size of . Common research tasks might be to estimate the farm-level and animal-level averages, such as the average milk production or average number of positive animals per farm and average milk production per cow or overall prevalence at the animal level. Suppose n farms are sampled from N farms in a region using SRS. As before, to estimate the population mean or prevalence it is always recommended to start by estimating the total. Since SRS is used for sampling clusters, the unbiased estimator for the finite population total (e.g., the number of all diseased dairy cows in a region) is straightforward and therefore given without proof:
(15) |
The variance and estimated variance for this estimator can also be straightforwardly determined by applying the theory introduced in the SRS section: and , where and are the finite population variance and sample variance (at the farm level), such that and . The estimated farm-level average and its corresponding variance and estimated variance are straightforward:
(16) |
(17) |
(18) |
The total number of animals in the region is . Hence, the estimated average at the cow level is given by:
(19) |
The variances and estimated variances for the cow-level average or overall prevalence are given as:
(20) |
(21) |
Note that at the farm level, we work on counts of positive animals instead of binary values even if we are estimating prevalence, therefore the variance formulas for and are indistinguishable.
4.3.2. Two-Stage Cluster Sampling
The main purpose of this section is to illustrate the estimation process for a complex survey—i.e., how to obtain the unbiased estimators and derive their corresponding variances. Suppose there are M dairy cows in a region with N dairy herds. The herd size for herd i is . The cows are separately managed based on a certain criterion; that is, within the ith herd there are J groups, and within each of the groups there are cows. The groups can be treated as strata, as they are not overlapping and constitute the entire herd. A research team is interested in knowing the prevalence of a disease among cows in this region. Based on the demographic information, a two-stage cluster sampling is decided. First, n herds will be selected using SRS. Within each of the sampled herds, STRRS will be used to sample cows from each of the strata in each of the herds. Before going to the estimation process, we shall define some notations (Table 1).
Table 1.
The number of dairy herds in the region. | |
The number of dairy herds in the sample. | |
The number of cows in the jth stratum in the ith herd. | |
The sample size in the jth stratum in the ith herd. | |
The number of cows in the ith herd (herd size for herd i), . |
|
The sample size for herd i. | |
The total number of cows in the region, |
|
The disease outcome (1/0) of the kth cow in the the jth stratum in the ith herd. | |
The total number of diseased cows in the ith herd, . |
|
The herd prevalence for the ith herd, . |
|
The total number of diseased cows in the region, . |
|
The overall prevalence in the region, . |
The ultimate goal for this sample survey is to estimate ; however, as in the previous examples it is the best to start by estimating the total . Additionally, the computation process needs to be consistent with the actual sampling steps. Thus, we start by estimating the total diseased animals in the jth stratum in the ith herd. Within each stratum, SRS is used, therefore the estimated total can be computed based on Equation (1). The second step is to estimate the total diseased animals in the ith herd. Because we used STRRS, this can be achieved by adopting Equation (9). Finally, we can estimate the total number of diseased animals in the region by using Equation (15), as SRS is used to select herds. Hence, the unbiased estimated region total is computed in the following way:
(22) |
To prove that the outcome of this process is unbiased, we simplify the notation, letting . We know that is unbiased (namely, , because we have used STRRS. Secondly, we specify a binary indicator variable if herd i is selected or if it is not. Let denote the probability that herd i is selected (inclusion probability of a herd); we then have , since SRS is used for the first stage of selection (i.e., the selection of herds). Given that sampling within any herd is independent of the sampling in any other herd and that is independent of , we have:
(partition theorem for expectations)
(the conditional expectation of a sum is the sum of the conditional expectations)
(expectation is a linear operator and is a constant)
(knowing a vector means the same as knowing every element of the vector; conditional on the selection status of every herd means knowing the selection status of any herd)
( and are independent)
(unbiased estimator for stratified random sampling for each herd)
(linear property of expectation).
Therefore, the unbiased estimator for the overall prevalence is simply:
(23) |
In order to find the variance formula for , it is easier to start with . It is necessary to first identify the sources of variability. In this two-stage cluster sampling process, we have between- and within-herd variances. The variance partition formula can thus decompose the total variance into two parts: , where measures the variability between herds and measures the variability within a herd. Since SRS is implemented at the herd level, according to Equation (3), we have:
where This part of the variance is the same as that of one-stage cluster sampling, since the herd sampling procedures are exactly the same. The detailed derivation is essentially the same as the derivation of variance in SRS (see Appendix B).
For the within-herd component of the variance, , the formula inside the expectation operator, according to the conditional variance formula. The detailed mathematical derivation is available in Appendix D and Appendix E provides the statistical theorems required in this paper. Here, we only give an essential intermediate result: Since STRRS is implemented within each herd, can be easily obtained from Equation (11) or Equation (12) depending on the nature of . In our particular example, where takes a binary value (either 1 or 0), we have . Generally, .
Therefore, the general formula for the variance of is given as:
(24) |
where , is the unknown mean of the jth stratum in the ith herd. When takes a binary value, the special form is given by applying the method introduced in the SRS section (see Equation (5)):
(25) |
Again, this variance depends on some unknown quantities which we have estimated. These estimates can then be used to replace these unknown quantities, as we have done previously. Thus, the estimated variance (general form) will be:
(26) |
where . Note the difference between and in the one-stage cluster sampling; is the sample variance within the jth stratum in the ith herd, with being the estimated sample mean. When takes a binary value, the special form is given as:
(27) |
Finally, the variance and estimated variance for are found simply by multiplying the results of Equations (25) and (27) by a constant . The same process can be applied to find the variance and estimated variance for when is not limited to binary values. A numerical illustration example in this design would be tedious to present manually; we have therefore provided the Python code for computation (see the Supplementary Materials: Python code for the two-stage cluster sampling where stratification is implemented within the clusters).
5. Sample Size Consideration
Although the paper has focused principally on the estimation of outcomes of interest, sample size calculations are also a critical part of the study design process. For ready-to-use sample size calculation formulas, readers are directed to Stevenson 2021 [37]. However, for a complex survey where the formula needs to be derived on a case-by-case basis, it is of value to briefly introduce the principles behind the sample size calculation. The formulas for sample size calculations are closely related to the sampling distributions of the estimators. The investigator needs to come up with an expected value for the finite population characteristic of interest and then think about how precise the estimate needs to be. The narrower the sampling distribution of an estimator, the more precise the estimate needs to be (and thus the larger the sample size). Therefore, it is natural to think what the sampling distribution is and which quantity defines the spread of the distribution.
Here, we use the SRS as an example, as the SRS serves as the theoretical basis for other more complicated sampling methods. Suppose that in the SRS for estimating prevalence in a finite herd, the sampling distribution for is approximately normal, with mean = and variance = (see Equation (5) of this paper for the variance formula) [27]. Note that for sample size calculations as opposed to calculating the variance of an estimator from the empirical data, we use the theoretical variance formula instead of the estimated variance formula. Clearly, the variance determines the spread of a distribution and it is a function of the sample size ; thus, this is the equation we are targeting.
The investigator then needs to specify the expected prevalence and think about the quantiles ( and ) of this sampling distribution at the confidence level (usually, we set ). These quantiles can be interpreted as the farthest acceptable estimates from the expected prevalence, and due to the symmetry of the normal density curve. This suggests that . Let denote the square root of the variance in this specific example. Standardisation gives ; thus, , where are the quantiles of standard normal distribution that we know—for example, for 95% confidence. We thus need to find a sample size value which makes equal to the absolute difference () between the farthest acceptable estimate and the expected prevalence. Now, it is just a matter of solving this equation. With a little algebra, the sample size is:
(28) |
6. Conclusions
This paper has provided a brief overview of the principles of probability sampling and the design-based approach for estimating finite population characteristics. In addition, we summarised the analytical methods for various commonly used sampling methods in detail. Instead of feeding the formulas to the readers, we have attempted to introduce and illustrate the ideas to help the readers understand, interpret, derive, and prove the unbiased estimators of the design-based samples and their corresponding variance formulas. We hope the ideas and methods presented in this paper can inspire the readers, so that the readers are encouraged to find the proper estimators and corresponding variances in their own sample surveys.
Acknowledgments
We acknowledge the technical supports offered by Geoff Jones at Massey University, New Zealand and Ciprian Giurcaneanu at The University of Auckland, New Zealand.
Supplementary Materials
The following are available online at https://www.mdpi.com/article/10.3390/vetsci8060105/s1, File S1: Python code for the two-stage cluster sampling where stratification is implemented within the clusters.
Appendix A
First: we consider the simple random sampling scenario. In a population with M animals, let a Bernoulli random variable denote whether animal k is selected. if the individual k is selected; otherwise, . Let be the inclusion probability that animal k is selected. Claim that for a sample with a fixed sample size m. We shall apply the principle introduced in the section “Overview of probability sampling” to prove. For a sample with a fixed sample size m, there are in total ways to obtain a sample. In the simple random sampling, all possible samples of a given size have the same probability of being selected—that is, . The inclusion probability is calculated by summing the probabilities of all samples including the animal k—i.e., . There are scenarios that a sample contains the animal k. As once the animal k is fixed, we need to choose another m − 1 animal from the rest M − 1 animals to form a sample. We then just need to add s up to obtain . Therefore, .
Building on the idea of the simple random sampling, the inclusion probability for stratified random sampling is easy to derive. Let represent whether the kth animal in the jth stratum is selected, where is the inclusion probability. For the jth stratum, as simple random sampling is implemented within the stratum.
For the cluster sampling, let represent whether the kth animal on the ith farm is selected. Let random variable represent whether the ith farm is selected. Given a one-stage cluster sampling scenario, if the farm is selected, all animals within the farm must be selected—namely, . As simple random sampling is used to select farms, By the law of total probability, If the farm is not selected, the probability of the animal on this farm being selected is 0, namely, , therefore . The inclusion probability for the two-stage cluster sampling is similar; the difference is , but this depends on the sampling method used within a farm.
Appendix B
The well-known results for the variance formulas in the simple random sampling will be derived here. For a finite population with size M, we have Bernoulli random variables representing the selection of the animals. These are identically distributed with the same marginal distribution, such that ; however, they are not independent, as , where . The probability of the second animal being selected depends on the first animal’s selection result. For a Bernoulli random variable, we know that if the sample size is m; . Therefore, . If , the probability that both animals are selected is , and . Therefore, the covariance of and is computed as:
Now the variance of the estimated finite population total shall be derived as:
Observe that is the finite population variance , therefore:
which is the Equation (3) shown in the main body of the text. In our particular case, is binary valued, we shall let and , therefore:
and the variance of the estimated finite population prevalence are:
which is Equation (5) shown in the main body of the text. Since the variance depends on the unknown quantity , we shall replace the finite population variance with the sample variance = . With the same algebra, . Therefore, the estimated variance is:
which is the Equation (7) shown in the main body of the text.
Appendix C
In the stratified random sampling section, we mentioned a “natural” estimator for the prevalence on a farm—that is:
This sums up all the observed values in the sample and divides by the sample size. This estimator is generally biased; to see why, let us suppose that we sampled the entire herd (herd size = M):
However, under the proportional allocation, the sample size in each of the strata is taken in proportion to the actual size of the corresponding stratum, that is , then . With a little algebra, we can see that ; therefore, .
Appendix D
In the two-stage cluster sampling, the variance of the estimated total comes from two sources. Based on the variance partition formula, , where measures the variability between herds and measures the variability within a herd. Since simple random sampling is implemented at the herd level, is relatively easier to derive by following the method shown in the simple random sampling scenario.
Here, we focus on deriving the variance formula for the within-herd component . We apply the conditional variance formula inside the expectation operator: . We will work on the two terms inside the expectation operator, then apply the expectation. A gentle courtesy here is to explain the notations again. A Bernoulli random variable represents whether the herd i is selected, and we have . Sampling within any herd is independent of the sampling in any other herd and is independent of .
( and are independent random variables, real valued functions f and g defined for and are also independent random variables; this justifies as f and g are a quadratic and an identity functions, respectively; more generally , and are independent, thus f() and g() are independent; this justifies as f() = and g() = )
(sampling independently from the farms implies independence between and , for two independent random variables, )
(sampling independently from the farms).
Appendix E
Joint probability: ,
Independence of two events:
Law of total probability:
Expectation of a discrete random variable:
Property of expectation:
Variance of a random variable:
- Property of variance:
- If are mutually independent, then
Bias: unbiasedness implies
Expectation for a function of two random variables:
Covariance:
- Properties of covariance:
- Independence between and
Conditional expectation: is a random variable subject to the variation of
- Properties of conditional expectation:
- Independence between and and
- and
Partition theorem for expectations:
Variance partition formula:
Conditional variance formula:
Author Contributions
Writing—original draft preparation, D.A.Y.; writing—review and editing, R.A.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
No new data were created or analysed in this study. Data sharing is not applicable to this article.
Conflicts of Interest
The author declares no conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Sano H., Barker K., Odom T., Lewis K., Giordano P., Walsh V., Chambers J.P. A survey of dog and cat anaesthesia in a sample of veterinary practices in New Zealand. N. Z. Vet. J. 2018;66:85–92. doi: 10.1080/00480169.2017.1413959. [DOI] [PubMed] [Google Scholar]
- 2.Thomson K., Rantala M., Hautala M., Pyörälä S., Kaartinen L. Cross-sectional prospective survey to study indication-based usage of antimicrobials in animals: Results of use in cattle. BMC Vet. Res. 2008;4:15. doi: 10.1186/1746-6148-4-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ouyang Z., Sargeant J., Thomas A., Wycherley K., Ma R., Esmaeilbeigi R., Versluis A., Stacey D., Stone E., Poljak Z., et al. A scoping review of ‘big data’, ‘informatics’, and ‘bioinformatics’ in the animal health and veterinary medical literature. Anim. Health Res. Rev. 2019;20:1–18. doi: 10.1017/S1466252319000136. [DOI] [PubMed] [Google Scholar]
- 4.Valletta J.J., Torney C., Kings M., Thornton A., Madden J. Applications of machine learning in animal behaviour studies. Anim. Behav. 2017;124:203–220. doi: 10.1016/j.anbehav.2016.12.005. [DOI] [Google Scholar]
- 5.Cernek P., Bollig N., Anklam K., Döpfer D. Hot topic: Detecting digital dermatitis with computer vision. J. Dairy Sci. 2020;103:9110–9115. doi: 10.3168/jds.2019-17478. [DOI] [PubMed] [Google Scholar]
- 6.Astill J., Dara R.A., Fraser E.D.G., Sharif S. Detecting and predicting emerging disease in poultry with the implementation of new technologies and big data: A focus on avian influenza virus. Front. Vet. Sci. 2018;5 doi: 10.3389/fvets.2018.00263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Skinner C., Wakefield J. Introduction to the design and analysis of complex survey data. Stat. Stat. Sci. 2017;32:165–175. doi: 10.1214/17-STS614. [DOI] [Google Scholar]
- 8.Revilla M., Lenoir G., Flatres-Grall L., Muñoz-Tamayo R., Friggens N.C. Quantifying growth perturbations over the fattening period in swine via mathematical modelling. bioRxiv. 2020 doi: 10.1101/2020.10.22.349985. [DOI] [Google Scholar]
- 9.Mansfield H.C., Winthrop D. Alexis de Tocqueville, Democracy in America. University of Chicago Press; Chicago, IL, USA: 2000. [Google Scholar]
- 10.Gregoire T.G. Design-based and model-based inference in survey sampling: Appreciating the difference. Can. J. For. Res. 1998;28:1429–1447. doi: 10.1139/x98-166. [DOI] [Google Scholar]
- 11.Jones G., Johnson W.O. A Bayesian superpopulation approach to inference for finite populations based on imperfect diagnostic outcomes. J. Agric. Biol. Environ. Stat. 2016;21:314–327. doi: 10.1007/s13253-015-0239-9. [DOI] [Google Scholar]
- 12.Yang D.A., Johnson W.O., Müller K.R., Gates M.C., Laven R.A. Estimating the herd and cow level prevalence of bovine digital dermatitis on New Zealand dairy farms: A Bayesian superpopulation approach. Prev. Vet. Med. 2019;165:76–84. doi: 10.1016/j.prevetmed.2019.02.014. [DOI] [PubMed] [Google Scholar]
- 13.Little R.J. To model or not to model? Competing modes of inference for finite population sampling. J. Am. Stat. Assoc. 2004;99:546–556. doi: 10.1198/016214504000000467. [DOI] [Google Scholar]
- 14.Baffetta F., Fattorini L., Franceschi S., Corona P. Design-based approach to k-nearest neighbours technique for coupling field and remotely sensed data in forest surveys. Remote Sens. Environ. 2009;113:463–475. doi: 10.1016/j.rse.2008.06.014. [DOI] [Google Scholar]
- 15.Pfeffermann D. The use of sampling weights for survey data analysis. Stat. Methods Med Res. 1996;5:239–261. doi: 10.1177/096228029600500303. [DOI] [PubMed] [Google Scholar]
- 16.Gelman A. Struggles with survey weighting and regression modeling. Stat. Sci. 2007;22:153–164. doi: 10.1214/088342306000000691. [DOI] [Google Scholar]
- 17.Chen Q., Elliott M.R., Haziza D., Yang Y., Ghosh M., Little R.J., Sedransk J., Thompson M. Approaches to improving survey-weighted estimates. Stat. Sci. 2017;32:227–248. doi: 10.1214/17-STS609. [DOI] [Google Scholar]
- 18.Stehman S.V. Practical implications of design-based sampling inference for thematic map accuracy assessment. Remote Sens. Environ. 2000;72:35–45. doi: 10.1016/S0034-4257(99)00090-5. [DOI] [Google Scholar]
- 19.Tate J.E., Hudgens M.G. Estimating population size with two-and three-stage sampling designs. Am. J. Epidemiol. 2007;165:1314–1320. doi: 10.1093/aje/kwm005. [DOI] [PubMed] [Google Scholar]
- 20.Dorazio R.M. Design-based and model-based inference in surveys of freshwater mollusks. J. N. Am. Benthol. Soc. 1999;18:118–131. doi: 10.2307/1468012. [DOI] [Google Scholar]
- 21.West P.W. Simple random sampling of individual items in the absence of a sampling frame that lists the individuals. N. Z. J. For. Sci. 2016;46:15. doi: 10.1186/s40490-016-0071-1. [DOI] [Google Scholar]
- 22.Abera Z., Degefu H., Gari G., Kidane M. Sero-prevalence of lumpy skin disease in selected districts of West Wollega zone, Ethiopia. BMC Vet. Res. 2015;11:135. doi: 10.1186/s12917-015-0432-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Abebe R., Hatiya H., Abera M., Megersa B., Asmare K. Bovine mastitis: Prevalence, risk factors and isolation of Staphylococcus aureus in dairy herds at Hawassa milk shed, South Ethiopia. BMC Vet. Res. 2016;12:270. doi: 10.1186/s12917-016-0905-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Sulayeman M., Dawo F., Mammo B., Gizaw D., Shegu D. Isolation, molecular characterization and sero-prevalence study of foot-and-mouth disease virus circulating in central Ethiopia. BMC Vet. Res. 2018;14:110. doi: 10.1186/s12917-018-1429-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Horvitz D.G., Thompson D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952;47:663–685. doi: 10.1080/01621459.1952.10483446. [DOI] [Google Scholar]
- 26.Cochran W.G. Sampling Techniques. 3rd ed. John Wiley & Sons, Inc.; New York, NY, USA: 1977. [Google Scholar]
- 27.Lohr S.L. Sampling: Design and Analysis. Cengage Learning; Boston, MA, USA: 2010. [Google Scholar]
- 28.Heayns B., Baugh S. Survey of veterinary surgeons on the introduction of serological testing to assess revaccination requirements. Vet. Rec. 2012;170:74. doi: 10.1136/vr.100147. [DOI] [PubMed] [Google Scholar]
- 29.Atuman Y.J., Ogunkoya A.B., Adawa D.A.Y., Nok A.J., Biallah M.B. Dog ecology, dog bites and rabies vaccination rates in Bauchi State, Nigeria. Int. J. Vet. Sci. Med. 2014;2:41–45. doi: 10.1016/j.ijvsm.2014.04.001. [DOI] [Google Scholar]
- 30.Kaler J., Wani S.A., Hussain I., Beg S.A., Makhdoomi M., Kabli Z.A., Green L.E. A clinical trial comparing parenteral oxytetracyline and enrofloxacin on time to recovery in sheep lame with acute or chronic footrot in Kashmir, India. BMC Vet. Res. 2012;8:12. doi: 10.1186/1746-6148-8-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wickham J.D., Stehman S.V., Smith J.H., Wade T.G., Yang L. A priori evaluation of two-stage cluster sampling for accuracy assessment of large-area land-cover maps. Int. J. Remote Sens. 2004;25:1235–1252. doi: 10.1080/0143116031000149998. [DOI] [Google Scholar]
- 32.Getahun K., Kelay B., Bekana M., Lobago F. Bovine mastitis and antibiotic resistance patterns in Selalle smallholder dairy farms, central Ethiopia. Trop. Anim. Health Prod. 2008;40:261–268. doi: 10.1007/s11250-007-9090-5. [DOI] [PubMed] [Google Scholar]
- 33.Regassa A., Tassew A., Amenu K., Megersa B., Abunna F., Mekibib B., Macrotty T., Ameni G. A cross-sectional study on bovine tuberculosis in Hawassa town and its surroundings, Southern Ethiopia. Trop. Anim. Health Prod. 2010;42:915–920. doi: 10.1007/s11250-009-9507-4. [DOI] [PubMed] [Google Scholar]
- 34.Solís-Calderón J.J., Segura-Correa J.C., Aguilar-Romero F., Segura-Correa V.M. Detection of antibodies and risk factors for infection with bovine respiratory syncytial virus and parainfluenza virus-3 in beef cattle of Yucatan, Mexico. Prev. Vet. Med. 2007;82:102–110. doi: 10.1016/j.prevetmed.2007.05.013. [DOI] [PubMed] [Google Scholar]
- 35.Hotchkiss J.W., Reid S., Christley R. A survey of horse owners in Great Britain regarding horses in their care. Part 1: Horse demographic characteristics and management. Equine Vet. J. 2007;39:294–300. doi: 10.2746/042516407X177538. [DOI] [PubMed] [Google Scholar]
- 36.Bisson A., Maley S., Rubaire-Akiiki C., Wastling J. The seroprevalence of antibodies to Toxoplasma gondii in domestic goats in Uganda. Acta Trop. 2000;76:33–38. doi: 10.1016/S0001-706X(00)00086-3. [DOI] [PubMed] [Google Scholar]
- 37.Stevenson M.A. Sample size estimation in veterinary epidemiologic research. Front. Vet. Sci. 2021;7 doi: 10.3389/fvets.2020.539573. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No new data were created or analysed in this study. Data sharing is not applicable to this article.