PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA

Yu Gu; John S Preisser; Donglin Zeng; Poojan Shrestha; Molina Shah; Miguel A Simancas-Pallares; Jeannie Ginnis; Kimon Divaris

doi:10.1214/21-aoas1516

. Author manuscript; available in PMC: 2022 Mar 29.

Published in final edited form as: Ann Appl Stat. 2022 Mar 28;16(1):551–572. doi: 10.1214/21-aoas1516

PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA

Yu Gu ^1,^*, John S Preisser ¹, Donglin Zeng ¹, Poojan Shrestha ^2,³, Molina Shah ², Miguel A Simancas-Pallares ², Jeannie Ginnis ², Kimon Divaris ^2,³

PMCID: PMC8963777 NIHMSID: NIHMS1731052 PMID: 35356492

Abstract

Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals’ fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches—two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.

Keywords and phrases: missing values, spatial interpolation, clustering, multiple imputation, random forest

1. Introduction.

Community water fluoridation is an important component of oral health promotion, and its introduction has been linked to substantial reductions in caries experience at the population level (Brunelle and Carlos, 1990). Although high-quality, contemporary evidence of water fluoridation’s caries-preventive efficacy is lacking (Iheozor-Ejiofor et al., 2015), fluoride has been shown to promote tooth surface remineralization, which can prevent or arrest caries lesion development (Cate, 1999). Because fluoride exposure is considered an important influence on populations’ and individuals’ dental caries experience (Selwitz, Ismail and Pitts, 2007; Fisher-Owens et al., 2007), it is important to account for it in oral health and dental caries-focused research. For example, recent genome-wide association studies of dental caries conducted fluoride exposure-stratified or interaction analyses in the hopes of identifying significant genetic loci (Shaffer et al., 2011; Eckert et al., 2017).

Despite the benefits of obtaining measurements of domestic water fluoride content as a measure of individuals’ fluoride exposure and thus caries risk, such undertakings are logistically challenging to carry out at a large scale in oral health research. Frequently, fluoride exposure is ignored or self-reported via questionnaires. A recent large-scale study of early childhood oral health (ZOE 2.0) in North Carolina (NC), U.S.A., conducted state-wide clinical examinations of preschool-age children, and among other data and specimen types (e.g., oral health-related behaviors via questionnaires, human DNA via saliva samples), it sought to collect domestic water samples from participants’ domestic water sources (Divaris et al., 2020; Divaris and Joshi, 2020; Ginnis et al., 2019). However, approximately 75% of study participants did not return water samples, creating an important missing data problem. Excluding subjects with missing values from analyses (i.e., in a complete case analysis scenario) would result in substantial reduction of the effective sample size, with obvious negative impacts on statistical power and precision.

This investigation is motivated by the missing data problem of domestic water fluoride concentrations in the ZOE 2.0 study. While existing imputation methods can be implemented to address this problem, consideration of possible approaches was motivated by two features of the parent study’s data: first, because neighboring households tend to be on common water systems, directly measured fluoride concentration levels (n=1,501) exhibit strong spatial autocorrelation (Figure 1.1). Second, preliminary data analysis found that participants’ questionnaire responses of having ‘well water’ as their primary water source were strongly associated with lower fluoride levels. This is also expected, because most wells in NC contain negligible or no fluoride. With these features in mind, the goal of the present study is to identify and conduct an efficient and low error rate imputation.

Fig 1.1. — Geographical distribution of fluoride levels based on the fluoride and geocoding information from the 1,501 study participants with non-missing data

The original geocoded dataset in the ZOE 2.0 study contains 6,336 subjects across 85 of NC’s 100 counties, among which only 1,530 subjects have observed fluoride information. Fluoride, the variable of interest, is a left-truncated continuous variable categorized into three levels, where fluoride concentration can be none/non-detectable (<0.20ppm; coded as 0), sub-optimal (0.20–0.59ppm; coded as 1), or optimal (⩾0.60ppm; coded as 2). Fluoride level defined by these three categories has been determined to have an effect on dental disease (Ha et al., 2019). This categorization circumvents the problem of distinguishing between true zero and non-detectable, nonzero fluoride. Covariates under consideration include geographic coordinates (i.e., X and Y) of participants’ residential home addresses, obtained from ArcGIS (Johnston et al., 2001), and participants’ answers to a question about their primary water source at home (i.e., well water vs. other). Twenty-nine participants are excluded from the original dataset since they have no clinical or geocoding information. An overall summary of area and fluoride-related information for the remaining 6,307 study participants with clinical dental data and geocoded residential addresses is shown in Table 1.1.

Table 1.1.

Summary of area and fluoride-related information for the 6,307 study participants with clinical dental data and geocoded residential addresses

Variable	Min	1st Q	Median	Mean	3rd Q	Max	#Missing
X¹	−84.13	−80.62	−79.08	−79.41	−78.13	−75.61	0
Y¹	33.89	35.18	35.52	35.51	35.93	36.55	0
Fluoride value²	0.00	0.00	0.48	0.42	0.67	3.10	4806
Water source	well water: 1003			other: 5304			0
Fluoride level	none/non-detectable: 548			sub-optimal: 342	optimal: 611		4806

Open in a new tab

X and Y coordinates of participants’ geocoded residential addresses

Measured fluoride concentration (ppm), with fluoride levels below the lower limit of detection recorded as zero

As mentioned above, the geographical distribution of fluoride levels was initially explored. Figure 1.1 is a map generated from ArcGIS based on the residential locations of the 1,501 ZOE 2.0 study participants who underwent a clinical examination and provided a useable water sample, where each point represents a participant and different shades of gray represent different fluoride levels. Evidently, participants living near each other tend to have similar fluoride levels, suggesting a strong spatial autocorrelation of fluoride concentration. Specifically, spatial autocorrelation was tested using Moran’s I (Moran, 1950; Cliff and Ord, 1981), which is defined as

I = \frac{n}{\sum_{i} \sum_{j} w_{i j}} \times \frac{\sum_{i} \sum_{j} w_{i j} (X_{i} - \bar{X}) (X_{j} - \bar{X})}{\sum_{i} {(X_{i} - \bar{X})}^{2}},

(1.1)

where X_i is the fluoride level of the i-th subject, $\bar{X}$ is the mean of X_i, and {w_ij} is a matrix of spatial weights. A weight of 1 to 5 nearest neighbors and 0 otherwise is assigned. The observed value of Moran’s I is 0.5528, compared with a null mean of −0.0007 and standard error of 0.0150 under the null hypothesis of no spatial autocorrelation, which yields an extremely small p-value (< 2.2 × 10⁻¹⁶) and supports our previous statement of strong spatial autocorrelation.

The investigators’ objective is to impute the missing fluoride levels for participants’ domestic water source, using the measured values from their closest neighbors. This can be done with several popular spatial interpolation methods (Lam, 1983; Mitas and Mitasova, 1999), including nearest neighbor interpolation, inverse-distance weighted (IDW) interpolation, and ordinary kriging. However, in addition to geographical location, water source (i.e., well water vs. other) is another key predictive factor for fluoride level in the dataset—most well water has zero or near-zero fluoride concentration, while city water is mostly fluoridated. The Chi-squared test of the association between water source and fluoride level is highly significant, with p-value< 2.2 × 10⁻¹⁶. For this reason, more complex, related methods, including k-nearest neighbors (KNN), classification and regression trees (CART), and universal kriging (UK), are considered to account for the additional effect of water source. Further, a new imputation method, PAMRF, is proposed, which is a combination of partitioning around medoids (PAM) clustering and random forest classification, two powerful machine learning approaches. Multiple imputation is incorporated into PAMRF to address the problem of underrepresentation of uncertainty with single imputation.

Section 2 introduces the methodological framework for each of the five imputation methods, including IDW, UK, KNN, CART, and PAMRF. Section 3 presents evaluations of the five methods and an application of PAMRF, the best imputation method. Finally, Section 4 summarizes and discusses the study’s findings, and offers recommendations for future research.

2. Methodology.

2.1. Inverse Distance Weighting (IDW).

A very popular interpolation method that is implemented in many GIS software packages is IDW. An IDW interpolation is defined as a spatially weighted average of the sample values within a search neighborhood (Shepard, 1968; Franke, 1982), under the assumption that neighbors that are closer to each other are more correlated than those that are further away. This assumption appears to hold true in the ZOE 2.0 data. Missing fluoride values can be imputed by

\hat{Z} (s_{0}) = \frac{\sum_{i = 1}^{n} Z (s_{i}) / d {(s_{0}, s_{i})}^{p}}{\sum_{i = 1}^{n} 1 / d {(s_{0}, s_{i})}^{p}},

(2.1)

where Z(·) denotes the measured continuous fluoride concentration value, s₁, …, s_n are the n neighbors of the interpolated point s₀, d(·, ·) is the Euclidean distance between two points based on their X and Y coordinates, and p is a positive real number, called the power parameter. The imputed continuous fluoride value is converted into the categorical fluoride level using the same rule as in the ZOE 2.0 study. The implementation of IDW is straightforward in ArcGIS.

2.2. Universal Kriging (UK).

Kriging, or Gaussian process regression, is another popular interpolation method. It is based on statistical models that include autocorrelation. Like IDW, kriging measures the Euclidean distance based on X and Y coordinates, and imputes missing values by weighting the observed values of closest neighbors, which is formulated by

\hat{Z} (s_{0}) = \sum_{i = 1}^{n} λ_{i} Z (s_{i}),

(2.2)

where Z(·) denotes the measured continuous fluoride concentration value, s₁, …, s_n are the n neighbors of the interpolated point s₀, and λ_i is an unknown weight for the observed value from the i-th neighbor. One significant difference between IDW and kriging is that the weight λ_i in IDW depends only on the distance to the imputation location, while in kriging, the weights depend not only on the distance between the neighbors and the imputation location but also on the spatial autocorrelation across the entire surface. To be specific, in kriging, λ_i’s are chosen such that the predictor is uniformly unbiased and the estimation variance is minimized (Cressie, 1993). In UK, the following assumptions are made:

Z(s) is a normally distributed random variable for any location s.
Within the neighborhood of s₀, denoted by D, the mean of Z(s) is a linear combination of known functions {f₀(s), …, f_p(s)}, which can be written as
$E [Z (s)] = f (s)^{'} β, s \in D,$ (2.3)
where f(s) = (f₀(s), …, f_p(s))′, and $β \in R^{p + 1}$ .
The correlation between two random variables depends only on the distance between them and is independent of their locations. That is, for each distance h and each pair of locations s₁ and s₂ separated by distance h, their covariance is a constant:
$C o v (Z (s_{1}), Z (s_{2})) = C (s_{1}, s_{2}) \equiv C (h) .$ (2.4)
The semivariogram is defined as
$γ (s_{1}, s_{2}) = \frac{1}{2} v a r (Z (s_{1}) - Z (s_{2})) .$ (2.5)
Clearly, the covariance function is related to the semivariogram by
$2 γ (s_{1}, s_{2}) = C (s_{1}, s_{1}) + C (s_{2}, s_{2}) - 2 C (s_{1}, s_{2}) .$ (2.6)
Under the third assumption above, the semivariogram depends only on the distance between two points regardless of their locations. Therefore, the semivariogram can also be expressed as a function of distance γ(h), like the covariance function C(h). Figure 2.1 shows the anatomy of a typical semivariogram and its corresponding covariance function.

Fig 2.1. — Semivariogram (left) and covariance function (right)

The semivariogram γ(h) and the covariance function C(h) can be estimated by

\begin{matrix} \hat{γ} (h) = \frac{1}{2 | N (h) |} \sum_{(i, j) \in N (h)} {[Z (s_{i}) - Z (s_{j})]}^{2} \\ \hat{C} (h) = \frac{1}{| N (h) |} \sum_{(i, j) \in N (h)} [Z (s_{i}) - m (h)] [Z (s_{j}) - m (h)], \end{matrix}

(2.7)

where

m (h) = \frac{1}{2 | N (h) |} \sum_{(i, j) \in N (h)} [Z (s_{i}) + Z (s_{j})],

(2.8)

N(h) denotes the set of pairs of locations at distance h, and |N(h)| denotes the number of corresponding pairs of locations (Cressie, 1993). The estimated semivariogram can be modeled by various models, such as the spherical, Gaussian, exponential models, etc. The model that best fits the data will be selected. After modeling the spatial autocorrelation, the optimal weights are obtained by

{c + X {(X^{'} Σ^{- 1} X)}^{- 1} (x - X^{'} Σ^{- 1} c)}^{'} Σ^{- 1},

(2.9)

where c = (C(s₀, s₁), …, C(s₀, s_n))′, Σ is an n × n matrix whose (i, j)-th element is C(s_i, s_j), X is an n × (p + 1) matrix whose i-th row is f(s_i)′, and x = f(s₀) (Cressie, 1993). By plugging the optimal weights into (2.2), the imputed value for s₀ is finally obtained. For the ZOE 2.0 fluoride data, water source (i.e., well water vs. other) is the only covariate other than the intercept in (2.3). Like IDW, the continuous fluoride concentration value is first imputed, then converted into the categorical fluoride level using the rule described in Section 1. UK is straightforward to implement in ArcGIS.

2.3. K-Nearest Neighbors (KNN).

KNN is a well-known learning algorithm that is widely used in missing value imputation problems (Hastie et al., 1999; Chen and Shao, 2000; Falkowski et al., 2010). All these applications assume an autocorrelated data structure, which exists in the ZOE 2.0 fluoride data. Therefore, KNN is a good candidate method for imputation, although there is no literature discussing the application of KNN to this specific problem. Because KNN works very well for categorical variables, it can be used directly to impute missing fluoride levels using neighbors’ observed fluoride levels, without dealing with continuous fluoride concentration values. In order to select nearest neighbors, the distance between two subjects based on their X and Y coordinates, and water source (i.e., well water vs. other) is computed. KNN accepts heterogenous distance metrics, such as Euclidean, Manhattan, Mahalanobis, etc. KNN-based imputation is straightforward: for a subject with a missing value, the observed information of its nearest neighbors is considered by imputing the value according to the majority vote of the k neighbors’ fluoride levels, with ties broken at random. If there are ties for the k-th nearest neighbor, all candidate neighbors can be included in the vote, or one of them can be randomly selected.

2.4. Classification and Regression Trees (CART).

CART, fully introduced in Breiman et al. (1984), is a very popular tree-based method for classification or regression predictive modeling problems. It works by creating a set of binary rules that recursively splits the feature space into partitions with homogeneous values of the dependent variable. As a result, a CART model can be represented graphically as a binary decision tree (Figure 2.2) that is constructed by splitting a node into two child nodes repeatedly. Classification trees are designed for categorical dependent variables, while regression trees are for continuous or ordered discrete dependent variables. Starting from the root node, the CART algorithm works repeatedly in three steps for each terminal node of the tree:

Find each feature’s best split.

For each feature find its best split that maximizes the splitting criterion. The resulting set of splits contains the best splits (one for each feature).
Find the node’s best split.

Among the best splits found in Step 1, choose the one that maximizes the splitting criterion.
Split the node using best node split from Step 2 and repeat from Step 1 until the stopping criterion is satisfied.

Fig 2.2. — An example binary tree representation of a CART model based on all observed fluoride data in a county. At each intermediate node, a case goes to the left child node if and only if the condition is satisfied. The predicted fluoride level is given beneath each leaf node.

For classification trees, the Gini impurity index is commonly used in the splitting criterion, which is defined for node t as

I_{G} (t) = \sum_{i} p (i ∣ t) (1 - p (i ∣ t)),

(2.10)

where p(i|t) is the probability of an object in class i given that it falls into node t. The corresponding Gini splitting criterion is the decrease of impurity at node t, which is defined as

Δ I_{G} (t) = I_{G} (t) - p_{L} I_{G} (t_{L}) - p_{R} I_{G} (t_{R}),

(2.11)

where p_L (p_R) are probabilities of assigning an object to the left (right) child node t_L (t_R), and I_G(·) is the Gini impurity measure as defined in (2.10). There are a variety of stopping criteria, and the most common one is to set a minimum node value—a node will not be split if its size is smaller than some minimum. A class label is then assigned to each terminal node by the majority vote. In this way, every point in the feature space is assigned a class label. The full grown tree is usually pruned using cross validation to reduce its complexity and enhance its generalization. This can also help prevent overfitting and balance the bias versus variance trade-off.

Regression trees are similar to classification trees with a few modifications. First, the least squared deviation (LSD) impurity measure is used in the splitting criterion (2.11) for regression trees, where LSD minimizes the sum of squared residuals. Second, instead of taking the majority vote as for classification trees, the within-node average value of the dependent variable is used as the predicted value at each terminal node. This yields piecewise constant models that are easy to interpret.

One of the appealing aspects of CART is that, unlike many linear combination methods like logistic regression or support vector machines (SVMs), it works well with a mixture of numerical and categorical features. Therefore, it is suitable for the ZOE 2.0 fluoride data. Classification trees will be used in this article to directly impute the fluoride level, with features including X and Y coordinates, and water source. CART has been implemented in R via the ‘rpart’ package (Therneau and Atkinson, 2008).

2.5. PAMRF.

Before formally introducing the PAMRF method, a brief overview of partitioning around medoids (PAM) clustering and random forest methods is provided, as well as the Gower distance that might be useful in cluster analysis of mixed data.

2.5.1. PAM Clustering.

PAM clustering, also referred to as k-medoids clustering, where medoids represent cluster centers, is one of the most popular algorithms for clustering mixed data with non-Euclidean measures, under which circumstances the widely used k-means clustering method is no longer applicable. The original PAM algorithm is fully described in Chapter 2 of Kaufman and Rousseeuw (2009), which uses a greedy search to divide n objects into k clusters. It contains two sub-algorithms: BUILD to choose an initial clustering, and SWAP to further improve the clustering towards a local optimum. The PAM algorithm has been implemented in R via the ‘cluster’ package (Maechler et al., 2019).

2.5.

The silhouette method proposed in Rousseeuw (1987) provides a good interpretation of cluster analysis. For any object i, denote by S_i the cluster to which it has been assigned. The average dissimilarity of i to all other objects of S_i is computed by

a (i) = \frac{1}{| S_{i} | - 1} \sum_{j \in S_{i}, j \neq i} d (i, j) .

(2.12)

Also, the smallest average dissimilarity of i to all objects of any other cluster S ≠ S_i is computed by

b (i) = min_{i^{'} \neq i} \frac{1}{| S_{i^{'}} |} \sum_{j \in S_{i^{'}}} d (i, j) .

(2.13)

Then the silhouette width of i is defined as

s (i) = \frac{b (i) - a (i)}{max {a (i), b (i)}}, if | S_{i} | > 1.

(2.14)

When S_i contains only object i, s(i) is simply set to zero. The silhouette width s(i) measures how well i has been classified, and an s(i) close to one means i is appropriately clustered. The average silhouette width $\bar{s}$ over all objects is usually used to evaluate the clustering validity and select an appropriate number of clusters.

2.5.2. Random Forest.

Random forest is another well-known tree-based method that can be used to build predictive models for both regression and classification problems. The original random forest algorithm, proposed in Breiman (2001), was essentially an application of bagging (Breiman, 1996) and random feature selection to CART (see Section 2.4). In a random forest, each tree is constructed using a different bootstrap sample of the data, and each node of each tree is split using the best split among a random subset of all features. This strategy prevents problems resulting from correlated individual trees, and is robust against overfitting. At each bootstrap iteration, data not in the bootstrap sample are called “out of bag” (OOB). An error estimate can be obtained by aggregating all OOB predictions. To predict new data, all individual tree predictions are aggregated (i.e., majority vote for classification and average for regression). For classification, the proportion of votes can be taken as predicted probabilities for each class.

2.5.

Random forest can be used to measure the importance of variables in a natural way. Breiman (2001) described the “permutation accuracy importance” measure, one of the most advanced variable importance measures available in random forest. In this measure, variable importance can be calculated as follows: for classification, it is the average increase in OOB error rate when the variable is randomly permuted among the OOB samples of each tree; for regression, it is the average increase in squared OOB residuals when the variable is randomly permuted among the OOB samples of each tree. The other three measures of importance based on changes in margins or the Gini index were introduced in Breiman (2002).

Several characteristics make random forest ideal for our fluoride data. First, like CART, it works very well with mixed data types. Second, random forest is intrinsically suited for multi-class problems, eliminating the need to reduce a multi-class classification problem into multiple binary classification problems as in SVMs. In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values (Liaw et al., 2002). The implementation of Breiman’s original random forest method is available in R via the ‘randomForest’ package (RColorBrewer and Liaw, 2018).

2.5.3. Gower Distance.

Gower Distance is a distance measure that can be used to calculate distance between two subjects whose attributes are a mixture of categorical and numerical values. It derives from Gower’s coefficient of similarity, which was first introduced in Gower (1971). The Gower distance of two distinct subjects i and j, each of whom has p variables of mixed type, is calculated by

D_{i j} = \frac{\sum_{l = 1}^{p} w_{l} δ_{i j l} d_{i j l}}{\sum_{l = 1}^{p} w_{l} δ_{i j l}},

(2.15)

where w_l is the constant weight for variable l, and δ_ijl is an indicator of whether variable l can be compared for i and j. When δ_ijl = 1, the contribution d_ijl is assigned as follows:

For nominal or binary variables, d_ijl = 0 if the two values are equal and 1 otherwise.
For other quantitative variables, d_ijl is the absolute difference of the two values, divided by the total range of that variable, which can be expressed as d_ijl = |x_il −x_jl|/R_l.

Variable weights are usually equal to 1 unless a variable’s value for one or both subjects is missing, when the corresponding weight is 0. Differential weighting can be used to express domain-specific knowledge on variable importance (see, e.g., Hennig and Liao (2013)). However, there is no general way to choose the variable weights (van de Velden, Iodice D’Enza and Markos, 2019). Gower distance is ideal for our mixed-type fluoride data.

2.5.4. PAMRF.

The new imputation method, PAMRF, is a combination of unsupervised learning (i.e., PAM clustering) and supervised learning (i.e., random forest) for multiple imputation. The rationale for this machine-learning-based method is the clustering characteristic shown in Figure 1.1. In addition to the obvious separation of different counties, which leads to natural clusters, there also appear to be latent clusters inside those counties that have dozens of data points. Therefore, the first and most critical step is to employ PAM clustering to identify those latent clusters within each county. The subjects in a county are divided into clusters, using Gower distance to measure dissimilarities between subjects in fluoride level, geographical location (i.e., X and Y coordinates), and water source (i.e., well water vs. other). The silhouette method is employed to determine the optimal number of clusters, selecting the value with the highest average silhouette width. After clustering, a random forest model is trained for each county, accounting for the effects of both geographical location (i.e., X and Y coordinates) and water source on the number (i.e., “id”) of the cluster to which a subject belongs. Each subject with a missing value is then assigned into a specific cluster based on these models. Finally, multiple imputation is applied by randomly picking an observed fluoride level from a cluster for each imputation replicate. The proportion of votes for a predicted cluster from all individual trees is a good measure of the imputation confidence.

2.5.

3. Application and Results.

All five imputation methods introduced in Section 2 are applied to the fluoride data, and a leave-one-out cross validation (LOOCV) is conducted to evaluate the performance of each method in terms of error rate. For the two methods with the lowest LOOCV error rates, a series of simulation studies are designed and conducted, mimicking the distribution of missingness in the real fluoride data. The estimation of association between fluoride level and dental caries status is assessed, assuming a binary logistic regression model. The influence of missingness proportion on the bias difference between the two methods is also explored in the simulation studies. Back to the real data, the associations between fluoride level and four clinical traits of dental caries (two binary case status classifications and two continuous caries experience measures) are examined based on the observed data and three different sets of data multiply imputed using PAMRF—specifically, these datasets have increasing proportions of missing values imputed, while their average confidences decrease. Finally, the function of average confidence versus the number of imputed values, as well as the pattern of error rate and sample size over different threshold values in PAMRF are explored and presented. This provides a transparent and useful means for investigators to decide the optimal threshold (i.e., depending on allowable error rate or desirable final sample size) according to their application.

3.1. Methods Assessment.

Since county is a natural unit for domestic water fluoride level, county-level imputation and assessment are performed for each method (i.e., apply each method within counties and then aggregate the results over all counties). This way, LOOCV is more appropriate compared with other cross validation techniques such as leave-p-out or k-fold, which may increase the variability in the assessment, making results difficult to interpret, as there are only a small number of data points within counties. Random resampling approaches are not considered for assessment, because they might destroy the spatial dependence considering the existence of latent clustering structures within counties. The procedure of LOOCV is as follows: each time, the imputation model is trained on all the subjects with measured fluoride concentration except for one, and a prediction is made for that subject. Then the predicted value is compared to the observed value for each subject and an error rate is computed across all trials, using

LOOCV error rate = \frac{# incorrect predictions}{# total number of predictions} .

(3.1)

Firstly, all counties with fewer than three observations are identified; in these counties, all subjects with missing values are excluded from the imputation, because there is too little information to impute the missing values with a reasonably high level of confidence. After this preprocessing step, there remain 72 counties and 4,779 missing values for imputation. For each method, parameters are tuned using LOOCV if there are any. When comparing the methods’ performances, the best value that leads to the lowest LOOCV error rate is set for each parameter. Generally speaking, the value of a tuning parameter might not make much difference as long as it is within a reasonable range of values. Indeed, it turns out that the value of a tuning parameter does not affect the error rate considerably within the range set in the tuning process for the ZOE 2.0 data (see results of tuning for IDW, UK, and KNN in the top panel of Table 3.1).

Table 3.1.

Results of LOOCV error rate (%) for all 1,501 training data in the ZOE 2.0 study, obtained using IDW, UK, KNN with different parameter values, and PAMRF with different number of imputations and distance metrics (i.e., Gower vs. Euclidean).

	IDW					UK	KNN
	p = 1	p = 2	p = 3	p = 4	p = 5
n = 3	29.18	28.37	27.63	27.76	27.43	32.21	23.05
n = 4	29.18	28.37	27.22	27.70	27.36	33.33	24.53
n = 5	30.73	29.31	27.96	27.63	27.70	34.51	23.38
n = 6	30.66	29.18	27.63	27.43	27.76	33.07	24.06
n = 7	30.73	29.25	27.56	27.36	27.70	30.06	23.65
n = 8	31.13	29.04	27.56	27.43	27.63	30.01	23.99
n = 9	31.81	29.51	28.03	27.63	27.63	29.93	23.72
n = 10	31.94	29.38	28.10	27.56	27.63	28.88	24.12

	PAMRF (Gower)	PAMRF (Euclidean)
M = 10	21.25	20.82
M = 25	20.87	21.19
M = 50	20.88	20.96
M = 100	21.58	21.22
M = 150	21.14	21.68
M = 200	21.33	20.65
M = 250	21.67	20.99

Open in a new tab

n: Number of nearest neighbors used for imputation in IDW, UK, and KNN;

p: Power parameter in IDW;

M: Number of imputations in PAMRF.

For IDW, fluoride information of 4 nearest neighbors is used, and the power parameter is set to be 3. For UK, it appears from Figure 3.1 that the semivariogram is best fit by the exponential model with range parameter 0.45, partial sill 0.1, and nugget effect 0.01, compared to the other candidate models. The number of nearest neighbors needed for imputation with UK is 10. For KNN, the Euclidean distance is computed and the missing value is imputed based on the information of 3 nearest neighbors. The Mahalanobis distance is not used due to the issue of singular sample covariance matrix in some of the counties. For any of the aforementioned three methods, if the number of observations in a county is smaller than the needed number of nearest neighbors, then all the observations in that county will be used for imputation. For CART, the default setting of the R function rpart is simply used without tuning any parameters. For PAMRF, the Gower distance is computed, with X coordinate and fluoride level weighted up by a factor of 2 and 8 respectively. Such a set of weights is selected because it achieves the lowest LOOCV error rate among all sets of weights explored for this analysis. In fact, weights in Gower distance only have a minor impact on the performance of PAMRF. Moreover, substituting Euclidean distance for Gower causes only a slight variation in error rate (see the bottom panel in Table 3.1). In order to determine the number of clusters, its value was allowed to increase from 2 up to the number of observed subjects in that county and the corresponding average silhouette width was calculated for each value. The optimal number of clusters for that county is the one with the highest average silhouette width. After clustering, random forest models are fitted within counties via the R function randomForest, using the default setting. Associations between fluoride level and dental caries are explored using 10, 25, 50, 100, 150, 200, and 250 imputations, respectively. Eventually, 50 imputations are used in all subsequent analyses because results in Tables 3.3 and 3.4 do not change much after reaching 50 imputations.

Fig 3.1. — Plots of the semivariogram under exponential, spherical, Gaussian, circular, linear, and power models

Table 3.3.

Estimation of effect of fluoride level on caries in logistic regression based on observed + imputed data by KNN and PAMRF, with 75%, 50%, and 25% missingness.

Missingness	KNN				PAMRF
Proportion	Mean	SE	SEE	CP	Mean	SE	SEE	CP
75%	−0.239	0.053	0.051	78%	−0.269	0.050	0.051	91%
50%	−0.271	0.052	0.051	90%	−0.286	0.052	0.051	93%
25%	−0.287	0.051	0.051	94%	−0.293	0.051	0.051	95%

Open in a new tab

Note: Mean and SE denote the empirical mean and standard error of the β estimator, SEE denotes the empirical mean of the standard error estimator, and CP denotes the empirical coverage percentage of the 95% confidence interval. The true value of β is −0.3.

Table 3.4.

Results of association testing between fluoride and clinical traits using different clinical definitions of dental caries and different sets of fluoride data (imputed by PAMRF based on 50 multiple imputations)

	Observed data	Observed + imputed with CS=1	Observed + imputed with CS⩾0.85	Observed + all imputed
Caries case status (sensitive)¹
Fluoride level 0: n (%)^*	493 (90.96)	645 (91.91)	1099 (93.14)	1683 (93.00)
1: n (%)^*	311 (89.37)	384 (90.80)	644 (90.37)	1252 (92.05)
2: n (%)^*	556 (91.00)	857 (89.99)	1895 (90.48)	2825 (90.86)
2 vs. 0: β; SE; p-value	−0.01; 0.21; 0.96	−0.23; 0.18; 0.18	−0.36; 0.14; 0.01	−0.29; 0.11; 0.01
1 vs. 0: β; SE; p-value	−0.21; 0.23; 0.35	−0.14; 0.22; 0.53	−0.37; 0.17; 0.03	−0.14; 0.14; 0.32
2 vs. 1 or 0: β; SE; p-value	0.08; 0.18; 0.68	−0.18; 0.15; 0.24	−0.20; 0.11; 0.07	−0.23; 0.09; 0.01
Caries experience (sensitive)²
Fluoride level 0: Mean (SD)	14.77 (15.08)	16.21 (16.33)	15.89 (15.64)	16.61 (16.68)
1: Mean (SD)	13.26 (14.45)	13.28 (14.38)	14.43 (14.93)	15.91 (15.91)
2: Mean (SD)	13.39 (14.37)	13.21 (14.14)	13.66 (14.30)	13.66 (14.23)
2 vs. 0: β; SE; p-value	−0.08; 0.06; 0.21	−0.18; 0.05; <0.01	−0.16; 0.04; <0.01	−0.18; 0.03; <0.01
1 vs. 0: β; SE; p-value	−0.15; 0.08; 0.05	−0.21; 0.07; <0.01	−0.14; 0.05; 0.01	−0.05; 0.04; 0.17
2 vs. 1 or 0: β; SE; p-value	−0.02; 0.06; 0.71	−0.10; 0.05; 0.03	−0.11; 0.03; <0.01	−0.15; 0.03; <0.01
Caries case status (less sensitive)³
Fluoride level 0: n (%)^*	295 (54.43)	393 (56.01)	669 (56.72)	1035 (57.19)
1: n (%)^*	179 (51.44)	224 (52.87)	402 (56.40)	793 (58.25)
2: n (%)^*	279 (45.66)	464 (48.76)	1046 (49.96)	1546 (49.73)
2 vs. 0: β; SE; p-value	−0.36; 0.12; <0.01	−0.29; 0.10; <0.01	−0.27; 0.07; <0.01	−0.30; 0.06; <0.01
1 vs. 0: β; SE; p-value	−0.13; 0.14; 0.35	−0.13; 0.12; 0.31	−0.01; 0.10; 0.90	0.04; 0.07; 0.55
2 vs. 1 or 0: β; SE; p-value	−0.31; 0.11; <0.01	−0.24; 0.09; <0.01	−0.27; 0.06; <0.01	−0.32; 0.05; <0.01
Caries experience (less sensitive)⁴
Fluoride level 0: Mean (SD)	7.95 (13.99)	8.90 (15.06)	8.59 (14.51)	9.33 (15.77)
1: Mean (SD)	6.81 (12.60)	6.89 (12.86)	7.82 (13.41)	9.02 (14.88)
2: Mean (SD)	6.58 (13.00)	6.55 (12.64)	6.79 (12.81)	6.80 (12.73)
2 vs. 0: β; SE; p-value	−0.22; 0.08; <0.01	−0.24; 0.07; <0.01	−0.21; 0.05; <0.01	−0.24; 0.04; <0.01
1 vs. 0: β; SE; p-value	−0.15; 0.09; 0.11	−0.19; 0.08; 0.02	−0.06; 0.07; 0.39	0.00; 0.05; 0.99
2 vs. 1 or 0: β; SE; p-value	−0.16; 0.07; 0.02	−0.17; 0.06; <0.01	−0.18; 0.04; <0.01	−0.24; 0.03; <0.01

Open in a new tab

Binary variable (i.e., disease/no disease) based on the detection of one or more caries lesions at the ICDAS⩾1 threshold (i.e., including early-stage lesions)

Corresponding count variable (i.e., sum of decayed, restored or missing tooth surfaces using ICDAS⩾1 diagnostic criteria)

Binary variable (i.e., disease/no disease) based on the detection of one or more caries lesions at the ICDAS⩾3 threshold (i.e., not including early-stage lesions)

⁴

Corresponding count variable (i.e., sum of decayed, restored or missing tooth surfaces using ICDAS⩾3 diagnostic criteria)

Number (percentage) of study subjects diagnosed as dental caries “cases” in each fluoride level group

The LOOCV error rates for assessing the five imputation methods are presented in Table 3.2(a). Considering fluoride level as an ordinal variable, the Spearman’s rank correlation coefficient is computed between the predicted values achieved using different imputation methods in LOOCV and the true values. These correlations are presented in Table 3.2(b). According to the LOOCV results, our new method, PAMRF, has the best performance among all of the five imputation methods, with the lowest error rate and the highest correlation with the true values. KNN imputation is also reasonable—its performance is only slightly worse than that of PAMRF. The other three methods, IDW, UK, and CART, will no longer be considered in the rest of Section 3.

Table 3.2.

Results of LOOCV for all 1,501 training data in the ZOE 2.0 study, obtained using IDW, UK, KNN, CART, and PAMRF, and the best values for all tuning parameters.

(a) LOOCV error rate		(b) Spearman’s correlation matrix
Method	Error rate (%)		True	IDW	UK	KNN	CART
IDW	27.22	IDW	0.70
UK	28.88	UK	0.63	0.64
KNN	23.11	KNN	0.74	0.72	0.69
CART	27.63	CART	0.65	0.69	0.57	0.72
PAMRF	20.96	PAMRF	0.77	0.78	0.68	0.84	0.77

Open in a new tab

Note: Each entry related to PAMRF is the mean among all 50 imputations, using Gower distance.

3.2. Simulation Studies.

Association between fluoride exposure and caries experience is of interest in the parent study. Since nearly 75% of the fluoride data are imputed rather than observed, the estimated association might deviate from the truth. Seen from this aspect, a good imputation method should lead to an association estimator with small bias and variance. In order to further compare KNN and PAMRF, a series of simulation studies are conducted based on an artificial dataset, and the performance of each method in estimating the association between fluoride level and caries status is evaluated.

Similar to the real fluoride data obtained from the ZOE 2.0 study, the artificial dataset contains 6,280 subjects and 4 variables: X and Y coordinates, domestic water source, and fluoride level. It is generated as follows: within each latent cluster identified in PAMRF, replace each missing fluoride level by a random draw from all the observed fluoride levels in that cluster. Note that the artificial dataset is complete and fixed. Missingness will be created subsequently to mimick the missingness pattern in the real data problem.

Within each replicate, the binary dental caries status variable is generated from the logistic regression model

log \frac{π}{1 - π} = α + β \times I (fluoride level = 2),

(3.2)

where π is the probability of disease conditional on fluoride level, and the true values of α and β are set to 0.1 and −0.3, respectively. To capture the distribution of missingness in the real data, a beta-binomial distribution is fitted to the observed (M_i, N_i) among all 85 counties, where M_i and N_i are the number of subjects with missing fluoride levels and the total sample size in the i-th county, respectively. Then a dataset with 75% missingness is created from the original artificial dataset by randomly selecting ${\tilde{M}}_{i}$ subjects within the i-th county and removing their fluoride levels, where ${\tilde{M}}_{i}$ follows the fitted beta-binomial distribution with the mean adjusted to 0.75. This way, the simulated dataset has a missingness distribution that is similar to the real dataset. The missing fluoride levels in the simulated dataset are imputed by KNN and PAMRF separately, and the same logistic regression model described in (3.2) is fitted to the observed + imputed data. Note that for PAMRF, 50 imputations are used, and the pooled parameter estimates and standard errors are calculated using Rubin’s rules (Rubin, 1987), which have been implemented in R via the ‘mice’ package (Buuren and Groothuis-Oudshoorn, 2010).

All steps described in the previous paragraph are repeated 1,000 times, and the estimation results of β are summarized based on these 1,000 replicates. In addition to 75% missingness, which corresponds to the missingness in the ZOE 2.0 data, 50% and 25% missingness are also studied in the simulations. The simulation results on the β estimator for the two imputation methods and three missingness proportions are presented in Table 3.3. As shown in the table, when the missingness proportion is small (e.g., 25%), both methods perform well—the biases of the β estimator are small (slightly larger in KNN), the standard error estimators are accurate, and the confidence intervals have proper coverage probabilities. However, as the missingness proportion increases, bias differences between the two methods also increase, with the bias for KNN rising sharply. In particular, for 75% missingness, the absolute relative bias is 10.3% for PAMRF versus 20.3% for KNN, which is a big difference. The relatively large bias for KNN adversely impacts the confidence interval coverage for KNN as shown in the table. In addition, in the 75% missingness case, KNN causes approximately 11% loss of statistical efficiency compared with PAMRF. The relatively stable and satisfactory performance of PAMRF in the simulation studies, together with the fact that PAMRF outperforms all the other methods in LOOCV as demonstrated in Section 3.1, provide convincing evidence that PAMRF is the most suitable method for imputing the missing fluoride levels in the ZOE 2.0 data.

3.3. Imputation of Fluoride.

To impute missing fluoride values in the real data, the steps for PAMRF described in Section 3.1 are followed. As discussed in Section 2.5.2, each imputed fluoride level’s confidence score (CS) is defined by the proportion of the majority votes obtained from the random forest. It is clear that 0 < CS ⩽ 1. The output data contain 50 imputed datasets for all 4,779 subjects and their corresponding confidence scores. Note that confidence score is invariant over the 50 imputed fluoride levels for a fixed subject.

Although PAMRF has the best performance, its LOOCV error rate is high (the original error rate is 20.96% without any filtration). To control the error (e.g., <10%), subjects whose missing values cannot be confidently imputed are identified and excluded. This can be done by setting a threshold for the confidence score and removing imputed values with relatively low confidence scores from the imputation results. A filtration with a specific threshold for confidence score is applied in the LOOCV step and the error rate is recalculated. A lower error rate can be achieved by a stricter filtration (i.e., a larger confidence score threshold), but this comes at the expense of fewer imputed subjects—there is a trade-off between error rate and sample size. Naturally, the selection of the threshold for confidence score depends on the specific application problem. If fluoride is the main exposure, then one might need to be very conservative in fluoride imputation and require a very low error rate (i.e., 5%). If fluoride is a covariate for adjustment in another model, then one will likely prefer to maintain the analytical sample size and include all possible imputed values despite a relatively high error rate. Figure 3.2 displays the patterns of average confidence score among all the imputed values after applying a filtration, LOOCV error rate, and sample size over different values of the threshold. The joint presentation of error rate and sample size can help investigators choose an optimal confidence score threshold for their specific problem or application.

Fig 3.2. — Patterns of average confidence score, LOOCV error rate, and sample size over different values of threshold. (a) and (b) display the function of average confidence score among all the imputed values after a filtration versus the number of imputed values and the total sample size, respectively, starting from the 576 imputed subjects with CS=1 towards all of the 4,779 imputed subjects. (c) and (d) show patterns of LOOCV error rate, imputed sample size and total sample size over different values of threshold, decreasing from 1 to 0 by 0.1 each time.

By setting the confidence score threshold at 1, 0.85, and 0, three 4,779×50 datasets of multiply imputed values are obtained (each row contains the 50 imputed values or “NA” for a subject, depending on whether its CS reaches the threshold): “imputed with CS=1”, “imputed with CS⩾0.85”, and “all imputed” values, respectively. By combining each dataset of imputed values with the 1,501 observed values, three complete datasets with dimensions 6,280×50 are further obtained, namely “observed + imputed with CS=1”, “observed + imputed with CS⩾0.85”, and “observed + all imputed” values, respectively. The confidence score threshold 0.85 is chosen rather than other values within the range of 0 and 1 because as is illustrated in Figure 3.2(d), setting a threshold 0.85 leads to the largest total sample size (n=3,988) while controlling the error rate below 10%, which is the desired maximum error rate in the ZOE 2.0 study. For each complete dataset, a row-wise majority vote among all 50 imputed values is taken to create a fluoride map using the obtained vector of single fluoride values. The three maps generated in this way are displayed in Figure 3.3, where the original fluoride map containing only observed data points is also included for reference. The spatial correlation pattern appears to be well preserved after having all missing values imputed by PAMRF.

Fig 3.3. — Fluoride maps based on different sets of observed and imputed fluoride values, with each imputed value obtained from a majority vote among all 50 imputations

3.4. Association between Fluoride and Caries.

Associations between fluoride exposure and caries experience are estimated using four clinical traits defined for caries in the parent study. The two binary classifications (i.e., disease/no disease) include a sensitive (i.e., ICDAS¹⩾1) and a less sensitive (ICDAS⩾3) criterion for the detection of caries lesions (Ginnis et al., 2019). Their corresponding countparts (i.e., number of affected tooth surfaces) are also considered as measures of participants’ dental caries experience (ranging between 0–88). Theoretically, fluoride level is expected to be associated with both lower prevalence and less severity of disease. It is of interest to determine if and how the estimated associations change with the inclusion of different sets of multiply imputed fluoride values. When estimating or testing the associations between fluoride level and clinical traits, three different categorical contrasts for fluoride level are considered (0: none/non-detectable, 1: sub-optimal, 2: optimal), including (2 vs. 0), (1 vs. 0), and (2 vs. 1 or 0).

Given a vector of fluoride levels, the association between fluoride level and caries is explored as follows:

For the two binary dental caries status variables, the prevalence (number and proportion of caries cases) within each fluoride level group is summarized, and a logistic regression model is fitted, with binary case status as the response and fluoride level as the predictor.
For the two count variables of caries experience, the means and standard deviations (SD) within each fluoride level group are summarized, and a linear regression model is fitted with logarithmic transformation of count as the response and fluoride level as the predictor.

The results based on 50 column vectors in the completed datasets are pooled using Rubin’s rules. The regression coefficient β in each model can be used for pairwise contrasts between different fluoride levels. Separate analyses are performed for the observed data and the three complete datasets.

Results of the association analyses are presented in Table 3.4. As the table shows, the results of contrast (2 vs. 0) and contrast (2 vs. 1 or 0) generally become more and more significant as more and more multiply imputed values are included, suggesting that estimates of association between fluoride level and clinical traits become more precise with a larger sample size. Specifically, with the inclusion of multiply imputed values in the regression models, p-values for testing the hypothesis H₀ : β = 0 vs. H₁ : β ≠ 0 in these two contrasts move from non-significant to significant in the top two panels of the table; whereas in the bottom two panels, the β estimates essentially remain unchanged, but their standard errors decrease. In other words, these associations might be under-detected with insufficient sample size, and our imputation of missing fluoride values is meaningful in this context.

The results for contrast (1 vs. 0), representing none/non-detectable (0) and sub-optimal (1) groups, are sensitive to the choice of CS, owing to the combination of the relatively small sample sizes for these groups and the overall missing data rate of 75%, which limits the amount of observed information to perform imputation. Even for the observed data only, considering the detection limit and the potential measurement error, some participants who truly belong to the sub-optimal group might be misclassified to the none/non-detectable group. Thus, the results of contrast (1 vs. 0) may not be completely reliable. In fact, such an issue of sensitivity is not unique to PAMRF, and similar results can be obtained using other imputation methods such as KNN.

Finally, the absence of statistical significance in contrast (1 vs. 0) in the last column of Table 3.4 indicates that there is no appreciable difference in caries experience or severity between none/non-detectable and sub-optimal level of fluoride. This result suggests that these two fluoride levels may be combined as “sub-optimal” (<0.60ppm) fluoridation versus optimal (⩾0.60ppm) fluoridation in future studies.

4. Discussion.

This article addressed the imputation of missing fluoride values in a unique, real-world, spatially-correlated dataset from a community-based epidemiologic study of early childhood oral health. A new method called PAMRF was proposed and demonstrated to outperform four other popular imputation methods, IDW, UK, KNN, and CART, in both leave-one-out cross validation and simulation studies. A procedure was described to fine-tune PAMRF by adding a threshold for imputation confidence score, which excludes imputed values with relatively low confidence scores and thus reduces the LOOCV error rate. The error rate and imputed sample size was shown to increase as the average confidence score threshold decreases from 1 to 0, which should aid investigators in their decision of an optimal threshold value according to the needs of their specific application (e.g., conservative vs. inclusive). When multiply imputed data are included in the examination of associations between fluoride level and clinically determined dental caries experience in the ZOE 2.0 study, more precise (i.e., statistically stronger) estimates of association are achieved compared to analysis restricted to fully observed data.

An issue that deserves discussion is that one possible source of bias in the ZOE 2.0 data is geography, as the nonresponse rate may vary over counties. To address potential sources of bias at the area-based level, PAMRF adopts a finely stratified imputation approach, by imputing missing data within each cluster within each county. This approach is reasonable because clustering is based on all observed variables which can be conceived to contribute to nonresponse bias. In one sense, fluoride level is missing completely at random within each cluster, but is more accurately characterized as missing-at-random upon considering that observed information is used to define clusters. In fact, PAMRF is still valid when within-cluster missingness is missing at random.

This article focuses on imputation of domestic water fluoride level. In this sense, county is regarded as the appropriate geographical unit for defining domestic water supply. However, it is not a perfect measure of fluoride exposure. First, considering that two children within the same county can get their water from different sources, ZOE 2.0 additionally asked respondents whether their children’s primary source of drinking water was well water, municipal water, bottled water, or some combination thereof; this added information was incorporated in the proposed PAMRF multiple imputation method. However, to the extent that children may travel across county lines and drink water from other sources, domestic water supply may not fully account for their exposure to fluoridated water. In a broader context, the ZOE 2.0 study that motivates the proposed methods has considered other factors related to dental health and other ways of clustering based on its sampling design (i.e., ‘Head Start Program’ or ‘Head Start Center’) in conceptual and statistical models (Divaris and Joshi, 2020; Divaris et al., 2020).

To elaborate further, a county-level assessment is conducted mainly because county is a natural administrative unit for the determination of water systems fluoridation. However, some other factors may cause spatial correlation in fluoride levels between two counties (e.g., the nature of watersheds, possible shared water treatment plants, shared wells across county lines, etc.). Lacking observed information of such variables, each imputation method can only be applied either within counties, or without subdividing by county. It was found that IDW, UK, CART, and PAMRF all have better performance when applied within counties, while there is no appreciable difference in performance for KNN. This is another reason for county-level assessment in this article. In fact, even if the between-county correlation is a big issue, PAMRF can automatically address the issue to some extent through the incorporated multiple imputation. To see this, suppose two clusters have similar structures, then random picking from each cluster separately is almost equivalent to randomly picking from the union of these two clusters. Of note, when the outcome of interest is some measurement other than water fluoride and its association with young children’s oral health, county-level assessment may not be appropriate any more, as individuals may travel to adjacent counties for work or use health services elsewhere, which calls for the need to pool counties into larger Health Service Areas that tend to utilize the same health services throughout the area.

Here are some closing remarks on the five imputation methods considered in this article.

IDW and UK are quite straightforward to use in ArcGIS, but they are not as flexible as the other methods. For example, IDW does not allow any covariates, and UK assumes Gaussian process and stationarity, which may not always be true. Unlike the other three methods, IDW and UK are targeted at continuous fluoride values, which are left-truncated. This is a problem because these two methods make no distinction between true zeros and non-detectable, non-zero fluoride concentration values. One possible way to solve this problem is to impute the left-truncated zeros for the neighbors using a two-part semi-continuous distribution with left-truncation allowed for the continuous part (e.g. log-normal) before applying IDW or UK. This can be achieved by fitting a two-part mixed model (Su, Tom and Farewell, 2009) and conducting multiple imputation.
KNN might be a good choice for data with low missingness rate (e.g., ≤ 25%) because it is very simple. However, it is not that suitable for the ZOE 2.0 fluoride data with 75% missingness for two reasons. First, it is sensitive to outliers and usually causes more variability, which is shown in the simulation studies. Second, it can not be combined with multiple imputation to appropriately account for the uncertainty associated with the missing values. This article considered some common distance metrics such as Euclidean and Mahalanobis. Given additional information, some weighted distance metric accounting for the structure of county and data points in the sample could be considered. Such information is lacking in the ZOE 2.0 data that motivated this study, so this topic is left for future research.
CART is the only method that is not distance-based. In other words, it fails to make full use of the spatial dependence, which is one possible reason for its moderate performance.
As a combination of unsupervised learning and supervised learning, PAMRF captures the latent structure of spatial autocorrelation without making any distributional assumptions. It is also more robust to noise and outliers compared with the other methods. Moreover, it is easy to implement in R, and example R code is provided in the Supplemental Material (Gu et al., 2021).

Finally, model-based approaches such as Bayesian methods were not considered in this article, because they require strong distributional assumptions and sometimes can be very complicated.

Supplementary Material

example code

NIHMS1731052-supplement-example_code.html^{(636.2KB, html)}

Acknowledgements.

This work was supported by a grant from the National Institute of Dental and Craniofacial Research, U01-DE025046. The authors thank the editor, the associate editor and three referees for helpful comments.

Footnotes

SUPPLEMENTARY MATERIAL

Code

Example code for implementing PAMRF in R.

The International Caries Detection and Assessment System (ICDAS)

REFERENCES

Breiman L (1996). Bagging Predictors. Machine learning 24 123–140. [Google Scholar]
Breiman L (2001). Random Forests. Machine learning 45 5–32. [Google Scholar]
Breiman L (2002). Manual on Setting up, Using, and Understanding Random Forests v3. 1. Statistics Department University of California Berkeley, CA, USA 1 58. [Google Scholar]
Breiman L, Friedman J, Stone CJ and Olshen RA (1984). Classification and Regression Trees. CRC press. [Google Scholar]
Brunelle J and Carlos J (1990). Recent Trends in Dental Caries in US Children and the Effect of Water Fluoridation. Journal of Dental Research 69 723–727. [DOI] [PubMed] [Google Scholar]
Buuren SV and Groothuis-Oudshoorn K (2010). mice: Multivariate Imputation by Chained Equations in R. Journal of statistical software 1–68. [Google Scholar]
Cate JMT (1999). Current Concepts on the Theories of the Mechanism of Action of Fluoride. Acta Odontologica Scandinavica 57 325–329. [DOI] [PubMed] [Google Scholar]
Chen J and Shao J (2000). Nearest Neighbor Imputation for Survey Data. Journal of official statistics 16 113. [Google Scholar]
Cliff AD and Ord JK (1981). Spatial Processes: Models & Applications. Taylor & Francis. [Google Scholar]
Cressie N (1993). Statistics for Spatial Data. John Wiley & Sons. [Google Scholar]
Divaris K and Joshi A (2020). The Building Blocks of Precision Oral Health in Early Childhood: the ZOE 2.0 Study. Journal of public health dentistry 80 S31–S36. [DOI] [PMC free article] [PubMed] [Google Scholar]
Divaris K, Slade GD, Ferreira Zandona AG, Preisser JS, Ginnis J, Simancas-Pallares MA, Agler CS, Shrestha P, Karhade DS, Ribeiro ADA et al. (2020). Cohort Profile: ZOE 2.0—A Community-Based Genetic Epidemiologic Study of Early Childhood Oral Health. International Journal of Environmental Research and Public Health 17 8056. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eckert S, Feingold E, Cooper M, Vanyukov MM, Maher BS, Slayton RL, Willing MC, Reis SE, McNeil DW, Crout RJ et al. (2017). Variants on Chromosome 4q21 Near PKD2 and SIBLINGs Are Associated with Dental Caries. Journal of human genetics 62 491–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
Falkowski MJ, Hudak AT, Crookston NL, Gessler PE, Uebler EH and Smith AM (2010). Landscape-Scale Parameterization of A Tree-Level Forest Growth Model: A K-Nearest Neighbor Imputation Approach Incorporating LiDAR Data. Canadian Journal of Forest Research 40 184–199. [Google Scholar]
Fisher-Owens SA, Gansky SA, Platt LJ, Weintraub JA, Soobader M-J, Bramlett MD and Newacheck PW (2007). Influences on Children’s Oral Health: A Conceptual Model. Pediatrics 120 e510–e520. [DOI] [PubMed] [Google Scholar]
Franke R (1982). Scattered Data Interpolation: Tests of Some Methods. Mathematics of computation 38 181–200. [Google Scholar]
Ginnis J, Zandoná AGF, Slade GD, Cantrell J, Antonio ME, Pahel BT, Meyer BD, Shrestha P, Simancas-Pallares MA, Joshi AR et al. (2019). Measurement of Early Childhood Oral Health for Research Purposes: Dental Caries Experience and Developmental Defects of the Enamel in the Primary Dentition. In Odontogenesis 511–523. Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gower JC (1971). A General Coefficient of Similarity and Some of Its Properties. Biometrics 857–871. [Google Scholar]
Gu Y, Preisser JS, Zeng D, Shrestha P, Shah M, Simancas-Pallares MA, Ginnis J and Divaris K (2021). Supplement to “Partitioning Around Medoids Clustering and Random Forest Classification for GIS-Informed Imputation of Fluoride Concentration Data”. [DOI] [PMC free article] [PubMed]
Ha D, Spencer A, Peres K, Rugg-Gunn A, SCOTT J and DO L (2019). Fluoridated Water Modifies the Effect of Breastfeeding on Dental Caries. Journal of dental research 98 755–762. [DOI] [PubMed] [Google Scholar]
Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P and Botstein D (1999). Imputing Missing Data for Gene Expression Arrays. Stanford University Statistics Department Technical report. [Google Scholar]
Hennig C and Liao TF (2013). How to Find An Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics) 62 309–369. [Google Scholar]
Iheozor-Ejiofor Z, Worthington HV, Walsh T, O’Malley L, Clarkson JE, Macey R, Alam R, Tugwell P, Welch V and Glenny A-M (2015). Water Fluoridation for the Prevention of Dental Caries. Cochrane Database of Systematic Reviews 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnston K, Ver Hoef JM, Krivoruchko K and Lucas N (2001). Using ArcGIS Geostatistical Analyst 380. Esri Redlands. [Google Scholar]
Kaufman L and Rousseeuw PJ (2009). Finding Groups in Data: An Introduction to Cluster Analysis 344. John Wiley & Sons. [Google Scholar]
Lam NS-N (1983). Spatial Interpolation Methods: A Review. The American Cartographer 10 129–150. [Google Scholar]
Liaw A, Wiener M et al. (2002). Classification and Regression by randomForest. R news 2 18–22. [Google Scholar]
Maechler M, Rousseeuw P, Struyf A, Hubert M and Hornik K (2019). cluster: Cluster Analysis Basics and Extensions. R Package Version 2.1. 0. 2019.
Mitas L and Mitasova H (1999). Spatial Interpolation. In Geographical Information Systems: Principles, Techniques, Management and Applications, (Longley PA, Goodchild MF, Maguire DJ and Rhind DWE, eds.) 1 34, 481–492. Wiley. [Google Scholar]
Moran PA (1950). Notes on Continuous Stochastic Phenomena. Biometrika 37 17–23. [PubMed] [Google Scholar]
RColorBrewer S and Liaw MA (2018). Package ‘randomForest’. University of California, Berkeley: Berkeley, CA, USA. [Google Scholar]
Rousseeuw PJ (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of computational and applied mathematics 20 53–65. [Google Scholar]
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc. [Google Scholar]
Selwitz RH, Ismail AI and Pitts NB (2007). Dental Caries. The Lancet 369 51–59. [DOI] [PubMed] [Google Scholar]
Shaffer J, Wang X, Feingold E, Lee M, Begum F, Weeks D, Cuenco K, Barmada M, Wendell S, Crosslin D et al. (2011). Genome-Wide Association Scan for Childhood Caries Implicates Novel Genes. Journal of dental research 90 1457–1462. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shepard D (1968). A Two-Dimensional Interpolation Function for Irregularly-Spaced Data. In Proceedings of the 1968 23rd ACM national conference 517–524. ACM. [Google Scholar]
Su L, Tom BD and Farewell VT (2009). Bias in 2-Part Mixed Models for Longitudinal Semicontinuous Data. Biostatistics 10 374–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Therneau T and Atkinson B (2008). rpart: Recursive Partitioning. R port by Brian Ripley. R package version 3–1. [Google Scholar]
van de Velden M, Iodice D’Enza A and Markos A (2019). Distance-Based Clustering of Mixed Data. Wiley Interdisciplinary Reviews: Computational Statistics 11 e1456. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

example code

NIHMS1731052-supplement-example_code.html^{(636.2KB, html)}

[R1] Breiman L (1996). Bagging Predictors. Machine learning 24 123–140. [Google Scholar]

[R2] Breiman L (2001). Random Forests. Machine learning 45 5–32. [Google Scholar]

[R3] Breiman L (2002). Manual on Setting up, Using, and Understanding Random Forests v3. 1. Statistics Department University of California Berkeley, CA, USA 1 58. [Google Scholar]

[R4] Breiman L, Friedman J, Stone CJ and Olshen RA (1984). Classification and Regression Trees. CRC press. [Google Scholar]

[R5] Brunelle J and Carlos J (1990). Recent Trends in Dental Caries in US Children and the Effect of Water Fluoridation. Journal of Dental Research 69 723–727. [DOI] [PubMed] [Google Scholar]

[R6] Buuren SV and Groothuis-Oudshoorn K (2010). mice: Multivariate Imputation by Chained Equations in R. Journal of statistical software 1–68. [Google Scholar]

[R7] Cate JMT (1999). Current Concepts on the Theories of the Mechanism of Action of Fluoride. Acta Odontologica Scandinavica 57 325–329. [DOI] [PubMed] [Google Scholar]

[R8] Chen J and Shao J (2000). Nearest Neighbor Imputation for Survey Data. Journal of official statistics 16 113. [Google Scholar]

[R9] Cliff AD and Ord JK (1981). Spatial Processes: Models & Applications. Taylor & Francis. [Google Scholar]

[R10] Cressie N (1993). Statistics for Spatial Data. John Wiley & Sons. [Google Scholar]

[R11] Divaris K and Joshi A (2020). The Building Blocks of Precision Oral Health in Early Childhood: the ZOE 2.0 Study. Journal of public health dentistry 80 S31–S36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Divaris K, Slade GD, Ferreira Zandona AG, Preisser JS, Ginnis J, Simancas-Pallares MA, Agler CS, Shrestha P, Karhade DS, Ribeiro ADA et al. (2020). Cohort Profile: ZOE 2.0—A Community-Based Genetic Epidemiologic Study of Early Childhood Oral Health. International Journal of Environmental Research and Public Health 17 8056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Eckert S, Feingold E, Cooper M, Vanyukov MM, Maher BS, Slayton RL, Willing MC, Reis SE, McNeil DW, Crout RJ et al. (2017). Variants on Chromosome 4q21 Near PKD2 and SIBLINGs Are Associated with Dental Caries. Journal of human genetics 62 491–496. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Falkowski MJ, Hudak AT, Crookston NL, Gessler PE, Uebler EH and Smith AM (2010). Landscape-Scale Parameterization of A Tree-Level Forest Growth Model: A K-Nearest Neighbor Imputation Approach Incorporating LiDAR Data. Canadian Journal of Forest Research 40 184–199. [Google Scholar]

[R15] Fisher-Owens SA, Gansky SA, Platt LJ, Weintraub JA, Soobader M-J, Bramlett MD and Newacheck PW (2007). Influences on Children’s Oral Health: A Conceptual Model. Pediatrics 120 e510–e520. [DOI] [PubMed] [Google Scholar]

[R16] Franke R (1982). Scattered Data Interpolation: Tests of Some Methods. Mathematics of computation 38 181–200. [Google Scholar]

[R17] Ginnis J, Zandoná AGF, Slade GD, Cantrell J, Antonio ME, Pahel BT, Meyer BD, Shrestha P, Simancas-Pallares MA, Joshi AR et al. (2019). Measurement of Early Childhood Oral Health for Research Purposes: Dental Caries Experience and Developmental Defects of the Enamel in the Primary Dentition. In Odontogenesis 511–523. Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Gower JC (1971). A General Coefficient of Similarity and Some of Its Properties. Biometrics 857–871. [Google Scholar]

[R19] Gu Y, Preisser JS, Zeng D, Shrestha P, Shah M, Simancas-Pallares MA, Ginnis J and Divaris K (2021). Supplement to “Partitioning Around Medoids Clustering and Random Forest Classification for GIS-Informed Imputation of Fluoride Concentration Data”. [DOI] [PMC free article] [PubMed]

[R20] Ha D, Spencer A, Peres K, Rugg-Gunn A, SCOTT J and DO L (2019). Fluoridated Water Modifies the Effect of Breastfeeding on Dental Caries. Journal of dental research 98 755–762. [DOI] [PubMed] [Google Scholar]

[R21] Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P and Botstein D (1999). Imputing Missing Data for Gene Expression Arrays. Stanford University Statistics Department Technical report. [Google Scholar]

[R22] Hennig C and Liao TF (2013). How to Find An Appropriate Clustering for Mixed-Type Variables with Application to Socio-Economic Stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics) 62 309–369. [Google Scholar]

[R23] Iheozor-Ejiofor Z, Worthington HV, Walsh T, O’Malley L, Clarkson JE, Macey R, Alam R, Tugwell P, Welch V and Glenny A-M (2015). Water Fluoridation for the Prevention of Dental Caries. Cochrane Database of Systematic Reviews 6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Johnston K, Ver Hoef JM, Krivoruchko K and Lucas N (2001). Using ArcGIS Geostatistical Analyst 380. Esri Redlands. [Google Scholar]

[R25] Kaufman L and Rousseeuw PJ (2009). Finding Groups in Data: An Introduction to Cluster Analysis 344. John Wiley & Sons. [Google Scholar]

[R26] Lam NS-N (1983). Spatial Interpolation Methods: A Review. The American Cartographer 10 129–150. [Google Scholar]

[R27] Liaw A, Wiener M et al. (2002). Classification and Regression by randomForest. R news 2 18–22. [Google Scholar]

[R28] Maechler M, Rousseeuw P, Struyf A, Hubert M and Hornik K (2019). cluster: Cluster Analysis Basics and Extensions. R Package Version 2.1. 0. 2019.

[R29] Mitas L and Mitasova H (1999). Spatial Interpolation. In Geographical Information Systems: Principles, Techniques, Management and Applications, (Longley PA, Goodchild MF, Maguire DJ and Rhind DWE, eds.) 1 34, 481–492. Wiley. [Google Scholar]

[R30] Moran PA (1950). Notes on Continuous Stochastic Phenomena. Biometrika 37 17–23. [PubMed] [Google Scholar]

[R31] RColorBrewer S and Liaw MA (2018). Package ‘randomForest’. University of California, Berkeley: Berkeley, CA, USA. [Google Scholar]

[R32] Rousseeuw PJ (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of computational and applied mathematics 20 53–65. [Google Scholar]

[R33] Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc. [Google Scholar]

[R34] Selwitz RH, Ismail AI and Pitts NB (2007). Dental Caries. The Lancet 369 51–59. [DOI] [PubMed] [Google Scholar]

[R35] Shaffer J, Wang X, Feingold E, Lee M, Begum F, Weeks D, Cuenco K, Barmada M, Wendell S, Crosslin D et al. (2011). Genome-Wide Association Scan for Childhood Caries Implicates Novel Genes. Journal of dental research 90 1457–1462. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Shepard D (1968). A Two-Dimensional Interpolation Function for Irregularly-Spaced Data. In Proceedings of the 1968 23rd ACM national conference 517–524. ACM. [Google Scholar]

[R37] Su L, Tom BD and Farewell VT (2009). Bias in 2-Part Mixed Models for Longitudinal Semicontinuous Data. Biostatistics 10 374–389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Therneau T and Atkinson B (2008). rpart: Recursive Partitioning. R port by Brian Ripley. R package version 3–1. [Google Scholar]

[R39] van de Velden M, Iodice D’Enza A and Markos A (2019). Distance-Based Clustering of Mixed Data. Wiley Interdisciplinary Reviews: Computational Statistics 11 e1456. [Google Scholar]

PERMALINK

PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA

Yu Gu

John S Preisser

Donglin Zeng

Poojan Shrestha

Molina Shah

Miguel A Simancas-Pallares

Jeannie Ginnis

Kimon Divaris

Abstract

1. Introduction.

Fig 1.1.

Table 1.1.

2. Methodology.

2.1. Inverse Distance Weighting (IDW).

2.2. Universal Kriging (UK).

Fig 2.1.

2.3. K-Nearest Neighbors (KNN).

2.4. Classification and Regression Trees (CART).

Fig 2.2.

2.5. PAMRF.

2.5.1. PAM Clustering.

2.5.2. Random Forest.

2.5.3. Gower Distance.

2.5.4. PAMRF.

3. Application and Results.

3.1. Methods Assessment.

Table 3.1.

Fig 3.1.

Table 3.3.

Table 3.4.

Table 3.2.

3.2. Simulation Studies.

3.3. Imputation of Fluoride.

Fig 3.2.

Fig 3.3.

3.4. Association between Fluoride and Caries.

4. Discussion.

Supplementary Material

Acknowledgements.

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases