Skip to main content
Food Safety logoLink to Food Safety
. 2020 Dec 25;8(4):67–89. doi: 10.14252/foodsafetyfscj.D-20-00007

Attempt of Bayesian Estimation from Left-censored Data Using the Markov Chain Monte Carlo Method: Exploring Cr(VI) Concentrations in Mineral Water Products

Yoshinari Suzuki 1, Noriko Tanaka 2, Hiroshi Akiyama 1
PMCID: PMC7765759  PMID: 33409115

Abstract

Hexavalent chromium (Cr(VI)) is toxic, carcinogenic, and mutagenic substances. Oral exposure to Cr(VI) is thought to be primarily from drinking water. However, under the certain reporting limit (~0.1 µg/L), percentage of Cr(VI) concentration in mineral water products under the reporting limit were estimated higher than 50%. Data whose values are below certain limits and thus cannot be accurately determined are known as left-censored. The high censored percentage leads to estimation of Cr(VI) exposure uncertain. It is well known that conventional substitution method often used in food analytical science cause severe bias. To estimate appropriate summary statistics on Cr(VI) concentration in mineral water products, parameter estimation using the Markov chain Monte Carlo (MCMC) method under assumption of a lognormal distribution was performed. Stan, a probabilistic programming language, was used for MCMC. We evaluated the accuracy, coverage probability, and reliability of estimates with MCMC by comparison with other estimation methods (discard nondetects, substituting half of reporting limit, Kaplan-Meier, regression on order statistics, and maximum likelihood estimation) using 1000 randomly generated data subsets (n = 150) with the obtained parameters. The evaluation shows that MCMC is the best estimation method in this context with greater accuracy, coverage probability, and reliability over a censored percentage of 10-90%. The mean concentration, which was estimated with MCMC, was 0.289×10−3 mg/L and this value was sufficiently lower than the regulated value of 0.05 mg/L stipulated by the Food Sanitation Act.

Key words: nondetects, left-censored data, Bayesian model, MCMC, Stan, Cr(VI)

1. Introduction

Chromium (Cr) is a metal widely distributed in the environment. Chromium has various oxidation states from −2 to +6, but the most predominant forms are +3 and +61). Trivalent Cr (Cr(III)) is an essential nutrient for cholesterol, fat and glucose metabolism in human bodies, while hexavalent Cr (Cr(VI)) is toxic, carcinogenic, mutagenic, and mobile in nature2). Continuous exposure to Cr(VI) may cause pulmonary congestions, liver damage, skin irritation, and kidney failure3). Once Cr(VI) enters inside the cell, it undergoes a rapid metabolic reduction and is converted ultimately to Cr(III)4). Thus, the Food Safety Commission of Japan estimated daily intake of Cr(VI) from consumption of mineral water and tap water5). According to the reports on Cr(VI) concentration in mineral water (MW) products6,7,8), detected concentrations ranged from 0.1 to 3.4 µg/L, and the data contained many non-detected values with censored percentage from 50 to 75% under the certain reporting limit (~0.1 µg/L). When nondetects are present in data, they lead to difficulties in computing statistical metrics. Estimates from the data with nondetects could vary widely depends upon not only the censored proportion due to the properties of the measured substance and the performance of the analytical instrument, but also the performance and the assumption of the statistical analysis methods. In other words, estimation of Cr(VI) exposure via MW products becomes uncertain.

Since foods are major exposure source to chemicals, assessment of chemical exposure via foods is important for health protection. Non-detected values often occur for chemical concentrations in foods because the analytical methods for chemicals in foods always have the limit of detection (LOD) and the limit of quantification (LOQ)9,10,11,12,13). Data lacking certain values which are lower ​​than LOD and LOQ, are known as left-censored. In the field of food analytical science, many studies used substitution (010,11), LOD/210,13), or LOD9,12)) for left-censored data on the exposure estimation. The estimation of chemical exposure via foods is likely to be evaluated by the conservative method using substitution of LOD or LOQ for left-censored data on the safety side. Therefore, such conservative assessment using those methods may cause a severe overestimation that is meaningless and impractical from the viewpoint of health risk and food distribution. False concerns and excessive regulations lead to strict rules onto the society and the food community. In 2006, EPA guidance supported the use of 0, LOD/2, or LOD substitution in data with less than 15% nondetects14). However, in 2015 the EPA revised its guidelines stating that substitution by LOD/2 should be used only if the percentage of censored data is less than 5% and if the data are mildly skewed15). Helsel16) advocates that academic journals should consider substitution as flawed and reject the papers that such methods are implemented. Helsel recommended three methods—Kaplan-Meier (KM), maximum likelihood estimation (MLE), and regression on order statistics (ROS)—as more accurate alternatives for computing statistics on data with nondetects17,18). Helsel18) gave recommendations of usage of KM, MLE, and ROS methods as follows: 1: KM should be used for data with censored percentage of <50%; 2: ROS should be used for small sample size (<50) and 50-80% censored percentage; 3: MLE should be used for large sample size (≥50) and 50-80% censored percentage. It can be expected that the MLE shows good performance for large sample size, however it is also reported that MLE shows poor performance for high-skewed data19). On the contrary, it is well known that incorporating prior information in Bayesian framework result in a reduced sample size, and Bayesian method allows for great flexibility in dealing with missing data20). Thus, we expect that Bayesian framework may give a unified solution for different situations of left-censored data.

In singular models with many hidden parameters, such as multinomial mixture model, change-point model, neural networks, hidden Markov models, and so on, it was reported that Bayesian estimation make the generalization error smaller21). Since left-censored data have many hidden values as nondetects, Bayesian estimation is expected to provide appropriate estimates. Thus, the estimation method based on Bayesian modeling using the Markov chain Monte Carlo (MCMC) method has recently been applied in left-censored data. To compute Bayesian models, probabilistic programing languages such as BUGS and JAGS have been used. For example, Bayesian modeling using BUGS and JAGS has been applied to left-censored data for microbiological contamination data22), hexabromocyclododecane diastereomer compositions in water23), fluoride in drinking water24), and so on.

While there are various reports on comparisons among non-Bayesian methods (substitution, non-parametric, semi-parametric, and parametric) in the context of left-censored data analysis25,26,27,28,29,30), the reports comparing the performance of Bayesian estimation with other estimation methods are limited. Huynh et al31) evaluated the estimation performance of a Bayesian method and substitution by simulation with real data in the specific area except for food safety science. Nie et al32), Hady and Rain33), and Feroze and Aslam34) also evaluated by comparison of Bayesian and likelihood-based estimation for left-censored data using simulations. Although Nie el al32) adopted Bayesian approach to the real data in the specific area except for food safety science, Hady and Rain33), and Feroze and Aslam34) performed a simulation study without real data. They revealed that Bayesian method resulted in less bias. Although Bayesian estimation has been reported to have satisfactory coverage probability, as well as less bias, the comparable performance and relative reliability (i.e., the stability of results given alternative datasets) of the Bayesian method approach in food safety science have been never evaluated. A method with less reliability may fail to fit additional data or predict future observations. We consider that it is important to validate the Bayesian method approach for the standardization in food safety science.

In this study, in order to elaborate exposure assessment, we applied Bayesian modeling using MCMC, in addition to non-Bayesian methods, to obtain less biased summary statistics on Cr(VI) concentrations in MW products from Kataoka et al (2017)8), which has over 50% nondetects and no specified criterion concerning nondetects. Since the true values of probability density distribution parameter are unknown, we have conducted a simulation study. In addition, we validated the proposed MCMC method on the left-censored data analysis by comparison of several estimation methods to explore the accuracy, coverage probability, and reliability of the proposed MCMC method in food safety science. The process was divided into two stages as follows: 1: Estimate the concentration distribution from the original data using MCMC and other methods; 2: Evaluate MCMC performance compared with other statistical methods from simulation study. Finally, this study can contribute to the reliability of food safety evaluation by indicating the possibility of evaluating more accurate concentration in a similar situation to Cr(VI).

2. Methods

2.1 Data

Data on Cr(VI) concentrations in MW products reported by Kataoka et al8) were used. One hundred and fifty MW products were purchased in 2016. Among them, 110 products are domestic, and 40 products are imported. The number of purchases was assigned according to the number of products for each prefecture and country in the Japanese market. The lower LOQ was calculated and data below LOQ are marked as “Tr.”. However, no description about the value of LOQ can be found in the text. Although summary statistics for the observed data are reported in the paper, appropriate analysis for data containing nondetects has not been performed. Herein, unless otherwise specified, the censoring criterion is described as reporting limit (RL).

2.2 Inferential Analysis

Statistical analysis was performed using R (ver. 3.4.0). R packages EnvStat (ver. 2.3.1) and rstan (ver. 2.16.2) were used to estimate parameters from left-censored data.

2.3 Parameter Estimation

To compare parameters estimated from left-censored data with MCMC, we used 5 other methods (1: Discard nondetects (DN); 2: Substitution of one half of RL (RL/2); 3: KM; 4: MLE; 5: ROS). ROS is a method corresponding to a normal or lognormal distribution. We performed statistical analysis under assumption that the original data follows lognormal distribution (see Supporting Information S4. for further detail). KM, MLE, and ROS methods were implemented using the EnvStats package. The minimum value of detected data was used as a censoring point for convenience. Then a location parameter (geometric mean, m^), a shape parameter (geometric standard deviation, s^), a mean (μ^), a standard deviation (σ^), and lower (LCI, L^) and upper (UCI, U^) 95% confidence intervals of mean Cr(VI) concentration were estimated for data reported by Kataoka et al8) using the 6 methods.

2.3.1 Discard nondetects

Entries with non-detected values are eliminated and only detected values are used for further calculation. This approach is attractive because of its simplicity but the data may be distorted.

2.3.2 Substitution of one half of reporting limit

Non-detected values are replaced with one half of RL and both detected and replaced values are used for further calculation. Substitution is still widely used in various fields because of its tractability, and it is perhaps acceptable under certain conditions.

2.3.3 Kaplan-Meier

Kaplan-Meier is a nonparametric estimator since it does not make distributional assumptions. More specifically, KM is a non-parametric version of maximum likelihood estimation, and this method estimates the percentiles, or cumulative distribution function (CDF), for the data, where the mean equals the area beneath the CDF35). It is widely used in survival analysis to estimate survival functions, which are then used to estimate different summary statistics35) (see Supporting Information S6.1 for the detailed algorithm). The KM method can provide useful estimates when sample sizes are small; however, it does not perform well if more than 50% of data are nondetects or if fewer than eight detections are available for evaluation18).

2.3.4 Regression on order statistics

Regression on order statistics is a semi-parametric, simple imputation method that transforms nondetects on the basis of a probability plot of detects17,18,36,37). ROS is also known as “Imputation Using Quantile-Quantile Regression” (see Supporting Information S6.2 for the detailed algorithm). ROS performs better than MLE and some substitution methods when the sample size is small (less than 50) and where data do not fit a distribution38).

2.3.5 Maximum likelihood estimation

Maximum likelihood estimation solves a likelihood equation to estimate the parameter(s) using both detected observations and the proportion of data falling below RL17). The observed data (x) enter the likelihood function through the probability density function (PDF, f(x|m, s)) of lognormal distribution and the censored observations can be accounted for by the CDF (F(RL|m,s) = P(x ≤ RL|m,s)) of lognormal distribution as follows:

L(m,s|x1,x2,,xn)=xDf(x|m,s)xCF(RL|m,s) ,  Eq.1

where D is the set of all observed values and C is the set of all left-censored values. MLE finds the parameter values (m, s) that maximize the likelihood function against the observations (see Supporting Information S6.3 for the detailed algorithm).

2.3.6 Markov chain Monte Carlo

Among probabilistic programing languages, Stan which was developed by Gelman et al39) was used in this study. A key feature of Stan is that NUTS (No-U-Turn Sampler), which is an implementation of Hamiltonian Monte Carlo, one of the algorithms of the MCMC method, is adopted. Various Bayesian modeling sources using Stan have been published and documentation is abundantly available online40).

In addition to two parameters of distribution (GM and GSD), we estimated RL with using MCMC (see Supporting Information S1. for furter details). Four parallel Markov chains are calculated using MCMC to check convergence. The numbers of iterations, warm-up, and thinning were set to 2×103, 1×103, and 2, respectively (see Supporting Information S2. for further details). For parameters estimated by MCMC, we used an expected a posteriori (EAP), which is mean of the posterior predictive distribution, as a point estimate. The reason is that an EAP reflect the information of entire distribution, while maximum a posteriori can be significantly affected by a small number of outliers. Then EAP was used for further calculation. After execution of the Stan model, we calculated the widely applicable information criterion (WAIC)21) (see Supporting Information S3. for further details).

2.4 Monte Carlo Simulations

We generated 105 random numbers according to the lognormal distribution with the parameters estimated in section 2.3. Using MCMC, estimated parameter(s) can be obtained as posterior predictive distribution(s). Since we obtained a set of 2×103 estimates for each parameter, 50 random numbers are generated against posterior predictive distributions, and a total of 105 random numbers are obtained (see line 42 of Fig. S1).

2.5 Simulation and Comparison of Various Methods

To explore the accuracy, coverage probability, and reliability of parameter estimation from left-censored data by MCMC, simulation and comparison with other methods (DN, RL/2, KM, ROS, and MLE) was carried out. To rank the 6 methods for estimating left-censored data, simulations were performed with the following algorithm:

  • 1. Generate a random number of size 150 according to a lognormal distribution with the estimated parameters in section 2.3.6 as true values.

  • 2. Simulate censoring by using a particular theoretical percentile of the sample data as an RL.

  • 3. Execute the various methods to estimate parameters.

  • 4. Repeat steps 1-3 1000 times.

  • 5. Using the true values, compute some criteria.

When simulating censoring, we used theoretical percentiles of simulated data. For example, to simulate data with 40% nondetects, we considered any value falling below the theoretical 40th percentile of the data as nondetects. We considered the minimum value of detected data as RL. The simulations cover multiple censored percentages from 10 to 90%. Then 6 parameters (m^ , s^ , μ^ , σ^ , L^, , and U^) were estimated using the 6 methods.

To evaluate accuracy, bias and mean squared error (MSE) were calculated. The bias of an estimator θ^, for estimating parameter θ, is defined as follows:

B(θ^)=E(θ^θ) .  Eq.2

MSE of an estimator θ^ is defined as follows:

MSE(θ^)=E[(θ^θ)2] .  Eq.3

We calculated the coverage probability of 95% CIs for mean (CP), which indicates the fraction of computed CIs that include the desired but unobservable parameter value. The 95% CIs for means estimated by DN and RL/2 were calculated using Cox’s method41) (see Supporting Information S5. for further details). The 95% CIs for means estimated by KM, ROS, and MLE were calculated by the bootstrap with bias-corrected and accelerated method (with 1000 repetitions). For the MCMC method, 95% credible intervals (CrI) for the mean were calculated (see line 43 of Fig. S1). Then the CP was evaluated by calculating whether the expected value was included within the 95% CI or 95% CrI (see Supporting Information S7. for further information of CI and CrI).

To evaluate reliability, which indicates the stability of results given alternative data, we calculated the relative interquartile range (RIQR) and relative standard deviation (RSD) for estimated parameters of GM and GSD from 1000 different datasets.

3. Results

3.1 Method Inter-comparison for Cr(VI) concentrations with Nondetects

To validate the MCMC results, we conducted a method inter-comparison. Table 1 shows summary statistics and estimated parameters from left-censored data by various methods. The following describes the results from each estimation method.

Table 1.  Summary statistics and estimated parameters from left-censored data based on six inferential methods.

Method Mean (×10−3) SD (×10−3) LCIb (×10−3) UCIb (×10−3) GM (×10−3) GSD
DN 0.472 0.433 0.380 0.587 0.348 2.18
RL/2 0.233 0.353 0.187 0.291 0.114 2.97
KM 0.261 0.338 0.216 0.305 0.160 2.69
ROS 0.257 0.343 0.208 0.398 0.148 2.78
MLE 0.266 0.846 0.201 0.342 0.080 4.72
MCMC 0.289a 1.06 a 0.194 0.460 0.082c 4.90c

a: Expected a posteriori of generated quantities (mean_est or sd_est) calculated by MCMC.

b: Lower and upper 95% confidence interval of the mean. For MCMC, 95% credible interval of posterior predictive distribution for generated quantities (mean_est).

c: Expected a posteriori of target parameters (GM and GSD).

3.1.1 Discard nondetects

Among the estimation methods, μ^ DN (0.472×10−3) and s^ DN(2.18) were highest and lowest, respectively. Given that nondetects are ignored, it seems reasonable to have such a result.

3.1.2 Substitution of one half of reporting limit

Although the 95% CI of μ^ RL/2 overlapped with that estimated by other methods except for DN, μ^ RL/2 (0.233×10−3) showed the minimum value among the estimation methods. Since the dataset used in this study does not satisfy the EPA’s proposed requirements for RL/215), we should not use the RL/2 method.

3.1.3 Kaplan-Meier

If there is only one RL, the estimation result is that μ^ KM is identical to a substitution of the RL for censored data18). Thus, it is reasonable that μ^ KM (0.261×10−3) was higher than μ^ RL/2 (0.233×10−3). The value of μ^ KM was similar to μ^ ROS (0.257×10−3) and μ^ MLE (0.266×10−3). The value of σ^ KM (0.338×10−3) was similar to σ^ RL/2 (0.353×10−3) and σ^ ROS (0.343×10−3). Since the data used herein have more than 50% nondetects, KM estimates may be less relevant.

3.1.4 Regression on order statistics

μ^ ROS (0.257×10−3) was relatively close to what was obtained using other methods except for DN. σ^ ROS (0.355×10−3) was lower than σ^ MLE (0.846×10−3) and σ^ MCMC (1.06×10−3).

3.1.5 Maximum likelihood estimation

μ^ MLE (0.266×10−3) was comparable to what was obtained using other methods except for DN. However, σ^ MLE (0.846×10−3) and s^ MLE (4.72) were higher compared to the other methods except for MCMC.

3.1.6 Markov chain Monte Carlo

The posterior predictive distributions of parameters (GM, GSD, and RL) and generated quantities (means and SD) are shown in Fig. 1. The value of μ^ MCMC (0.289×10−3) was slightly higher compared to KM, ROS, and MLE. m^ MCMC (0.082×10−3) showed the second lowest value among all other methods. σ^ MCMC (1.06×10−3) and s^ MCMC (4.94) were higher than corresponding values from all other methods.

Fig. 1.

Fig. 1.

 Violin plots for posterior predictive distribution of parameters (geometric mean (GM), geometric standard deviation (GSD), reporting limit (RL), mean, and standard deviation (SD)) about Cr(VI) concentration in mineral water products reported by Kataoka et al (2017) under assumption that original data follow a lognormal distribution. Violins, boxes, horizontal solid lines, and open circles indicate probability density, interquartile range, median, and arithmetic mean, respectively.

The 95% CrI for the mean by MCMC was wider than other methods. There are two possible reasons for this. The first is that, according to the Cox method, the larger GSD resulted in the wider CrI. Another reason is that the estimated parameters themselves have a distribution in MCMC, and the uncertainty was combined, resulting in wider CrI than other methods.

3.1.7 Distribution of Cr(VI) concentrations in mineral water products

To estimate the distribution of Cr(VI), a total of 105 Cr(VI) random numbers were generated (Fig. 2). Regardless of estimation method, Cr(VI) concentrations in MW products were sufficiently lower than the regulated value of 0.05 mg/L stipulated by the Food Sanitation Act42). The mean values were higher in the order of DN (0.472×10−3) > MCMC (0.289×10−3) > MLE (0.268×10−3) > KM (0.2602×10−3) > ROS (0.2599×10−3) > RL/2 (0.205×10−3). The 97.5th percentile values were higher in the order of MCMC (1.82×10−3) > MLE (1.69×10−3) > DN (1.61×10−3) > ROS (1.123×10−3) > KM (1.117×10−3) > RL/2 (0.96×10−3). Since the actual 97.5th percentile value is 1.38×10−3, it can be concluded that the methods explored herein express the original data reasonably well. The estimated mean values showed similar values except for DN, but the 97.5th percentile value was larger when MCMC was used.

Fig. 2.

Fig. 2.

 Histograms of generated random numbers for Cr(VI) concentrations in mineral water products following a lognormal distribution with estimated parameters by DN, RL/2, KM, ROS, MLE, and MCMC. Vertical solid, dashed, and dotted lines indicate regulated concentration (0.05 mg/L), 97.5th percentiles, and mean, respectively.

3.2 Comparison the Different Methods

Since the true values for parameters of probability density distribution are unknown, we conducted the simulation study to determine the most appropriate methods. We compared the performances of each method with using assessment criteria (bias, MSE, CP, RIQR, and RSD). Ideally, we want the methods being studied to have a bias as close to 0 as possible and an MSE as small as possible. In addition to high accuracy, we want to know that a particular method is appropriate in terms of estimating CI for means, and reliable when faced with alternative data sets.

3.2.1 Accuracy

The results of accuracy assessment under censored percentages from 10 to 90% are shown in Fig. 3. The DN method overestimates GM and underestimates GSD. This tendency became more pronounced as the censored percentage increases. We should refrain from calculating summary statistics after discarding nondetects. MCMC tended to overestimate the GSD slightly (<101% overestimated when the censored percentage was between 10 and 50), but other methods tended to underestimate the GSD.

Fig. 3.

Fig. 3.

 Bias and mean squared error (MSE) for GM and GSD estimated by DN, RL/2, KM, ROS, MLE, and MCMC from 1000 randomly generated left-censored data subsets (n = 150) which follow a lognormal distribution with certain parameters (GM = 0.082×10−3; GSD = 4.9) over a 10 to 90% censoring range. For MCMC, the expected a posteriori was used for further calculation.

Even when the censored percentage is small, s^ DN and s^ KM exhibited high MSE values. Where the censored percentage was between 10 and 50, the ROS, MLE, and MCMC methods exhibited similar MSE values for both GM and GSD. However, in terms of the ROS method, MSE increased from around 50% censoring upwards for both GM and GSD.

m^ MLE and m^ MCMC showed similar values in the censoring range of 10 to 90%, and these estimation methods exhibited better accuracy. The MSE of both m^ MCMC and s^ MCMC were lower compared to other estimation methods when the censored percentage was in the range of 10 to 90%. From these results, we concluded that MCMC showed the best accuracy to estimate distribution parameters from the original data.

3.2.2 Coverage probability

The results of CPs assessment under censored percentages from 10 to 90% are shown in Fig. 4. The CPs decreased sharply from 10% for DN and 60% for the KM method. In terms of the RL/2 method, the CP gradually decreased from 10%, and rapidly decreased around 70%. In ROS and MLE, the CP showed similar values (75 to 85%) depending on the censoring ratio. MCMC showed the most stable CPs, and the CPs were generally close to the target coverage of 0.95.

Fig. 4.

Fig. 4.

 Coverage probability of 95% CI or 95% CrI for mean estimated by DN, RL/2, KM, ROS, MLE, and MCMC from 1000 randomly generated left-censored data subsets (n = 150) which follow a lognormal distribution with certain parameters (GM = 0.082×10−3; GSD = 4.9) over a 10 to 90% censoring range.

3.2.3 Reliability

The violin plots for m^ and s^ at the censored ratio of 85/150 are shown in Fig. 5. The true values ​​were distributed within the IQR estimated by ROS, MLE, and MCMC, but not within those by DN, RL, and KM. Among these three estimation methods, MCMC showed the smallest variation in RIQR and RSD for both m^ and s^.

Fig. 5.

Fig. 5.

 Violin plots of GM and GSD estimated by DN, RL/2, KM, ROS, MLE, and MCMC from 1000 randomly generated left-censored data subsets (n = 150) which follow a lognormal distribution with certain parameters (GM = 0.082×10−3; GSD = 4.9) at a censoring ratio of 85/150. For MCMC, the expected a posteriori was used for further calculation. Horizontal dashed lines indicate true values. Violins, boxes, horizontal solid lines, and open circles indicate probability density, IQR, median, and arithmetic mean, respectively.

The results of reliability assessment under censored percentages from 10 to 90% are shown in Fig. 6. The variations for both m^ and s^ were generally in the order of RL/2 < DN < MCMC < MLE < ROS. When the censored rate was high, KM exhibited lower RSD and RIQR compared to MLE and ROS. Even when the censored percentage was low, KM exhibited higher variation in both RSD and RIQR compared to other methods.

Fig. 6.

Fig. 6.

 Relative standard deviation (RSD) and relative interquartile range (IQR) of GM and GSD estimated by DN, RL/2, KM, ROS, MLE, and MCMC from 1000 randomly generated left-censored data subsets (n = 150) which follow a lognormal distribution with certain parameters (GM = 0.082×10−3; GSD = 4.9) over a 10 to 90% censoring range. For MCMC, the expected a posteriori was used for further calculation.

Since DN, RL/2, and KM showed lower accuracy, when these three methods were excluded, the MCMC estimation exhibited lower variation in both RSD and RIQR compared to ROS and MLE at censored percentages from 10 to 90%. From these results, it was concluded that MCMC is the most reliable estimation method.

Although s^ MCMC showed reasonable reliability (the maximum RSD of s^ MCMC was 0.18 at 90% censoring), the variation in m^ MCMC was as high as RSD=0.37 and RIQR=0.44 at a censoring rate of 90%. A reliable estimation method for left-censored data with a high censoring rate is required.

4. Discussions

We have described the MCMC method, as well as the commonly used methods, for summarizing and analyzing left-censored Cr(V) concentration in MW products. We used a simulation study to demonstrate the properties of these methods. MCMC showed the most stable CPs with the highest accuracy, and the CPs were generally close to the target coverage of 0.95. These results are consistent with previous reports32,34). MLE showed the second-best performance next to MCMC. Since the original data is sufficient for MLE requirements (more than 50 observations and censoring ratio of 50–80%) recommended by Helsel18), the performance of MLE was superior to those of ROS and KM. Other studies have also reported that MLE has a smaller bias than ROS (or probability plotting method) and KM25,27). Although RL/2 showed the best reliability, RL/2 showed lower accuracy and coverage probability under censored percentage from 20 to 90%. The RL/2 estimation results are considered to have a serious bias. As reported previously, substitution is essentially creating data which does not have an empirical basis. Since our simulation results are consistent with previous reports, the simulation in this study is considered to be valid.

Although the reliability of Bayesian approaches on left-censored data is less understood, our study shows that MCMC performed with the smaller variation in estimates compared to KM, ROS, and MLE. Since MCMC showed better performance on accuracy, coverage probability, and reliability compared to other methods, estimates by MCMC for Cr(VI) concentration in MW products are the most appropriate estimates among the methods used in this study. The most appropriate estimates of mean and those of 97.5th percentile of Cr(VI) concentration, which were estimated with MCMC, were 0.289×10−3 and 1.82×10−3 mg/L, respectively. These estimates were sufficiently lower than the regulated value of 0.05 mg/L stipulated by the Food Sanitation Act42).

Huynh et al31) reported that the use of more informative priors generally improved the Bayesian method’s performance, making both the bias and the root MSE lower. The adoption of the appropriate prior distribution is sometimes difficult under the conditions when they are needed most, i.e. small sample sizes or high censored percentage. Since MCMC showed better performance on accuracy, coverage probability, and reliability, we have successfully adopted informative prior distribution to the original data. We consider that this is the main reason why the MCMC showed better performance compared to other methods. It is well known that MCMC can accommodate outliers by describing data via heavy-tailed distributions to the extent implied by the data43). Since lognormal distributions with GSD greater than 4 are considered to have heavy tails, the performance of MCMC was superior to that of the other methods. In this study, we used the EAPs as point estimator in MCMC. Other point estimators (maximum a posteriori and median of posterior predictive distribution) for Bayesian estimation might show different performances. These are other reasons why the MCMC is better than other methods.

To perform safe assessment of exposure to chemicals, substituting RL for nondetects may be a conservative and preferable method. However, when the dissociation between reality and assumption is large, such a conservative method may generate overestimates that are meaningless for health risk, leading to false concern and excessive regulation. Rather than imposing strict rules onto society and the food community, it is important to introduce new, scientifically sound methods and resolve the lack of transparency arising from inadequate evaluation. The use of prior distributions in Bayesian estimation can be considered as bridging a prior knowledge to MLE. Although it is necessary to carefully check whether the prior distribution has been adopted appropriately, utilizing knowledge of society and food community, as well as researchers, is able to help to reduce the gap between science-technology and society. Bayesian estimation provides the whole (posterior predictive) distribution for each parameter estimated, instead of one single-value estimate in MLE, thus it can provide estimates on the safe side with considering uncertainty. We expect that Bayesian estimation with incorporating consumer and regulatory science reduce this gap in food safety fields.

Since the original data are univariate, we have not extended the regression model in this study. However, it may be possible to estimate individual values ​​of nondetects using data covering signal intensity, production location, and other information. If sufficient information is available on the manufacturing process, storage method, factory, manufacturer, and so on, a hierarchical Bayesian model that considers these factors as random effects and incorporates them into the model could be considered (see Supporting Information S8. as example of comparison between Japanese and imported MW products). By virtue of such a complex analysis, it may be possible to evaluate manufacturing process risks such as contamination or changes in chemical species.

This study demonstrates that using the MCMC method provide the most appropriate estimation result for Cr(VI) concentration in MW products. On the other hands, as mentioned in many articles18,27,44), our simulation result indicate that the use of RL/2 (and DN method) should be avoided. Evaluation, prediction, and judgment based on scientific evidences are the basis of regulatory science and are important for risk assessment in food safety administration. MCMC, as well as other methods based on statistical validity such as MLE, is important in the field of food safety as an estimation method instead of conventional substitution method. On the other hand, it should be discussed carefully whether the MCMC estimation results are appropriate for other data. Evaluation of the validity for different probability density distributions, sample sizes, and parameters is necessary.

5. Conclusions

We demonstrate that parameter estimation from left-censored data with MCMC show better performance on accuracy, coverage probability, and reliability compared to other methods (DN, RL/2, KM, MLE, and ROS) for Cr(VI) concentration in MW products. The most appropriate estimate of mean concentration, which was estimated with MCMC, was 0.289×10−3 mg/L and this estimate was sufficiently lower than the regulated value of 0.05 mg/L stipulated by the Food Sanitation Act.

In the field of food safety, although chemical substances in food are rarely detected, there are many chemical substances that have to be periodically investigated for protecting public health. This study suggests that MCMC could be a powerful estimation method in exposure assessment contexts in food safety science. However, there is ample scope of future research for the standardization to explore this further for the model with various sample size, alternative parameters, and different distributions, as well as multiple RLs due to the performance difference of analytical instruments.

Supporting Information

Acknowledgments

This work was partially supported by a Health Sciences Research Grant (19KA1002 and 19KA2001) from the Ministry of Health, Labour, and Welfare in Japan.

Abbreviations: CDF: cumulative distribution function, CI: confidence interval, CP: coverage probability, CrI: credible interval, DN: discarding nondetects, EAP: expected a posteriori, GM: parameter of lognormal distribution, geometric mean, GSD: parameter of lognormal distribution, geometric standard deviation, IQR: interquartile range, KM: Kaplan-Meier, LCI: lower confidence interval, LOD: limit of detection, LOQ: limit of quantification, m^: estimate of geometric mean. The letters on the upper right indicate the estimation method, MCMC: Markov chain Monte Carlo, MLE: maximum likelihood estimation, MSE: mean of squared error, MW: mineral water, PDF: probability density function, RIQR: relative inter quartile range, RL: reporting limit, RL/2: substitution of half of RL, RSD: relative standard deviation, s^: estimate of geometric standard deviation. The letters on the upper right indicate the estimation method, SD: standard deviation, UCI: upper confidence interval, WAIC: widely applicable information criterion, μ^: estimate of expected value. The letters on the upper right indicate the estimation method, σ^: estimate of standard deviation. The letters on the upper right indicate the estimation method,

Footnotes

Conflict of interest: The authors declare no conflicts of interest.

References

  • 1.Marqués MJ,Salvador A,Morales-Rubio A,de la Guardia M. Chromium speciation in liquid matrices: a survey of the literature. Fresenius J Anal Chem. 2000; 367(7): 601–613. 10.1007/s002160000422 [DOI] [PubMed] [Google Scholar]
  • 2.Zhitkovich A. Chromium in drinking water: sources, metabolism, and cancer risks. Chem Res Toxicol. 2011; 24(10): 1617–1629. 10.1021/tx200251t [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sun H,Brocato J,Costa M. Oral chromium exposure and toxicity. Curr Environ Health Rep. 2015; 2(3): 295–303. 10.1007/s40572-015-0054-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhitkovich A. Importance of chromium-DNA adducts in mutagenicity and toxicity of chromium(VI). Chem Res Toxicol. 2005; 18(1): 3–11. 10.1021/tx049774+ [DOI] [PubMed] [Google Scholar]
  • 5.Food Safety Commission of Japan. Hexavalent chromium (Contaminants). Food Safety. 2019; 7(2): 56–57. 10.14252/foodsafetyfscj.D-1900002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Catalani S,Fostinelli J,Gilberti ME,Apostoli P. Application of a metal free high performance liquid chromatography with inductively coupled plasma mass spectrometry (HPLC–ICP-MS) for the determination of chromium species in drinking and tap water. Int J Mass Spectrom. 2015; 387: 31–37. 10.1016/j.ijms.2015.06.015 [DOI] [Google Scholar]
  • 7.Marcinkowska M,Komorowicz I,Barałkiewicz D. New procedure for multielemental speciation analysis of five toxic species: As(III), As(V), Cr(VI), Sb(III) and Sb(V) in drinking water samples by advanced hyphenated technique HPLC/ICP-DRC-MS. Anal Chim Acta. 2016; 920: 102–111. 10.1016/j.aca.2016.03.039 [DOI] [PubMed] [Google Scholar]
  • 8.Kataoka Y,Watanabe T,Hayashi K,Ozawa R,Takizawa K,Akiyama H. Surveillance of chromium(VI) concentrations in mineral water products. Food Hyg. Saf. Sci. 2017; 58(6): 275–280. 10.3358/shokueishi.58.275 [DOI] [PubMed] [Google Scholar]
  • 9.Nabeshi H,Tsutsumi T,Uekusa Y,Hachisuka A,Matsuda R,Teshima R. Surveillance of strontium-90 in foods after the Fukushima Daiichi Nuclear Power Plant Accident. Food Hyg. Saf. Sci. 2015; 56(4): 133–143. 10.3358/shokueishi.56.133 [DOI] [PubMed] [Google Scholar]
  • 10.Tsutsumi T,Matsuda R,Yanagi T,et al. Dietary intake of dioxins in Japan in 2016 with time trends since 1998. Food Addit Contam Part A Chem Anal Control Expo Risk Assess. 2018; 35(8): 1553–1564. 10.1080/19440049.2018.1469052 [DOI] [PubMed] [Google Scholar]
  • 11.Uekusa Y,Takatsuki S,Tsutsumi T,et al. Determination of polychlorinated biphenyls in marine fish obtained from tsunami-stricken areas of Japan. PLoS One. 2017; 12(4): e0174961. 10.1371/journal.pone.0174961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Cao P,Liu AD,Yang DJ,et al. Assessment of dietary exposure of young Chinese children to aluminium residues. Food Addit Contam Part A Chem Anal Control Expo Risk Assess. 2019; 36(4): 582–591. 10.1080/19440049.2019.1579369 [DOI] [PubMed] [Google Scholar]
  • 13.Oguri T,Yoshinaga J,Tao H,Nakazato T. Daily intake of inorganic arsenic and some organic arsenic species of Japanese subjects. Food Chem Toxicol. 2012; 50(8): 2663–2667. 10.1016/j.fct.2012.05.010 [DOI] [PubMed] [Google Scholar]
  • 14.US Environmental Protection Agency. Data quality assessment: Statistical methods for practitioners. EPA/240-B-06/003, Office of Environmental Information, Washington, DC, USA. 2006. [Google Scholar]
  • 15.US Environmental Protection Agency. ProUCL Version 5.1.002 Technical Guide, EPA/600/R-07/041, Office of Research and Development, Washington, DC. October. 2015. [Google Scholar]
  • 16.Helsel D. Much ado about next to nothing: incorporating nondetects in science. Ann Occup Hyg. 2010; 54(3): 257–262. [DOI] [PubMed] [Google Scholar]
  • 17.Helsel DR. More than obvious: better methods for interpreting nondetect data. Environ Sci Technol. 2005; 39(20): 419A–423A. 10.1021/es053368a [DOI] [PubMed] [Google Scholar]
  • 18.Helsel DR. Statistics for Censored Environmental Data Using Minitab® and R, 2nd edition. Hoboken, NJ: Wiley; 2012. [Google Scholar]
  • 19.Gilliom RJ,Helsel DR. Estimation of distributional parameters for censored trace level water quality data: 1. Estimation techniques. Water Resour. Res. 1986; 22(2): 135–146. 10.1029/WR022i002p00135 [DOI] [Google Scholar]
  • 20.Food and Drug Administration. Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials. U.S. Department of Health and Human Services, ed. 2010. [Google Scholar]
  • 21.Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res. 2010; 11: 3571–3594. [Google Scholar]
  • 22.Busschaert P,Geeraerd AH,Uyttendaele M,Van Impe JF. Hierarchical Bayesian analysis of censored microbiological contamination data for use in risk assessment and mitigation. Food Microbiol. 2011; 28(4): 712–719. 10.1016/j.fm.2010.06.006 [DOI] [PubMed] [Google Scholar]
  • 23.Ichihara M,Yamamoto A,Kakutani N,Sudo M,Takakura K. A Bayesian approach for estimating hexabromocyclododecane (HBCD) diastereomer compositions in water using data below limit of quantification. Environ Sci Pollut Res Int. 2017; 24(3): 2667–2674. 10.1007/s11356-016-7990-5 [DOI] [PubMed] [Google Scholar]
  • 24.Qian SS,Schulman A,Koplos J,Kotros A,Kellar P. A hierarchical modeling approach for estimating national distributions of chemicals in public drinking water systems. Environ Sci Technol. 2004; 38(4): 1176–1182. 10.1021/es020686q [DOI] [PubMed] [Google Scholar]
  • 25.She N. Analyzing censored water quality data using a non-parametric approach. J Am Water Resour As. 1997; 33(3): 615–624. 10.1111/j.1752-1688.1997.tb03536.x [DOI] [Google Scholar]
  • 26.Sinha P,Lambert MB,Trumbull VL. Evaluation of statistical methods for left-censored environmental data with nonuniform detection limits. Environ Toxicol Chem. 2006; 25(9): 2533–2540. 10.1897/05-548R.1 [DOI] [PubMed] [Google Scholar]
  • 27.LaFleur B,Lee W,Billhiemer D,Lockhart C,Liu J,Merchant N. Statistical methods for assays with limits of detection: Serum bile acid as a differentiator between patients with normal colons, adenomas, and colorectal cancer. J Carcinog. 2011; 10(1): 12. 10.4103/1477-3163.79681 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Huynh T,Ramachandran G,Banerjee S,et al. Comparison of methods for analyzing left-censored occupational exposure data. Ann Occup Hyg. 2014; 58(9): 1126–1142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Antweiler RC. Group comparisons evaluation of statistical treatments of left-censored environmental data using coincident uncensored data sets. II. Environ Sci Technol. 2015; 49(22): 13439–13446. 10.1021/acs.est.5b02385 [DOI] [PubMed] [Google Scholar]
  • 30.Shoari N,Dubé JS. An investigation of the impact of left-censored soil contamination data on the uncertainty of descriptive statistical parameters. Environ Toxicol Chem. 2016; 35(10): 2623–2631. 10.1002/etc.3420 [DOI] [PubMed] [Google Scholar]
  • 31.Huynh T,Quick H,Ramachandran G,et al. A comparison of the β-Substitution method and a Bayesian method for analyzing left-censored data. Ann Occup Hyg. 2016; 60(1): 56–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Nie L,Chu H,Feng S. Estimating variance parameters from multivariate normal variables subject to limit of detection: MLE, REML, or Bayesian approaches? Stat Med. 2009; 28(20): 2605–2616. 10.1002/sim.3644 [DOI] [PubMed] [Google Scholar]
  • 33.Hady DHA, Rania MS. Bayesian and non bayesian parameter estimation for bivariate pareto distribution based on censored samples. Int J Eng Res Appl. 2015; 5(10): 01-06.
  • 34.Feroze N,Aslam M. Comparison of Bayesian and maximum likelihood estimation for analysis of burr type V distribution under left censored samples. J Stat. 2015; 22(1): 116–192. [Google Scholar]
  • 35.Klein JP,Moeschberger ML Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer; 2003. [Google Scholar]
  • 36.Cohen AC. Simplified estimators for the normal distribution when samples are singly censored or truncated. Technometrics. 1959; 1(3): 217–237. 10.1080/00401706.1959.10489859 [DOI] [Google Scholar]
  • 37.Helsel DR,Cohn TA. Estimation of descriptive statistics for multiply censored water quality data. Water Resour. Res. 1988; 24(12): 1997–2004. 10.1029/WR024i012p01997 [DOI] [Google Scholar]
  • 38.Shumway RH,Azari RS,Kayhanian M. Statistical approaches to estimating mean water quality concentrations with detection limits. Environ Sci Technol. 2002; 36(15): 3345–3353. 10.1021/es0111129 [DOI] [PubMed] [Google Scholar]
  • 39.Gelman A,Lee D,Guo J Stan: A probabilistic programming language for Bayesian inference and optimization. Journal of Educational and Behavioral Statistics. 2015; 40(5): 530–543. 10.3102/1076998615606113 [DOI] [Google Scholar]
  • 40.Carpenter B,Gelman A,Hoffman MD,et al. Stan: A probabilistic programming language. Journal of Statistical Software. 2017; 76(1): 1–29. 10.18637/jss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Land CE. An evaluation of approximate confidence interval estimation methods for lognormal means. Technometrics. 1972; 14(1): 145–158. 10.1080/00401706.1972.10488891 [DOI] [Google Scholar]
  • 42.Ministry of Health, Labour and Welfare. Partial revision of ministerial ordinance concerning ingredient specifications of milk and dairy products (Notification No. 106) and standards of foods and additives (Notification No. 302) [In Japanese]. https://www.mhlw.go.jp/web/t_doc?dataId=00tc3580&dataType=1&pageNo=1. Published on August 2018. Accessed on January 17, 2020.
  • 43.Kruschke JK. Bayesian estimation supersedes the t test. J Exp Psychol Gen. 2013; 142(2): 573–603. 10.1037/a0029146 [DOI] [PubMed] [Google Scholar]
  • 44.Hewett P,Ganser GH. A comparison of several methods for analyzing censored data. Ann Occup Hyg. 2007; 51(7): 611–632. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Food Safety are provided here courtesy of Food Safety Commission of Japan

RESOURCES