Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2016 Feb 25;6(5):1165–1177. doi: 10.1534/g3.116.028118

Genomic Bayesian Prediction Model for Count Data with Genotype × Environment Interaction

Abelardo Montesinos-López *, Osval A Montesinos-López , José Crossa †,1, Juan Burgueño , Kent M Eskridge , Esteban Falconi-Castillo §, Xinyao He , Pawan Singh , Karen Cichy **
PMCID: PMC4856070  PMID: 26921298

Abstract

Genomic tools allow the study of the whole genome, and facilitate the study of genotype-environment combinations and their relationship with phenotype. However, most genomic prediction models developed so far are appropriate for Gaussian phenotypes. For this reason, appropriate genomic prediction models are needed for count data, since the conventional regression models used on count data with a large sample size (nT) and a small number of parameters (p) cannot be used for genomic-enabled prediction where the number of parameters (p) is larger than the sample size (nT). Here, we propose a Bayesian mixed-negative binomial (BMNB) genomic regression model for counts that takes into account genotype by environment (G×E) interaction. We also provide all the full conditional distributions to implement a Gibbs sampler. We evaluated the proposed model using a simulated data set, and a real wheat data set from the International Maize and Wheat Improvement Center (CIMMYT) and collaborators. Results indicate that our BMNB model provides a viable option for analyzing count data.

Keywords: Bayesian model, count data, genome enabled prediction, Gibbs sampler, GenPred, shared data resource, genomic selection


In most living organisms, phenotype is the result of genotype (G), environment (E) and genotype by environment interactions (G×E). Garrod (1902) observed that the effect of genes on phenotype could be modified by the environment (E). Similarly, Turesson (1922) demonstrated that the development of a plant is often influenced by its surroundings. He postulated the existence of a close relationship between crop plant varieties and their environment, and stressed that the presence of a particular variety in a given locality is not a chance occurrence; rather, there is a genetic component that helps the individual adapt to that area.

For these reasons, today the consensus is that G×E is useful for understanding genetic heterogeneity under different environmental exposures (Kraft et al. 2007; Van Os and Rutten 2009), and for identifying high-risk or productive subgroups in a population (Murcray et al. 2009); it also provides insight into the biological mechanisms of complex traits such as disease resistance and yield (Thomas 2011), and improves the ability to discover resistance genes that interact with other factors that have few marginal effects (Thomas 2011). However, finding significant G×E interactions is challenging. Model misspecification, inconsistent definition of environmental variables, and insufficient sample sizes are just a few of the issues that often lead to low-power and nonreproducible findings in G×E studies (Jiao et al. 2013; Winham and Biernacka 2013).

Genomics and its breeding applications are developing very quickly with the goal of predicting yet-to-be observed phenotypes, or unobserved genetic values for complex traits, and inferring the underlying genetic architecture utilizing large collections of markers (Goddard and Hayes 2009; Zhang et al. 2014). Also, genomics is useful when dealing with complex traits that are multigenic in nature, and have major environmental influence (Pérez-de-Castro et al. 2012). For these reasons, the use of whole genome prediction models continues to increase. In genomic prediction, all marker effects are fitted simultaneously on a model and simulation studies promote the use of this methodology to increase genetic progress in less time. For continuous phenotypes, models have been developed to regress phenotypes on all available markers using a linear model (Goddard and Hayes 2009; de los Campos et al. 2013). However, in plant breeding, the response variable in many traits is a count (y = 0,1,2,…), for example, number of panicles per plant, number of seeds per panicle, weed count per plot, etc. Count data are discrete, non-negative, integer-valued, and typically have right-skewed distributions (Yaacob et al. 2010).

Poisson and negative binomial regression are often used to deal with count data. These models have a number of advantages over an ordinary linear regression model, including a skewed, discrete distribution (0,1,2,3,…,) and the restriction of predicted values for phenotypes to non-negative numbers (Yaacob et al. 2010). These models differ from an ordinary linear regression model. First, they do not assume that counts follow a normal distribution. Second, rather than modeling y as a linear function of the regression coefficients, they model a function of the response mean as a linear function of the coefficients (Cameron and Trivedi 1986). Regression models for counts are usually nonlinear and have to take into consideration the specific properties of counts, including discreteness and non-negativity, and are often characterized by overdispersion (variance greater than the mean) (Zhou et al. 2012).

However, in the context of genomic selection, it is still common practice to apply linear regression models to these data, or to transformed data (Montesinos-López et al. 2015a, 2015b). This does not take into account that: (a) many distributions of count data are positively skewed, many observations in the data set have a value of 0, and the high number of 0s in the data set does not allow a skewed distribution to be transformed into a normal one (Yaacob et al. 2010); and (b) it is quite likely that the regression model will produce negative predicted values, which are theoretically impossible (Yaacob et al. 2010; Stroup 2015). When transformation is used, it is not always possible to have normally distributed data, and often transformations not only do not help, they are counterproductive. There is also mounting evidence that transformations do more harm than good for the models required by the vast majority of contemporary plant and soil science researchers (Stroup 2015). To the best of our knowledge, only the paper of Montesinos-López et al. (2015c) is appropriate for genomic prediction of count data in a Bayesian framework; however, it does not take into account G×E interaction.

In this paper, we extend the negative binomial (NB) regression model for counts proposed by Montesinos-López et al. (2015c) to take into account G×E by using a data augmentation approach. A Gibbs sampler was derived since all full conditional distributions were obtained, which allows samples to be drawn from them to estimate the required parameters. In addition, we provide all the details of the efficient derived Gibbs sampler so that it can be implemented easily by most plant and animal scientists. We illustrate our proposed methods with a simulated data set and a real data set on wheat Fusarium head blight. We compare our proposed models (NB and Poisson) with the Normal and Log-Normal models commonly implemented for analyzing count data. We also provide R code for implementing the proposed models.

Materials and Methods

The data used in this study were taken from a PhD thesis (Falconi-Castillo 2014) aimed at identifying sources of resistance to Fusarium head blight (FHB), caused by Fusarium graminearum, and at identifying genomic regions and molecular markers linked to FHB resistance through association analysis.

Experimental data

Phenotypic data:

A total of 297 spring wheat lines developed by the International Maize and Wheat Improvement Center (CIMMYT) was assembled and evaluated for resistance to F. graminearum. Phenotyping was done at CIMMYT’s El Batan experimental station in Mexico over two years (2012 and 2014), and at the Santa Catalina Experimental Station of the National Institute for Agricultural Research (INIAP), Ecuador, for one year (2014). For the application, we considered these three environments, which we named Batan 2012, Batan 2014, and Ecuador 2014. In all the experiments (environments), the genotypes were arranged in a randomized complete block design, in which each plot comprised two 1-m double rows separated by a 0.25 m space. In Ecuador 2014, the nursery was inoculated with maize seeds infected with a local F. graminearum isolate (SC01). The inoculum was broadcast in the field at 3 and 2 wk before anthesis, at a rate of 50 g/m2.

FHB severity data were collected shortly before maturity by counting symptomatic spikelets on 10 randomly selected spikes in each plot. In Mexico, plots were inoculated with a mixture of five F. graminearum isolates (CIMFU235, 702, 715, 720, and 770) at each line’s flowering period by spraying 30 ml of an F. graminearum macroconidial suspension (50,000 spores/ml) using a CO2-powered backpack sprayer (model T R&D Sprayers, Opelousas, LA) calibrated to 40 psi. High humidity was maintained in the field by a mist irrigation system controlled by a programmable timer that applied 10 min of spray every hour from 9:00 to 20:00. FHB severity data were collected at 25 days after inoculation by counting spikelets showing FHB symptoms on 10 spikes that had been tagged at anthesis. In this study, we used only 182 spring wheat lines because we had complete marker information only for those lines.

Genotypic data:

DNA samples were extracted from young leaves (2- to 3-wk-old) taken from each line, using Wizard Genomic DNA purification (Promega) following the manufacturer’s protocol. DNA samples were genotyped using an Illumina 9K SNP chip with 8632 SNPs (Cavanagh et al. 2013). For a given marker, the genotype for the ith line was coded as the number of copies of a designated marker-specific allele carried by the ith line (absences equal to zero, and presents equal to one). SNP markers with unexpected genotype AB (heterozygous) were recoded as either AA or BB, based on the graphical interface visualization tool of GenomeStudio (Illumina) software. SNP markers that did not show clear clustering patterns were excluded. In addition, 66 simple sequence repeat (SSR) markers were screened. After filtering the markers for 0.05 minor allele frequency (MAF), and deleting markers with more than 10% of no calls, the final set of SNPs was 1635 SNPs.

Data and software availability

The phenotypic (FHB) and genotypic (marker) data used in this study, as well as basic R codes (R Core Team 2015), for fitting the models can be downloaded directly from the repository at http://hdl.handle.net/11529/10575.

Statistical models

We assume that, at each environment, the J genotypes were grown in a randomized complete block design, and we let yijkt represent the count response for the tth replication of the jth line in the kth block in the ith environment, with i=1,...,I;j=1,2,...,J,k=1,...,K,t=1,2,...,nijk, and we propose the following combined linear predictor for the response variable:

ηijk=Ei+R(E)ik+gj+gEij (1)

where Ei represents environment i, R(E)ik represent the effect of block k within environment i, gj is the marker effect of genotype j, and gEij is the interaction between markers and the environment; I=3, since we have three environments (Batan 2012, Batan 2014, and Ecuador 2014), J=182, since it is the number of lines under study, K=2, since only two blocks are available per environment, and nijk represents the number of replicates of each line in each block and environment but this was the same (nijk=n) for all combinations of i, j and k n was 10 since 10 spikes were selected at random from each plot). The number of observations in each environment i is ni=JKn, while the total number of observations is nT=IJKn. IJ is the product of the number of environments and number of lines. Four models were implemented using the linear predictor given in expression (1).

Model NB:

Model NB stands for model negative binomial and is defined by three distributions: yijkt|gj,gEijNB(μijk,r), with r being the scale parameter, μijk=exp(ηijk),g=(g1,...,gJ)TN(0,G1σg2), and gEi=(gEi1,...,gEiJ)TN(0,G2σgE2). Note that the NB distribution has expected value E(yijkt|gj,gEij)=μijk and variance Var(yijkt|gj,gEij)=μijk+μijk2r and Var(yijkt|gj,gEij)>E(yijkt|gj,gEij) for r>0. G1 and G2 were assumed known, with G1 computed from marker Wdata(form=1,...,qmarkers) as G1=WWTq; this matrix is called the Genomic Relationship Matrix (GRM) (VanRaden 2008). The G1 matrix defines the covariance between individuals based on observed similarity at the genomic level, rather than on expected similarity based on pedigree, so that more accurate predictions of merit can be achieved. While G2 is computed as G2=IIG1 of order IJ × IJ and ⊗ denotes the Kronecker product, II means that we assume independence between environments.

Model Pois:

Model Poisson (Model Pois) is the same as Model NB, except that yijkt|gj,gEij Poisson(μijk). Since, according to Zhou et al. (2012) and Teerapabolarn and Jaioun (2014), the limrNB(μijk,r)=Pois(μijk), Model Pois was implemented using the same method as Model NB, but fixing r to a large value, depending on the mean count. We used r=1000, which is a good choice when the mean count is less than 50 (see Figure 1). However, when the count is between 50 and 200, we suggest using r=5000, and, when the count is larger than 200, we suggest a value of r=10,000 or larger. These suggestions are supported by Figure 1, where we plot the mean and variance of Model NB as a function of the scale parameter r, with three values of r (1000; 5000; 10,000). Good approximations to the Model Pois with the Model NB occur when the mean and variance are very similar. For this reason, good approximations are those that follow the diagonal in Figure 1 where μ=σ2. We can see that the mean count and variances are very similar for mean counts of less than 50 with r=1000; however, when the mean count is larger than 50 and less than 200, we should use r=5000, and for counts greater than 200, we suggest using a value of r=10,000 or larger. In our applications with simulated and real data, the mean count is less than 50; for this reason, we used a value of r=1000.

Figure 1.

Figure 1

Plot of the mean count vs. the variance of Model NB as a function of the scale parameter (r). Good approximations are obtained when the mean and variance are very similar; in the plot, they should follow the diagonal that plots μ=σ2.

Model Normal:

Model Normal is similar to Model NB, except that yijkt|gj,gEijN(ηijk,σe2) with identity link function (ηijk=μijk), and σe2 is the scale parameter of the normal distribution and is associated with the residual in the i environment, k block, j line and replication t. The σe2 parameter must be estimated since the Normal distribution, Log-normal distribution, and the Negative binomial distribution belong to the two-parameter exponential family, while the Poisson distribution belongs to the one-parameter exponential family. For this reason, only the μijk need to be estimated since the mean is equal to the variance. However, the scale parameter in the NB distribution is represented by r.

Model LN:

Model Log-Normal (Model LN) is similar to Model NB, except that log(yijkt+1)|gj,gEijN(ηijk,σe2) with identity link function (ηijk=μijk) and σe2 is the scale parameter associated with the residual in the i environmment, k block, j line, and t replication.

When the number of markers (q) is larger than the number of observations (nT), implementing Models NB and Pois is challenging. For this reason, we propose a Bayesian method for dealing with situations when q>nT and our model takes into account all markers through the GRM (G1=WWTq) described above. Models Normal and LN were implemented in the BGLR package of de los Campos et al. (2014). Therefore, our proposed Bayesian model for count data is a so-called Genomic Best Linear Unbiased Prediction (GBLUP) method, since it utilizes genomic relationships to predict the genetic value of an individual.

Bayesian mixed negative binomial regression:

Rewriting the linear predictor (1) as ηijk=xikTβ+ b1j+ b2ij, with xikT=[ x1,x2, x3,x11,x12, x21, x22,x31, x32, ], where x1,x2  and x3 are indicator variables that take the value of 1 if the observed environment i is 1, 2, and 3, respectively, and 0 otherwise, xik, i=1,2,3 and k=1,2,  are indicator variables that take the value of 1 if the block k is observed within environment i, and 0 otherwise. βT=[ β1,β2,,β3,β11,β12,,β21,β22, β31,β32], where the first three beta coefficients belong to the effects of environment, and the last six beta coefficients correspond to the blocks effects in each of the environments (that is, β is a vector of beta coefficients of order p×1, with p=I+I×K). Therefore, xikTβ=Ei+R(E)ik , b1j=gj and b2ij=gEij. Note that, under the Model NB, because μijk=E(yijkt|b1j,b2ij)=exp(ηijk), conditionally on b1j  and b2ij, the probability that the random variable Yijkt  takes the value yijkt is equal to

Pr(Yijkt=yijkt|b1j,b2ij)=(yijkt+r1yijkt)(1μijkr+μijk)r(μijkr+μijk)yijtkforyijkt=0,1,2,
=Γ(yijkt+r)yijkt!Γ(r)[exp(ηijk*)]yijkt[1+exp(ηijk*)]yijkt+ryijkt=0,1,2, (2)

We arrive at Equation (2) since we make μijkr+μijk=rμijkr(r+μijk)=μijk/r1ijk/r=exp(ηijk)exp(log(r))1+exp(ηijk)exp(log(r))=exp(ηijklog(r))1+exp(ηijklog(r))=exp(ηijk*)1+exp(ηijk*), with ηijk*=xikTβ*+b1j+ b2ij,  β*=[ β1*,β2*,β3*,β11,β12,,β21,β22, β31,β32], and βi*=βilog(r). Therefore, in Equation (2), we have the connection between the probability distribution of the response (Yijkt) induced by the assumed relation between the linear predictor (ηijk) and the expected value of Yijkt  (μijk) under the model NB. Then we can rewrite the Pr(Yijkt=yijkt|b1j,b2ij)     given in Equation (2) as:

Γ(yijkt+r)yijkt!Γ(r)2yijktrexp(yijktr2ηijk*)0exp[ωijkt(ηijk*)22]f(ωijkt,yijkt+r,0)dωijkt (3)

Expression (3) was obtained using the following equality given by Polson et al. (2013): (eψ)a(1+eψ)b=2beκψ0eωijktψ22f(ωijkt;b,0)dωijkt, where κ=ab/2 and f(.,b,0) denotes the density of the Pólya-Gamma distribution (ωijkt) with parameters b and c=0 [PG(b,c=0)] (see Definition 1 in Polson et al. 2013). From here, conditioning on ωijkt PG(yijkt+r,c=0), we have that

Pr(Yijkt=yijkt|b1j,b2ij, ωijkt)    =Γ(yijkt+r)yijkt!Γ(r)2yijktrexp(yijktr2ηijk*)exp[ωijkt(ηijk*)2/2] (4)

To be able to get the full conditional distributions, we provide the prior distributions, f(θ), for all the unknown model parameters θ=(β*, σβ2, b1, σb12, b2, σb22, r). We assume prior independence between the parameters, that is,

f(θ)=f(β*)f(σβ2)f( b1)f(σb12)f( b2)f(σb22)f(r).

We assign conditionally conjugate but weakly informative prior distributions to the parameters because we have no prior information. Prior specification in terms of β*  instead of β is for convenience. We adopt proper priors with known hyper-parameters whose values we specify in model implementation to guarantee proper posteriors. We assume that β*|σβ2Np(β0,0σβ2), σβ2χ2(νβ,Sβ) where χ2(νβ,Sβ)  denotes a scaled inverse chi-square distribution with shape νβ  and scale Sβ parameters, b1|σb12Nnb1(0,G1σb12), nb1=J, σb12 χ2(νb1,Sb1), b2|σb22Nnb2(0,G2σb22), nb2=IJ, σb22 χ2(νb2,Sb2) and rG(a0,1/b0). Next we combine (Equation 4) using all data with priors to get the full conditional distribution for parameters β*, σβ2, b1, σb12, b2, σb22  and r.

Full conditional distributions:

The full conditional distribution of  β* is given as:

 f(β*|y,ELSE)N(β˜0,Σ˜0) (5)

where Σ˜0=(Σ01σβ2+XTDωX)1, β˜0=Σ˜0(Σ0-1σβ-2β0-XTDωΣh=12Zhbh+XTκ),  yijk=[yijk1,,yijkn]T,  yij=[yij1T,,yijKT]T,  yi=[yi1T,,yiJT]T, y=[y1T,,yIT]T, κijk=12[yijk1-r,,yijkn-r]T,  κij=[κij1T,,κijKT]T,  κi=[κi1T,,κiJT]T, κ=[κ1T,,κIT]T,  Xijk=[1nTxik]T,  Xij=[Xij1T,, XijKT]T,  Xi=[Xi1T,, XiJT]T, X=[X1T,, XIT]T, Dωijk=diag(ωijk1, , ωijkn), Dωij=diag(Dωij1, , DωijK),  Dωi=diag(Dωi1, , DωiJ), Dω=diag(Dω1, , DωI), b1=[b11,, b1J]T, b2i=[b2i1,, b2iJ]T, b2=[b21T,, b2IT]T, Z1i=[1n001n00ֻ001n], Z1=[Z11T,, Z1IT]T and Z2=Z1*X, where *  indicates the horizontal Kronecker product between Z1 and X. The horizontal Kronecker product performs a Kronecker product of Z1 and X, and creates a new matrix by stacking these row vectors into a matrix. Z1 and X must have the same number of rows, which is also the same number of rows in the result matrix. The number of columns in the result matrix is equal to the product of the number of columns in Z1 and X. When the prior for  β*  constant, the posterior distribution of  β* is also normally distributed, N(β˜0,Σ˜0), but we set the term Σ01σβ-2 to zero in both Σ˜0 and β˜.

The fully conditional distribution of ωijkt is

f(ωijkt|y,ELSE)PG(yijkt+r,xikT β*+b1j+ b2ij) (6)

Defining η1=X β*+Z2b2, the conditional distribution of b1 is given as

f(b1|y,ELSE)N(b˜1, F1) (7)

with  F1=(σb12G11+Z1TDωZ1)1, b˜1= F1(Z1TκZ1TDωη1). Similarly, by defining η2=X β*+Z1b1, the conditional distribution of b2 is

f(b2|y,ELSE)N(b˜2, F2) (8)

where  F2=(σb22G21+Z2TDωZ2)1, b˜2= F2(Z2TκZ2TDωη2).

The fully conditional distribution of σbh2 for h=1,2,  is

f(σbh2|y,ELSE)χ2(ν˜b=νbh+nbh,S˜b=(bhTGh1bh+νbhSbh)/νbh+nbh) (9)

with nb1=J and nb2=IJ.

The conditional distribution of σβ*2 is

f(σβ*2|y,ELSE)χ2(ν˜β*=νβ*+I+IK,S˜β=[( β*β0)TΣ01( β*β0)+νβ*Sβ*]ν˜β*) (10)

Taking advantage of the fact that the NB distribution can also be generated using a Poisson representation (Quenouille 1949) as Y=l=1Lul, where ulLog(π),  π=μr+μ and is independent of L Pois[rlog(1π)], where Log and Pois denote logarithmic and Poisson distributions, respectively. Then, we infer a latent count L for each Y  NB(μ,r) conditional on Y and r. Therefore, following Zhou et al. (2012), we obtain the full conditional of r by alternating

f(r|y,ELSE)G(a0 i=1Ij=1Jk=1Kt=1nijklog(1πijkt), 1b0+i=1Ij=1Jk=1Kt=1nijkLijkt  ) (11)
f(Lijkt|y,ELSE) CRT(yijkt,r) (12)

where CRT(yijkt,r) denotes a Chinese restaurant table (CRT) count random variable that can be generated as Lijkt=Σl=1yijktdl, where  dlBernoulli(rl1+r). For details of the CRT random variable derivation, see Zhou and Carin (2012, 2015).

Gibbs sampler:

The Gibbs sampler for the latent parameters of the NB with G×E  can be implemented by sampling repeatedly from the following loop:

  1. Sample ωijkt values from the Pólya-Gamma distribution in (6).

  2. Sample LijktCRT(yijkt,r) from (12).

  3. Sample the scale parameter (r) from the gamma distribution in (11).

  4. Sample the location effects ( β*) from the normal distribution in (5).

  5. Sample the random effects (b1)   from the normal distribution in (7).

  6. Sample the random effects (b2)   from the normal distribution in (8).

  7. Sample the variance effects (σbh2) with h=1,2,  from the scaled inverted χ2 distribution in (9).

  8. Sample the variance effect (σβ*2) from the scaled inverted χ2 distribution in (10).

  9. Return to step 1 or terminate when chain length is adequate to meet convergence diagnostics.

Model implementation:

The Gibbs sampler described above for the BMNB model was implemented in R-Core Team (2015). Implementation was done under a Bayesian approach using Markov Chain Monte Carlo (MCMC) through the Gibbs sampler algorithm, which samples sequentially from the full conditional distribution until it reaches a stationary process, converging with the joint posterior distribution (Gelfand and Smith 1990). To decrease the potential impact of MCMC errors on prediction accuracy, we performed a total of 60,000 iterations, with a burn-in of 30,000, so that 30,000 samples were used for inference. We did not apply thinning of the chains following the suggestions of Geyer (1992), MacEachern and Berliner (1994), and Link and Eaton (2012), who provide justification of the ban on subsampling MCMC output for approximating simple features of the target distribution (e.g., means, variances, and percentiles). We implemented the prior specification given in the section Bayesian mixed negative binomial regression with  β*|σβ2Np(β0=09T,I9×10,000),  b1|σb12Nnb1(0nb1T,G1σb12), where G1  is the GRM, that is, the covariance matrix of the random effects, σb12 χ2(νb1=3,Sb1=0.001), b2|σb22Nnb2(0nb2T,G2σb22), G2  is the covariance matrix of the random effects that belong to the G×E term, σb22  χ2(νb2=3,Sb2=0.001), and rG(a0=0.01,1/(b0=0.01)). All these hyper-parameters were chosen to lead weakly informative priors. The convergence of the MCMC chains was monitored using trace plots and autocorrelation functions. We also conducted a sensitivity analysis on the use of the inverse gamma priors for the variance components, and we observed that the results are robust under different choices of priors.

Assessing prediction accuracy:

We used cross-validation to compare the prediction accuracy of the proposed models for count phenotypes. We implemented a 10-fold cross-validation, that is, the data set was divided into 10 mutually exclusive subsets; each time we used nine subsets for the training set, and the remaining one for the validation set. The training set was used to fit the model, and the validation set was used to evaluate the prediction accuracy of the proposed models. To compare the prediction accuracy of the proposed models, we calculated the Spearman correlation (Cor) and the mean square error of prediction (MSEP), both calculated using the observed and predicted response variables of the validation set. Models with large values of Cor indicate better prediction accuracy, while small MSEP indicate better prediction performance. The predicted observations, y^ijkt, were calculated with M collected Gibbs samples after discarding those of the burn-in period. For Models NB and Pois, the predicted values were calculated as y^ijkt=s=1Mexp(xikT β^*(s)+log(r^(s))+g^j(s)+g^Eij(s))M, where r^(s),  β^*(s), g^j(s) , and g^Eij(s)  are estimates of  β*, r,  gj, and gEij, for line j, block k, in environment i obtained in the sth collected sample. For Model Normal as y^ijkt=s=1M(xikT β^(s)+g^j(s)+g^Eij(s))M, and for Model LN, the predicted observations were calculated as y^ijkt=s=1Mexp(xikT β^(s)+g^j(s)+g^Eij(s)+σ^e2(s)2)M1, using the corresponding estimates of each model.

Simulation study:

To show the performance of the proposed Gibbs sampler for count phenotypes that takes into account G×E, we performed a simulation study under model (1) with the following linear predictor: ηij=Ei+gj+gEij,   with two scenarios (S1 and S2). Scenario 1 had three environments (I=3), 20 genotypes (J=20), G1=I20, G2=IIG1  and σb12=σb22=0.5, with four different numbers of replicates of each genotype in each environment, n= 5, 10, 20, and 40. Scenario 2 is equal to scenario 1, except that G1=0.7I20+0.3J20, where J20  is a square matrix of ones of the order 20×20. In this second scenario, we imitated the correlation between lines of real data available in genomic selection. The priors used for the simulation study in both scenarios (S1 and S2) were approximately flat for all parameters: for β|σβ2N(β0T=[0,0,0],I3×10,000), for rG(0.001,1/0.001), for σb12  and σb22 a χ2(0.50002,4.0002), while for b1|σb12N(0,G1σb12), and for b2|σb22N(0,G2σb22). We computed 20,000 MCMC samples; Bayes estimates were computed with 10,000 samples, since the first 10,000 were discarded as burn-in. We report average estimates obtained by using the proposed Gibbs sampler, along with SD (Table 1). All the results in Table 1 are based on 50 replications.

Table 1. Posterior mean and posterior SD of the Bayesian method with four sample sizes (n) for Model NB.
n=5 n=10 n=20 n=40
Scenario Parameter True Mean SD Mean SD Mean SD Mean SD
β0 1.5 1.48 0.36 1.49 0.27 1.54 0.23 1.55 0.21
β1 −1 −0.98 0.26 −0.99 0.25 −1.08 0.25 −1.02 0.19
1 β2 1 1.00 0.27 0.99 0.22 0.99 0.27 0.95 0.22
r 5 5.08 0.92 5.08 0.52 5.02 0.47 5.03 0.33
σ12 0.5 0.54 0.20 0.59 0.18 0.58 0.18 0.59 0.22
σ22 0.5 0.50 0.13 0.52 0.14 0.53 0.11 0.51 0.11
β0 1.5 1.48 0.50 1.46 0.50 1.56 0.61 1.47 0.50
β1 −1 −1.06 0.23 −1.00 0.20 −1.01 0.22 −1.03 0.19
2 β2 1 0.95 0.24 1.03 0.22 0.99 0.20 0.97 0.20
r 5 5.10 0.81 4.99 0.59 5.04 0.35 5.03 0.20
σ12 0.5 0.54 0.18 0.57 0.22 0.58 0.19 0.53 0.18
σ22 0.5 0.50 0.12 0.51 0.14 0.53 0.13 0.51 0.10

Results

Table 1 list the results of the simulation study of both scenarios (S1 and S2). The bias when estimating the parameters is a little larger in S1 compared to S2. Also, parameter β0 is the parameter with larger bias (underestimated). Both variances (σ12, σ22) are overestimated in scenario 1, but only σ12 is overestimated in scenario 2. Also, with a sample size of n=5, parameter r had a larger SD; however, for larger sample sizes (n=20,40), the SD were considerably reduced. In general, there was not a large reduction in SD when the sample size increased from 5 to 10, 20, and 40, the exception being the estimation of r in both scenarios, and the estimation of β0 in S1, where there was a large reduction in SD when the sample size increased. Although estimations do not totally agree with the true values of the parameters, the proposed Gibbs sampler for count data, which takes into account G×E, did a good job of estimating the parameters, since the estimates are close to the true values with a SD of reasonable size.

In all the experiments (environments) using the real data set, the genotypes were arranged in a randomized complete block design with two blocks; thus the linear predictor used was that given in Equation (1). Using the real data set, we compared four scenarios (S1–S4, given in Table 2) for each model. Table 2 shows that, in the linear predictor, S1 and S2 do not take into account interaction effects between genotypes and environments, only the main effects of these factors. Also, S1 and S3 do not use marker information. These four scenarios were studied to investigate the gain in model fit and prediction ability taking into account the interaction effects, and using the marker information available.

Table 2. Scenarios proposed to fit the real data set with Models NB, Pois, Normal and LN.

Scenario Main Effects Nested Effect Interaction Effects
E L G R(E) EL EG
S1 X X X
S2 X X X
S3 X X X X
S4 X X X X

E, Environment; R, blocks; L, lines; G, lines taking into account markers; EL and EG, interaction effects of E and L, and E and G; R(E) blocks nested in the environment.

The posterior means (Mean), posterior SD of the scalar parameters, and posterior predictive checks for each scenario of the proposed models are given in Table 3. For the four models, the posterior means of the beta regression coefficients, variance components, and overdispersion parameters (r) are similar between S1 and S2, and between S3 and S4. In terms of goodness-of-fit measured by the loglikelihood posterior mean (Loglik), the scenarios rank as follows: S3, rank 1; S4, rank 2; S1, rank 3; and S2, rank 4, for the four proposed models, with the exception of Model Pois, where the ranking was S3, rank 1; S4, rank 2; S2, rank 3; and S1, rank 4. Therefore, there is evidence that, with the four proposed models in terms of goodness-of-fit, the best scenario is S3. Of the four models under study, Table 3 shows that Model LN reports the best fit since it has the largest Loglik.

Table 3. Estimated beta coefficients, variance components, and posterior predictive checks for the four scenarios (S1, S2, S3, S4) for each proposed model.

S1 S2 S3 S4
Parameter Mean SD Mean SD Mean SD Mean SD
Model NB
β1* −0.93 0.60 −1.05 0.61 −2.52 0.71 −2.38 0.99
β2* −0.83 0.71 −1.16 0.66 −2.27 0.58 −2.73 1.00
β3* −0.03 0.48 −0.15 0.56 −1.69 0.85 −1.96 0.78
β11 −0.09 0.52 −0.06 0.65 −0.02 0.54 −0.25 0.67
β12 0.05 0.51 0.08 0.60 0.10 0.53 −0.13 0.66
β21 −0.20 0.62 0.05 0.70 −0.27 0.47 0.09 0.67
β22 −0.05 0.61 0.20 0.66 −0.15 0.46 0.21 0.65
β31 0.07 0.42 0.11 0.61 0.11 0.61 0.32 0.50
β32 −0.14 0.41 −0.10 0.59 −0.10 0.60 0.11 0.48
σ12 0.43 0.05 1.37 0.17 0.34 0.05 1.03 0.15
σ22 0.38 0.03 1.04 0.10
r 2.80 0.12 2.81 0.12 11.87 1.12 11.55 1.17
 Loglik −1526.65 −1526.88 −1268.83 −1275.25
 Cor 0.69 0.69 0.90 0.89
 MSEP 2.13 2.12 0.75 0.77
Model Pois
β1* −7.14 0.22 −7.21 0.39 −6.69 0.11 −6.80 0.33
β2* −7.08 0.13 −7.17 0.11 −7.07 0.16 −7.27 0.19
β3* −5.97 0.43 −6.46 0.29 −5.88 0.16 −6.66 0.28
β11 0.12 0.17 0.07 0.29 −0.25 0.11 −0.34 0.23
β12 0.27 0.17 0.23 0.29 −0.13 0.11 −0.22 0.23
β21 0.06 0.14 0.03 0.15 0.14 0.15 0.13 0.17
β22 0.22 0.14 0.18 0.15 0.25 0.15 0.24 0.17
β31 0.04 0.34 0.41 0.21 −0.09 0.13 0.51 0.19
β32 −0.20 0.33 0.17 0.21 −0.31 0.13 0.28 0.19
σ12 0.44 0.05 1.46 0.17 0.35 0.05 1.03 0.14
σ22 0.38 0.03 1.05
r 1000.00 1000.00 1000.00 1000.00
 Loglik −1477.63 −1477.52 −1228.73 −1234.97
 Cor 0.66 0.66 0.90 0.89
 MSEP 1.87 1.86 0.74 0.76
Model Normal
β1 −12.30 5.86 7.90 4.36 13.70 3.69 9.22 3.11
β2 −12.20 5.80 7.93 4.41 13.60 3.73 9.11 3.16
β3 −10.40 5.87 9.66 4.36 15.50 3.69 10.94 3.10
σ12 0.96 0.16 1.42 0.35 0.72 0.18 1.58 0.40
σ22 1.33 0.18 1.13 0.34
r 2.75 0.14 2.91 0.15 1.67 0.11 2.23 0.17
 Loglik −1918.00 −1957.00 −1542.00 −1747.00
 Cor 0.60 0.56 0.83 0.71
 MSEP 2.41 2.60 1.07 1.68
Model LN
β1 −3.95 0.51 −6.34 3.33 1.41 0.48 3.32 1.31
β2 −3.95 0.48 −6.33 3.32 1.41 0.49 3.32 1.29
β3 −3.51 0.49 −5.85 3.33 1.86 0.49 3.79 1.31
σ12 0.09 0.01 0.15 0.03 0.07 0.01 0.16 0.03
σ22 0.08 0.01 0.05 0.02
r 0.17 0.01 0.181 0.009 0.11 0.01 0.15 0.01
 Loglik −484.00 −518.00 −125.00 −354.00
 Cor 0.71 0.68 0.88 0.79
 MSEP 2.50 2.63 1.25 1.97

The beta coefficients corresponding to effects of environments (β1, β2, β3) are given for models Normal and LN only. Mean, posterior mean; SD posterior SD.

Table 4 presents the mean and SD of the posterior predictive checks (Cor and MSEP) for each location (Batan 2012, Batan 2014, and Ecuador 2014) resulting from the 10-fold cross-validation implemented for the four models and four scenarios. The predictive checks given in Table 4 were calculated using the testing set. In Model NB, according to the Spearman correlation, the ranking of scenarios was as follows: in Batan 2012, 1 for S4, 2 for S3, 3 for S1, and 4 for S2. In Batan 2014, the ranking was 1 for S4, 2 for S3 and 3 for S1 and S2. In Ecuador 2014, the ranking was 1 S3, 2 for S2, 3 for S1, and 4 for S4. With the MSEP, the ranking for Model NB in Batan 2012 was 1 for S3, 2 for S4, 3 for S1 and 4 for S2. In Batan 2014, the ranking was 1 for S2, 2 for S1, 3 for S3, and 4 for S4. In Ecuador 2014, the ranking in terms of MSEP was 1 for S3, 2 for S2, 3 for S4, and 4 for S1. Under Model Pois, the ranking of the four scenarios in each locality was exactly the same as the ranking reported for Model NB. For Model Normal in terms of the Spearman correlation, S1 was the best in prediction accuracy in Batan 2012, while scenario 4 was the worst in all three locations. In terms of MSEP, the best scenario was S3 in Batan 2012 and Ecuador 2014, and the worst was S4 in Batan 2014 and Ecuador 2014. For Model LN in terms of the Spearman correlation, the best scenarios were scenarios S1, S2 and S3 and the worst was S4 in Batan 2012. In Batan 2014, the best scenario was S1, then scenario S3 and the worst was scenario S4. In Ecuador 2014, the best scenario was scenario S1 and S3, then S2 and S4. In terms of MSEP for Batan 2012, the best scenario was S3, then S1 and S2 and the worst was S4. In Batan 2014, the best scenario was S1, then S2 and the worst was scenario S4. Finally, in Ecuador 2014, the best scenario was S3, then S2 and the worst was scenario S1.

Table 4. Estimated posterior predictive checks with cross-validation for Models NB, Pois, Normal and LN.

Batan 2012 Batan 2014 Ecuador 2014
Scenario Cor MSEP Cor MSEP Cor MSEP
Model NB
 S1 Mean 0.43 (3) 0.98 (3.5) 0.43 (3.5) 1.39 (2) 0.18 (3) 11.733 (4)
SD 0.33 0.72 0.33 1.35 0.40 9.471
 S2 Mean 0.42 (4) 0.98 (3.5) 0.43 (3.5) 1.38 (1) 0.20 (2) 11.222 (2)
SD 0.33 0.72 0.33 1.36 0.37 8.614
 S3 Mean 0.54 (2) 0.49 (1) 0.52 (2) 1.48 (3) 0.22 (1) 8.645 (1)
SD 0.28 0.38 0.29 2.32 0.39 5.688
 S4 Mean 0.56 (1) 0.61 (2) 0.56 (1) 1.85 (4) 0.12 (4) 11.343 (3)
SD 0.24 0.44 0.22 2.68 0.41 8.154
Model Pois
 S1 Mean 0.43 (3) 0.98 (3.5) 0.43 (3.5) 1.39 (2) 0.18 (3) 11.733 (4)
SD 0.33 0.72 0.33 1.35 0.40 9.471
 S2 Mean 0.42 (4) 0.98 (3.5) 0.43 (3.5) 1.38 (1) 0.20 (2) 11.222 (2)
SD 0.33 0.72 0.33 1.36 0.37 8.614
 S3 Mean 0.54 (2) 0.48 (1) 0.52 (2) 1.48 (3) 0.22 (1) 8.645 (1)
SD 0.28 0.38 0.29 2.32 0.39 5.688
 S4 Mean 0.56 (1) 0.61 (2) 0.56 (1) 1.85 (4) 0.12 (4) 11.343 (3)
SD 0.24 0.44 0.22 2.68 0.41 8.154
Model Normal
 S1 Mean 0.36(1) 1.10 (4) 0.37 (1.5) 1.79 (1) 0.15 (1.5) 7.425 (2)
SD 0.28 0.88 0.39 1.70 0.32 4.151
 S2 Mean 0.34 (2) 0.99 (2) 0.33 (3) 2.01 (3) 0.07 (3) 7.454 (3)
SD 0.33 0.65 0.44 2.46 0.33 4.339
 S3 Mean 0.33 (3) 0.81 (1) 0.37 (1.5) 1.96 (2) 0.15 (1.5) 7.318 (1)
SD 0.30 0.46 0.40 2.99 0.29 4.159
 S4 Mean 0.27 (4) 1.03 (3) 0.24 (4) 2.37 (4) 0.04 (4) 8.482 (4)
SD 0.34 0.73 0.45 3.42 0.24 4.326
Model LN
 S1 Mean 0.51 (2) 0.66 (2.5) 0.46 (1) 1.60 (1) 0.15 (1.5) 8.10 (4)
SD 0.21 0.42 0.31 2.35 0.38 5.11
 S2 Mean 0.51 (2) 0.66 (2.5) 0.43 (3.5) 1.78 (2) 0.09 (3.5) 7.82 (2)
SD 0.22 0.39 0.35 2.82 0.46 5.31
 S3 Mean 0.51 (2) 0.64 (1) 0.45 (2) 1.871 (3) 0.15 (1.5) 7.76 (1)
SD 0.21 0.45 0.31 3.16 0.37 5.21
 S4 Mean 0.43 (4) 0.72 (4) 0.43 (3.5) 1.95 (4) 0.09 (3.5) 8.04(3)
SD 0.25 0.42 0.33 3.15 0.41 5.18

The numbers in parentheses denote the ranking of the four scenarios for each posterior predictive check.

Table 5 gives the average of the ranks of the two posterior predictive checks (Cor and MSEP) that were used. Since we are comparing four scenarios for each model, the values of the ranks range from 1 to 4, and the lower the values, the better the scenario. For ties, we assigned the average of the ranges that would have been assigned had there been no ties. Table 5 shows that the best scenarios were S3 and S4 under Models NB and Pois in Batan 2012. In Batan 2014, under Models NB and Pois, the best scenario was S2, while in Ecuador 2014, the best scenarios were S3. Under Model Normal, the best scenario was S1 in Batan 2014 S1 and S3 in Ecuador 2014, while in Batan 2012, the best scenarios were S2 and S3. Finally, under Model LN, the best scenario was S3 in Ecuador 2014, S3 in Batan 2012 and S1 in Batan 2014.

Table 5. Rank averages for the four scenarios for each model resulting from the 10-fold cross-validation implemented.

Scenario Batan 2012 Batan 2014 Ecuador 2014 Batan 2012 Batan 2014 Ecuador 2014
Model NB Model Normal
S1 3.25 2.75 3.5 2.5 1.25 1.75
S2 3.75 2.25 2 2 3 3
S3 1.5 2.5 1 2 1.75 1.75
S4 1.5 2.5 3.5 3.5 4 4
Model Pois Model LN
S1 3.25 2.75 3.5 2.25 1 2.75
S2 3.75 2.25 2 2.25 2.75 2.75
S3 1.5 2.5 1 1.5 2.5 1.25
S4 1.5 2.5 3.5 4 3.75 3.25

Each average was obtained as the mean of the rankings given in Table 4 for the two posterior predictive checks (Cor and MSEP) in each scenario.

Results in Table 4 and Table 5 indicate that the best models, in terms of prediction accuracy, are Models NB and Pois, since they had better predictions in the validation set based on both posterior predictive checks (Cor and MSEP) implemented, although, in terms of goodness-of-fit, Model LN was the best. These results are in partial agreement with the findings of Montesinos-Lopez et al. (2015c), who came to the conclusion that Models NB and Pois are good alternatives for modeling count data, although in this study, the best predictions were produced by Model LN. However, this model did not take into account G×E interaction.

Discussion

Generalized linear mixed models (GLMM) are widely recognized as one of the major methodological developments of the second half of the twentieth century. The main factor contributing to the success of their wide applicability over the last 30 years or so has been their flexibility, since they can be applied to many different types of data (Berridge and Crouchley 2011). These types of data include continuous interval/scale, categorical (including binary and ordinal) data, count data, beta data, and others. Each member of the GLMM family is appropriate for a specific type of data (Berridge and Crouchley 2011). However, GLMM for non-normal data are scarce in the context of genome-enabled prediction, since most of the models developed so far are linear mixed models (mixed models for Gaussian data). For this reason, we believe that developing specific methods for count data for genome-enabled prediction can help to improve the selection of candidate genotypes early when the phenotypes are counts. Because using transformation to approximate the counts to normality, or assuming that the counts are normally distributed, frequently produces poor parameter estimates and lower power. Also, parameter interpretation is more difficult when transformation is used (Stroup 2015). However, in genomic selection, phenotypic data (dependent variable) are not currently taken into account before deciding on the modeling approach to be used, mainly due to the lack of genome-enabled prediction models for non-normal phenotypes. Although our proposed Bayesian regression models are only for count data, they help fill this lack of genome-enabled prediction models for non-normal data.

Another advantage of our proposed methods for count data is that they take into account the nonlinear relationship between responses, and consider the specific properties of counts, including discreteness, non-negativity, and overdispersion (variance greater than the mean); this guarantees that the predictive response will not be negative, which makes no sense for count data. In addition, our methods help modeling G×E for count data in the context of genome-enabled prediction, which plays a central role in plant breeding for the selection of candidate genotypes that present high stability over a wide range of environmental conditions, and for the prediction of yet-to-be observed phenotypes when the relative performance of genotypes varies across environments.

Another advantage of our proposed method is that the proposed Gibbs sampler has an analytical solution because we were able to obtain all the analytically required full conditional distributions. This is important, because, of all the computational intensive methods for fitting complex multilevel models, the Gibbs sampler is the most popular due to its simplicity and ability to effectively generate samples from high-dimensional probability distributions (Park and van Dyk 2009). This was possible because we constructed our Gibbs sampler using the data augmentation approach proposed by Polson et al. (2013). For this reason, we believe it is an attractive alternative for fitting complex count data that arise in the context of genomic selection.

Our proposed methods showed superior performance in terms of prediction accuracy compared to Models Normal and LN. Also, we observed that, in Models NB and Pois, taking into account G×E considerably increased the prediction accuracy, which was expected since there is enough scientific evidence that including G×E interaction improves prediction accuracy. However, to use these models correctly, it is important to first understand the types of data we have before deciding on the modeling approach to be used. If the phenotypic data are normally distributed, the linear mixed models for genome-enabled prediction developed so far for Gaussian phenotypes should be used. If the phenotypic data are binary or categorical ordinal, the methods proposed by Montesinos-López et al. (2015a, 2015b) developed for ordinal data for genome-enabled prediction are preferred. If the phenotypic data are counts (number of panicles per plant, number of seeds per panicle, weed count per plot, etc.), and the counts are small, the models developed in this study, and those proposed by Montesinos-López et al. (2015c), are the best option, since they have more advantages over the conventional linear mixed models with Gaussian response, as was observed when we applied them to the real data set. We also need to keep in mind that Model Pois will be enough when the equi-dispersion (equality of mean and variance) is supported by the data at hand. However, when this assumption is violated, and the variance of the counts exceeds the mean count, overdispersion is present; in this situation, the most appropriate model is the NB model because it can control the overdispersion with the scale parameter (r), and improve parameter estimates, power, and predictions (Yaacob et al. 2010). Finally, more research is needed to study the proposed methods using other real data sets, and extend the proposed genomic-enabled prediction models to deal with the large number of zeros in count response variables, and for modeling multiple traits.

Acknowledgments

We very much appreciate the International Maize and Wheat Improvement Center (CIMMYT) field collaborators, laboratory assistants, and technicians who collected the phenotypic and genotypic data used in this study.

Appendix A

Derivation of full conditional distribution for all parameters.

Full conditional for  β*

f( β*|y,ELSE)= i=1Ij=1Jk=1Kt=1nijkPr(Yijkt=yijkt|xikT,r,ωijkt,b1i,b2ij)f( β*)
exp(κTX β*+κTh=12Zhbh12(X β*+h=12Zhbh)TDω(X β*+h=12Zhbh)12( β*β0)TΣ01σβ2( β*β0))
exp(12[β*T(Σ01σβ2+XTDωX) β*2(Σ01σβ2β0XTDωh=12Zhbh+XTκ)T β*])
exp(12[( β*β˜0)TΣ˜01( β*β˜0)])N(β˜0,Σ˜0)

where Σ˜0=(Σ01σβ2+XTDωX)1, β˜0=Σ˜0(Σ01σβ2β0XTDωh=12Zhbh+XTκ).

Full conditional for ωijkt

f(ωijkt|y,ELSE)exp[ωijkt(xikT β*+b1i+b2ij)22]f(ωijkt;yijkt+r,0)exp[ωijkt(xikT β*+b1i+b2ij)22]f(ωijkt;yijkt+r,0)  PG(yijkt+r,xikT β*+b1i+b2ij)

Full conditional for b1:

Defining η1=X β*+Z2b2 the conditional distribution of b1 is given as

f(b1|y,ELSE)exp(κTZ1b112(Z1b1+η1)TDω(Z1b1+η1)) f(b1|σb12)
exp{12 [b1T(σb12G11+Z1TDωZ1)b12 (Z1TκZ1TDωη1)T b1]}
exp{12 (b1b˜1)TF11(b1b˜1)}N(b˜1, F1)

where F1=(σb12G11+Z1TDωZ1)1 and b˜1= F1(Z1TκZ1TDωη1).

Full conditional for b2:

Defining η2=X β*+Z1b1, the conditional distribution of b2 is given as

f(b2|y,ELSE)exp(κTZ2b212(Z2b2+η2)TDω(Z2b2+η2)) f(b2|σb22)
exp{12 [b2T(σb22G21+Z2TDωZ2)b22 (Z2TκZ2TDωη2)T b2]}
exp{12 (b2b˜2)TF21(b2b˜2)}N(b˜2, F2)

where F2=(σb22G21+Z2TDωZ2)1 and b˜2= F2(Z2TκZ2TDωη2).

Full conditional for σbh2

f(σbh2|y,ELSE)1(σbh2)νbh+nbh2+1exp(bhTGh1bh+νbhSbh2σbh2)
χ2(ν˜b=νbh+nbh,S˜b=(bhTGh1bh+νbhSbh)/νbh+nbh)

with nb1= J, nb2= IJ and h=1,2.

Full conditional for σβ*2

f(σβ*2|y,ELSE)1(σβ*2)νβ*+I+IK2+1exp(( β*β0)TΣ01( β*β0)+νβ*Sβ*2σβ*2)
χ2(ν˜β*=νβ*+I+IK,S˜β=[( β*β0)TΣ01( β*β0)+νβ*Sβ*]ν˜β*)

Full conditional for r

To make the inference of r, we first place a gamma prior on it as rG(a0,1/b0). Then we infer a latent count L for each count conditional on Y and r. To derive the full conditional of r, we use the following parameterization of the NB distribution: Y  NB(π,r) with π=μr+μ. Since L Pois[rlog(1π)], by construction we can use the Gamma-Poisson conjugacy to update r. Therefore,

f(r|y,ELSE)f(r)i=1Ij=1Jk=1Ktnijkf(yijkt|Lijkt)f(Lijkt)
 ra01exp(rb0)i=1Ij=1Jk=1Ktnijk[rlog(1πijk)] Lijktexp[r log(1πijk)] 
              ra0+i=1Ij=1Jk=1Kt=1nijkLijt1exp[(b0i=1Ij=1Jk=1Kt=1nijklog(1πij)r
 G(a0 i=1Ij=1Jk=1Kt=1nijklog(1πij), 1b0+i=1Ij=1Jk=1Kt=1nijkLijt)   (A5)

According to Zhou et al. (2012), the conditional posterior distribution of Lijkt is a Chinese restaurant table (CRT) count random variable. That is, LijktCRT(yijkt,r)  and we can sample it as Lijkt=Σl=1yijktdl, where  dlBernoulli(rl1+r).    For details of the CRT random variable derivation, see Zhou and Carin (2012, 2015).

Appendix B

The Pólya-Gamma distribution

According to Polson et al. (2013), random variable ω has a Pólya-Gamma distribution with parameters b>0 and dR, denoted ωPG(b,d) if

ω=D12π2k=1gk(k12)2+d2/(4π2) (B1)

where gk ∼ Gamma(b, 1) are independent gamma random variables, and D = indicates equality in distribution (Polson et al. 2013). However, it is not easy to simulate Pólya-Gamma random variables from Equation (B1), which is a sum of gamma random variables. To avoid the difficulties that can result from truncating the infinite sum given in Equation (B1), the density of a Pólya-Gamma random variable is expressed as an alternating-sign sum of inverse-Gaussian densities as:

P(ω;b,d)={coshb(d2)}  2b1Γ(b)k=0(1)nΓ(n+b)(2n+b)Γ(n+1)2πω3exp((2n+b)28ωd22ω (B2)

where cosh denotes the hyperbolic cosine. A further useful fact is that all finite moments of a Pólya-Gamma random variable are available in closed form (Polson et al. 2013). In particular, the expectation may be calculated directly, and is equal to E(ω)=b2dtanh(d2), where tanh denotes the hyperbolic tangent. Also, the Pólya-Gamma distribution is closed under convolution for random variates with the same scale parameter, given that if ω1PG(b1,d), and if ω2PG(b2,d) are independent, then ω1+ω2P G(b1+b2,d) (Polson et al. 2013). This is used to construct the proposed Gibbs sampler. More details on simulating Pólya-Gamma random variates can be found in section 4 of the paper of Polson et al. (2013). Also, it is important to point out that this method for simulating Pólya-Gamma random variables is implemented in the R package Bayeslogit.

Footnotes

Communicating editor: D. J. de Koning

Literature Cited

  1. Berridge, D. M., and R. Crouchley, 2011 Multivariate generalized linear mixed models using R, CRC Press, Boca Raton. [Google Scholar]
  2. Cameron A. C., Trivedi P. K., 1986.  Econometric models based on count data. Comparisons and applications of some estimators and tests. J. Appl. Econ. 1(1): 29–53. [Google Scholar]
  3. Cavanagh, C.R., Chao, S., Wang, S., Huang, B. E, Stephen, S., et al 2013.  Genome-wide comparative diversity uncovers multiple targets of selection for improvement in hexaploid wheat landraces and cultivars. Proc. Natl. Acad. Sci. USA 110: 8057–8062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. de los Campos G., Vazquez A. I., Fernando R., Klimentidis Y. C., Sorensen D., 2013.  Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9: e1003608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. de los Campos, G., A. Pataki, and P. Pérez, 2014 The BGLR (Bayesian Generalized Linear Regression) R-Package. Available at: http://bglr.r-forge.r-project.org/BGLR-tutorial.pdf. Accessed: November 1, 2015.
  6. Falconi-Castillo, E., 2014 Association mapping for detecting QTLs for Fusarium head blight and yellow rust resistance in bread wheat. Ph.D. Thesis. Michigan State University, East Lansing, Michigan.
  7. Garrod A. E., 1902.  The incidence of alkatonuria: a study in chemical individuality. Lancet 160: 16161620. [Google Scholar]
  8. Gelfand A. E., Smith A. F., 1990.  Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85: 398–409. [Google Scholar]
  9. Geyer C. J., 1992.  Practical Markov Chain Monte Carlo. Stat. Sci. 7: 473–483. [Google Scholar]
  10. Goddard M. E., Hayes B. J., 2009.  Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10: 381–391. [DOI] [PubMed] [Google Scholar]
  11. Jiao S., Hsu L., Bézieau S., Brenner H., Chan A. T., et al. 2013.  SBERIA: Set Based Gene-Environment Interaction test for rare and common variants in complex diseases. Genet. Epidemiol. 37: 452–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kraft P., Yen Y. C., Stram D. O., Morrison J., Gauderman W. J., 2007.  Exploiting gene environment interaction to detect genetic associations. Hum. Hered. 63: 111–119. [DOI] [PubMed] [Google Scholar]
  13. Link W. A., Eaton M. J., 2012.  On thinning of chains in MCMC. Methods Ecol. Evol. 3: 112–115. [Google Scholar]
  14. MacEachern S. N., Berliner L. M., 1994.  Subsampling the Gibbs sampler. Am. Stat. 48: 188–190. [Google Scholar]
  15. Montesinos-López O. A., Montesinos-López A., Pérez-Rodríguez P., de los Campos G., Eskridge K. M., et al. 2015a Threshold models for genome-enabled prediction of ordinal categorical traits in plant breeding. G3 (Bethesda) 5: 291–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Montesinos-López O. A., Montesinos-López A., Crossa J., Burgueño J., Eskridge K. M. 2015b Genomic-enabled prediction of ordinal data with Bayesian logistic ordinal regression. G3 (Bethesda) 5: 2113–2126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Montesinos-López O. A., Montesinos-López A., Pérez-Rodríguez P., Eskridge K. M., He X., et al. 2015c Genomic prediction models for count data. J. Agric. Biol. Environ. Stat. 20: 533–554 (JABES) . 10.1007/s13253-015-0223-4 [DOI] [Google Scholar]
  18. Murcray C. E., Lewinger J. P., Gauderman W. J., 2009.  Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 169: 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Park T., van Dyk D. A., 2009.  Partially collapsed Gibbs samplers: illustrations and applications. J. Comput. Graph. Stat. 18(2): 283–305. [Google Scholar]
  20. Pérez-de-Castro A. M., Vilanova S., Cañizares J., Pascual L., Blanca J. M., et al. 2012.  Application of genomic tools in plant breeding. Curr. Genomics 13(3): 179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Polson N. G., Scott J. G., Windle J., 2013.  Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Am. Stat. Assoc. 108: 1339–1349. [Google Scholar]
  22. Quenouille M. H., 1949.  A relation between the logarithmic, Poisson, and negative binomial series. Biometrics 5: 162–164. [PubMed] [Google Scholar]
  23. R Core Team, 2015 R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna. Austria. Available at: http://www.R-project.org/. Accessed: September 1, 2015.
  24. Stroup W. W., 2015.  Rethinking the analysis of non-Normal data in plant and soil science. Agron. J. 107: 811–827. [Google Scholar]
  25. Teerapabolarn K., Jaioun K., 2014.  An improved Poisson approximation for the Negative binomial distribution. Applied Mathematical Sciences 8(89): 4441–4445. [Google Scholar]
  26. Thomas D., 2011.  Response to ‘Gene-by-environment experiments: a new approach to finding the missing heritability’ by Van Ijzendoorn et al. Nat. Rev. Genet. 12: 881. [DOI] [PubMed] [Google Scholar]
  27. Turesson G., 1922.  The genotypical response of the plant species to the habitat. Hereditas 3: 211350. [Google Scholar]
  28. Van Os J., Rutten B., 2009.  Gene-environment-wide interaction studies in psychiatry. Am. J. Psychiatry 166: 964–966. [DOI] [PubMed] [Google Scholar]
  29. VanRaden P. M., 2008.  Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [DOI] [PubMed] [Google Scholar]
  30. Winham S. J., Biernacka J. M., 2013.  Gene–environment interactions in genome‐wide association studies: current approaches and new directions. J. Child Psychol. Psychiatry 54: 1120–1134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Yaacob, W. F. W., M. A. Lazim, and Y. B. Wah, 2010 A practical approach in modelling count data. Proceedings of the Regional Conference on Statistical Sciences. Malaysia. pp. 176–183. [Google Scholar]
  32. Zhang Z., Ober U., Erbe M., Zhang H., Gao N., et al. 2014.  Improving the accuracy of whole genome prediction for complex traits using the results of genome wide association studies. PLoS One 9: e93017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zhou, M., L. Li, D. Dunson, and L. Carin, 2012 Lognormal and gamma mixed negative binomial regression. Proceedings of the International Conference on Machine Learning, vol. 2012. NIH Public Access. p. 1343. [PMC free article] [PubMed] [Google Scholar]
  34. Zhou, M., and L. Carin, 2012 Augment-and-conquer negative binomial processes. Adv. Neural Information Processing Systems, 4: 2546–2554. [Google Scholar]
  35. Zhou M., Carin L., 2015.  Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. 37(2): 307–320. [DOI] [PubMed] [Google Scholar]

Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES