Abstract
Validation of genomic predictions or polygenic risk scores is key for model selection and evaluating the performance of the chosen prediction machinery. Non-parametric validation, such as cross-validation, is popular but does not account for population structure and the fact that the interest could be in validating a set of individuals and not the entire population. Semi-parametric methods, such as the LR method, also use removed records to validate predictions, account for population structure, and allow focus on a specific set of individuals of interest. Confidence intervals are obtained using semi-parametric methods without the need for repeated cross-validation. We developed a tool within the Blupf90 software suite, called validationf90, that allows researchers to conduct semi-parametric validation from the solutions obtained from that software suite. validationf90 calculates different validation statistics and their confidence intervals for a pre-defined set of individuals of interest, reflecting the bias and accuracy of genomic predictions. The program allows for genomic predictions obtained from frequentist and Bayesian methods, as well as for categorical data. validationf90 can validate any model supported by the Blupf90 software suite and can be used with animal, plant, and human datasets. Predictions obtained with other software can be provided to validationf90 as long as the input format matches with the Blupf90 format.
Keywords: validation, accuracy, genomic predictions, polygenic risk scores
Validation of genomic predictions or polygenic risk scores is crucial for model selection and validating the performance of the chosen model. Non-parametric cross-validation does not consider population structure, and that not all individuals are of the same interest for validation. Semi-parametric validation overcomes these issues. Bermann et al. developed a software called validationf90, which performs semi-parametric validation within the Blupf90 software suite. This type of validation, as well as the software tool, is applicable to animal, plant, and human datasets.
Introduction
Validation of genomic predictions and polygenic risk scores is crucial for model selection and assessing the performance of the current prediction machinery. Statistical validation could be parametric (e.g. Rencher and Schaalje 2008), semi-parametric (Thompson 2001; Legarra et al. 2008; Legarra and Reverter 2018), or non-parametric (Hastie et al. 2009). Parametric validation consists of statistics calculated from evaluating the likelihood, such as the logarithm of the likelihood, Akaike Information Criterion (AIC), or Bayesian Information Criterion (BIC), or as closed-form functions of estimators such as Mallows' Cp, R2 from linear regression, receiver operating characteristic (ROC) curve, among others. Although parametric validation could be used for genomic prediction models, it gives emphasis on the fitting of the model to the data, whereas the goal of genomic prediction is to have high accuracy and low biased estimated breeding values, marker effects, or polygenic risk scores (Thompson 2001; Daetwyler et al. 2013). Non-parametric validation comprises cross-validation (CV) and bootstrapping (Efron and Tibshirani 1993). On the one hand, CV and its variants like stratified CV or k-fold CV consist of splitting the data into training and testing sets and predicting the testing set using predictions obtained from the training set (Hastie et al. 2009). On the other hand, bootstrapping works by sampling with replacement testing and training datasets. Then, in both cases, statistics like correlations and slopes of regressions are obtained by comparing predictions and observations. Non-parametric validation is popular for evaluating genomic predictions due to its simplicity and because some statistics resemble genetic parameters, such as the accuracy of estimated breeding values. However, it does not consider the distributional properties of the genomic predictions (Henderson 1975) and does not directly consider population structure (Thompson 2001; Daetwyler et al. 2013). Legarra and Reverter (2018) developed a semi-parametric method, named the LR method, for validating genomic predictions while considering their distributional properties. The LR method is similar to CV but compares predictions from the complete (whole) dataset against those from a reduced (partial) dataset and provides expected values of those statistics. Thanks to its simplicity and because it considers population structure, the LR method is widely spread in the animal breeding field (e.g. Vallejo et al. 2019; Bermann et al. 2021; Massender et al. 2022; Aliakbari et al., 2022; Lázaro et al., 2024) and is gaining importance in the plant breeding field (e.g. Bornhofen et al. 2022; Callister et al. 2022; Guo et al. 2023; Tessema et al. 2024; Raffo et al. 2025).
The Animal Breeding and Genetics group at the University of Georgia has been actively conducting research on genomic selection and genetic prediction methods (https://nce.ads.uga.edu). Many of the methods developed by the group were programmed in the Blupf90 software suite (Lourenco et al. 2022), which is available for researchers and companies and has an average of 2,000 monthly downloads. The Blupf90 software suite is programmed in Fortran and works with executable binaries and plain text parameter files. The software supports genomic information, has great flexibility for statistical modeling, and is programmed to maximize computing efficiency.
Validation statistics from the LR method can be easily calculated with the output files provided by some of the programs of the Blupf90 software suite. As they are computed from random variables, validation statistics have associated incertitude. Therefore, inference on those validation statistics should account for their point estimate and standard error. Although the LR method validation statistics could be obtained directly from the genomic predictions, their standard error is hard to calculate (Bermann et al. 2024), making proper statistical inference unfeasible. Therefore, we developed a software called validationf90, which helps the user conduct a proper semi-parametric validation by obtaining the LR method statistics and their standard errors. This article aims to show how to validate genomic predictions within the Blupf90 software suite with validationf90 and introduce its main functionalities.
Materials and methods
The LR method
For genomic predictions, we consider a linear mixed model of the following form:
| (1) |
where is the vector of phenotypes, is the vector of fixed effects, is the vector of additive genetic effects, is the vector of random effects except for the additive genetic effect, is the vector of errors, and , , and are incidence matrices.
The LR method creates a subset of the whole dataset called partial dataset by removing observations for the testing or focal set. Genomic predictions obtained with the whole dataset are denoted by , whereas genomic predictions obtained from the partial dataset are denoted by . Based on Henderson (1975), the joint distribution of and assuming that selection is properly taken into account:
| (2) |
where is the variance of . Depending on the application, could be the additive relationship matrix, a genomic relationship matrix (e.g. VanRaden 2008; Hayes et al. 2009; Yang et al. 2010), a single-step genomic relationship matrix (Legarra et al. 2009), or a kernel-based relationship matrix (Morota and Gianola 2014).
The validation statistics of the LR method are:
Bias: the mean difference between and , which has an expected value of zero if the predictions are unbiased. In the presence of bias, there is a wrong estimation of the genetic trend.
Dispersion or inflation: the slope of the regression of on , which has an expected value of one if there is no over/under estimation of . Values lower than one indicate that predictions are overestimated or inflated, while values greater than one indicate that predictions are underestimated or deflated.
Ratio of accuracies: the Pearson correlation coefficient between and , which reflects the increase in accuracy when adding observations to the partial data.
Reliability: the covariance between and and , divided by the variance of the focal individuals, which reflects the squared correlation between true and estimated breeding values. The square root of the reliability is the accuracy, that is, the correlation between true and estimated breeding values.
Predictivity: the correlation between and adjusted phenotypes , divided by the square root of the heritability, which reflects the accuracy of estimated breeding values.
Statistical inference for the validation statistics is carried out by (Wald 1943), where CI is the confidence interval, θ is the validation statistic, is the value of the standard normal distribution quantile function for the confidence level , α is the significance level, and is the asymptotic standard error of θ.
The mathematical expressions for the validation statistics were derived in Legarra and Reverter (2018), whereas Bermann et al. (2024) provided formulae for their asymptotic variances. Details are provided in Table 1.
Table 1.
Mathematical expression for the validation statistics and their exact and approximated confidence intervals.
| Statistic | Exact confidence intervals | Approximated confidence intervals | |
|---|---|---|---|
| Bias | |||
| Dispersion | |||
| Ratio of accuracies | |||
| Reliability | |||
| Predictivity |
mean, sample mean; cov, sample covariance; var, sample variance; corr, sample correlation coefficient; , genetic variance of the focal or testing individuals; , adjusted phenotypes; h, square root of the heritability; , prediction error (co)variance for the focal set obtained with the partial dataset; , prediction error (co)variance for the focal set obtained with the complete dataset; , genetic (co)variance for the focal set; ; , hyperbolic tangent; , reliabilities for the focal set obtained with the complete dataset; , reliabilities for the focal set obtained with the partial dataset; , average of ; , average of ; , genetic variance.
Main functionalities
Binaries of validationf90 for different operative systems are available at https://nce.ads.uga.edu/wiki/doku.php?id=distribution. As with many of the programs in the Blupf90 software suite, validationf90 works with a parameter file and specific options to manage its behavior. The program will calculate validation statistics and, optionally, their standard errors and 95% confidence intervals for a random effect (usually, the genetic effect but could be maternal or permanent environmental effect) specified by the user and for all the traits available in the model. validationf90 is activated with OPTION validation eff list_1 list_2 … list_n, where eff is the effect number for which validation will be performed and list_1 list_2 … list_n are files containing the identification of the individuals in the focal or testing set. The number of files with identifications could be one (the same testing group is used for all the traits) or equal to the number of traits (each trait has its own testing group).
The program automatically reads two solution files generated by two different runs of blupf90+, one with the whole and the other with the partial dataset, and extracts and for the focal set. The two different runs should be done by the user, while the extraction of and for the focal set is done internally by validationf90. The default names for the solutions files are solutions_whole and solutions_partial for the whole and partial datasets, respectively. The prefix of the solutions files can be changed by an option as long as the suffixes _whole and _partial are maintained. Solutions from other statistical packages can be provided to validationf90, as long as the input format matches the Blupf90 solution file format and a parameter file is provided.
After reading the solutions, the program will calculate the trait's heritabilities based on the covariance matrices read in the parameter file. These heritabilities will be used to calculate the predictivity and the variance of the focal individuals to obtain the reliability estimator (see Table 1). The user can also provide the variance of the focal set.
validationf90 calculates bias, dispersion, ratio of accuracies, and reliability by default. The user can add OPTION predictive_ability to the parameter file to force the program to calculate the predictivity (i.e. predictive ability) for each trait based on the file yhat_residual, which should be obtained beforehand using the program predictf90.
By default, the program will not calculate standard errors and confidence intervals. For activating the calculation, the user should add OPTION se mode to the parameter file, where mode could be either exact for exact asymptotic variances and confidence intervals (see column 3 in Table 1), approx for approximated asymptotic variances (see column 4 in Table 1), or boot for variances calculated by bootstrapping (Efron and Tibshirani 1993). Users may need to add different options when running blupf90+, depending on the chosen mode.
Running validationf90
Henceforth, eff will refer to the effect number to validate, n will refer to the number of traits, and ids_i will refer to a file containing a column of validation individual's identification for the ith trait; if no subscript is provided, it is assumed that the same list holds for all the traits.
The main steps for conducting a validation with validationf90 are as follows:
Run renumf90 to renumber the whole dataset.
Create the partial dataset by subsetting the renumbered data from the first step.
Run blupf90+ with the whole dataset. The predictions will be in the file solutions for versions from blupf90+ equal to or less than 2.60, which should be renamed to solutions_whole right after the program finishes. For versions equal to or greater than 2.61, the user might choose to change the solutions' file name with OPTION solfile solutions_whole before running the program or run the program without that option and proceed as for versions equal to or less than 2.60.
Run blupf90+ with the partial dataset. The solutions' file should be named solutions_partial.
Run validationf90 by adding OPTION validation eff ids_1 ids_2 … ids_n if each trait has a different testing set or OPTION validation eff ids, otherwise.
Steps 1 and 2 are common for all ways of running validationf90; therefore, they will be omitted from the following descriptions. The same holds for OPTION validation eff ids_1 ids_2 … ids_n.
To include the calculation of predictive ability, the user should run the following sequence of programs:
Run blupf90+ with the whole dataset to obtain solutions_whole.
Run predictf90 with OPTION include_effects eff and OPTION solfile solutions_whole.
Run blupf90+ with the partial dataset to obtain solutions_partial.
Run validationf90 with OPTION predictive_ability.
By running validationf90 as shown, the user will obtain only validation statistics but not confidence intervals. Although the option for calculating predictivity will be omitted from the following sequences, the user can add it as previously explained.
For calculating confidence intervals by bootstrap, the user should:
Run blupf90+ with the whole dataset to obtain solutions_whole.
Run blupf90+ with the partial dataset to obtain solutions_partial.
Run validationf90 with OPTION se boot.
The number of bootstrap samples is 10,000 by default. To modify it, the user can add OPTION nrepboot x to the parameter file of validationf90, where x is the desired number of bootstrap samples.
For obtaining confidence intervals based on exact asymptotic variances, the sequence should be:
Run blupf90+ with the whole dataset to obtain solutions_whole. Add OPTION store_pev_pec eff full to the parameter file. This will create n binary files named ebv_pev_i_1, i = 1, …, n, which the user should rename them as ebv_pev_i_1_whole.
Run blupf90+ with the partial dataset to obtain solutions_partial and add OPTION store_pev_pec eff full to the parameter file. As before, the user should rename ebv_pev_i_1 as ebv_pev_i_1_partial.
Run validationf90 with OPTION se exact.
It is worth noticing that calculating exact asymptotic variance requires inverting the mixed model equations; therefore, it could be computationally costly. Inverting the mixed model equations is done internally in blupf90+ when OPTION store_pev_pec eff full is included in the parameter file.
Finally, obtaining confidence intervals based on approximated asymptotic variances requires approximating accuracies with accf90GS2 or accf90GS3 (further named indistinctly as accf90GSx), which are available under a research agreement. Although blupf90+ can calculate accuracies, its computational cost is equal to inverting the mixed model equations. Hence, one would use confidence intervals based on exact asymptotic variances. In any case, the steps for obtaining confidence intervals based on approximated asymptotic variances are:
Obtain solutions for the whole dataset and run accf90GSx. Rename the output sol_and_acc by sol_and_acc_whole.
Repeat the previous step with the partial dataset to obtain sol_and_acc_partial.
Run validationf90 with OPTION se approx and OPTION prefix sol_and_acc.
Details for the optional arguments for validationf90 are provided on the website (https://nce.ads.uga.edu/wiki/doku.php?id=readme.validationf90). Details on how to run renumf90, blupf90+ and other programs form the Blupf90 software suite are in Lourenco et al. (2020).
Application example
We used a simulated dairy cattle dataset, which can be found in https://github.com/masuday/data/tree/master/tutorial/rawfiles. These data consist of a simulated dairy cattle population with a pedigree of 4,641 animals spanning 11 generations. Four traits were simulated as the sum of an overall mean, generation, and sex as fixed effects, and the additive genetic effect and residual as random effects. Semi-parametric validation was carried out on the last generation, which had 461 animals. The parameter files for the full pipeline are in Supplementary File.
Results and discussion
The parameter file for validationf90 for the simulated dataset can be seen in Fig. 1, while the results for the validation statistics in Fig. 2. The full pipeline for replicating the analysis can be found in Supplementary File. Due to the small size of the data, we chose to use exact confidence intervals. The values of the validation statistics lie within their expected values; however, some of the confidence intervals are large due to the small number of validation individuals. In theory, the values of the reliability should match the squared predictive ability. Disagreements could be explained due to the small sample size, estimates of heritability, and approximation of the variance of focal individuals (Macedo et al. 2020).
Fig. 1.
Parameter file of validationf90 for the simulated dataset.
Fig. 2.
Output validationf90 for the simulated dataset. Point estimate, standard errors (SE), and 95% confidence intervals are provided.
Validation of genomic predictions is essential for model selection and assessing the chosen model's performance for genetic and genomic predictions. Prediction of the genetic merit and polygenic risk scores involves unbalanced datasets, the presence of population structure, and uneven importance of individuals for prediction purposes (Daetwyler et al. 2013). In such cases, conducting non-parametric validations such as k-fold cross-validation is unsuitable for evaluating the quality of genomic predictions (Legarra and Reverter 2018). For instance, genomic predictions for dairy cattle involve pedigrees that trace back to 1910; however, the interest is in predicting the genetic merit of young selection candidates. In such a case, one would want to calculate the accuracy of predictions for those young selection candidates while overlooking older animals in the pedigree. Thus, where the population cannot be split randomly or where there is only one way to subset the data for validation, semi-parametric validation, such as the LR method, provides the proper statistical framework for evaluating genomic models. Considering the variation of validation statistics is crucial for comparing methods and avoiding concluding that there exist differences between models or methods where there are none. In other words, considering confidence intervals of validation statistics reduces the probability of incurring in type one errors. In a previous study (Bermann et al. 2024), we showed the importance of considering genomic predictions as random variables when calculating validation statistics such as those presented in this study. If confidence intervals for validation statistics were calculated as if genomic predictions were fixed (i.e. using known formulas for simple linear regression), the confidence intervals would be narrower than they should be. Therefore, researchers could conclude that significant differences exist where there are not. Semi-parametric validation with proper confidence intervals (Bermann et al. 2024) helps avoid such a problem. Since calculating confidence intervals for semi-parametric validation statistics is not an easy task, validationf90 helps researchers properly validate genetic and genomic predictions.
Designing validationf90 as a separate program allows the users to fully customize and run their analysis in steps, ensuring reliable results and a robust pipeline. The software is designed to efficiently read solution files given by blupf90+, making it possible to process the genomic predictions of a few focal individuals from a solution file of hundreds of millions of lines. This design also allows the researchers to use different methods for predicting the genetic merit, such as Bayesian methods, and conduct validation for categorical traits (Bermann et al. 2021). For example; Albiñana et al. (2023) investigated the increase in predictive accuracy for different psychiatry phenotypes, such as attention-deficit/hyperactivity disorder, bipolar disorder, and schizophrenia, among others, by moving from a single to a multiple-trait prediction scenario. They calculated what they called multi-polygenic risk scores from publicly available GWAS using generalized linear models, tree gradient boosting, and SNP-BLUP. The authors performed a 5-fold cross-validation by adjusted variance explained in the liability scale. Using the Blupf90 software suite, studies like those from Albiñana et al. (2023) could also calculate polygenic risk scores from threshold models using gibbsf90+ and perform a semi-parametric validation with validationf90. Mahjani et al. (2020) investigated the presence of maternal effects on the risk for obsessive-compulsive disorder (OCD). In their study, they applied a threshold model with the Blupf90 software suite and concluded that there is a presence of maternal effects on OCD based on the fact that the maternal variance of their model was significantly different than zero. Possible further steps of their study could include validating the maternal effect estimates using validationf90.
Recently, many studies compared genomic predictions obtained from linear mixed models vs those obtained from machine learning methods (e.g. John et al. 2022; Lee et al., 2023; Li et al. 2024). In most studies, the criterion for comparison is the correlation between estimated breeding values and adjusted phenotypes or predicted phenotypes and observed phenotypes, while the bias of the predictions is usually overlooked. In any case, inference on those validation statistics is generally done by k-fold validation, which overlooks the covariance structure among the testing individuals and could underestimate the variation of the validation statistics (Bermann et al. 2024). In such cases, we recommend performing a semi-parametric validation for the predictions obtained from linear mixed models, which possibly would provide more accurate confidence intervals and change the conclusion of the comparison between genomic predictions obtained either from linear mixed models or from machine learning methods.
The LR method validation statistics are conditional on the underlying statistical model, hence, on the chosen covariance matrix for the estimated breeding values. Thus, if the model or the covariance matrix is improper, inference with the LR method could be misleading. All the programs from the Blupf90 software suite allow the user to choose between three different definitions (Leutenegger et al., 2003; Amin et al. 2007; VanRaden 2008; Yang et al. 2010) and scalings (VanRaden 2008; Hayes et al. 2009; Gianola et al. 2009) of the genomic relationship matrix. Furthermore, the user can provide the entire covariance matrix for random effects from an external file. Thanks to these functionalities, the user has full control of the predictions and their validation.
Conclusions
validationf90 is a computing tool integrated into the Blupf90 software suite that allows users to conduct proper semi-parametric validation of genomic predictions or polygenic risk scores. The program gives estimates and confidence intervals of many validation statistics, reflecting the bias and accuracy of predictions. validationf90 accepts outputs from frequentist and Bayesian methods, as well as categorical data. Coupled with high computational efficiency, validationf90 is suitable for performing validation of predictions of any animal, plant, or human genomic datasets.
Supplementary Material
Contributor Information
Matias Bermann, Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA.
Alejandra Álvarez Múnera, Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA.
Ignacy Misztal, Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA.
Daniela Lourenco, Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA.
Data availability
The data is available at https://github.com/masuday/data/tree/master/tutorial/rawfiles. Binaries for different operative systems are available at https://nce.ads.uga.edu/wiki/doku.php?id=distribution. The parameter files to run the full pipeline are provided in the Supplementary File. Supplemental material available at G3 online.
Funding
This study was partially funded by Grant no. 2020-67015-31030 from the National Institute of Food and Agriculture, United States.
Literature cited
- Albiñana C, Zhu Z, Schork AJ, Ingason A, Aschard H, Brikell I, Bulik CM, Petersen LV, Agerbo E, Grove J, et al. 2023. Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nat Commun. 14(1):4702. doi: 10.1038/s41467-023-40330-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aliakbari A, Zemb O, Cauquil L, Barilly C, Billon Y, Gilbert H. 2022. Microbiability and microbiome-wide association analyses of feed efficiency and performance traits in pigs. Genet Select Evol. 54(1):29. doi: 10.1186/s12711-022-00717-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amin N, van Duijn CM, Aulchenko YS. 2007. A genomic background based method for association analysis in related individuals. PLoS One. 2(12):e1274. doi: 10.1371/journal.pone.0001274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bermann M, Legarra A, Hollifield MK, Masuda Y, Lourenco D, Misztal I. 2021. Validation of single-step GBLUP genomic predictions from threshold models using the linear regression method: an application in chicken mortality. J Animal Breed Genet. 138(1):4–13. doi: 10.1111/jbg.12507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bermann M, Legarra A, Munera AA, Misztal I, Lourenco D. 2024. Confidence intervals for validation statistics with data truncation in genomic prediction. Genet Select Evolut. 56(1):18. doi: 10.1186/s12711-024-00883-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bornhofen E, Fè D, Lenk I, Greve M, Didion T, Jensen CS, Asp T, Janss L. 2022. Leveraging spatiotemporal genomic breeding value estimates of dry matter yield and herbage quality in ryegrass via random regression models. Plant Genome. 15(4):e20255. doi: 10.1002/tpg2.20255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Callister AN, Bermann M, Elms S, Bradshaw BP, Lourenco D, Brawner JT. 2022. Accounting for population structure in genomic predictions of Eucalyptus globulus. G3 (Bethesda). 12(9):jkac180. doi: 10.1093/g3journal/jkac180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daetwyler HD, Calus MP, Pong-Wong R, de Los Campos G, Hickey JM. 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 193(2):347–365. doi: 10.1534/genetics.112.147983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B, Tibshirani RJ. 1993. An introduction to the Bootstrap. Chapman and Hall. [Google Scholar]
- Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R. 2009. Additive genetic variability and the Bayesian alphabet. Genetics. 183(1):347–363. doi: 10.1534/genetics.109.103952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo X, Sarup P, Jahoor A, Jensen J, Christensen OF. 2023. Metabolomic-genomic prediction can improve prediction accuracy of breeding values for malting quality traits in barley. Genet Select Evol. 55(1):61. doi: 10.1186/s12711-023-00835-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer. [Google Scholar]
- Hayes BJ, Visscher PM, Goddard ME. 2009. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res. 91(1):47–60. doi: 10.1017/S0016672308009981. [DOI] [PubMed] [Google Scholar]
- Henderson CR. 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics. 31(2):423–447. doi: 10.2307/2529430. [DOI] [PubMed] [Google Scholar]
- John M, Haselbeck F, Dass R, Malisi C, Ricca P, Dreischer C, Schultheiss SJ, Grimm DG. 2022. A comparison of classical and machine learning-based phenotype prediction methods on simulated data and three plant species. Front Plant Sci. 13:932512. doi: 10.3389/fpls.2022.932512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lázaro SF, Tonhati H, Oliveira HR, Silva AA, Scalez DCB, Nascimento AV, Santos DJA, Stefani G, Carvalho IS, Sandoval AF, et al. 2024. Genetic parameters and genome-wide association studies for mozzarella and milk production traits, lactation length, and lactation persistency in Murrah buffaloes. J Dairy Sci. 107(2):992–1021. doi: 10.3168/jds.2023-23284. [DOI] [PubMed] [Google Scholar]
- Lee HJ, Lee JH, Gondro C, Koh YJ, Lee SH. 2023. deepGBLUP: joint deep learning networks and GBLUP framework for accurate genomic prediction of complex traits in Korean native cattle. Genet Select Evolut. 55(1):56. doi: 10.1186/s12711-023-00825-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Legarra A, Aguilar I, Misztal I. 2009. A relationship matrix including full pedigree and genomic information. J Dairy Sci. 92(9):4656–4663. doi: 10.3168/jds.2009-2061. [DOI] [PubMed] [Google Scholar]
- Legarra A, Reverter A. 2018. Semi-parametric estimates of population accuracy and bias of predictions of breeding values and future phenotypes using the LR method. Genet Select Evol. 50(1):53. doi: 10.1186/s12711-018-0426-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Legarra A, Robert-Granié C, Manfredi E, Elsen JM. 2008. Performance of genomic selection in mice. Genetics 180(1):611–618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leutenegger AL, Prum B, Génin E, Verny C, Lemainque A, Clerget-Darpoux F, Thompson EA. 2003. Estimation of the inbreeding coefficient through use of genomic data. Am J Hum Genet. 73(3):516–523. doi: 10.1086/378207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M, Hall T, MacHugh DE, Chen L, Garrick D, Wang L, Zhao F. 2024. KPRR: a novel machine learning approach for effectively capturing nonadditive effects in genomic prediction. Brief Bioinformatics. 26(1):1–9. doi: 10.1093/bib/bbae683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lourenco D, Legarra A, Tsuruta S, Masuda Y, Aguilar I, Misztal I. 2020. Single-step genomic evaluations from theory to practice: using SNP chips and sequence data in BLUPF90. Genes (Basel). 11(7):790. doi: 10.3390/genes11070790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lourenco D., Tsuruta S., Masuda Y., Bermann M., Legarra A., & Misztal I. (2022). Recent updates in the BLUPF90 software suite. Proceedings of the 12th World Congress on Genetics Applied to Livestock Production: 3–8 July 2022; Rotterdam. 2022.
- Macedo F, Reverter A, Legarra A. 2020. Behavior of the linear regression method to estimate bias and accuracies with correct and incorrect genetic evaluation models. J Dairy Sci. 103(1):529–544. doi: 10.3168/jds.2019-16603. [DOI] [PubMed] [Google Scholar]
- Mahjani B, Klei L, Hultman CM, Larsson H, Devlin B, Buxbaum JD, Sandin S, Grice DE. 2020. Maternal effects as causes of risk for obsessive–compulsive disorder. Biol Psychiatry. 87(12):1045–1051. doi: 10.1016/j.biopsych.2020.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massender E, Brito LF, Maignel L, Oliveira HR, Jafarikia M, Baes CF, Sullivan B, Schenkel FS. 2022. Single-step genomic evaluation of milk production traits in Canadian alpine and Saanen dairy goats. J Dairy Sci. 105(3):2393–2407. doi: 10.3168/jds.2021-20558. [DOI] [PubMed] [Google Scholar]
- Morota G, Gianola D. 2014. Kernel-based whole-genome prediction of complex traits: a review. Front Genet. 5:363. doi: 10.3389/fgene.2014.00363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raffo MA, Sarup P, Jensen J, Guo X, Jensen JD, Orabi J, Jahoor A, Christensen OF. 2025. Genomic prediction for yield and malting traits in barley using metabolomic and near-infrared spectra. Theor Appl Genet. 138(1):24. doi: 10.1007/s00122-024-04806-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rencher A, Schaalje B. 2008. Linear Models in Statistics. Hokoben: Wiley. [Google Scholar]
- Tessema BB, Raffo MA, Guo X, Svane SF, Krusell L, Jensen JD, Ruud AK, Malinowska M, Thorup-Kristensen K, Jensen J. 2024. Genomic prediction for root and yield traits of barley under a water availability gradient: a case study comparing different spatial adjustments. Plant Methods. 20(1):8. doi: 10.1186/s13007-023-01121-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thompson R. 2001. Statistical validation of genetic models. Livestock Prod Sci. 72(1–2):129–134. doi: 10.1016/S0301-6226(01)00273-1. [DOI] [Google Scholar]
- Vallejo RL, Cheng H, Fragomeni BO, Shewbridge KL, Gao G, MacMillan JR, Towner R, Palti Y. 2019. Genome-wide association analysis and accuracy of genome-enabled breeding value predictions for resistance to infectious hematopoietic necrosis virus in a commercial rainbow trout breeding population. Genet Select Evol. 51(1):47. doi: 10.1186/s12711-019-0489-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- VanRaden PM. 2008. Efficient methods to compute genomic predictions. J Dairy Sci. 91(11):4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- Wald A. 1943. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans Am Math Soc. 54(3):426–482. doi: 10.1090/S0002-9947-1943-0012401-3. [DOI] [Google Scholar]
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al. 2010. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data is available at https://github.com/masuday/data/tree/master/tutorial/rawfiles. Binaries for different operative systems are available at https://nce.ads.uga.edu/wiki/doku.php?id=distribution. The parameter files to run the full pipeline are provided in the Supplementary File. Supplemental material available at G3 online.


