Abstract
Genetic dissection and breeding by design for polygenic traits remain substantial challenges. To address these challenges, it is important to identify as many genes as possible, including key regulatory genes. Here, we developed a genome-wide scanning plus machine learning framework, integrated with advanced computational techniques, to propose a novel algorithm named Fast3VmrMLM. This algorithm aims to enhance the identification of abundant and key genes for polygenic traits in the era of big data and artificial intelligence. The algorithm was extended to identify haplotype (Fast3VmrMLM-Hap) and molecular (Fast3VmrMLM-mQTL) variants. In simulation studies, Fast3VmrMLM outperformed existing methods in detecting dominant, small, and rare variants, requiring only 3.30 and 5.43 h (20 threads) to analyze the 18K rice and UK Biobank-scale datasets, respectively. Fast3VmrMLM identified more known (211) and candidate (384) genes for 14 traits in the 18K rice dataset than FarmCPU (100 known genes). Additionally, it identified 26 known and 24 candidate genes for seven yield-related traits in a maize NC II design; Fast3VmrMLM-mQTL identified two known soybean genes near structural variants. We demonstrated that this novel two-step framework outperformed genome-wide scanning alone. In breeding by design, a genetic network constructed via machine learning using all known and candidate genes identified in this study revealed 21 key genes associated with rice yield-related traits. All associated markers yielded high prediction accuracies in rice (0.7443) and maize (0.8492), enabling the development of superior hybrid combinations. A new breeding-by-design strategy based on the identified key genes was also proposed. This study provides an effective method for gene mining and breeding by design.
Key words: Fast3VmrMLM, machine learning, large-scale data, polygenic trait, efficient gene mining, breeding by design
This study presents Fast3VmrMLM, a novel algorithm that integrates genome-wide scanning with machine learning and computational techniques to accelerate gene mining and breeding by design for polygenic traits using large-scale SNP-, haplotype-, and molecular-based GWAS datasets. The software runs efficiently on a Linux server with 30 threads and 1 TB of memory. Applying this approach to rice, the study identifies 21 key genes associated with yield-related traits through a genetic network constructed from both known and candidate genes, thereby proposing a new strategy for breeding by design based on these key genetic targets.
Introduction
Polygenic traits contribute to many critical breeding and production traits in crops, including resistance to rice blast, wheat Fusarium head blight, Sclerotinia in oilseed rape, and cotton Fusarium wilt. For these traits, gene identification remains difficult in genetic studies (Purugganan and Jackson, 2021), and phenotype improvement through the alteration of only a few genes is often ineffective because these traits are controlled by numerous polygenes (Yi and Xu, 1999; Kroymann and Mitchell-Olds, 2005). In particular, during the post-Green Revolution era, many key genes have already been utilized(Kumar et al., 2023), but effective resistant and high-yielding varieties have not been developed for most highly destructive diseases.
The genetic dissection and breeding of polygenic traits remain major challenges (Purugganan and Jackson, 2021; Lappalainen et al., 2024). In early efforts to mine genes, marker analysis was proposed; genome scanning is now widely used in quantitative trait locus (QTL) mapping (Lander and Botstein, 1989), genome-wide association studies (GWAS) (Risch and Merikangas, 1996), and bulked segregant analysis (Schneeberger et al., 2009). These methods often focus on strong signals from genes with large effects, substantially reducing the power to detect genes with smaller effects (Wang et al., 2016; Wen et al., 2018). In GWAS, commonly used approaches mainly consider the allelic substitution effect rather than incorporating both additive and dominant effects. Additionally, they control for the polygenic background of the allelic substitution effect but not for the complete additive and dominant polygenic backgrounds, further reducing gene detection power (Li et al., 2022a). To address this issue, Li et al. (2022a) established a compressed variance component mixed model, termed 3VmrMLM. However, the increasing availability of large genomic and phenomic datasets poses additional challenges for statistical computation. In crop breeding by design for polygenic traits, more than a century of recurrent selection has been validated in maize and soybean (Werner and Wilcox, 1990; Lambert, 1994; Dudley and Lambert, 2010). Nevertheless, recurrent selection requires long time, and the absence of known genes contributes to uncertainty in crop improvement efforts. Recently, Wang et al. (2019) sequenced the genome of Tetep, a rice cultivar with broad-spectrum resistance to blast, cloned and tested a set of NLR genes, and designed an extensive set of molecular markers to rapidly introduce clustered and paired NLRs from the Tetep genome into new resistant cultivars. This work underscores the genetic foundation for gene identification in the improvement of polygenic traits.
To support gene mining and crop breeding by design for polygenic traits, we proposed a fast, efficient, and large-scale GWAS method, Fast3VmrMLM, to identify more genes through genetic analysis, then constructed a gene network for rice yield-related traits to determine key genes for breeding. In Fast3VmrMLM, the conventional genome-wide scanning step for identifying significantly associated markers (SAMs) is replaced by a two-step approach. The first step involves genome-wide scanning combined with fast and large-scale data algorithms (Loh et al., 2015; Zhou et al., 2018) to select potentially associated markers. The second step uses machine learning to identify SAMs. Additionally, Fast3VmrMLM has been extended to detect haplotype (Fast3VmrMLM-Hap) and molecular (Fast3VmrMLM-mQTL) variants. In the construction of the genetic network, machine learning was utilized to capture complex genetic relationships by identifying quantitative trait nucleotide (QTN)-by-QTN interactions. This new framework was validated through a series of simulation studies and analyses of real datasets in rice and maize.
Results
Strategies for genetic dissection and breeding by design of polygenic traits
Polygenic traits are crucial in crop breeding and production. However, challenges persist in identifying their underlying genes and enhancing phenotypic values. In gene mining, commonly used methods rely on genome-wide scanning and focus on only the strongest signals, which may overlook important loci and fail to consider the complexity of polygenic traits. To address these limitations, we developed a novel framework and proposed a fast, efficient, and large-scale algorithm named Fast3VmrMLM. In this method, genome-wide scanning is first used to select potentially associated markers using a relatively loose threshold, rather than to identify SAMs. Machine learning is subsequently applied to select markers with non-zero effects, and likelihood ratio testing is used to identify SAMs (Figure 1). Specifically, a full model based on 3VmrMLM (Li et al., 2022a)—which incorporates additive and dominant effects, along with their respective polygenic backgrounds—is combined with the new genome-wide scanning and machine learning framework to enhance the power of gene detection. Nonetheless, the algorithm continues to face computational challenges in the era of big data and artificial intelligence (AI).
Figure 1.
Strategies for genetic dissection and crop breeding of polygenic traits.
In genetic dissection, genome-wide scanning is used to select potentially associated markers rather than to identify SAMs. These selected markers are subsequently evaluated to identify SAMs for polygenic traits using machine learning and a likelihood ratio test. In crop breeding, all QTNs with known or candidate genes in one or multiple datasets are used to identify QTN-by-QTN interactions (QQIs) for each polygenic trait via machine learning. These QQIs for a given polygenic trait are used to construct a genetic network to identify key genes. All significant loci are also used to perform genomic selection and to predict superior hybrid combinations. Figures depicting rice and maize plants were created using BioGDP.com (Jiang et al., 2025).
To achieve fast computing on large genomic datasets, the five variance components in the full model described above were compressed into three (Li et al., 2022a ). The preconditioned conjugate gradient (PCG) method (Kaasschieter, 1988) was used to replace the decomposition and inversion of large matrices in the mixed linear model solution and machine learning process. Marker effect estimation and the corresponding Wald test in 3VmrMLM were replaced by a vectorized Wald test (Li et al., 2024), based on the conditional expectation method (Henderson, 1975; Xu et al., 2014) and the GRAMMAR-gamma approximation (Svishcheva et al., 2012), enabling dimensionality reduction of computational complexity in variance-covariance estimation.
To accommodate large-scale datasets with sample sizes in the tens of thousands, the kinship matrix was first integrated into the PCG iterations (Kang et al., 2008; Loh et al., 2015) to avoid storing an n × n matrix in memory. The Woodbury matrix equation (Golub and van Loan, 1996) was then used to reduce the dimensionality of the large matrix in the machine learning process. Additionally, a binary data structure was utilized to store the genotypic data.
To identify key genes for polygenic traits, all QTNs closely linked to known or candidate genes (Wei et al., 2021) were used to detect QTN-by-QTN interactions via a machine learning approach. These interactions, obtained across multiple datasets, were used to construct a genetic network. Hub genes in the resulting network were considered key genes for polygenic traits.
The above strategy is summarized in Figure 1.
Monte Carlo simulation studies
Comparison of four mixed models using the Fast3VmrMLM method
To evaluate the advantage of the compressed variance component mixed model used in Fast3VmrMLM, four models (Figure 2A) were compared in Monte Carlo simulation studies under four scenarios: no polygenic background (I-1), additive-dominant background (I-2), epistatic background (I-3), and additive-dominant-epistatic background (I-4) (Supplemental Table 1). Statistical power and false-positive rates (FPRs) across different p value thresholds (from 1.0e−39 to 9.0e−3) were used to generate receiver operating characteristic curves. The novel model implemented in Fast3VmrMLM exhibited the largest area under the curve (Figure 2A), demonstrating superior performance compared with the other models.
Figure 2.
Comparison of different models and methods in Monte Carlo simulation studies.
(A) Comparison of the average power (y axis) of four mixed linear models at different observed FPRs (x axis) in simulation datasets I-1 to I-4 (1000 replicates).
(B) Comparison of the average power (y axis) of the Fast3VmrMLM with existing methods (3VmrMLM, FarmCPU, and EMMAX) at different p value thresholds (x axis) (100 replicates).
(C) Effect of genetic modes and minor allele frequencies (x axis) on average power (y axis) for new and existing methods (100 replicates).
(D) Running times (y axis) of Fast3VmrMLM, FarmCPU, and EMMAX with different sample sizes (x axis).
Performance comparison of Fast3VmrMLM, FarmCPU, and EMMAX
To confirm the advantage of the new method (Fast3VmrMLM) over existing methods (3VmrMLM, FarmCPU, and EMMAX), simulation datasets I-1 to I-4 were used to evaluate the four approaches. The average power under the Bonferroni correction threshold (p < 5.00e−7) for all 10 simulated QTNs was 92.12%, 97.00%, 46.20%, and 36.00% for Fast3VmrMLM, 3VmrMLM, FarmCPU, and EMMAX, respectively (Supplemental Table 2; Figure 2B). The first two methods exhibited significantly higher power than the latter two, primarily due to the lower power of FarmCPU and EMMAX in identifying the fifth to eighth QTNs with α = 0 (Supplemental Tables 1 and 2). The average FPRs for the above four methods were 0.02‱, 0.20‱, 0.04‱, and 0.00‱, respectively, indicating that Fast3VmrMLM offers a superior balance between high power and low FPR (Supplemental Table 3). Additionally, Fast3VmrMLM yielded lower mean squared error (MSE) and mean absolute deviation (MAD) for QTN effect estimates compared with existing methods (Supplemental Tables 4 and 5).
To further validate these findings using real marker genotypic datasets, simulated phenotypic datasets II-1 to II-4, based on the true parameters listed in Supplemental Table 6, were analyzed with the same four methods. The average power under the Bonferroni correction threshold was 78.94%, 76.81%, 60.79%, and 31.08%, respectively (Supplemental Table 7). The average FPRs were 0.53‱, 1.34‱, 0.54‱, and 2.98‱, respectively, with similar trends observed for false discovery rate (FDR), false negative rate (FNR), and F1 score (Supplemental Table 8). Fast3VmrMLM also showed lower MSE and MAD for both QTN effect estimates and positional estimates compared with the other methods (Supplemental Tables 9 and 12). These results support a similar conclusion.
Comparison of Fast3VmrMLM with existing methods under different genetic modes and allelic frequency combinations
To investigate the effect of genetic mode and allelic frequency combinations on the performance of the new and existing methods, nine scenarios were considered in simulation dataset III, comprising three genetic modes (additive, dominant, and additive-dominant) and three allelic frequencies (0.50, 0.25, 0.05) (Supplemental Text 1). Fast3VmrMLM showed consistent average power (64.89%–66.58%) across the three genetic models, while FarmCPU and EMMAX exhibited varying average powers in the additive (67.06% and 54.47%), dominant (25.56% and 19.28%), and additive-dominant (48.58% and 35.53%) models (Supplemental Table 13; Figure 2C). The reduced power of FarmCPU and EMMAX resulted from their poor ability to detect dominant QTNs with allelic frequencies of 0.25 (≤8.50%) and 0.50 (≈0.00%). This is theoretically consistent because α = a + d (q–p), where a and d represent the additive and dominant effects of a QTN, respectively, and q and p denote the allelic frequencies of alleles A and a, respectively. Additionally, all three methods exhibited low false positive rates (FPRs) and MSEs (Supplemental Tables 14 and 15).
To further assess type I error rates for the new and existing methods, simulation dataset IV, which contained no QTNs, was analyzed. Association tests were conducted for 108 genetic variants (Supplemental Text 1; Kang et al., 2008). Type I error rates under the Bonferroni correction threshold (p < 5.00e−7) were 4.80e−7, 1.90e−7, and 1.0e−8 for Fast3VmrMLM, FarmCPU, and EMMAX, respectively, indicating no inflation of type I error in Fast3VmrMLM or the existing methods.
Comparison of running times and memory consumption for Fast3VmrMLM and existing methods across different large-scale sample sizes
To evaluate the performance of these GWAS methods on large-scale datasets, the new and existing methods were compared across five sample size levels: 4000 (V-1), 8000 (V-2), 12 000 (V-3), 16 000 (V-4), and 20 000 (V-5), each with 100 000 SNPs. The average running time over five replicates ranged from 44.57 s to 289.19 s for Fast3VmrMLM, from 39.76 s to 546.88 s for FarmCPU, and from 322.73 s to 14 370.00 s for EMMAX in the five simulated datasets (Supplemental Table 16; Figure 2D). These results indicate that Fast3VmrMLM is faster than FarmCPU, and both are considerably faster than EMMAX. Memory consumption ranged from 0.70 to 1.28 GB for Fast3VmrMLM, from 1.73 to 25.76 GB for FarmCPU, and from 1.22 to 26.16 GB for EMMAX across the five datasets, indicating lower memory requirements for Fast3VmrMLM compared with the other two methods. Furthermore, Fast3VmrMLM (6.53 min) ran significantly faster than FarmCPU (75.67 min) on a dataset comprising 40 000 individuals and 100 000 SNPs (20 replicates).
To assess the performance of Fast3VmrMLM on a UK Biobank-scale dataset, the method was applied to a simulated dataset of 500 000 individuals, each with one million SNPs. The runtime and memory consumption were 5.43 h and 120.29 GB, respectively (Supplemental Table 16). Notably, Fast3VmrMLM successfully identified QTNs with frequencies as low as 0.03% (Supplemental Table 17), highlighting its potential utility for Biobank-scale analyses. All simulation studies were conducted on a server with 1.0 TB of memory and 20 CPUs. We observed that both memory usage and runtime linearly increased with the number of markers, rather than the number of individuals.
Comparison of Fast3VmrMLM-Hap with existing methods in identifying rare haplotypic variants
In simulated dataset VI, haplotypes were used to simulate functional elements in real rice genomics (Supplemental Text 1). Among the 10 simulated causal variants, four were rare (minor allele frequency <5%; Supplemental Table 18). The power to detect all 10 variants was highest for 3VmrMLM (81.30%), followed by Fast3VmrMLM-Hap (76.20%), FarmCPU (74.60%), Fast3VmrMLM (66.10%), and EMMAX (56.70%) (Supplemental Table 18). When specifically focusing on the four rare variants, 3VmrMLM-Hap exhibited the highest power (92.25%), followed by 3VmrMLM (83.00%), FarmCPU (69.50%), Fast3VmrMLM (56.50%), and EMMAX (50.50%) (Supplemental Table 18), indicating the superior performance of Fast3VmrMLM-Hap in detecting rare variants.
Performance of Fast3VmrMLM-mQTL in identifying mQTLs
To evaluate the performance of Fast3VmrMLM-mQTL in detecting mQTLs, the Monte Carlo simulation dataset VII (1000 individuals, each with 33 341 structural variants [SVs]; 10 comprised causal SVs with heritabilities ranging from 5% to 10%; 100 replicates) was analyzed (Supplemental Table 19). The results showed a power of 74.8% and a false positive rate of 2.0281‱ (Supplemental Table 19), demonstrating the method’s effectiveness in mQTL detection.
Re-analysis of the 18K rice dataset
Gene mining for 14 traits in the 18K rice population
Fourteen traits across three environments in the large-scale 18K rice dataset of Wei et al. (2024) were re-analyzed using Fas3VmrMLM, and the results were compared with those obtained via FarmCPU (Wei et al., 2024). Fast3VmrMLM identified 1555 SAMs at the Bonferroni correction threshold (1.71e−8), compared with 1117 identified by FarmCPU (Supplemental Tables 20–33) (Wei et al., 2024). Of the 1555 SAMs identified by Fast3VmrMLM, 1222 overlapped with those detected by FarmCPU SAMs; 333 were uniquely identified by Fast3VmrMLM.
Within ±200-kb windows surrounding all SAMs identified above, we mined known genes supported by transgenic experimental evidence from http://www.ricedata.cn/. The results indicated that 100, 211, and 215 known genes were identified by FarmCPU, Fast3VmrMLM, and Fast3VmrMLM-Hap, respectively (Supplemental Tables 34–47; Figures 3 and 4A; Supplemental Figures 1–5). Among the 359 known genes identified in this study, 62 were detected jointly by the new methods (Fast3VmrMLM and Fast3VmrMLM-Hap) and FarmCPU; 259 were detected exclusively by the new methods (Figure 4A). Of the 70 known genes identified in at least two environments, Fast3VmrMLM and Fast3VmrMLM-Hap identified 39 and 41, respectively, compared with 17 identified by FarmCPU. Eleven were jointly detected by the new and FarmCPU methods, whereas 53 were identified only by the two new methods (Supplemental Figure 6A). In summary, the new methods identified a greater number of known genes, with higher repeatability, than FarmCPU.
Figure 3.
Manhattan plots of heading date in the 18K rice dataset.
The x axis represents the chromosomal and physical coordinates of the markers; the left y axis shows the negative logarithmic transformation of p values obtained from genome-wide scanning, and the right y axis displays LOD scores obtained from the likelihood ratio test using the new methods. Known genes located within ±200-kb windows of significant QTNs (p value <1.71e−8, indicated by dashed line) identified by new and FarmCPU methods, are marked in dark red; those identified by the new method in more than two environments are marked in blue, and those identified by FarmCPU alone are marked in gray. Shanghai, Hangzhou, and Hainan represent the three environments. SNP, Fast3VmrMLM; Hap, Fast3VmrMLM-Hap. If , the transformation was performed. Variants with LOD scores greater than 20 were transformed using .
Figure 4.
Known, candidate, and key genes for yield-related traits in rice.
(A) Number of known genes identified by Fast3VmrMLM (light green), Fast3VmrMLM-Hap (blue), and FarmCPU (light pink) in the 18K rice dataset.
(B) Frequency distributions of the allele substitution effect (α, axis) of SAMs identified by Fast3VmrMLM (light green) and FarmCPU (light pink) in the 18K rice dataset. Annular charts indicate the proportion of SAMs detected exclusively by Fast3VmrMLM, and those detected by both Fast3VmrMLM and FarmCPU, for the first (left) and second (right) groups.
(C) Subset network showing genes with degree values greater than five in the 18K and 1439 rice datasets.
(D) Evidence for hub gene pathways reported in previous studies.
To identify candidate genes, all genes within ±200-kb windows of the SAMs were subjected to gene differential expression analysis, Gene Ontology (GO) annotation analysis (http://systemsbiology.cau.edu.cn/-agriGOv2/), and haplotype analyses (Supplemental Text 3). In total, 384 and 367 candidate genes were identified by Fast3VmrMLM (Supplemental Table 48) and Fast3VmrMLM-Hap, respectively (Supplemental Table 49). Among them, 184 were shared by both methods, while 200 and 183 were uniquely identified by Fast3VmrMLM and Fast3VmrMLM-Hap, respectively (Supplemental Figure 6B).
Comparison of the new and FarmCPU methods
Among the 333 SAMs exclusively detected by Fast3VmrMLM, 35 had large absolute dominance ratios (|d/a| ≥ 5.0), and 28 exhibited dominant variance ratios greater than 50%. Of the 212 SAMs associated with known genes that were identified by Fast3VmrMLM alone but not by FarmCPU (Supplemental Table 50), 17 had |d/a| ≥ 5.0 and 13 had dominant variance ratios ≥50% (Supplemental Table 51). These findings demonstrate the advantage of Fast3VmrMLM in detecting SAMs with dominant effects.
In total, 71 of the 333 SAMs had small standardized allelic substitution effects (α) (|α| ≤ 0.05), and two (r2 > 1%) exhibited large genetic effects (Supplemental Tables 20–33; Figure 4B). Among the 181 SAMs with known genes in Supplemental Table 50 that had small α (|α| < 0.5), 32 showed large genetic effects (r2 > 0.5%; Supplemental Table 52). These results highlight the advantage of Fast3VmrMLM in identifying SAMs with small allelic substitution effects.
Of the 1692 variants identified by Fast3VmrMLM-Hap (Supplemental Tables 53–66), 123 were jointly identified by Fast3VmrMLM and/or FarmCPU, whereas 1569 were detected exclusively by Fast3VmrMLM-Hap. Among the 1569 variants, 685 were rare (minor allele frequency <2%). Fast3VmrMLM-Hap identified more rare variants (785/1692; 46.39%) than Fast3VmrMLM (61/1555; 3.92%) and FarmCPU (2/908; 0.22%) (Supplemental Tables 34–47). Of the 310 variants associated with known genes and detected only by Fast3VmrMLM-Hap, 86 were rare (Supplemental Table 67). These results demonstrate the advantage of Fast3VmrMLM-Hap in identifying rare variants.
Key genes of yield-related traits for rice breeding by design
To identify key genes associated with yield-related traits for rice breeding by design, a QTN-by-QTN interaction (QQI) network was constructed for nine yield-related traits using the 18K and 1439 (Huang et al., 2015) rice datasets, as described in Wei et al. (2024) (Figure 4C; Supplemental Figure 7). In the 18K rice dataset, 399 known or candidate genes were found by Fast3VmrMLM to be located near 371 QTNs (Wei et al., 2021) associated with heading date (HD), plant height (PH), panicle number (PN), grain length (GL), grain yield (GY), and grain width (GW) (Supplemental Tables 34, 35, 38, and 42–44). In the 1439 rice dataset (Huang et al., 2015), 297 known genes were found near 271 QTNs associated with HD, PH, PN, GY, grain number (GN), seed setting rate (SSR), and 1000-grain weight (TGW) (Supplemental Tables 68–74; Supplemental Text 2). Significant QQIs for each trait in each dataset were identified via empirical Bayes at the 0.01 probability level (Wei et al., 2024), and all significant QQIs were used to construct the QQI network. Overall, 527 QQIs involving 388 gene pairs were identified as significant (Supplemental Table 75) and used to build the QQI network. Twenty-one hub genes, each involved in at least seven epistatic pairs, were identified. These hub genes were involved in 261 (49.53%) of all epistatic interactions. Notably, six hub genes—Ghd8, OsSOC1, Ghd7.1, sd1, OsFTL1, and OsDEP1—were consistent with those reported in Wei et al. (2024).
All 21 hub genes—Ghd8 (Yan et al., 2011), OsSOC1 (Lee et al., 2004), Ghd7.1 (Yan et al., 2013), sd1 (Sasaki et al., 2002), TAD1 (Xu et al., 2012), LYL1 (Li et al., 2019), and others (Supplemental Table 75)—have been reported as key genes in yield-related pathways. For example, our network identified epistatic interactions between Ghd8 and Ghd7.1, as well as Hd1, where these genes co-regulate heading date (HD) and play a major role in determining the yield potential of rice cultivars (Zhang et al., 2019) (Figure 4D). Additional supporting evidence from our network is presented in Supplemental Table 76 and Supplemental Text 2, demonstrating the effectiveness of Fast3VmrMLM in identifying key genes associated with polygenic traits.
In addition, the prediction accuracy of phenotypes in genomic selection, based on all SAMs identified by Fast3VmrMLM, was evaluated using five-fold cross-validation. The results showed prediction accuracies for the nine traits ranging from 0.6900 to 0.9185 (Supplemental Figure 8).
Analysis of maize NC II mating design datasets
Gene mining for seven traits in two maize NC II mating design populations
Three best linear unbiased prediction (BLUP) phenotypic datasets for seven yield-related traits from 2019, 2022, and the combined dataset were analyzed in association with 13 528 478 SNPs using Fast3VmrMLM, Fast3VmrMLM-Hap, and 3VmrMLM. The analyzed traits included PH, ear height (EH), seed weight (SW), ear thickness (ET), kernel row number (KRN), kernel number per row (KNR), and yield per plot (Yield). On average, Fast3VmrMLM identified 59.81 SAMs and 2.81 suggested associated markers per trait per year (Supplemental Tables 77–83), Fast3VmrMLM-Hap identified 59.52 SAMs and 10.43 suggested associated markers (Supplemental Tables 84–90), and 3VmrMLM identified 53.10 SAMs and 4.69 suggested associated markers (Supplemental Tables 91–97).
Within ±200-kb windows of all significantly and suggested associated markers (S2AMs), 26, 26, and 10 known genes (Supplemental Tables 98–100), along with 24, 23, and 33 candidate genes identified through gene differential expression analysis (Lawrence et al., 2004), GO annotation analysis (Lawrence et al., 2004, Love et al., 2014), rice homologous genes (Emms and Kelly, 2015), and haplotype analyses (Supplemental Figure 9; Supplemental Table 101; Supplemental Text 3), were detected by Fast3VmrMLM, 3VmrMLM, and Fast3VmrMLM-Hap, respectively. Among the known genes, four were commonly identified by both the new method and 3VmrMLM, whereas 41 and six were identified exclusively by the new method and 3VmrMLM, respectively. Among the candidate genes, seven were commonly identified by both approaches; 41 and 16 were uniquely identified by the new method and 3VmrMLM, respectively. These results demonstrate the effectiveness of the new methods and the complementarity between the new and 3VmrMLM approaches (Supplemental Figures 6C and 6D).
Comparison of the new and 3VmrMLM methods
Among all S2AMs, 133 were jointly identified by Fast3VmrMLM and 3VmrMLM, demonstrating consistency between the two methods. The proportions of S2AMs with dominant variance ratios greater than 50% were 52.73% for Fast3VmrMLM and 45.49% for 3VmrMLM (Supplemental Figures 10A and 10B), indicating the effectiveness of both methods in detecting S2AMs with dominant effects.
Prediction accuracy of genomic selection for maize yield-related traits
All S2AMs identified by Fast3VmrMLM and 3VmrMLM were used to perform genomic selection with the rrBLUP package (Endelman, 2011). Five-fold cross-validation with 20 replicates was conducted to obtain average prediction accuracies. The results showed prediction accuracies for the seven traits ranging from 0.6900 to 0.9185; notably, the accuracies for PH and Yield were 0.9013 and 0.9185, respectively (Supplemental Figure 11; Supplemental Table 102). Superior hybrid combinations were then selected based on the predicted phenotypes. Accordingly, 400 hybrid combinations with strong overall performance were listed in Supplemental Table 103 for further breeding validation.
Analysis of soybean structural variant dataset
To improve the universality of Fast3VmrMLM, we developed Fast3VmrMLM-mQTL for the detection of mQTLs, such as SVs. Fast3VmrMLM-mQTL was implemented to analyze soybean seed oil content datasets from Wuhan 2014 and Nanjing 2016 (Zuo et al., 2022). Two known genes—Glyma.08G213100 (Li et al., 2023) and Glyma.03G130900 (Kanai et al., 2019) (Supplemental Table 104)—were identified within 300-kb physical distance of significantly associated SVs, indicating the method’s effectiveness.
Comparison of genome-wide scanning plus machine learning framework with genome-wide scanning alone
To demonstrate the advantage of the genome-wide scanning plus machine learning framework (Fast3VmrMLM) over genome-wide scanning alone (MLM-scan), we compared the two methods using simulation datasets I-1 to I-4. Fast3VmrMLM outperformed MLM-scan in statistical power (92.12% versus 91.67%), FPR (0.02‱ versus 0.43‱), FDR (2.25% versus 24.96%), FNR (7.88% versus 8.33%), and F1 score (0.95 versus 0.81) (Figure 2B; Supplemental Tables 2 and 3). These results indicate that Fast3VmrMLM achieves a better balance between power and FPR than MLM-scan. Similar trends were observed in simulation datasets II-1 to II-4 (Supplemental Tables 7 and 8).
To further validate this conclusion, we observed that MLM-scan failed to identify 716 of the 1555 SAMs detected by Fast3VmrMLM in the 18K rice datasets. Among these missed loci, 126 were located near known genes (Supplemental Table 105).
Discussion
Two important advances were achieved in this study. First, genome-wide scanning for gene identification in current statistical genetics methods was replaced by a genome-wide scanning plus machine learning framework, adapted for the era of large-scale data and AI. This framework aims to identify more genes to address the challenges of genetic dissection in polygenic traits (Purugganan and Jackson, 2021; Lappalainen et al., 2024). For example, Fast3VmrMLM (211 genes) and Fast3VmrMLM-Hap (215 genes) identified significantly more known genes than FarmCPU (100 genes) for 14 polygenic traits in the 18K rice dataset (Figure 4A). To handle large-scale datasets, several fast, efficient, and large-scale algorithms were integrated with our compressed variance component mixed model to improve computational and storage efficiency (Supplemental Table 16). Fast3VmrMLM successfully processed both the 18K rice and UK Biobank-scale datasets with 3.30 and 5.43 h, respectively, using a server with 20 threads and 1 TB of memory (Supplemental Tables 16 and 17). Second, all QTNs associated with known or candidate genes for nine rice yield-related traits were used to detect QQIs for each trait using a machine learning approach (empirical Bayes). These QQIs, identified across multiple datasets, were used to construct a genetic network to identify key genes for a specific type of polygenic trait. In the 1439 and 18K rice datasets, 21 hub genes associated with rice yield-related traits were identified within a network comprising 527 QQIs across 388 gene pairs. Evidence presented in Supplemental Table 76 supports the relevance of these hub genes. These genes are expected to play a critical role in improving rice yield-related traits in future breeding efforts.
The genome-wide scanning plus machine learning framework for gene mining in the era of large-scale data and AI
Since the introduction of the genome-wide scanning strategy for gene identification in linkage analysis (Lander and Botstein, 1989), GWAS (Risch and Merikangas, 1996), and bulked segregant analysis (Schneeberger et al., 2009), a substantial number of genes have been identified for polygenic traits across animal, plant, and human genetics. However, the genetic dissection of polygenic traits remains challenging (Purugganan and Jackson, 2021; Lappalainen et al., 2024) for two main reasons: first, existing methods focus primarily on a limited number of loci with strong signals and large effects; second, single-locus models exhibit low power for gene detection. To address these challenges, it is essential to integrate genome-wide scanning with machine learning in the era of big data and AI. The Fast3VmrMLM algorithm was developed with this goal. In the first stage, genome-wide scanning is used to select as many potentially associated markers as possible with a relaxed significance threshold. In the second stage, all potentially associated markers are analyzed in a multi-locus model using an empirical Bayes-based machine learning method, to identify S2AMs. Simulation studies (Supplemental Tables 7 and 8) and real-data analyses (Supplemental Table 105) demonstrated that this new framework outperforms genome-wide scanning alone. Compared with FarmCPU, Fast3VmrMLM considers both additive and dominant effects for each potentially associated marker and controls for additive and dominant polygenic backgrounds, thus addressing the limitations of existing GWAS methods in detecting small α and dominant effects (Supplemental Tables 2–15 and 50–52; Figure 2). To handle large-scale datasets, all Fast3VmrMLM procedures were optimized using several computational techniques, including the PCG method (Kaasschieter, 1988), GRAMMAR-gamma approximation (Svishcheva et al., 2012), the Woodbury matrix identity (Golub and van Loan, 1996), and the conditional expectation method (Henderson, 1975; Xu et al., 2014) (Supplemental Texts 4–7). To enable the identification of additional rare variants using bin markers, Fast3VmrMLM-Hap was developed (Supplemental Tables 18 and 67). To further extend its applicability, Fast3VmrMLM-mQTL was developed for the detection of mQTLs, such as SVs. In summary, the genome-wide scanning plus machine learning framework enables comprehensive gene mining, particularly suited for the demands of large-scale data and AI.
Conventional gene-mining methods for polygenic traits typically select a few strong signals from genome-wide scans and rely on multi-omics evidence for candidate genes near those loci—especially those repeatedly identified across multiple environments—to support gene-trait associations. In contrast, this study utilizes Fast3VmrMLM, which integrates differential expression analysis, gene annotation analysis, and haplotype analysis for candidate genes located around all S2AMs to support gene-trait associations (Supplemental Tables 48, 49, and 101). This approach provides unprecedented insights into the genetic architecture of polygenic crop traits and offers valuable references for experimental validation after GWAS. The goal is to identify as many genes related to polygenic traits as possible, addressing the persistent challenge of gene mining in this context. In the real-data analysis, 233 and 51 known genes associated with complex traits in rice and maize, respectively, were identified based on previous studies (Supplemental Tables 34–47 and 98–100). In addition, this study identified 381 and 64 candidate genes for complex traits in rice and maize (Supplemental Tables 48, 49, and 101), respectively, which have not yet been confirmed by transgenic experiments. For example, LOC_Os02g57660, which encodes a key enzyme (PIP5K) involved in the phosphatidylinositol signaling pathway—known to regulate plant growth, development, and stress responses (Zhang et al., 2020)—is a homolog of the known gene OsPIP5K1, which has been linked to yield-related traits (Fang et al., 2020).
Mining key genes of polygenic traits for breeding by design
Many polygenic traits, such as yield, are important production traits and major breeding objectives. However, these traits are typically controlled by numerous polygenes. In crop breeding, identifying which genes are critical remains a fundamental challenge. This study aims to address that question. We propose that three criteria can be used to determine whether a gene is key for yield-related traits: (1) pleiotropic genes involved in the regulation of multiple yield-related component traits (Huang et al., 2016), (2) hub genes with central roles in interaction networks; and (3) genes previously used in breeding programs. Using these criteria, 21 hub genes were identified in this study (Supplemental Tables 75 and 106). Specifically, Ghd8, OsSOC1, Ghd7.1, sd1, TAD1, and OsDEP1 satisfied all three criteria. LYL1 and WG7 met the first two criteria; OsPIN5b and TSCD11 met the first criterion; and SDG711 and OsFTL1 met the second criterion. The remaining nine hub genes were classified as candidate genes based on multi-omics evidence, requiring further validation of their function and breeding potential. Among the 21 hub genes, Ghd8, OsSOC1, Ghd7.1, sd1, OsFTL1, and OsDEP1 were consistent with key genes in the yield trait network identified by Wei et al. (2024). Additionally, Ghd8, Ghd7.1, sd1, and OsDEP1 were reported as key genes for rice yield improvement in breeding by Wu et al. (2018); Ghd8, Ghd7.1, OsSOC1, sd1, and OsDEP1 were recognized as representative key functional genes in rice molecular design breeding (Guo et al., 2019); Ghd8, Ghd7.1, and OsDEP1 were identified as key genes in green super rice breeding (Zhang et al., 2024); and Ghd8, Ghd7.1, TAD1, and OsDEP1 were documented as key genes involved in complex agronomic traits in rice (Zuo and Li, 2014). These findings highlight the utility of machine learning in identifying key genes for polygenic traits and demonstrate its potential to facilitate breeding by design. In human genetics, many complex diseases, such as type 2 diabetes (DeForest and Majithia, 2022), are also polygenic, and key genes have similarly been identified in disorders such as rheumatoid arthritis (Holmdahl, 2000). Therefore, this method is also applicable to human genetic studies.
Once key genes for a specific type of polygenic trait have been identified, they are first introduced into an elite cultivar through backcrossing and accelerated breeding in pure-line breeding, or into elite male and female inbred lines in hybrid breeding. Subsequently, multi-way crossing and doubled haploid technology are employed to pyramid and fix these key genes into a superior accession. This results in an elite cultivar in pure-line breeding, or elite male and female inbred lines that, when crossed, produce a superior hybrid in hybrid breeding. Such an approach represents a new strategy for the improvement of polygenic traits through breeding.
Although genomic selection has proven highly successful in animal breeding (Meuwissen et al., 2016), its high cost has limited broader application in crop breeding. It is therefore essential to reduce its cost. In this study, genomic selection based on all SAMs identified by Fast3VmrMLM achieved high prediction accuracies, ranging from 0.59 to 0.91 for rice and from 0.69 to 0.92 for maize. These results highlight the substantial value of the identified S2AMs in improving phenotypic trait prediction and accelerating crop breeding. Notably, the cost of genotyping has been significantly reduced by developing a chip based on these S2AMs. More importantly, genomic selection can also be applied to predict superior hybrid combinations in the partial maize NC II design experiment (Supplemental Table 103). Thus, the proposed framework supports both genomic selection and molecular breeding by design.
Comparison of Fast3VmrMLM with existing methods
Fast3VmrMLM differs fundamentally from existing GWAS methods. It is based on a genome-wide scanning plus machine learning framework, whereas most existing methods rely solely on genome-wide scanning. As a result, Fast3VmrMLM demonstrates significantly higher statistical power for multiple reasons. First, Fast3VmrMLM incorporates both additive and dominant effects and controls for their polygenic backgrounds, whereas most existing methods consider only the allelic substitution effect (α) and its corresponding polygenic background. This distinction underscores the efficiency of Fast3VmrMLM in identifying S2AMs with dominant effects and those with α ≈ 0.0, as well as in estimating comprehensive genetic effects (Supplemental Tables 2–15 and 50–52; Figure 2). Second, the statistical model of Fast3VmrMLM relies on a multi-locus model that simultaneously includes all potentially associated markers, in contrast to the single-locus model commonly used in genome-wide scanning by most existing methods. The two-step approach of Fast3VmrMLM first applies a loose significance threshold during genome-wide scanning to reduce the likelihood of missing S2AMs due to proximal contamination (Jiang et al., 2021), followed by a stringent threshold (Bonferroni correction) in the likelihood ratio test after the machine learning stage to control the FPR. Most existing methods, in contrast, often use a one-step strategy with a strict threshold to identify SAMs (Segura et al., 2012; Liu et al., 2016). This design gives Fast3VmrMLM an advantage in achieving both high power and low FPR, while mitigating the dimensionality challenge of multi-locus modeling. Third, the unified three-component (compressed) variance component mixed model in Fast3VmrMLM can be readily extended to detect gene-by-environment and gene-by-gene interactions without introducing additional terms into the mixed model. This avoids the complexity associated with algorithm redesign required by many existing methods when adapting to new model structures.
To meet the challenges posed by large-scale data and AI, algorithmic capabilities for large-scale datasets, fast computation, and low memory consumption were incorporated into the compressed variance component mixed model of Fast3VmrMLM, as described above. Fast3VmrMLM, developed based on the comprehensive genetic effect framework of 3VmrMLM, retains its advantages in identifying loci with small α and dominant effects (Supplemental Tables 2, 7, and 13; Figure 2C), while offering faster computation, and improved suitability for large-scale data. In real-data analysis, we observed that the gene-mining results of Fast3VmrMLM complemented those of 3VmrMLM (Supplemental Tables 98–101; Supplemental Figures 6C and 6D). Therefore, this study presents Fast3VmrMLM as a fast, efficient, and large-scale GWAS algorithm that complements and accelerates the performance of 3VmrMLM.
Finally, the superior performance of Fast3VmrMLM over FarmCPU is attributed to its novel framework (Figure 2B; Supplemental Table 105), as well as its high power in identifying dominant (Figure 2C; Supplemental Table 51), small-effect (Supplemental Tables 2 and 52), and rare QTNs (Supplemental Tables 18 and 67).
Detection and estimation of the dominant effect of SAMs
Dominance represents a key genetic basis for complex traits and heterosis and plays a significant role in genomic selection. Although the 18K rice dataset contains a low proportion of heterozygous genotypes (Supplemental Figure 10C), Fast3VmrMLM identified 100 SAMs with dominant variance ratios exceeding 50% (Supplemental Figure 10D), underscoring the importance of dominance in GWAS for this dataset. Additionally, SAMs with both additive and dominant effects identified by Fast3VmrMLM achieved high prediction accuracies in rice (0.7443) and maize (0.8492). These findings reinforce the importance of incorporating dominance effects in GWAS and genomic selection.
Fast3VmrMLM software is designed to run on a Linux server equipped with 60 CPUs and 1 TB of memory. For optimal results, users may adjust the “svrad” parameter according to specific species and traits. The sparse matrix technique was not employed in this study because most kinship coefficients exceeded 0.3 in the 1,439 and 18K rice datasets (Supplemental Figure 12). In future work, we plan to extend Fast3VmrMLM to detect QTN-by-QTN and QTN-by-environment interactions.
Methods
Compressed variance component mixed linear model
The phenotype of the ith individual for a complex trait in an association mapping population is modeled as:
| (Equation 1) |
where is the incident matrix for fixed effect ; and are additive and dominant effects of the kth locus, respectively, and and are corresponding dummy variables that indicate genotypes of the kth locus, defined as and for qq, Qq and QQ, respectively (Xu, 2013). The additive and dominant polygenic effects are modeled as and , where and are additive and dominant kinship matrices, respectively; m represents the number of markers. The parameters and are variances of the additive and dominant polygenic effects. The residual error , with residual variance and unit matrix .
To reduce the number of variance components in mixed model (1), as described in Li et al. (2022a), the model (1) is reformulated as:
| (Equation 2) |
where vector replaces and in model (1), or the allelic substitution effect α in existing methods (Lippert et al., 2011; Yang et al., 2011; Zhou et al., 2018; Jiang et al., 2019, 2021). The polygenic background term follows a normal distribution with mean 0 and variance , where ; is the generalized genetic variance, which replaces the separate additive and dominant polygenic background components (Xu, 2013) in model (1) or the allelic substitution polygenic background used in SAIGE (Zhou et al., 2018) and fastGWA (Jiang et al., 2019, 2021). Thus, the five variance components in model (1) are reduced to three in model (2).
To reduce computational burden in the estimation of variance for each marker, we first considered the null model derived from model (2):
| (Equation 3) |
This null model was used to estimate the parameters and , where represents the polygenic effects, and is the random effect vector for all markers, in conjunction with unit matrix . Subsequently, all marker effects were predicted using the conditional expectation method (Henderson, 1975; Xu et al., 2014).
Restricted maximum log-likelihood function and its solution
The restricted maximum log-likelihood (REML) function of model (3), given and , is defined as:
| (Equation 4) |
where and is the variance-covariance matrix of . The average information REML (AI-REML) method (Arthur et al., 1995) is used to iteratively estimate the variance components and (Supplemental Text 4). To reduce the computational burden of inverting the variance-covariance matrix during each iteration, particularly in large-scale datasets, the PCG method (Kaasschieter, 1988) is utilized (Supplemental Text 4).
Fast3VmrMLM algorithm
Fast3VmrMLM consists of two steps: (1) single-locus genome-wide scanning to select potentially associated markers and (2) machine learning to identify S2AMs. In the first step, the conditional expectation method (Henderson, 1975) is applied to model (3) to predict the conditional expectations and corresponding variances of marker effects , given the estimated parameters and (Supplemental Text 5). The computational complexity of variance and covariance estimation for each marker in this step is , which becomes inefficient when the sample size n is large. To address this issue, the GRAMMAR-gamma approximation method (Svishcheva et al., 2012) is used to estimate the gamma ratio by randomly selecting a subset of markers and approximating the variance-covariance matrices for all markers (Supplemental Text 6).
Based on the estimated effects and variance-covariance matrices of all markers, a vectorized Wald test strategy (Li et al., 2024) is used to calculate genome-wide Wald statistics via , where is the Wald statistic for the kth marker. Using a relatively loose p value threshold (Wellcome Trust Case Control, 2007; Wang et al., 2016; Li et al., 2022a), variants with p values ≤1.00e−5 were regarded as potentially associated variants. After initial selection, we removed some potentially associated variants around the strongest variant. The window size for exclusion depends on the species (Li et al., 2022a, Li et al., 2022b); for example, 20 kb for PH, EH, and SW in maize, and 200 kb for ET, KRN, KNR, and Yield in maize, and all traits in rice.
In the second step, all selected potentially associated markers were entered into Equation 5, and a machine learning method—expectation-maximization (EM) empirical Bayes (Xu, 2010; Li et al., 2022a)—was used to estimate all genetic effects.
| (Equation 5) |
where , , , are as defined in Equation 1; is the number of potentially associated variants selected from the Wald test; and represents the genetic effect associated with the indicator vector . Based on a normal prior for , , with a scaled inverse prior for , all effects can be estimated using the EM empirical Bayes framework (Supplemental Text 7). Matrix inversion poses a major computational challenge in large-scale datasets. Therefore, this study introduces two strategies to improve computational efficiency and reduce memory usage within the EM empirical Bayes procedure (Supplemental Text 7). When is smaller than the sample size (), the Woodbury matrix identity (Golub and van Loan, 1996) is used to calculate via , where is a diagonal matrix with diagonal elements . When both and exceed 4000, the PCG method (Kaasschieter, 1988) is used.
All markers with non-zero effects from the machine learning method are subsequently evaluated using a likelihood ratio test. The critical p value of 0.05/m, based on Bonferroni correction, and the threshold of LOD score ≥3.0 (Wang et al., 2016) are used to identify S2AMs.
Fast3VmrMLM-Hap and Fast3VmrMLM-mQTL algorithms
To identify rare variants and additional trait-associated genes, the Fast3VmrMLM-Hap algorithm was developed. In this approach, adjacent SNP markers are first grouped into bin markers—also referred to as dosage markers (Zhou et al., 2022)—based on linkage disequilibrium, as described by An et al. (2020). Haplotypes are then derived based on these dosage values. To minimize the number of rare haplotype variables, samples with similar dosages are grouped into a single haplotype. These genome-wide haplotypes across chromosomal regions are subsequently used in trait association analysis with Fast3VmrMLM, constituting the Fast3VmrMLM-Hap algorithm. To identify associations between molecular markers and traits, Fast3VmrMLM was further extended to Fast3VmrMLM-mQTL, which accommodates markers with differing numbers of genotypes across loci.
Monte Carlo simulation studies
In Monte Carlo simulation studies, PLINK software (Chang et al., 2015) was used to simulate datasets I, III, V, and VII, whereas the real genotypic dataset of Simmental Beef (Zhu et al., 2016) was used to simulate dataset II. The rice genotypic dataset (Huang et al., 2015) was used to simulate datasets IV and VI. The evaluation indicators followed those described in Li et al. (2022a). Technical details are provided in Supplemental Text 1.
Real-data re-analyses of the 18K rice dataset
The 18K rice dataset consisted of 15 recombinant inbred line (RIL) populations and one four-way MAGIC population (Wei et al., 2024). The RIL populations were derived from crosses between Huanghuazhan and 15 other accessions, whereas the MAGIC population originated from a four-way cross involving Huanghuazhan, Kasalath, Nipponbare, and IAC25. In this study, 14 traits from 18 421 lines in the 18K rice dataset were analyzed in association with 2 929 530 SNPs for GWAS using Fast3VmrMLM and Fast3VmrMLM-Hap. The top two principal components were used to correct for population structure (Wei et al., 2024).
Genes located within a ±200-kb window of SAMs were examined. Genes with at least one piece of evidence from overexpression, CRISPR, RNAi experiments, mutant versus wild type comparisons, or molecular mechanism studies were considered known genes (https://www.ricedata.cn/ontology/default.aspx). Candidate genes were identified through differential expression analysis (Wang et al., 2010; Sato et al., 2011; Jathar et al., 2022), GO annotation (http://systemsbiology.cau.edu.cn/-agriGOv2/), and haplotype analyses. All physical positions of markers and genes are based on the rice MSU Reference Genome Release 7.0 (https://rice.uga.edu/).
Genomic selection for rice and maize was conducted using the R package rrBLUP (Endelman, 2011), based on BLUP values and all identified S2AMs. Five-fold cross-validation was used to assess prediction accuracy, implemented via the createFolds function in the R package caret (Kuhn, 2008).
Genetic interaction detection and network construction
To identify key genes associated with nine rice yield-related traits—HD, PH, PN, GL, GY, and GW in the 18K rice dataset (Wei et al., 2024), and HD, PH, PN, GY, GN, SSR, and TGW in the 1439 rice dataset (Huang et al., 2015)—all QTNs linked to known or candidate genes in the 18K rice dataset (Supplemental Tables 34, 35, 38, and 42–44) and all known genes identified in the 1439 rice dataset (Supplemental Tables 68–74) were used to detect epistatic interactions via the EM empirical Bayes method. The epistatic effects of all QTN pairs were estimated using the following multi-locus model
| (Equation 6) |
where denotes the kth additive-by-additive interaction effect between the ith and jth SNPs, with , represents the BLUP values calculated from multi-environment phenotypes for each trait. Other terms are the same as in model (1). All effects and variances in model (6) were estimated using the fast version of the EM empirical Bayes method described in Supplemental Text 7. A Wald test was then performed on each SNP pair based on the estimated effects and variances to identify significant epistatic interactions (p < 0.01) (Wei et al., 2024). Considering the low SNP heterozygosity rate in the 18K rice dataset (97.52% of SNPs had heterozygosity with rates below 4.0%; Supplemental Figure 10C), only additive and additive-by-additive epistatic effects were considered in model (6).
All known and candidate genes associated with the nine traits, along with the significant epistatic pairs, were used to construct a genetic network using Cytoscape software (Shannon et al., 2003). The degree of each gene in the network was calculated and visually represented by node size and label font size, both proportional to the degree value. Connecting lines, representing epistatic interactions specific to each trait, were color-coded to distinguish genes involved in multiple yield-related traits. Genes with high connectivity were identified as hub genes.
Construction, planting, and phenotyping of the maize population
The North Carolina Design II (NC II) maize population, planted in 2019 and 2022, consisted of 145 parental lines (56 male and 89 female) and 1166 hybrids. In 2019, 97 inbred lines, including eight male and 89 female, along with 712 hybrids, were planted in Baixiang, Luohe, and Zhengzhou, China. In 2022, 65 inbred lines, including 55 male and 10 female, along with 521 hybrids, were planted in Shijiazhuang and Zhengzhou. Seventeen inbred lines and 67 hybrids were shared across both years. Pedigree information is presented in Supplemental Table 107. The NC II maize population was phenotyped for seven yield-related traits including PH, EH, KRN, KNR, SW, ET, and yield per plant (Yield). Control varieties were planted at each site to correct for environmental variation in phenotypes.
Whole-genome resequencing and genotyping
Fifty seeds were collected from each inbred line, and young seedlings were harvested 4 days after sowing. The seedlings were thoroughly ground in liquid nitrogen, and DNA was extracted using the CTAB method. DNA samples were submitted to BGI for 10× whole-genome resequencing on the BGI platform. Raw sequencing data were quality-controlled and filtered using SOAPnuke (version: v2.2.6) (Chen et al., 2018). Sequencing reads were aligned to the B73V5 (https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/) (Lawrence et al., 2004) reference genome using the MEM algorithm of BWA (version V0.7.17) (Li and Durbin, 2009), generating alignment files in SAM format. These SAM files were converted into sorted BAM files using the view and sort commands of Samtools (version: V1.16.1) (Li et al., 2009). The MarkDuplicates tool in Picard (v2.22.8) (https://broadinstitute.github.io/picard/) was used to mark PCR duplicates. The Qualimap2 (version v.2.3) (Okonechnikov et al., 2016) bamqc tool was used to perform quality control analyses of the BAM files. Based on the BAM files aligned to the reference genome, GATK (version v4.1.8.1) (McKenna et al., 2010) was used to call SNPs and InDels across all samples.
Real-data analyses for the maize dataset
The R package lme4 (version 1.1.31) (Bates et al., 2015) was used to calculate BLUP values for all seven traits across multiple locations in 2019, 2022, and the combined dataset. PLINK (Chang et al., 2015) was used for genotype quality control by filtering variants with a missing rate >0.3. Beagle was then used for imputation, and PLINK was subsequently used to filter variants with a minor allele frequency <0.01 to obtain the final genotype dataset. This process yielded 13 528 478 markers for all 145 parental lines. All ungenotyped hybrids were inferred from their corresponding parents using Fast3VmrMLM (Supplemental Text 8; Supplemental Table 107).
The top three principal components, computed using the flashpcaR package (version 2.1) (Abraham et al., 2017), were used to correct for population structure. A ±200-kb window around S2AMs was used to identify known and candidate genes. Candidate gene confirmation (Supplemental Text 3) and genomic selection procedures were consistent with those used for the rice dataset.
Data and code availability
The Fast3VmrMLM, Fast3VmrMLM-Hap, and Fast3VmrMLM-mQTL algorithms proposed in this study have been integrated into the R software package Fast3VmrMLM v1.0, which is freely available at https://github.com/YuanmingZhang65/.
The 18K (Wei et al., 2024) and 1439 (Huang et al., 2015) rice datasets were obtained from https://figshare.com/s/12978737918eecb74903 and http://www.ncgr.ac.cn/RiceHap4, respectively. The simulation datasets are available from the corresponding author upon request.
Funding
This study was supported by the National Natural Science Foundation of China, China (32470657 and 32270673).
Acknowledgments
We thank Prof. Xuehui Huang from the College of Life Sciences of Shanghai Normal University for reading and commenting on the manuscript draft and for providing the 1439 and 18K rice datasets used in this study. No conflict of interest is declared.
Author contributions
Y.-M.Z. conceived and supervised the study and revised this manuscript. Y.W. and G.S. assisted in the supervision of the study and conducted the maize experiment. Y.-M.Z. and J.W. developed the methods and designed the study. Y.-M.Z., J.W., Y.C., and M.Z. wrote the draft. J.W. developed the software tool. J.W. and G.L. performed the simulation studies. J.W., Y.C., M.Z., A.Z., and X.C. contributed to GWAS and genetics analysis of the maize, rice, and soybean datasets. All the authors reviewed and approved the final manuscript.
Published: May 22, 2025
Footnotes
Supplemental information is available at Plant Communications Online.
Contributor Information
Yibo Wang, Email: chigohut@163.com.
Yuan-Ming Zhang, Email: soyzhang@mail.hzai.edu.cn.
Supplemental information
References
- Abraham G., Qiu Y., Inouye M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics. 2017;33:2776–2778. doi: 10.1093/bioinformatics/btx299. [DOI] [PubMed] [Google Scholar]
- An B., Gao X., Chang T., et al. Genome-wide association studies using binned genotypes. Heredity. 2020;124:288–298. doi: 10.1038/s41437-019-0279-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arthur R.G., Robin T., Brian R.C. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 1995;51:1440–1450. [Google Scholar]
- Bates D., Mächler M., Bolker B., Walker S. Fitting linear mixed-effects models using lme4. J. Stat. Software. 2015;67:1–48. [Google Scholar]
- Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Y., Chen Y., Shi C., et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience. 2018;7 doi: 10.1093/gigascience/gix120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeForest N., Majithia A.R. Genetics of type 2 diabetes: Implications from large-scale studies. Curr. Diab. Rep. 2022;22:227–235. doi: 10.1007/s11892-022-01462-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dudley J.W., Lambert R.J. 100 generations of selection for oil and protein in corn. Plant Breed. Rev. 2010:79–110. doi: 10.1002/9780470650240.ch5. [DOI] [Google Scholar]
- Emms D.M., Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endelman J.B. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. 2011;4:250–255. [Google Scholar]
- Fang F., Ye S., Tang J., Bennett M.J., Liang W. DWT1/DWL2 act together with OsPIP5K1 to regulate plant uniform growth in rice. New Phytol. 2020;225:1234–1246. doi: 10.1111/nph.16216. [DOI] [PubMed] [Google Scholar]
- Golub G.H., van Loan C.F. 3rd, The Johns Hopkins University Press; Baltimore and London: 1996. Matrix Computations. [Google Scholar]
- Guo T., Yu H., Qiu J., Li J.Y., Han B., Lin H.X. Advances in rice genetics and breeding by molecular design in China (in Chinese) Sci Sin Vitae. 2019;49:1185–1212. [Google Scholar]
- Henderson C.R. Best linear unbiased estimation and prediction under a selection mode. Biometrics. 1975;31:423–447. [PubMed] [Google Scholar]
- Holmdahl R. Association of MHC and rheumatoid arthritis: Why is rheumatoid arthritis associated with the MHC genetic region? An introduction. Arthritis Res. 2000;2:203–204. doi: 10.1186/ar87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang X., Yang S., Gong J., Zhao Y., Feng Q., Gong H., Li W., Zhan Q., Cheng B., Xia J., et al. Genomic analysis of hybrid rice varieties reveals numerous superior alleles that contribute to heterosis. Nat. Commun. 2015;6:6258. doi: 10.1038/ncomms7258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y., Hu Y., Fu X.D., Xing Y.Z. Functional genes for grain yield related traits and their application in rice breeding (in Chinese) Chin. Bull. Life Sci. 2016;28:1147–1155. [Google Scholar]
- Jathar V., Saini K., Chauhan A., Rani R., Ichihashi Y., Ranjan A. Spatial control of cell division by GA-OsGRF7/8 module in a leaf explaining the leaf length variation between cultivated and wild rice. New Phytol. 2022;234:867–883. doi: 10.1111/nph.18029. [DOI] [PubMed] [Google Scholar]
- Jiang S., Li H., Zhang L., Mu W., Zhang Y., Chen T., Wu J., Tang H., Zheng S., Liu Y., et al. Generic Diagramming Platform (GDP): a comprehensive database of high-quality biomedical graphics. Nucleic Acids Res. 2025;53:D1670–D1676. doi: 10.1093/nar/gkae973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang L., Zheng Z., Fang H., Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat. Genet. 2021;53:1616–1621. doi: 10.1038/s41588-021-00954-4. [DOI] [PubMed] [Google Scholar]
- Jiang L., Zheng Z., Qi T., Kemper K.E., Wray N.R., Visscher P.M., Yang J. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019;51:1749–1755. doi: 10.1038/s41588-019-0530-8. [DOI] [PubMed] [Google Scholar]
- Kaasschieter E.f. Preconditioned conjugate gradients for solving singular systems. J. Comput. Appl. Math. 1988;24:265–275. [Google Scholar]
- Kanai M., Yamada T., Hayashi M., Mano S., Nishimura M. Soybean (Glycine max L.) triacylglycerol lipase GmSDP1 regulates the quality and quantity of seed oil. Sci. Rep. 2019;9:8924. doi: 10.1038/s41598-019-45331-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E. Efficient control of population structure in model organism association mapping. Genetics (Austin, Tex.) 2008;178:1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kroymann J., Mitchell-Olds T. Epistasis and balanced polymorphism influencing complex trait variation. Nature. 2005;435:95–98. doi: 10.1038/nature03480. [DOI] [PubMed] [Google Scholar]
- Kuhn M. Building predictive models in R using the caret package. J. Stat. Softw. 2008;28:1–26. [Google Scholar]
- Kumar A., Pandey S.S., Kumar D., Tripathi B.N. Genetic manipulation of photosynthesis to enhance crop productivity under changing environmental conditions. Photosynth. Res. 2023;155:1–21. doi: 10.1007/s11120-022-00977-w. [DOI] [PubMed] [Google Scholar]
- Lambert R.J. In: Specialty Corns. Hallauer A.R., editor. 1994. High-oil corn hybrids; pp. 123–145. [Google Scholar]
- Lander E.S., Botstein D. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics (Austin, Tex.) 1989;121:185–199. doi: 10.1093/genetics/121.1.185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lappalainen T., Li Y.I., Ramachandran S., Gusev A. Genetic and molecular architecture of complex traits. Cell. 2024;187:1059–1075. doi: 10.1016/j.cell.2024.01.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence C.J., Dong Q., Polacco M.L., Seigfried T.E., Brendel V. MaizeGDB, the community database for maize genetics and genomics. Nucleic Acids Res. 2004;32:393–397. doi: 10.1093/nar/gkh011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S., Kim J., Han J.J., Han M.J., An G. Functional analyses of the flowering time gene OsMADS50, the putative SUPPRESSOR OF OVEREXPRESSION OF CO 1/AGAMOUS-LIKE 20 (SOC1/AGL20) ortholog in rice. Plant J. 2004;38:754–764. doi: 10.1111/j.1365-313X.2004.02082.x. [DOI] [PubMed] [Google Scholar]
- Li C., Liu X., Pan J., Guo J., Wang Q., Chen C., Li N., Zhang K., Yang B., Sun C., et al. A lil3 chlp double mutant with exclusive accumulation of geranylgeranyl chlorophyll displays a lethal phenotype in rice. BMC Plant Biol. 2019;19:456. doi: 10.1186/s12870-019-2028-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Zhou R., Liu P., et al. Design of high-monounsaturated fatty acid soybean seed oil using GmPDCTs knockout via a CRISPR-Cas9 system. Plant Biotechnol. J. 2023;21:1317–1319. doi: 10.1111/pbi.14060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H.F., Wang J.T., Zhao Q., Zhang Y.M. BLUPmrMLM: A fast mrMLM algorithm in genome-wide association studies. Genom. Proteom. Bioinform. 2024;22 doi: 10.1093/gpbjnl/qzae020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M., Zhang Y.W., Zhang Z.C., Xiang Y., Liu M.H., Zhou Y.H., Zuo J.F., Zhang H.Q., Chen Y., Zhang Y.M. A compressed variance component mixed model for detecting QTNs and QTN-by-environment and QTN-by-QTN interactions in genome-wide association studies. Mol. Plant. 2022;15:630–650. doi: 10.1016/j.molp.2022.02.012. [DOI] [PubMed] [Google Scholar]
- Li P., Li G., Zhang Y.W., Zuo J.F., Liu J.Y., Zhang Y.M. A combinatorial strategy to identify various types of QTLs for quantitative traits using extreme phenotype individuals in an F2 population. Plant Commun. 2022;3 doi: 10.1016/j.xplc.2022.100319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
- Liu X., Huang M., Fan B., Buckler E.S., Zhang Z. Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet. 2016;12 doi: 10.1371/journal.pgen.1005767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loh P.R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B., et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meuwissen T., Hayes B., Goddard M. Genomic selection: A paradigm shift in animal breeding. Animal Frontiers. 2016;6:6–14. [Google Scholar]
- Okonechnikov K., Conesa A., García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016;32:292–294. doi: 10.1093/bioinformatics/btv566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purugganan M.D., Jackson S.A. Advancing crop genomics from lab to field. Nat. Genet. 2021;53:595–601. doi: 10.1038/s41588-021-00866-3. [DOI] [PubMed] [Google Scholar]
- Risch N., Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. [DOI] [PubMed] [Google Scholar]
- Sasaki A., Ashikari M., Ueguchi-Tanaka M., Itoh H., Nishimura A., Swapan D., Ishiyama K., Saito T., Kobayashi M., Khush G.S., et al. A mutant gibberellin-synthesis gene in rice. Nature. 2002;416:701–702. doi: 10.1038/416701a. [DOI] [PubMed] [Google Scholar]
- Sato Y., Antonio B., Namiki N., Motoyama R., Sugimoto K., Takehisa H., Minami H., Kamatsuki K., Kusaba M., Hirochika H., et al. Field transcriptome revealed critical developmental and physiological transitions involved in the expression of growth potential in japonicarice. BMC Plant Biol. 2011;11:10. doi: 10.1186/1471-2229-11-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneeberger K., Ossowski S., Lanz C., Juul T., Petersen A.H., Nielsen K.L., Jørgensen J.E., Weigel D., Andersen S.U. SHOREmap: simultaneous mapping and mutation identification by deep sequencing. Nat. Methods. 2009;6:550–551. doi: 10.1038/nmeth0809-550. [DOI] [PubMed] [Google Scholar]
- Segura V., Vilhjálmsson B.J., Platt A., Korte A., Seren Ü., Long Q., Nordborg M. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat. Genet. 2012;44:825–830. doi: 10.1038/ng.2314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B., Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Svishcheva G.R., Axenovich T.I., Belonogova N.M., van Duijn C.M., Aulchenko Y.S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 2012;44:1166–1170. doi: 10.1038/ng.2410. [DOI] [PubMed] [Google Scholar]
- Wang L., Xie W., Chen Y., Tang W., Yang J., Ye R., Liu L., Lin Y., Xu C., Xiao J., et al. A dynamic gene expression atlas covering the entire life cycle of rice. Plant J. 2010;61:752–766. doi: 10.1111/j.1365-313X.2009.04100.x. [DOI] [PubMed] [Google Scholar]
- Wang L., Zhao L., Zhang X., Zhang Q., Jia Y., Wang G., Li S., Tian D., Li W.H., Yang S. Large-scale identification and functional analysis of NLR genes in blast resistance in the Tetep rice genome sequence. Proc. Natl. Acad. Sci. USA. 2019;116:18479–18487. doi: 10.1073/pnas.1910229116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S.B., Feng J.Y., Ren W.L., Huang B., Zhou L., Wen Y.J., Zhang J., Dunwell J.M., Xu S., Zhang Y.M. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci. Rep. 2016;6 doi: 10.1038/srep19444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei X., Qiu J., Yong K., Fan J., Zhang Q., Hua H., Liu J., Wang Q., Olsen K.M., Han B., Huang X. A quantitative genomics map of rice provides genetic insights and guides breeding. Nat. Genet. 2021;53:243–253. doi: 10.1038/s41588-020-00769-9. [DOI] [PubMed] [Google Scholar]
- Wei X., Chen M., Zhang Q., Gong J., Liu J., Yong K., Wang Q., Fan J., Chen S., Hua H., et al. Genomic investigation of 18,421 lines reveals the genetic architecture of rice. Science. 2024;385 doi: 10.1126/science.adm8762. [DOI] [PubMed] [Google Scholar]
- Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wen Y.J., Zhang H., Ni Y.L., Huang B., Zhang J., Feng J.Y., Wang S.B., Dunwell J.M., Zhang Y.M., Wu R. Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief. Bioinform. 2018;19:700–712. doi: 10.1093/bib/bbw145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Werner B.K., Wilcox J.R. Recurrent selection for yield in Glycine max using genetic male-sterility. Euphytica. 1990;50:19–26. [Google Scholar]
- Wu B., Hu W., Xing Y.Z. The history and prospect of rice genetic breeding in China (In Chinese) Yi Chuan. 2018;40:841–857. doi: 10.16288/j.yczz.18-213. [DOI] [PubMed] [Google Scholar]
- Xu C., Wang Y., Yu Y., Duan J., Liao Z., Xiong G., Meng X., Liu G., Qian Q., Li J. Degradation of MONOCULM 1 by APC/C(TAD1) regulates rice tillering. Nat. Commun. 2012;3:750. doi: 10.1038/ncomms1743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu S. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects. Heredity. 2010;105:483–494. doi: 10.1038/hdy.2009.180. [DOI] [PubMed] [Google Scholar]
- Xu S. Mapping quantitative trait loci by controlling polygenic background effects. Genetics. 2013;195:1209–1222. doi: 10.1534/genetics.113.157032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu S., Zhu D., Zhang Q. Predicting hybrid performance in rice using genomic best linear unbiased prediction. Proc. Natl. Acad. Sci. U. S. A. 2014;111:12456–12461. doi: 10.1073/pnas.1413750111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yan W.-H., Wang P., Chen H.-X., et al. A major QTL, Ghd8, plays pleiotropic roles in regulating grain productivity, plant height, and heading date in rice. Mol. Plant. 2011;4:319–330. doi: 10.1093/mp/ssq070. [DOI] [PubMed] [Google Scholar]
- Yan W., Liu H., Zhou X., et al. Natural variation in Ghd7.1 plays an important role in grain yield and adaptation in rice. Cell Res. 2013;23:969–971. doi: 10.1038/cr.2013.43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi N., Xu S. A random model approach to mapping quantitative trait loci for complex binary traits in outbred populations. Genetics. 1999;153:1029–1040. doi: 10.1093/genetics/153.2.1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B., Liu H., Qi F., et al. Genetic interactions among Ghd7, Ghd8, OsPRR37 and Hd1 contribute to large variation in heading date in rice. Rice. 2019;12:48. doi: 10.1186/s12284-019-0314-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J., Che J., Ouyang Y. Engineering rice genomes towards green super rice. Curr. Opin. Plant Biol. 2024;82 doi: 10.1016/j.pbi.2024.102664. [DOI] [PubMed] [Google Scholar]
- Zhang Z., Li Y., Huang K., Xu W., Zhang C., Yuan H. Genome-wide systematic characterization and expression analysis of the phosphatidylinositol 4-phosphate 5-kinases in plants. Gene. 2020;756 doi: 10.1016/j.gene.2020.144915. [DOI] [PubMed] [Google Scholar]
- Zhou W., Bi W., Zhao Z., Dey K.K., Jagadeesh K.A., Karczewski K.J., Daly M.J., Neale B.M., Lee S. SAIGE-GENE+ improves the efficiency and accuracy of set-based rare variant association tests. Nat. Genet. 2022;54:1466–1469. doi: 10.1038/s41588-022-01178-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou W., Nielsen J.B., Fritsche L.G., Dey R., Gabrielsen M.E., Wolford B.N., LeFaive J., VandeHaar P., Gagliano S.A., Gifford A., et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu B., Zhu M., Jiang J., Niu H., Wang Y., Wu Y., Xu L., Chen Y., Zhang L., Gao X., et al. The impact of variable degrees of freedom and scale parameters in Bayesian methods for genomic prediction in Chinese simmental beef cattle. PLoS One. 2016;11 doi: 10.1371/journal.pone.0154118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuo J.F., Chen Y., Ge C., Liu J.Y., Zhang Y.M. Identification of QTN-by-environment interactions and their candidate genes for soybean seed oil-related traits using 3VmrMLM. Front. Plant Sci. 2022;13 doi: 10.3389/fpls.2022.1096457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuo J., Li J. Molecular dissection of complex agronomic traits of rice: a team effort by Chinese scientists in recent years. Natl. Sci. Rev. 2014;1:253–276. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Fast3VmrMLM, Fast3VmrMLM-Hap, and Fast3VmrMLM-mQTL algorithms proposed in this study have been integrated into the R software package Fast3VmrMLM v1.0, which is freely available at https://github.com/YuanmingZhang65/.
The 18K (Wei et al., 2024) and 1439 (Huang et al., 2015) rice datasets were obtained from https://figshare.com/s/12978737918eecb74903 and http://www.ncgr.ac.cn/RiceHap4, respectively. The simulation datasets are available from the corresponding author upon request.




