Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2025 Sep 12;21(9):e1012734. doi: 10.1371/journal.pcbi.1012734

Sparse multitask group Lasso for genome-wide association studies

Asma Nouira 1,2,3,4,*, Chloé-Agathe Azencott 1,2,3
Editor: Adam Charles5
PMCID: PMC12448984  PMID: 40938940

Abstract

A critical hurdle in Genome-Wide Association Studies (GWAS) involves population stratification, wherein differences in allele frequencies among subpopulations within samples are influenced by distinct ancestry. This stratification implies that risk variants may be distinct across populations with different allele frequencies. This study introduces Sparse Multitask Group Lasso (SMuGLasso) to tackle this challenge. SMuGLasso is based on MuGLasso, which formulates this problem using a multitask group lasso framework in which tasks are subpopulations, and groups are population-specific Linkage-Disequilibrium (LD)-groups of strongly correlated Single Nucleotide Polymorphisms (SNPs). The novelty in SMuGLasso is the incorporation of an additional 1-norm regularization for the selection of population-specific genetic variants. As MuGLasso, SMuGLasso uses a stability selection procedure to improve robustness and gap-safe screening rules for computational efficiency. We evaluate MuGLasso and SMuGLasso on simulated data sets as well as on a case-control breast cancer data set and a quantitative GWAS in Arabidopsis thaliana. We show that SMuGLasso is well suited to addressing linkage disequilibrium and population stratification in GWAS data, and show the superiority of SMuGLasso over MuGLasso in identifying population-specific SNPs. On real data, we confirm the relevance of the identified loci through pathway and network analysis, and observe that the findings of SMuGLasso are more consistent with the literature than those of MuGLasso. All in all, SMuGLasso is a promising tool for analyzing GWAS data and furthering our understanding of population-specific biological mechanisms.

Author summary

Genome-Wide Association Studies (GWAS) scan thousands of genomes to identify loci associated with a complex trait. However, population stratification, which is the presence in the data of multiple subpopulations with differing allele frequencies, can lead to false associations or mask true population-specific associations. We recently proposed MuGLasso, a new computational method to address this issue. However, MuGLasso relied on an ad-hoc post-processing of the results to identify population-specific associations. Here, we present SMuGLasso, which directly identifies both global and population-specific associations. We evaluate both MuGLasso and SMuGLasso on several datasets, including both case-control (such as breast cancer vs. controls) and quantitative (for example, plant flowering time) traits, and show on simulations that SMuGLasso is better suited than MuGLasso for the identification of population-specific associations. In addition, SMuGLasso’s findings on real case studies are more consistent with the literature than that of MuGLasso, which is possibly due to false discoveries of MuGLasso. These results show that SMuGLasso could be applied to other complex traits to better elucidate the underlying biological mechanisms.

Introduction

Feature selection methods have emerged as a popular way of framing Genome-Wide Association Studies (GWAS) to uncover the genetic underpinnings of complex diseases, such as cancer. GWAS aim at establishing associations between genetic variants, more specifically Single Nucleotide Polymorphisms (SNPs), and the presence/absence of a disease or a quantitative trait [13]. However, their ability to identify relevant variants is limited by several difficulties, including the curse of dimensionality, population stratification, linkage disequilibrium, and the lack of stability of feature selection procedures with respect to small changes in the input samples. Consequently, the application of feature selection requires careful attention to mitigate false discoveries. The challenge in this context is optimizing the stability of selection to identify regions of interest while minimizing false positives [4].

Contrary to the assumption in many existing feature selection methods that SNPs associated with a phenotype are shared across diverse populations, numerous studies highlight population-specific genetic associations with certain diseases [5]. Notably, diseases can manifest distinct prevalence patterns across populations, leading to variations in risk variants from one genetic ancestry to another [6]. For instance, multiple studies underscore that Africans and Europeans exhibit dissimilar genes associated with the lactase-persistence phenotype [7], emphasizing the population-specific nature of genetic influences. Moreover, recent research has revealed significant differences in genetic risk factors of type 2 diabetes among East Asian and European individuals, highlighting the importance of considering population-specific genetic architectures in disease studies [8].

In previous work, we have introduced the Multitask Group Lasso (MuGLasso) framework, designating groups as blocks of SNPs in strong Linkage Disequilibrium (LD) and tasks as subpopulations [9]. We demonstrated its effectiveness in stably identifying SNPs associated with breast cancer.

Despite its effectiveness, the original MuGLasso design required additional post-processing steps to discern task-specific LD-groups [9], typically by filtering out LD-groups with near-zero coefficients in each task after model training. This limitation prompted the introduction of a second regularization term to enhance population-specific sparsity.

Hence this paper introduces the Sparse Multitask Group Lasso (SMuGLasso), an extension of MuGLasso aimed at refining the population-specific selection of LD-groups. By combining the 1,2-norm penalty of MuGLasso with an additional 1-norm at the LD-group level, SMuGLasso seeks to improve the precision of LD-groups selection.

We evaluate the performance of SMuGLasso against MuGLasso using simulated data and the DRIVE breast cancer dataset. In addition to these qualitative phenotypes, we assess both MuGLasso’s and SMuGLasso’s effectiveness on a quantitative Arabidopsis thaliana phenotype, further validating our approaches on non-human data with a large number of subpopulations.

Finally, we use enrichment analyzes and protein-protein interaction networks to analyze SMuGLasso’s findings on the DRIVE breast cancer data sets, shedding light on the molecular mechanisms underlying breast cancer tumor growth.

Finally, we compare the stability of SMuGLasso, MuGLasso, and other existing methods in identifying LD-groups and SNPs associated with a phenotype, aiming to provide a comprehensive understanding of the proposed framework’s advantages in addressing the challenges inherent to GWAS.

Materials and methods

Ethics statement

The study used data from the DRIVE Breast Cancer OncoArray Genotypes dataset (dbGaP Study Accession phs001265.v1.p1), obtained from NIH after ethical review of project #17707, titled “Network-guided multi-locus biomarker discovery”, and used under approval of this request (#67806-4).

General framework

We introduce SMuGLasso, a four-step framework designed to enhance the precision of population-specific causal variants selection. The steps are similar to those of MuGLasso and are outlined as follows:

  1. Populations assignment: Each sample is assigned to a genetic population using PCA and k-means clustering. This results in the assignment of each population to an input task within the multitask framework, facilitating a tailored analysis for distinct subpopulations.

  2. LD-Groups formation: LD-groups consisting of strongly correlated SNPs are formed using adjacency-constrained hierarchical clustering through the adjclust [10] package to alleviate the curse of dimensionality by conducting feature selection at the group level. More specifically, we performed this clustering on each chromosome and subpopulation before merging the common boundaries across populations to construct a common set of shared LD groups (see S1 Appendix). Note that LD groups are non-overlapping and exhaustive, meaning each SNP is assigned to exactly one LD group and does not appear in multiple groups.

  3. Model fitting with dual penalty: The model is fitted with a regularization comprising two penalty terms. Firstly, the MuGLasso penalty involves an 1,2-norm, fostering sparsity at the LD-group level across all tasks and populations. However, this penalty does not promote sparsity at the task/population level, and does not allow to identify population-specific LD-groups. For this reason, we add in SMuGLasso a second 1-norm penalty to enforce sparsity, specifically at the LD-group level for individual populations. To address computational complexity, the optimization problem is solved using coordinate descent with gap safe screening rules [11].

  4. Stability selection: To improve the robustness of the algorithm, we incorporate a stability selection procedure [12] to ensure a more stable genetic variants selection, contributing to the overall resilience of SMuGLasso.

Unlike MuGLasso, the proposed setting eliminates the need for additional post-processing steps to obtain population-specific LD-groups. In MuGLasso, such population-specific groups are identified through a post-processing step, which removes LD-groups with near-zero coefficients for each task. SMuGLasso stands out by offering a more precise and refined approach to the selection of population-specific causal SNPs, thereby streamlining the process and enhancing the accuracy of the analysis.

As the population assignment, LD-groups formation, and stability selection procedures are similar to those presented in MuGLasso, we refer the reader to [9] for details and proceed with a detailed discussion of the SMuGLasso model fitting itself. The details of the model and its implementation are provided in S1 Appendix.

Notations

Given a set of p SNPs measured on n samples, we split the n samples in T subpopulations/tasks, each of size nt for t=1,,T, and the p SNPs in G LD-groups, each of size pg for g=1,,G. For each population t, we denote by xm(t) the p-dimensional vector of SNPs of the m-th sample in the population (m=1,,nt), and by ym(t) its phenotype. In practice, we consider SNPs encoded as 0, 1 or 2 depending on the number of reference alleles, but the framework applies to any one-dimensional encoding.

MuGLasso and its formulation

In what follows, we recall the formulation of MuGLasso as presented in [9]. MuGLasso leverages a penalized regression framework to model the relationship between SNPs and phenotypes. The formulation seeks to achieve sparsity at the LD-group level and smoothness of regression coefficients within and across tasks. We formulate MuGLasso optimization problem as follows:

minBp×T1nt=1T(m=1nt(ym(t),j=1pβj(t)xmj(t)))+λg=1GpgB(g)F, (1)

where β(t)p represents the regression coefficients specific to task t, denoted as β(t)=(B1t,,Bpt), so that Bjt=βj(t) represents the effect of SNP j in task t. The loss function takes the form of quadratic loss for quantitative phenotypes (y) and logistic loss for qualitative phenotypes (y0,1). Here, the outer summation is over tasks (t=1,,T), and the inner summation is over the individuals (m=1,,nt) within each task, which makes explicit the two-level structure of the data: individuals nested within populations. The Frobenius norm ·F is used to quantify the size of matrices, and B(g)pg×T refers to the submatrix of the full coefficient matrix Bp×T, corresponding to the pg SNPs in LD group g across all T tasks.

MuGLasso can be reformulated by transforming the original dataset into a new one, represented as a block-diagonal matrix denoted as (X~,y~). This reformulation is introduced to express our problem in a standard single-task regression framework that supports structured penalties such as group-lasso. Unlike typical multitask learning settings, where each task shares the same samples but predicts different phenotypes, our setup involves different input populations predicting a singlesame phenotype. Stacking the design matrices X(t) block-diagonally into X~ and concatenating the phenotype vectors into y~ enables compatibility with common optimization algorithms and better designs input multiple tasks. Here, X~n×pT forms a block-diagonal matrix where each of the T diagonal blocks corresponds to the SNP matrix X(t)nt×p for task t. Additionally, y~ is an n-dimensional vector obtained by stacking the phenotype vectors for each task. Introducing this transformation, we derive an adjusted optimization problem. Let bpT be the vector of regression coefficients, where pT represents the total number of features across all tasks. The reformulated optimization problem is then:

minbpT1ni=1n(y~i,k=1pTbkx~ik)+λg=1Gpgb(g)2,

where y~i is the i-th entry of the transformed phenotype vector y~, and x~ik is the (i,k)-th entry of the block-diagonal matrix X~. Additionally, b(g)pgT denotes the regression coefficients associated with SNPs of group g, and pg is the square root of the size of group g.

SMuGLasso

Problem formulation

To address potential limitations of the MuGLasso framework, we introduce the SMuGLasso method, which incorporates an additional 1 penalty to the original formulation. This additional penalty aims to better control the selection of LD groups across different populations. MuGLasso applies the same group-level penalty for all tasks, but it cannot directly exclude a group from one population while keeping it in another—this requires a manual post-processing step to identify near-zero coefficients. In contrast, the SMuGLasso formulation encourages sparsity at the population level: it can automatically select or remove an LD group for a specific population during training. This leads to more accurate identification of population-specific variants and avoids extra steps after fitting the model. The optimization problem of SMuGLasso is written as follows:

minBp×T1nt=1(tasks)T(m=1(samples in t)nt(ym(t),j=1pβj(t)xmj(t)))+λ1g=1GpgB(g)FMuGLasso+λ2g=1GpgB(g)1, (2)

The penalization parameters λ1 and λ2 control the respective strengths of the two regularization terms. Here, B(g)1 denotes the elementwise 1 norm of the matrix B(g), defined as j=1pgt=1T|Bjt(g)|. This penalty encourages sparsity within LD-groups at the population level, complementing the group-wise selection enforced by the Frobenius norm. The combination of the group-level Frobenius norm and the elementwise 1 norm allows the model to identify LD-groups that are relevant across tasks while discarding irrelevant coefficients for specific populations. This improves interpretability by distinguishing shared from population-specific signals, and helps reduce false positives by avoiding the selection of entire groups when only a few elements are relevant.

Optimization

The formulation of SMuGLasso can be transformed exactly as that of MuGLasso shown above. The reformulated model is then:

minbpT1ni=1n(y~i,k=1pTbkx~ik)+λ1g=1Gpgb(g)2+λ2g=1Gpgb(g)1, (3)

where b(g)pgT is the vector of regression coefficients corresponding to all SNPs of group g for all tasks.

Gap safe screening rules

Gap safe screening [11] is a method designed to enhance the efficiency of solving regularization problems in statistical learning and high-dimensional data analysis. Hence, gap-safe employs a set of rules to identify and eliminate irrelevant features from the optimization problem, significantly reducing computational complexity. These rules use duality gaps to rigorously guarantee that the discarded features have zero coefficients in the optimal solution, ensuring the accuracy of the model while improving computational speed. This approach is particularly useful in cases with large datasets and numerous features, making it a valuable tool in GWAS data analysis. We have detailed the fundamentals of these rules in MuGLasso paper [9]. Code is available in https://github.com/asmanouira/SMuGLasso.

Related work

Our method is related to the group Lasso and multitask Lasso, which both rely on an 1,2-norm penalty [13,14]. Building on that, several studies have been proposed related to multitask variants composed of either two or three regularization terms [1518]. Notably, these models exhibit limitations in scalability when confronted with high-dimensional data, rendering them inapplicable to our specific context.

To effectively select the additional population-specific regularization term for SMuGLasso, we conducted a comprehensive investigation into the applicability of existing methods. Notably, we found that the proposed sparsity-enforcing penalties were not suited to our specific problem. Our objective is to implement a regularization term that enforces sparsity for specific populations at the level of LD-groups.

We specifically examine the method proposed by (Li L, et al.) [18], which suggests implementing three regularization-based multitask models. Their optimization problem is reformulated as follows:

minβp×k12t=1Tm=1ntym(t)j=1pβj(t)xmj(t)22+λ1j=1pβj21(β)+λ2t=1Tg=1Gpgβg(t)22(β)+λ3t=1Tβ(t)13(β). (4)

Here, the authors aim to enforce population-specific group sparsity using the term 2(β), aiming to select certain groups only for specific tasks or subpopulations. However, relying only on this regularization term results in the optimization problem being separated across tasks, meaning that the selection is performed independently for each single task. To ensure simultaneous task fitting, the authors introduce the term 1(β), corresponding to multitask regularization at the single-SNP level across all T tasks. Additionally, they seek to enforce sparsity within groups using an 1-norm over all SNPs, represented by a third regularization term (defined by 3(β)), corresponding to the second regularization term of the well-known sparse group Lasso.

In our study, we aim to enhance the selection for population-specific LD-groups. Hence, as mentioned above integrating 2(β) into SMuGLasso would not maintain multitasking across the tasks T, while implementing 3(β) alongside SMuGLasso would impede the interpretability of the selected features. Selecting SNPs within groups for specific populations complicated determining the number of SNPs within an LD-group g that must be set to 0 to consider the group as not selected for a particular task t. Additionally, the inclusion of two penalties in SMuGLasso substantially increased computational demands at a GWAS scale.

Another approach has also been proposed presenting a sparse group multitask feature selection model for GWAS data aimed at leveraging pleiotropy, i.e., SNPs associated with multiple complex diseases [19]. However, it’s important to note that their method addresses a different scenario from ours. Specifically, they focus on scenarios where the tasks are output phenotypes rather than input populations samples. In their setting, the goal is to select groups of SNPs targeting the same gene or pathway. Additionally, they combine multiple GWAS datasets and retain only the SNPs shared between them, resulting in a substantial reduction in the number of SNPs (down to 3,766 SNPs) and thereby reducing computational complexity significantly.

Experiments

Data

Simulated data.

Using GWAsimulator [20], we simulate GWAS data following LD patterns of two populations (CEU: Utah residents with Northern and Western European ancestry and YRI: Yoruba in Ibadan, Nigeria) from HapMap3 [21]. We generate different numbers of samples through subpopulations to mimic the structure of real data, where samples through subpopulations are not necessarily equally distributed. We also produce the population stratification confounder by varying the case:control ratio within each subpopulation (CEU 1 300:1 700 and YRI 400:600).

We predefine a total of 200 causal SNPs (non-null hypotheses) as shown in Table 1, in which 50 SNPs (respectively 50 SNPs) are specific to the CEU (respectively YRI) and 100 shared between both populations. All other SNPs were considered non-causal (null hypotheses). We locate the predefined disease loci (and their corresponding LD-groups) randomly and without loss of generality through chromosomes 12, 19, 21, and 22, as shown in Table 2. In total, the data is composed of 4 000 samples and 50 000 SNPs. For CEU, there are 1,407 LD-groups, each containing an average of 35 SNPs. For YRI, there are 995 LD-groups, each containing an average of 50 SNPs. To evaluate method performance, we considered a SNP (or LD-group) as a true positive if it overlapped with any of the predefined causal SNPs (or LD-groups containing them). Conversely, selected SNPs or LD-groups not overlapping with any causal loci were counted as false positives. Note that we used a single simulation replicate due to the high computational cost of running all comparison methods with stability selection on large-scale data (4,000 samples and 50,000 SNPs).

Table 1. For simulated data, number of predefined causal SNPs.
Populations Number of SNPs
Specific-CEU 50
Specific-YRI 50
Shared (CEU+YRI) 100
Total 200
Table 2. For simulated data, location of predefined disease loci represented by start/end positions and its corresponding LD-groups number in each subpopulation through chromosomes: 12, 19, 21 and 22.
Chromosome Subpopulations
CEU YRI
loci (# LD-groups) loci (# LD-groups)
12 4 000 - 4 050 (3) 4 000 - 4 050 (1)
19 1 000 - 1 050 (2) 1 000 - 1 050 (2)
21 10 000 - 10 050 (1)
22 1 000 - 1 050 (3)

DRIVE breast cancer OncoArray.

The DRIVE OncoArray dataset contains 28 281 individuals that were genotyped for 582 620 SNPs. 13 846 samples are cases and 14 435 are controls. The dataset contains data for the following countries: USA, Uganda, Nigeria, Cameroon, Australia and Denmark. Additional information about data access and ethical approval is presented in S2 Appendix.

Arabidopsis thaliana.

Our Arabidopsis thaliana dataset comes from the 1001 Genomes Project [22] (Build TAIR10). We study the DTF3 phenotype, which is the time until the first open flower, in days. The dataset obtained from easyGWAS [23] contains 923 samples and 6 973 565 SNPs divided into 5 chromosomes. This dataset contains plant samples coming from 44 countries.

Preprocessing

Quality control and imputation.

For the simulated dataset and DRIVE breast cancer, we exclude SNPs with a minor allele frequency below 5%, a p-value for Hardy-Weinberg Equilibrium in controls below 10−4, or a genotyping rate missing more than 10%. We also remove duplicate SNPs, as well as samples with over 10% missing SNPs. We impute missing genotypes in DRIVE using IMPUTE2 [24].

For Arabidopsis thaliana, we perform the quality control steps recommended by [23]. We use a Box-Cox transformation [25] of the phenotype to improve the measurements normality. We remove SNPs with a minor allele frequency lower than 5%.

LD pruning.

We perform LD pruning using PLINK [26] with an LD cutoff of r2>0.85 and a sliding window of 50Mb for the simulated data and DRIVE. For Arabidopsis thaliana, we use an LD cutoff of r2>0.75 and a window size of 50Mb. After preprocessing steps, 50 000 SNPs remain in the simulated data, 312 237 SNPs in DRIVE and 564 291 SNPs in the Arabidopsis thaliana data.

Population structure.

We use PLINK [26] to compute the principal components of the genotype matrix. In the simulated dataset, we find two populations, corresponding to the CEU and YRI populations (see S1 Fig). In DRIVE, we identify two populations (see S2 Fig) that we call in this paper POP1 (samples from the USA, Australia and Denmark) and POP2 (samples from the USA, Cameroon, Nigeria and Uganda).

In the Arabidopsis thaliana dataset, among 44 countries, we retrieve 5 populations using k-means clustering of the top 4 principal components (see S3 Fig, S4 Fig and S5 Fig). These 5 populations are detailed in S1 Table.

LD-groups choice.

For simulated and DRIVE data, we determine the LD-groups for each subpopulation and each chromosome using adjclust [10]. However, for Arabidopsis thaliana, adjclust did not scale computationally to the huge number of SNPs in the five chromosomes. Thus, we first split each chromosome into independent LD-blocks using snpldsplit [27] function from bigsnpr R package [28]. We then form the LD-groups by applying adjclust on the obtained chunks of independent LD-blocks.

For all three datasets, we then combine these LD-groups across populations by merging their boundary coordinates to obtain shared LD-groups.

Table 3 shows the number of LD-groups obtained for each subpopulation and the final number of shared groups.

Table 3. Number of LD groups for each subpopulation of the studied datasets (simulated, DRIVE and Arabidopsis thaliana), and after combination across subpopulations.
Data Subpopulations # LD-groups # shared LD-groups
Simulated data CEU 1 407 1 566
YRI 995
DRIVE real data POP1 8 152 17 782
POP2 5 032
A. thaliana data POP1 1 846 7 080
POP2 1 950
POP3 2 002
POP4 1 728
POP5 1 834

Comparison partners

As a baseline, we use PLINK to conduct association studies between each SNP individually and the phenotype, employing either the top PCs as covariates (Adjusted GWAS) or treating each population separately (Stratified GWAS). We also add FastLMM [29] as a representative baseline of linear-mixed model approaches. FastLMM is designed to adjust globally for population stratification but cannot identify population-specific SNPs; note also that there is no point in running it separately on each subpopulation.

Additionally, we derive a PCA-adjusted phenotype by regressing the top PCs against the phenotype to compute the residuals. To explore the impact of grouping correlated SNPs, segregating populations into tasks, and using an additional penalty to automatically select population-specific SNPs, we compare SMuGLasso with MuGLasso and various other methods. These include a single-task Lasso without groups applied to each population separately (Stratified Lasso) or the adjusted phenotype (Adjusted Lasso), as well as a single-task group Lasso Stratified group Lasso and Adjusted group Lasso applied similarly with an adjusted phenotype. (Details in Table 4).

Table 4. Summary of baseline methods compared to SMuGLasso.

Population Handling: Separated means each population is analyzed independently; Handled together indicates populations are pooled into a single dataset; Modeled jointly refers to multi-task learning where populations are treated as related but distinct tasks.

Method Population Handling LD Grouping Adj. Pheno
Adjusted GWAS Handled together No No
FastLMM Handled together No No
Stratified Lasso Separated No No
Adjusted Lasso Handled together No Yes
Stratified Group Lasso Separated Yes No
Adjusted Group Lasso Handled together Yes Yes
MuGLasso Modeled jointly Yes No
SMuGLasso Modeled jointly Yes No

Note that for Adjusted Lasso and Adjusted Group Lasso, when handling qualitative phenotypes (such as case-control in DRIVE), we employ logistic regression on the top principal components (PCs) for adjustment. The resulting residuals are subtracted from the actual phenotype values (1 for cases or 0 for controls) to generate a newly adjusted phenotype [30]. This adjustment method, while not widely used, was chosen because it effectively reduced population stratification effects in our experiments, achieving an inflation factor close to 1.

Similarly, for quantitative phenotypes (e.g., DTF3 in Arabidopsis thaliana), we apply the same procedure but with linear regression. Traditionally, PCA-based methods or linear mixed models are commonly used for population stratification adjustment in GWAS. With such approaches, it is always possible to add top PCs as additional features, but there is no guarantee that the method will select, and therefore use, them. For instance, FastLMM [29] is recommended for Arabidopsis thaliana. However, integrating linear mixed models for feature selection poses challenges in machine learning applications.

In practice, we use bigLasso [31] for the lassos and gap safe screening rules [11] for the group Lasso to optimize computational efficiency. Across all methods, we determine the regularization hyperparameter(s) through cross-validation. More specifically, we use f1-score as a criterion for binary phenotypes and RMSE for quantitative phenotypes. We select the hyperparameters λ1 and λ2 using 3-fold cross-validation on simulated data generated with GWASimulator, which uses real HapMap3 CEU and YRI genotypes, closely matching real datasets like DRIVE. Due to computational constraints, we reused these tuned values for the DRIVE dataset. For Arabidopsis thaliana, we performed 3-fold cross-validation directly on the dataset (without stability selection), and found that the selected hyperparameters were similar to those from simulation-based tuning. To ensure a fair comparison, 3-fold cross-validation was consistently used across all datasets and methods. For models that jointly handle all populations (e.g., SMuGLasso, MuGLasso, Adjusted Lasso/Group Lasso), folds were stratified by both phenotype and population label to preserve balance. For methods applied independently per population (Stratified Lasso/Group Lasso), folds were stratified only by phenotype within each subpopulation.

To assess methodological performance, we analyze runtime, the ability to identify true causal SNPs (in simulated data), and the stability of feature selection. To quantify the stability of the feature selection procedure with respect to perturbations of the input, we repeat the feature selection process on 10 subsamples of the data and report the average Pearson’s correlation among all pairs of indicator vectors representing the selected features for each subsample (see S1 Appendix). Our choice of Pearson’s correlation is based on the benchmark of Nogueira et al. [32], which shows it is superior to alternatives such as the Jaccard index as it adjusts for chance agreement and allows for selection sizes that vary across repetitions.

Biological interpretation

Functional mapping and annotations analysis.

We use FUMA [33] to map functionally annotated SNPs to genes according to the physical position in the genome and eQTL mapping. The tool uses information from multiple biological data to perform these mapping analyses. We used Ensembl version 110 as the reference, and the 1000 Genome Project/Phase3 as the reference panel. A physical mapping window of 10 kb was employed to map variants to nearby genes, a commonly used range to capture potential regulatory effects of nearby SNPs while limiting false positives. We ensured that FUMA would interpret all the SNPs we had selected by assigning them an arbitrary and fake p-value below the significance threshold of 5.10−8 it uses for selecting SNPs of interest. The eQTL mapping was performed using GTEx data version 6, including all available tissue types.

For Arabidopsis thaliana dataset, we map SNPs identified by Adjusted GWAS, MuGLasso, and SMuGLasso to genes using TAIR10, which provides genomic location data of Arabidopsis thaliana genes in GFF3 format.

Gene set enrichment analysis.

We use Metascape [34] to perform gene set enrichment analysis to understand the functional relevance of the obtained gene lists that Adjusted GWAS, MuGLasso and SMuGLasso have discovered. This tool performs pathway and process enrichment analysis using multiple pathway data bases: KEGG Pathway, Reactome Pathway, WikiPathways, PID, BioCarta, Panther Pathway, SMPDB, GO Biological Processes, CORUM, TargetScan Pathway, TF targets and PharmGKB. Metascape also performs gene set enrichment analysis against cell type signatures, the gene-disease database DisGeNET, the pattern gene database PaGenBase, and transcription factor targets. We selected Metascape over other tools because it integrates a wide range of annotation resources in a single platform, offers high-quality visualizations, and supports both human and non-human species, including Arabidopsis thaliana.

Metascape collects terms with an enrichment p-value <0.01, a minimum number of occurrences of 3, and an enrichment factor >1.5 and groups them into clusters based on membership similarities. The term with the smallest p-value within a cluster then represents the cluster.

Protein-protein interaction network analysis.

This analysis consists of an additional layer of biological validation and helps assess whether the selected genes are not only statistically significant, but also biologically connected through shared functional pathways. We further use Metascape [34] to construct a protein-protein interaction network between enriched genes. Metascape uses multiple sources to this end, including experimental and predicted interactions, and identifies densely connected network components using MCODE [35]. We finally visualize the obtained network using Cytoscape [36] to enable the discovery of functionally related gene groups within breast cancer disease on the DRIVE dataset and on within DTF3 on the Arabidopsis thaliana dataset.

Results

SMuGLasso and MuGLasso rely on both LD-groups and the multitask approach to recover disease SNPs in simulated data

On simulated data, we observe that SMuGLasso and MuGLasso outperform the other methods at recovering the predefined disease loci (See Fig 1). In addition, we confirm that performing feature selection at the level of LD-groups provides better performance compared to the conventional single-SNP selection. This confirms that grouping SNPs helps to alleviate the curse of dimensionality and improve the identification of causal variants.

Fig 1. On simulated data, ability of different methods to retrieve causal disease SNPs as a ROC plot.

Fig 1

Table 7 gives, for both SMuGLasso and MuGLasso, the number of selected LD-groups and SNPs across and per subpopulation for each dataset. Compared to MuGLasso, we notice that SMuGLasso ensures more sparsity for shared selection across all subpopulations thanks to its additional 1-norm penalty.

Table 7. Number of selected LD-groups/SNPs, across and per population, for the three data sets, for both SMuGLasso and MuGLasso.

Data Population # selected LD-groups (and SNPs)
SMuGLasso MuGLasso
Simulated data CEU 2 (104 SNPs) 2 (88 SNPs)
YRI 3 (64 SNPs) 1 (14 SNPs)
shared (CEU and YRI) 3 (122 SNPs) 6 (261 SNPs)
DRIVE POP1 5 (155 SNPs) 6 (148 SNPs)
POP2 1 (21 SNPs) 2 (43 SNPs)
shared (POP1 and POP2) 52 (1 103 SNPs) 54 (1 166 SNPs)
A. thaliana POP1 3 (247 SNPs) 2 (164 SNPs)
POP2 5 (381 SNPs) 4 (303 SNPs)
POP3 1 (81 SNPs) 0
POP4 3 (232 SNPs) 3 (232 SNPs)
POP5 1 (72 SNPs) 0
shared (5 populations) 67 (5 354 SNPs) 95 (7 555 SNPs)

SMuGLasso provides a more precise selection for population-specific level. Indeed, SMuGLasso successfully recovers causal LD-groups/SNPs that MuGLasso missed in simulated data (see Fig 2).

Fig 2. For simulated data, precision and recall of SMuGLasso, MuGLasso and the stratified approaches on the populations-specific SNPs.

Fig 2

We note that SMuGLasso is more intensive computationally compared to MuGLasso and any other tested method (see Fig 3). This computational cost is caused by the additional population-specific regularization term. However, the implementation is efficient enough to scale to high-dimensional GWAS data thanks to gap-safe screening rules. Here, the reported runtimes for both SMuGLasso and MuGLasso include the stability selection procedure.

Fig 3. Runtimes of Lasso approaches for simulated, DRIVE and Arabidopsis thaliana datasets.

Fig 3

MuGLasso and SMuGLasso point to genes of interest not identified by classical GWAS in real data

Genes identified by physical mapping annotation of SNPs selected in the DRIVE data.

The breast cancer risk genes identified by physical mapping of the SNPs selected by adjusted GWAS, SMuGLasso and MuGLasso on DRIVE are detailed in S2 Table.

SMuGLasso and MuGLasso both recover the 9 risk genes identified by classical GWAS. SMuGLasso identifies 27 more risk genes. Of those, 17 have been previously identified in a meta-GWAS analysis containing the DRIVE data (see S3 Table), and another 8 have been found to be associated with breast cancer risk in other studies, leaving 2 genes with no previous evidence supporting their relation to the disease (see S4 Table).

MuGLasso selects the same genes as SMuGLasso, and an additional 5 genes. We have found in the literature evidence of the association with breast cancer of only 2 of those, leaving another 3 genes with no previous supporting evidence.

To summarize these findings, the distribution of evidence types for genes identified by each method is illustrated in Fig 4.

Fig 4. On DRIVE and through physical mapping annotation: stacked bar plot presenting the number of genes identified by different methods on the x-axis (GWAS genes, shared SMuGLasso and MuGLasso genes not identified by GWAS and only MuGLasso genes not identified by SMuGLAsso and GWAS), and categorized by evidence type (No evidence, evidence in literature and evidence in Meta-GWAS).

Fig 4

Note that MuGLasso finds all genes selected by SMuGLasso, and SMuGLasso finds all genes selected by a classical GWAS.

Genes identified by physical mapping annotation of SNPs selected in the Arabidopsis thaliana data.

The genes associated with the Arabidopsis thaliana DTF3 phenotype according to adjusted GWAS, SMuGLasso and MuGLasso are presented in S5 Table. Again, SMuGLasso and MuGLasso both recover all the risk genes (7 in total) identified by classical GWAS. SMuGLasso identifies an additional 41 genes, including 8 population-specific findings, and MuGLasso identifies 7 more genes on top of those selected by SMuGLasso. Only 4 of the 55 genes selected by MuGLasso are population-specific.

Genes identified by eQTL mapping annotation of SNPs selected in the DRIVE data.

In addition to physical mapping, our use of eQTL functional annotations aims to uncover supplementary information about the genetic basis of breast cancer disease in DRIVE. S6 Table presents the genes obtained by both physical and eQTL mapping of the loci identified by the adjusted GWAS, SMuGLasso and MuGLasso. Using eQTL mapping adds 25 genes to the 9 identified by physical mapping of the adjusted GWAS results. 2 of those (PTHLH and TNRC6B) had been identified through the physical mapping of loci selected by SMuGLasso (and hence MuGLasso) but not the classical GWAS.

Using eQTL mapping also adds 30 genes to the list of those selected by SMuGLasso but not the classical GWAS. Of these, 26 are confirmed by the literature, as presented in S7 Table). Finally, the loci selected only by MuGLasso point to an additional 12 genes through eQTL mapping, only 2 of which are linked to breast cancer in the literature. To provide further clarification, the distribution of evidence types for these gene discoveries is illustrated on Fig 5.

Fig 5. On DRIVE and through eQTL mapping annotation: stacked bar plot presenting the number of genes identified by different methods in x-axis (GWAS genes, shared SMuGLasso and MuGLasso genes not identified by GWAS and only MuGLasso genes not identified by SMuGLasso and GWAS) and categorized by evidence type (No evidence or evidence in the literature).

Fig 5

SMuGLasso and MuGLasso outperform the other methods in terms of stability

Table 5, Table 6 and S8 Table show the performance of the tested methods concerning the stability, measured by the stability index alongside the number of selected LD-groups and SNPs, along with their selection level (LD-groups or Single-SNP) respectively for simulated, DRIVE and Arabidopsis thaliana datasets. We use 100 subsamples to perform stability selection [12]. Indeed, the obtained metrics highlight that stability selection increases the robustness of SMuGLasso, MuGLasso and Adjusted group Lasso for the three datasets.

Table 5. Stability index and number of selected features for different methods, on simulated data.

Methods # selected # selected Stability Selection
LD-groups SNPs index level
SMuGLasso 8 290 0.5811 LD-groups
SMuGLasso without stability selection 9 328 0.5045 LD-groups
MuGLasso 10 363 0.7015 LD-groups
MuGLasso without stability selection 11 402 0.6124 LD-groups
Adjusted group Lasso + stability selection 11 374 0.5929 LD-groups
Adjusted group Lasso 12 392 0.5340 LD-groups
Stratified group Lasso 13 452 0.4491 LD-groups
Adjusted Lasso 12 422 0.4053 Single-SNP
Stratified Lasso 13 441 0.3140 Single-SNP
Adjusted GWAS 3 109 0.9834 Single-SNP
FastLMM 3 73 0.9714 Single-SNP

Table 6. Stability index and number of selected features for different methods, on the DRIVE data set.

Methods # selected # selected Stability Selection
LD-groups SNPs index level
SMuGLasso 58 1 279 0.3881 LD-groups
SMuGLasso without stability selection 60 1 354 0.3325 LD-groups
MuGLasso 62 1 357 0.4312 LD-groups
MuGLasso without stability selection 72 1 524 0.3911 LD-groups
Adjusted group Lasso + stability selection 59 1 293 0.3234 LD-groups
Adjusted group Lasso 68 1 466 0.2613 LD-groups
Stratified group Lasso 58 1 119 0.2498 LD-groups
Adjusted Lasso 41 874 0.2068 Single-SNP
Stratified Lasso 38 789 0.1581 Single-SNP
Adjusted GWAS 16 306 0.7724 Single-SNP
FastLMM 13 227 0.7471 Single-SNP

For the simulated data (Table 5), SMuGLasso and MuGLasso exhibit noteworthy stability, surpassing other methods. SMuGLasso selects 8 LD-groups and 290 SNPs, demonstrating a stability index of 0.5811. Similarly, MuGLasso selects 10 LD-groups and 363 SNPs with a higher stability index of 0.7015. Even without stability selection, both methods marginally increase the number of selected LD-groups and SNPs compared to classical approaches while maintaining relatively high stability indices. Adjusted GWAS and FastLMM, based on single-marker testing, show even greater stability, with indices of 0.8834 (109 SNPs on 3 LD-groups) and 0.8714 (73 SNPs on 3 LD-groups), respectively.

For the DRIVE dataset (Table 6), SMuGLasso and MuGLasso continue to exhibit better stability than their comparison partners. SMuGLasso selects 58 LD-groups and 1,279 SNPs with a stability index of 0.3881, while MuGLasso selects 62 LD-groups and 1,357 SNPs with a stability index of 0.4312. Adjusted GWAS and FastLMM achieve higher stabilities, with indices of 0.7724 (306 SNPs on 16 LD-groups), and 0.7471 (227 SNPs on 13 LD-groups) respectively.

For the Arabidopsis thaliana dataset (see S8 Table), SMuGLasso and MuGLasso present once again the best stability indices. SMuGLasso selects 80 LD-groups and 6,367 SNPs with a stability index of 0.4315, while MuGLasso selects 104 LD-groups and 8,254 SNPs with a stability index of 0.5733. Thus, SMuGLasso and MuGLasso demonstrate good performance even with datasets containing relatively few samples. However, Adjusted GWAS (0.7129 for 31 SNPs on 7 LD-groups) and FastLMM (0.8106 for 12 SNPs on 5 LD-groups) again yield the best stability scores.

In summary, traditional GWAS and FastLMM—which rely on single-marker association tests—achieve the highest stability indices across all datasets. This observation reflects the robustness of univariate testing strategies when measuring stability across subsamples [4]. However, it is important to note that these methods tend to detect only a small subset of the causal variants, often restricted to the most strongly associated signals. Moreover, a major limitation of these methods is their inability to identify population-specific associations, as they do not model population structure in a multivariate framework as done in feature selection models. While FastLMM is designed to account for population structure by globally adjusting for diversity or admixture, it does not allow detection of signals specific to individual populations. Similarly, stratifying the analysis in GWAS by population reduces statistical power, often resulting in very few or no detected associations, especially for populations with few samples.

Note that for methods offering selection at the single-SNP level, once a SNP is selected, we consider that the entire LD-group is selected. This approach enables direct comparison across methods with different genomic selection levels (single-SNP level versus LD-groups level). We observe that LD-group-based methods tend to produce more stable results as they reduce the number of candidate options to be selected. Indeed, this makes the selection more consistent across subsamples.

Because MuGLasso does not include an additional 1 penalty, it selects more LD-groups than SMuGLasso overall, which affords it higher stability. This is an expected trade-off in sparse modeling, similar to what is observed when comparing Lasso with Elastic Net [37]. MuGLasso remains the model that gives the best stability values on all datasets, followed by SMuGLasso, which outperforms the other applied feature selection methods. It’s noteworthy that SMuGLasso produces fewer selected SNPs and LD-groups compared to MuGLasso. Indeed, enforcing an additional penalty yields a sparser model, at the expense of stability; this behavior is on par with what is usually observed with lasso vs elastic net regularization.

SMuGLasso and MuGLasso select both population-specific and shared LD-groups on both simulated and real data

SMuGLasso ensures the selection of both shared (across tasks) and task-specific LD-groups. MuGLasso can also provide such a selection at the cost of a post-processing step. Specifically, for each task, LD groups with regression coefficients below a small threshold of 10−2 are considered inactive and removed. This post-processing was consistently applied to MuGLasso during our comparison experiments to ensure a fair evaluation.

Table 7 presents the number of shared and population-specific LD-groups (along with the corresponding number of SNPs) selected respectively by SMuGLasso and MuGLasso on the simulated, DRIVE, and Arabidopsis thaliana data sets. This underscores the ability of both methods to capture task-specific genetic features while also identifying shared patterns across different populations.

For comparison, feature selection in stratified models is conducted separately for each task. Thus, the population-specific LD-groups in stratified models correspond to LD-groups that were only selected in one population. Notably, the adjusted methods for population stratification (Adjusted group Lasso, Adjusted Lasso, and Adjusted GWAS) do not allow the selection of population-specific LD-groups.

These findings have implications for the practical application of the methods. For instance, SMuGLasso has better recall for population-specific SNPs, as illustrated by Fig 2. This figure shows the precision and recall of SMuGLasso, MuGLasso, and the stratified approaches on the population-specific SNPs, highlighting the improved performance of SMuGLasso in reducing the number of falsely selected SNPs, thanks to its additional 1-norm regularization.

Genes selected by SMuGLasso on DRIVE show breast cancer related expression

Gene set enrichment analysis.

To further explore the usefulness of SMuGLasso, we investigated the genes it selected on the DRIVE dataset by performing gene set enrichment analyses.

S9 Table shows the top 10 pathways and processes enriched in genes selected by SMuGLasso on DRIVE. These gene sets reveal ontology terms related to biological processes and pathways implicated in breast cancer development and progression, as supported by the literature. For instance, “mammary gland morphogenosis” and “intracellular signaling by second messengers” highlight key pathways involved in breast development and cellular signaling [38,39]. Furthermore, terms such as “ectoderm differentiation” and “endoderm differentiation” point to the importance of cellular differentiation processes in breast tissue homeostasis and tumor formation [40]. Hence, the enrichment of SMuGLasso genes in these pathways underscores their potential roles as regulators in breast cancer biology.

S10 Table, S11 Table, S12 Table, S13 Table present further enrichment analyses against various ontologies. Many of the DisGeNET disease terms that are significantly (corrected p-values < 0.05) enriched in genes selected by SMuGLasso (see S10 Table) correspond to subtypes of breast cancer (estrogen receptor-positive breast cancer, luminal A breast carcinoma, luminal B breast carcinoma, estrogen receptor-negative breast cancer, stage 0 breast carcinoma, mammary neoplasms). Several other enriched disease terms pertain to related diseases (uterine fibroids, squamous cell carcinoma of lung). Finally, the “breast size” trait could indeed be associated with breast cancer risk through breast tissue density, which contributes to an increased risk of developing breast cancer [41].

One cell type signature is significantly (corrected p-value < 0.05) enriched in genes selected by SMuGLasso: fetal thymic epithelial cells (see S11 Table). While the relationship with breast cancer genes is not immediately obvious, this could be related to the role of thymic function in mammary gland development and tumorigenesis [42].

Finally, although not significant after correction for multiple hypothesis testing, transcription factor target enrichment analysis shows enrichment of targets of the Brn-2 transcription factor (see S13 Table). These targets (FGFR2, PTLH, ELL, and ZMIZ1) have already been identified through meta-GWAS of breast cancer (see S3 Table). In addition, this is consistent with findings demonstrating that this transcription factor promotes invasion and metastasis in triple-negative breast cancer cells [43].

Protein-protein interaction analysis.

We also used Metascape to identify the modules of the protein-protein interaction network formed from known interactions between genes identified through physical mapping of the SNPs selected by the Adjusted GWAS approach, SMuGLasso and MuGLasso, which are shown on Fig 6. The modules obtained when adding genes identified through eQTL mapping can be visualized on S6 Fig. Pathway and process enrichment analysis of the genes in the two modules identified by SMuGLasso (Fig 6C) highlights three significant processes, described in Table 8. These highlight the ability of SMuGLasso to identify more relevant disease genes than a classical GWAS approach. Indeed, as breast cancer often originates from aberrant growth and dysfunction within mammary gland structures, mammary gland morphogenesis processes may serve as crucial drivers or modifiers of tumor initiation and progression [38,44]. Furthermore, phosphorylation is well-known to play a role in regulating cellular processes such as proliferation, migration, and survival, all of which are dysregulated in cancer [45,46] in general and in breast cancer in particular [47].

Fig 6. Modules of the PPI of known interactions between genes identified through physical mapping of the SNPs selected by (A) Adjusted GWAS, (B) MuGLasso and (C) SMuGLasso on DRIVE.

Fig 6

Table 8. Pathway and process enrichment analysis of the modules of the PPI of known interactions between genes identified through physical mapping of the SNPs selected by SMuGLasso on DRIVE.
GO Description Log10(P) Gene Hits
GO:0060443 mammary gland morphogenesis -6.4 ESR1, FGFR2, TGFBR2
GO:0016310 phosphorylation -5.5 FGFR2, HK1, MAP3K1, TGFBR2, NEK10
GO:0022612 gland morphogenesis -5.2 ESR1, FGFR2, TGFBR2

All in all, gene set enrichment analysis of the genes pinpointed by SMuGLasso highlights the relevance of the disease genes it detects in addition to a classical GWAS, suggesting that this tool can be used to provide valuable insights into the molecular mechanisms underlying phenotypes.

Quantitative pathway enrichment analysis comparison.

To compare pathway enrichments between Adjusted GWAS, MuGLasso and SMuGLasso, we conduct a quantitative analysis to assess their biological significance. We extract the Z-scores for each common pathway across the three methods (Adjusted GWAS, MuGLasso and SMuGLasso). For this analysis, we consider the top 50 pathways with the highest Z-scores for each method and then focus on the common pathways among these top-ranked sets. For each shared pathway, we compute the Z-score ratio for each pairwise comparison of methods (MuGLasso/GWAS to assess MuGLasso vs. Adjusted GWAS, MuGLasso/Adjusted GWAS to assess Adjusted GWAS vs. SMuGLasso and SMuGLasso/MuGLasso to assess MuGLasso vs. SMuGLasso). Thus, a Z-score ratio higher than 1 indicates that the first method in the pair (e.g., MuGLasso in the MuGLasso/Adjusted GWAS comparison) shows greater enrichment for that pathway than the second method (e.g., Adjusted GWAS). Conversely, a Z-score ratio of less than 1 indicates that the second method in the pair shows greater enrichment for that pathway compared to the first method.

We further assess the significance of differences in pathway enrichments using paired t-tests.

In Fig 7, the box plots illustrate comparisons of pathway enrichments between SMuGLasso, Adjusted GWAS, and MuGLasso. Notably, SMuGLasso shows higher pathway and process enrichment compared to both Adjusted GWAS and MuGLasso, as indicated by the higher Z-score ratios. Furthermore, the performed paired t-tests reveal statistically significant differences between SMuGLasso and MuGLasso, confirming that SMuGLasso identifies pathways with greater biological significance compared to MuGLasso.

Fig 7. On DRIVE, box plots representing the distribution of Z-score ratios for gene sets enrichments across three pairwise comparisons: MuGLasso vs. Adjusted GWAS, SMuGLasso vs. Adjusted GWAS, and SMuGLasso vs. MuGLasso.

Fig 7

The Z-score ratio is computed for each shared pathway between the methods. The green triangle indicates significant differences in enrichment (p-value < 0.005) as determined by paired t-tests. The red dashed line at y = 1 represents equal enrichment by both methods.

Further figures illustrate that SMuGLasso exhibits greater enrichment than MuGLasso across all gene sets (S7 Fig), including DisGeNET gene sets (S8 Fig). We emphasize the DisGeNET gene sets in these analysis, as they are the only ones for which classical GWAS shows enrichment. Compared to GWAS, SMuGLasso demonstrates higher enrichment in 4 out of 6 pathways, equivalent enrichment in 1 pathway, and lower enrichment in 1 other pathway (see S9 Fig), whereas MuGLasso shows better enrichment than GWAS in 3 pathways, equal enrichment in 1 pathway, and lower enrichment in 2 pathways (S10 Fig).

Notably, the top enrichment results determined through pathway analysis are similar regardless of whether or not eQTL gene lists are included.

Genes selected by SMuGLasso on the Arabidopsis thaliana data show flowering time related expression

We present in S5 Fig the list of mapped genes using TAIR10_gff3 mapping of SNPs selected on the Arabidopsis thaliana data set by Adjusted GWAS, SMuGLasso and MuGLasso. We conducted pathway enrichment analysis and observed distinct differences in enrichments among the tested methods. SMuGLasso identified 48 genes, among which two pathways are significantly overrepresented: gravitropism and response to carbohydrate. MuGLasso discovered 55 genes, among which only the gravitropism pathway is overrepresented. By contrast, Adjusted GWAS identified 7 genes, for which the enrichment analysis did not yield any pathway.

An advantage of SMuGLasso compared to MuGLasso is that it finds more pathways with fewer gene discoveries, suggesting it is likely more efficient. The pathways identified by SMuGLasso and MuGLasso, namely gravitropism and response to carbohydrate, have potential relevance to the time until the first open flower phenotype. Gravitropism, the orientation or growth of plants in response to gravity, could influence floral development by affecting how plants orient their growth and allocate resources [48]. The response to carbohydrate pathway discovered by SMuGLasso may also be significant, as carbohydrates are crucial for energy storage and signaling, which can impact plant growth and development, including flowering time [49]. These observations suggest that SMuGLasso and MuGLasso methods are more effective in uncovering biologically relevant pathways that could explain variations in the flowering time phenotype compared to Adjusted GWAS, which did not identify any related pathways.

Finally, Fig 8 shows the modules of the protein-protein interaction network of known interactions between the genes identified by SMuGLasso and MuGLasso. The MuGLasso PPI network appears denser and more informative, with additional interactions and nodes such as FUS9 and ICK1, compared to the SMuGLasso PPI network, which is sparser and has fewer connections. However, upon further investigation, we did not find any evidence linking these nodes, particularly FUS9 and ICK1, to flowering time regulation in Arabidopsis thaliana.

Fig 8. Modules of the PPI of known interactions between genes identified through physical mapping of the SNPs selected by (A) MuGLasso and (B) SMuGLasso on the Arabidopsis thaliana data set.

Fig 8

Discussion

We have presented in this paper SMuGLasso, an extension of MuGLasso for the identification of relevant SNPs from GWAS data across multiple populations. The proposed model is based on a multitask framework in which the tasks are genetic populations and features are clustered in groups. The selection is performed at the scale of LD-groups. The populations are identified using PCA and k-means to assign each sample to a subpopulation. This setting alleviates the curse of dimensionality and addresses population stratification in diverse populations. Compared to MuGLasso, SMuGLasso includes an additional regularization term which enforces task-specific sparsity at the level of LD-groups. Thus, our model provides indeed a more precise recovery of risk regions related to the phenotype at the population-specific level.

Our simulations demonstrate that SMuGLasso outperforms MuGLasso and other methods in accurately identifying population-specific disease loci, while also minimizing potential false discoveries. While MuGLasso shows commendable stability, SMuGLasso closely follows, exhibiting robust stability indexes across various datasets with a reduced number of selected LD-groups/SNPs. The application of stability selection techniques further bolsters SMuGLasso’s reliability in terms of stability measurements.

A significant advancement in our study is addressing the computational challenges posed by the additional penalty in MuGLasso, achieved through the implementation of gap-safe screening rules. This ensures efficient processing for both qualitative and quantitative phenotypes.

Lastly, we have detailed the genes identified by both SMuGLasso and MuGLasso in our real data analyses, and we performed pathway analysis with biological interpretation for the entire gene lists. Interestingly, SMuGLasso’s findings are more consistent with the literature than those that are specific to MuGLasso. However, we encountered limitations in investigating pathway analysis specific to populations due to the absence of tools that adequately consider population structure. Despite our efforts, we were unable to find evidence in our findings for pathway enrichment specific to particular populations. Looking ahead, our goal is to delve into pathway analysis to unravel the biological mechanisms underpinning the phenotypes of interest in diverse population studies, as revealed by the identified risk genes.

Despite the implementation of gap-safe screening rules, the computational load is still significant, especially when dealing with extremely large datasets. This could limit the applicability in broader GWAS data where computational resources are a constraint. The efficacy of SMuGLasso heavily relies on the ability of PCA and k-means clustering in identifying subpopulations. Misclassification or suboptimal clustering can potentially impact the final results. One solution is to focus on enhancing the clustering stability through a hierarchical structure. Moreover, given the additional regularization term, there is a risk that the model might become biased towards tasks with more samples, potentially overlooking key insights in the less-represented tasks/populations. Introducing a weighting scheme that balances the influence of each task, particularly giving more weight to those with fewer samples, might help in addressing the imbalance, or integrating additional external datasets to bolster the sample sizes of underrepresented tasks could be beneficial. In addition, investigating different regularization terms for the population-specific LD-groups selection or hybrid approaches could potentially improve the model’s performance in identifying disease-relevant loci. Furthermore, there remains an essential avenue for future work in rigorously integrating alternative stability selection methods, which could further improve SMuGLasso’s robustness. In conclusion, while SMuGLasso presents a novel framework in the field of GWAS analysis, especially in the precise identification of population-specific risk loci, our ongoing efforts to refine its computational efficiency, enhance clustering accuracy, and balance task representation will be essential in realizing its full potential in unraveling the complex genetic mechanisms of diseases.

Supporting information

S1 Appendix. SMuGLasso method details.

(PDF)

pcbi.1012734.s001.pdf (138.8KB, pdf)
S2 Appendix. DRIVE.

(PDF)

pcbi.1012734.s002.pdf (56.1KB, pdf)
S1 Fig. PCA for simulated dataset.

(PDF)

pcbi.1012734.s003.pdf (45KB, pdf)
S2 Fig. PCA for DRIVE dataset.

(PDF)

pcbi.1012734.s004.pdf (428.6KB, pdf)
S3 Fig. PCA for Arabidopsis thaliana dataset.

Projection of the Arabidopsis thaliana genotypes on the first two PCA components. Samples originate from 44 countries.

(PDF)

pcbi.1012734.s005.pdf (45.6KB, pdf)
S4 Fig. PCA for Arabidopsis thaliana dataset.

Projection of the Arabidopsis thaliana. The identified 5 subpopulations through K-means clustering of the data.

(PDF)

pcbi.1012734.s006.pdf (28.8KB, pdf)
S5 Fig. Inertia Plot for K-means Clustering of Arabidopsis thaliana.

(PDF)

pcbi.1012734.s007.pdf (13.6KB, pdf)
S6 Fig. PPI Modules from eQTL and Physical Mapping of GWAS/SMuGLasso-Selected SNPs on DRIVE.

Modules of the PPI of known interactions between genes identified through physical and eQTL mapping of the SNPs selected by Adjusted GWAS, SMuGLasso and MuGLasso on DRIVE.

(TIF)

S7 Fig. Comparison of All gene set Enrichment (Z-ratios) on DRIVE: SMuGLasso vs. MuGLasso.

On DRIVE, comparison of All gene sets enrichment between SMuGLasso and MuGLasso based on Z-score ratios. Bar heights represent the ratio of Z-scores (SMuGLasso/MuGLasso) for top common gene sets.

(PNG)

pcbi.1012734.s009.png (719.4KB, png)
S8 Fig. Comparison of DisGeNET Enrichment (Z-ratios) on DRIVE: SMuGLasso vs. MuGLasso.

On DRIVE, comparison of DisGeNET gene sets enrichment between SMuGLasso and MuGLasso based on Z-score ratios. Bar heights represent the ratio of Z-scores (SMuGLasso/MuGLasso) for top DisGeNET common gene sets.

(PNG)

pcbi.1012734.s010.png (733.9KB, png)
S9 Fig. Comparison of DisGeNET Enrichment (Z-ratios) on DRIVE: SMuGLasso vs. Adjusted GWAS.

On DRIVE, comparison of DisGeNET gene sets enrichment between SMuGLasso and Adjusted GWAS based on Z-score ratios. Bar heights represent the ratio of Z-scores (SMuGLasso/Adjusted GWAS) for top DisGeNET common gene sets.

(PNG)

pcbi.1012734.s011.png (328.5KB, png)
S10 Fig. Comparison of DisGeNET Enrichment (Z-ratios) on DRIVE: MuGLasso vs. Adjusted GWAS.

On DRIVE, comparison of DisGeNET gene sets enrichment between MuGLasso and Adjusted GWAS based on Z-score ratios. Bar heights represent the ratio of Z-scores (MuGLasso/Adjusted GWAS) for top DisGeNET common gene sets.

(PNG)

pcbi.1012734.s012.png (323.3KB, png)
S1 Table. Subpopulations of Arabidopsis thaliana with the corresponding countries and the number of samples included in each subpopulation.

(PDF)

pcbi.1012734.s013.pdf (48.1KB, pdf)
S2 Table. Potential breast cancer risk genes identified through physical (within 10 kb) mapping of the loci selected by Adjusted GWAS, SMuGLasso and MuGLasso.

CEU-specific selected genes are highlighted in blue and YRI-specific selected genes are highlighted in red. The remaining genes (in black) are risk genes shared across all populations.

(PDF)

pcbi.1012734.s014.pdf (40.9KB, pdf)
S3 Table. MuGLasso or/and SMuGLasso specific Genes in Meta-GWAS via Physical and eQTL Mapping.

Potential breast cancer risk genes identified through both physical (within 10 kb) and eQTL mapping of the loci selected by MuGLasso or/and SMuGLasso and not the adjusted GWAS, found in meta-GWAS including the samples used in this work.

(PDF)

pcbi.1012734.s015.pdf (77.2KB, pdf)
S4 Table. MuGLasso or/and SMuGLasso specific Genes linked to Breast Cancer in Literature.

Potential breast cancer risk genes identified through both physical (within 10 kb) and eQTL mapping of the loci selected by MuGLasso or/and SMuGLasso and not the adjusted GWAS, found to be associated with breast cancer risk or tumor growth in the literature.

(PDF)

pcbi.1012734.s016.pdf (91.4KB, pdf)
S5 Table. DTF3 loci detected by SMuGLasso and MuGLasso on Arabidopsis thaliana dataset.

Genes identified through physical mapping of SNPs selected as associated with flowering time in Arabidopsis thaliana using SMuGLasso, MuGLasso and Adjusted GWAS.

(PDF)

pcbi.1012734.s017.pdf (58KB, pdf)
S6 Table. Breast cancer risk loci detected by SMuGLasso and MuGLasso on DRIVE.

Potential breast cancer risk genes identified through both physical (within 10 kb) and eQTL mapping of the loci selected by Adjusted GWAS, SMuGLasso and MuGLasso. CEU-specific selected genes are highlighted in blue and YRI-specific selected genes are highlighted in red. The remaining genes (in black) are risk genes shared across all populations.

(PDF)

pcbi.1012734.s018.pdf (44.9KB, pdf)
S7 Table. MuGLasso or/and SMuGLasso specific eQTL Genes linked to Breast Cancer.

The potential breast cancer risk genes within 10 kb of loci obtained through eQTL analysis, identified by MuGLasso or/and SMuGLasso and not the adjusted GWAS, found to be associated with breast cancer risk or tumor growth in the literature.

(PDF)

pcbi.1012734.s019.pdf (116.8KB, pdf)
S8 Table. Stability index and number of selected features for different methods on Arabidopsis thaliana.

(PDF)

pcbi.1012734.s020.pdf (46.4KB, pdf)
S9 Table. Summary of pathway and process enrichment analysis: Top 10 clusters of enriched terms, each described by one representative enriched term.

“#” is the number of genes in the user-provided lists with membership in the given ontology term. “%” is the percentage of genes selected by SMuGLasso that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). “Log10(P)” is the p-value in log base 10. “Log10(q)” is the multi-test adjusted p-value in log base 10.

(PDF)

pcbi.1012734.s021.pdf (55.5KB, pdf)
S10 Table. Summary of enrichment analysis in DisGeNET.

(PDF)

pcbi.1012734.s022.pdf (45.7KB, pdf)
S11 Table. Summary of enrichment analysis in Cell Type Signatures.

(PDF)

pcbi.1012734.s023.pdf (42.7KB, pdf)
S12 Table. Summary of enrichment analysis in PaGenBase.

(PDF)

pcbi.1012734.s024.pdf (39.9KB, pdf)
S13 Table. Summary of enrichment analysis in Transcription Factor Targets.

(PDF)

pcbi.1012734.s025.pdf (43KB, pdf)

Acknowledgments

The authors would like to thank Adeline Fermanian, Vivien Goepp, Héctor Climente-González, Gwenaëlle Lemoine, Antoine Poirier and Lotfi Slim for fruitful discussion. OncoArray genotyping and phenotype data harmonization for the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) breast-cancer case control samples was supported by X01 HG007491 and U19 CA148065 and by Cancer Research UK (C1287/A16563).

Data Availability

Code is available in: https://github.com/asmanouira/SMuGLasso. The dataset “General Research Use” in DRIVE Breast Cancer OncoArray Genotypes is available from the dbGaP controlled-access portal, under Study Accession phs001265.v1.p1 (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001265.v1.p1). Researchers can gain access the data by applying to the data access committee, see https://dbgap.ncbi.nlm.nih.gov.

Funding Statement

This work was supported by the French Agence Nationale de la Recherche (ANR-18-CE45-0021-01 and ANR-19-P3IA-0001 to C-A.A.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cho S, Kim H, Oh S. Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis. BMC Proc. 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Cho S, Kim K. Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann Hum Genet. 2010. [DOI] [PubMed] [Google Scholar]
  • 3.Waldmann P, M´esz´aros G. Evaluation of the lasso and the elastic net in genome-wide association studies. Front Genet. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Haury A-C, Gestraud P, Vert J-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One. 2011;6(12):e28210. doi: 10.1371/journal.pone.0028210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Medina-Gomez C, Felix JF, et al. Challenges in conducting genome-wide association studies in highly admixed multi-ethnic populations: the Generation R study. Eur J Epidemiol. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rosenberg NA, Huang L. Genome-wide association studies in diverse populations. Nat Rev Genet. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nat Genet. 2007;39(1):31–40. doi: 10.1038/ng1946 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Spracklen CN, Chen P. Association analyses of East Asian individuals and trans-ancestry analyses with European individuals reveal new loci associated with cholesterol and triglyceride levels. Hum Mol Genet. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nouira A, Azencott C-A. Multitask group Lasso for Genome Wide association Studies in diverse populations. Pac Symp Biocomput. 2022;27:163–74. [PubMed] [Google Scholar]
  • 10.Ambroise C, et al. Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. Algorithms Mol Biol. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ndiaye E, et al. Gap safe screening rules for sparsity enforcing penalties. Journal of Machine Learning Research. 2017;18. [Google Scholar]
  • 12.Meinshausen N, Buhlmann P. Stability selection. J R Statist Soc B. 2009. [Google Scholar]
  • 13.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006. [Google Scholar]
  • 14.Obozinski G, Taskar B, Jordan M. Multi-task feature selection. UC Berkeley; 2006.
  • 15.Wang H, Nie F, Huang H, Kim S, Nho K, Risacher SL, et al. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics. 2012;28(2):229–37. doi: 10.1093/bioinformatics/btr649 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lin D, Zhang J, Li J, He H, Deng H-W, Wang Y-P. Integrative analysis of multiple diverse omics datasets by sparse group multitask regression. Front Cell Dev Biol. 2014;2:62. doi: 10.3389/fcell.2014.00062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Xiaoli L, et al. Group guided sparse group lasso multi-task learning for cognitive performance prediction of Alzheimer’s disease. In: International Conference on Brain Informatics. 2017. p. 202–12.
  • 18.Li L, Chang D, Han L, Zhang X, Zaia J, Wan X-F. Multi-task learning sparse group lasso: a method for quantifying antigenicity of influenza A(H1N1) virus using mutations and variations in glycosylation of Hemagglutinin. BMC Bioinformatics. 2020;21(1):182. doi: 10.1186/s12859-020-3527-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Sutton M, Sugier P-E, Truong T, Liquet B. Leveraging pleiotropic association using sparse group variable selection in genomics data. BMC Med Res Methodol. 2022;22(1):9. doi: 10.1186/s12874-021-01491-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li C, Li M. GWAsimulator: a rapid whole-genome simulation program. Bioinformatics. 2008;24(1):140–2. doi: 10.1093/bioinformatics/btm549 [DOI] [PubMed] [Google Scholar]
  • 21.Consortium IH. The international HapMap project. Nature. 2003. [DOI] [PubMed] [Google Scholar]
  • 22.Alonso-Blanco C, Andrade J, Becker C, Bemm F, Bergelson J, Borgwardt KM, et al. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell. 2016;166(2):481–91.doi: 10.1016/j.cell.2016.05.063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Grimm DG, Roqueiro D, Salomé PA, Kleeberger S, Greshake B, Zhu W, et al. easyGWAS: a cloud-based platform for comparing the results of genome-wide association studies. Plant Cell. 2017;29(1):5–19. doi: 10.1105/tpc.16.00551 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5(6):e1000529. doi: 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Box GEP, Cox DR. An analysis of transformations. Journal of the Royal Statistical Society Series B (Methodological). 1964. [Google Scholar]
  • 26.Purcell S. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Human Genet. 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Priv´e F. Optimal linkage disequilibrium splitting. Bioinformatics. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Privé F, Aschard H, Ziyatdinov A, Blum MGB. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–7. doi: 10.1093/bioinformatics/bty185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lippert C. FaST linear mixed models for genome-wide association studies. Nat Methods. 2011. [DOI] [PubMed] [Google Scholar]
  • 30.Abegaz F, FVL F. Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure. BioData Min. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yaohui Z, Patrick B. The biglasso package: a memory- and computation-efficient solver for lasso model fitting with big data in R. The R Journal. 2017. [Google Scholar]
  • 32.Nogueira S, Brown G. Measuring the stability of feature selection with applications to ensemble methods. International Workshop on Multiple Classifier Systems. 2015.
  • 33.Watanabe K, Taskesen E, et al. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhou Y, Zhou B. Metascape provides a biologist-oriented resource for the analysis of systems-level dataset. Nat Commun. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bader GD, Hogue CWV. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4:2. doi: 10.1186/1471-2105-4-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504. doi: 10.1101/gr.1239303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005. [Google Scholar]
  • 38.Fata JE, Werb Z. Regulation of mammary gland branching morphogenesis by the extracellular matrix and its remodeling enzymes. Breast Cancer Res. 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Chen X, Gu J. Identifying intracellular signaling modules and exploring pathways associated with breast cancer recurrence. Nature Sci Rep. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gao X, Cui X, Zhang X, Zhao C, Zhang N, Zhao Y, et al. Differential genetic mutations of ectoderm, mesoderm, and endoderm-derived tumors in TCGA database. Cancer Cell Int. 2020;20(1):595. doi: 10.1186/s12935-020-01678-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Boyd NF, Lockwood GA. Mammographic density as a marker of susceptibility to breast cancer: a hypothesis. IARC Sci Publ. 2001. [PubMed] [Google Scholar]
  • 42.Shi D, Shui Y, et al. Thymic function affects breast cancer development and metastasis by regulating expression of thymus secretions PTMα and Tβ15b1. Transl Oncol. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Miskin RP, Warren JSA, et al. Integrin α3β1 promotes invasive and metastatic properties of breast cancer cells through induction of the Brn-2 transcription factor. Cancers. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Polyak K, Hu M. Do myoepithelial cells hold the key for breast tumor progression?. J Mammary Gland Biol Neoplasia. 2005. [DOI] [PubMed] [Google Scholar]
  • 45.Ma X, Chen J. ErbB2-upregulated HK1 and HK2 promote breast cancer cell proliferation, migration and invasion. Med Oncol. 2023. [DOI] [PubMed] [Google Scholar]
  • 46.Wu X, Zahari MS, et al. Phosphoproteomic analysis identifies focal adhesion kinase 2 (FAK2) as a potential therapeutic target for tamoxifen resistance in breast cancer. Mol Cell Proteomics. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Yarden Y, Sliwkowski MX. Untangling the ErbB signalling network. Nat Rev Mol Cell Biol. 2001;2(2):127–37. doi: 10.1038/35052073 [DOI] [PubMed] [Google Scholar]
  • 48.Su SH. Gravity signaling in flowering plant roots. Plants. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Cho LH, et al. Roles of sugars in controlling flowering time. J Plant Biol. 2018. [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012734.r001

Decision Letter 0

Adam Charles

16 May 2025

PCOMPBIOL-D-24-02202

Sparse Multitask group Lasso for Genome-Wide Association Studies

PLOS Computational Biology

Dear Dr. Nouira,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

​Please submit your revised manuscript within 60 days Jul 16 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter

We look forward to receiving your revised manuscript.

Kind regards,

Adam Charles

Guest Editor

PLOS Computational Biology

Shihua Zhang

Section Editor

PLOS Computational Biology

Additional Editor Comments:

Dear Dr.'s Nouira and Agathe

Thank you for your patience in this review process. As you know it is increasingly difficult to solicit reviewers. Despite this we were able to receive sufficient reviewer feedback for your manuscript. Given the reviewer comments, the current manuscript lacks sufficient clarity, in particular int he details of the mathematical notation, selection of some of the model regularization, experimental results, and validation procedures to be published in its current form. However the reviewers did note the benefits of the approach and so I am suggesting a major revision, which should carefully address the specific reviewer feedback to improve the clarity of the manuscript.

Best,

-Adam

Journal Requirements:

1) Please ensure that the CRediT author contributions listed for every co-author are completed accurately and in full.

At this stage, the following Authors/Authors require contributions: Asma Nouira, and Chloé-Agathe Azencott. Please ensure that the full contributions of each author are acknowledged in the "Add/Edit/Remove Authors" section of our submission form.

The list of CRediT author contributions may be found here: https://journals.plos.org/ploscompbiol/s/authorship#loc-author-contributions

2) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019.

3) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: 

https://journals.plos.org/ploscompbiol/s/figures

4) We notice that your supplementary Figures, and Tables are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list.

5) Please ensure that the funders and grant numbers match between the Financial Disclosure field and the Funding Information tab in your submission form. Note that the funders must be provided in the same order in both places as well.

- State the initials, alongside each funding source, of each author to receive each grant. For example: "This work was supported by the National Institutes of Health (####### to AM; ###### to CJ) and the National Science Foundation (###### to AM)."

- State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.".

If you did not receive any funding for this study, please simply state: u201cThe authors received no specific funding for this work.u201d

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In the manuscript, the authors extended the previous developed MuGLasso method to select population-specific risk variants with considering the selection of risk LD groups. The novel method, namely SMuGLasso, is based on a multitask group lasso framework. The manuscript is well-written, I have some major concerns related to the implementation details of the SMuGLasso:

1. The original MuGLasso requires additional post-processing steps to discern task-specific LD groups. What are the post-processing steps? And are they also be used for MuGLasso in the comparison experiments?

2. In the model, is X used to denote the raw genotypes of SNPs, or the normalized genotypes?

3. How to select hyperparameters lambda 1 and lambda 2?

4. How to pre-define the LD groups?

5. Are you using different variable selection methods on SNPs after LD pruning? How is the performance on SNPs without pruning?

6. Why different r2 thresholds are used for datasets in LD pruning?

7. In Figure 2, do runtimes of SMuGLasso and MuGLasso also incorporate the stability selection steps?

Minor issue:

Page 12, Line 363: For the Arabidopsis thaliana dataset (see ??)

Reviewer #2: I have uploaded my review in Word document, as it includes mathematical notations.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No: At the time of my review, I did not have full access to the authors' simulation code or all simulation parameters. However, the real data components are clear.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Seungjun Ahn

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: PLOS_CB_ReviewerComments.docx

pcbi.1012734.s026.docx (19KB, docx)
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012734.r003

Decision Letter 1

Adam Charles

27 Aug 2025

Dear Nouira,

We are pleased to inform you that your manuscript 'Sparse Multitask group Lasso for Genome-Wide Association Studies' has been provisionally accepted for publication in PLOS Computational Biology. The reviewer with the main concerns has re-reviewed the manuscript and found that the authors have answered their concerns and made the necessary modifications to the manuscript. While the other reviewer was unable to re-review, I have reviewed the responses and found that they address the main concerns. 

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Adam Charles

Guest Editor

PLOS Computational Biology

Shihua Zhang

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer #2:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: Thank you for your careful revisions. I find that you have addressed my previous comments and concerns, and I have no additional comments at this stage.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Seungjun Ahn

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1012734.r004

Acceptance letter

Adam Charles

PCOMPBIOL-D-24-02202R1

Sparse Multitask group Lasso for Genome-Wide Association Studies

Dear Dr Nouira,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. SMuGLasso method details.

    (PDF)

    pcbi.1012734.s001.pdf (138.8KB, pdf)
    S2 Appendix. DRIVE.

    (PDF)

    pcbi.1012734.s002.pdf (56.1KB, pdf)
    S1 Fig. PCA for simulated dataset.

    (PDF)

    pcbi.1012734.s003.pdf (45KB, pdf)
    S2 Fig. PCA for DRIVE dataset.

    (PDF)

    pcbi.1012734.s004.pdf (428.6KB, pdf)
    S3 Fig. PCA for Arabidopsis thaliana dataset.

    Projection of the Arabidopsis thaliana genotypes on the first two PCA components. Samples originate from 44 countries.

    (PDF)

    pcbi.1012734.s005.pdf (45.6KB, pdf)
    S4 Fig. PCA for Arabidopsis thaliana dataset.

    Projection of the Arabidopsis thaliana. The identified 5 subpopulations through K-means clustering of the data.

    (PDF)

    pcbi.1012734.s006.pdf (28.8KB, pdf)
    S5 Fig. Inertia Plot for K-means Clustering of Arabidopsis thaliana.

    (PDF)

    pcbi.1012734.s007.pdf (13.6KB, pdf)
    S6 Fig. PPI Modules from eQTL and Physical Mapping of GWAS/SMuGLasso-Selected SNPs on DRIVE.

    Modules of the PPI of known interactions between genes identified through physical and eQTL mapping of the SNPs selected by Adjusted GWAS, SMuGLasso and MuGLasso on DRIVE.

    (TIF)

    S7 Fig. Comparison of All gene set Enrichment (Z-ratios) on DRIVE: SMuGLasso vs. MuGLasso.

    On DRIVE, comparison of All gene sets enrichment between SMuGLasso and MuGLasso based on Z-score ratios. Bar heights represent the ratio of Z-scores (SMuGLasso/MuGLasso) for top common gene sets.

    (PNG)

    pcbi.1012734.s009.png (719.4KB, png)
    S8 Fig. Comparison of DisGeNET Enrichment (Z-ratios) on DRIVE: SMuGLasso vs. MuGLasso.

    On DRIVE, comparison of DisGeNET gene sets enrichment between SMuGLasso and MuGLasso based on Z-score ratios. Bar heights represent the ratio of Z-scores (SMuGLasso/MuGLasso) for top DisGeNET common gene sets.

    (PNG)

    pcbi.1012734.s010.png (733.9KB, png)
    S9 Fig. Comparison of DisGeNET Enrichment (Z-ratios) on DRIVE: SMuGLasso vs. Adjusted GWAS.

    On DRIVE, comparison of DisGeNET gene sets enrichment between SMuGLasso and Adjusted GWAS based on Z-score ratios. Bar heights represent the ratio of Z-scores (SMuGLasso/Adjusted GWAS) for top DisGeNET common gene sets.

    (PNG)

    pcbi.1012734.s011.png (328.5KB, png)
    S10 Fig. Comparison of DisGeNET Enrichment (Z-ratios) on DRIVE: MuGLasso vs. Adjusted GWAS.

    On DRIVE, comparison of DisGeNET gene sets enrichment between MuGLasso and Adjusted GWAS based on Z-score ratios. Bar heights represent the ratio of Z-scores (MuGLasso/Adjusted GWAS) for top DisGeNET common gene sets.

    (PNG)

    pcbi.1012734.s012.png (323.3KB, png)
    S1 Table. Subpopulations of Arabidopsis thaliana with the corresponding countries and the number of samples included in each subpopulation.

    (PDF)

    pcbi.1012734.s013.pdf (48.1KB, pdf)
    S2 Table. Potential breast cancer risk genes identified through physical (within 10 kb) mapping of the loci selected by Adjusted GWAS, SMuGLasso and MuGLasso.

    CEU-specific selected genes are highlighted in blue and YRI-specific selected genes are highlighted in red. The remaining genes (in black) are risk genes shared across all populations.

    (PDF)

    pcbi.1012734.s014.pdf (40.9KB, pdf)
    S3 Table. MuGLasso or/and SMuGLasso specific Genes in Meta-GWAS via Physical and eQTL Mapping.

    Potential breast cancer risk genes identified through both physical (within 10 kb) and eQTL mapping of the loci selected by MuGLasso or/and SMuGLasso and not the adjusted GWAS, found in meta-GWAS including the samples used in this work.

    (PDF)

    pcbi.1012734.s015.pdf (77.2KB, pdf)
    S4 Table. MuGLasso or/and SMuGLasso specific Genes linked to Breast Cancer in Literature.

    Potential breast cancer risk genes identified through both physical (within 10 kb) and eQTL mapping of the loci selected by MuGLasso or/and SMuGLasso and not the adjusted GWAS, found to be associated with breast cancer risk or tumor growth in the literature.

    (PDF)

    pcbi.1012734.s016.pdf (91.4KB, pdf)
    S5 Table. DTF3 loci detected by SMuGLasso and MuGLasso on Arabidopsis thaliana dataset.

    Genes identified through physical mapping of SNPs selected as associated with flowering time in Arabidopsis thaliana using SMuGLasso, MuGLasso and Adjusted GWAS.

    (PDF)

    pcbi.1012734.s017.pdf (58KB, pdf)
    S6 Table. Breast cancer risk loci detected by SMuGLasso and MuGLasso on DRIVE.

    Potential breast cancer risk genes identified through both physical (within 10 kb) and eQTL mapping of the loci selected by Adjusted GWAS, SMuGLasso and MuGLasso. CEU-specific selected genes are highlighted in blue and YRI-specific selected genes are highlighted in red. The remaining genes (in black) are risk genes shared across all populations.

    (PDF)

    pcbi.1012734.s018.pdf (44.9KB, pdf)
    S7 Table. MuGLasso or/and SMuGLasso specific eQTL Genes linked to Breast Cancer.

    The potential breast cancer risk genes within 10 kb of loci obtained through eQTL analysis, identified by MuGLasso or/and SMuGLasso and not the adjusted GWAS, found to be associated with breast cancer risk or tumor growth in the literature.

    (PDF)

    pcbi.1012734.s019.pdf (116.8KB, pdf)
    S8 Table. Stability index and number of selected features for different methods on Arabidopsis thaliana.

    (PDF)

    pcbi.1012734.s020.pdf (46.4KB, pdf)
    S9 Table. Summary of pathway and process enrichment analysis: Top 10 clusters of enriched terms, each described by one representative enriched term.

    “#” is the number of genes in the user-provided lists with membership in the given ontology term. “%” is the percentage of genes selected by SMuGLasso that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). “Log10(P)” is the p-value in log base 10. “Log10(q)” is the multi-test adjusted p-value in log base 10.

    (PDF)

    pcbi.1012734.s021.pdf (55.5KB, pdf)
    S10 Table. Summary of enrichment analysis in DisGeNET.

    (PDF)

    pcbi.1012734.s022.pdf (45.7KB, pdf)
    S11 Table. Summary of enrichment analysis in Cell Type Signatures.

    (PDF)

    pcbi.1012734.s023.pdf (42.7KB, pdf)
    S12 Table. Summary of enrichment analysis in PaGenBase.

    (PDF)

    pcbi.1012734.s024.pdf (39.9KB, pdf)
    S13 Table. Summary of enrichment analysis in Transcription Factor Targets.

    (PDF)

    pcbi.1012734.s025.pdf (43KB, pdf)
    Attachment

    Submitted filename: PLOS_CB_ReviewerComments.docx

    pcbi.1012734.s026.docx (19KB, docx)
    Attachment

    Submitted filename: SMuGLasso_ReviewerResponses.pdf

    pcbi.1012734.s027.pdf (162.2KB, pdf)

    Data Availability Statement

    Code is available in: https://github.com/asmanouira/SMuGLasso. The dataset “General Research Use” in DRIVE Breast Cancer OncoArray Genotypes is available from the dbGaP controlled-access portal, under Study Accession phs001265.v1.p1 (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001265.v1.p1). Researchers can gain access the data by applying to the data access committee, see https://dbgap.ncbi.nlm.nih.gov.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES