Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2025 May 28;97(22):11563–11571. doi: 10.1021/acs.analchem.5c00539

Integrative Analysis of Nontargeted LC-HRMS and High-Throughput Metabarcoding Data for Aquatic Environmental Studies Using Combined Multivariate Statistical Approaches

Maryam Vosough †,‡,§,*, Felix Drees †,, Guido Sieber ‡,, Tom L Stach ‡,, Daniela Beisser #, Alexander J Probst ‡,⊥,, Jens Boenigk ‡,, Torsten C Schmidt †,‡,
PMCID: PMC12163877  PMID: 40436373

Abstract

Significant progress in high-throughput analytical techniques has paved the way for novel approaches to integrating data sets from different compartments. This study leverages nontarget screening (NTS) via liquid chromatography-high-resolution mass spectrometry (LC-HRMS), a crucial technique for analyzing organic micropollutants and their transformation products, in combination with biological indicators. We propose a combined multivariate data processing framework that integrates LC-HRMS-based NTS data with other high-throughput data sets, exemplified here by 18S V9 rRNA and full-length 16S rRNA gene metabarcoding data sets. The power of data fusion is demonstrated by systematically evaluating the impact of treated wastewater (TWW) over time on an aquatic ecosystem through a controlled mesocosm experiment. Highly compressed NTS data were compiled through the implementation of the region of interest-multivariate curve resolution-alternating least-squares (MCR-ALS) method, known as ROIMCR. By integrating ANOVA-simultaneous component analysis with structural learning and integrative decomposition (SLIDE), the innovative SLIDE-ASCA approach enables the decomposition of global and partial common, as well as distinct variation sources arising from experimental factors and their possible interactions. SLIDE-ASCA results indicate that temporal variability explains a much larger portion of the variance (74.6%) than the treatment effect, with both contributing to global shared space variation (41%). Design structure benefits include enhanced interpretability, improved detection of key features, and a more accurate representation of complex interactions between chemical and biological data. This approach offers a greater understanding of the natural and wastewater-influenced temporal patterns for each data source, as well as reveals associations between chemical and biological markers in an exemplified perturbed aquatic ecosystem.


graphic file with name ac5c00539_0007.jpg


graphic file with name ac5c00539_0005.jpg

Introduction

Significant advancements in environmental analysis have emerged through nontarget screening (NTS) using liquid chromatography-high-resolution mass spectrometry (LC-HRMS), enabling comprehensive detection of organic micropollutants, their transformation products (TPs), and metabolites in aquatic ecosystems. A growing number of platforms and technologies in water research studies enable the assessment of diverse but interconnected biological, chemical and other environmental variables through integrated data analysis strategies. , Specifically, one promising area in this regard, is the exploitation and development of multiomics or multiblock data processing strategies in combination with NTS data. Metabarcoding, which is based on amplification of DNA sequences, provides valuable insights into the unknown compositions of biological communities, offering a deeper understanding of the ecological status and health of water system. ,

Creating a multiblock data set can be achieved through a multiplatform analysis of the same samples, resulting in data matrices with the equal number of rows, but different number of variables. Analyzing bio/chemical features in the same set of water samples using high-throughput technologies necessitates the exploitation of multiblock data analysis methods to reveal complex interactions between key micropollutants and microbial communities that isolated analyses might miss. Shared variation across data sets enables more in-depth interpretation of variables and their interrelationships. Several multiblock methods have been proposed for estimating the shared parts of data sets while disregarding some of the specific characteristics of each data set. Examples of unsupervised methods are sparse multiblock partial least-squares, regularized and sparse generalized canonical correlation analysis (sGCCA). Supervised integration methods, such as concatenation-based and ensemble-based frameworks using sparse partial least-squares discriminant analysis (sPLS-DA), and data integration analysis for biomarker discovery using latent components (DIABLO)a supervised extension of sGCCAhave shown utility in discriminative tasks by simultaneously identifying key features from each data block. However, most of these methods are primarily optimized to extract shared structures across data sets and often overlook block-specific variations, which may hold critical platform-specific or domain-relevant information. Methods such as joint and individual variation explained (JIVE), distinct and common simultaneous component analysis (DISCO-SCA), structural learning and integrative decomposition (SLIDE), penalized exponential simultaneous component analysis (P-ESCA) address this gap by estimating both common and specific sources of variation. Nonetheless, a major shortcoming of many existing data fusion approaches is their lack of integration with experimental design frameworks, which limits their ability to resolve structured variation and interaction effects, and ultimately to gain deeper analytical insight. The integration of P-ESCA with ANOVA-simultaneous component analysis (ASCA) into PE-ASCA, can achieve this goal by first decomposing the data matrix into common and distinct variations, followed by applying ASCA to each submatrix.

NTS studies in environmental monitoring often cover various influencing factors, such as spatiotemporal dynamics, treatment effects, and environmental disturbances from diverse sources. Incorporating different sources of variation into multiblock models can significantly improve the interpretation of both common and distinct components, leading to more accurate and holistic environmental assessments. ASCA is well-suited for modeling variations in individual data sources while accounting for design structure, and has been widely used in chemometrics. Recent study has demonstrated the complementary strengths of NTS, 18S V9 rRNA, and full-length 16S rRNA gene metabarcoding for assessing the impact of treated wastewater (TWW) in stream ecosystems. While NTS effectively detects chemical change, the rRNA-based methods are more sensitive to microbial variation, and all three showed strong covariation. Despite these advances, capturing the full complexity of environmental data still requires the development of more robust, multilayered integration strategies across data sources. This highlights the benefits of combining ASCA or other ANOVA-based methods with fusion frameworks such as SLIDE or JIVE, as previously recommended.

Building on this concept, we introduce SLIDE-ASCA, a novel integrative approach that couples the SLIDE frameworkcapable of modeling globally shared, locally shared, and distinct componentswith ASCA to link each source of variation to experimental factors. This method enables structured decomposition aligned with factorial design, helping to resolve complex interactions while preserving block-specific detail. To recover NTS components, we utilized the region of interest- coupled with multivariate curve resolution-alternating least-squares (MCR-ALS), known as ROIMCR. By using bilinear matrix factorization, ROIMCR avoids conventional peak-picking processes such as retention time alignment and peak shape modeling. This approach improves the resolution of coeluting compounds, separates true signals from irrelevant peaks, and groups related features into single components, reducing redundancy and simplifying interpretations. , This makes it a robust alternative for processing complex, high-dimensional LC-HRMS data sets. We will also investigate the overlapping between the results of SLIDE-ASCA on joint spaces and DIABLO outputs, which identifies key features for each data block based on the supervised discrimination of water samples. Although we focus on the impact of TWW as an illustrative case, the SLIDE-ASCA framework addresses broader analytical challenges and holds potential for application in other design-based multivariate analysis involving multiple platforms or measurement types. Modern high-throughput experiments increasingly produce multiple omics, chromatographic, and spectrometric data sets from the same set of samples, necessitating integrative approaches that can separate different sources of variation, while also incorporating experimental design structure.

Methods

Samples and Data Collection

This study uses experimental data sets from a previous investigation carried out in six identical large-scale circular flow mesocosms, referred to as the AquaFlow systems. The original work provides complete details of the experimental setup and sampling procedure. Figure S1outlines the complete study workflow, while Table S1 provides definitions of key terms used throughout the study. Briefly, we systematically analyzed 42 river water grab samples across two distinct experimental conditions defined by sample type (Factor α: treatment effect) and collection time (Factor β: time effect). The mesocosms were divided into two groups: a control group, consisting of three replicates (C1, C2, C3) filled solely with river water, and a treatment group, comprising three replicates (T1, T2, T3) that contained a mixture of two-thirds river water and one-third effluent from a municipal wastewater plant. Samples were collected from each mesocosm at seven time points (S1 to S7) over 10 days (see Table S2). This structured sampling strategy using mesocosm ecosystem, enabled assessment of both immediate and cumulative effects of TWW, providing robust temporal and comparative data for evaluating its impact on aquatic ecosystems. Moreover, the workflow incorporates quality control samples, laboratory and system blanks to ensure comprehensive QC verification. These measures guarantee the reliability of the protocol by optimizing parameter settings and monitoring measurement stability.

Initial Data Processing

Detailed descriptions of the HRMS analysis, ROI matrices, and MCR-ALS processing, as well as the methodologies for microeukaryote and prokaryote data processing, are summarized in SI-3–5 and Tables S3–S6. Briefly, HRMS data were preprocessed and analyzed using ROIMCR approach to recover chemical components, while 18S V9 rRNA and 16S rRNA data were processed with targeted amplification and advanced bioinformatics workflows. These procedures enable comprehensive insights into the chemical, microeukaryotic and prokaryotic compositions. For more methodological details, please refer to the SI and the original publication.

Data Analysis: Single and Multiblock Approaches

Statistical evaluations of the final NTS matrix derived from ROIMCR approach were performed by analyzing peak areas after data preprocessing steps such as logarithmic transformation and total area normalization. , Operational taxonomic unit (OTU) tables were first prefiltered by removing near zero variance predictors. , Then, they were normalized by using total sum scaling normalization followed by the centered log-ratio (CLR) transformation. , To check the assumptions for the statistical tests including normally distributed variables and homogeneity of variance, Shapiro-Wilk and Bartlett’s tests were used (p < 0.01).

The data analysis began with exploratory two-block PLS analyses to assess pairwise relationships between data sets. The goal was to extract latent components that maximize the covariance between each data set pair, guiding the structure of the subsequent multiblock integration. We then utilized ASCA decomposition for each set separately to assess the design factors, their possible interactions, and identify key features associated with them. This method merges ANOVA’s variance decomposition capability with the comprehensive variable effect assessment provided by SCA. The significance of each factor and its interactions is determined by comparing the sum-of-squares (SSQ) values of the actual data with the SSQ values from permuted data (10,000 times). P-value reflect the probability to obtain the same results if the null hypothesis (i.e., no effect) is true. Further, SCA scores and loadings provide insights into sample patterns and variables associated with each design component (see SI-6 for more details).

SLIDE-ASCA

Integrated analysis of three data blocks with a shared object mode can be conducted using SLIDE method following column-centering and block scaling to unit Frobenius norm. SLIDE is a model that incorporates globally and locally common components as well as block-specific variation, to analyze multiblock data structure (SI-6). One further step in this regard is incorporating the design structure in the data fusion approach. A potential strategy is to first extract latent components from data sets using SLIDE, then breakdown common and distinct variations with ASCA (see Figure S2). This combined SLIDE-ASCA approach enables the structured decomposition of each data block into globally shared, locally shared, and distinct sources of variation, each aligned with the design structure. The model organizes latent variation in terms of its origin (shared or distinct) and experimental relevance (α, β effects and αβ interaction) contributions, enabling clearer interpretation of multiblock data sets using score and loading matrices (Table S7). A full mathematical formulation of the SLIDE-ASCA model is provided in SI-6.

As some data frames have high dimensionality in the direction of variables, further variable selection strategy was used to facilitate interpretation. For the purpose of this study, all variables were initially preserved within each model. To address the key features in each data source, we first employed sparse PCA (sPCA) based on regularized low-rank matrix approximation. sPCA, by incorporating l 1 regularization, helps by zeroing out less important features’ loadings, focusing analysis on those features that contribute most significantly. We then selected top 15 features from each data set as the key features contributing to the variance in each submodel based on their absolute loading values (see SI-7). Moreover, ASCA features were incorporated into the JIVE model, creating JIVE-ASCA. Initially, the data sets were block-scaled, and ranks were estimated using permutation tests.

sPLS-DA and DIABLO

The use of supervised PLS-DA models might be effective for processing design-based data sets, provided the factors/levels are distinguished. In this study, we considered factor-wise ASCA models to define individual or global classification problems. Specifically, the multiomics integrative method DIABLO was utilized to explore similarities with SLIDE-ASCA global shared components. A description of the implications of these methods can be found in SI-7.

Global ROI LC-HRMS data matrices for multiple chromatographic runs have been compiled using MSroi GUI app. The calculations involving MCR-ALS were performed in MATLAB software (The Mathworks Inc., version 9.9, 2020b, Natick, MA, U.S.A.) using the MCR-ALS 2.0 toolbox available at www.mcrals.info The identification of prioritized MCR compounds involved searching public libraries such as mzCloud (https://www.mzcloud.org) and PubChem (pubchem.ncbi.nlm.nih.gov/). To classify tentatively identified features, the scheme of Schymanski has been used. The ASCA models were obtained using the MATLAB source code (github.com/josecamachop/MEDA-Toolbox). Matlab 2020b is used in all steps of the data processing excluding sparse PLS-DA, sparse PCA and DIABLO for which R software version 3.0.1 with the mixOmics package31 was used (http://cran.r project.org/web/packages/mixOmics). SLIDE and JIVE have been performed in R and can be found at GitHub repository github.com/irinagain/slide and cran.r-project.org/web/packages/r.jive, respectively. A relevance network graph showing strong associations (≥|0.8|) between variables was visualized using Cytoscape v3.7.2 (www.cytoscape.org).

Results and Discussion

Resolution of LC-HRMS Data: NTS Data Block

This study introduces a peak area table containing highly abstracted chemical information to serve as a representative of NTS data for subsequent integrative analysis. This was accomplished by processing LC-HRMS data using ROIMCR approach. An individual model was built for each augmented submatrix using the key parameters (see SI-4). MCR-ALS modeling facilitated the extraction of all components causing systematic variations across the samplesexplaining over 96% of the variance in all data subsetsand was also able to resolve LC profiles of components and irrelevant peaks; examples are provided in Figure S3. Additionally, the uniqueness of resolved components was carefully evaluated and confirmed using the MCR-BANDS method under implemented constraints. The results showed difference between f max and f min were near zero (≤0.0012) for the considered species in the current data set, indicating stable solutions. Figures S4 and S5 further illustrate this, showing that the resolved components either coincide with extreme values of the profiles or fall within very narrow profile bands. Consequently, a total of 203 (ESI+) and 147 (ESI-) components were used to elucidate the variance within the entire data set, encompassing all detected species, solvent contributions, noisy signals, artifacts, and background signal contributions. Then, following matrix cleaning , and removing components with near-zero variance (SI-4), the global peak area matrix (dimension of 42 × 176) was prepared for further multivariate data assessments.

Multivariate Statistical Analysis of NTS and Metabarcoding Data

Explorative Analysis and ASCA Models

To ensure comparability between NTS data and other blocks, we first created exploratory PLS models for each pair of data sets. The correlation patterns among the top 50 features of each pair are shown in Figure S6. The correlation values suggest robust interrelationships among the data sets, with the 16S rRNA and 18S V9 rRNA pair showing the strongest correlation (0.98, 0.97), followed by NTS and 18S V9 rRNA (0.94, 0.95) and then NTS and 16S rRNA (0.93,0.91), for the first and second latent variables (LVs), respectively. The results are consistent with those obtained from coinertia analysis of the NTS data (using XCMS with other data sets, and they confirm their general coherence with patterns in microeukaryotic and prokaryotic communities. To focus the analysis on the effect of the specific factors of treatment (α) time (β) and their interaction (αβ), we built balanced ASCA models for individual data sets. Across the three data blocks, both the treatment effect and the time effect significantly contributed to the variation, as confirmed by permutation tests (p-values <0.0001; Table S8). The time effect consistently explains a larger portion of the variation compared to the treatment effect in all blocks. The treatment-time interactions are more prominent than the overall treatment effect in 16S rRNA and 18S V9 rRNA data, explaining 14.3% and 13.6% of the variation, respectively. Conversely, the treatment-time interaction is considered minor (p-values >0.1) for NTS data, accounting for 6.6% of the total variance. Also, residual matrices explain a high percentage of variance, with the 16rS rRNA showing a smaller value than other blocks. PCA analysis, however, revealed no significant pattern in these residuals, suggesting natural experimental uncertainty.

SLIDE-ASCA and JIVE-ASCA Models

We then implemented SLIDE on the concatenated data matrix to integrate the analysis process and disentangle the global and local common structures, latent spaces of each data set and residual noise. Table summarizes the results of rank estimation and the percentage of variation explained, as the squared Frobenius norm of the estimated signal to the squared Frobenius norm of the data. The SLIDE results indicate that, of the total variation, a substantial part of the variance for all data blocks is accounted for by the global common structure, which has an estimated rank of three. Local common structures contribute notably in 16S rRNA, while distinct variations are more pronounced in NTS. The corresponding results for data decomposition using JIVE is also presented in this table. This includes the low-rank approximations of global common variation across data types, variations individual to each data type, and residual noise. SLIDE offers a superior explanation of global common variance across all data sets compared to JIVE. JIVE tends to allocate more variance to individual structured variations and residual noise. Moreover, SLIDE successfully unravels local common structures that JIVE fails to identify. This is the main reason for JIVE’s suboptimal solutions, as evidenced by the local common contributions obtained by SLIDE for 16S rRNA and 18S V9 rRNA data sets.

1. Variance Explanation and Rank Estimation using SLIDE and JIVE, and their Relevant Factor Effects for Integrated Processing of NTS, 16S rRNA, and 18S V9 rRNA Datasets.

Data Source NTS 16S rRNA 18S V9 rRNA Total Rank Estimation Factor involved
Variance
Common (global) 48.4% (40.9%) 38.6% (32.7%) 35.9% (27.5%) 41.0% 3 (2) α (p < 0.0001) β (p<0.0001)αβ (p < 0.0001)
Common (local) - 9.6% 6.4% 5.3% 1 β (p < 0.0001)
Common (local) 8.7% - 3.7% 4.2% 1 N.S.
Distinctive 13.4% (23.2%) - - 4.5% 3 (6) N.S.
Distinctive - 7.8% (13.1%) - 2.6% 2 (4) αβ (p < 0.0001)
Distinctive - - 3.7% (22.2%) 1.2% 1 (4) αβ (p < 0.0001)
Residual 29.5% (36.0%) 43.9% (54.2%) 50.2% (50.3%) 41.2% - -
a

The percent of variance and rank estimation for JIVE are provided in the parentheses.

b

Not significant.

SLIDE-ASCA model was then used to further decompose the global common latent space of three data sets using the design elements of the study. Through the ASCA model applied to the shared space of three data blocks, common variations representing the effect factors of α, β, and their interaction term αβ were distinctly separated, allowing previously untraceable patterns of effect and interaction in the SLIDE shared space to be quantitatively assessed. Permutation tests showed that both factors and their interaction are significant with p-values= 0.0001. Figure displays ASCA score patterns for the time effect, the primary driver of variation in each data set and their global joint space by SLIDE, with panels A–C illustrating score scatter plots from sampling times 1 h to 10 days for NTS, 16S rRNA, and 18S V9 rRNA data, highlighting significant temporal changes in water composition, including chemicals and prokaryotic and microeukaryotic communities. The results clearly show that the different layers of temporal variability are captured by PC1 and PC2, and the samples have been clustered according to each PC subspace. Comparing NTS data with the other blocks (panels B and C) reveals that early sampling points are more aligned with 16S rRNA gene data, while 18S V9 rRNA data show a steady evolutionary pattern across clusters S1–S4, S5 and S6, with increasing variability in later stages. Prokaryotes respond faster to environmental chemicals due to their rapid metabolism and short generation times, allowing them to utilize compounds efficiently. In contrast, microeukaryotes, with longer generation cycles and reliance on prokaryotes as a food source, show delayed responses. This leads to continuous clustering of microeukaryotic patterns as time passes, when they reflect the cumulative effects of chemicals processed by prokaryotes. This difference suggests that prokaryotes react swiftly to contaminants due to simpler defenses, while microeukaryotes exhibit a slower, sequential response as they indirectly accumulate these chemicals. The more uniform temporal patterns of eukaryotes and prokaryotes is highly consistent with what was previously reported using three-dimensional principal coordinate analysis (PCoA, p < 0.001) based on Bray–Curtis dissimilarity measure. ASCA modeling of global common components of SLIDE identified joint temporal space of factor β (panel D) explaining 74.6% of total variation of global shared space.

1.

1

ASCA score plots for time effect (factor β) for NTS (A), 16S rRNA (B), 18S V9 rRNA (C), and global common space of three data blocks using SLIDE-ASCA (D), respectively.

Regarding treatment effect, the ASCA score plots (Figure ) reveal consistent patterns across the three data blocks, with the 18S V9 rRNA block showing the highest impact at 10.4% of total variation, compared to 6.8% and 6.5% for NTS and 16S rRNA, respectively. In the global common space, factor α accounted for 18.3% of the variations based on the ASCA model.

2.

2

ASCA score plots for PC1 of treatment effect for individual data blocks and global common space of three data blocks using SLIDE-ASCA.

The treatment–time interaction (αβ) submodels in both ASCA and SLIDE-ASCA visualize sample group deviations from the overall α and β effects (Figure S7 and SI-9). The corresponding loading plots for factors β, α, and αβ from global common space of SLIDE-ASCA model is visualized in Figure S8. In this figure, the global shared loading plots for each data set are superimposed on the loading values from the initial individual ASCA models, clearly showing that the variables responsible for joint variation across data sets exhibit a consistently reduced set of loading values compared to the initial ASCA models. This further verified the SLIDE-ASCA model performance, which is primarily due to reliable component estimation by SLIDE.

The most relevant features were identified using sPCA. For the joint α submodel, 15, 45, and 90 features explained 67%, 62%, and 44% of the variance in the NTS, 16S rRNA, and 18S V9 rRNA blocks, respectively. For β effect, PC1 accounted for 25–57% (using 45–50 features), and PC2 for 4–17% (using 15–90 features). To ensure consistency and interpretability, we adopted a fixed set of 15 variables per data setguided by the lowest optimal value during tuning and ranked by absolute loading values. These represent the most influential contributors to the latent variation patterns. Table S9 summarizes the top chemical and microbial features across globally shared components in the SLIDE-ASCA model. There is a strong overlap (87–100%) between α effect and αβ interaction features, indicating that treatment-related features evolve temporally across data sets. Figure displays the heatmap resulting from h-clustering of prioritized features related to the time effect. A set of key markers, specifically 4 chemical compounds, 8 microeukaryotic OTUs, and 9 prokaryotic OTUs, are closely linked with the second tier of time variability (β-PC2), distinguishing the first time point. This is evident in the red cluster in the figure, showing high positive values and strong correlations, highlighting their distinct role in this time frame. Among the total prioritized NTS features, nine compounds were tentatively identified (confirmation level 2a31), while the remaining compounds were unknown (Table S10).

3.

3

Heatmap analysis of key bio/chemical features associated with the time effect (β submodel) through SLIDE-ASCA modeling of NTS, 16S rRNA, and 18S V9 rRNA. The color scale represents time-associated changes, with red (up to +3) and blue (down to −3) for strong positive and negative associations, respectively.

The JIVE-ASCA output, however, is inconsistent with that obtained with SLIDE-ASCA models for current data sets. In fact, the suboptimal solutions in estimating the global joint variations would lead to ASCA submodels with overweighting of some variables compared to initial individual models (Figure S9). Nevertheless, while JIVE-ASCA is less suited for data sets with locally shared variation, its implementation here provides a useful comparison for assessing model behavior and the impact of local common structures on global integration.

Further assessment of ASCA submodels for SLIDE-estimated local common components shows 18S V9 rRNA–16S rRNA significantly contributes to the β design element (p < 0.0001), which account for 5.3% of the total variation in concatenated data. Score and loading plots of the β submodel for the locally shared space between 18S V9 rRNA and 16S rRNA data frames are provided in Figures S10, S11 and Table S11. Initially, both control and treated samples exhibit minor variations until day four, when a pronounced shift occurs characterized by notable negative scores for both sample types; from day seven onward, the samples exhibit increasing positive scores. The score pattern, recovered clearly by this approach across extended time intervals, reveal a broader level of temporal dynamics specific to biological domain. In fact, biotic interactions between organisms such as, e.g., predation and competition, are partly species-specific. Therefore, it is to be expected that the eukaryotic and prokaryotic data sets can covary to some extent independent of or not related to shifts in the chemical data.

ASCA analysis was then applied to the distinctive component space of the NTS data block (i.e., NTS variation not explained by globally/locally shared structures), and showed no statistically significant design effects (p > 0.05). This confirms that the remaining NTS variation is likely due to platform-specific characteristics or experimental variation, rather than any factor related to the study design (Figure S12). In contrast, distinct variations of the 16S rRNA and 18S V9 rRNA data sets with two and one components, account for 7.8% and 3.7% of the variations in each data frame, respectively. These components play a statistically significant role in explaining certain residual αβ variations within the integrated data that are not captured by either global or local latent spaces (Figure S13). The first PC scores indicate that the unique variation in 16S rRNA is primarily associated with the effect of TWW treatment at the initial time point (1 h). Meanwhile, the variability pattern specific to 18S V9 rRNA demonstrates divergence at most time points in TWW treated samples relative to control samples. Key OTUs contributing to this source-specific variation are also highlighted in the loading patterns (see Table S11). The overall results highlights a key strength of SLIDE-ASCAwhile it disentangles shared and block-specific sources of variation, it also enables the evaluation of residual block-specific signals to determine whether they align with design factors and reflect meaningful chemical/biological information, or whether they are more likely stemming from confounding effects (source-specific) structure. This demonstrates that the method not only preserves source-specific variation but also organizes it for systematic evaluation. In this regard, a prior individual analysis of each data set is highly beneficial for establishing baseline insights before applying SLIDE-ASCA. This strategy helps verify patterns, understand loading structures, and ensure that data integration highlights important trends.

While SLIDE-ASCA offers clear advantages in terms of interpretability and structured decomposition of effects, it also inherits limitations from both constituent methods. SLIDE assumes orthogonality of components and applied on continuous, complete data, making it less suited for data sets with missing values or mixed types. Its performance further relies on the optimal recovery of low-rank structures and well-aligned sample designs across data blocks. Data sets with high levels of noise, strong group imbalance, or incompatible scaling between blocks can lead to instability in component estimation. Proper normalization is essential to prevent dominant data sets from distorting shared components and should follow individual block preprocessing. ASCA introduces additional assumptions, including within-group variable independence and sensitivity to scaling. Computationally, SLIDE-ASCA is more intensive than either method alone, relying on heuristics like penalized factorization and bicross-validation. The complexity increases with the number of blocks, components, and permutations. Future improvements may focus on enhancing scalability, incorporating mixed data types, and improving robustness to missing values.

sPLS-DA and DIABLO Models

We addressed a four-class classification problem based on α and β effects identified in ASCA (see Table S2). Optimal sPLS-DA performance was achieved by selecting variables minimizing the balanced error rate, validated via 5-fold cross-validation (50 repetitions). The best model, with 50 and 12 stable variables (>50%) on LV1 and LV2, yielded zero classification error. The resulting 2D score plots revealed temporal separation on LV1 (33% variance) and treatment effect on LV2 (7% variance) (Figure S14 and SI-10). Feature overlap between sPLS-DA and ASCA was 50% and 90% for LV1 and LV2, respectively, among the top 10 variables. Lower β-related concordance aligns with prior findings for highly correlated variables. , We then applied DIABLO for integrated multiomics analysis using the same cross-validation strategy. The chosen design matrix enabled both class discrimination and feature correlation. Sample groups were clearly separated in the integrated space, confirming sPLS-DA results and revealing strong cross-block coherence at early time points (Figure S15). The model with the lowest error (2.2%) selected 10 + 10 microeukaryotic OTUs, 50 + 50 prokaryotic OTUs, and 30 + 10 chemical compounds for LV1 and LV2 (Figures S16–S18). Selected chemo/bio variables contributing to both LVs are illustrated in the correlation circle plot (Figure ). Consistent with expectations, the chemical features identified by DIABLO for the NTS block represented a subset of those selected by individual sPLS-DA modeling, with 30 of 50 for LV1 and 10 of 12 for LV2. Comparison of the top 10 features from the globally shared components of SLIDE-ASCA and the LV2 and LV1 components of DIABLO revealed an overall overlap of 50–60%, reaching 90% for features in the NTS LV2 space (Table S9), consistent with individual NTS block comparison.

4.

4

Correlation circle plot for integrated modeling of NTS, 16S rRNA, and 18S V9 rRNA with DIABLO, for a 4-class classification problem.

Associations between Chemical Markers with Prokaryotic/Microeukaryotic OTUs

To investigate specific associations between prioritized organic compounds and key prokaryotic/microeukaryotic OTUs, an integrative network analysis utilizing SLIDE-ASCA temporal and treatment-related submodels (Figures S19 and 20) in the global joint space was performed. For visual clarity, associations between prokaryotic and microeukaryotic OTUs are excluded from the figures but detailed in Tables S12–S15. Every network exhibits two types of temporal variations, whose first dimension is highly distinctive. The total number of strong associations (|r| ≥ 0.8) for each NTS feature with OTUs is depicted in Figure S21. Generally, 271 and 294 temporal associations, and 187 and 243 treatment-relevant associations were found between key NTS compounds and prokaryotic and microeukaryotic OTUs. In summary, our integrative analysis highlighted several key organic compounds, such as caprolactam (CAP), 3-indoleacetonitrile (IAN), (4-dodecylbenzenesulfonic acid, DBSA), and 8-(4-sulfophenyl)­octanoic acid (8-SPOA), showing strong associations with specific prokaryotic and microeukaryotic OTUs. These associations suggest notable ecological interactions, with CAP potentially serving as a nutrient source and influencing microbial community structure, while other compounds like IAN and DBSA emerge as byproducts of biodegradation pathways. , Additionally, various OTUs demonstrated both positive and negative associations with these compounds, reflecting complex dynamics within the microbial community that align with previous studies on micropollutant impacts, ,, Further details are provided in SI-11 Figures S22 and S23.

Conclusions

This study demonstrates the benefits of incorporating experimental design structure into multiblock data analysis using the SLIDE-ASCA framework. By integrating nontargeted LC-HRMS data with 16S RNA and 18S V9 rRNA gene metabarcoding, SLIDE-ASCA enabled structured decomposition of variation into globally shared, locally shared, and block-specific components, each aligned with specific experimental factors. This presented a more comprehensive interpretation of induced bio/chemical alterations across platforms while facilitating the identification of key markers and chemical–biological associations in specific submodels. By quantifying the contributions of temporal, treatment, and interaction effects, SLIDE-ASCA identified time as the primary driver of globally shared variation across the three data sets and revealed two layers of temporal variability. This statistical framework effectively distinguished natural from wastewater-influenced dynamics, offering clearer insights into ecosystem responses and cross-domain associations. The method revealed that locally shared variation between biological data sets reflects purely temporal changes, independent of NTS, highlighting a distinct time-driven biological pattern. A key strength of SLIDE-ASCA lies in its ability to preserve and organize block-specific signals, allowing researchers to determine whether they reflect meaningful design-driven effects or potential confounding variations. Taken together, the notable advantages of this approach include (i) integration of multisource data in a design-aware framework, (ii) separation of overlapping sources of variation to all types of contributions, and (iii) enhanced interpretability and feature selection. However, the method inherits certain limitations from its constituents, including assumptions of component orthogonality and complete, continuous data, as well as sensitivity to noise and group imbalance, and should be applied with careful consideration of its underlying assumptions and data characteristics. Despite this, SLIDE-ASCA provides a powerful and flexible framework for disentangling complex variation in multiomics data sets, improving marker discovery and interpretation in environmental monitoring and other high-throughput applications.

Supplementary Material

ac5c00539_si_001.pdf (3.2MB, pdf)
ac5c00539_si_002.xlsx (216.6KB, xlsx)

Acknowledgments

This research was supported by the German Research Foundation (DFG) under grant number 520243139. Additionally, we appreciate the support provided by DFG under – CRC 1439/1 – project number 426547801, in the frame of the Collaborative Research Center (CRC) RESIST.

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.analchem.5c00539.

  • Workflow overview; sampling design; LC-HRMS methodology; MCR-ALS modeling and compound identification; processing of microeukaryotic/prokaryotic data; single/multiblock data processing methods; score and loading plots for submodels; table of key features; relevance network graphs (PDF)

  • Associations between chemical markers and prokaryotic/microeukaryotic OTUs; list of OTUs (XLSX)

The authors declare no competing financial interest.

References

  1. Aceña J., Stampachiacchiere S., Pérez S., Barceló D.. Advances in liquid chromatography–high-resolution mass spectrometry for quantitative and qualitative environmental analysis. Anal. Bioanal. Chem. 2015;407(21):6289–6299. doi: 10.1007/s00216-015-8852-6. [DOI] [PubMed] [Google Scholar]
  2. Vosough M., Schmidt T. C., Renner G.. Non-target screening in water analysis: Recent trends of data evaluation, quality assurance, and their future perspectives. Anal. Bioanal. Chem. 2024;416(9):2125–2136. doi: 10.1007/s00216-024-05153-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Shi C., Mahadwar G., Dávila-Santiago E., Bambakidis T., Crump B. C., Jones G. D.. Nontarget Chemical Composition of Surface Waters May Reflect Ecosystem Processes More than Discrete Source Contributions. Environ. Sci. Technol. 2023;57(46):18296–18305. doi: 10.1021/acs.est.2c08540. [DOI] [PubMed] [Google Scholar]
  4. Sieber G., Drees F., Shah M., Stach T. L., Hohrenk-Danzouma L., Bock C., Vosough M., Schumann M., Sures B., Probst A. J.. et al. Exploring the efficacy of metabarcoding and non-target screening for detecting treated wastewater. Sci. Total Environ. 2023;903:167457. doi: 10.1016/j.scitotenv.2023.167457. [DOI] [PubMed] [Google Scholar]
  5. Santiago-Rodriguez T. M., Hollister E. B.. Multi ‘omic data integration: A review of concepts, considerations, and approaches. Semin. Perinatol. 2021;45(6):151456. doi: 10.1016/j.semperi.2021.151456. [DOI] [PubMed] [Google Scholar]
  6. Adamo M., Voyron S., Chialva M., Marmeisse R., Girlanda M.. Metabarcoding on both environmental DNA and RNA highlights differences between fungal communities sampled in different habitats. PLoS One. 2020;15(12):e0244682. doi: 10.1371/journal.pone.0244682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bock C., Jensen M., Forster D., Marks S., Nuy J., Psenner R., Beisser D., Boenigk J.. Factors shaping community patterns of protists and bacteria on a European scale. Environ. Microbiol. 2020;22(6):2243–2260. doi: 10.1111/1462-2920.14992. [DOI] [PubMed] [Google Scholar]
  8. Adyari B., Shen D., Li S., Zhang L., Rashid A., Sun Q., Hu A., Chen N., Yu C. P.. Strong impact of micropollutants on prokaryotic communities at the horizontal but not vertical scales in a subtropical reservoir, China. Sci. Total Environ. 2020;721:137767. doi: 10.1016/j.scitotenv.2020.137767. [DOI] [PubMed] [Google Scholar]
  9. Li W., Zhang S., Liu C.-C., Zhou X. J.. Identifying multi-layer gene regulatory modules from multi-dimensional genomic data. Bioinformatics. 2012;28(19):2458–2466. doi: 10.1093/bioinformatics/bts476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Tenenhaus A., Tenenhaus M.. Regularized Generalized Canonical Correlation Analysis. Psychometrika. 2011;76(2):257–284. doi: 10.1007/s11336-011-9206-8. [DOI] [PubMed] [Google Scholar]
  11. Lê Cao K.-A., Boitard S., Besse P.. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinf. 2011;12(1):253. doi: 10.1186/1471-2105-12-253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Singh A., Shannon C. P., Gautier B., Rohart F., Vacher M., Tebbutt S. J., Le Cao K. A.. DIABLO: An integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35(17):3055–3062. doi: 10.1093/bioinformatics/bty1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lock E. F., Hoadley K. A., Marron J. S., Nobel A. B.. Joint and Individual Variation Explained (Jive) for Integrated Analysis of Multiple Data Types. Ann. Appl. Stat. 2013;7(1):523–542. doi: 10.1214/12-AOAS597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Schouteden M., Van Deun K., Pattyn S., Van Mechelen I.. SCA with rotation to distinguish common and distinctive information in linked data. Behav. Res. Methods. 2013;45(3):822–833. doi: 10.3758/s13428-012-0295-9. [DOI] [PubMed] [Google Scholar]
  15. Gaynanova I., Li G.. Structural learning and integrative decomposition of multi-view data. Biometrics. 2019;75(4):1121–1132. doi: 10.1111/biom.13108. [DOI] [PubMed] [Google Scholar]
  16. Song Y., Westerhuis J. A., Smilde A. K.. Separating common (global and local) and distinct variation in multiple mixed types data sets. J. Chemom. 2020;34(1):e3197. doi: 10.1002/cem.3197. [DOI] [Google Scholar]
  17. Jansen J. J., Hoefsloot H. C. J., van der Greef J., Timmerman M. E., Westerhuis J. A., Smilde A. K.. ASCA: Analysis of multivariate data obtained from an experimental design. J. Chemom. 2005;19(9):469–481. doi: 10.1002/cem.952. [DOI] [Google Scholar]
  18. Alinaghi M., Bertram H. C., Brunse A., Smilde A. K., Westerhuis J. A.. Common and distinct variation in data fusion of designed experimental data. Metabolomics. 2020;16:1. doi: 10.1007/s11306-019-1622-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Tauler, R. ; Maeder, M. ; de Juan, A. . 2.15 - Multiset Data Analysis: Extended Multivariate Curve Resolution. In Comprehensive Chemometrics, Brown, S. ; Tauler, R. ; Walczak, B. , Eds.; Elsevier, 2020, pp. 305–336. 10.1016/B978-0-12-409547-2.14702-X. [DOI] [Google Scholar]
  20. Gorrochategui E., Jaumot J., Tauler R.. ROIMCR: A powerful analysis strategy for LC-MS metabolomic datasets. BMC Bioinf. 2019;20(1):256. doi: 10.1186/s12859-019-2848-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Gorrochategui E., Jaumot J., Lacorte S., Tauler R.. Data analysis strategies for targeted and untargeted LC-MS metabolomic studies: Overview and workflow. TrAC, Trends Anal. Chem. 2016;82:425–442. doi: 10.1016/j.trac.2016.07.004. [DOI] [Google Scholar]
  22. Perez-Lopez C., Oro-Nolla B., Lacorte S., Tauler R.. Regions of Interest Multivariate Curve Resolution Liquid Chromatography with Data-Independent Acquisition Tandem Mass Spectrometry. Anal. Chem. 2023;95(19):7519–7527. doi: 10.1021/acs.analchem.2c05704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hohrenk L. L., Vosough M., Schmidt T. C.. Implementation of Chemometric Tools To Improve Data Mining and Prioritization in LC-HRMS for Nontarget Screening of Organic Micropollutants in Complex Water Matrixes. Anal. Chem. 2019;91(14):9213–9220. doi: 10.1021/acs.analchem.9b01984. [DOI] [PubMed] [Google Scholar]
  24. Vosough M., Salemi A., Rockel S., Schmidt T. C.. Enhanced efficiency of MS/MS all-ion fragmentation for non-targeted analysis of trace contaminants in surface water using multivariate curve resolution and data fusion. Anal. Bioanal. Chem. 2024;416(5):1165–1177. doi: 10.1007/s00216-023-05102-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Dalmau N., Bedia C., Tauler R.. Validation of the Regions of Interest Multivariate Curve Resolution (ROIMCR) procedure for untargeted LC-MS lipidomic analysis. Anal. Chim. Acta. 2018;1025:80–91. doi: 10.1016/j.aca.2018.04.003. [DOI] [PubMed] [Google Scholar]
  26. Rohart F., Gautier B., Singh A., Le Cao K. A.. mixOmics: An R package for’omics feature selection and multiple data integration. PloS Comput. Biol. 2017;13(11):e1005752. doi: 10.1371/journal.pcbi.1005752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Arumugam M., Raes J., Pelletier E., Le Paslier D., Yamada T., Mende D. R., Fernandes G. R., Tap J., Bruls T., Batto J.-M.. et al. Enterotypes of the human gut microbiome. Nature. 2011;473(7346):174–180. doi: 10.1038/nature09944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Le Cao K. A., Costello M. E., Lakis V. A., Bartolo F., Chua X. Y., Brazeilles R., Rondeau P.. MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLoS One. 2016;11(8):e0160169. doi: 10.1371/journal.pone.0160169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shen H., Huang J. Z.. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal. 2008;99(6):1015–1034. doi: 10.1016/j.jmva.2007.06.007. [DOI] [Google Scholar]
  30. Pérez-Cova M., Bedia C., Stoll D. R., Tauler R., Jaumot J.. MSroi: A pre-processing tool for mass spectrometry-based studies. Chemom. Intell. Lab. Syst. 2021;215:104333. doi: 10.1016/j.chemolab.2021.104333. [DOI] [Google Scholar]
  31. Schymanski E. L., Jeon J., Gulde R., Fenner K., Ruff M., Singer H. P., Hollender J.. Identifying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence. Environ. Sci. Technol. 2014;48(4):2097–2098. doi: 10.1021/es5002105. [DOI] [PubMed] [Google Scholar]
  32. Jaumot J., Tauler R.. MCR-BANDS: A user friendly MATLAB program for the evaluation of rotation ambiguities in Multivariate Curve Resolution. Chemom. Intell. Lab. Syst. 2010;103(2):96–107. doi: 10.1016/j.chemolab.2010.05.020. [DOI] [Google Scholar]
  33. Lotfi Khatoonabadi R., Vosough M., Hohrenk L. L., Schmidt T. C.. Employing complementary multivariate methods for a designed nontarget LC-HRMS screening of a wastewater-influenced river. Microchem. J. 2021;160:105641. doi: 10.1016/j.microc.2020.105641. [DOI] [Google Scholar]
  34. Smith C. A., Want E. J., O’Maille G., Abagyan R., Siuzdak G. A.. XCMS: Processing mass spectrometry data for metabolite profiling using Nonlinear Peak Alignment,Matching,and Identification. Anal. Chem. 2006;78(3):779–787. doi: 10.1021/ac051437y. [DOI] [PubMed] [Google Scholar]
  35. Bertinetto C., Engel J., Jansen J.. ANOVA simultaneous component analysis: A tutorial review. Anal. Chim. Acta: X. 2020;6:100061. doi: 10.1016/j.acax.2020.100061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Hohrenk-Danzouma L. L., Vosough M., Merkus V. I., Drees F., Schmidt T. C.. Non-target Analysis and Chemometric Evaluation of a Passive Sampler Monitoring of Small Streams. Environ. Sci. Technol. 2022;56(9):5466–5477. doi: 10.1021/acs.est.1c08014. [DOI] [PubMed] [Google Scholar]
  37. Otzen M., Palacio C., Janssen D. B.. Characterization of the caprolactam degradation pathway in Pseudomonas jessenii using mass spectrometry-based proteomics. Appl. Microbiol. Biotechnol. 2018;102(15):6699–6711. doi: 10.1007/s00253-018-9073-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Patten C. L., Blakney A. J. C., Coulson T. J. D.. Activity, distribution and function of indole-3-acetic acid biosynthetic pathways in bacteria. Crit. Rev. Microbiol. 2013;39(4):395–415. doi: 10.3109/1040841X.2012.716819. [DOI] [PubMed] [Google Scholar]
  39. Gu Y., Qiu Y., Hua X., Shi Z., Li A., Ning Y., Liang D.. Critical biodegradation process of a widely used surfactant in the water environment: Dodecyl benzene sulfonate (DBS) RSC Adv. 2021;11(33):20303–20312. doi: 10.1039/D1RA02791C. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Blunt S. M., Sackett J. D., Rosen M. R., Benotti M. J., Trenholm R. A., Vanderford B. J., Hedlund B. P., Moser D. P.. Association between degradation of pharmaceuticals and endocrine-disrupting compounds and microbial communities along a treated wastewater effluent gradient in Lake Mead. Sci. Total Environ. 2018;622–623:1640–1648. doi: 10.1016/j.scitotenv.2017.10.052. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ac5c00539_si_001.pdf (3.2MB, pdf)
ac5c00539_si_002.xlsx (216.6KB, xlsx)

Articles from Analytical Chemistry are provided here courtesy of American Chemical Society

RESOURCES