Abstract
High-throughput sequencing generates vast data, often containing low-quality bases, chimeras, and artifacts that can mislead taxonomic classification and diversity assessments. Divisive amplicon denoising algorithm 2 (DADA2) enhances taxonomic resolution by excluding low-quality bases and optimizing amplicon sequence variant inference. Proper truncation reduces computational load while maintaining key hypervariable regions for accurate classification. In this study, we examine the effect of various truncation lengths during the DADA2 analysis in ensuring statistical robustness and improving the reliability of microbial community profiling in ecological and environmental studies. Truncation of read length from 175 to 185 bp improves the quality read recovery rate, and preserves microbial diversity in the V4 hypervariable region of the Illumina paired-end reads. Incorporating the optimal truncation length strategy optimizes read recovery and preserves the richness and evenness of microbial communities.
Keywords: microbiome, QIIME, DADA2, read truncation, quality filtering, paired-end reads, alpha-diversity
Introduction
In microbial ecology and environmental biology, analyzing microbial communities through high-throughput sequencing has revolutionized our understanding of biodiversity, ecological interactions, and ecosystem functioning. QIIME 2 (Quantitative Insights Into Microbial Ecology), one of the most widely used platforms for analyzing and interpreting microbiome data, offers comprehensive tools for processing and visualizing large sets of sequencing data. Among the various preprocessing steps, trimming the sequence length of reads is crucial to ensure the accuracy and reliability of downstream analyses [1, 2]. High-throughput sequencing technologies, such as Illumina sequencing, generate vast amounts of sequencing data that provide insights into the composition and function of microbial communities. However, raw sequences often come with inherent issues such as low-quality bases, chimeras, and non-target (host or contaminant) sequences. These factors can dramatically affect the quality of the data and, consequently, the interpretations drawn from it [3]. Quality control is a fundamental aspect of sequence analysis that determines the integrity and validity of results. Trimmed sequences help polish the dataset’s quality, allowing for better detection of true microbial diversity and accurate taxonomic assignment [4]. Low-quality sequences can introduce noise into the dataset, potentially leading researchers to erroneous conclusions regarding the composition and structure of microbial communities. By trimming sequences, researchers can remove parts of reads that do not meet quality standards, significantly increasing the overall data quality [5]. Sequencing artifacts, including errors introduced during the sequencing process, can result in misinterpretations of microbial community structure. This can involve substitutions, insertions, deletions, or biases associated with specific sequences [6]. Trimming identified low-quality bases from the beginning and end of sequences helps eliminate potential mismatches during alignment and reduces the likelihood of assigning incorrect taxonomy to amplicon sequence variants (ASVs) [7]. Effective trimming can also reduce the creation of chimeric sequences, which arise when two sequences join together, leading to false inflation of biodiversity estimates [8].
Performance in downstream statistical analyses relies heavily on the quality of input data. Tools used for diversity analyses, such as alpha and beta diversity metrics, depend on an accurate representation of community richness and evenness. If sequences are left untrimmed, analyses can produce misleading results, with potential increases in false positives and negatives [9]. For instance, alpha-diversity measures such as Shannon or Simpson indices can be disproportionately influenced when low-quality sequences are included in these calculations. Addressing these concerns through effective trimming ensures that all statistical assumptions and models are based on high-quality, reliable data. The specifics of trimming often involve determining an appropriate “cut-off” length for sequences. This cut-off is typically chosen based on several criteria. The quality score associated with each base in a sequence is a critical determinant of the decision to trim. The Phred score, commonly used in sequencing data, provides a logarithmic representation of base call accuracy. Sequences that fall below a predetermined threshold (e.g., Q20 or Q30) are often candidates for trimming or exclusion [10]. Sequences that are shorter than a defined minimum length after trimming may not contain enough information for accurate taxonomic assignment. Trimming sequences to a consistent length ensures that subsequent analyses within a study are based on comparable data breadth across samples [2]. For paired-end sequencing, overlapping regions between forward and reverse reads may need to be trimmed to yield optimal merged reads. This step enhances the accuracy of read merging and helps reduce the generation of uninformative sequences [8]. In many studies, particular regions (like the 16S rRNA gene) are targeted for amplification. Trimming ensures that only the regions of interest are retained, mitigating the influence of extraneous information that can arise from untrimmed reads [3].
Within the QIIME 2 framework, the divisive amplicon denoising algorithm 2 (DADA2) plays a crucial role in the analysis pipeline, facilitating accurate taxonomic classification and community structure analysis. The integration of DADA2 in QIIME enables researchers to obtain more reliable and reproducible results in ecological studies of microbial populations [2]. Truncating sequences to a specific length during DADA2 analysis in QIIME 2 is a crucial strategy for improving the quality and accuracy of microbial community profiling. DADA2 improves its performance by using quality scores to inform truncation length decisions [2]. DADA2 employs a denoising algorithm that is sensitive to the quality of the input sequences. When poor-quality bases are present, the algorithm may struggle to accurately distinguish true sequences from artifacts or noise [7]. Truncating the sequences to a specific length helps create a more stable input by excluding these problematic ends, thereby allowing DADA2 to generate more reliable ASVs representative of the actual microbial community. Truncating sequences to a length that maintains the core hypervariable regions (e.g. V3-V4 or V4) enables effective taxonomic resolution while reducing the likelihood of including less informative or error-prone sequence data. Many studies have shown that retaining specific length ranges where the sequence variability is high contributes positively to the precision of taxonomic classification across complex microbial communities [10]. This approach is particularly useful for distinguishing closely related taxa, allowing for more robust interpretations of microbial diversity and community structure.
Truncating sequences also streamlines the computational burden in downstream analyses. Longer reads can lead to increased processing times and memory usage. By specifying an optimal truncation length, researchers can reduce the complexity of the data, making analyses such as diversity assessments or statistical comparisons more manageable [11]. This efficiency is especially important during large-scale microbiome studies, where datasets can become quite extensive.
This study aims to contribute to the body of knowledge by conducting a comprehensive analysis of truncation length through a comparative lens. We will identify best practices by assessing already published datasets and provide evidence-based recommendations for determining optimal truncation lengths. The discussion will highlight how these practices enhance the clarity and precision of research findings, ensuring that valuable information is not inadvertently discarded.
Materials and methods
Sample data
The sample data set consists of 100 sequences of mice fecal samples selected from our study, 50 samples from the group fed with atherogenic diet (Group A), and 50 samples from the group fed with normal chow diet (Group B) (Bioproject no. PRJNA1143515). These samples were sequenced in Illumina MiSeq using 2 × 250 bp paired-end technology. We randomly selected samples of the V4 hypervariable region with accession numbers available in Supplementary Table S1 (see online supplementary material).
QIIME2 analysis
QIIME2 (v 2024.2) [12] was used for this study. The raw reads fastq files were imported using the input format “PairedEndFastqManifestPhred33V2” in the QIIME 2 artifact file.qza. The artifact files were quality-filtered, denoised, and reads were joined. Chimeras were removed, and similar sequences were dereplicated into ASVs using DADA2 [2]. In the DADA2 analysis, the Phred score was kept at 18 [13], the V4 primer sequences were trimmed, the truncation length was set at 0, 200, 195, 190, 185, 180, 175, and 170 for both forward and reverse reads. The remaining parameters were set by default. ASV inference was done using the 16S_full-length Greengenes2 (version 2022.10) database at 99% similarity. Taxonomic classification was performed based on the Naïve Bayes classifier Greengenes2 2022.10 515F/806R V4 region. The classified features were subjected to stringent quality filters. Host DNA, including the mitochondria/chloroplast, features having one or two frequencies, and features and frequencies less than 0.01% were removed during the filtering steps. These filtering steps were introduced to reduce residual noise from extremely low-frequency ASVs, improve the stability of diversity metrics by removing singleton and doubleton ASVs, and enhance the robustness of downstream analyses by filtering low-abundance features.
Core microbiome
The core microbiome for each truncation parameter was analyzed using QIIME 2. A prevalence threshold of ≥0.60 fraction and a relative abundance of ≥0.1% were established to define the core microbiome in this study. A heatmap of the core microbiome was generated using the R packages “ggplot2” (v0.6.2) [14] and “reshape2” (v1.4.2) [15].
Alpha rarefaction
The alpha-rarefaction curve was analyzed for the observed richness matrix with a maximum sequencing depth of 10,000 in QIIME 2, while all other parameters were set to default. The.csv files were then imported into R Studio (v2025.09.02) [16] and R (v4.5.1) [17], where differences in richness between Group A and Group B were assessed using the R packages “tidyr” (v1.3.2) [18] and dplyr (v1.2.0) [19] for observed richness.
Alpha diversity
In this study, we employ alpha diversity indices, including observed, Shannon, Simpson, and Pielou’s evenness, to determine the richness, evenness, and even distribution of samples among taxa. Alpha diversity was calculated using R packages such as “phyloseq” (v1.52.0) [20], “vegan” (v2.8-0) [21], “microbiome” (v1.30.0) [22], “ggpubr” (v0.6.2) [23], and “ggplot2” from the ASV feature table. Alpha-diversity metrics were used as technical indicators of feature retention, stabilities of diversity estimates, and denoising efficiency during truncation optimization, rather than for biological inference across truncation settings.
Beta diversity
Beta-diversity measures the differences in composition among microbial communities, which provides insight into ecological differentiation, environmental selection, and host-associated microbiome variability. In this study, we utilized the Bray–Curtis dissimilarity matrix to determine the difference in abundance between the groups. We performed two-dimensional PCoA using the Bray–Curtis dissimilarity matrix of the feature ASV table. The analysis was conducted using the R packages “vegan” and “ape” (v5.8-1) [24]. To evaluate significant group separations in ordination space, we applied a permutational multivariate analysis of variance (PERMANOVA) with 999 permutations.
To obtain the concordance among community profiles obtained under different truncation settings, pairwise Bray–Curtis dissimilarities were calculated using the “vegan” package in R. Subsequently, Mantel tests (999 permutations) were performed between Bray–Curtis distance matrices from each truncation run to quantify their correlation (Mantel r statistic) and assess statistical significance (P-value). A high Mantel r value (approaching 1) indicates strong concordance and minimal compositional distortion across truncation settings.
ANCOMBC2 analysis
The differential abundance of taxa between groups A and B for various truncation parameters was assessed using the function ancombc2() from the ‘ANCOMBC’ package (v2.10.1) [25] along with other packages such as “phyloseq,” and “tidyverse” (v2.0.0) [26] from the ASV feature table, along with metadata. P-values were adjusted using Holm’s sequential Bonferroni correction. By setting the “p_adj_method” to “holm,” we balance the false discovery control with sensitivity, ensuring that detected significant taxa or associations are robust without being overly conservative.
Mock community
Mock community sequence data SRR33838054 [27] was retrieved from NCBI using SRA tool kit. Similar truncation parameters and database were utilized to determine the best truncation strategy. Because of low-quality reads after filtering step, untrimmed dataset was excluded in the downstream analysis. Our analysis pipeline was unable to classify the mock community to species level. Out of the bacterial species in the mock community, only six genera and one Enterobacteriaceae family were classified. Percentage retained reads were calculated from the DADA2 denoised statistics (Supplementary Table S2—see online supplementary material). Bray-Curtis similarity matrix, and Pearson’s correlation was assessed using the mock community composition data provided by DNA extraction method “supplier M” (ZymoBIOMICS® Microbial Community Standard) as reference, using the R package “vegan.”
Statistics
All statistical analyses were conducted in QIIME2 (v2024.2), R (v4.5.1), and R Studio (v2025.09.02). The Shapiro–Wilk test was applied to check the normal distribution of the dataset. Since our data are a non-normally distributed dataset, two-sided Wilcoxon rank-sum tests were used to determine the group differences, with P-values adjusted by the Benjamini–Hochberg (BH) procedure. Group-level compositional differences were evaluated by PERMANOVA (999 permutations, P < 0.05). Differential abundance analysis was performed with ANCOMBC2, using Holm-adjusted P-values (p_adj_method = “holm”) to control false discoveries. The significance values were denoted as follows: *P < 0.05, **P < 0.01, and ***P < 0.001. All the plots were generated using ggplot2.
Results
Quality control assessment
A rigorous quality assessment of the sequences was conducted before the execution of DADA2 and subsequent analyses. Forward reads exhibited high-quality sequence bases at the onset of the sequencing cycle, with a notable decline in quality toward the end. In contrast, reverse reads displayed comparatively lower base quality throughout the cycle. Illumina MiSeq platforms are known to generate relatively lower quality reverse reads, which often necessitate careful optimization of trimming parameters. To systematically evaluate this issue, we assess the effect of truncation by applying a series of truncation lengths, specifically 0 (trunc0), 170 (trunc170), 175 (trunc175), 180 (trunc180), 185 (trunc185), 190 (trunc190), 195 (trunc195), and 200 (trunc200) bases—were applied to both forward and reverse reads.
Good reads count
The DADA2 analysis unequivocally indicated that the parameter with trunc length 170 produced the highest counts of filtered reads, denoised reads, merged reads, and non-chimeric reads (Fig. 1). In contrast, the subsequent truncation lengths of 170, 175, 180, 185, 190, and 200 produced progressively reduced read counts, with trunc0 yielding the lowest results. The highest percentage of filtered reads was recorded at trunc170, with 42.3%, followed closely by trunc175 at 41.4%. In comparison, trunc0 had the lowest percentage at just 9.2%. Similarly, trunc170 and trunc175 also had the highest percentages of denoised reads (42% and 41.2%, respectively), merged reads (39.7% and 39.1%), and non-chimeric reads (35.3% and 34.8%). In contrast, trunc0 showed the lowest percentages for denoised, merged, and non-chimeric reads. Raw data for DADA2 denoised statistics for each truncation parameter can be found in the Supplementary Tables S3–S10 (see online supplementary material).
Figure 1.
Percentage of input reads retained at each step of the DADA2 pipeline across truncation lengths. Read retention declined with increasing truncation stringency, with optimal recovery and quality balance observed at truncation lengths of 170–180 bp.
ASV inference, taxonomic classification, removal of singleton and doubleton, and filtering of low-frequency and low-abundant features and frequencies
The quantification of frequencies and features was methodically assessed at every stage of ASV inference, taxonomic classification, removal of singletons and doubletons, and filtering of low-frequency and low-abundant features. The trunc170 configuration yielded the highest frequency count, followed closely by the trunc175 parameter. As anticipated, trunc0 recorded the lowest frequency (Table 1). Additionally, the maximum number of features was observed at trunc200, which displayed 280 identifiable features (Table 2). In contrast, both trunc170 and trunc175 had the least number of features, with 265 each.
Table 1.
Number of feature frequencies after various filtering step: DADA2 denoising, ASV inference, taxonomic classification, singleton and doubleton filtering, low frequency filtering, and low abundant filtering.
| SL no. | Parameter | DADA2 denoising | ASV inference | Taxonomic classification | Singleton and doubleton filtering | Low frequency filtering | Low abundant filtering |
|---|---|---|---|---|---|---|---|
| 1 | Trunc0 | 1 006 538 | 1 006 538 | 437 621 | 437 595 | 435 341 | 435 341 |
| 2 | Trunc200 | 2 999 655 | 2 999 655 | 1 405 622 | 1 405 604 | 1 396 753 | 1 396 753 |
| 3 | Trunc195 | 3 164 294 | 3 164 294 | 1 476 568 | 1 476 552 | 1 466 696 | 1 466 696 |
| 4 | Trunc190 | 3 420 320 | 3 420 320 | 1 595 844 | 1 595 812 | 1 584 923 | 1 584 923 |
| 5 | Trunc185 | 3 620 415 | 3 620 415 | 1 670 922 | 1 670 892 | 1 659 936 | 1 659 936 |
| 6 | Trunc180 | 3 839 352 | 3 839 352 | 1 738 470 | 1 738 444 | 1 726 524 | 1 726 524 |
| 7 | Trunc175 | 4 204 977 | 4 204 977 | 1 874 615 | 1 874 587 | 1 861 446 | 1 861 446 |
| 8 | Trunc170 | 4 272 049 | 4 272 049 | 1 903 939 | 1 903 910 | 1 890 988 | 1 890 988 |
Table 2.
Number of features after different filtering steps: DADA2 denoising, ASV inference, taxonomic classification, singleton and doubleton filtering, low-frequency filtering, and low abundant filtering.
| SL No. | Parameter | DADA2 denoising | ASV Inference | Taxonomic classification | Singleton and doubleton filtering | Low frequency filtering | Low abundant filtering |
|---|---|---|---|---|---|---|---|
| 1 | Trunc_len_0 | 894 | 718 | 419 | 406 | 267 | 267 |
| 2 | Trunc_len_200 | 1067 | 840 | 485 | 476 | 280 | 280 |
| 3 | Trunc_len_195 | 1064 | 834 | 478 | 470 | 275 | 275 |
| 4 | Trunc_len_190 | 1084 | 855 | 495 | 479 | 272 | 272 |
| 5 | Trunc_len_185 | 1085 | 857 | 497 | 482 | 275 | 275 |
| 6 | Trunc_len_180 | 1086 | 856 | 496 | 483 | 269 | 269 |
| 7 | Trunc_len_175 | 1090 | 865 | 497 | 483 | 265 | 265 |
| 8 | Trunc_len_170 | 1076 | 855 | 490 | 475 | 265 | 265 |
Core microbiome
The core microbiome represents the stable community of microorganisms that inhabit a specific environment, playing a crucial role in maintaining the health and functionality of that ecosystem. Here, we establish the core microbiome at a 0.60 prevalence threshold with a relative abundance of ≥0.1%. At trunc0 (Fig. 2), several taxa, such as Treponema I, Muribaculum, Lactobacillus, Desulfovibrio, and Fimenecus, show very high abundance values that are likely inflated due to untrimmed or low-quality reads. These spikes in abundance suggest the presence of sequencing noise and a poor representation of the true microbial composition. Moving to moderate truncations (trunc170 to trunc185), the community composition becomes more stable and biologically consistent, showcasing moderate abundances of key genera like Lactobacillus, Muribaculum I, Desulfovibrio, and Fimenecus. The relative abundances of dominant taxa remain steady across these truncation lengths, indicating minimal sequence loss and high reproducibility. Notably, taxa such as Staphylococcus and Aeromonas maintain consistent levels (∼16%–17%), while other core taxa show stable proportions, suggesting an accurate retention of biological signals. In high truncation (trunc190 to trunc200), slight declines or fluctuations are observed in several core taxa, including Muribaculaceae family, which denote the onset of sequence loss due to over-trimming. Although Staphylococcus and Aeromonas remain stable, there is a slight reduction in richness and in some subtle taxa, such as UBA3263 and UBA7173. This over-trimming could potentially exclude informative sequence regions and bias relative abundance toward more abundant or shorter amplicons.
Figure 2.
Core gut microbiota composition across DADA2 truncation settings. Heatmap depicts the relative abundance (%) of dominant bacterial taxa retained after quality filtering under varying truncation lengths.
Alpha rarefaction
In the present study, multiple truncation lengths (0–200 bp) were evaluated to determine the optimal trimming threshold for quality filtering and denoising during metagenomic sequence processing. As shown in Fig. 3, the rarefaction curves illustrate the relationship between sequencing depth and observed feature richness across different truncation lengths. At trunc0 (Fig. 3A), read retention was highest; however, the inclusion of low-quality bases at the 3′ ends likely introduced sequencing noise and spurious features, reducing reliability. Moderate trunc170 and trunc175 bp (Fig. 3B and C) improved overall quality and curve smoothness, with most samples approaching saturation, although slight variability remained. Truncation at 180 bp (Fig. 3D) provided the most balanced outcome, showing consistent plateauing of rarefaction curves across samples while maintaining high feature richness and minimal read loss. Further increases in truncation length (≥185 bp) (Fig. 3E–H) led to a gradual reduction in observed richness and premature curve truncation, indicating loss of informative reads due to overly stringent trimming.
Figure 3.
Rarefaction curves showing observed feature richness across truncation settings. Each panel (A) trunc0, (B) trunc170, (C) trunc175, (D) trunc180, (E) trunc185, (F) trunc190, (G) trunc195, (H) trunc200, illustrating sequencing depth versus the number of observed features for two experimental groups (group A, blue; group B, red). Curves Plateau at higher sequencing depths, indicating adequate sampling coverage, with consistent richness patterns across truncations.
The richness between the groups was analyzed across different truncation lengths. There is a significant difference in richness between Group A and Group B across all truncation parameters (Supplementary Fig. 1—see online supplementary material for a colour version of this figure). Specifically, Supplementary Fig. 1C (see online supplementary material for a colour version of this figure), which corresponds to truncation at 175 bp, shows the highest Wilcoxon test statistic (W = 29 655, P < 2.2 × 10−16), indicating a clear separation and a wide range of richness between the groups. We also observed similar richness at a truncation length of 170 bp (W = 29 512, P < 2.2 × 10−16) (Supplementary Fig. 1B—see online supplementary material for a colour version of this figure), as well as at 180 bp (W = 27 637, P < 2.2 × 10−16) and 185 bp (W = 28 186, P < 2.2 × 10−16) (Supplementary Fig. 1D and E—see online supplementary material for a colour version of this figure), though with lesser richness compared to the 175 bp truncation.
Therefore, a truncation length of 175 bp was identified as the optimal threshold, offering the best compromise between sequence quality, read retention, and coverage depth for downstream analyses.
Diversity analysis
Alpha diversity analysis is a pivotal component in microbiome research, offering insights into the richness and evenness of microbial communities within individual samples. High diversity is often associated with ecosystem stability, resilience, and functionality, particularly in gut microbiota studies, whereas low diversity has been linked to dysbiosis and various diseases, including inflammatory bowel disease (IBD), obesity, and diabetes. Since the truncation length is a technical preprocessing parameter, variation in alpha-diversity metrics across truncation settings reflects differences in read-quality filtering, denoising efficiency, and feature recovery.
To evaluate the impact of sequence truncation length on microbial community resolution and data quality, alpha diversity indices (Observed richness, Simpson, Shannon, and Pielou’s evenness) were compared across truncation lengths ranging from 0 bp (untrimmed) to 200 bp. At 0 bp (Fig. 4A), all diversity indices were markedly low, suggesting the inclusion of untrimmed or poor-quality reads. Such reads likely contained sequencing noise or low-quality bases, reducing the ability to detect actual biological variation. Truncation at 170 bp (Fig. 4B) resulted in a substantial improvement across all metrics, with high richness, good evenness, and balanced diversity, indicating an optimal balance between read length and quality. The truncation at 175 bp (Fig. 4C) yielded the highest and most consistent diversity values, including the highest richness and stable Shannon and Simpson indices, alongside good evenness. This truncation point provided the best overall diversity estimates, indicating that it retained informative reads while removing low-quality tails.
Figure 4.
Distribution of alpha diversity indices of different truncation length during DADA2 analysis (A) trunc0, (B) trunc170, (C) trunc175, (D) trunc180, (E) trunc185, (F) trunc190, (G) trunc195, (H) trunc200.
At trunc180 bp (Fig. 4D) and trunc185 bp (Fig. 4E), diversity metrics remained high, but small declines in Observed richness indicated the beginning of minor sequence loss, though community evenness remained stable. Further increases beyond 190 bp (Fig. 4F and Supplementary Fig. 1F—see online supplementary material for a colour version of this figure) showed a progressive decline in richness, while evenness and Simpson’s index remained relatively stable, suggesting loss of lower-abundance taxa due to excessive trimming. At trunc195 bp (Fig. 4G and Supplementary Fig. 1G—see online supplementary material for a colour version of this figure) and trunc200 bp (Fig. 4H and Supplementary Fig. 1H—see online supplementary material for a colour version of this figure), a noticeable and sharp drop in richness and Shannon diversity was observed, reflecting over-truncation and the reduced diversity profiles. These lengths likely excluded informative reads, leading to an underestimation of feature recovery.
The beta diversity analysis across different truncation lengths revealed distinct clustering strengths among samples, as reflected by the principal coordinate analysis (PCoA) metrics and PERMANOVA results. The trunc0 bp exhibited the weakest group separation (R2 = 0.321), indicating that untrimmed or noisy reads likely compromised sample differentiation. In contrast, truncations at 170–180 bp (Fig. 5B–D) demonstrated the strongest and most stable group separations, with R2 values around 0.452–0.453 and highly significant P = 0.001. Notably, trunc175 bp (Fig. 5C) and trunc180 bp (Fig. 5D) exhibited nearly identical ordination structures (PC1 ≈ 47%, PC2 ≈ 7.6%–7.8%), suggesting that optimal read trimming was achieved, which maximized both diversity resolution and data quality. Slightly longer truncations at 185 and 190 bp (Fig. 5E and F) maintained good stability (R2 ≈ 0.447–0.448) but exhibited marginal reductions in community separation, implying the onset of minor sequence loss. Furthermore, trunc195 and trunc200 bp (Fig. 5G and H) resulted in a noticeable decline in group differentiation (R2 ≈ 0.438–0.442), likely due to over-trimming and reduced feature recovery. Collectively, these results indicate that truncation at 175–180 bp provides the most robust and stable beta diversity structure, effectively balancing sequencing quality and biological resolution.
Figure 5.
Bray-Curtis PCoA Plot for beta-diversity (A) trunc0, (B) trunc170, (C) trunc175, (D) trunc180, (E) trunc185, (F) trunc190, (G) trunc195, (H) trunc200.
Additionally, the concordance test using the Mantel test demonstrated consistently high correlations (r ≥ 0.95; P < 0.001) among all truncation settings, indicating robust compositional stability (Supplementary Fig. 2—see online supplementary material for a colour version of this figure). Notably, truncation lengths between 170 and 200 bp exhibited near-perfect concordance (r ≈ 1.0), while the untruncated dataset (trunc0) showed slightly lower similarity (r ≈ 0.95). These findings suggest that truncation beyond 170 bp minimizes the effects of low-quality tails without distorting community composition. Based on this concordance pattern, a truncation length of 175 bp can be selected for subsequent analyses as the optimal balance between sequence quality and data retention.
Differential abundance analysis
The differential abundance analysis across truncation lengths revealed clear trends in statistical strength, clarity, and biological interpretability of the results. The untrimmed dataset (Fig. 6A) exhibited moderate overall significance, but the presence of weak and noisy signals suggested instability often associated with early truncation or unfiltered low-quality reads. As truncation increased to 170 bp (Fig. 6B), the clarity of differential taxa improved, showing stronger and more defined group distinctions. The truncation of 175 bp (Fig. 6C) and 180 bp (Fig. 6D) demonstrated the most robust and balanced outcomes, with consistently strong significance (up to ∼25) and clearly interpretable taxa patterns, indicating optimal trimming that preserved both data quality and biological signal. Similarly, the truncation of 185 bp (Fig. 6E) produced highly defined clusters with minimal noise, consistent with Fig. 6C and D. Beyond this point, however, the significance strength began to decline. Truncation at 190 bp (Fig. 6F) and 195 bp (Fig. 6G) retained stable yet slightly weaker differentiation, suggesting minor data loss from over-trimming. The trunc200 bp (Fig. 6H) showed the lowest performance, with noticeably fewer significant taxa and diminished interpretability, indicating over-trimming and the loss of biologically relevant information. Overall, truncations between 175 and 185 bp provided the most reliable and biologically meaningful results, balancing significance strength, taxa clarity, and stability across analyses.
Figure 6.
ANCOMBC Volcano plot of differential abundance between group a and group B data set (A) trunc0, (B) trunc170, (C) trunc175, (D) trunc180, (E) trunc185, (F) trunc190, (G) trunc195, (H) trunc200.
Mock community validation
To evaluate the impact of read truncation length on microbial community reconstruction, we compared observed microbial compositions at different truncation thresholds (trunc170–trunc200) with the mock community, ZymoBIOMICS Microbial Community Standard, using both Bray–Curtis similarity and Pearson correlation analyses (Fig. 7A). As truncation length increased, both metrics showed a gradual improvement in concordance with the theoretical profile. Specifically, Bray–Curtis similarity increased from approximately 0.78 at trunc170 to 0.81 at trunc200, while Pearson correlation (r) rose from ∼0.87 to ∼0.90 across the same range. The trends plateaued beyond trunc190, suggesting a stabilization of compositional accuracy. Overall, trunc200 exhibited the highest similarity and correlation, indicating the most unbiased and faithful reconstruction of the theoretical microbial composition.
Figure 7.
Assessment of sequencing read truncation effects on microbiome data quality and similarity across truncation lengths in mock community data set. (A) Compares Bray–Curtis similarity and Pearson’s correlation across the truncation parameters (B) The percentage of reads retained at successive DADA2 processing steps for each truncation parameters.
In contrast, read retention analysis (Fig. 7B) demonstrated that the truncation at 170–185 bp retained more than 10% of the input reads. However, truncation lengths of ≥185 bp resulted in less than 10% of reads successfully passing denoising and merging steps, indicating loss of quality reads during the filtering step. Even though untrimmed reads exhibited the highest % of retained nonchimeric reads, the pronounced losses during the filtering stage are consistent with the accumulation of low-quality bases at read termini that compromise downstream processing. Collectively, these findings highlight a critical balance between read length and data quality in amplicon sequencing. While longer reads (≥190 bp) yield marginally higher compositional fidelity, moderate truncation lengths (around 175–185 bp) offer an optimal trade-off between accuracy and read retention. Therefore, a truncation length within this range is recommended for minimizing quality-driven read loss while maintaining robust concordance with mock community microbial profiles.
Discussion
The analysis of different truncation lengths in sequencing data is integral to optimizing the accuracy and completeness of microbiome studies. In our research, we systematically evaluated multiple truncation lengths, concluding that truncation length of around 175–185 bp strikes an ideal balance between maintaining high read recovery rates and preserving microbial diversity. A similar study by Lee et al. [28] demonstrates the trimming condition for DADA2 analysis in QIIME 2. In the study, they used the V3–V4 region of human oral microbiota samples and optimized the minimum overlapping length at 16 bps with a Phred quality score of 20. In our study, we have kept the Phred quality score at 18 [13] and the overlapping length was kept as default so that maximum overlap can be achieved and did not compromise with richness of the microbial community. This finding aligns with and expands upon existing literature, demonstrating that careful parameter selection is essential for interpreting microbial community structures effectively.
The observation that a truncation length of 170 bp yielded the highest counts of filtered, denoised, merged, and non-chimeric reads aligns with existing literature advocating for the careful consideration of read quality when processing sequencing data (Table 1). Notably, past studies have shown that retaining high-quality sequences is essential for accurate taxonomic classification and reliable ecological inference [3]. Our results support the hypothesis that moderate truncation thresholds can significantly improve sequencing quality by effectively filtering out low-quality reads, a strategy that has been successfully employed in various microbiome studies [1].
Previous studies have highlighted the impact of truncation lengths on data quality and community representation. Callahan et al. [2] emphasized that excessive truncation can lead to significant reductions in the number of usable sequences, thereby diminishing the presence of less abundant taxa. Their analysis found that a truncation length of 220 bp optimized read quality while still capturing the richness of the microbial community. Furthermore, the quantification of core microbiome composition elucidated clear distinctions across different truncation thresholds. At truncation lengths of 0 and within the range of 170–185 bp, we observed significant differences in abundance and representation of various microbial taxa. The inflated abundance values of taxa such as Treponema I and Fimenecus at truncation length 0 underscore the dangers of untrimmed reads (Fig. 2). These spikes can distort the true microbial community dynamics and suggest the presence of sequencing noise, which has been documented as a significant challenge in microbial ecology [29]. Conversely, the stability of key genera like Lactobacillus and Muribaculum I over moderate truncation lengths reflects a more biologically accurate representation of the microbial community, reinforcing the need for optimal trimming to capture the proper structure and function of microbial ecosystems. This observation is consistent with previous research that has advocated for stringent quality control measures to ensure the reliability of microbial community assessments [20]. Interestingly, our analysis also uncovered potential adverse effects associated with over-trimming at lengths exceeding 185 bp. The observed decline in richness and fluctuations in core taxa abundance, particularly for the Muribaculaceae family and other less abundant taxa, suggest a loss of informative sequences under these conditions, potentially skewing the representation of locally dominant species vital for ecosystem functioning [30]. This critical observation emphasizes the delicate balance required in sequence processing, where overly stringent quality filtering may inadvertently exclude biologically informative data and bias relative abundance estimates toward more prevalent or shorter amplicons. The alpha rarefaction analysis reinforced our findings, illustrating a clear relationship between truncation length and observed feature richness. The rarefaction curves demonstrated that while higher read retention was achieved at truncation length 0, the inclusion of low-quality bases, particularly at the 3' ends, likely introduced noise and spurious features, reducing the reliability of community assessment. Notably, moderate truncation lengths at 170 and 175 bp yielded significant improvements in overall quality and smoothness of rarefaction curves. These curves approached saturation, indicating that most samples adequately represented the underlying biological diversity, albeit with slight variability. The optimal performance observed at a truncation length of 180 bp further suggests that this threshold provides a balanced outcome, maintaining high feature richness while minimizing read loss across samples.
Our analysis showed significant differences in community composition associated with varying truncation lengths, suggesting that the choice of truncation can influence the perception of beta diversity. At truncation lengths of 170 and 175 bp, community composition appeared more consistent and biologically meaningful, highlighting a robust differentiation between samples (Fig. 5). This stability in community structure is critical for accurately assessing ecological relationships and can inform our understanding of how microbial communities respond to environmental gradients [12]. In contrast, excessive truncation at lengths of 190–200 bp led to community impoverishment and reduced dissimilarity scores, indicative of the loss of less abundant but ecologically relevant microorganisms. These findings align with the literature, which reports that effective beta diversity assessment is heavily contingent upon the quality and completeness of input data [12]. Differential abundance analysis plays a pivotal role in identifying taxa that contribute significantly to differences in community composition across conditions or treatments. Our assessment revealed clear trends in differential abundances tied to the choice of truncation lengths. Truncation at 175–185 bp provided the best balance of statistical significance and biological clarity, leading to reliable results (Fig. 6C–E). In contrast, lengths beyond 185 bp resulted in decreased interpretability and potential data loss, emphasizing the need for careful truncation choices in ecological research.
By systematically comparing the microbial compositions derived from different truncation thresholds (trunc170–200) against a mock community, we observed a clear trend of increasing concordance with theoretical profiles as truncation length increased. Notably, metrics such as Bray–Curtis similarity and Pearson correlation demonstrated improvements, suggesting that longer reads provide a richer representation of the underlying microbial diversity. While longer reads (≥190 bp) offer marginal improvements in compositional fidelity, moderate truncation lengths around 175–185 bp strike an optimal balance between accuracy and read retention.
Among all evaluated metrics, DADA2 read retention, alpha rarefaction, and alpha diversity indices were prioritized for truncation optimization, as they directly reflect data quality, sequencing depth sufficiency, and biological interpretability. Based on these primary criteria, truncation at 175–185 bp provided the optimal balance between read retention, microbial richness, and community stability. Secondary analyses, including beta diversity, differential abundance, and mock community validation, further corroborated the robustness of this selection.
This study is limited to the V4 metagenomic data of gut microbiota only, which may limit the generalizability of the results to other variable regions and different microbiota ecologies. Additionally, we used a symmetric truncation length for both forward and reverse reads to ensure a practical, transparent, and reproducible framework that a broad range of users can readily adopt. Our findings and recommendations are not dependent on enforcing equal truncation lengths; they illustrate the generalization of optimization strategy. Our workflow is flexible and can be extended by users to asymmetric truncation, but the core conclusions remain independent of this choice.
In summary, our study underscores the critical significance of rigorous quality control and a judicious selection of truncation lengths in metagenomic analyses. The methodology presented herein not only aligns with established practices in the field but also contributes valuable insights that may guide future investigations aiming to accurately explore and characterize microbial communities. As microbiome research continues to expand into diverse ecological contexts, it is imperative that researchers prioritize high-quality sequencing data and appropriate processing protocols to ensure meaningful biological interpretations. Future studies should seek to explore the implications of these findings in varying ecosystems and their responses to environmental changes, thereby advancing our collective understanding of microbial dynamics and their roles in ecosystem health and stability.
Conclusion
Based on our findings and corroborated by established literature, truncation at 175–185 bp emerges as the best truncation strategy for microbial sequencing studies. This length not only optimizes read recovery but also preserves the richness and evenness of microbial communities, facilitating more nuanced interpretations of ecological dynamics. Future investigations should incorporate these insights and benchmark against established datasets to refine truncation protocols. By ensuring that methodological choices in high-throughput sequencing are informed by empirical evidence, researchers can enhance the reliability and applicability of microbiome research outcomes.
Supplementary Material
Acknowledgments
M.G.S. thanks the Indian Council of Medical Research (ICMR), New Delhi, India, for providing the Junior Research Fellowship No.: 3/1/3/JRF-2019/HRD-070(31035). The Council of Scientific and Industrial Research (CSIR), New Delhi, and the CSIR-NEIST, Jorhat, Assam, are gratefully acknowledged for providing the necessary facilities, thereby facilitating the smooth progression of the study. The authors also thank the Publication & Intellectual Property Rights Committee, CSIR-NEIST, Jorhat for approving the manuscript for publication (Manuscript No: CSIR-NEIST/PUB/2025/026, dated 11-02-2025).
Contributor Information
Moirangthem Goutam Singh, Center for Infectious Diseases, Biological Sciences and Technology Division, CSIR-North East Institute of Science and Technology, Jorhat 785006, Assam, India; Biological Sciences Division, Faculty of Science, Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, Uttar Pradesh, India.
Romi Wahengbam, Center for Infectious Diseases, Biological Sciences and Technology Division, CSIR-North East Institute of Science and Technology, Jorhat 785006, Assam, India; Biological Sciences Division, Faculty of Science, Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, Uttar Pradesh, India.
Author contributions
Romi Wahengbam developed the detailed protocols and oversaw all stages of the study, including critical inputs and the manuscript drafting process. Moirangthem Goutam Singh conceptualized the study, performed data analysis, generated the empirical data, and wrote the manuscript. All authors reviewed and edited the final draft of the manuscript and collectively approved the decision to submit the manuscript for publication consideration.
Moirangthem Singh (Conceptualization [equal], Data curation [equal], Formal analysis [lead], Investigation [equal], Methodology [equal], Resources [supporting], Software [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [equal]) and Romi Wahengbam (Conceptualization [lead], Data curation [lead], Formal analysis [equal], Funding acquisition [lead], Investigation [lead], Methodology [lead], Project administration [lead], Resources [lead], Software [equal], Supervision [lead], Validation [equal], Visualization [equal], Writing—original draft [equal], Writing—review & editing [Equal])
Supplementary material
Supplementary material is available at Biology Methods and Protocols online.
Conflicts of interest
The authors declare that they have no competing interests.
Funding
This work was supported by the Council of Scientific & Industrial Research (CSIR) through the in-house grant number OLP2083.
Data availability
Raw data are available in NCBI SRA under Bioproject No. PRJNA1143515.
References
- 1. Bokulich NA, Subramanian S, Faith JJ et al. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods 2013;10:57–9. 10.1038/nmeth.2276 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Callahan BJ, McMurdie PJ, Rosen MJ et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 2016;13:581–3. 10.1038/nmeth.3869 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Kozich JJ, Westcott SL, Baxter NT et al. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol 2013;79:5112–20. 10.1128/AEM.01043-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Caporaso JG, Kuczynski J, Stombaugh J et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods 2010;7:335–6. 10.1038/nmeth.f.303 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Schirmer M, Ijaz UZ, D’Amore R et al. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res 2015;43:e37. 10.1093/nar/gku1341 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Amir A, McDonald D, Navas-Molina JA et al. Deblur rapidly resolves single-nucleotide community sequence patterns. mSystems 2017;2:e00191-16. 10.1128/msystems.00191-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods 2013;10:996–8. 10.1038/nmeth.2604 [DOI] [PubMed] [Google Scholar]
- 8. Haas BJ, Gevers D, Earl AM, et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 2011;21:494–504. 10.1101/gr.112730.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Lundberg DS, Yourstone S, Mieczkowski P et al. Practical innovations for high-throughput amplicon sequencing. Nat Methods 2013;10:999–1002. 10.1038/nmeth.2634 [DOI] [PubMed] [Google Scholar]
- 10. Schloss PD, Westcott SL, Ryabin T et al. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009;75:7537–41. 10.1128/AEM.01541-09 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Hu T, Chitnis N, Monos D et al. Next-generation sequencing technologies: An overview. Hum Immunol 2021;82:801–11. 10.1016/j.humimm.2021.02.012 [DOI] [PubMed] [Google Scholar]
- 12. Bolyen E, Rideout JR, Dillon MR et al. Author Correction: Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 2019;37:1091. 10.1038/s41587-019-0252-6 [DOI] [PubMed] [Google Scholar]
- 13. Mohsen A, Park J, Chen Y-A et al. Impact of quality trimming on the efficiency of reads joining and diversity analysis of Illumina paired-end reads in the context of QIIME1 and QIIME2 microbiome analysis frameworks. BMC Bioinformatics 2019;20:581. 10.1186/s12859-019-3187-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Wickham H. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Cham: Springer International Publishing: Imprint: Springer, 2016. 10.1007/978-3-319-24277-4 [DOI] [Google Scholar]
- 15. Wickham H. Reshaping data with the reshape Package. J Stat Soft 2007;21:1–20. 10.18637/jss.v021.i12 [DOI] [Google Scholar]
- 16. Posit team. RStudio: Integrated Development Environment for R, 2025. Posit Software, PBC, Boston, MA. http://www.posit.co/
- 17. R Core Team. R: A Language and Environment for Statistical Computing, 2023. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
- 18. Wickham H, Vaughan D, Girlich M. tidyr: Tidy Messy Data. R package version 1.3.2, 2026. https://tidyr.tidyverse.org
- 19. Wickham H, François R, Henry L. et al. dplyr: A Grammar of Data Manipulation. R package version 1.2.0, 2026. https://dplyr.tidyverse.org
- 20. McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLos One 2013;8:e61217. 10.1371/journal.pone.0061217 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Oksanen J, Simpson GL, Blanchet FG et al. vegan: Community Ecology Package. R package version 2.8-0, 2025. https://vegandevs.github.io/vegan/
- 22. Lahti L, Shetty S. microbiome R package. Bioconductor, 2017. 10.18129/b9.bioc.microbiome [DOI]
- 23. Kassambara A. ggpubr: “ggplot2” Based Publication Ready Plots. R package version 0.6.1, 2025. https://rpkgs.datanovia.com/ggpubr
- 24. Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 2004;20:289–90. 10.1093/bioinformatics/btg412 [DOI] [PubMed] [Google Scholar]
- 25. Lin H, Peddada SD. Multigroup analysis of compositions of microbiomes with covariate adjustments and repeated measures. Nat Methods 2024;21:83–91. 10.1038/s41592-023-02092-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Wickham H, Averick M, Bryan J et al. Welcome to the Tidyverse. Joss 2019;4:1686. 10.21105/joss.01686 [DOI] [Google Scholar]
- 27. Pačes J, Malinská N, Tušková L et al. Microbiota modulate immune cell populations and drive dynamic structural changes in gut-associated lymphoid tissue. Gut Microbes 2025;17:2543908. 10.1080/19490976.2025.2543908 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Lee S-Y, Yu Y, Chung J et al. Trimming conditions for DADA2 analysis in QIIME2 platform. Int J Oral Biol 2021;46:146–53. 10.11620/IJOB.2021.46.3.146 [DOI] [Google Scholar]
- 29. DeSantis TZ, Hugenholtz P, Larsen N et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 2006;72:5069–72. 10.1128/AEM.03006-05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Nikodemova M, Holzhausen E, Deblois C et al. The effect of low-abundance OTU filtering methods on the reliability and variability of microbial composition assessed by 16S rRNA amplicon sequencing. Front Cell Infect Microbiol 2023;13:1165295. 10.3389/fcimb.2023.1165295 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw data are available in NCBI SRA under Bioproject No. PRJNA1143515.







