ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference

Jacob L Steenwyk; Thomas J Buida, III; Yuanning Li; Xing-Xing Shen; Antonis Rokas

doi:10.1371/journal.pbio.3001007

. 2020 Dec 2;18(12):e3001007. doi: 10.1371/journal.pbio.3001007

ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference

Jacob L Steenwyk ^1,^*, Thomas J Buida III ², Yuanning Li ¹, Xing-Xing Shen ³, Antonis Rokas ^1,^*

Editor: Andreas Hejnol⁴

PMCID: PMC7735675 PMID: 33264284

Abstract

Highly divergent sites in multiple sequence alignments (MSAs), which can stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Thus, several different trimming strategies have been developed for identifying and removing these sites prior to phylogenetic inference. However, a recent study reported that doing so can worsen inference, underscoring the need for alternative alignment trimming strategies. Here, we introduce ClipKIT, an alignment trimming software that, rather than identifying and removing putatively phylogenetically uninformative sites, instead aims to identify and retain parsimony-informative sites, which are known to be phylogenetically informative. To test the efficacy of ClipKIT, we examined the accuracy and support of phylogenies inferred from 14 different alignment trimming strategies, including those implemented in ClipKIT, across nearly 140,000 alignments from a broad sampling of evolutionary histories. Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time saving. Furthermore, ClipKIT consistently outperformed other trimming methods across diverse datasets, suggesting that strategies based on identifying and retaining parsimony-informative sites provide a robust framework for alignment trimming.

Highly divergent sites in multiple sequence alignments are thought to negatively impact phylogenetic inference; trimming methods aim to remove these sites, but recent analysis suggests that doing so can worsen inference. This study introduces ClipKIT, a trimming method that instead aims to retain parsimony-informative sites; phylogenetic inference using ClipKIT-trimmed alignments is accurate, robust and time-saving.

Introduction

Multiple sequence alignment (MSA) of a set of homologous sequences is an essential step of molecular phylogenetics, the science of inferring evolutionary relationships from molecular sequence data. Errors in phylogenetic analysis can be caused by erroneously inferring site homology or saturation of multiple substitutions [1], which often present as highly divergent sites in MSAs. To remove errors and phylogenetically uninformative sites, several methods “trim” or filter highly divergent sites using calculations of site/region dissimilarity from MSAs [1–4]. A beneficial by-product of MSA trimming, especially for studies that analyze hundreds of MSAs from thousands of taxa [5], is that trimming MSAs reduces the computational time and memory required for phylogenomic inference. Nowadays, MSA trimming is a routine part of molecular phylogenetic inference [6].

Despite the overwhelming popularity of MSA trimming strategies, a recent study revealed that trimming often decreases, rather than increases, the accuracy of phylogenetic inference [7]. This decrease suggests that current strategies may remove phylogenetically informative sites (e.g., parsimony-informative and variable sites) that have previously been shown to contribute to phylogenetic accuracy [8]. Furthermore, it was shown that phylogenetic inaccuracy is positively associated with the number of removed sites [7], revealing a speed–accuracy trade-off wherein trimmed MSAs decrease the computation time of phylogenetic inference but at the cost of reduced accuracy. More broadly, these findings highlight the need for alternative MSA trimming strategies.

To address this need, we developed ClipKIT, an MSA trimming algorithm based on a conceptually novel framework. Rather than aiming to identify and remove putatively phylogenetically uninformative sites in MSAs, ClipKIT instead focuses on identifying and retaining parsimony-informative sites, which (alongside other types of sites and features of MSAs, such as variable sites and alignment length) have previously been shown to be phylogenetically informative [8]. ClipKIT implements a total of 5 different trimming strategies. Certain ClipKIT trimming strategies allow users to also retain constant sites, which inform base frequencies in substitution models [9], and/or trim alignments based on the fraction of taxa represented by gaps per site (or site gappyness). We tested the accuracy and support of phylogenetic inferences using ClipKIT and other alignment trimming software using nearly 140,000 alignments from empirical datasets of mammalian and budding yeast sequences [8] and simulated datasets of metazoans, plants, filamentous fungi, and a larger sampling of budding yeasts sequences [10–13]. We found that ClipKIT-trimmed alignments led to accurate and well-supported phylogenetic inferences that consistently outperformed other alignment trimming software. Additionally, we note that ClipKIT-trimmed alignments can save computation time during phylogenetic inference. Taken together, our results demonstrate that alignment trimming based on identifying and retaining parsimony-informative sites is a robust alignment trimming strategy.

Results

To test the efficacy of ClipKIT, we examined the accuracy and support of single-gene and species-level phylogenetic trees inferred from untrimmed MSAs and MSAs trimmed using 14 different strategies (Table 1) across 4 empirical genome-scale datasets and 4 simulated datasets. The 4 empirical datasets correspond to the untrimmed amino acid and nucleotide MSAs from 24 mammals (N_alignments = 4,004) and 12 budding yeasts (N_alignments = 5,664) [8]. The 4 simulated datasets (N_alignments = 50 alignments per dataset or 200 total) stem from simulated nucleotide sequence evolution along the species phylogeny of 93 filamentous fungi [10] and from simulated amino acid sequence evolution along the species phylogenies of 70 metazoans [11], 46 flowering plants [12], and 96 budding yeasts [13]. MSAs were trimmed using popular alignment trimming software (Table 1), generating a total of 138,152 MSAs [(4,004 mammalian + 5,664 yeast + 200 simulated MSAs) * (14 trimming strategies, including a “no trimming” strategy) = 138,152 MSAs]. However, Gblocks and Block Mapping and Gathering with Entropy (BMGE) with an entropy threshold of 0.3 were not used for performance assessment of simulated datasets because they frequently removed entire MSAs.

Table 1. The 14 different MSA trimming strategies tested in this study.

Software	MSA trimming strategies	Approach	Parameter(s)	Reference
ClipKIT	ClipKIT: k	Keep parsimony-informative sites	kpi mode	This study
	ClipKIT: kg	Keep parsimony-informative sites and remove highly gappy sites	kpi-gappy mode; remove sites with 90% gaps
	ClipKIT: kc	Keep parsimony-informative and constant sites	kpic mode
	ClipKIT: kcg	Keep parsimony-informative and constant sites and remove highly gappy sites	kpic-gappy mode; remove sites with 90% gaps
	ClipKIT: g	Remove highly gappy sites	gappy mode; remove sites with 90% gaps
BMGE	BMGE 0.3	Remove sites with high entropy	Entropy threshold of 0.3	[3]
	BMGE		Default entropy threshold of 0.5
	BMGE 0.7		Entropy threshold of 0.7
Gblocks	Gblocks	Remove sites that are gap rich and highly variable	default	[1]
Noisy	Noisy	Predicts homoplastic sites and remove them	default	[15]
trimAl	trimAl: s	Remove highly gappy and variable sites	strict mode	[2]
	trimAl: sp	Remove highly gappy and variable sites	strictplus mode
	trimAl: go	Remove highly gappy sites	gappyout mode
No trimming	No trim	N/A	N/A	N/A

Open in a new tab

Each MSA trimming strategy tested by our study, the software used, a general description of its trimming approach, its parameters, and a citation for the software used are described here.

BMGE, Block Mapping and Gathering with Entropy; MSA, multiple sequence alignment; N/A, not applicable.

We found that the 14 strategies examined occupied distinct regions of feature space suggestive of substantial differences between MSAs (Fig 1). Variation in feature space was largely driven by normalized Robinson–Foulds (nRF) and average bipartition support (ABS) measures along the first dimension and alignment length along the second dimension for both empirical and simulated datasets (S1 Fig). In empirical datasets, we found that some ClipKIT strategies removed few sites, while others removed many and, at times, the most sites (S2 Fig). Among simulated datasets, ClipKIT trimmed substantial portions of MSAs, but variation was observed across MSAs and datasets (S3 Fig). Examination of nRF and ABS values revealed that ClipKIT performed well, and at times the best, among the MSA trimming strategies tested, suggesting that phylogenetic inferences made with ClipKIT-trimmed MSAs were both accurate and well supported (S4 and S5 Figs). Finally, counter to previous evidence suggestive of a trade-off between trimming and phylogenetic accuracy [7], we found that that ClipKIT aggressively trimmed MSAs in the empirical datasets without compromising phylogenetic tree accuracy and support (S2 and S4 Figs).

To obtain a summary of overall performance, we ranked the 14 strategies’ performance for each dataset using objective desirability–based integration of nRF and ABS values [14] (Fig 2). We found that the 5 ClipKIT strategies outperformed all others for amino acid sequences in the empirical mammalian dataset (Fig 2A) as well as in the simulated metazoan and flowering plant datasets (Fig 2E and 2F). Other strategies that performed well included trimAl with the “gappyout” parameter for empirical datasets and Noisy for simulated datasets [2,15]. To evaluate MSA trimming strategy performance for empirical and simulated datasets, we examined average ranks across each set of 4 datasets and found that ClipKIT strategies were among the best performing (Fig 2I and 2J). In empirical datasets, ClipKIT’s gappy strategy outperformed all others followed by no trimming, trimAl with the “gappyout” parameter, and then 4 other ClipKIT strategies (Fig 2I). In simulated datasets, all strategies generally performed better than in empirical datasets; the no trimming strategy ranked best followed by all 5 ClipKIT strategies (Fig 2J). These results suggest that ClipKIT, which focuses on retaining parsimony-informative sites, was on par with no trimming and frequently outperformed strategies that focus on removing highly divergent sites.

Fig 2 — Desirability-based integration of accuracy and support metrics per MSA facilitated the comparison of relative performance of the 14 different MSA trimming strategies for empirical (A–D) and simulated (E–H) datasets. Examination of performance for individual datasets and average performance across empirical (I) and simulated (J) datasets revealed that ClipKIT is a top-performing software. MSA trimming strategies are ordered along the x-axis from the highest-performing strategy to the lowest-performing one according to average desirability–based rank. Boxplots embedded in violin plots have upper, middle, and lower hinges that represent the first, second, and third quartiles. Whiskers extend to 1.5 times the interquartile range. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). AA, amino acid; BMGE, Block Mapping and Gathering with Entropy; MSA, multiple sequence alignment; NT, nucleotide.

To examine the accuracy of branch lengths across MSA trimming strategies, we conducted a correlation analysis between individual branch lengths in gene trees inferred from trimmed alignments (treatment) and those inferred from untrimmed alignments (control). Because this analysis requires that untrimmed alignments are highly accurate, we conducted it only for individual simulated gene alignments. Notably, in our experimental set up, branch length estimates using trimmed alignments cannot be “more” accurate than untrimmed alignments. Thus, an alignment trimming algorithm that does not negatively influence branch length estimates will have a Spearman rank correlation coefficient of 1.0. Examination of Spearman rank correlation coefficients revealed that branch lengths of trimmed alignments were typically very highly correlated with the branch lengths of untrimmed alignments (S6–S9 Figs); ClipKIT strategies had correlation coefficients of 1.0, suggesting that branch lengths inferred using ClipKIT-trimmed alignments are accurate.

To evaluate the performance of the 14 strategies for species-level phylogenetic inference, we conducted concatenation- and quartet-based phylogenetic inference using IQ-TREE and ASTRAL v. 5.7.3 [16], respectively. We found that all strategies resulted in nearly identical and well-supported phylogenies (S10–S12 Figs). We also calculated tree certainty, an information theory-based measure of tree incongruence, which was used to summarize the degree of agreement with the reference topology across gene trees. The output from this analysis is a single value ranging from 0 to 1, where low values reflect high levels of incongruence among gene trees in the reference topology, and high values reflect low levels of incongruence with the reference topology among gene trees [17]. Tree certainty values were typically high and similar across all trimming strategies except for a few instances where certain strategies, which do not include ClipKIT strategies, significantly underperformed compared to all the others (S13 Fig). Among simulated datasets, we found that ClipKIT strategies reduced computation time by an average of approximately 20% compared to no trimming.

Discussion

Current state-of-the-art MSA trimming strategies focus on the removal of highly divergent sites. Highly divergent sites are thought to lack phylogenetic signal either because they represent sites that have become mutationally saturated due to the occurrence of multiple substitutions or because they are the result of inaccurate inference of homology [18]. A previous analysis suggested that MSA trimming strategies often decreased the accuracy of phylogenetic inference [7], highlighting the need for new strategies.

To address this need, we developed ClipKIT, an alignment trimming software that focuses on identifying and retaining parsimony-informative sites. Examination of the accuracy and support of phylogenetic inferences revealed that ClipKIT strategies consistently and frequently outperformed other MSA trimming strategies and were on par with no trimming. These results suggest that MSA trimming strategies focused on retaining phylogenetically informative sites, such as parsimony-informative sites, hold promise for developing more accurate MSA trimming strategies. We anticipate that ClipKIT will be useful for phylogenomic inference and the quest to build the tree of life.

Methods

ClipKIT availability and usage

ClipKIT is a stand-alone software written in the Python programming language (https://www.python.org/) and is available from GitHub, (https://github.com/JLSteenwyk/ClipKIT) and PyPi (https://pypi.org/). Complete documentation is available online (https://jlsteenwyk.com/ClipKIT/). ClipKIT differs from most MSA trimming software in that it focuses on identifying and retaining parsimony-informative sites from MSAs rather than on removing highly divergent ones. To do so, ClipKIT conducts site-by-site examination of MSAs and determines whether they should be retained or trimmed based on the strategy of ClipKIT being used and how the site has been classified. During site-by-site examination of MSAs, sites are either classified as parsimony-informative, as constant sites, or neither. Note that other types of sites and features of MSAs have previously been shown to be phylogenetically informative (e.g., variable sites and MSA length); however, ClipKIT focuses on parsimony-informative sites. Parsimony-informative sites are defined as sites that contain at least 2 character states that occur in at least 2 taxa. Constant sites are defined as sites with only 1 character state that appears in at least 2 taxa [19]. Across the various ClipKIT strategies, parsimony-informative sites are always retained, constant sites are either retained or removed, and sites that are neither parsimony-informative nor constant are always removed.

Previous work [4] identified 2 types of “aberrant sites”: (1) sites where only 1 sequence is represented in the alignment; and (2) sites where only 2 taxa are represented and lack homology (defined by a null model of genome-wide sequence similarity based on species-level divergences) to any other taxa in the alignment. For the first strategy, sites with these features in MSAs may stem from a genuine insertion event in 1 taxon or from assembly, annotation, and/or alignment errors; for the second strategy, homology is defined according to a null model of expected homology based on species-level sequence divergence. ClipKIT removes any sites that are not parsimony-informative or constant, and it also removes sites that contain high percentages of gaps. Thus, such “aberrant sites” are typically removed by ClipKIT.

Lastly, ClipKIT can also perform alignment trimming based on site gappyness, which is defined as the percentage of taxa that contain a gap character state (as opposed to a nucleotide or amino acid character state) at a given site. The 5 ClipKIT trimming strategies are summarized as follows:

(1) kpi: a strategy that retains sites that are parsimony-informative, which is specified with the following command:

clipkit <MSA> -m kpi;

This strategy executes the following pseudocode:

FOR site in alignment:

>IF site is parsimony-informative

>>keep the site

>ELSE

>>remove the site

ENDFOR

(2) kpic: a strategy that retains sites that are either parsimony-informative or constant, which is specified with the following command:

clipkit <MSA> -m kpic;

This strategy executes the following pseudocode:

FOR site in alignment:

>IF site is parsimony-informative or constant

>>keep the site

>ELSE

>>remove the site

ENDFOR

(3) gappy: a strategy that removes sites that are gappy-rich (defined as sites with ≥90% gaps), which is specified with the following command:

clipkit <MSA> -m gappy,

Because gappy-based trimming is the default strategy, it can also be executed with the following command:

clipkit <MSA>;

This strategy executes the following pseudocode:

FOR site in alignment:

>IF site has <90% gaps

>>keep the site

>ELSE

>>remove the site

ENDFOR

(4) kpi-gappy: a combination of strategies 1 and 3, which is specified with the following command:

clipkit <MSA> -m kpi-gappy;

This strategy executes the following pseudocode:

FOR site in alignment:

>IF site is parsimony-informative AND has <90% gaps

>>keep the site

>ELSE

>>remove the site

ENDFOR

(5) kpic-gappy: a combination of strategies 2 and 3, which is specified with the following command:

clipkit <MSA> -m kpic-gappy.

This strategy executes the following pseudocode:

FOR site in alignment:

>IF site is (parsimony-informative OR constant) AND has <90% gaps

>>keep the site

>ELSE

>>remove the site

ENDFOR

All output files have the same name as the input files with the addition of the suffix “.clipkit.” Users can specify output files names with the -o/—output option. For example, an alignment may have the output name “ClipKIT_trimmed_aln.fa” with the following command:

clipkit <MSA> -o ClipKIT_trimmed_aln.fa.

To enable users to fine-tune alignment trimming parameters, we provide an additional option for users to specify their own gappyness threshold, which can range between 0 and 1. For example, to retain sites with <95% gaps, the following command would be used:

clipkit <MSA> -g 0.95

This gappyness threshold would execute the following pseudocode:

FOR site in alignment:

>IF site has <95% gaps

>>keep the site

>ELSE

>>remove the site

ENDFOR

In practice, we recommend the use of very high gappyness thresholds; the use of lower thresholds may remove too many sites, which may worsen phylogenetic inferences [8].

To enable users to examine the trimmed sites/regions from MSAs, we have also implemented a logging option in ClipKIT. When used, the logging option outputs an additional 4-column file with the following information: column 1, position in the alignment (starting at 1); column 2, whether or not the site was trimmed or kept; column 3, reports if the site was parsimony-informative, constant, or neither; and column 4, reports the gappyness of the site. Log files are generated using the -l/—log option:

clipkit <MSA> -l

We anticipate that this information will be helpful for alignment diagnostics, fine-tuning of trimming parameters, and other reasons.

To enable seamless integration of ClipKIT into preexisting pipelines, 8 file types can be used as input. More specifically, ClipKIT can input and output fasta, clustal, maf, mauve, phylip, phylip-sequential, phylip-relaxed, and stockholm formatted MSAs. By default, ClipKIT automatically determines the input file format and creates an output file of the same format; however, users can specify either with the -if/—input_file_format and -of/—output_file_format options. For example, an input file of fasta format and a desired output file of clustal format can be specified using the following command:

clipkit <MSA> -if fasta -of clustal

Recent analyses indicate that approximately 28% of available computational tools fail to install due to implementation errors [20]. To overcome this hurdle and ensure archival stability of ClipKIT, we implemented state-of-the-art software development practices and design principles. More specifically, ClipKIT is composed of highly modular, extensible, and reusable code, which allows for easy debugging and seamless integration of new functions and features. We wrote a total of 118 unit and integration tests resulting in 97% code coverage. We also implemented a robust continuous integration (CI) pipeline to automatically build, package, and test ClipKIT whenever code is modified. This CI pipeline runs a testing matrix for Python versions 3.6, 3.7, and 3.8. Given the current configuration, building and testing ClipKIT for future versions of Python will be straightforward. Lastly, central ClipKIT functions rely on few dependencies (i.e., BioPython [21] and NumPy [22]). In summary, we have taken several measures to ensure ClipKIT implements MSA trimming strategies that do not sacrifice the accuracy of phylogenetic inference but also safeguard that ClipKIT will be a long-lasting computational tool for the field of molecular phylogenetics.

Practical considerations when using ClipKIT

Although ClipKIT strategies performed well across empirical genome-scale and simulated datasets, we acknowledge that testing every possible evolutionary scenario is impossible. This is further complicated by the lack of large-scale phylogenomic data matrices in which the true evolutionary relationships among organisms are known. Therefore, we recommend using multiple trimming strategies available in ClipKIT and examining the resulting ABS values for trees. Considering high ABS values often corresponded to lower nRF values (S4 and S5 Figs), using the resulting phylogeny with the highest ABS value may be a representative of the phylogeny that most closely approximates the true evolutionary history. This may require substantially greater computation time. To potentially ameliorate the computation time issue that may arise, we recommend creating subsets of larger datasets that span alignments of various lengths and testing multiple trimming strategies on the reduced dataset.

Although constant sites are thought to be important for informing parameters of substitution models [9], we observed variation in the performance of ClipKIT strategies that retain only parsimony-informative sites (kpi and kpi-gappy) and the performance strategies that retain parsimony-informative and constant sites (kpic and kpic-gappy). More specifically, at times, strategies kpi and kpi-gappy outperformed kpic and kpic-gappy, suggesting that constant sites may not always be informative to substitution models. However, we note that trimming nucleotide sequences with strategies kpi and kpi-gappy may warrant ascertainment bias correction for nucleotide sequences because constant sites are absent from the trimmed alignments.

Dataset acquisition and generation

To test the efficacy of strategies from ClipKIT and other alignment trimming software (Table 1), we used a total of 8 empirical and simulated datasets. For empirical datasets, we obtained publicly available untrimmed amino acid and nucleotide MSAs from 24 mammals (N_alignments = 4,004) and 12 budding yeasts (N_alignments = 5,664) totaling 4 datasets [8]. Publicly available amino acid alignments were generated with MAFFT, v. 7.164, using the G-INS-I strategy with a gap penalty of 1.53 [23]. Publicly available nucleotide alignments were generated by mapping nucleotide sequences onto the amino acid alignments. For simulated datasets, we simulated sequence evolution along proposed species phylogenies of 93 filamentous fungi [10] and from simulated amino acid sequence evolution along the species phylogenies of 70 metazoans [11], 46 flowering plants [12], and 96 budding yeasts [13] (N_alignments = 50 alignments per dataset or 200 total).

Simulated sequences were generated with INDELible v. 1.03 [24] using parameters suggested by the software developers. INDELible was chosen to generate simulated sequences because of its ability to also simulate insertion and deletion events, which are represented by gaps, a common feature in MSAs. Nucleotide alignments for filamentous fungi were generated using the general time reversible (GTR) substitution model [25]. Additional parameters specified were state frequencies values of 0.1, 0.2, 0.3, and 0.4 for T, C, A, and G nucleotides, respectively. We specified the substitution rate matrix using the scheme outlined in S1 Table. Insertion and deletion rates were set to be 5% as frequent as single substitutions. Insertion and deletions occurred according to the power law distribution (a = 1.7, M = 500). The tree’s root length was set to 1,000. For amino acids, all parameters were the same except the insertion, and deletion rates were set to be 1% as frequent as single substitutions using the Whelan and Goldman (or WAG) model of substitutions, which was also used to specify state frequencies [26].

The resulting empirical and simulated MSAs were trimmed using 14 popular alignment trimming strategies (Table 1). Altogether, we generated a total of 138,152 MSAs [(4,004 mammalian + 5,664 yeast + 200 simulated MSAs) * (14 trimming strategies, including a “no trimming” strategy) = 138,152 MSAs], which were used to evaluate the performance of ClipKIT and other alignment trimming strategies.

Measuring accuracy and support of phylogenetic inferences

Phylogenetic inferences from MSAs were made using IQ-TREE v. 1.6.11 [9]. For nucleotide sequences, we used a GTR substitution model [27] with empirical base frequencies and a discrete Gamma model with 4 rate categories [28] or “GTR+F+G”; for amino acid sequences, we used the general WAG model of substitutions [26] with empirical base frequencies and a discrete Gamma model with 4 rate categories [28] or “WAG+F+G.”

Tree accuracy was measured using nRF distances as calculated by ape v. 5.3 [29] in R package (https://cran.r-project.org/), by comparing the inferred gene phylogenies to their species phylogenies. Tree support was measured using ABS from 5,000 ultrafast bootstrap approximations in IQ-TREE [30]. To determine if alignment trimming strategies resulted in substantially different alignment lengths, nRF values, and ABS values, we conducted principal component analysis using the R packages FactoMineR v. 2.3 [31] and factoextra v. 1.0.6 [32]. All plots were made with FactoMineR, factoextra, and ggplot2, v. 2.3.1 [33], in the R, 3.6.2 (https://cran.r-project.org/), programming environment.

To summarize nRF and ABS values into a single summary metric, we used desirability functions. Desirability functions rescale a distribution of values to be between 0 and 1 depending on whether or not low values (e.g., nRF) or high values (ABS) are best. More specifically, these transformations were conducted using the following approach:

For nRF values:

{d e s i r a b i l i t y}_{l o w} = {\begin{matrix} 0 \\ \frac{Y - A}{B - A} \\ 1 \end{matrix} | \begin{matrix} Y > B \\ A \leq Y \leq B \\ Y < A \end{matrix},

where Y is the variable value, A is the maximum nRF value, and B is the minimum nRF value.

For ABS values:

{d e s i r a b i l i t y}_{h i g h} = {\begin{matrix} 0 \\ \frac{Y - A}{B - A} \\ 1 \end{matrix} | \begin{matrix} Y < A \\ A \leq Y \leq B \\ Y > B \end{matrix},

where Y is the variable value, A is the minimum ABS value, and B is the maximum ABS value.

These transformations were conducted for the 14 different trimming strategies on a per gene basis. The resulting values were used to rank the relative performance of the 14 trimming strategies.

To examine the accuracy of branch lengths among single-gene trees, Spearman rank correlations of branch lengths were calculated between untrimmed (control) and trimmed (treatment) simulated MSAs. To do so, the topologies of the untrimmed and trimmed phylogenies must be identical. Therefore, branch lengths were inferred along phylogenies that were constrained to match the reference tree topology using IQ-TREE [9]. This analysis was only done for simulated sequences because high confidence in alignment quality and true tree topology is required. Spearman rank correlations were conducted using the ggpubr v.0.2.5 [34] package in the R, 3.6.2 (https://cran.r-project.org/), programming environment.

For species-level phylogenetic inferences, we used concatenated alignments of trimmed MSAs as input to IQ-TREE [9]. Species-level phylogenetic inferences were also examined when using the quartet-based approach implemented in ASTRAL v. 5.7.3 [16], in which single-gene trees were used as input. Lastly, support among single-gene trees for references topologies was assessed using the information theory-based measure tree certainty [17,35,36], which is implemented in RAxML, v. 8.2.10 [37].

Software availability

ClipKIT is available from GitHub (https://github.com/JLSteenwyk/ClipKIT) and PyPi (https://pypi.org/project/clipkit). Complete ClipKIT documentation is available online (https://jlsteenwyk.com/ClipKIT/).

Supporting information

S1 Fig. Data are well represented in principal component analysis.

Examination of the factors contributing to the variation of the 14 trimming strategies in feature space for empirical datasets (panels A–D) and simulated datasets (panels E–H). Examination of variable representation along the first and second dimensions of the principal component analysis (see Fig 1) revealed that alignment length, nRF, and ABS were well represented in empirical (A) and simulated (E) datasets. Variable correlation plots for empirical (B) and simulated (F) datasets show the relationship among variables. Broadly, we found that variable types (i.e., alignment length, nRF, and ABS) were correlated with one another across datasets. Examination of contribution of variables along the first and second dimensions for empirical datasets (C and D, respectively) as well as simulated datasets (G and H, respectively) revealed that ABS and nRF contributed the most along the first dimension and alignment length contributed the most along the second dimension. In these figures, the red dashed line represents the expected average contribution if all variables contributed equally. Abbreviations used in the figure are as follows: ABS, average bipartition support; Aln. len, alignment length; A.P., Aspergillus and Penicillium; F.P., Flowering plants; Met, Metazoans; nRF, normalized Robinson–Foulds; Sac, Saccharomycotina. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618).

(TIF)

Click here for additional data file.^{(788.3KB, tif)}

S2 Fig. Lengths of trimmed MSAs and the associated nRF and ABS values across empirical datasets.

For the mammalian AA and NT sequences (A and B) as well as the yeast AA and NT sequences (C and D), wide variation was observed in the fraction of the original alignment trimmed by different trimming strategies. MSA trimming strategies are ordered along the x-axis from the least aggressive trimmer to the most aggressive trimmer according to the average fraction of the original alignment in that remained in the trimmed alignment (y-axis). Distributions of nRF (E–H) and ABS (I–L) values were subsequently examined. MSA trimming strategies are ordered along the x-axis from highest- to lowest-performing according to average nRF or ABS value. nRF and ABS values were Z transformed per gene prior to plotting their distribution. Boxplots embedded in violin plots have upper, middle, and lower hinges that represent the first, second, and third quartiles. Whiskers extend to 1.5 times the interquartile range. Note that the yeast dataset (N = 12) is at the lower threshold of Noisy’s recommended minimum number of sequences. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). AA, amino acid; ABS, average bipartition support; MSA, multiple sequence alignment; nRF, normalized Robinson–Foulds; NT, nucleotide.

(TIF)

Click here for additional data file.^{(1.5MB, tif)}

S3 Fig. Lengths of trimmed MSAs and the associated nRF and ABS values across simulated datasets.

Extensive variation was observed in the fraction of the original alignment trimmed by the various MSA trimming strategies (A–D). MSA trimming strategies are ordered along the x-axis from the least aggressive trimmer to the most aggressive trimmer according to the average fraction of the original alignment that remained in the trimmed alignment (y-axis). Among the various datasets, we examined the distributions of nRF (E–H) and ABS (I–L) values. MSA trimming strategies are ordered along the x-axis from highest- to lowest-performing according to average nRF or ABS value. Prior to plotting, nRF and ABS values were Z transformed on a per gene basis. Boxplots embedded in violin plots have upper, middle, and lower hinges that represent the first, second, and third quartiles. Whiskers extend to 1.5 times the interquartile range. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). ABS, average bipartition support; MSA, multiple sequence alignment; nRF, normalized Robinson–Foulds.

(TIF)

Click here for additional data file.^{(841.4KB, tif)}

S4 Fig. Pairwise examination of the relationship between alignment length, nRF, and ABS values among empirical datasets.

Broadly, we found that higher ABS values corresponded with lower nRF values across the various datasets (A–D). Additionally, lower nRF values were typically associated with shorter alignment lengths (E–H). The only MSA trimming strategies that did not follow this trend were the ClipKIT strategies kpic, kpic-gappy, kpi, and kpi-gappy, where the alignments were shorter than others but resulted in accurate phylogenetic inferences. Similarly, we found longer alignments were associated with higher ABS values (I–L). Again, we found that the only strategies that resulted in substantially shorter alignments, but which produced well-supported phylogenetic trees, were the ClipKIT strategies kpic, kpic-gappy, kpi, and kpi-gappy. This suggests that ClipKIT was able to trim substantial portions of alignments without compromising phylogenetic accuracy and support. Bidirectional error bars extend 1 standard deviation from the mean. Error bars cross at the average of the 2 variables being examined. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). ABS, average bipartition support; MSA, multiple sequence alignment; nRF, normalized Robinson–Foulds.

(TIF)

Click here for additional data file.^{(675.6KB, tif)}

S5 Fig. Pairwise examination of the relationship between alignment length, nRF, and ABS values among simulated datasets.

Higher ABS values were often associated with lower nRF values (A–D). Lower nRF values were often associated with shorter alignment lengths (E–H). Longer alignments were often associated with higher ABS values (I–L). Bidirectional error bars extend 1 standard deviation from the mean. Error bars cross at the average of the 2 variables being examined. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). ABS, average bipartition support; nRF, normalized Robinson–Foulds.

(TIF)

Click here for additional data file.^{(546.6KB, tif)}

S6 Fig. ClipKIT branch lengths are accurate using simulated alignments from a metazoan phylogeny.

Spearman rank correlation coefficients reveal that ClipKIT is a top-performing software when assessing branch length accuracy. Hex plots depict the number of datasets for a given set of x- and y-coordinates. Trimmed alignment branch lengths are depicted along the x-axis. Control or untrimmed branch lengths are depicted along the y-axis. Perfect 1-to-1 correlations are represented by the red line. The line of best fitting using a linear model is represented by the blue line. All Spearman rank correlation coefficients are shown in the top right portion of each figure. All Spearman rank correlations were statistically significant (p < 0.01). Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618).

(TIF)

Click here for additional data file.^{(928.8KB, tif)}

S7 Fig. ClipKIT branch lengths are accurate using simulated alignments from a phylogeny of flowering plants.

(TIF)

Click here for additional data file.^{(869.8KB, tif)}

S8 Fig. ClipKIT branch lengths are accurate using simulated alignments from a phylogeny of Aspergillus and Penicillium species.

(TIF)

Click here for additional data file.^{(773.9KB, tif)}

S9 Fig. ClipKIT branch lengths are accurate using simulated alignments from a phylogeny of Saccharomycotina yeast.

(TIF)

Click here for additional data file.^{(791.7KB, tif)}

S10 Fig. Concatenated alignment lengths varied by dataset and alignment trimming strategy.

Alignment lengths of concatenated data matrices for mammals (A, B), yeasts (C, D), which are the empirical datasets, and metazoans (E), flowering plants (F), Aspergillus and Penicillium (G), and Saccharomycotina yeasts (H), which are simulated datasets, varied greatly. For example, ClipKIT with the gappy strategy trimmed the least among the empirical datasets, while Noisy often trimmed the least among simulated datasets. MSA trimming strategies are ordered along the x-axis from the ones that trimmed the most to those that trimmed the least. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). MSA, multiple sequence alignment.

(TIF)

Click here for additional data file.^{(799.6KB, tif)}

S11 Fig. Alignment trimming strategies resulted in nearly identical metrics of accuracy and support for species-level inferences using empirical datasets.

Species-level phylogenies were inferred using a concatenation and coalescence approaches. All phylogenies inferred received nearly full support using the concatenation (A–D) and coalescence approaches (E–H). Similarly, phylogenies inferred using the concatenation (I–L) and coalescence approaches (M–P) were accurate. Variation species-level inferences for these datasets has been previously reported. Thus, we are considering these phylogenies nearly identical. Distributions among concatenation-based metrics stem from 5 independent tree searches. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618).

(TIF)

Click here for additional data file.^{(1.2MB, tif)}

S12 Fig. Alignment trimming strategies resulted in nearly identical metrics of accuracy and support for species-level inferences using simulated datasets.

Species-level phylogenies were inferred using concatenation and coalescence approaches. Nearly full support was observed for phylogenies inferred using the concatenation (A–D) and coalescence approaches (E–H). Low nRF values indicate the concatenation (I–L) and coalescence approaches (M–P) inferred accurate species-level phylogenies. Concatenation-based metrics are derived from a single tree search. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618).

(TIF)

Click here for additional data file.^{(1.2MB, tif)}

S13 Fig. Alignment trimming strategies have similar tree certainty values.

Tree certainty values were calculated using all single-gene trees and the reference tree as input. (A–D) Across empirical datasets, tree certainty values were similar across alignment trimming strategies with the exception of Gblocks, Noisy, and BMGE with an entropy threshold of 0.3 that frequently had lower tree certainty values compared to other alignment trimming strategies. (E–H) Among simulated datasets, a similar pattern was observed except BMGE often had a lower tree certainty value compared to other alignment trimming strategies. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). BMGE, Block Mapping and Gathering with Entropy.

(TIF)

Click here for additional data file.^{(1.1MB, tif)}

S1 Table. GTR substitution rate matrix.

Simulated NT alignments were generated using this substitution rate matrix. GTR, general time reversible; NT, nucleotide.

(XLSX)

Click here for additional data file.^{(8.8KB, xlsx)}

Abbreviations

ABS: average bipartition support
BMGE: Block Mapping and Gathering with Entropy
CI: continuous integration
GTR: general time reversible
MSA: multiple sequence alignment
nRF: normalized Robinson–Foulds
WAG: Whelan and Goldman

Data Availability

All alignments and phylogenies inferred in this study will be available from figshare (doi: 10.6084/m9.figshare.12401618) upon publication. The following link is provided for review purposes only https://figshare.com/s/bd07b70b510bca3155b9.

Funding Statement

J.L.S. and A.R. were supported by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program. A.R. was supported by the National Science Foundation (DEB-1442113), the Guggenheim Foundation, the Burroughs Wellcome Fund, and the National Institutes of Health / National Institute of Allergy and Infectious Diseases (1R56AI146096-01A1). X.X.S. was supported by the start-up grant from the “Hundred Talents Program” at Zhejiang University and the Fundamental Research Funds for the Central Universities. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Talavera G, Castresana J. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Syst Biol. 2007;56: 564–577. 10.1080/10635150701472164 [DOI] [PubMed] [Google Scholar]
2.Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–3. 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol. 2010;10:210 10.1186/1471-2148-10-210 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–31. 10.1126/science.1253451 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Shen X-X, Steenwyk JL, Labella AL, Opulente DA, Zhou X, Kominek J, et al. Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota. bioRxiv. 2020. 10.1126/sciadv.abd0079 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020. 10.1038/s41576-020-0233-0 [DOI] [PubMed] [Google Scholar]
7.Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, et al. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. Syst Biol. 2015;64:778–91. 10.1093/sysbio/syv033 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Shen X-X, Salichos L, Rokas A. A Genome-Scale Investigation of How Sequence, Function, and Tree-Based Gene Properties Influence Phylogenetic Inference. Genome Biol Evol. 2016;8:2565–80. 10.1093/gbe/evw179 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Mol Biol Evol. 2015;32:268–74. 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Steenwyk JL, Shen X-X, Lind AL, Goldman GH, Rokas A. A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium. MBio. 2019;10 10.1128/mBio.00925-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Whelan NV, Kocot KM, Moroz LL, Halanych KM. Error, signal, and the placement of Ctenophora sister to all other animals. Proc Natl Acad Sci U S A. 2015;112:5773–8. 10.1073/pnas.1503453112 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Xi Z, Liu L, Rest JS, Davis CC. Coalescent versus Concatenation Methods and the Placement of Amborella as Sister to Water Lilies. Syst Biol. 2014;63:919–32. 10.1093/sysbio/syu055 [DOI] [PubMed] [Google Scholar]
13.Shen X-X, Zhou X, Kominek J, Kurtzman CP, Hittinger CT, Rokas A. Reconstructing the Backbone of the Saccharomycotina Yeast Phylogeny Using Genome-Scale Data. G3 (Bethesda). 2016;6:3927–39. 10.1534/g3.116.034744 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Eidem HR, Steenwyk JL, Wisecaver JH, Capra JA, Abbot P, Rokas A. integRATE: a desirability-based data integration framework for the prioritization of candidate genes across heterogeneous omics and its application to preterm birth. BMC Med Genet. 2018;11:107 10.1186/s12920-018-0426-y [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, et al. Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol. 2008;3:7 10.1186/1748-7188-3-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31:i44–52. 10.1093/bioinformatics/btv234 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497:327–31. 10.1038/nature12130 [DOI] [PubMed] [Google Scholar]
18.Lake JA. The order of sequence alignment can bias the selection of tree topology. Mol Biol Evol. 1991. 10.1093/oxfordjournals.molbev.a040654 [DOI] [PubMed] [Google Scholar]
19.Kumar S, Stecher G, Tamura K. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol. 2016. 10.1093/molbev/msw054 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 2019;17:e3000333 10.1371/journal.pbio.3000333 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–3. 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Van Der Walt S, Colbert SC, Varoquaux G. The NumPy array: A structure for efficient numerical computation. Comput Sci Eng. 2011;13:22–30. 10.1109/MCSE.2011.37 [DOI] [Google Scholar]
23.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–80. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Fletcher W, Yang Z. INDELible: A Flexible Simulator of Biological Sequence Evolution. Mol Biol Evol. 2009;26:1879–88. 10.1093/molbev/msp098 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Waddell PJ, Steel M. General Time-Reversible Distances with Unequal Rates across Sites: Mixing Γ and Inverse Gaussian Distributions with Invariant Sites. Mol Phylogenet Evol. 1997;8:398–414. 10.1006/mpev.1997.0452 [DOI] [PubMed] [Google Scholar]
26.Whelan S, Goldman N. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol Biol Evol. 2001;18:691–9. 10.1093/oxfordjournals.molbev.a003851 [DOI] [PubMed] [Google Scholar]
27.Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. In Lectures on Mathematics in the Life Sciences, vol. 17 1986. p. 57–86. [Google Scholar]
28.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol. 1994;39:306–14. 10.1007/BF00160154 [DOI] [PubMed] [Google Scholar]
29.Paradis E, Claude J, Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20:289–90. 10.1093/bioinformatics/btg412 [DOI] [PubMed] [Google Scholar]
30.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–22. 10.1093/molbev/msx281 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25:1–18. 10.18637/jss.v025.i01 [DOI] [Google Scholar]
32.Kassambara A, Mundt F. factoextra. R package, v. 1.0.5. 2017.
33.Wickham H. ggplot2. Elegant Graphics for Data Analysis. New York, NY: Springer New York; 2009. 10.1007/978-0-387-98141-3 [DOI] [Google Scholar]
34.Kassambara A. ‘ggpubr’: “ggplot2” Based Publication Ready Plots. R Packag version 025. 2020.
35.Kobert K, Salichos L, Rokas A, Stamatakis A. Computing the Internode Certainty and Related Measures from Partial Gene Trees. Mol Biol Evol. 2016;33:1606–17. 10.1093/molbev/msw040 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Salichos L, Stamatakis A, Rokas A. Novel Information Theory-Based Measures for Quantifying Incongruence among Phylogenetic Trees. Mol Biol Evol. 2014;31:1261–71. 10.1093/molbev/msu061 [DOI] [PubMed] [Google Scholar]
37.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3. 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Biol. doi: 10.1371/journal.pbio.3001007.r001

Decision Letter 0

Roland G Roberts

1 Jul 2020

Dear Dr Rokas,

Thank you for submitting your manuscript entitled "ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference" for consideration as a Methods and Resources by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Jul 03 2020 11:59PM.

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

PLoS Biol. doi: 10.1371/journal.pbio.3001007.r002

Decision Letter 1

Roland G Roberts

21 Aug 2020

Dear Dr Rokas,

Thank you very much for submitting your manuscript "ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference" for consideration as a Methods and Resources at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by four independent reviewers.

You’ll see that three of the reviewers (#1, #2, #4) find your method useful, but rev #1 thinks that you need to improve the description of what it actually does (in contrast to other approaches); we agree, and think that this is doubly important for our broader readership. Several of the reviewers have further technical concerns, some of which may need additional analysis.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 2 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

General comments:

The study by Steenwyk addresses an important and unappreciated issue in genomics, that sequence alignments have problems, which, if not fixed, can negatively impact downstream studies, such as generating a phylogenetic tree. They came up with a solution for phylogeny that identifies and saves phylogenetically informative sites of a multiple sequence alignment, as opposed to deleting sites that are not uninformativ. They performed a thorough comparison of multiple methods, and find that their ClipKIT methods out performs the other methods. Overall, I think the study has made important advances. But, I have some concerns that I feel need to be addressed.

The authors present one view on issues with alignments, but there are really two: 1) Determining which sites are phylogenetically informative, as ClipKIT is designed to do; 2) Removing errors in the alignment, which some other tools are designed to do, but it is not clear that ClipKIT is designed to remove errors? Such errors include over-alignment (non-homologous sequences that should not be aligned but were aligned) and under-alignment (homologous sequences that should be aligned but were not aligned). These errors too affect phylogenetic inference and many other analyses. The authors do not seem to address how ClipKIT handles such errors? In a study of alignment of 48 bird genomes, Jarvis et al 2014 (Science) developed alignment filtering tools (described in the supplement) to remove such errors before they inferred a phylogeny. Does ClipKIT work like these?

A big issue is that the authors are not clear enough on what ClipKIT does. I looked in the methods and in their online website about it. Do the authors have equations that they programmed? What criterion is used to determine a site is phylogenetically informative? What does constant sites mean (identical sequence in all species)? What does ClipKIT do with the non-phylogenetically informative sites

Also what type of alignment software was used to make the alignments? Different alignment algorithms give more or less errors in the alignment (e.g. Jarvis et al 2014).

After the authors demonstrate that ClipKIT outperforms maintaining phylogenetic sites, they infer phylogenetic trees comparing the trimmed alignments from multiple methods, and find almost identical, well-supported phylogenies with all methods. So, then what is the need of ClipKIT? This makes me wonder, if their simulated alignments reflected real world situations?

Overall, the paper is not prepared well enough. So, it is hard to tell issues due to missing information, or a flaw in design of the analyses.

Specific comments:

Page and line numbers need to be included, otherwise writing a review is more difficult.

If ClipKIT is suppose to identify and retain phylogenetically informative sights, the term itself implies removal of sites.

In the introduction, the authors need to mention the primary goal of many trimming/filtering tools is to remove alignment errors. They should describe what ClipKIT does with such errors.

The authors are inconsistent in stating the number of trimming approaches used; 13, 14, 6. In table 1, I count 6 trimming approaches, 12 subapproaches, plus a no trimming control. They need to be consistent.

nRF and ABS need to be spelled out on first use in the paper, in the results section.

The authors state that Gblocks and BMGE trimming was not evaluated on simulated data sets, because they could remove entire alignments. However, values from these methods are in the figures, including in the simulated data sets of Figure 1B.

Where is the evidence for the following sentence? "Finally, counter to previous evidence suggestive of a trade-off between trimming and phylogenetic accuracy [6], we found that ClipKIT aggressively trimmed MSAs in the empirical datasets without compromising phylogenetic tree accuracy and support."

The following sentence in the discussion seems contradictory to the finding in the results that ClipKIT trimming did not increase or decrease the accuracy of the phylogeny. "A previous analysis suggested that MSA trimming methods often decreased the accuracy of phylogenetic inference [6], highlighting the need for alternative alignment trimming approaches."

Reviewer #2:

Trimming MSA in phylogenomics

# Summary

This study describes a new approach to trimming multiple sequence alignments (MSAs) for phylogenomics. While most available methods focus on identifying and trimming putatively phylogenetically-uninformative sites, the new method (ClipKIT) aims to identify and retain phylogenetically-informative sites. The authors demonstrate the accuracy of ClipKIT with multiple empirical and simulated datasets using split-based distance metrics for phylogenetic trees. Overall, the method performs extremely well (as judged by the metrics used) and sometimes is on par with no-trimming, yet it saves computation time.

# Assessment

In general, there are no major issues with this study. It is a clever idea to focus on site retention instead of removal. The rationale and implementation of the algorithm are clear and well-justified (though I wish the pseudo-code was provided in the Suppl. Info). Speed and accuracy are a major improvement over alternative approaches indicating that ClipKIT will likely be appealing to most empirical phylogenomicists.

My biggest quarrel with the study is the use of split-based distance metrics like the RF distance to assess accuracy (and thus making the PCAs not easily interpretable). It is well known that RF is a highly biased and sometimes problematic metric, even if normalized. However, I am not sure what would be reasonable to ask the authors to do here given the large numbers of MSAs they examined. On one hand, it would be useful at least to provide raw, as opposed to normalized RF. On the other hand, it would be better to use alternative metrics of tree comparison. Not sure if the information theoretic-based metrics developed by the senior author (tree certainty (TC) and AllTC) would be a good addition to the ms. Alternatively, using information theoretic-based RF (see https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa614/5866976) could be more informative. I know this may be a lot of work, and hence I am not sure what would be the best way to proceed. Results may not change, but I worry about describing the accuracy of a new method using a metric that is highly biased like RF (and normalized RF). ABS is not significantly better.

A second issue is that MSA site trimming may not only affect topology but also branch lengths, an equally important parameter in empirical phylogenomics and molecular evolution. Comparing branch lengths and defining null models can also be tricky, but I wonder if the authors have any insight here.

Addressing or at least discussing these two issues may help improve an already excellent, well-written manuscript.

Reviewer #3:

In this work, Steenwyk and collaborators developed an alignment-trimming algorithm aiming to identify phylogenetically-informative sites.

The authors performed tested the effectiveness under different condition (simulated and not) and showed that consistently outperformed other trimming methods across diverse datasets.

I have some significant concern about this work:

1) The software removes parsimonious uninformative sites; however, the same sites analysed under ML and Bayesian framework can be informative.

2) Removing the parsimonious uninformative sites affect the branch length and potentially the topology.

3) It is possible to remove parsimonious uninformative sites also gaps in MEGA. However, ClipKIT allows the user to perform this using a command line, but this does not justify a publication on Plos Biology.

Reviewer #4:

The submission addresses the issue of alignment trimming for phylogenetic inference. This is a routine step in phylogenetic inference, and some of the trimming software have thousands of citations. Recently, the benefit of alignment trimming has been called into question. The software introduced here, ClipKIT, purports to address this issue.

The work is generally well done, clearly written, and well illustrated. I only have a few comments.

Major:

- The use of "Desirability-based integration of accuracy and support metrics" makes it easier to rank the methods and summarise the results, but makes it harder to interpret the differences. Please include a precise mathematical definition of the compound measure.

Minor:

- Abstract: "Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time-saving". This statement is too absolute, particularly the accuracy claim (e.g. if the input alignment is fundamentally flawed, trimming alone cannot possibly turn it into something "accurate"). It could however be said that contrary to other methods, the trees don't worsen after ClipKit trimming, and the method saves time.

- Define ABS (average bootstrap support?) and nRF as normalised Robinson-Foulds.

- In Supplementary figure 4, I find the z-score transform confusing. I would be interested to see how the ABS and nRF values compare for the different methods. The z-score makes things less interpretable. For one thing, over what population were the mean and variance computed? (across all methods and datasets? across all datasets separately for each method?)

PLoS Biol. 2020 Dec 2;18(12):e3001007. doi: 10.1371/journal.pbio.3001007.r003

Author response to Decision Letter 1

8 Sep 2020

Attachment

Submitted filename: Steenwyk_etal_response_to_reviewers.docx

Click here for additional data file.^{(81.2KB, docx)}

PLoS Biol. doi: 10.1371/journal.pbio.3001007.r004

Decision Letter 2

Roland G Roberts

26 Oct 2020

Dear Dr Rokas,

Thank you for submitting your revised Methods and Resources paper entitled "ClipKIT: a multiple sequence alignment-trimming software for accurate phylogenomic inference" for publication in PLOS Biology. I have now obtained advice from two of the original reviewers and have discussed their comments with the Academic Editor.

Based on the reviews, we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the remaining points raised by the reviewers. Please also make sure to address the data and other policy-related requests noted at the end of this email.

IMPORTANT: Please attend to the following:

a) Please address the remaining concerns raised by reviewer #1.

b) Please address my Data Policy requests (see further down).

We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer. In addition to the remaining revisions and before we will be able to formally accept your manuscript and consider it "in press", we also need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

*Copyediting*

Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Many thanks for depositing your alignments and phylogenies in Figshare and making your code available on Github. However, we also ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1, 2, S1-S13. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

[identifies himself as Erich Jarvis)

The authors were very responsive to the reviews, and this led to a significant improvement in the manuscript. I just have a few conceptual concerns in how things are presented, that can easily be fixed with changes in the text.

The authors were responsive to my comments about the difference between ClipKIT identifying and removing natural phylogeneticaly un-informative sites versus removing sites due to alignment errors. They however did not clearly state this in the main text, nor state the difference on phylogenetic inference. This needs to be clearly stated for the theory and the proportion of sites with suspected uninformative site and alignment errors, when possible.

The authors seem to present a contradictory message on alignment filtering strategies that are meant to improve phylogenetic inference; when some don't actually make any change at all or sometimes make the phylogeny worse. But ClipKIT trimming is suppose to make the phylogenetic inference "better" or make no change in an already accurate phylogeny. The contradictions to this view is that in response to reviewer 3, not removing the uninformative sites does not change anything; and removing the uninformative sites does not change the branch lengths. But isn't the point of ClipKIT is that by removing the uninformative sites the phylogenic inference is should improve? The branch lengths should become more accurate? Isn't removing the sequence alignment errors suppose to improve phylogenetic inference? I believe the authors show improvements, but I also think they are unnecessarily trying to play two sides of the same coin - no change or improvement in phylogeny -. The definition of no change also means no improvement.

Lines 157-159. The reason for highly divergent sites could be more than them not being natural mutations that are not phylogenetically formative, but because some of them are due to alignment errors. Actually, I think the later reason is more likely for many highly divergent sites in the alignment. This should be mentioned.

Reviewer #2:

The authors have addresses all my concerns and further provided more nuance to the manuscript in other sections. This is an excellent contribution to the field. I personally look forward to start using ClipKIT!

PLoS Biol. 2020 Dec 2;18(12):e3001007. doi: 10.1371/journal.pbio.3001007.r005

Author response to Decision Letter 2

4 Nov 2020

Attachment

Submitted filename: Steenwyk_etal_second_resubmission_response_to_reviewers.docx

Click here for additional data file.^{(19.7KB, docx)}

PLoS Biol. doi: 10.1371/journal.pbio.3001007.r006

Decision Letter 3

Roland G Roberts

10 Nov 2020

Dear Dr Rokas,

On behalf of my colleagues and the Academic Editor, Andreas Hejnol, I am pleased to inform you that we will be delighted to publish your Methods and Resources in PLOS Biology.

PRODUCTION PROCESS

Before publication you will see the copyedited word document (within 5 business days) and a PDF proof shortly after that. The copyeditor will be in touch shortly before sending you the copyedited Word document. We will make some revisions at copyediting stage to conform to our general style, and for clarification. When you receive this version you should check and revise it very carefully, including figures, tables, references, and supporting information, because corrections at the next stage (proofs) will be strictly limited to (1) errors in author names or affiliations, (2) errors of scientific fact that would cause misunderstandings to readers, and (3) printer's (introduced) errors. Please return the copyedited file within 2 business days in order to ensure timely delivery of the PDF proof.

If you are likely to be away when either this document or the proof is sent, please ensure we have contact information of a second person, as we will need you to respond quickly at each point. Given the disruptions resulting from the ongoing COVID-19 pandemic, there may be delays in the production process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

EARLY VERSION

The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process.

Kind regards,

Vita Usova

Publication Assistant,

PLOS Biology

on behalf of

Roland Roberts,

Senior Editor

PLOS Biology

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Data are well represented in principal component analysis.

(TIF)

Click here for additional data file.^{(788.3KB, tif)}

S2 Fig. Lengths of trimmed MSAs and the associated nRF and ABS values across empirical datasets.

(TIF)

Click here for additional data file.^{(1.5MB, tif)}

S3 Fig. Lengths of trimmed MSAs and the associated nRF and ABS values across simulated datasets.

(TIF)

Click here for additional data file.^{(841.4KB, tif)}

S4 Fig. Pairwise examination of the relationship between alignment length, nRF, and ABS values among empirical datasets.

(TIF)

Click here for additional data file.^{(675.6KB, tif)}

S5 Fig. Pairwise examination of the relationship between alignment length, nRF, and ABS values among simulated datasets.

(TIF)

Click here for additional data file.^{(546.6KB, tif)}

S6 Fig. ClipKIT branch lengths are accurate using simulated alignments from a metazoan phylogeny.

(TIF)

Click here for additional data file.^{(928.8KB, tif)}

S7 Fig. ClipKIT branch lengths are accurate using simulated alignments from a phylogeny of flowering plants.

(TIF)

Click here for additional data file.^{(869.8KB, tif)}

S8 Fig. ClipKIT branch lengths are accurate using simulated alignments from a phylogeny of Aspergillus and Penicillium species.

(TIF)

Click here for additional data file.^{(773.9KB, tif)}

S9 Fig. ClipKIT branch lengths are accurate using simulated alignments from a phylogeny of Saccharomycotina yeast.

(TIF)

Click here for additional data file.^{(791.7KB, tif)}

S10 Fig. Concatenated alignment lengths varied by dataset and alignment trimming strategy.

(TIF)

Click here for additional data file.^{(799.6KB, tif)}

S11 Fig. Alignment trimming strategies resulted in nearly identical metrics of accuracy and support for species-level inferences using empirical datasets.

(TIF)

Click here for additional data file.^{(1.2MB, tif)}

S12 Fig. Alignment trimming strategies resulted in nearly identical metrics of accuracy and support for species-level inferences using simulated datasets.

(TIF)

Click here for additional data file.^{(1.2MB, tif)}

S13 Fig. Alignment trimming strategies have similar tree certainty values.

(TIF)

Click here for additional data file.^{(1.1MB, tif)}

S1 Table. GTR substitution rate matrix.

Simulated NT alignments were generated using this substitution rate matrix. GTR, general time reversible; NT, nucleotide.

(XLSX)

Click here for additional data file.^{(8.8KB, xlsx)}

Attachment

Submitted filename: Steenwyk_etal_response_to_reviewers.docx

Click here for additional data file.^{(81.2KB, docx)}

Attachment

Submitted filename: Steenwyk_etal_second_resubmission_response_to_reviewers.docx

Click here for additional data file.^{(19.7KB, docx)}

Data Availability Statement

[pbio.3001007.ref001] 1.Talavera G, Castresana J. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Syst Biol. 2007;56: 564–577. 10.1080/10635150701472164 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref002] 2.Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–3. 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref003] 3.Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol. 2010;10:210 10.1186/1471-2148-10-210 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref004] 4.Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–31. 10.1126/science.1253451 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref005] 5.Shen X-X, Steenwyk JL, Labella AL, Opulente DA, Zhou X, Kominek J, et al. Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota. bioRxiv. 2020. 10.1126/sciadv.abd0079 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref006] 6.Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020. 10.1038/s41576-020-0233-0 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref007] 7.Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, et al. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. Syst Biol. 2015;64:778–91. 10.1093/sysbio/syv033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref008] 8.Shen X-X, Salichos L, Rokas A. A Genome-Scale Investigation of How Sequence, Function, and Tree-Based Gene Properties Influence Phylogenetic Inference. Genome Biol Evol. 2016;8:2565–80. 10.1093/gbe/evw179 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref009] 9.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Mol Biol Evol. 2015;32:268–74. 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref010] 10.Steenwyk JL, Shen X-X, Lind AL, Goldman GH, Rokas A. A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium. MBio. 2019;10 10.1128/mBio.00925-19 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref011] 11.Whelan NV, Kocot KM, Moroz LL, Halanych KM. Error, signal, and the placement of Ctenophora sister to all other animals. Proc Natl Acad Sci U S A. 2015;112:5773–8. 10.1073/pnas.1503453112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref012] 12.Xi Z, Liu L, Rest JS, Davis CC. Coalescent versus Concatenation Methods and the Placement of Amborella as Sister to Water Lilies. Syst Biol. 2014;63:919–32. 10.1093/sysbio/syu055 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref013] 13.Shen X-X, Zhou X, Kominek J, Kurtzman CP, Hittinger CT, Rokas A. Reconstructing the Backbone of the Saccharomycotina Yeast Phylogeny Using Genome-Scale Data. G3 (Bethesda). 2016;6:3927–39. 10.1534/g3.116.034744 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref014] 14.Eidem HR, Steenwyk JL, Wisecaver JH, Capra JA, Abbot P, Rokas A. integRATE: a desirability-based data integration framework for the prioritization of candidate genes across heterogeneous omics and its application to preterm birth. BMC Med Genet. 2018;11:107 10.1186/s12920-018-0426-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref015] 15.Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, et al. Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol. 2008;3:7 10.1186/1748-7188-3-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref016] 16.Mirarab S, Warnow T. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015;31:i44–52. 10.1093/bioinformatics/btv234 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref017] 17.Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497:327–31. 10.1038/nature12130 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref018] 18.Lake JA. The order of sequence alignment can bias the selection of tree topology. Mol Biol Evol. 1991. 10.1093/oxfordjournals.molbev.a040654 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref019] 19.Kumar S, Stecher G, Tamura K. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol. 2016. 10.1093/molbev/msw054 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref020] 20.Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 2019;17:e3000333 10.1371/journal.pbio.3000333 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref021] 21.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–3. 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref022] 22.Van Der Walt S, Colbert SC, Varoquaux G. The NumPy array: A structure for efficient numerical computation. Comput Sci Eng. 2011;13:22–30. 10.1109/MCSE.2011.37 [DOI] [Google Scholar]

[pbio.3001007.ref023] 23.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–80. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref024] 24.Fletcher W, Yang Z. INDELible: A Flexible Simulator of Biological Sequence Evolution. Mol Biol Evol. 2009;26:1879–88. 10.1093/molbev/msp098 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref025] 25.Waddell PJ, Steel M. General Time-Reversible Distances with Unequal Rates across Sites: Mixing Γ and Inverse Gaussian Distributions with Invariant Sites. Mol Phylogenet Evol. 1997;8:398–414. 10.1006/mpev.1997.0452 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref026] 26.Whelan S, Goldman N. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol Biol Evol. 2001;18:691–9. 10.1093/oxfordjournals.molbev.a003851 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref027] 27.Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. In Lectures on Mathematics in the Life Sciences, vol. 17 1986. p. 57–86. [Google Scholar]

[pbio.3001007.ref028] 28.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol. 1994;39:306–14. 10.1007/BF00160154 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref029] 29.Paradis E, Claude J, Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20:289–90. 10.1093/bioinformatics/btg412 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref030] 30.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–22. 10.1093/molbev/msx281 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref031] 31.Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25:1–18. 10.18637/jss.v025.i01 [DOI] [Google Scholar]

[pbio.3001007.ref032] 32.Kassambara A, Mundt F. factoextra. R package, v. 1.0.5. 2017.

[pbio.3001007.ref033] 33.Wickham H. ggplot2. Elegant Graphics for Data Analysis. New York, NY: Springer New York; 2009. 10.1007/978-0-387-98141-3 [DOI] [Google Scholar]

[pbio.3001007.ref034] 34.Kassambara A. ‘ggpubr’: “ggplot2” Based Publication Ready Plots. R Packag version 025. 2020.

[pbio.3001007.ref035] 35.Kobert K, Salichos L, Rokas A, Stamatakis A. Computing the Internode Certainty and Related Measures from Partial Gene Trees. Mol Biol Evol. 2016;33:1606–17. 10.1093/molbev/msw040 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3001007.ref036] 36.Salichos L, Stamatakis A, Rokas A. Novel Information Theory-Based Measures for Quantifying Incongruence among Phylogenetic Trees. Mol Biol Evol. 2014;31:1261–71. 10.1093/molbev/msu061 [DOI] [PubMed] [Google Scholar]

[pbio.3001007.ref037] 37.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3. 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference

Jacob L Steenwyk

Thomas J Buida III

Yuanning Li

Xing-Xing Shen

Antonis Rokas

Roles

Abstract

Introduction

Results

Table 1. The 14 different MSA trimming strategies tested in this study.

Fig 1. The 14 alignment trimming strategies tested differ in resulting MSAs and metrics of phylogenetic tree accuracy and support.

Fig 2. ClipKIT is a top-performing software for trimming MSAs.

Discussion

Methods

ClipKIT availability and usage

Practical considerations when using ClipKIT

Dataset acquisition and generation

Measuring accuracy and support of phylogenetic inferences

Software availability

Supporting information

Abbreviations

Data Availability

Funding Statement

References

Decision Letter 0

Roland G Roberts

Roles

Decision Letter 1

Roland G Roberts

Roles

Author response to Decision Letter 1

Decision Letter 2

Roland G Roberts

Roles

Author response to Decision Letter 2

Decision Letter 3

Roland G Roberts

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases