Skip to main content
PLOS One logoLink to PLOS One
. 2023 Feb 24;18(2):e0279597. doi: 10.1371/journal.pone.0279597

Roadmap to the study of gene and protein phylogeny and evolution—A practical guide

Florian Jacques 1,2, Paulina Bolivar 1, Kristian Pietras 1, Emma U Hammarlund 1,2,*
Editor: Arndt von Haeseler3
PMCID: PMC9955684  PMID: 36827278

Abstract

Developments in sequencing technologies and the sequencing of an ever-increasing number of genomes have revolutionised studies of biodiversity and organismal evolution. This accumulation of data has been paralleled by the creation of numerous public biological databases through which the scientific community can mine the sequences and annotations of genomes, transcriptomes, and proteomes of multiple species. However, to find the appropriate databases and bioinformatic tools for respective inquiries and aims can be challenging. Here, we present a compilation of DNA and protein databases, as well as bioinformatic tools for phylogenetic reconstruction and a wide range of studies on molecular evolution. We provide a protocol for information extraction from biological databases and simple phylogenetic reconstruction using probabilistic and distance methods, facilitating the study of biodiversity and evolution at the molecular level for the broad scientific community.

Introduction

Living organisms are characterized by an astonishing array of phenotypes. This diversity is the result of billions of years of evolution, from the first primitive cells to modern cells and multicellular organisms, including bacteria, archaea, protists, plants, and animals. How organismal functions, adaptations, and diversifications are related can be studied through molecular evolution, a field that studies variation in the information content in the genetic material through time, namely the evolution of genes, proteins, or other markers such as ribosomal RNA, transposable elements, or other parts of the genome, that have common ancestry and are, therefore, homologous. Homology can result from speciation events, that create homologous genes in different species, or gene or genome duplication, that generate homologous genes in the same genome. Homologous genes in different species (e.g., human and murine hemoglobin) are called orthologues, and homologous genes present in the same genome (e.g., human hemoglobin and human myoglobin) are called paralogues. Gene homology resulting from horizontal transfer is called xenology. Phylogenetics and evolutionary biology study how homologues are related to each other to retrace their evolutionary history.

Evolutionary studies can be carried out in the context of comparative genomics between different species or, alternatively, in the context of population genetics, and compare molecular variation between the populations or individuals of a single species. In the first case, one considers long-term evolution and compares mutations (including substitutions) that have been fixed between different species. The second case considers short-term evolution and concerns the study of variation that is still segregating in a population. In both cases, the evolution of species is usually presented as a phylogenetic tree, a diagram displaying the evolutionary relationships between the sequences or taxa. The tools and methods for phylogenetic inference have become more complex over past four decades and their use can be challenging.

Molecular evolutionary studies aim at reconstructing the evolutionary histories and relationships of different taxa, genes or genomic components (e.g., transposable elements), as well as understanding the diverse mechanisms and factors underlying evolutionary change, such as mutation, selection, recombination, genetic drift, demographic processes, or biased gene conversion. For these purposes, the integration of novel genomic technologies with evolutionary studies are invaluable. For example, in systematics, the description of new species necessitates knowing how they are related to other species. In epidemiology, the emergence of new infectious diseases and antibiotic resistance requires studying genetic variation of infectious agents and identifying adaptive mutations leading to pathological conditions. A focus on the process of adaptation is also valuable in biological, agricultural and environmental sciences to, for example, protect endangered species or limit the spread of invasive species.

In recent years, technological advances in molecular biology, in particular the sequencing of DNA and RNA, has allowed for an exponential increase of available sequences of nucleotides and amino acids. In addition, these data are coupled with annotations regarding biological functions. The genomic, transcriptomic, and proteomic data is curated in specialized public databases, assets that are paralleled by development of new statistical methods and computational technology to study gene and protein functions and evolution. While generally accessible to biologists for studying molecular diversity and evolution, to sort and navigate through these resources can be challenging. Here, we outline a selection of molecular databases as well as bioinformatic tools and methods for retrieving sequences and reconstructing evolutionary history and processes. In doing so, we follow a recently published phylogenetic protocol [1]. We focus on databases that are maintained and popular. The aim is to provide a practical guide for beginners and more advanced explorers into protein and gene evolution. This is followed by a tutorial to the reconstruction of the evolution of two families of cell cycle-related proteins: P53 and cyclins/cyclin-dependent kinases (CDKs), over organismal history.

Materials and methods

Data included in this study (sequences and accession numbers) are available in S1S4 Files. There are no ethical or legal restrictions on sharing these data sets. The protocol described in this peer-reviewed article is published on protocols.io, https://protocols.io/view/road map-to-the-study-of-gene-and-protein-phylogeny-cknkuvcw and included as S5 File.

Collecting genomic and proteomic information

Dozens of databases store sequences and other biological information about genes and proteins; for a complete list, see [2, 3]. These databases offer query tools to retrieve DNA or amino-acid sequences and other information such as gene architecture or protein structure. They also provide annotations with information about gene or protein properties such as function, polymorphism, activity and pathways, subcellular localization, and tissue expression (Fig 1).

Fig 1. Protocol for reconstructing the phylogeny and evolutionary history of genes and proteins using molecular databases and bioinformatic tools.

Fig 1

Solid arrows indicate the order of actions for the phylogenetic analysis and evolutionary studies. Dashed arrows indicate feedback loops that are needed during the process. A subset of available databases and bioinformatic programs are depicted in the figure. This roadmap is mostly based on a recently published phylogenetic protocol [1].

DNA databases

GenBank [4] and Entrez [5], both maintained at the National Centre for Biotechnology Information (NCBI) [6], store nucleotide sequences of all living organisms and, when applicable, their translation into protein, with biological annotation and supporting bibliography (Table 1). They include integrated search tools to retrieve sequences, structures, genetic cartography and bibliography about genes [6]. Ensembl [7] is a genome browser that focuses on chordates and contains information about gene sequence and structure, expression, location on the chromosome, transcript variants, homologues, and gene ontologies. The browser is further expanded into specific databases for invertebrates, plants, fungi, protists, and bacteria in EnsemblMetazoa, EnsemblPlants, EnsembleFungi, EnsemblProtists and EnsemblBacteria. Ensembl is relevant for evolutionary analyses, comparative genomics, and population genetics studies. Data on gene expression patterns across animal species, including anatomical and embryonic information, is stored in the database Bgee [8]. GeneCards [9] stores information on human genes, including biological function, genomics, transcription factor binding sites and protein products, as well as assay products (e.g., siRNA, inhibitors or CRISPR products) and crosslinks to many other databases.

Table 1. List of nucleic acid databases.

Database Features Link References
BAR Database of plant genes and proteins http://bar.utoronto.ca/ [15]
Bgee Gene expression patterns https://bgee.org/ [8]
Ensembl Genome browser of vertebrates, includes tools for identification of homology https://www.ensembl.org/index.html [7]
Entrez Gene sequences and structures https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html [5]
FlyBase Genome and proteome of the model insect D. melanogaster https://flybase.org/ [11]
GeneCards Human gene function, genomics, transcription factor binding sites and protein products https://www.genecards.org/ [9]
GenBank Annotated DNA sequences https://www.ncbi.nlm.nih.gov/genbank/ [4]
NCBI Collection of databases for molecular biology and medicine providing many bioinformatics tools and services https://www.ncbi.nlm.nih.gov/ [6]
PomBase Genome and proteome of the model yeast S. pombe https://www.pombase.org/ [13]
TAIR Genome and proteome of the model plant A. thaliana https://www.arabidopsis.org/ [14]
WormBase Genome and proteome of the model nematode C. elegans https://wormbase.org//#012-34-5 [12]
Xenbase Genome and proteome of the model amphibian X. laevis http://www.xenbase.org/entry/ [10]

Particular model organisms that have provided extensive biological data are stored in organism-specific databases such as XenBase [10], FlyBase [11], and WormBase [12]. They include data concerning genomics, development, gene expression and variants of the amphibian Xenopus laevis, the fruit fly Drosophila melanogaster, and the nematode Caenorhabditis elegans, respectively. Data from the fission yeast Schizosaccharomyces pombe is stored in PomBase [13], which includes the complete genome, gene and protein sequences, and annotations. The Arabidopsis Information Resource (TAIR) [14] provides the complete genome sequence of the model plant Arabidopsis thaliana and information on gene sequence and structure, gene expression, protein sequence and literature. The Bio-Analytic Resource (BAR) [15] for plant biology also provides access to several plant-specific databases, including gene expression and protein tools such as the eFP Browser (https://bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi) [16], that displays gene expression patterns in Arabidopsis, molecular markers and mapping and genomic tools.

Nucleotide sequences can be downloaded from these databases in the FASTA format (a text-based format for representing either nucleotide or amino-acid sequences, where nucleotides or amino acids are represented by single-letter codes). It is also possible to batch-download a large number of sequences from NCBI, by entering their identifiers (accession numbers, GI numbers or GeneIDs) in Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez).

Protein databases

General information on proteins

The Universal protein resource (UniProt) [17] is currently the main source of information for proteins (Table 2). UniProt contains published amino-acid sequences and open-reading-frame translations, with various annotations including structure, classification, biological function, and subcellular localization of the protein. Protein sequences can be downloaded from UniProt in the FASTA format. The resource also provides links to many other public databases. The “retrieve/ID mapping” tool of UniProt (https://www.uniprot.org/id-mapping) facilitates batch downloads of information on a set of proteins using UniProt identifiers. This tool can also be used to convert UniProt identifiers to the identifiers of external databases such as NCBI, GenBank, Ensembl or the Protein Data Bank (PDB). Gene Ontology [18] provides a unified annotation system of the molecular function, biological processes, and cellular components of proteins across all species. Information about genes, proteins, and genomes that is acquired from several ‘omics technologies is further gathered in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [19]. The KEGG database focuses on metabolism, biological pathways, and human diseases. Summaries of the entire human proteome using antibody-based proteomics, transcriptomics, and integration of other omics technologies is gathered in the Human Protein Atlas [20]. This atlas displays expression profiles, subcellular localization, tissue and organ distribution, and protein function in human metabolism, as well as information about diseases such as cancer. The PHAROS database [20, 21] provides an overview of the literature on human proteins, including their classification, pathways, expression data, and related diseases.

Table 2. List of protein databases.
Database Features Link References
CATH Classification of protein domains based on their structure, functionality, and evolution https://www.cathdb.info/ [23]
FSSP Classification of protein domains based on their structural similarity https://archive.ph/20121222235655/http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+LibInfo+-id+5Ti2u1RffMj+-lib+FSSP [24, 25]
Gene Ontology Unified annotation of molecular function, biological processes, and cellular components of proteins http://geneontology.org/ [18]
Human Protein Atlas Information on human protein and their link with diseases https://www.proteinatlas.org/ [20]
InterPro Classification of proteins domains and functional sites https://www.ebi.ac.uk/interpro/ [17]
KEGG Protein function and biological pathways https://www.genome.jp/kegg/ [19]
PDB 3D structures of proteins https://www.rcsb.org/ [26]
Pfam Information about protein families and domains http://pfam.xfam.org/ [27]
PHAROS Centralizes literature for human proteins https://pharos.nih.gov/ [21, 22]
PRINTS Protein fingerprints classification database http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ [28]
PROSITE Protein family database https://prosite.expasy.org/ [29]
SCOP Classification of protein domains based on their structure, functionality and evolution https://scop.mrc-lmb.cam.ac.uk/ [30]
SUPERFAMILY Protein structure and functions https://supfam.org/ [31]
UniProt General information on proteins, including sequence, structure, classification, function, subcellular localization and simple homology identification https://www.uniprot.org/uniprot/ [17]

Protein structure databases

Protein structures are described by databases such as the PDB [26]. The PDB provides three-dimensional (3D) structures of proteins and their interacting ligands established by X-ray crystallography, electron microscopy, or NMR spectroscopy, which can be retrieved as pdb files. The PDB also displays a 3D visualization tool, programs for 3D analyses such as pairwise structure alignment and pairwise symmetry, and cross links to other protein databases. Annotation for protein families based on fingerprints (i.e., conserved 3D motifs specific for a protein family), are gathered in the database PRINTS [28]. PRINTS includes a 3D visualization software and search tools for protein sequence homology and pairwise or multiple sequence alignments.

Protein classification

Proteins are classified into different categories based on structural similarity, functionality and evolutionary relationship (Table 2). The 3D structural classification of proteins (SCOP) [30] classifies protein domains according to their class, fold, superfamily, and family. Large proteins can have several domains belonging to different categories. The class and fold levels are based on protein structures. Most proteins belong to one of the five structural classes (α, β, α/β, α+β, multi-domain), defined respectively by the presence of α-helices, β-strands, both α-helices and β-strands, segregated α-helices and β-strands, or none of these characteristics. Below this primary level, a protein’s secondary structure is reflected in folds. The other levels of protein classification (superfamily and family) are based on evolutionary relationships. Proteins with shared ancestry are classified in the same superfamily, and proteins sharing 30% or more sequence identity are classified in the same family [32]. Similarly, the Class, Architecture, Topology, Homology (CATH) database proposes a five-level classification of protein domains. The first three levels: class, architecture and topology, are based on structural homology. The last two, homologous superfamily and family, are based on sequence, structure and functional specificities, and sequence identity, respectively [23]. The Families of Structurally Similar Proteins (FSSP) provides a classification of proteins in the PDB based on a structure comparison algorithm, that calculates a structural similarity score between protein chains. These similarity scores are used to create a classification of protein structures [24, 25].

SUPERFAMILY [31] is a database of structurally and functionally annotated proteins. The database Protein families (Pfam) [27] classifies protein domains based on multiple sequence alignment. Pfam contains information on protein domain structures and their occurrence in living organisms. The homologues of a protein are listed in Pfam and their sequences can be downloaded in the FASTA format. For any family, Pfam displays a graphical view of all species possessing the protein domain. The domain families and functional sites of proteins is the focus of the InterPro database [17]. It combines structure-based and phylogenetic classifications. InterPro uses predictive models called signatures, that are used to infer functions for a sequence in association with the database Gene Ontology [18]. Biological information about the protein family, domains, and functional sites is gathered in the PROSITE database [29]. PROSITE also provides tools for identifying distant homology between sequences.

Homologue research

Studying the evolution of a family of genes or proteins requires the identification of homologues (i.e., genes or protein with shared ancestry). Homologues of a gene or protein can be identified using appropriate tools (Table 3). The Basic Local Alignment Search Tool (BLAST) [33], maintained at the NCBI, is the most widely used heuristic algorithm for researching sequence homology. Homologues of a gene or protein can be retrieved from the genomes or proteomes of specified taxa with a significance score called E-value (Expect value), which is defined as the number of expected hits of similar quality that can be obtained by chance [34]. A lower E-value corresponds to a higher statistical significance of the match. A BLAST search also provides the identity and similarity of the hits. BLAST can be used for nucleotide (BLASTn) or amino-acid sequences (BLASTp), translated nucleotide to protein (BLASTx) and protein to translated nucleotide (tBLASTn). UniProt [17] also provides a tool to identify proteins in the database sharing 50% or 90% identity with any protein, including paralogues and orthologues. Pfam [27] and Ensembl [7] also include tools to identify homologues of any given gene and retrieve their sequences. Other search tools include HMMER [35] of the HH-suite software, FASTA [36], SSAHA [37] and BLAT [38]. Homologues of a given sequence should be compiled into a single FASTA file. FASTA files can be made by retrieving sequences from the databases and manually adding the sequences one by one. For large datasets, for example the results of an exhaustive search for homologues, it is possible to directly export a large number of sequences in the FASTA format from NCBI or UniProt [17].

Table 3. List of bioinformatic tools for identification of gene and protein homologues.

Database Features Link References
BLAST Protein or DNA homology search in NCBI https://blast.ncbi.nlm.nih.gov/Blast.cgi [33]
BLAT Protein or DNA homology search in animal genomes https://genome.ucsc.edu/cgi-bin/hgBlat [38]
Ensembl Genome browser of vertebrates, includes tools for identification of homology https://m.ensembl.org/index.html [7]
FASTA Protein or DNA homology search and sequence alignment https://www.ebi.ac.uk/Tools/sss/fasta/ [36]
HMMER Gene and protein homology search http://hmmer.org/ [35]
Pfam Protein families and domains, includes tools for identification of homology http://pfam.xfam.org/ [27]
SSAHA DNA sequence search and alignment https://www.sanger.ac.uk/tool/ssaha/ [37]
UniProt General information on proteins, includes tools for identification of similarity https://www.uniprot.org/uniprot/ [17]

Phylogenetic analysis

Exploring molecular evolution necessitates studying how species, genes, and proteins are related to each other in an evolutionary sense. Phylogenetic relationships can apply to species, genes, or proteins even within the same genome. Reconstructing evolution from molecular data (amino-acid or nucleotide sequences) includes the steps of sequence alignment and trimming, phylogenetic analysis, and study of molecular evolution using a phylogenetic tree. Below, we describe the tools used in these different steps.

Multiple sequence alignment

Aligning gene or protein sequences consists in inferring homology between bases or amino acids. The sequences are put in every row of a matrix, one after the other, to arrange every homologous base or amino acid. Alignment of the homologous residues necessitates adding gaps, indicated by the symbol “-” and corresponding to insertions or deletions (indels), into the sequences. Sequence alignment methods include the progressive approach and the consistency-based method. The progressive approach aligns progressively from the two closest to the most distant sequences. It is used by CLUSTALW [39], CLUSTAL Omega [40], MUSCLE [41], PRANK [37, 55], KAlign [44] and MAFFT [45]. Consistency-based methods calculate the best multiple sequence alignment (MSA) after different pairwise alignments using information from a third sequence as intermediate [4648]. They are used by T-COFFEE [49], PROBCONS [50] and its successor CONTRAlign [51], the latter for amino acid sequences only (Table 4). Other approaches include the iterative refinement method, which is also included in MAFFT and Muscle [52], the genetic algorithms [53], and methods that use hidden Markov models [54].

Table 4. List of programs for multiple sequence alignment.

Software Features Link References
BAli-Phy Multiple sequence alignment of nucleotide and amino acid sequences and phylogenetic analysis using BI http://www.bali-phy.org [59]
CLUSTAL Omega* Speed-oriented multiple sequence alignment for nucleotide or amino acid data https://www.ebi.ac.uk/Tools/msa/clustalo/ [40]
CLUSTALW* Multiple sequence alignment for nucleotide or amino acid data https://www.genome.jp/tools-bin/clustalw [39]
CONTRAlign (ProbCons) Accuracy-oriented multiple sequence alignment for amino acid data http://contra.stanford.edu/contralign/ [50, 51]
Kalign* Speed-oriented multiple sequence alignment for nucleotide or amino acid data https://www.ebi.ac.uk/Tools/msa/kalign/ [44]
MAFFT* Multiple sequence alignment for nucleotide or amino acid data https://mafft.cbrc.jp/alignment/server/ [45]
MUSCLE* Multiple sequence alignment for nucleotide or amino acid data https://www.ebi.ac.uk/Tools/msa/muscle/ [41]
PASTA Speed-oriented multiple sequence alignment for nucleotide or amino acid data, designed for very large datasets https://bioinformaticshome.com/tools/msa/descriptions/PASTA.html [60]
PRANK/WebPRANK* Multiple sequence alignment for nucleotide or amino acid data, should be preferred for close sequences http://wasabiapp.org/software/prank/ https://www.ebi.ac.uk/goldman-srv/webprank/ [42, 43]
SATé Software package for multiple sequence alignments and phylogenetic inference https://phylo.bio.ku.edu/software/sate/sate.html [62]
T-COFFEE* Multiple sequence alignment of nucleotide and amino acid sequences http://tcoffee.crg.cat/ [49]
UPP Speed-oriented multiple sequence alignment of nucleotide and amino acid sequences, designed for very large data sets https://github.com/smirarab/sepp. [61]

(*: include online version).

Several MSA tools, including CLUSTALW [39], MUSCLE [41], MAFFT [45], Kalign [44] and PRANKS [37, 55] display inferred MSAs using user interfaces. ClustalW and Muscle are also included in MEGA [55]. PROBCONS [50], T-COFFEE [49] and MAFFT [45] are described to have particularly high accuracy but also high execution times [56]. Their use should be restricted to small and intermediate datasets. CLUSTAL Omega [40] and Kalign [44] are particularly fast, but less accurate [57]. They can be used to analyse datasets of up to 4,000 and 2,000 sequences, respectively [56, 57]. The performances of MUSCLE are intermediate [57]. PRANK is meant for closely-related sequences [58]. Bali-Phy [59] performs a Bayesian co-estimation of alignment, phylogeny, and other parameters and is also argued to be very reliable. PASTA [60] and UPP [61], which uses a machine-learning technique, are designed for very large datasets. MAFFT offers a wide range of methods, which can be accuracy-oriented, such as L-INS-i, G-INS-I and E-INS-i; or speed-oriented, such as FFT-NS-2. The latter can be used for up to 30,000 sequences. Simultaneous Alignment and Tree Estimation (SATé) [62] is a software package providing several tools for sequence alignment and phylogenetic analysis.

In practice, finding the accurate MSA can be challenging for several reasons. First, one should keep in mind that the alignment with the best score is not necessarily biologically correct. Computer programs are not based explicitly on the hypothesis of homology between aligned residues [63]. Furthermore, it is difficult to get a good alignment for sequences that have diverged significantly and share low identity. In this case, for protein-coding sequences, amino-acid data should be preferred over nucleotide data, since it is possible to consider the biochemical similarity of amino acids [64]. Alignment programs require defining a gap-opening penalty and a gap-extension penalty, but these values are arbitrary. It is common that different sequences in the alignment do not have the same length, for biological or experimental reasons. It is recommended to keep end-gaps unpenalized [64]. Furthermore, indels are reported to affect the accuracy of MSA programs. It is recommended to use several MSA programs for sequences that contain indels [65]. MAFFT is reported to be the most accurate program in the cases of sequences with non-overlapping deletions and alternatively spliced gene products [65]. Furthermore, single nucleotides, small sequences (e.g., microsatellites) or entire protein domains, can be repeated in a gene or protein sequence. If the number of repeats differs between sequences, one domain of a sequence can be homologous to several domains of another sequence. It is recommended to excise the repeated domains [64].

Alignment trimming

Once the alignment is completed, it is necessary to select the positions and regions that will be used for the phylogenetic inference. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. They should be excluded prior to the phylogenetic analysis to maximize the phylogenetic signal of the alignment [66]. A minimum reporting standard has been developed to quantify the alignment completeness, and implemented in AliStat [67]. Phylogenetically informative regions of the alignment can be selected using appropriate tools, such as Guidance 2 [68], GBlocks [69], trimAl [70], BMGE [71] and Noisy [72] (Table 5).

Table 5. List of programs for sequence alignment trimming.

Software Features Link References
AliStat Quantification of alignment completeness for alignment refinement https://github.com/thomaskf/AliStat [67]
BMGE Selection of informative regions on multiple sequence alignments https://gitlab.pasteur.fr/GIPhy/BMGE [71]
GBlocks Selection of informative regions on multiple sequence alignments http://molevol.cmima.csic.es/castresana/Gblocks.html
Guidance 2* Selection of informative regions on multiple sequence alignments http://guidance.tau.ac.il/ [68]
Noisy Selection of informative regions on multiple sequence alignments http://www.bioinf.uni-leipzig.de/Software/noisy/ [72]
trimAl Selection of informative regions on multiple sequence alignments http://trimal.cgenomics.org/ [70]

(*: tools that include online version).

Assessing phylogenetic assumptions

Phylogenetic methods rely on simple assumptions about the evolutionary processes, stating, for example, that all sites in the alignment evolved under the same tree (treelikeness), that mutation rates have remained constant over time (time-homogeneity), and that substitutions are reversible and, therefore, also stationary (for details on these assumptions, see [73]). If the phylogenetic data violate these assumptions, the phylogeny and evolutionary analyses may become biased [7476]. Once the alignment is performed and the sites have been selected for phylogenetic inference, it is recommended to assess those phylogenetic assumptions when possible [1]. Statistical methods allowing users to test stationarity and homogeneity of the evolutionary processes (along diverging lineages), and treelikeness, have been developed and included in IQ-TREE and IQ-TREE2 [77, 78]. Homo2.1 (https://github.com/lsjermiin/Homo.v2.1) [79] is designed for the analysis of compositional heterogeneity in sequence alignments. It is also possible to use the R package MOTMOT [80].

Phylogenetic protocol selection

The choice of a phylogenetic method can be challenging. The appropriate phylogenetic method depends on the phylogenetic assumptions of each method. Inferring the correct phylogenetic tree requires that the data do not violate the assumptions of the method. Most phylogenetic methods assume that the sequences have evolved under time-reversible Markovian conditions (i.e., the nucleotides or amino-acids have evolved independently of time and their past history). Most model selectors consider only time-reversible Markovian models. However, if the data have evolved under more complex, non-time reversible Markovian conditions, identifying the sequence evolution model that fits the data and reconstructing the phylogenetic tree may be complex, since phylogenetic methods for such data are lacking [1]. A model of sequence evolution selected as the best fit does not necessarily imply that it adequately describes the data. Poorly fitting models are inadequate approximations of the evolutionary processes and can lead to errors. In this case, it is recommended to test the goodness of fit between the phylogenetic tree, the substitution model, and the data (see paragraph Test of goodness of Fit).

Selection of the optimal model of sequence evolution

Probabilistic and distance methods require selection of the model of molecular evolution that best describes the data. Several models of nucleotide or amino-acid substitution exist [81]. The nucleotide substitution models differ in the number of parameters considered, like mutation rates and base frequencies [82] (for a review see [83]). The main nucleotide substitution models are, from the simplest to the most complex: JC69 [84], K80 [85], F81 [86], HKY85 [87], TN93 [88] and GTR [89]. The main amino acid substitution models include JTT [90], WAG [91], LG [92] and Dayhoff [93]. Models of codon evolution also exist [94].

These models can be associated with models of substitution rate heterogeneity across sites. Mutation rates and selective pressure may vary among sites, due to different roles in the structure and function of the gene or protein. The most common rate heterogeneity across sites models are the Gamma distribution (G) and the proportion of invariant nucleotide or amino acid sites (I). Every substitution model can be associated with G, I or both. The FreeRate model (R), a more complex model of rate heterogeneity [95], is included in ModelFinder, PhyML and IQ-TREE. More recently, the GHOST model for alignments with variation in mutation rate was introduced and implemented in IQ-TREE [96].

The likelihood of the different models should be computed by appropriate software (Table 6). For every model of sequence evolution (i.e., a combination of a substitution model and a rate-heterogeneity across sites model), these tools calculate the Bayesian information criterion (BIC) [97] and the Akaike information criterion (AIC) [98] from the log-likelihood scores. A model with lower BIC or AIC is considered more accurate. The model minimizing BIC or AIC (i.e., with the lowest score) should be selected. ModelTest and jModelTest [99, 100] estimate the likelihood for phylogenetic trees based on nucleotide sequences and ProtTest [100] for amino acid sequences. ModelFinder [101] is a model selection method, for alignments of nucleotides, codons or amino acids, implemented in IQ-TREE [77, 78]. PartitionFinder 2 [102] can be used with nucleotide and amino acid data. Model selectors are also included in programs such as MEGA [55] and PhyML (SMS) [103].

Table 6. List of programs for molecular evolution model selection.

Software Features Link References
ModelFinder Fast model selection method with a model of rate heterogeneity between sites (nucleotides, amino acids, codons) Implemented in IQ-Tree [101]
ModelTest / jModelTest Nucleotide substitution model selection http://evomics.org/resources/software/molecular-evolution-software/modeltest/ [99]
PartitionFinder 2 Molecular evolution model selection http://www.robertlanfear.com/partitionfinder/ [102]
ProtTest Aminoacid substitution model selection (nucleotides, amino acids) https://github.com/ddarriba/prottest3 [100]
SMS Nucleotide or aminoacid model selection included in PhyML (nucleotides, amino acids) http://www.atgc-montpellier.fr/sms/ [103]

Phylogenetic analysis

A phylogenetic tree is a graphical illustration of the evolutionary relationships between taxa, genes, or proteins. For comprehensive reviews, see [104, 105]. Phylogenetic trees may consider the topology and the branch lengths (phylograms) or just the topology (cladograms). Several tree-building methods exist. Distance methods create a matrix of molecular distances, defined by the numbers and types of differences between the sequences, and they use this matrix to reconstruct the phylogenetic tree. Character-based methods compare all sequences at the same time, site by site. They include Maximum Parsimony (MP) and the probabilistic methods: Maximum Likelihood (ML) [106] and Bayesian Inference (BI) [107109]. Maximum parsimony is a classical and simple method, now rarely used with molecular data, that calculates the minimum number of evolutionary steps, including nucleotide insertions, deletions or substitutions, between species. The main weakness of this method is that it ignores hidden mutations and does not consider branch lengths. This can lead to incorrect clustering of unrelated taxa, a phenomenon known as long branch attraction (for a review, see [110]). Probabilistic methods are the most recent and today the most widely used methods for phylogenetic inference. They are more relevant for molecular phylogenetics because they use specified models of molecular evolution and rely on likelihood calculations, but their execution time is longer.

Phylogenetic analysis using distance methods

Distance methods include the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [111], Neighbor Joining (NJ) [112], and Minimum Evolution (ME) [113] tree-inference methods. FastME [114] is designed for phylogenetic inference using diverse distance methods with nucleotide or amino-acid data or distance matrices (Table 7). PyCogent [115] is a software library for genomic biology, allowing for phylogenetic analysis and a large number of evolutionary, statistical and genomic analyses, including partition models, as well as graphical display and annotation of phylogenetic trees. SplitsTree [116, 117] is used for inference of unrooted phylogenetic trees and phylogenetic networks from sequence alignments or distance matrices. APE (Analysis of Phylogeny and Evolution) [118] is a package written in the R language that provides a wide range of evolutionary analyses, including calculating genetic distances and computing phylogenetic trees using distance-based methods. PAUP [119], MEGA [55], FastTree [120] or PHYLIP [121], can also be used for distance-based methods, as well as ML, and/or MP [122, 123].

Table 7. List of programs for phylogenetic analysis using distance methods, maximum parsimony, maximum likelihood and Bayesian inference.
Software Features Link References
APE R-written package for molecular phylogenetics using distance-based methods http://ape-package.ird.fr [118]
BAli-Phy Phylogenetic inference using BI http://www.bali-phy.org [59]
BayesTraits Phylogenetic inference and other evolutionary analyses using BI http://www.evolution.reading.ac.uk/BayesTraitsV4.0.0/BayesTraitsV4.0.0.html [136]
FastMe* Phylogenetic inference using distance methods http ://www.atgc-montpellier.fr/fastme/ [114]
GARLI Phylogenetic inference using ML http ://evomics.org/resources/software/molecular-evolution-software/garli/ [137]
HYPHY* Phylogenetic inference using ML and distance methods https ://www.hyphy.org/ [128]
IQ-TREE* Phylogenetic inference using ML, including model selection and a very fast bootstrapping method http ://www.iqtree.org/ [77, 78]
MEGA Sequence alignment, model selection, phylogenetic analysis using distance methods, MP and ML, and other evolutionary analyses https://www.megasoftware.net/ [55]
MrBayes Phylogenetic inference using BI and diverse evolutionary analyses, including ancestral states reconstruction and time calibration http://nbisweden.github.io/MrBayes/ [133]
PAML phylogenetic inference using ML, estimation of selection strength, ancestral states reconstruction and other evolutionary analyses http://abacus.gene.ucl.ac.uk/software/paml.html [127]
PAUP Phylogenetic inference using MP and ML http://paup.phylosolutions.com/ [119]
PHYLIP Phylogenetic inference using MP, distance methods and ML https ://evolution.genetics.washington.edu/phylip.html [121]
PhyloBayes Phylogenetic inference with protein data using BI using a specific probabilistic model http ://www.atgc-montpellier.fr/phylobayes/ [134]
PhyML* Phylogenetic inference using ML, ancestral states reconstruction and various evolutionary analyses http://atgc.lirmm.fr/phyml/ [125]
PyCogent Phylogenetic inference, tree drawing, various evolutionary analyses, including partition models and ancestral states reconstruction https ://github.com/pycogent/pycogent [115]
RAxML* Phylogenetic inference using ML https://cme.h-its.org/exelixis/web/software/raxml/ [126]
SeaView Sequence alignment and phylogenetic inference using MP, NJ and ML http://doua.prabi.fr/software/seaview [124]
SplitsTree Phylogenetic inference for unrooted trees and phylogenetic networks https ://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/splitstree/ [117]

Phylogenetic analysis using maximum likelihood

Maximum-likelihood methods calculate the likelihood of observing the data under different explicit models of molecular evolution. Maximum likelihood aims to identify the best fit model by exploring multiple combinations of trees and model parameters. Programs for ML phylogenetic analysis include MEGA [55], SeaView [124], PhyML [125], RAxML [126], FastTree [120], PAML [127], PAUP [119], IQ-TREE [77, 78], HYPHY [128], PHYLIP [121] and GARLI [129] (Table 7). All of them can be used with nucleotide or amino-acid data. MEGA [55] and SeaView [124] are known to be very user-friendly. They include sequence alignment tools and tree manipulators. PhyML [125] is reported as being accurate, easy to use and, like PAUP and MEGA [55], includes many common models of substitution. RAxML [126] and particularly FastTree [120] are fast and well suited for large datasets (up to 1 million sequences with FastTree). In addition to assuming Gamma-distributed rate-heterogeneity across sites and the proportion of invariant sites, they include CAT, a specific model of rate heterogeneity [130]. IQ-TREE [77, 78], which includes ModelFinder [101] and the very fast bootstrapping method UFBoot2 [131], is reported to be both fast and accurate [132].

Phylogenetic analysis using Bayesian Inference

The most recently-developed method for phylogenetic reconstruction uses Bayesian Inference (BI). This method calculates the posterior probability of the tree and model of sequence evolution, given the data. The main software used for BI-based phylogenetics is MrBayes [133]. It uses the Markov Chain Monte Carlo (MCMC) algorithm (Table 7). PhyloBayes [134, 135] is a Bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model. It is well adapted for large datasets and phylogenomics. Bali-Phy [59] can also be used for phylogenetic analysis using BI.

Test of reliability of the inferred tree

It is recommended to estimate the reliability of the clades of the inferred phylogenetic tree. Most programs of phylogenetic analysis use the non-parametric bootstrapping method [138]. Bootstrapping is a resampling technique used to assess the repeatability of the clade, and estimate how consistently it is supported by the data [139]. The sites in the alignment (nucleotides or amino acids) are randomly resampled with replacement and a new phylogeny is calculated for each replicate [138]. A bootstrap value, corresponding to the proportion of replicate phylogenies that recovered the clade, is calculated for every internal branch. A bootstrap value of 100% means that the branch is supported by all resampled datasets, while low values mean that only few of these datasets support the branch. Bootstrap values depend on both data and the method used. Users should keep in mind that bootstrapping gives a measure of the consistency of the estimate, but it is not a measure of the accuracy of the tree [140]. The number of replicates that are necessary to obtain a good accuracy of the bootstrap depends on the bootstrap value. For example, for a 1% confidence interval on a bootstrap value of 95, 2,000 replicates are necessary [139].

Since bootstrapping can be time consuming, fast approximation methods for phylogenetic bootstrap, UFBoot and UFBoot2, have been developed and implemented in IQ-TREE [131, 141, 142]. They are also less biased than other non-parametric bootstrapping methods and robust against moderate model violations. While other methods tend to underestimate the probabilities of the clade of being correct, the values from UFBoot and UFBoot2 truly reflect this probability, simplifying the interpretation of bootstrap values [141].

The approximate likelihood ratio test (aLRT), implemented in PhyML [143], is an alternative to the non-parametric bootstrap. Bayesian inference methods use posterior probabilities (PP) to measure branch support. It is also possible to compare the topology of different trees. In ML-based phylogenetic analysis, the Shimodaira-Hasegawa (SH) test and its improved version, Approximately unbiased (AU) [144], have been designed to evaluate alternative phylogenetic hypotheses, and test if a tree is better supported than another one. This test can be used with PAUP, PhyML, FastTree and IQ-TREE.

Tree rooting

The root of a phylogenetic tree is the hypothetical last common ancestor of all the sequences present in the tree. Depending on the question asked, phylogenetic trees can be unrooted or rooted. The latter corresponds to the identification of ancestral and derived states, aiming at studying the direction of the evolution of the sequences [69]. Diverse methods have been developed to root phylogenetic trees. The most common consists in including outgroups (i.e., sequences that are closely related to the ingroup of interest) in the analysis. Typically, two outgroups are selected, one being more closely related to the ingroup than the other, allowing for a proper identification of the states of characters. Correctly rooting a phylogeny can be challenging, for example in the case of rapid evolutionary radiations. The outgroups can be subject to long-branch artifacts and tend to cluster with the longest branches of the tree [145]. A study suggests reconstructing the trees with and without outgroups. When the outgroup affects the topology, the tree with no outgroups should be preferred [146]. When outgroups are not included, alternative methods can be used. For example, midpoint rooting places the root at the mid-point between the most dissimilar sequences in the tree, and molecular clock rooting assumes that evolution speed is constant between the sequences [69].

Test of goodness-of-fit

The inferred optimal model of sequence evolution used for the phylogenetic analysis can be inadequate. Once a phylogenetic tree has been inferred, it is recommended to test the goodness-of-fit (i.e., the adequacy) between the tree, the model, and the data [1]. A good fit means that the tree and the model of sequence evolution provide a good explanation of the data but does not indicate if the tree is correct or not. The goodness of fit can be tested using a parametric bootstrap [147], a method that consists in simulating sequence evolution to generate pseudo-data, using the optimal tree and the optimal model as an input. Sequence generating programs, such as SeqGen (https://github.com/rambaut/Seq-Gen/releases/tag/1.3.4, [148]) can be used. The goodness-of-fit is calculated from the difference between the unconstrained and constrained (i.e., assuming the optimal tree and model) log-likelihoods of the real data and the pseudo-data. If the fit is poor, it is recommended to check the alignment and the selected set of sites, and the sequence evolution model (feedback loops on Fig 1). The adequacy of the data can be tested using the frequentist Goldman-Cox (GC) test, which can be performed with PAUP [149]. Most Bayesian phylogenetic programs employ the posterior predictive (PP) test [150].

Visualization tools

Once the phylogenetic tree has been computed, it can be visualized using graphical software such as FigTree [151], ETE Toolkit [152] or ITOL [153] (Table 8). MEGA [55] and SeaView [124] also include visualization tools. Using different sets of options, several types of phylogenetic trees can be drawn (rooted or not, cladogram or phylogram), and branch support values (bootstrap values or posterior probabilities) can be displayed.

Table 8. List of tools for graphical visualization and annotation of phylogenetic trees.

Software Features Link References
ETE Toolkit Visualization and analysis of phylogenetic trees http://etetoolkit.org/ [152]
FigTree Graphic software for phylogenetic trees https://github.com/rambaut/figtree/releases [151]
ITOL* Visualization and annotation of phylogenetic trees https://itol.embl.de/ [153]
MEGA Sequence alignment, model selection, phylogenetic analysis, includes tree visualization and annotation tools https://www.megasoftware.net/ [55]
SeaView Sequence alignment and phylogenetic inference, includes tree visualization and annotation tools http://doua.prabi.fr/software/seaview [124]

(*: includes online version).

Integrative services for phylogenetic workflows

Packages for phylogenetic analysis can facilitate phylogenetic inference, analysis and other evolutionary studies (Table 9). A complete series of libraries for bioinformatics including tools for sequence alignment, phylogenetic analysis, study of molecular evolution and population genetics is available through the Bio++ suite [154] and HyPhy [128]. The Cyberinfrastructure for phylogenetic research (CIPRES science gateway) [155] is a public resource for phylogenetic analysis that includes many tools and software for sequence alignment, model selection, and phylogenetic inference. Other packages include NGPhylogeny [156] and Phylemon [157]. Geneious is a platform for DNA, RNA and protein studies that provides tools for NGS assembly, sequence alignment, phylogenetic analysis using NJ, UPGMA, ML and BI, 3-D structures study, and SNP analysis [158].

Table 9. List of packages for phylogenetic analysis and evolutionary studies.

Software Features Link References
Bio++ Software package for phylogenetic analysis https://github.com/BioPP [154]
CIPRES Resource for evolutionary studies, that includes programs for sequence alignment, model selection, and phylogenetic inference https://www.phylo.org/ [155]
HyPhy Software package for phylogenetic analysis https://www.hyphy.org/ [128]
Geneious Platform for NGS data assembly and diverse evolutionary studies https://www.geneious.com/ [158]
NGPhylogeny Workflow that integrates numerous methods and tools for phylogenetic analysis https://ngphylogeny.fr [156]
Phylemon2 Web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing. http://phylemon.bioinfo.cipf.es/index.html [157]

Study of molecular evolution

Inferring phylogenetic relationships allows users to study many aspects of molecular evolution. Here, we propose a non-exhaustive list of studies that can be carried out using phylogenetic trees and the above mentioned bioinformatic tools.

Reconstitution of ancestral states

Retracing the functional evolution of genes, proteins, or biological traits often requires the reconstitution of ancestral states. Ancestral states can be inferred from a phylogenetic tree using MP, ML, or BI. To infer ancestral states also requires the aligned sequences and, when using probabilistic and distance methods, the model of sequence evolution that has been used for the phylogenetic analysis. For ML-based reconstructions, MEGA [55], PAML [127], IQ-TREE and IQ-TREE 2 [77, 78], HyPhy [128], Bio++ [154], and Mesquite [159] can be used (Table 10). BEAST [160], MrBayes [133], and BayesTraits [136] use BI. RASP (Reconstruct Ancestral state in Phylogenies) [161] can be used for both ML- and BI-based ancestral states reconstruction. PyCogent [115] provides a large number of evolutionary analyses, including ancestral states reconstruction.

Table 10. List of programs and databases for diverse evolutionary analyses in complement to phylogenetic analysis.

Software Features Link References
Arlequin Population genetics analyses http://cmpg.unibe.ch/software/arlequin35/ [162]
BayesTraits Evolutionary analyses using Bayesian inference http://www.evolution.reading.ac.uk/BayesTraitsV4.0.0/BayesTraitsV4.0.0.html [136]
BEAST Diverse evolutionary analyses using BI, including time-calibration of phylogenetic trees http://www.beast.community [160]
CAFE Gene family evolution https://github.com/hahnlab/CAFE5 [163]
CoGE Comparative genomics analyses https://genomevolution.org/coge/ [164]
Copycat Coevolution studies http://www.cophylogenetics.com/ [165]
CoRe-PA Coevolution studies http://pacosy.informatik.uni-leipzig.de/49-1-CoRe-PA.html [166]
DNAsp Analysis of DNA polymorphism http://www.ub.edu/dnasp/ [167]
Genepop Population genetics analyses https://genepop.curtin.edu.au/ [168]
HGT-Finder Horizontal gene transfer finding https://github.com/yinlabniu/HGT-Finder [169]
IQ-TREE, IQ-TREE2 Ancestral states reconstruction http ://www.iqtree.org/ [77, 78]
Jane Coevolution studies https://www.cs.hmc.edu/~hadas/jane/ [170]
LSD Time-calibration of phylogenetic trees http ://www.atgc-montpellier.fr/LSD/ [178]
Mesquite Comparative analyses and statistics http://www.mesquiteproject.org/ [159]
Ohnologs Database of vertebrate ohnologues, resulting from whole genome duplications http://ohnologs.curie.fr/ [171]
PyCogent Numerous evolutionary analyses, including partition models and phylogenetic analysis, tree drawing and ancestral states reconstruction https ://github.com/pycogent/pycogent [115]
RASP Ancestral states reconstruction http://mnh.scu.edu.cn/soft/blog/RASP/index.html [161]
SNiplay SNP detection and other population genetics analyses https://sniplay.southgreen.fr/cgi-bin/home.cgi [172]
TimeTree Trime-calibration of phylogenetic trees http://www.timetree.org/ [173]
TreeMap Software for studying co-evolution https://sites.google.com/site/cophylogeny/treemap [174]

Measure of selection strength

The type and strength of selection on protein coding genes may be of interest. It is calculated by evaluating the ratio of the number of non-synonymous substitutions (substitutions changing the protein sequence) per non-synonymous site (dN), and the number of synonymous substitutions (substitutions with no effect on the protein sequence due to the redundancy of the genetic code) per synonymous site (dS). If dN/dS > 1, then the non-synonymous substitutions are higher than expected and the gene is under positive selection. If dN/dS<1, the gene is under purifying selection and if dN/dS = 1, the selection is neutral. It is recommended not to use the dN/dS ratio for closely related species [175]. The ratio can be calculated using PAML [127], MEGA [55], Bio++ [154] and HyPhy [128] (Table 10).

Time-calibration of phylogenetic trees

Time calibration of phylogenetic trees consists in estimating divergence times, using events with a known age, such as fossil and other geological data (that can only give minimal ages) as calibration points. Alternatively, mutation rates can be used to calculate the divergence time between two sequences. However, since the mutation rate can vary a lot during long-term evolution and differ between taxonomic groups, using mutation rates should be avoided for distantly related species [176]. The estimated divergence times between species is summarized in TimeTree [173] (Table 10). It is noteworthy that divergence times estimates from the literature, based on calibration points from fossil data and molecular clocks, are prone to error and illusory precision [177]. TimeTree can be used with MEGA [55] to calibrate a phylogeny. The BEAST package [160] uses BI to estimate mutation rates and calibrate phylogenies. LSD [178], recent versions of PhyloBayes [134] and APE [118] can also be used for molecular dating of evolutionary events.

Study of host/parasite co-evolution

Co-evolution refers to the genetic or morphological changes (or both) between different species in interaction. It is widely used in evolutionary ecology and parasitology to study the evolution of hosts and parasites. Co-evolutionary events include co-speciation, host change, duplication, and loss of interaction. The evolution of the parasite is partly driven by the evolution of the host, which is considered independent from the evolution of the parasite [179]. The co-evolutionary history can be presented as a co-phylogeny with the two entities. Some programs for studying co-evolution, including Jane [170], CoRe-PA [166] and TreeMap [174] (Table 10), are based on the hypothesis that the evolution of the parasite is driven by the evolution of the host. Others, such as Copycat [165], reconciliate the two phylogenies under the hypothesis that the situation is symmetric and evaluate the significance of co-evolution under a statistical framework. Co-evolution of genes or proteins can also be studied using these tools.

Phylogenetic comparative analysis

Evolutionary biology often employs the so-called phylogenetic comparative methods to study the adaptive significance of biological traits. These methods aim at identifying biological characters, in terms of morphology, physiology or ecology, that result from a shared ancestry. Comparative analysis uses a correlative approach between traits, taking into account the phylogenetic constraints [180]. Comparative analyses can be performed for quantitative or qualitative variables. Suitable programs include Mesquite [159] and BayesTraits [136] (Table 10).

Genome evolution

Phylogenetic trees, in complement with genomics tools and databases, can be used to study genome evolution, and identify evolutionary events such as mutations, insertions, deletions, gene or genome duplications, genome re-organization, chromosomal rearrangements, polyploidization events or genetic exchanges. Molecular databases, such as Ensembl [7] and GenBank [4] (Fig 1, Table 1), can be used to study genome evolution. Ohnologs [171] summarizes the whole genome duplication events during the evolution of vertebrates. This database can be used to interpret the duplication events and identify paralogues resulting from a whole genome duplication. Horizontal gene transfer, i.e., the gene exchanges between different organisms, can be estimated using HGT-Finder [169] (Table 10). CoGe [164] provides many tools for comparative genomic research, including BLAST [33] and tools for studying synteny, genomic inversions or horizontal gene transfers. Computational analysis of gene family evolution (CAFE) [163] is a program for studying the evolution of gene family sizes. It can be used to calculate the birth and death rates of gene families over phylogenies.

Population genetics

Genetic diversity can also be explored at the population level by analyzing polymorphism between members of the same species. Population geneticists often study allele diversity within a population, including single nucleotide polymorphisms (SNP), indels, microsatellites or transposable elements. Mathematical models have been developed to describe polymorphism. For instance, nucleotide diversity (π) measures the degree of polymorphism in a population, based on the average number of SNPs per site [181]. The fixation index (FST) is a statistic of genetic distance between populations based on their allelic composition using multiple alleles [182]. Linkage disequilibrium measures the association between alleles at different loci in a population. Several programs are suitable for population genetics studies; for a full review, see [183]. Arlequin [162], SNiPlay [172], DNAsp [184] and GENEPOP [168] can be used to compute statistics describing genetic diversity in populations, as well as the R-written package APE [118] (Table 10). Arlequin and GENEPOP are also relevant for inferring the strength of genetic drift and selection. The Bio++ suite [154] and HyPhy [128] also include tools for population genetics analyses.

Protein structure study

The study of protein functional evolution can require bioinformatic tools for protein structural analyses (Table 11). 3D structure comparisons can be performed using PyMOL [185]. Structure alignments can be realized and the mean distance in ångström between homologous residues can be calculated with this program. I-TASSER [186], HHPred of the HH suite [187] and Alpha fold [188] can be used to predict the 3D structure of proteins from their amino-acid sequences. FoRSA [189] is able to identify a protein fold from its amino-acid sequence or a protein sequence in the proteome of a species from a crystal structure.

Table 11. List of programs for protein structure analyses.

Software Features Link References
Alpha fold Protein structure prediction from amino-acid sequence https://alphafold.ebi.ac.uk/ [188]
FoRSA Protein structure prediction from amino-acid sequence http://www.bo-protscience.fr/forsa/ [189]
HHPred Protein structure prediction from amino-acid sequence https://toolkit.tuebingen.mpg.de/tools/hhpred [187]
I-TASSER Protein structure prediction from amino-acid sequence https://zhanglab.dcmb.med.umich.edu/I-TASSER/ [186]
PyMOL 3D visualization of molecules http://www.mesquiteproject.org/ https://pymol.org/2/ [185]

Two test-cases of evolutionary analyses of proteins

Following the roadmap for evolutionary analyses of proteins that is presented above (Fig 1), we now demonstrate how to track the evolution of the p53 family and human cyclins and CDKs.

Reconstructing the evolutionary history of the p53 family

TP53 is a transcription factor regulating genes involved in DNA repair and cell cycle control, inducing growth arrest or apoptosis depending on the physiological conditions and cell type. TP53 has been extensively studied for its role in development and cancer. Two paralogues of p53 are identified in vertebrate genomes: p63 and p73. Here, we propose a simple bioinformatic study to reconstruct the evolutionary history of the proteins p53, p63 and p73, to illustrate our roadmap. We investigate how the paralogues of different animal species are related to each other, when they appeared and diverged, and when they evolved new protein domains. We describe step-by-step the methods and tools used, from the selection of sequences to the phylogenetic inference and reconstruction of the evolutionary history of the TP53 family, using data from reference [190].

1. State of the art and protein classification of p53

According to UniProt [17], the human p53 (cellular tumor antigen p53, hereafter named HsTP53) is located in the nucleus, the endoplasmic reticulum, the cytoskeleton and the mitochondrion. HsTP53 is labelled as P04637 in UniProt [17] (https://www.uniprot.org/uniprot/P04637, accessed September 06, 2021) where the full sequence can be downloaded in the FASTA format. HsTP53 contains 393 amino acids. According to Pfam 35.0 [27], HsTP53 contains four main protein domains: P53 TAD (transactivating domain), TAD2, P53 DNA binding domain, and P53 tetramer. P63 and p73 also contain the P53 DNA binding domain and the P53 tetramer domain. The P53 TAD and TAD2 domains are absent in P63 and P73, but instead both include a single SAM_2 domain.

P53 (PF00870 in Pfam) is the main domain of the p53 protein, covering the amino acids 99 to 289. Pfam contains 1765 P53-domain-containing sequences from 382 species, all in choano-organisms (metazoans and choanoflagellates), including 5 sequences in choanoflagellates and 13 sequences in the genome of Homo sapiens (Fig 2A, the figure can also be accessed here: https://pfam.xfam.org/family/PF00870#tabview=tab7). P53 TAD and TAD2 are two transcription scaffold domains. The Pfam database [27] includes 253 sequences containing the P53 TAD domain, in bilaterians only. The domain TAD2 is present in 81 sequences, from primates only. P53 tetramer serves for the oligomerization of the protein. The database includes 1,392 sequences, in animals only, containing the domain p53 tetramer. The SAM 2 (sterile alpha motif) domain is a putative protein interaction domain. More than 20,000 sequences containing this domain are present in Pfam [27], in more than 1,400 species.

Fig 2. Sunburst plot of the distribution of the p53 protein domain (PF00870) in living organisms according to Pfam (accessed September 06, 2021).

Fig 2

The plot shows the distribution of the 1,765 sequences containing the P53 binding domain across 382 species. Every bar on the periphery represents one single species, containing one or several p53 paralogues in their genome (A), and percent identity matrix created by CLUSTALW 2.1 of the 7 human p53 domain-containing proteins and two of their most distant homologues from the choanoflagellate Monosiga brevicollis (B).

The SCOP classification of p53 is as follows (accessed September 06, 2021):

2. Identification of homologues

To reconstruct the evolutionary history of the p53 domain in animals, the selection of TP53 homologues covering the diversity of the family is necessary.

Using a protein BLAST (Blastp) (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins), paste the sequence of HsTP53 in the FASTA format, and select the genomes of the species of interest. Launch a BLAST search and download the amino-acid sequences in the FASTA format. Then, paste all the sequences in a single file using ’.fasta’ as filename extension.

In this example, the p53 homologues of diverse animals (the cnidarian Hydra vulgaris, four insect species: Drosophila melanogaster, Apis mellifera, Bombus terrestris and Aedes aegyptus, and the tunicate Ciona intestinalis) and the p53, p63, and p73 of diverse vertebrates (the teleost fish Danio rerio, the coelacanth Latimeria chalumnae, the amphibian Xenopus tropicalis, the lizard Anolis carolinensis, the bird Gallus gallus, and the mammals Bos taurus and H. sapiens) were chosen. The p53 of the choanoflagellate Monosiga brevicollis (a protist related to animals), also retrieved from a BLAST search was chosen as outgroup (S1 File).

3. Multiple sequence alignment and alignment trimming

Use an alignment tool (e.g. MAFFT, https://www.ebi.ac.uk/Tools/msa/mafft/ [45]) to align the sequences. Paste the alignment in the FASTA format and submit. Save the alignment in a new FASTA file. You can also directly download the sequences into the Guidance 2 server (http://guidance.tau.ac.il/) [68] and proceed to the alignment using MAFFT. Open the color-coded MSA to identify poorly aligned and highly variable regions. You can delete them manually from the alignment or remove unreliable columns below a certain cutoff. The new MSA, hereafter renamed sub-MSA, will be used for the phylogenetic analysis.

Optional: calculate the identity matrix of the sequences using alignment tools (e.g., CLUSTALW 2.1).

The 13 human p53 paralogues share 36% to 100% identity, and the two paralogues of Monosiga brevicollis share 21.4% identity (Fig 2B). Human and Monosiga orthologues share 17% to 25% identity. Hence, all human paralogues are more similar to each other than to any of the Monosiga orthologues.

4. Sequence evolution model selection

We propose to perform a phylogenetic analysis of the p53 family using a distance-based method (NJ) and a probabilistic method (ML) and compare the results. First, it is necessary to identify the optimal model of sequence evolution.

Here, we are using protein sequences. Use ProtTest 3.4.2 [100] to calculate the log-likelihoods of a panel of 56 amino-acid substitution models, and select the most relevant one based on the BIC or AIC score. Select the model with the lowest score.

Alternatively, use the substitution model selectors included in IQ-TREE or MEGA. For example, with the IQ-TREE web server (http://iqtree.cibiv.univie.ac.at/), open the Model Selection panel, download the sub-MSA, select “protein sequences”, choose a selection criterion (AIC or BIC) and proceed to the analysis. With MEGA 11, download the sub-MSA, and select “Find best DNA/protein models” in the Model panel.

The model JTT+G [90] (JTT with Gamma-distributed rate-heterogeneity across sites), that minimizes the BIC score, was selected.

5. Phylogenetic inference

For beginners, we recommend using programs that include a user interface or an online version, such as MEGA, SeaView, or the IQ-TREE server. The phylogenetic trees were inferred using MEGA 11 [55] for the NJ-based analysis, and IQ-TREE 2 [77, 78] for the ML-based analysis, using the sub-MSA and the appropriate model (JTT+G).

With MEGA 11, in the Phylogeny panel, perform a phylogenetic analysis using NJ with the sub-MSA. Select the appropriate substitution model (e.g., JTT+G) and the bootstrap method with e.g., 1000 replicates.

With IQ-TREE 2, download the alignment file, select the appropriate sequence type (DNA or protein) and the appropriate substitution model (e.g., JTT+G). In the panel “branch support analysis”, select the Ultrafast Bootstrap analysis with e.g., 1000 replicates. For single branch tests, you can also select the SH-aLRT test.

Save the phylogenetic tree including the Bootstrap/SH-aLRT values and branch lengths in the Newick format and open it with FigTree or ITOL for a graphical display of the tree. You can also paste the tree in the Newick format directly into the graphical program.

Both methods reveal four major clades containing respectively the p53 of insects and the p53, p63, and p73 of all vertebrates (Fig 3). The p53, p63, and p73 of vertebrates are more closely related to each other than to any other p53. Furthermore, the p63 and p73 of vertebrates are more closely related to each other than to vertebrate p53. This indicates that two duplication events in the p53 family preceded the origin of vertebrates. First, the p53 family and the p63/p73 cluster diverged. The second one caused the p63 and p73 families to diverge (Fig 3). The p53 of insects are clustered together. This indicates that insects diverged from the other bilaterians before these two duplications. These results are in accordance with the existing literature on the evolutionary history of the p53 family [190].

Fig 3.

Fig 3

Phylogenetic trees of the P53 domain-containing proteins of metazoans using Neighbor Joining (A) and Maximum Likelihood (B). The trees were realized according to the model JTT+G [90], as calculated by ModelFinder [101], using the AIC [98]. The numbers on the internal edges/branches indicate the bootstrap values as calculated by the standard bootstrapping method [138] and UFBoot2 [131], respectively. The phylogenetic trees were inferred using MEGA 11 [55] and IQ-TREE 2 [77, 78], respectively, and the figures were generated using ITOL [153]. Green branches represent the p63 family, orange branches represent the p73 family and blue branches represent the p53 family.

6. Molecular dating of speciation events

We propose to estimate the age of speciation and duplication events of our phylogenetic tree. TimeTree has been used to retrieve the estimates of age of speciation events, and these events were used as calibration points. Molecular clocks can also be used to calibrate the phylogeny in MEGA.

Download the alignment file in the FASTA format and the phylogenetic tree in the Newick format in MEGA 11. In the Compute panel, select “Compute TimeTree” and “internal nodes constraints”. In TimeTree (http://www.timetree.org/), enter the names of two species of interest. For example, Homo and Drosophila diverged between 630 and 830 million years ago, with 694 million years as median time. In MEGA, click “add new calibration point” and select the node in the phylogenetic tree, or enter the names of the two taxa, and define the speciation age with a minimum, maximum or fixed time (for example, 694 million years between Homo and Drosophila). Use TimeTree to define several calibration points, before and after the duplication events, and save the calibrated tree.

The time-calibrated phylogeny of the TP53 family suggests that the duplication event between p53 and the p67/p73 cluster occurred around 502 million years ago, and that p63 and p73 diverged 452 million years ago (Fig 4). One should keep in mind that these evolutionary ages are only estimates based on a few calibration points. According to the database Ohnologs [171], p53, p63 and p73 result from the two-round whole genome duplication event that preceded the origin of chordates.

Fig 4. Time-calibrated phylogenetic tree of the P53 domain-containing proteins of metazoans.

Fig 4

The trees were realized according to the model JTT+G [90], as calculated by ModelFinder [101] using the AIC [98]. The phylogenetic tree and the figure were realized using MEGA 11 [55]. Time calibration was performed using TimeTree [173]. The values at the nodes and the scale indicate the divergence time in million years. Green branches represent the p63 family, orange branches represent the p73 family and blue branches represent the p53 family. Gray spots on the branches indicate the appearance of the different protein domains during the evolution of the TP53 family.

7. Reconstruction of the evolutionary history of the P53 family

By combining this phylogenetic tree and the database Pfam [27], the evolutionary history of the protein family can be traced. The p53 DNA binding domain, shared by all proteins in this analysis, appeared before the divergence between choanoflagellates and metazoans (Fig 4). The SAM2 domain, present in p63 and p73 sequences of vertebrates only, appeared after the p53-p63/p73 duplication and before the p63-p73 duplication. The P53 TAD domain is restricted to vertebrate p53. It appeared after the first whole genome duplication and before the speciation of vertebrates. Finally, the TAD2 domain evolved recently and is restricted to primate p53.

Evolutionary history of human cyclins and CDKs

Cyclin-dependent kinases (CDKs) are protein kinases involved in the control of cell cycle. They are responsible for the activation of specific target proteins. CDKs are activated by regulatory proteins called cyclins, that are characterized by a cyclic variation of the concentration along the cell cycle. Cyclin binding to the CDK activates specific kinases and phosphatases that in turn activate the CDK. Subsequent ubiquitination and proteolysis of cyclins by the anaphase promoting complex then inactivate the CDK. All steps of the cell cycle (mitosis, G1, G2, S) depend on the activation of specific CDKs by specific cyclins. Cyclins and CDKs represent large families of proteins. Twenty-one CDKs and twenty-one cyclins are present in the human genome. Here, we present a protocol for studying the evolutionary history of human cyclin and CDK paralogues and their coevolution, following the roadmap presented above.

1. Phylogenetic analyses

To study the coevolution between human cyclins and CDKs, we need first to reconstruct their phylogenies separately. In this example, two human proteins related to CDKs, GSK3 and MAK, were chosen as outgroups for CDKs [191] (S2 File). Cables1 and Cables2, related to cyclins, were used as outgroups for the phylogenetic analysis of cyclins [191] (S3 File). Sequences were aligned using CLUSTALW 2.1 [39] and ModelFinder [101] has been used to determine the most relevant evolutionary model based on the AIC. The amino acid substitution model LG [192] has been selected for both families. Then, phylogenetic analysis was performed using maximum likelihood with IQ-TREE 2 [77, 78] and consistency of the phylogenetic estimate was assessed using the bootstrapping method UFBoot2 [131]. The figures were generated using ITOL [153].

2. Identification of homologues resulting from whole genome duplications

Using the phylogeny (Fig 5) and the database Ohnologs 2.0 [171], ohnologues, i.e., paralogues resulting from a whole genome duplication, can be identified (Fig 5). For example, in the cyclin family, Cyclins T1 and T2, Cyclins B1 and B2, Cyclins A1 and A2, and Cyclins E1 and E2 are ohnologues. In the CDK family, CDK12 and CDK13, CDK4 and CDK6, CDK14 and CDK15, and CDK19 and CDK8 are ohnologues. These ohnologues likely resulted from two-round whole genome duplication that occurred before the origin of chordates [193, 194].

Fig 5.

Fig 5

Phylogenetic trees of human CDKs (A) and Cyclins (B) using Maximum likelihood. The trees were realized according to the substitution model LG [192] as calculated by ModelFinder [101] using the AIC [98]. The phylogenetic analyses were performed using IQ-TREE2 [77, 78] and the figure was realized using ITOL [153]. The numbers indicate the bootstrap values as calculated by UFBoot2 [130]. Red squares indicate whole genome duplications according to the database Ohnologs 2.0 [171].

3. Study of the coevolution between cyclins and CDKs

To study the coevolution between the two gene families, the phylogenetic trees of cyclins and CDKs and their associations are needed. Jane 4 [170] and Treemap 3 [174], two programs designed for studying coevolution between hosts and parasites, were used to reconstruct the co-phylogeny of the two gene families. This example uses the cyclin/CDK associations from a publication on the evolution of the Cyclin and CDK families [195].

With Jane and TreeMap, a single nexus file containing the phylogenies of cyclins and CDKs, and their associations is needed. Create a nexus file (starting with #NEXUS). This file should contain the two trees in the Newick format, in the sections BEGIN HOST and BEGIN PARASITE, and the associations in the section BEGIN DISTRIBUTION. This section should mention every association between Cyclins and CDK following the pattern “Host: Parasite,. All three sections should end with “ENDBLOCK;”. The names of the taxa in the three files should be identical. Cyclins interacting with several CDKs and vice-versa should be repeated (S4 File).

Import this file to Jane and launch the analysis in the Solve Mode. The costs of coevolutionary events can be set. The stats mode can be used to compute the cost range of the solutions. With TreeMap, import the nexus file and launch the analysis in “Solve the tanglegram”. We obtain a coevolutionary scenario that represents the best way to associate the two trees. You can test the significance of the reconstruction in “estimate significance” or perform a heuristic test.

The co-phylogenies of cyclins and CDKs are presented in Fig 6. These figures retrace the evolutionary history of cyclins and CDKs. Several coevolution events were identified, including co-speciation, duplication, duplication with interaction switch, loss of interaction, and numerous failures to diverge, for example the duplication of Cyclins L1 and L2 without a duplication in CDK9 (Fig 6A). Significant co-evolution events are identified by both programs, such as between the 3 Cyclin D paralogues and the CDK4 and CDK6 cluster, and between Cyclins A, B, D and E and CDK 1, 2, 3, 4, 6, 14 and 16 (Fig 6A and 6B).

Fig 6. Two co-evolutionary scenarios between human cyclins and human CDKs.

Fig 6

The co-phylogenies were realized using Jane [170] (A) and TreeMap [174] (B). (A) Cyclins (black lines) and CDKs (blue lines) that cluster together depict an interaction (the cyclin can bind the CDK to activate it). Hollow red circles indicate co-speciation events, solid red circles indicate a duplication, and yellow circles indicate a duplication with host switch. Dashed lines illustrate a loss of interaction, and jagged lines indicate a failure to diverge. (B) Significant co-speciation events between cyclins and CDKs are indicated by red filled circles (graded). The more intense red color indicates a more significant congruence.

Supporting information

S1 File. Sequences and accession numbers of p53 proteins.

These sequences and accession numbers were used for phylogenetic analysis.

(PDF)

S2 File. Sequences and accession numbers of human CDKs.

These sequences and accession numbers were used for phylogenetic analysis.

(PDF)

S3 File. Sequences and accession numbers of human cyclins.

These sequences and accession numbers were used for phylogenetic analysis.

(PDF)

S4 File. Nexus code.

The coding was used to reconstruct the co-phylogeny of human CDKs and Cyclins.

(PDF)

S5 File. Protocol.

The protocol as also available on protocols.io.

(PDF)

Acknowledgments

We are grateful to Sarah Amend, Kenneth Pienta, and Laurie Kostecka at the Brady Urological Institute, Johns Hopkins School of Medicine, and to Stina Andersson, Chris Carroll, and Sinan Karakaya at the Tissue Development and Evolution (TiDE) group, Lund University, for carefully reading the manuscript and providing useful comments that improved the paper.

Data Availability

Data are within the Supporting Information files.

Funding Statement

EH received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 949538). The funder had and will not have a role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genomics Bioinforma. 2020. Jun 1;2(2):lqaa041. doi: 10.1093/nargab/lqaa041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Chen C, Huang H, Wu CH. Protein Bioinformatics Databases and Resources. In: Wu CH, Arighi CN, Ross KE, editors. Protein Bioinformatics [Internet]. New York, NY: Springer New York; 2017. [cited 2021 Aug 5]. p. 3–39. (Methods in Molecular Biology; vol. 1558). Available from: http://link.springer.com/10.1007/978-1-4939-6783-4_1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rigden DJ, Fernández XM. The 2021 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 2021. Jan 8;49(D1):D1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Benson DA. GenBank. Nucleic Acids Res. 2002. Jan 1;30(1):17–20. doi: 10.1093/nar/30.1.17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Schuler GD, Epstein JA, Ohkawa H, Kans JA. [10] Entrez: Molecular biology database and retrieval system. In: Methods in Enzymology [Internet]. Elsevier; 1996. [cited 2021 Dec 20]. p. 141–62. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0076687996660121 [DOI] [PubMed] [Google Scholar]
  • 6.NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012. Nov 26;41(D1):D8–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021. Nucleic Acids Res. 2021. Jan 8;49(D1):D884–91. doi: 10.1093/nar/gkaa942 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bastian F, Parmentier G, Roux J, Moretti S, Laudet V, Robinson-Rechavi M. Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species. In: Bairoch A, Cohen-Boulakia S, Froidevaux C, editors. Data Integration in the Life Sciences [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. [cited 2021 Aug 5]. p. 124–31. (Lecture Notes in Computer Science; vol. 5109). Available from: http://link.springer.com/10.1007/978-3-540-69828-9_12 [Google Scholar]
  • 9.Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, et al. GeneCards Version 3: the human gene integrator. Database. 2010. Aug 5;2010(0):baq020–baq020. doi: 10.1093/database/baq020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bowes JB, Snyder KA, Segerdell E, Gibb R, Jarabek C, Noumen E, et al. Xenbase: a Xenopus biology and genomics resource. Nucleic Acids Res. 2007. Dec 23;36(Database):D761–7. doi: 10.1093/nar/gkm826 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Drysdale RA. FlyBase: genes and gene models. Nucleic Acids Res. 2004. Dec 17;33(Database issue):D390–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Stein L. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 2001. Jan 1;29(1):82–6. doi: 10.1093/nar/29.1.82 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wood V, Harris MA, McDowall MD, Rutherford K, Vaughan BW, Staines DM, et al. PomBase: a comprehensive online resource for fission yeast. Nucleic Acids Res. 2012. Jan 1;40(D1):D695–9. doi: 10.1093/nar/gkr853 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rhee SY. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003. Jan 1;31(1):224–8. doi: 10.1093/nar/gkg076 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Waese J, Provart NJ. The Bio-Analytic Resource: Data visualization and analytic tools for multiple levels of plant biology. Curr Plant Biol. 2016. Nov;7–8:2–5. [Google Scholar]
  • 16.Winter D, Vinegar B, Nahal H, Ammar R, Wilson GV, Provart NJ. An “Electronic Fluorescent Pictograph” Browser for Exploring and Analyzing Large-Scale Biological Data Sets. Baxter I, editor. PLoS ONE. 2007. Aug 8;2(8):e718. doi: 10.1371/journal.pone.0000718 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bairoch A. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2004. Dec 17;33(Database issue):D154–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Consortium TGO. Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 2001. Aug 1;11(8):1425–33. doi: 10.1101/gr.180801 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Aoki KF, Kanehisa M. Using the KEGG Database Resource. Curr Protoc Bioinforma [Internet]. 2005. Sep [cited 2021 Aug 5];11(1). Available from: https://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0112s11 [DOI] [PubMed] [Google Scholar]
  • 20.Digre A, Lindskog C. The Human Protein Atlas—Spatial localization of the human proteome in health and disease. Protein Sci. 2021. Jan;30(1):218–33. doi: 10.1002/pro.3987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bouthors V, Dedieu O. Pharos, a Collaborative Infrastructure for Web Knowledge Sharing. In: Abiteboul S, Vercoustre AM, editors. Research and Advanced Technology for Digital Libraries [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 1999. [cited 2021 Aug 5]. p. 215–33. (Goos G, Hartmanis J, van Leeuwen J, editors. Lecture Notes in Computer Science; vol. 1696). Available from: http://link.springer.com/10.1007/3-540-48155-9_15 [Google Scholar]
  • 22.Sheils TK, Mathias SL, Kelleher KJ, Siramshetty VB, Nguyen DT, Bologa CG, et al. TCRD and Pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res. 2021. Jan 8;49(D1):D1334–46. doi: 10.1093/nar/gkaa993 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, et al. The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 2009. Jan 1;37(Database):D310–4. doi: 10.1093/nar/gkn877 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Holm L, Sander C. The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res. 1996. Jan 1;24(1):206–9. doi: 10.1093/nar/24.1.206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Holm L, Sander C. Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Res. 1997. Jan 1;25(1):231–4. doi: 10.1093/nar/25.1.231 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007. Jan 3;35(Database):D301–3. doi: 10.1093/nar/gkl971 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019. Jan 8;47(D1):D427–32. doi: 10.1093/nar/gky995 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Attwood TK, Beck ME, Bleasby AJ, Parry-Smith DJ. PRINTS a database of protein motif fingerprints.: 7. [PMC free article] [PubMed] [Google Scholar]
  • 29.Hulo N. The PROSITE database. Nucleic Acids Res. 2006. Jan 1;34(90001):D227–30. doi: 10.1093/nar/gkj063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chothia C. SCOP: a Structural Classification of Proteins database.: 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Madera M. The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res. 2004. Jan 1;32(90001):235D – 239. doi: 10.1093/nar/gkh117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hubbard TJP, Murzin AG, Brenner SE, Chothia C. SCOP: a Structural Classification of Proteins database. Nucleic Acids Research. 1996;25(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. 1990;8. [DOI] [PubMed] [Google Scholar]
  • 34.Kerfeld CA, Scott KM. Using BLAST to Teach “E-value-tionary” Concepts. Kerfeld CA, editor. PLoS Biol. 2011. Feb 1;9(2):e1001014. doi: 10.1371/journal.pbio.1001014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011. Jul 1;39(suppl):W29–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988. Apr 1;85(8):2444–8. doi: 10.1073/pnas.85.8.2444 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ning Z. SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 2001. Oct 1;11(10):1725–9. doi: 10.1101/gr.194201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kent WJ. BLAT—The BLAST-Like Alignment Tool. Genome Res. 2002;(12):656–64. doi: 10.1101/gr.229202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007. Nov 1;23(21):2947–8. doi: 10.1093/bioinformatics/btm404 [DOI] [PubMed] [Google Scholar]
  • 40.Sievers F, Higgins DG. Clustal Omega. Curr Protoc Bioinforma [Internet]. 2014. Dec [cited 2022 Jun 4];48(1). Available from: https://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0313s48 [DOI] [PubMed] [Google Scholar]
  • 41.Edgar RC. MUSCLE: multiple sequence alignment with improved accuracy and speed. In: Proceedings 2004 IEEE Computational Systems Bioinformatics Conference, 2004 CSB 2004 [Internet]. Stanford, CA, USA: IEEE; 2004 [cited 2021 Aug 10]. p. 689–90. Available from: http://ieeexplore.ieee.org/document/1332560/
  • 42.Löytynoja A. Phylogeny-aware alignment with PRANK. In: Russell DJ, editor. Multiple Sequence Alignment Methods [Internet]. Totowa, NJ: Humana Press; 2014. [cited 2021 Sep 9]. p. 155–70. (Methods in Molecular Biology; vol. 1079). Available from: http://link.springer.com/10.1007/978-1-62703-646-7_10 [DOI] [PubMed] [Google Scholar]
  • 43.Löytynoja A, Goldman N. webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics. 2010. Dec;11(1):579. doi: 10.1186/1471-2105-11-579 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lassmann T. Kalign 3: multiple sequence alignment of large datasets. Mathelier A, editor. Bioinformatics. 2019. Oct 26;btz795. doi: 10.1093/bioinformatics/btz795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Katoh K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002. Jul 15;30(14):3059–66. doi: 10.1093/nar/gkf436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009. Oct 1;25(19):2455–65. doi: 10.1093/bioinformatics/btp452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Do CB, Katoh K. Protein Multiple Sequence Alignment. In: Thompson JD, Ueffing M, Schaeffer-Reiss C, editors. Functional Proteomics [Internet]. Totowa, NJ: Humana Press; 2008. [cited 2022 Oct 22]. p. 379–413. (Walker JM, editor. Methods in Molecular Biology; vol. 484). Available from: http://link.springer.com/10.1007/978-1-59745-398-1_25 [DOI] [PubMed] [Google Scholar]
  • 48.Pei J. Multiple protein sequence alignment. Curr Opin Struct Biol. 2008. Jun;18(3):382–6. doi: 10.1016/j.sbi.2008.03.007 [DOI] [PubMed] [Google Scholar]
  • 49.Notredame C, Higgins DG, Heringa J. T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton. J Mol Biol. 2000. Sep;302(1):205–17. [DOI] [PubMed] [Google Scholar]
  • 50.Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005. Feb;15(2):330–40. doi: 10.1101/gr.2821705 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Do CB, Gross SS, Batzoglou S. CONTRAlign: Discriminative Training for Protein Sequence Alignment. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in Computational Molecular Biology [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. [cited 2022 Jun 16]. p. 160–74. (Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, et al., editors. Lecture Notes in Computer Science; vol. 3909). Available from: http://link.springer.com/10.1007/11732990_15 [Google Scholar]
  • 52.Gotoh O. Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments. J Mol Biol. 1996. Dec;264(4):823–38. doi: 10.1006/jmbi.1996.0679 [DOI] [PubMed] [Google Scholar]
  • 53.Notredame C. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 1996. Apr 15;24(8):1515–24. doi: 10.1093/nar/24.8.1515 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Bioinformatics. 1996;12(2):95–107. doi: 10.1093/bioinformatics/12.2.95 [DOI] [PubMed] [Google Scholar]
  • 55.Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Battistuzzi FU, editor. Mol Biol Evol. 2018. Jun 1;35(6):1547–9. doi: 10.1093/molbev/msy096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Pais FSM, Ruy P de C, Oliveira G, Coimbra RS. Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol. 2014. Dec;9(1):4. doi: 10.1186/1748-7188-9-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Mohamed EM, Mousa HM, keshk AE. Comparative Analysis of Multiple Sequence Alignment Tools. Int J Inf Technol Comput Sci. 2018. Aug 8;10(8):24–30. [Google Scholar]
  • 58.Anderson C, Strope C, Moriyama E. Assessing multiple sequence alignments using visual tools. In: Bioinformatic—trends and methodologies. InTech Publications. 2011. [Google Scholar]
  • 59.Redelings BD. BAli-Phy version 3: model-based co-estimation of alignment and phylogeny. Ponty Y, editor. Bioinformatics. 2021. Sep 29;37(18):3032–4. [DOI] [PubMed] [Google Scholar]
  • 60.Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T. PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. J Comput Biol. 2015. May;22(5):377–86. doi: 10.1089/cmb.2014.0156 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Nguyen N phuong D, Mirarab S, Kumar K, Warnow T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 2015. Dec;16(1):124. doi: 10.1186/s13059-015-0688-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Liu K, Warnow TJ, Holder MT, Nelesen SM, Yu J, Stamatakis AP, et al. SATé-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees. Syst Biol. 2012. Jan 1;61(1):90. [DOI] [PubMed] [Google Scholar]
  • 63.Morrison DA. Is Sequence Alignment an Art or a Science? Syst Bot. 2015. Feb 1;40(1):14–26. [Google Scholar]
  • 64.Lemey P, Salemi M, Van Damme AM. The Phylogenetic Handbook, A Practical Approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge University Press; 2009. [Google Scholar]
  • 65.Golubchik T, Wise MJ, Easteal S, Jermiin LS. Mind the Gaps: Evidence of Bias in Estimates of Multiple Sequence Alignments. Mol Biol Evol. 2007. Aug 16;24(11):2433–42. doi: 10.1093/molbev/msm176 [DOI] [PubMed] [Google Scholar]
  • 66.Talavera G, Castresana J. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Kjer K, Page R, Sullivan J, editors. Syst Biol. 2007. Aug 1;56(4):564–77. [DOI] [PubMed] [Google Scholar]
  • 67.Wong TKF, Kalyaanamoorthy S, Meusemann K, Yeates DK, Misof B, Jermiin LS. A minimum reporting standard for multiple sequence alignments. NAR Genomics Bioinforma. 2020. Jun 1;2(2):lqaa024. doi: 10.1093/nargab/lqaa024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Sela I, Ashkenazy H, Katoh K, Pupko T. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 2015. Jul 1;43(W1):W7–14. doi: 10.1093/nar/gkv318 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Kinene T, Wainaina J, Maina S, Boykin LM. Rooting Trees, Methods for. In: Encyclopedia of Evolutionary Biology [Internet]. Elsevier; 2016. [cited 2021 Dec 17]. p. 489–93. Available from: https://linkinghub.elsevier.com/retrieve/pii/B9780128000496002158 [Google Scholar]
  • 70.Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009. Aug 1;25(15):1972–3. doi: 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol. 2010;10(1):210. doi: 10.1186/1471-2148-10-210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, et al. Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol. 2008. Dec;3(1):7. doi: 10.1186/1748-7188-3-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. Mixture Models of Nucleotide Sequence Evolution that Account for Heterogeneity in the Substitution Process Across Sites and Across Lineages. Syst Biol. 2014. Sep 1;63(5):726–42. doi: 10.1093/sysbio/syu036 [DOI] [PubMed] [Google Scholar]
  • 74.Naser-Khdour S, Minh BQ, Zhang W, Stone EA, Lanfear R. The Prevalence and Impact of Model Violations in Phylogenetic Analysis. Bryant D, editor. Genome Biol Evol. 2019. Dec 1;11(12):3341–52. doi: 10.1093/gbe/evz193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Ho SYW, Jermiin LS. Tracing the Decay of the Historical Signal in Biological Sequence Data. Lockhart P, editor. Syst Biol. 2004. Aug 1;53(4):623–37. doi: 10.1080/10635150490503035 [DOI] [PubMed] [Google Scholar]
  • 76.Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD. The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated. Lockhart P, editor. Syst Biol. 2004. Aug 1;53(4):638–43. doi: 10.1080/10635150490468648 [DOI] [PubMed] [Google Scholar]
  • 77.Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Mol Biol Evol. 2015. Jan;32(1):268–74. doi: 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Teeling E, editor. Mol Biol Evol. 2020. May 1;37(5):1530–4. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Jermiin LS, Lovell DR, Misof B, Foster PG, Robinson J. Detecting and visualising the impact of heterogeneous evolutionary processes on phylogenetic estimates [Internet]. Evolutionary Biology; 2019. Nov [cited 2022 Oct 22]. Available from: http://biorxiv.org/lookup/doi/10.1101/828996 [Google Scholar]
  • 80.Thomas GH, Freckleton RP. MOTMOT: models of trait macroevolution on trees: MOTMOT. Methods Ecol Evol. 2012. Feb;3(1):145–51. [Google Scholar]
  • 81.Arenas M. Trends in substitution models of molecular evolution. Front Genet [Internet]. 2015. Oct 26 [cited 2022 Oct 22];6. Available from: http://journal.frontiersin.org/Article/10.3389/fgene.2015.00319/abstract [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Posada D, Crandall KA. Selecting the Best-Fit Model of Nucleotide Substitution. YSTEMATIC Biol. 2001;50:22. [PubMed] [Google Scholar]
  • 83.Yang Z. Molecular Evolution: A Statistical Approach. Oxford University Press; 2014. 512 p. [Google Scholar]
  • 84.Jukes TH, Cantor CR. Evolution of Protein Molecules. In: Mammalian Protein Metabolism [Internet]. Elsevier; 1969. [cited 2021 Dec 20]. p. 21–132. Available from: https://linkinghub.elsevier.com/retrieve/pii/B9781483232119500097 [Google Scholar]
  • 85.Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980. Jun;16(2):111–20. doi: 10.1007/BF01731581 [DOI] [PubMed] [Google Scholar]
  • 86.Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981. Nov;17(6):368–76. doi: 10.1007/BF01734359 [DOI] [PubMed] [Google Scholar]
  • 87.Hasegawa M, Kishino H, aki Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985. Oct;22(2):160–74. doi: 10.1007/BF02101694 [DOI] [PubMed] [Google Scholar]
  • 88.Tamura Koichiro, Nei Masatoshi. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993. May;10(3):512–26. doi: 10.1093/oxfordjournals.molbev.a040023 [DOI] [PubMed] [Google Scholar]
  • 89.Miura RM, American Association for the Advancement of Science, editors. Some mathematical questions in biology: DNA sequence analysis. Providence, R.I: American Mathematical Society; 1986. 124 p. (Lectures on mathematics in the life sciences). [Google Scholar]
  • 90.Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Bioinformatics. 1992;8(3):275–82. doi: 10.1093/bioinformatics/8.3.275 [DOI] [PubMed] [Google Scholar]
  • 91.Whelan S, Goldman N. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol Biol Evol. 2001. May;18(5):691–9. doi: 10.1093/oxfordjournals.molbev.a003851 [DOI] [PubMed] [Google Scholar]
  • 92.Le SQ, Gascuel O. An Improved General Amino Acid Replacement Matrix. Mol Biol Evol. 2008. Apr 3;25(7):1307–20. doi: 10.1093/molbev/msn067 [DOI] [PubMed] [Google Scholar]
  • 93.Dayhoff MO, Schwartz RM, Orcutt BC. 22 A model of evolutionary change in proteins. In: Atlas of protein sequence and structure. 1978. p. 345–52. [Google Scholar]
  • 94.Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994. Sep;11(5):715–24. doi: 10.1093/oxfordjournals.molbev.a040152 [DOI] [PubMed] [Google Scholar]
  • 95.Yang Z. A space-time process model for the evolution of DNA sequences. Genetics. 1995. Feb 1;139(2):993–1005. doi: 10.1093/genetics/139.2.993 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, et al. GHOST: Recovering Historical Signal from Heterotachously Evolved Sequence Alignments. Smith S, editor. Syst Biol. 2019. Jul 31;syz051. [DOI] [PubMed] [Google Scholar]
  • 97.Neath AA, Cavanaugh JE. The Bayesian information criterion: background, derivation, and applications. WIREs Comput Stat. 2012. Mar;4(2):199–203. [Google Scholar]
  • 98.Bozdogan H. Model selection and Akaike’s Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika. 1987. Sep;52(3):345–70. [Google Scholar]
  • 99.Posada D, Crandall KA. MODELTEST: testing the model of DNA substitution. Bioinformatics. 1998. Oct 1;14(9):817–8. doi: 10.1093/bioinformatics/14.9.817 [DOI] [PubMed] [Google Scholar]
  • 100.Abascal F, Zardoya R, Posada D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 2005. May 1;21(9):2104–5. doi: 10.1093/bioinformatics/bti263 [DOI] [PubMed] [Google Scholar]
  • 101.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017. Jun;14(6):587–9. doi: 10.1038/nmeth.4285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B. PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses. Mol Biol Evol. 2016. Dec 23;msw260. [DOI] [PubMed] [Google Scholar]
  • 103.Lefort V, Longueville JE, Gascuel O. SMS: Smart Model Selection in PhyML. Mol Biol Evol. 2017. Sep 1;34(9):2422–4. doi: 10.1093/molbev/msx149 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Roy SS, Dasgupta R, Bagchi A. A Review on Phylogenetic Analysis: A Journey through Modern Era. Comput Mol Biosci. 2014;04(03):39–45. [Google Scholar]
  • 105.Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the genomic age. Nat Rev Genet. 2020. Jul;21(7):428–44. doi: 10.1038/s41576-020-0233-0 [DOI] [PubMed] [Google Scholar]
  • 106.Goldman N. Maximum Likelihood Inference of Phylogenetic Trees, with Special Reference to a Poisson Process Model of DNA Substitution and to Parsimony Analyses. Syst Zool. 1990. Dec;39(4):345. [Google Scholar]
  • 107.Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. J Mol Evol. (43):304–11. [DOI] [PubMed] [Google Scholar]
  • 108.Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. Mol Biol Evol. 1997. Jul 1;14(7):717–24. doi: 10.1093/oxfordjournals.molbev.a025811 [DOI] [PubMed] [Google Scholar]
  • 109.Mau B, Newton MA, Larget B. Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods. Biometrics. 1999. Mar;55(1):1–12. doi: 10.1111/j.0006-341x.1999.00001.x [DOI] [PubMed] [Google Scholar]
  • 110.Bergsten J. A review of long-branch attraction. Cladistics. 2005. Apr;21(2):163–93. doi: 10.1111/j.1096-0031.2005.00059.x [DOI] [PubMed] [Google Scholar]
  • 111.Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. 1958;(38):1409–38. [Google Scholar]
  • 112.Saitou N, Nasatoshi N. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4(4):406–25. doi: 10.1093/oxfordjournals.molbev.a040454 [DOI] [PubMed] [Google Scholar]
  • 113.Rzhetsky A, Nasatoshi N. A Simple Method for Estimating and Testing Minimum-Evolution Trees. Mol Biol Evol. 1992. Sep 1;9(5):945–67. [Google Scholar]
  • 114.Lefort V, Desper R, Gascuel O. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program: Table 1. Mol Biol Evol. 2015. Oct;32(10):2798–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, et al. PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007;8(8):R171. doi: 10.1186/gb-2007-8-8-r171 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Huson DH, Bryant D. Estimating phylogenetic trees and networks using SplitsTree 4. Center for Bioinformatics: Tübingen University; 2004. [Google Scholar]
  • 117.Huson DH, Kloepper T, Bryant D. SplitsTree 4.0—Computation of phylogenetic trees and networks. 2004. [Google Scholar]
  • 118.Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004. Jan 22;20(2):289–90. doi: 10.1093/bioinformatics/btg412 [DOI] [PubMed] [Google Scholar]
  • 119.Swofford DL. Phylogenetic analysis using parsimony. Laboratory of Molecular Systematics Smithsonian Institution; 1998. [Google Scholar]
  • 120.Price MN, Dehal PS, Arkin AP. FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix. Mol Biol Evol. 2009. Jul 1;26(7):1641–50. doi: 10.1093/molbev/msp077 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Felsenstein J. phylogenetic inference program Version 3.6. Seattle: University of Washington; 2005. [Google Scholar]
  • 122.Misener S, Krawetz SA, editors. Bioinformatics methods and protocols. Totowa, N.J: Humana Press; 2000. 500 p. (Methods in molecular biology). [Google Scholar]
  • 123.Swofford DL, Sullivan J. Phylogeny inference based on parsimony and other methods using PAUP*. 2003;160. [Google Scholar]
  • 124.Gouy M, Guindon S, Gascuel O. SeaView Version 4: A Multiplatform Graphical User Interface for Sequence Alignment and Phylogenetic Tree Building. Mol Biol Evol. 2010. Feb 1;27(2):221–4. doi: 10.1093/molbev/msp259 [DOI] [PubMed] [Google Scholar]
  • 125.Guindon S, Delsuc F, Dufayard JF, Gascuel O. Estimating Maximum Likelihood Phylogenies with PhyML. In: Posada D, editor. Bioinformatics for DNA Sequence Analysis [Internet]. Totowa, NJ: Humana Press; 2009. [cited 2021 Sep 9]. p. 113–37. (Methods in Molecular Biology; vol. 537). Available from: http://link.springer.com/10.1007/978-1-59745-251-9_6 [DOI] [PubMed] [Google Scholar]
  • 126.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014. May 1;30(9):1312–3. doi: 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol. 2007. Apr 18;24(8):1586–91. doi: 10.1093/molbev/msm088 [DOI] [PubMed] [Google Scholar]
  • 128.Kosakovsky Pond SL, Poon AFY, Velazquez R, Weaver S, Hepler NL, Murrell B, et al. HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Crandall K, editor. Mol Biol Evol. 2020. Jan 1;37(1):295–9. doi: 10.1093/molbev/msz197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Lewis PO. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol Biol Evol. 1998. Mar 1;15(3):277–83. doi: 10.1093/oxfordjournals.molbev.a025924 [DOI] [PubMed] [Google Scholar]
  • 130.Stamatakis A. Phylogenetic models of rate heterogeneity: a high performance computing perspective. In: Proceedings 20th IEEE International Parallel & Distributed Processing Symposium [Internet]. Rhodes Island, Greece: IEEE; 2006. [cited 2022 Oct 22]. p. 8 pp. Available from: http://ieeexplore.ieee.org/document/1639535/ [Google Scholar]
  • 131.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018. Feb 1;35(2):518–22. doi: 10.1093/molbev/msx281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Zhou X, Shen XX, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Mol Biol Evol. 2018. Feb 1;35(2):486–503. doi: 10.1093/molbev/msx302 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001. Aug 1;17(8):754–5. doi: 10.1093/bioinformatics/17.8.754 [DOI] [PubMed] [Google Scholar]
  • 134.Lartillot N, Lepage T, Blanquart S. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics. 2009. Sep 1;25(17):2286–8. doi: 10.1093/bioinformatics/btp368 [DOI] [PubMed] [Google Scholar]
  • 135.Lartillot N, Philippe H. A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process. Mol Biol Evol. 2004. Jun;21(6):1095–109. doi: 10.1093/molbev/msh112 [DOI] [PubMed] [Google Scholar]
  • 136.BayesTraits Pagel M. Computer program and documentation. Meade A, editor. PLoS Comput Biol [Internet]. 2007. [cited 2022 Jun 4]; Available from: https://dx.plos.org/10.1371/journal.pcbi.0010003 [Google Scholar]
  • 137.Bazinet AL, Zwickl DJ, Cummings MP. A Gateway for Phylogenetic Analysis Powered by Grid Computing Featuring GARLI 2.0. Syst Biol. 2014. Sep 1;63(5):812–8. doi: 10.1093/sysbio/syu031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Felsenstein J. CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP. Evolution. 1985. Jul;39(4):783–91. doi: 10.1111/j.1558-5646.1985.tb00420.x [DOI] [PubMed] [Google Scholar]
  • 139.Hedges BS. The number of replications needed for accurate estimation of the bootstrap P value in phylogenetic studies. Mol Biol Evol. 1992;9(2):366–9. doi: 10.1093/oxfordjournals.molbev.a040725 [DOI] [PubMed] [Google Scholar]
  • 140.Jermiin LS, Poladian L, Charleston MA. Is the ‘Big Bang’ in Animal Evolution Real? Science. 2005. Dec 23;310(5756):1910–1. [DOI] [PubMed] [Google Scholar]
  • 141.Minh BQ, Nguyen MAT, von Haeseler A. Ultrafast Approximation for Phylogenetic Bootstrap. Mol Biol Evol. 2013. May 1;30(5):1188–95. doi: 10.1093/molbev/mst024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142.Stamatakis A, Hoover P, Rougemont J. A Rapid Bootstrap Algorithm for the RAxML Web Servers. Renner S, editor. Syst Biol. 2008. Oct 1;57(5):758–71. doi: 10.1080/10635150802429642 [DOI] [PubMed] [Google Scholar]
  • 143.Anisimova M, Gascuel O. Approximate Likelihood-Ratio Test for Branches: A Fast, Accurate, and Powerful Alternative. Sullivan J, editor. Syst Biol. 2006. Aug 1;55(4):539–52. doi: 10.1080/10635150600755453 [DOI] [PubMed] [Google Scholar]
  • 144.Shimodaira H, Hasegawa M. Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference. Mol Biol Evol. 1999. Aug 1;16(8):1114–6. [Google Scholar]
  • 145.Shavit L, Penny D, Hendy MD, Holland BR. The Problem of Rooting Rapid Radiations. Mol Biol Evol. 2007. Nov;24(11):2400–11. doi: 10.1093/molbev/msm178 [DOI] [PubMed] [Google Scholar]
  • 146.Holland BR, Penny D, Hendy MD. Outgroup Misplacement and Phylogenetic Inaccuracy Under a Molecular Clock—A Simulation Study. Sullivan J, editor. Syst Biol. 2003. Apr 1;52(2):229–38. doi: 10.1080/10635150390192771 [DOI] [PubMed] [Google Scholar]
  • 147.Goldman N. Statistical tests of models of DNA substitution. J Mol Evol. 1993. Feb;36(2):182–98. doi: 10.1007/BF00166252 [DOI] [PubMed] [Google Scholar]
  • 148.Rambaut A, Grass NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics. 1997;13(3):235–8. doi: 10.1093/bioinformatics/13.3.235 [DOI] [PubMed] [Google Scholar]
  • 149.A Shepherd D, Klaere S. How Well Does Your Phylogenetic Model Fit Your Data? Foster P, editor. Syst Biol. 2019. Jan 1;68(1):157–67. doi: 10.1093/sysbio/syy066 [DOI] [PubMed] [Google Scholar]
  • 150.Lewis PO, Xie W, Chen MH, Fan Y, Kuo L. Posterior Predictive Bayesian Phylogenetic Model Selection. Syst Biol. 2014. May;63(3):309–21. doi: 10.1093/sysbio/syt068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151.Rambaut A. FigTree v1.3.1. Edinburgh: Institute of Evolutionary Biology, University of Edinburgh; (2010). [Google Scholar]
  • 152.Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol. 2016. Jun;33(6):1635–8. doi: 10.1093/molbev/msw046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 153.Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021. Jul 2;49(W1):W293–6. doi: 10.1093/nar/gkab301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 154.Dutheil J, Gaillard S, Bazin E, Glémin S, Ranwez V, Galtier N, et al. Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics. 2006. Dec;7(1):188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 155.Miller MA, Pfeiffer W, Schwartz T. Creating the CIPRES Science Gateway for inference of large phylogenetic trees. In: 2010 Gateway Computing Environments Workshop (GCE) [Internet]. New Orleans, LA, USA: IEEE; 2010 [cited 2021 Sep 9]. p. 1–8. Available from: http://ieeexplore.ieee.org/document/5676129/
  • 156.Lemoine F, Correia D, Lefort V, Doppelt-Azeroual O, Mareuil F, Cohen-Boulakia S, et al. NGPhylogeny.fr: new generation phylogenetic services for non-specialists. Nucleic Acids Res. 2019. Jul 2;47(W1):W260–5. doi: 10.1093/nar/gkz303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157.Sanchez R, Serra F, Tarraga J, Medina I, Carbonell J, Pulido L, et al. Phylemon 2.0: a suite of web-tools for molecular evolution, phylogenetics, phylogenomics and hypotheses testing. Nucleic Acids Res. 2011. Jul 1;39(suppl):W470–4. doi: 10.1093/nar/gkr408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 158.Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, et al. Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012. Jun 15;28(12):1647–9. doi: 10.1093/bioinformatics/bts199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Maddison WP, Maddison DR. Mesquite: a modular system for evolutionary analysis. [Internet]. 2021. Available from: http://www.mesquiteproject.org [Google Scholar]
  • 160.Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, et al. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. Pertea M, editor. PLOS Comput Biol. 2019. Apr 8;15(4):e1006650. doi: 10.1371/journal.pcbi.1006650 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161.Yu Y, Harris AJ, Blair C, He X. RASP (Reconstruct Ancestral State in Phylogenies): A tool for historical biogeography. Mol Phylogenet Evol. 2015. Jun;87:46–9. doi: 10.1016/j.ympev.2015.03.008 [DOI] [PubMed] [Google Scholar]
  • 162.Excoffier L, Laval G, Schneider S. Arlequin (version 3.0): An integrated software package for population genetics data analysis. Evol Bioinforma. 2005. Jan;1:117693430500100. [PMC free article] [PubMed] [Google Scholar]
  • 163.De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 2006. May 15;22(10):1269–71. doi: 10.1093/bioinformatics/btl097 [DOI] [PubMed] [Google Scholar]
  • 164.Lyons EH. CoGe, a new kind of comparative genomics platform. University of California, Berkeley; 2008. [Google Scholar]
  • 165.Meier-Kolthoff JP, Auch AF, Huson DH, Goker M. COPYCAT: cophylogenetic analysis tool. Bioinformatics. 2007. Apr 1;23(7):898–900. doi: 10.1093/bioinformatics/btm027 [DOI] [PubMed] [Google Scholar]
  • 166.Merkle D, Middendorf M, Wieseke N. A parameter-adaptive dynamic programming approach for inferring cophylogenies. BMC Bioinformatics. 2010. Jan;11(S1):S60. doi: 10.1186/1471-2105-11-S1-S60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 167.Rozas J, Rozas R. DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics. 1999. Feb 1;15(2):174–5. doi: 10.1093/bioinformatics/15.2.174 [DOI] [PubMed] [Google Scholar]
  • 168.Rousset F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour. 2008. Jan;8(1):103–6. doi: 10.1111/j.1471-8286.2007.01931.x [DOI] [PubMed] [Google Scholar]
  • 169.Nguyen M, Ekstrom A, Li X, Yin Y. HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus genomes. Toxins. 2015. Oct 9;7(10):4035–53. doi: 10.3390/toxins7104035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 170.Conow C, Fielder D, Ovadia Y, Libeskind-Hadas R. Jane: a new tool for the cophylogeny reconstruction problem. Algorithms Mol Biol. 2010. Dec;5(1):16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 171.Singh PP, Isambert H. OHNOLOGS v2: a comprehensive resource for the genes retained from whole genome duplication in vertebrates. Nucleic Acids Res. 2019. Oct 15;gkz909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 172.Dereeper A, Nicolas S, Le Cunff L, Bacilieri R, Doligez A, Peros JP, et al. SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects. BMC Bioinformatics. 2011. Dec;12(1):134. doi: 10.1186/1471-2105-12-134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 173.Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics. 2006. Dec 1;22(23):2971–2. doi: 10.1093/bioinformatics/btl505 [DOI] [PubMed] [Google Scholar]
  • 174.Charleston MA, Robertson DL. Preferential Host Switching by Primate Lentiviruses Can Account for Phylogenetic Similarity with the Primate Phylogeny. Sanderson M, editor. Syst Biol. 2002. May 1;51(3):528–35. doi: 10.1080/10635150290069940 [DOI] [PubMed] [Google Scholar]
  • 175.Kryazhimskiy S, Plotkin JB. The Population Genetics of dN/dS. Gojobori T, editor. PLoS Genet. 2008. Dec 12;4(12):e1000304. doi: 10.1371/journal.pgen.1000304 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 176.Britten RJ. Rates of DNA Sequence Evolution Differ Between Taxonomic Groups. Science. 1986. Mar 21;231(4744):1393–8. doi: 10.1126/science.3082006 [DOI] [PubMed] [Google Scholar]
  • 177.Graur D, Martin W. Reading the entrails of chickens: molecular timescales of evolution and the illusion of precision. Trends Genet. 2004. Feb;20(2):80–6. doi: 10.1016/j.tig.2003.12.003 [DOI] [PubMed] [Google Scholar]
  • 178.To TH, Jung M, Lycett S, Gascuel O. Fast Dating Using Least-Squares Criteria and Algorithms. Syst Biol. 2016. Jan;65(1):82–97. doi: 10.1093/sysbio/syv068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 179.Stevens J. Computational aspects of host-parasite phylogenies. Brief Bioinform. 2004. Jan 1;5(4):339–49. doi: 10.1093/bib/5.4.339 [DOI] [PubMed] [Google Scholar]
  • 180.Felsenstein J. Phylogenies and the Comparative Method. Am Nat. 1985. Jan;125(1):1–15. [Google Scholar]
  • 181.Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci. 1979. Oct 1;76(10):5269–73. doi: 10.1073/pnas.76.10.5269 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 182.Nei M. Analysis of Gene Diversity in Subdivided Populations. Proc Nat Acad Sci USA. 1973;70(12):3. doi: 10.1073/pnas.70.12.3321 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 183.Excoffier L, Heckel G. Computer programs for population genetics data analysis: a survival guide. Nat Rev Genet. 2006. Oct;7(10):745–58. doi: 10.1038/nrg1904 [DOI] [PubMed] [Google Scholar]
  • 184.Rozas J, Ferrer-Mata A, Sánchez-DelBarrio JC, Guirao-Rico S, Librado P, Ramos-Onsins SE, et al. DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets. Mol Biol Evol. 2017. Dec 1;34(12):3299–302. doi: 10.1093/molbev/msx248 [DOI] [PubMed] [Google Scholar]
  • 185.DeLano WL. Pymol: An open-source molecular graphics tool. CCP4 Newsl Protein Crystallogr. 2002;40(1):82–92. [Google Scholar]
  • 186.Zhang Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics. 2008. Dec;9(1):40. doi: 10.1186/1471-2105-9-40 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 187.Hildebrand A, Remmert M, Biegert A, Söding J. Fast and accurate automatic structure prediction with HHpred: Structure Prediction with HHpred. Proteins Struct Funct Bioinforma. 2009;77(S9):128–32. [DOI] [PubMed] [Google Scholar]
  • 188.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021. Aug 26;596(7873):583–9. doi: 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 189.Mahajan S, de Brevern AG, Sanejouand YH, Srinivasan N, Offmann B. Use of a structural alphabet to find compatible folds for amino acid sequences: Fold Recognition Using a Structural Alphabet. Protein Sci. 2015. Jan;24(1):145–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 190.dos Santos HG, Nunez-Castilla J, Siltberg-Liberles J. Functional Diversification after Gene Duplication: Paralog Specific Regions of Structural Disorder and Phosphorylation in p53, p63, and p73. Roemer K, editor. PLOS ONE. 2016. Mar 22;11(3):e0151961. doi: 10.1371/journal.pone.0151961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 191.Cao L, Chen F, Yang X, Xu W, Xie J, Yu L. Phylogenetic analysis of CDK and cyclin proteins in premetazoan lineages. BMC Evol Biol. 2014;14(1):10. doi: 10.1186/1471-2148-14-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 192.Le SQ, Gascuel O. An Improved General Amino Acid Replacement Matrix. Mol Biol Evol. 2008. Apr 3;25(7):1307–20. doi: 10.1093/molbev/msn067 [DOI] [PubMed] [Google Scholar]
  • 193.Dehal P, Boore JL. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. Holland P, editor. PLoS Biol. 2005. Sep 6;3(10):e314. doi: 10.1371/journal.pbio.0030314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 194.Holland LZ, Ocampo Daza D. A new look at an old question: when did the second whole genome duplication occur in vertebrate evolution? Genome Biol. 2018. Dec;19(1):209. doi: 10.1186/s13059-018-1592-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 195.Peyressatre M, Prével C, Pellerano M, Morris M. Targeting Cyclin-Dependent Kinases in Human Cancers: From Small Molecules to Peptide Inhibitors. Cancers. 2015. Jan 23;7(1):179–237. doi: 10.3390/cancers7010179 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Arndt von Haeseler

23 May 2022

PONE-D-22-06715Roadmap to the study of gene and protein evolutionPLOS ONE

Dear Dr. Jacques,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 07 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Arndt von Haeseler

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ.

Additional Editor Comments:

The ms has been evaluated by two experts in the field. I do agree with the reviewers remarks and would like to ask you to address the critcism and to rewrite

the ms accordingly.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

Descriptions of methods and reagents contained in the step-by-step protocol should be reported in sufficient detail for another researcher to reproduce all experiments and analyses. The protocol should describe the appropriate controls, sample sizes and replication needed to ensure that the data are robust and reproducible.

Reviewer #1: Partly

Reviewer #2: No

**********

3. Does the protocol describe a validated method?

The manuscript must demonstrate that the protocol achieves its intended purpose: either by containing appropriate validation data, or referencing at least one original research article in which the protocol was used to generate data.

Reviewer #1: Yes

Reviewer #2: No

**********

4. If the manuscript contains new data, have the authors made this data fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: N/A

Reviewer #2: N/A

**********

5. Is the article presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please highlight any specific errors that need correcting in the box below.

Reviewer #1: Yes

Reviewer #2: No: The manuscript can be written more succinctly

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this manuscript, Jacques et al. present a guideline for studying gene and protein evolution. In summary, several available public resources for DNAs and proteins are introduced, which provide the sequence data and their related characters such as gene functions, protein structures, the associated biological pathways or diseases, etc. They can be specific, focusing only on a particular model organism such as TAIR, WormBase or The human protein atlas, or can also be very widespread like NCBI or UniProt. Next to those databases, the authors listed different tools and methods for reconstructing and analyzing the phylogenetics of genes or proteins, starting from the alignment of the homologous genes to tree building and visualizing the result trees. They furthermore highlighted some applications that can be utilized from the phylogenetic analysis, such as measure of selection strength, study of co-evolution, or genome evolution, etc. Finally, the manuscript is completed with two tutorials to reconstruct the evolutionary history of p53 protein family and the coevolution between human cyclins and CDKs.

Overall, this manuscript provides a practical protocol for biologists to design a basic analysis of gene evolution. Still, I see some issues and improvements.

Major issues:

1. In the sequence alignment session, the authors already mentioned that ClustalW is suitable for large dataset, while MAFFT has the highest accuracy. Why hasn’t MAFFT been used in the two examples instead of Clustal2.1, even though the dataset was small enough for MAFFT? How will it affect the result by using different alignment tool?

2. Both two examples considered small datasets. For example, the reconstruction of the evolutionary history of p53 family has been done with only 11 taxa. It is, therefore, doable by either manually typing the species to the search organism field or downloading protein sequences for each species from the online BLAST search result. However, in many cases, especially while studying about the evolutionary history of a gene, one needs a large and more diverse taxa set (i.e. https://doi.org/10.3389/fmicb.2021.739000). It would be appropriate to have an instruction for effectively batch downloading the required data. Or, in general, how to deal with retrieving large data?

3. The databases are intensively represented. The information, such as the gene or protein sequences, can be redundant between databases. Nevertheless, their identifiers are often different. It would be helpful to quickly outline the approaches to communicate between different resources, for example by using the ID mapping tool of UniProt (https://www.uniprot.org/uploadlists/).

4. Similar to the issue with the databases, a large number of tools has been introduced. However, many of them are just superficially described, especially the tools for the phylogenetic analysis in table 6 and 7. Thus, this can puzzle the users for choosing the right approach for their study, which consequently defeats the purpose of this manuscript.

Minor issues:

1. FigTree should not belong to the list of nucleic acid databases (table 1).

2. Link for UniProt in table 2 should not be directed to a specific protein (P04637) but be generalized.

3. Gblocks Server is no longer available (link in table 4).

4. Links for BayesTraits, HGT-Finder in table 6 and Bio++ in table 7 do not work (tested on May 15th).

5. Link for PyMol in table 8 is directed to Mesquite.

6. The reference for some tools are missing in the main text and tables. Namely, ClustalOmega, Probcons (table 4), BayesTraits, FigTree, PAUL (table 6), and Bio++ Suite (table 7).

Reviewer #2: GENERAL COMMENTS

In general, I think there is merit in publishing a manuscript like the one submitted. However, I also think you could make it more comprehensive and more accurate, especially regarding the section on the protocol and phylogenetic analysis.

Regarding the protocol, I think readers would need to have a more detail decision tree that offers them alternative paths, depending on what their objectives and data are. You need to incorporate some measures of quality control at different steps in the protocol, and you need feedback loops that are followed in case the result of a quality control is that the preceding analysis did not yield the expected result (otherwise you will perpetuate errors made at the early steps). In this regard, your protocol is a little like that in Ciccarelli et al. (2006; Science 311, 1283-1287).

I welcome that you cite a lot of databases, but could you move their citations from where they are to right after the name? Sometimes, a citation is at the end of the sentence and might be mistaken for a citation to something else (e.g., other databases or multiple sequence alignment).

Regarding the phylogenetic analysis (L194-L364), I compliment you on a valiant attempt to cover a large and complex field. However, I do not think you succeeded because I found gaping holes:

1. You seem unaware of previous phylogenetic protocols, one of which appeared recently in NAR Genomics & Bioinformatics (2, lqaa041).

2. Because the manuscript does not present something novel but summarises bioinformatics tools and resources, chiefly databases, you need to be comprehensive. Unfortunately, you were not comprehensive. This applies to both multiple sequence alignment methods and phylogenetic methods.

3. Currently, I am comparing the accuracy of 20 multiple sequence alignment methods (i.e., Clustal Omega, CONTRAlign, DIALIGN-TX, Fsa, GramAlign, KAlign, MAFTT (default), MAFFT (EINSI), MAFFT (GINSI), MAFFT (LINSI), MSAProbs, Muscle, Pasta, PicXAA, Poa, Prank, ProbAlign, ProbCons, T-Coffee (fast), and T-Coffee (regressive)). You considered only a few of these.

4. You mention one method for trimming sites from multiple sequence alignment (GBlocks; in Table 4). There is a suite of other and more suitable methods available (see NAR Genomics & Bioinformatics 2, lqaa024; see also citations 13-21 in that paper).

5. You mention three model-selection methods but overlooked other more flexible methods (Nature Methods 14, 587-589; Mol. Biol. Evol. 29, 1695-1701; Syst. Biol. 63, 726-742; Mol. Biol. Evol. 34, 772-773; Syst. Biol. 69, 249-264).

6. Your understanding of the relationship between log-likelihood and the AIC and BIC is wrong (L232-L233), suggesting confusion. You should read Briefings in Bioinformatics (21, 533-565).

7. Your understanding of what the non-parametric bootstrap scores indicate is wrong (see Science 310, 1911-1912).

8. You mention several distance matrix-based phylogenetic methods but do not cite papers describing them.

9. Your list of phylogenetic methods is incomplete and misses many important programs (e.g., IQTREE (Mol. Biol. Evol. 32, 268-274), IQTREE2 (Mol. Biol. Evol. 37, 1530-1534), PHYLIP (Felsenstein, 2005), PyCogent (Genome Biology 8, R171), FastME v2.0 (Mol. Biol. Evol. 32, 2798-2800), LSD (Syst. Biol. 65, 82-97) Garli (Syst. Biol. 63, 812-818), PhyloBayes 3 (Bioinformatics 25, 2286-2288), SplitsTree (Mol. Biol. Evol. 23, 254-267)).

10. You really need to ensure that software and methods referred to are cited properly, preferentially every time and with version numbers included.

11. Your figures are unclear, and the colours used are not consistent with a colour palette fit for colourblind people.

SPECIFIC COMMENTS

L47-L48. Why start with “Although … community”?

L61. “allowed” > “has allowed”

L61-L62. “nucleic acid and protein” > “nucleotides and amino acids” [DNA is a sequence of nucleotides and protein is a sequences of amino acids; therefore, it is not correct to write DNA sequences or protein sequences]

L62. “In addition, this data is” > “Typically, these data are” [datum is singular of data]

L66-L67. “navigate these resources can be puzzling” > “navigate through these resources can be challenging”

L81. Delete “considerable accompanying”

L84-L86. The figure is not a fully developed protocol like those in NAR Genomics & Bioinformatics 2, lqaa041 [2020].

L89. “DNA sequences” > “nucleotide sequences (DNA)” [I am pedantic here, but DNA is a sequence (of nucleotides) so writing DNA sequences is the same as saying a sequence of a sequence of nucleotides]

L89. “and their protein translations” > “and, where applicable, their translations into protein”

L103. FigTree is not a database!!! It is a phylogenetic tool, so it should be in a different table.

L119. “protein sequences” > “amino-acid sequences” [I use a hyphen because amino and acid form a compound adjective]

L123. See above.

L148. “sequences alignment” > “pairwise sequence alignment” OR “multiple sequence alignment”

L159. “evolutionary relationships” > “evolutionary relationships (see below)”

L167 “under FASTA” > “in FASTA”

L177. “, i.e., … ancestry” > “(i.e., … ancestry)” [I prefer “that is,” in the sentence and “i.e.,” inside parentheses]

L177-L179. This sentence should be so that the terms homology, orthology, paralogy and xenology are clear and unambiguous.

L184. Include “(define/explain the E-value)”

L185. “amino acid sequences” > “amino-acid sequences”

L186. Explain “HH”

L196. “proteins within” > “proteins, even within”

L197. “data, (“ > “data (“

L198. “), includes” > “) includes”

L202. Delete “sequences”

L203. “one after the other” > “, one after the other,”

L213-L214. Not “distantly related”; “closely related” [I spoke to Ary and Nick about this]

L227. You list the two least realistic models of rate heterogeneity across sites. More realistic models of rate heterogeneity across sites are described in Kalyaanamoorthy et al. (Nature Methods 14, 587-589) and Crotty et al. (Syst. Biol. 69, 249-264).

L232-L233. This sentence is incorrect. The AIC and BIC are computed from the log-likelihood scores; not the other way around, as you write.

L235-L239. Your description of the reason for doing a non-parametric bootstrap analysis is misleading and is not supported by relevant literature (e.g., Mol. Biol. Evol. 9, 366-369; Mol. Biol. Evol. 30, 1188-1195; Mol. Biol. Evol. 35, 518-522).

L247-L250. Include citations after each method or program mentioned, please.

I will stop my general review here, but urge you to revise every page in accordance with the suggestions given above.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Feb 24;18(2):e0279597. doi: 10.1371/journal.pone.0279597.r002

Author response to Decision Letter 0


24 Aug 2022

We thank the reviewers for their helpful and insightful comments. We have revised the manuscript in accordance with most of their comments and concerns. We believe that their thorough review and our revision has resulted in a much-improved contribution. Below, we reply to all the comments one by one. We refer to new intervals of text by their row number in the manuscript.

Reviewer 1:

1. In the sequence alignment session, the authors already mentioned that ClustalW is suitable for large dataset, while MAFFT has the highest accuracy. Why hasn’t MAFFT been used in the two examples instead of Clustal2.1, even though the dataset was small enough for MAFFT? How will it affect the result by using different alignment tool?

Yes, we agree and thank the reviewer for pointing it out. We used MAFFT in our first example and ClustalW in the second example, because they are two of the most user-friendly for beginners since they have a web interface. In our first example, ClustalW has been used only to generate the percent identity matrix. In the introduction we have now also clarified the overall aim to highlight the databases and tools that display a web-user interface. In our examples, the differences in results (topology tree and nodes robustness) with MAFFT or ClustalW is minor. Although a comparison of the different results could be interesting, we refrain from discussing it here since it is beyond the scope of this guide to the practical tools. More specifically, edits are made at

Row 69-70: ‘The aim is to provide a practical guide for beginners and more advanced explorers into protein and gene evolution.’

Row 233-235: ‘ PROBCONS [42], T-COFFEE [41] and MAFFT [40] are described to have particularly high accuracy but also high calculation times [45]. They are suitable for small and intermediate datasets.’

2. Both two examples considered small datasets. For example, the reconstruction of the evolutionary history of p53 family has been done with only 11 taxa. It is, therefore, doable by either manually typing the species to the search organism field or downloading protein sequences for each species from the online BLAST search result. However, in many cases, especially while studying about the evolutionary history of a gene, one needs a large and more diverse taxa set (i.e. https://doi.org/10.3389/fmicb.2021.739000). It would be appropriate to have an instruction for effectively batch downloading the required data. Or, in general, how to deal with retrieving large data?

Yes, this is a valid comment. We have now revised the text to clarify how to extract a large number of sequences from NCBI using different identifiers, and how to save them into a fasta file. Also, we now introduce the Batch Entrez tool. More specifically, edits are made at

Row 121-123: ‘It is also possible to batch download a large number of sequences from NCBI, by entering their identifiers (accession numbers, gi numbers or GeneIDs) in Batch Entrez.’

Row 133-134: ‘The “retrieve/ID mapping” tool of UniProt allows to batch download the information on a list of proteins using UniProt identifiers.’

3. The databases are intensively represented. The information, such as the gene or protein sequences, can be redundant between databases. Nevertheless, their identifiers are often different. It would be helpful to quickly outline the approaches to communicate between different resources, for example by using the ID mapping tool of UniProt (https://www.uniprot.org/uploadlists/).

Yes, we thank the reviewer also for pointing this out. We have revised the text to clarify how to communicate between the identifiers of several databases and introduced the ID mapping tool of Uniprot. More specifically, edits are made at

Row 130-132: ‘The “retrieve/ID mapping” tool of UniProt allows to batch download the information on a list of proteins using UniProt identifiers.’

4. Similar to the issue with the databases, a large number of tools has been introduced. However, many of them are just superficially described, especially the tools for the phylogenetic analysis in table 6 and 7. Thus, this can puzzle the users for choosing the right approach for their study, which consequently defeats the purpose of this manuscript.

We understand the reviewers concern here. There is a delicate balance when condensing a complex field into a practical guide that is aimed also for beginners (as well as somewhat more advanced users). We have tried to go into some more depth while clarifying how to choose the paths forward. More specifically, we have added several tools and bioinformatic methods for sequence alignment and trimming, model selection and phylogenetic inference, as well as comments on their specificities, including models of molecular evolution and bootstrapping methods. We edited the protocol, so it is more comprehensive, more clear and more user-friendly for non-bioinformatic users (and also stated this in the introduction). We detailed more the specificities of databases and tools in Tables 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and their corresponding paragraphs to facilitate the choice of appropriate method for the protocol users.

We explained more in detail the different methods for sequence alignments, and added several programs that can be used, with their specificities and comments on how to choose between them, mostly based on the size of the dataset. Specifically, edits are made on:

Row 232-243: ‘CLUSTALW [34] and MUSCLE [36] are included in MEGA [44]. They display web interfaces, as well as MAFFT [40], Kalign [39], and PRANKS [37,55]. PROBCONS [42], T-COFFEE [41] and MAFFT [40] are described to have particularly high accuracy but also high calculation times [45]. They should be restricted to small and intermediate datasets. CLUSTAL Omega [35] and Kalign [39] are particularly fast, but less accurate [46]. They can be used for datasets of up to 4000 and 2000 sequences, respectively [45,46]. The performances of MUSCLE are intermediate [46]. PRANK is particularly accurate for large sets and closely related sequences. Bali-Phy [47] performs a bayesian co-estimation of alignment, phylogeny, and other parameters and is also argued to be very reliable. PASTA [48] and UPP [49], that uses a machine learning technique, are designed for very large datasets. MAFFT offers a wide range of methods, which can be accuracy-oriented, such as L-INS-i, G-INS-i and E-INS-i; or speed-oriented, such as FFT-NS-2, which can be used for up to 30 000 sequences.’

We also added information on the different phylogenetic methods. They are now in separate paragraphs. We added several programs that can be used for phylogenetic inference, with some of their specificities and comments on how to choose between them, mostly based on the size of the datasets, but also their diverse options (type of data, models implemented, branch support test). Specifically, edits are made on:

Row 332-341 : ‘MEGA [44] and SeaView [104] are known to be very user-friendly. They include sequence alignment tools and tree editors. PhyML [105] is accurate, easy of use and, like PAUP [103] and MEGA [44], includes all common models of molecular evolution. RAxML [106] and particularly FastTree [100] are fast and well suited for large datasets (up to 1 million sequences with FastTree). They use a specific model of rate heterogeneity, in addition to Gamma law and proportion of invariant sites. IQ-TREE [60,61], that includes ModelFinder [84] and a very fast bootstrapping method, is reported to be both fast and accurate [111]. PAUP is slower than other programs and uses nucleotide data only.’

Row 345-349: ‘The main software used for BI-based phylogenetics is MrBayes [112] that uses the Markov Chain Monte Carlo (MCMC) algorithm (Table 7). PhyloBayes [113,114] is a bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model, well adapted for large datasets and phylogenomics. Bali-Phy [47] can also be used for phylogenetic analysis using Bayesian inference.’

To make the protocol more comprehensible and easier of use for beginner, we also added information on bioinformatic tools and how to use them, in the text and in the tables. More specifically, we added precisions on the tools used by the workflows.

Row 399-406: ‘The Cyberinfrastructure for phylogenetic research (CIPRES science gateway) [129] is a public resource for phylogenetic analysis that includes most tools and software for sequence alignment, model selection, and phylogenetic inference, including BEAST, FastTree, GARLI, IQ-TREE, jModelTest, MAFFT, MrBayes, PAUP, PhyloBayes and RAxML. Other packages include NGPhylogeny [130], a web service for phylogenetic analysis from sequence alignment to tree inference, and Phylemon [131], a suite of web tools for phylogenetics, phylogenomics, molecular evolution studies and hypothesis testing.’

Table 8: List of packages for phylogenetic analysis and evolutionary studies (*: tools that include web interface)

Software Features Link References

Bio++ Software package for sequence analyses, phylogenetic analysis, molecular evolution studies and population genetics analyses https://github.com/BioPP

[128,132]

CIPRES Server providing a software package for diverse phylogenetic analyses, including BEAST, FastTree, GARLI, IQ-TREE, jModelTest, MAFFT, MrBayes, PAUP, PhyloBayes and RAxML https://www.phylo.org/

[129]

HyPhy Software package for evolutionary analyses including evolution model selection, phylogenetic inference using ML and distance methods and sequence evolution studies https://www.hyphy.org/

[108]

NGPhylogeny* Workflows that integrate numerous methods and tools for multiple sequence alignment (MAFFT), trimming (BMGE), tree inference (FastTree, FastME or PhyML) and newick display

https://ngphylogeny.fr

[130]

Phylemon2* Web-tools for molecular evolution, phylogenetics, phylogenomics, molecular evolution studies and hypotheses testing. http://phylemon.bioinfo.cipf.es/index.html

[131]

Row 507-513: ‘Structure alignments can be realized to compare protein functions and evolution, and the mean distance in Å between homologous residues can be calculated. I-TASSER [159] and HHPred of the HH-suite software [160] can predict 3-dimentional structure for protein sequences using homology information. FoRSA [161] uses a structural alphabet known as Protein Blocks to identify a protein fold from its amino acid sequence, or to identify a protein sequence in the proteome of a species from a crystal structure by calculating a likelihood score.’

Table 10: List of programs for protein structure analyses

Software Features Link References

FoRSA Protein structure prediction using a structural alphabet http://www.bo-protscience.fr/forsa/

[161]

HHPred Protein structure prediction using homology information https://toolkit.tuebingen.mpg.de/tools/hhpred

[160]

I-TASSER Protein structure prediction and structure-based function annotation https://zhanglab.dcmb.med.umich.edu/I-TASSER/

[159]

PyMOL 3D visualization of molecules and diverse analyses on protein structures https://pymol.org/2/

[158]

FigTree should not belong to the list of nucleic acid databases (table 1).

We removed FigTree from Table 1

1. Link for UniProt in table 2 should not be directed to a specific protein (P04637) but be generalized.

We changed the link to the generalized web page.

3. Gblocks Server is no longer available (link in table 4).

The broken link was replaced with the valid one.

4. Links for BayesTraits, HGT-Finder in table 6 and Bio++ in table 7 do not work (tested on May 15th).

The links were replaced with valid ones.

5. Link for PyMol in table 8 is directed to Mesquite.

The link was replaced with a link to PyMol

6. The reference for some tools are missing in the main text and tables. Namely, ClustalOmega, Probcons (table 4), BayesTraits, FigTree, PAUL (table 6), and Bio++ Suite (table 7).

References were added to each of these tools.

Reviewer 2:

In general, I think there is merit in publishing a manuscript like the one submitted. However, I also think you could make it more comprehensive and more accurate, especially regarding the section on the protocol and phylogenetic analysis. Regarding the protocol, I think readers would need to have a more detail decision tree that offers them alternative paths, depending on what their objectives and data are. You need to incorporate some measures of quality control at different steps in the protocol, and you need feedback loops that are followed in case the result of a quality control is that the preceding analysis did not yield the expected result (otherwise you will perpetuate errors made at the early steps). In this regard, your protocol is a little like that in Ciccarelli et al. (2006; Science 311, 1283-1287).

We are grateful that the reviewer also sees the merit in this practical guide to the study of gene and protein phylogeny and evolution for those willing to get into the field. This first comment is certainly also valid. Therefore, we have edited the text so that an extra step to assess phylogenetic assumptions after the alignment trimming, and several tools are present in the protocol. We added feedback loops in the protocol (Figure 1).

Row 88-91: ‘Feedback loops illustrate the necessity to control the quality of the alignment, to assess phylogenetic assumptions and to test the robustness of the tree, and to go back to previous steps to redo the analysis if necessary.’

We also added several tools proposed by Reviewer 2. This makes the manuscript more comprehensive, especially for sequence alignment and trimming, model selection and phylogenetic analysis. We also edited the text to clarify the specificities of numerous tools and methods so users can select the best fit method depending on their data and objectives, depending on the size of their datasets. More specifically, we have edited at the following paragraphs:

Row 248-256 : ‘Once the alignment is completed, it is necessary to select the positions and regions that will be used for the phylogenetic inference. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. These positions should be manually or automatically excluded prior to the phylogenetic analysis. The resulting sub-MSA maximizes the phylogenetic signal of the alignment [51]. Alignment trimming can be done manually or using appropriate programs. The completeness of alignments can be quantified and phylogenetic informative regions of the alignment can be selected using appropriate tools, such as Guidance 2 [52], AliStat [53], Gblocks [51,54], trimAl [55], BMGE [56] and Noisy [57] (Table 5).’

We also included more steps to the protocol, including a paragraph on the validation of phylogenetic assumptions, based on a new tool that has been implemented in IQ-TREE, and another paragraph on bootstrapping methods.

Row 261-269: ‘Most phylogenetic methods rely on simplifying assumptions stating for example that all sites in the alignment evolved under the same tree, that mutation rates have remained constant, and that substitutions are reversible. Once the alignment is performed and the sites selected for phylogenetic inference, a recent phylogenetic protocol recommends assessing those phylogenetic assumptions when possible [58]. If the phylogenetic data violate these assumptions, the phylogeny and evolutionary analyses can be biased with most common phylogenetic programs [59]. Several statistical methods have been developed. Recently, tests for all these assumptions have been included in IQ-TREE [60,61]. It is also possible to use the R package MOTMOT [62].’

I welcome that you cite a lot of databases, but could you move their citations from where they are to right after the name? Sometimes, a citation is at the end of the sentence and might be mistaken for a citation to something else (e.g., other databases or multiple sequence alignment).

Yes, thanks for pointing this out. Now, we have edited through all the manuscript the text to consistently have citations right after the name of every database and for software every time they appear in a paragraph.

Regarding the phylogenetic analysis (L194-L364), I compliment you on a valiant attempt to cover a large and complex field. However, I do not think you succeeded because I found gaping holes:

1. You seem unaware of previous phylogenetic protocols, one of which appeared recently in NAR Genomics & Bioinformatics (2, lqaa041).

Thanks for pointing this paper out to us. Now, we have edited figure 1 and the text according to this recently published protocol, and cite this work. We have edited the text on:

Row 264-265: ‘a recent phylogenetic protocol recommends assessing those phylogenetic assumptions when possible [58]’

Row 370-373: ‘For model-based methods, a recent phylogenetic protocol recommends to test the goodness of fit between tree, model and data using a parametric bootstrap [58,128]. Bayesian inference method calculates posterior probabilities, which measure branch support instead of bootstrap values.’

2. Because the manuscript does not present something novel but summarizes bioinformatics tools and resources, chiefly databases, you need to be comprehensive. Unfortunately, you were not comprehensive. This applies to both multiple sequence alignment methods and phylogenetic methods.

We have edited the text to highlight that the main aim of this work is to provide a roadmap for the study of gene and protein evolution for the broad scientific community and scientist that are new to the field. Our intention is indeed to be comprehensive, but also simplify the decision-making process for the reader while choosing the most appropriate tools for their scientific endeavor. Therefore, we believe it is important to have a balance between being exhaustive and practical enough in a way that will also incorporate beginners. This paper does not aim to present all tools that exist for MSA or phylogenetic analysis but rather a set of tools which are diverse, widely used in the scientific community and also user friendly. Specially for beginners in the field, it is easier to navigate a list of software that is actively maintained by the scientific community, and for which getting support (in the form of tutorials, documentations, publications, online support groups) is easier. Still, we added several tools and methods for sequence alignment, trimming, model selection and phylogenetic inference (see responses below). We have edited text:

Row 5: ‘a practical guide’

Row 69-70: ‘The aim is to provide a practical guide for beginners and more advanced explorers into protein and gene phylogeny and evolution.’

3. Currently, I am comparing the accuracy of 20 multiple sequence alignment methods (i.e., Clustal Omega, CONTRAlign, DIALIGN-TX, Fsa, GramAlign, KAlign, MAFTT (default), MAFFT (EINSI), MAFFT (GINSI), MAFFT (LINSI), MSAProbs, Muscle, Pasta, PicXAA, Poa, Prank, ProbAlign, ProbCons, T-Coffee (fast), and T-Coffee (regressive)). You considered only a few of these.

Thanks for pointing out also these methods. Related to our response in the previous point, we recognize that there are several other tools that have been developed. We have therefore added a majority of these methods in the text and in Table 4.

Thus, we have edited the following:

Row 226-243: ‘The progressive approach aligns progressively from the closest to the most distant sequences. It is used by CLUSTALW [34], CLUSTAL Omega [35], MUSCLE [36], PRANK [37,55], Kalign [39] and MAFFT [40]. Consistency-based methods calculate the best alignment after different pairwise alignments. They are used by T-COFFEE [41] and PROBCONS [42] and its successor CONTRAlign [43], the latter for amino acid sequences only (Table 4). CLUSTALW [34] and MUSCLE [36] are included in MEGA [44]. They display web interfaces, as well as MAFFT [40], Kalign [39], and PRANKS [37,55]. PROBCONS [42], T-COFFEE [41] and MAFFT are described to have particularly high accuracy but also high calculation times [45]. They should be restricted to small and intermediate datasets. CLUSTAL Omega [35] and Kalign [39] are particularly fast, but less accurate [46]. They can be used for datasets of up to 4000 and 2000 sequences, respectively [45,46]. The performances of Muscle are intermediate [46]. PRANK is particularly accurate for large sets and closely related sequences. Bali-Phy [47] performs a bayesian co-estimation of alignment, phylogeny, and other parameters and is also argued to be very reliable. PASTA [48] and UPP [49], that uses a machine learning technique, are designed for very large datasets. MAFFT offers a wide range of methods, which can be accuracy-oriented, such as L-INS-i, G-INS-I and E-INS-i; or speed-oriented, such as FFT-NS-2, which can be used for up to 30 000 sequences.’

Table 4: List of programs for multiple sequence alignment (*: tools that include web interface)

Software Features Link References

BAli-Phy Multiple sequence alignment of nucleotide and amino acid sequences and phylogenetic analysis using a bayesian approach http://www.bali-phy.org

[47]

CLUSTAL Omega* Speed-oriented multiple sequence alignment for nucleotide or amino acid data, suitable for large datasets https://www.ebi.ac.uk/Tools/msa/clustalo/

[35]

CLUSTALW* Multiple sequence alignment for nucleotide or amino acid data https://www.genome.jp/tools-bin/clustalw

[34]

CONTRAlign (ProbCons) Accuracy-oriented multiple sequence alignment for amino acid data http://contra.stanford.edu/contralign/

[42,43]

Kalign* Multiple sequence alignment for nucleotide or amino acid data, suitable for large datasets https://www.ebi.ac.uk/Tools/msa/kalign/

[39]

MAFFT* Accuracy-oriented multiple sequence alignment for nucleotide or amino acid data https://mafft.cbrc.jp/alignment/server/

[40]

MUSCLE* Multiple sequence alignment for nucleotide or amino acid data https://www.ebi.ac.uk/Tools/msa/muscle/

[36]

PASTA Speed-oriented multiple sequence alignment for nucleotide or amino acid data, designed for very large datasets https://bioinformaticshome.com/tools/msa/descriptions/PASTA.html

[48]

PRANK/

WebPRANK* Speed-oriented multiple sequence alignment for nucleotide or amino acid data, should be preferred for close sequences and large datasets http://wasabiapp.org/software/prank/ https://www.ebi.ac.uk/goldman-srv/webprank/

[37,38]

SATé Software package for multiple sequence alignments and phylogenetic inference https ://phylo.bio.ku.edu/software/sate/sate.html

[50]

T-COFFEE* Accuracy-oriented multiple sequence alignment of nucleotide and amino acid sequences http://tcoffee.crg.cat/

[41]

UPP Speed-oriented multiple sequence alignment of nucleotide and amino acid sequences, designed for very large data sets https://github.com/smirarab/sepp.

[49]

However, several of the tools mentioned by reviewer #2 are not maintained, or have not been updated in several years. Sometimes they are not accessible (links in the publications are broken, do not exist and so one needs to contact the authors to get the source code). We recognize the value of these tools and their contribution to the field, but we believe including them in our manuscript may only create confusion. For those reasons, we have not included Probalign, PicXAA, FSA, GramAlign, and MSAProbs.

4. You mention one method for trimming sites from multiple sequence alignment (GBlocks; in Table 4). There is a suite of other and more suitable methods available (see NAR Genomics & Bioinformatics 2, lqaa024; see also citations 13-21 in that paper).

Yes, we thank the reviewer for this thought. We added a full paragraph on alignment trimming and mention several tools from these publications.

Row 248-256: ‘Once the alignment is completed, it is necessary to select the positions and regions that will be used for the phylogenetic inference. Poorly aligned positions and highly variable regions are not phylogenetically informative, because these positions might not be homologous or subject to saturation. These positions should be manually or automatically excluded prior to the phylogenetic analysis. The resulting sub-MSA maximizes the phylogenetic signal of the alignment [51]. Alignment trimming can be done manually or using appropriate programs. The completeness of alignments can be quantified and phylogenetic informative regions of the alignment can be selected using appropriate tools, such as Guidance 2 [52], AliStat [53], Gblocks [51,54], trimAl [55], BMGE [56] and Noisy [57] (Table 5).’

5. You mention three model-selection methods but overlooked other more flexible methods (Nature Methods 14, 587-589; Mol. Biol. Evol. 29, 1695-1701; Syst. Biol. 63, 726-742; Mol. Biol. Evol. 34, 772-773; Syst. Biol. 69, 249-264).

Yes, we thank the reviewer also for pointing this out. We included several new tools in the text and in Table 4 (see response above). More specifically, the edits are made in

Row 279-281: ‘More recently, the GHOST model for alignments with variation in mutation rate was introduced and implemented in IQ-TREE.’

Row 288-292: ‘PartitionFinder 2 [82] can be used with nucleotide and amino acid data. Model test selectors are also included in programs such as MEGA [44] and PhyML (SMS) [83]. ModelFinder [84] is a model selection method for alignments or nucleotides, codons or amino acids implemented in IQ-TREE [60,61], that includes a flexible model of rate-heterogeneity between sites.’

Table 6: List of programs for molecular evolution model selection

Software Features Link References

ModelFinder Fast model selection with a model of rate heterogeneity between sites (nucleotide, amino acids or codons) Implemented in IQ-TREE [84]

ModelTest / jModelTest Nucleotide substitution model selection http://evomics.org/resources/software/molecular-evolution-software/modeltest/

[80]

PartitionFinder 2 Molecular evolution model selection (nucleotide or amino acids) http://www.robertlanfear.com/partitionfinder/

[82]

ProtTest Aminoacid substitution model selection https://github.com/ddarriba/prottest3

[81]

SMS Molecular evolution model selection included in PhyML (nucleotide or aminoacid) http://www.atgc-montpellier.fr/sms/

[83]

6. Your understanding of the relationship between log-likelihood and the AIC and BIC is wrong (L232-L233), suggesting confusion. You should read Briefings in Bioinformatics (21, 533-565).

We have now edited this sentence to make it more accurate and clear.

Row 283-284: ‘For every substitution model, these tools calculate the Bayesian information criterion (BIC) [78] and the Akaike information criterion (AIC) [79] from the log-likelihood scores.’

7. Your understanding of what the non-parametric bootstrap scores indicate is wrong (see Science 310, 1911-1912).

We acknowledge that the text can be much clearer. Therefore, we have added a full paragraph with information, precision and references on bootstrapping method and other tests of tree robustness. See edits on

Row 355-373: ‘Once the phylogenetic tree is obtained, it is recommended to estimate the robustness of the nodes. Most programs of phylogenetic analysis use the non-parametric bootstrapping method [121]. Bootstrapping is an estimate of error used to assess the repeatability of the clade and the how consistently the data support the nodes [121,122]. The characters (e.g., nucleotides or amino acids) are randomly resampled with replacement and a new phylogeny is calculated for each replicate. A bootstrap value is calculated for every node, indicating the proportion of replicate phylogenies that recovered the node from the initial tree. A bootstrap value of 100% means that the node is supported by all informative characters, while low values mean that only few characters support the node. A bootstrap value above 95% is usually considered very good and a bootstrap value below 75% is generally considered a poor support for the clade. 1000 replicates are often used in phylogenetic analysis. Since bootstrapping can be time consuming, fast approximation methods for phylogenetic bootstrap have been proposed and are implemented in programs such as RAxML or IQ-TREE [123–125]. Other tests of branch support robustness exist, such as the Shimodaira-Hasegawa test [126] and the approximate likelihood ratio test [127], both implemented in PhyML along with the bootstrapping method. FastTree includes only the Shimodaira-Hasegawa test. For model-based methods, a recent phylogenetic protocol recommends to test the goodness of fit between tree, model and data using a parametric bootstrap, rather than non-parametric bootstrap [58,128]. Bayesian inference method calculates posterior probabilities, which measure branch support instead of bootstrap values.’

8. You mention several distance matrix-based phylogenetic methods but do not cite papers describing them.

To clarify, we have complemented citations in the text. Edits are made on

Row 314-315: ‘Distance methods include the Unweighted Pair Group Method with Arithmetic mean (UPGMA) [91], Neighbor Joining (NJ) [92], and Minimum Evolution (ME) [93].’

9. Your list of phylogenetic methods is incomplete and misses many important programs (e.g., IQTREE (Mol. Biol. Evol. 32, 268-274), IQTREE2 (Mol. Biol. Evol. 37, 1530-1534), PHYLIP (Felsenstein, 2005), PyCogent (Genome Biology 8, R171), FastME v2.0 (Mol. Biol. Evol. 32, 2798-2800), LSD (Syst. Biol. 65, 82-97) Garli (Syst. Biol. 63, 812-818), PhyloBayes 3 (Bioinformatics 25, 2286-2288), SplitsTree (Mol. Biol. Evol. 23, 254-267)).

Yes, the reviewer is correct that not all programs are presented. This is, as mentioned above, a deliberation to balance the overview with a guide that is accessible also to novices to the field. For phylogenetic analysis tools we have the same criteria as for the MSA (not include tools which are not maintained or are difficult to access). Therefore, we added most of these tools to be more comprehensive, with comments on their specificities. This concerns particularly ML methods and Bayesian inference. (see also responses to comments 2-4, and 9). More specifically, the following is edited:

Row 329-339: ‘Programs for ML phylogenetic analysis include MEGA [44], SeaView [104], PhyML [105], RAxML [106], FastTree [100], PAML [107], PAUP [99,103], IQ-TREE [60,61], HYPHY [108], PHYLIP [101] and GARLI [109,110] (Table 7). All of them can be used with nucleotide or amino acid data. MEGA [44] and SeaView [104] are known to be very user-friendly. They include sequence alignment tools and tree editors. PhyML [105] is accurate, easy of use and, like PAUP [103] and MEGA [44], includes all common models of molecular evolution. RAxML [106] and particularly FastTree [100] are fast and well suited for large datasets (up to 1 million sequences with FastTree). They use a specific model of rate heterogeneity, in addition to Gamma law and proportion of invariant sites. IQ-TREE [60,61], that includes ModelFinder [84] and a very fast bootstrapping method, is reported to be both fast and accurate [111]. PAUP [99,103] is slower than other programs, and uses nucleotide data only.’

Row 343-349: ‘The most recent method for phylogenetic reconstruction uses Bayesian inference, that calculates the probability of the molecular evolution model given the data. The main software used for BI-based phylogenetics is MrBayes [112] that uses the Markov Chain Monte Carlo (MCMC) algorithm (Table 7). PhyloBayes [113,114] is a bayesian MCMC sampler for phylogenetic reconstruction with protein data using a specific probabilistic model, well adapted for large datasets and phylogenomics. Bali-Phy [47] can also be used for phylogenetic analysis using Bayesian inference.’

Table 7: List of programs for phylogenetic analysis using distance methods, maximum parsimony, maximum lilekihood (ML) and bayesian inference (*: tools that include web interface)

Software Features Link References

APE R-written package for molecular phylogenetics http://ape-package.ird.fr

[98]

BAli-Phy Sequence alignment and phylogenetic inference using a bayesian approach http://www.bali-phy.org

[47]

BayesTraits Phylogenetic inference and other evolutionary analyses using Bayesian inference http://www.evolution.reading.ac.uk/BayesTraitsV4.0.0/BayesTraitsV4.0.0.html

[115]

ETE Toolkit Visualization and analysis of phylogenetic trees http://etetoolkit.org/

[116]

FastMe Fast phylogenetic inference using distance methods. http ://www.atgc-montpellier.fr/fastme/

[94]

FastTree Phylogenetic inference using ML for nucleotide (GTR and JC models) and amino acid (JTT and WAG models), and Shimodaira-Hasegawa test. Suitable for very large datasets. http://www.microbesonline.org/fasttree/

[100]

FigTree Graphic software for phylogenetic trees http://tree.bio.ed.ac.uk/software/figtree/

[117]

GARLI Phylogenetic inference using ML for nucleotide (GTR model), aminoacid (most models) or codon data, with Gamma law and proportion of invariant sites. http://evomics.org/resources/software/molecular-evolution-software/garli/

[110]

HYPHY* Diverse evolutionary analyses including evolution model selection, phylogenetic inference using ML and distance methods and sequence evolution studies https://www.hyphy.org/

[108]

IQ-TREE ML phylogenetic inference, including model selection and ultrafast bootstrapping method. Includes the GHOST evolution model and tests for phylogenetic assumptions. http://www.iqtree.org/

[60,61]

ITOL Visualization and annotation of phylogenetic trees https://itol.embl.de/

[118]

MEGA Sequence alignment, model selection, phylogenetic analysis (parsimony, distance methods). Includes all common nucleotide and amino acid evolution models, Gamma law and proportion of invariant sites, and Bootstrapping method. https://www.megasoftware.net/

[44]

MrBayes Bayesian phylogenetic inference, ancestral states reconstruction, phylogenetic calibration and other evolutionary analyses http://nbisweden.github.io/MrBayes/

[112]

PAML Maximum likelihood phylogenetic inference, estimation of selection strength, ancestral states reconstruction and other analyses http://abacus.gene.ucl.ac.uk/software/paml.html

[107]

PAUP Phylogenetic inference using maximum parsimony and ML on nucleotide sequences (all ModelTest models), with Gamma law and proportion of invariant sites and Bootstrapping method. http://paup.phylosolutions.com/

[99]

PHYLIP Phylogenetic inference using parsimony, distance methods and ML https ://evolution.genetics.washington.edu/phylip.html

[101]

PhyloBayes Phylogenetic inference using Bayesian inference on proteins using a specific probabilistic model http ://www.atgc-montpellier.fr/phylobayes/

[113]

PhyML* Phylogenetic inference using ML, ancestral states reconstruction and various evolutionary analyses. Includes all common DNA and protein evolution models and diverse branch support methods (Bootstrap, Shimodaira-Hasegawa, aLTR…). https://github.com/stephaneguindon/phyml

Web interface : http://atgc.lirmm.fr/phyml/

[105]

PyCogent Phylogenetic inference and phylogeny drawing, various evolutionary analyses including partition models and ancestral states reconstruction https://github.com/pycogent/pycogent

[95]

RAxML Phylogenetic inference using ML with nucleotide (GTR) or amino acid data (all common models) with Gamma law or CAT and proportion of invariant sites. Suitable for large datasets. https://cme.h-its.org/exelixis/web/software/raxml/

[106]

SeaView Sequence alignment and phylogenetic inference using maximum parsimony, NJ and ML http://doua.prabi.fr/software/seaview

[104]

SplitsTree Phylogenetic inference, in particular unrooted trees, or phylogenetic networks https ://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/splitstree/

[97]

10. You really need to ensure that software and methods referred to are cited properly, preferentially every time and with version numbers included.

Yes, we have now been more careful on this. See response above (comments 8 and 10). In essence, we have edited the text to include the historical papers describing the different distance methods, as well as all software and methods.

11. Your figures are unclear, and the colours used are not consistent with a colour palette fit for colourblind people.

We thank the reviewer for pointing this out. Although most colors in our figures are defined by others (webtools and logos), we have changed the red color for yellow in Figure 3.

SPECIFIC COMMENTS

We thank the reviewer for the detailed comments below. To all, we have edited sentences to clarify and increase accuracy as suggested by the reviewers. Specific replies follow below (row numbers refer to those in the originally submitted MS).

L47-L48. Why start with “Although … community”?

Thanks. We removed “Although … community”.

L61. “allowed” > “has allowed”

Yes, thanks. We replace « allowed » with « has allowed ».

L61-L62. “nucleic acid and protein” > “nucleotides and amino acids” [DNA is a sequence of nucleotides and protein is a sequences of amino acids; therefore, it is not correct to write DNA sequences or protein sequences]

True. We replaced « nucleic acids » and « protein » with « nucleotide and amino acids ».

L62. “In addition, this data is” > “Typically, these data are” [datum is singular of data]

Yes, we changed « is » to « are ».

L66-L67. “navigate these resources can be puzzling” > “navigate through these resources can be challenging”

We changed « puzzling » for « challenging ».

L81. Delete “considerable accompanying”

We deleted « considerable accompanying ».

L84-L86. The figure is not a fully developed protocol like those in NAR Genomics & Bioinformatics 2, lqaa041 [2020].

We have updated the figure and added steps, in accordance with this protocol. We added alignment trimming, validation of phylogenetic assumptions, and bootstrapping as specific tools.

L89. “DNA sequences” > “nucleotide sequences (DNA)” [I am pedantic here, but DNA is a sequence (of nucleotides) so writing DNA sequences is the same as saying a sequence of a sequence of nucleotides]

We changed « DNA sequences » to « nucleotide sequences ».

L89. “and their protein translations” > “and, where applicable, their translations into protein”

We changed « and their protein translations » to « and, where applicable, their trabslations into protein ».

L103. FigTree is not a database!!! It is a phylogenetic tool, so it should be in a different table.

The FigTree is moved to the appropriate table.

L119. “protein sequences” > “amino-acid sequences” [I use a hyphen because amino and acid form a compound adjective]

We changed « protein sequences » into « amino-acid sequences ».

L123. See above.

We changed « protein sequences » into « amino-acid sequences ».

L148. “sequences alignment” > “pairwise sequence alignment” OR “multiple sequence alignment”

We changed “sequences alignment” to “pairwise sequence alignment or multiple sequence alignment”.

L159. “evolutionary relationships” > “evolutionary relationships (see below)”

We added « see below ».

L167 “under FASTA” > “in FASTA”

We changed « under FASTA » to « in FASTA ».

L177. “, i.e., … ancestry” > “(i.e., … ancestry)” [I prefer “that is,” in the sentence and “i.e.,” inside parentheses]

Parentheses are added.

L177-L179. This sentence should be so that the terms homology, orthology, paralogy and xenology are clear and unambiguous.

We rephrased the paragraph to explain more clearly the concepts of homology, orthology, paralogy and xenology.

L184. Include “(define/explain the E-value)”

Thanks, we added a definition of the E-value.

L185. “amino acid sequences” > “amino-acid sequences”

We added the hyphen.

L186. Explain “HH”

We introduced the HHsuite.

L196. “proteins within” > “proteins, even within”

We changed “proteins within” into “proteins, even within”.

L197. “data, (“ > “data (“

The comma is removed.

L198. “), includes” > “) includes”

The comma is removed.

L202. Delete “sequences”

We deleted « sequences ».

L203. “one after the other” > “, one after the other,”

Two commas were added.

L213-L214. Not “distantly related”; “closely related” [I spoke to Ary and Nick about this] Edited from “distantly related” to “closely related”.

L227. You list the two least realistic models of rate heterogeneity across sites. More realistic models of rate heterogeneity across sites are described in Kalyaanamoorthy et al. (Nature Methods 14, 587-589) and Crotty et al. (Syst. Biol. 69, 249-264).

We introduced these models and added the references.

L232-L233. This sentence is incorrect. The AIC and BIC are computed from the log-likelihood scores; not the other way around, as you write.

We rewrote the sentence to clarify.

Row 279-280: ‘For every substitution model, these tools calculate the Bayesian information criterion (BIC) [76] and the Akaike information criterion (AIC) [77] from the log-likelihood scores.’

L235-L239. Your description of the reason for doing a non-parametric bootstrap analysis is misleading and is not supported by relevant literature (e.g., Mol. Biol. Evol. 9, 366-369; Mol. Biol. Evol. 30, 1188-1195; Mol. Biol. Evol. 35, 518-522).

We added a paragraph to explain the bootstrapping and other methods to test tree robustness. We added the references.

L247-L250. Include citations after each method or program mentioned, please.

We added the citation immediately after every method and program described.

Attachment

Submitted filename: Rebuttal_letter_PlosOne_Jacques et al.pdf

Decision Letter 1

Arndt von Haeseler

6 Oct 2022

PONE-D-22-06715R1Roadmap to the study of gene and protein phylogeny and evolution – a practical guidePLOS ONE

Dear Dr. Hammarlund,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 20 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Arndt von Haeseler

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

Descriptions of methods and reagents contained in the step-by-step protocol should be reported in sufficient detail for another researcher to reproduce all experiments and analyses. The protocol should describe the appropriate controls, sample sizes and replication needed to ensure that the data are robust and reproducible.

Reviewer #1: Yes

**********

3. Does the protocol describe a validated method?

The manuscript must demonstrate that the protocol achieves its intended purpose: either by containing appropriate validation data, or referencing at least one original research article in which the protocol was used to generate data.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. If the manuscript contains new data, have the authors made this data fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: N/A

Reviewer #2: No

**********

5. Is the article presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please highlight any specific errors that need correcting in the box below.

Reviewer #1: Yes

Reviewer #2: No: The manuscript requires further editing to improve clarity

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have adequately addressed the most important issues in this revised manuscript. I have no further comment.

Reviewer #2: GENERAL COMMENTS

In general, I still think there is merit in publishing a manuscript like the one submitted. However, I also think the manuscript can improved further. Several aspects of the revision suggest you were in a hurry to finish it, leading, in some cases, to confusion rather than clarity.

If you want the manuscript to become the paper that educators recommend as an introduction to the study of gene and protein phylogeny and evolution, then you may need to invest a bit more of your time on achieving an accurate, easy-to-read presentation of the topics. There is also evidence that you do not understand some of the aspects that you cover. Those have to be addressed so that you present a sound and reliable protocol, with references to critical databases, alignment tools, and phylogenetic methods.

In the following, I list some of the major issues:

1. An example of the above-mentioned misunderstanding appears in Figure 1: The feedback loops should be distinguished from the forward arrows (use different colour or dashed arrows). There are no arrows between “Select phylogenetic method” and the two boxes below it with methods.

2. Distance methods are model-based, so they should be listed with the maximum-likelihood and Bayesian methods.

3. You write “Test of robustness”, which is misleading. Robustness refers to the quality or condition of being strong and in good condition. A phylogenetic estimate can be robust, yet it may be incorrect. We would like the estimates to be accurate and the methods to be precise and accurate. As for the accuracy, we may test the goodness of fit between the tree, the model of sequence evolution, and the data. A good fit means that the tree and model of sequence evolution provide a good explanation of the data. Yet, we will never know whether that explanation is the correct (others may exist). A poor fit means that something is not right about the tree, the model of sequence evolution, or both. This is fundamental and it should be clear from your manuscript (which it isn’t).

4. The sections on alignment and phylogenetic analysis (L212-L395) have been improved a lot. However, you appear to ignore the new phylogenetic protocol [1], which you cited in the Introduction. I think it would serve you better if you described the different steps and procedures in the context of the new phylogenetic protocol [1] and/or Figure 1. You know the details and order of actions but newcomers to the field do not and would need a clear framework. The new protocol provides that framework.

5. Regarding multiple sequence alignment, I think you need to consider and cite some of the excellent papers by Morrison (e.g., Is Sequence Alignment and Art or Science; Syst. Bot. 40, 14-26 [2015]). You gloss over the challenges and pitfalls, some of which are illustrated in Golubchik et al. (Mol. Biol. Evol. 24, 2433–2442 [2007]), thereby belittling the challenge it is to obtain a trustworthy multiple sequence alignment. Apart from the progressive and consistency-based approach (mentioned on L224-L227), there are other ways to obtain a multiple sequence alignment. You should at least mention some of these other strategies, and cite the relevant literature (or reviews of that literature). Further, the consistency-based approach is not well described (L226-L227).

6. You should mention the minimum reporting standard for multiple sequence alignments (it was presented in Ref. 53) when introducing trimming of multiple sequence alignment. The reason for doing so is given in Ref. 53.

7. The section entitled “Test of robustness of phylogenetic trees” needs to be rewritten. For a start, choosing to use the word “robustness” is not good. Robustness usually refers to how accurate methods are when the assumptions of the methods are violated by the data (see Yang 2014; Molecular Evolution). Other words like accuracy, consistency and precision are more appropriate. In this context, note that:

• Accuracy means ‘the quality or state of being correct or precise’

• Consistency means ‘the quality of achieving a level of performance which does not vary greatly in quality over time’

• Precision means ‘the quality, condition, or fact of being exact and accurate’

Your description of bootstrapping is incorrect; it is not “an estimate of error” (L349). In my previous review, I drew your attention to the correct interpretation of bootstrapping, but you ignored that, which was unwise. Instead, you cite Hedges [120], which focuses on the number of bootstrap replicates needed gain an accurate estimate of the bootstrap P value, and not on the interpretation. You also need to note that bootstrap P values are both data and method dependent. It is the sites in the alignment, not “characters” (L350), which are sampled randomly with replacement, and the correct citation is missing: “each replicate” > “each replicate [Felsenstein 1985; Evolution 39, 783-791]”. Bootstrapping, as described by Felsenstein and implemented by many since then, is a non-parametric bootstrap approach. Using it gives us an insight into the consistency of the data (i.e., do we have enough sites in the data to consistently infer the same tree). It does not tell us whether the inferred tree is accurate.

You write that a “bootstrap value is calculated for every node” (L352). This is incorrect. The value is computed for every internal edge/branch/split that separates the sequences.

You write “from the initial tree” (L353). This is incorrect. It can be done for any tree, even the consensus tree (IQ-TREE often list the values for the ML tree and the consensus tree; and these trees may not be identical).

You write “A bootstrap value …for the clade”. Who did you quote? There has been some disagreement on the interpretation of bootstrap P values. New methods like those from von Haeleser’s group [121, 122] have clarified the matter considerably.

You write “1000 replicates …” (L356). This should be spelled out, and you should probably mention how that fits with Hedges [120].

You write “Other branch support methods exist” (L359) and then refer to the Shimodaira-Hasegawa test [124]. The bootstrap method is not a “branch support method”, so it is not correct to say “Other branch …”. Moreover, the Shimodaira-Hasegawa test is not a branch support test either. It is a test designed to compare the ML tree to other trees.

8. I have not included any specific comments on the two case studies, mainly because I have already used far more time on this review than I had available. That said, the two sections do need some revision.

9. None of the figures are as clear as is necessary for publication. In fact, I can’t read them.

The Bibliography is clearly set using a French application and do not appear to comply with PLoS One’s style. Please ensure that the Bibliography complies with the journal’s formatting style. The Bibliography should have been fixed after the first review!

SPECIFIC COMMENTS

L25: “Here a” > “Here we present a”

L26: “nucleic acid” > “DNA”

L27: Delete “is presented”

L33: “proto cells to modern organisms” > “primitive cells to modern cells and multicellular organisms, ”

L36: “sequence composition of” > “information content in”

L38-L40: Poor sentence structure

L41: “mutations that” > “mutations (including substitutions) that”

L43: “all” > “both”

L45: “past decades” > “past four decades”

L48: “taxa” > “taxa, genes or genomic components (e.g., transposable elements)”

L52: “the others” > “other species”

L53: “infection diseases” > “infectious diseases”

L59: Delete “both”

L65: Delete “the”

L66: “ and of” > “as well as”

L67: “processes, following” > “processes. In so doing, we follow”

L68: “used by many and maintained” > “maintained and used by many”

L75: Delete “peer-reviewed”

L80: Delete “digitalized

L82: “nucleic acid or amino acid sequence” > DNA or amino-acid sequences”

L88: Perhaps state that the protocol is based on that in [1].

L89: “Nucleic acid” > “DNA”

L95: ”browser focusing on chordates that contains” > ”browser that focuses on chordates and contains”

L97: “This” > “The”

L99” Delete “Overall,”

L105: “rendered” > “provided”

L114: “provides several” > “provides access to several”

L116: “under” > “in the”

L117: “nucleotide sequences” > “nucleotide”

L118: Delete “pasted to fasta file”

L129: “format and the” > “format. The”

L129: “interlinks” > “links”

L130: “allows to batch download the” > “facilitates batch downloads of”

L132: “databases” > “databases (such as …)”

L136: “The KEGG focuses” > “The KEGG database focuses”

L148-L156: You alternate between “3-dimensional” and “3D”. Be consistent

L149: Delete “the”

L153: Place “i.e., … family” between parentheses

L159: You write “based on …, or both”. Perhaps separate the databases accordingly

L163: You write “based on the presence or absence of … strands”. I think you need to elaborate a bit further on this. There is SCOPe classification system to consider but you also need to consider the work by Sun et al. (Current Protocols in Protein Science (2004) 17.1.1-17.1.189)

L164: “class” > “classification”

L177: “[16], that combines” > “[16]. It combines”

L186: Delete “whole”

L188: “hemoglobine” > “haemoglobin”

L189: “hemoglobine” > “haemoglobin”

L190: “myoglobine” > “myoglobin”

L197: Include a citation to the E-value. You might know it, but you are targeting beginners who can be assumed to be ignorant.

L206: “retrieving from” > “retrieving sequences from”

L213: “related to each other” > “related to each other in an evolutionary sense” [they can be related to each other on other senses]

L220: “row one” > row of a matrix, one”

L230-L231: Poor sentence structure and you probably mean “user interface”, rather than “web interfaces”

L233: “calculation times” > “execution times”

L236-L237: You need a reference to support this statement, and you then need to cite Golubchik et al. (Mol. Biol. Evol. 24, 2433–2442 [2007]), who reported the opposite for PRANK.

L238: “that” > “which”

L242: “web interface” > “user interface”

L253: “quantified and phylogenetic informative” > “quantified using AliStat [53] and phylogenetically informative”

L258: “stating for example that” > “stating, for example, that”

L259: “constant” > “constant over time (i.e., time-homogeneity)”

L260: “reversible” > “reversible and, therefore, also stationary (for details on these assumptions, see Jayaswal et al. (Syst. Biol. 63, 726-742 [2014]) and papers cited therein)”

L261: You write “can be biased [58]”. It would be more appropriate to cite Syst. Biol. (53, 623-637; 53, 638-643) because these papers used data generated by simulation (hence, the truth is known).

L262: “recent phylogenetic protocols recommend … possible [1]” > “a recent phylogenetic protocol [1] recommends … possible”

L264: “developped" > “developed”

L264: You write “tests for all these assumptions”. This is not true. Some have been included in IQ-TREE and IQ-TREE2. I agree that the matched-pairs tests of homogeneity (Bioinformatics 22, 1225-1231) are of relevance, but only for some Markovian conditions. One of these is implemented in Homo 2.1 (https://github.com/lsjermiin/Homo.v2.1).

L266: “Selection of the molecular evolution model” > “Selection of the optimal model of sequence evolution”

L267: Several papers and book chapters have described model selection in general terms. I think it would be wise to cite some of these (e.g., Front. Genet. 6, 319 [2015; Meth. Mol. Biol. 1525, 379-420 [2017]; DOI 10.1007/978-1-4939-6622-6_15).

L268: “Nucleotide or amino acid substitution exist” > “Several models of nucleotide or amino-acid substitution exist [REF]”. The statement requires some citations.

L270: you cite Yang 2006; you should cite is more recent book from 2014.

L273: You write “Each model … IQ-TREE [75]”. The set of models described here differ from the substitution models described in L271-L274. They are homotachous models of rate-heterogeneity across sites, so name them appropriately. You forgot to include the PDF/FreeRate model proposed by Yang (Genetics 139, 993-1005 [1995]) and now included in ModelFinder, IQ-TREE and PhyML.

L279: “every substitution model” > “model of sequence evolution (i.e., combination of substitution model and rate-heterogeneity across sites model)”

L281: “optimizing” > “minimising”

L284: Move “PartitionFinder 2 …[81]” to L288. The logic here is that ModelFinder is for a single partition and it supersedes ModelTest, jModelTest and ProtTest. PartitionFinder 2 is for multiple partitions.

L293: Replace the sentence with “Phylogenetic trees may consider the topology and the branch lengths (phylograms) or just the topology (cladograms)”

L294: “tree building methods” > “tree-building methods” [compound adjective]

L295: “distance” > “distances”

L295-L206: “number of differences” > “”numbers and types of differences”

L296: “to reconstruct” > “and they use this matrix to reconstruct”

L296-L297: The sentence is wrong. Distance methods are also character-based methods.

L299: “classic” > “classical”

L303: Include citation after “long branch attraction”

L309: “mean” > “Mean”

L310: “[91].” > “[91] tree-inference methods.”

L312: “amino acid data” > “amino-acid data” [compound adjective]

L315: “but also” > “and”

L316: Delete “,”

L317: “in R language that” > “in the R language. It”

L320: “ML” > “maximum likelihood [ML]”. I think this is the first time that you typed ML, so it has to be defined, even though you and I both know what it refers to

L322: “ML methods” > “Maximum-likelihood methods”. When starting a sentence, it is good practice to spell out the abbreviation (or number)

L322: “probability” > “likelihood”

L323: “ML aims” > “Maximum likelihood aims”

L323: “combinations of model” > “combinations of trees and model”

L327: “amino acid data” > “amino-acid data”

L328: “is accurate” > “is reported as being accurate”

L329: “all common models of molecular evolution” > “many common models of substitution” or “many common models of sequence evolution” [choose the most accurate statement”

L331: “rate heterogeneity … invariant sites” > “rate-heterogeneity across sites (…)”. Here, you replace “…” with the model of that you refer to

L333: “that” > “which”

L333: “method” > “method [REF]”. Here you need to replace “REF” with the relevant citation to the method implemented in IQ-TREE and IQ-TREE 2

L336: “recent method” > “recently-developed method”

L336: “inference, that” > “inference (BI). It”

L337: “calculates the probability” > “calculates the posterior probability”

L337: “the molecular evolution model given” > “the tree and model of sequence evolution, given”

L338: “[109] that” > “[109]. It”

L362: “inbcludes” > “includes”

L365: “taxa” > “sequences”

L366: “addressed question” > “question asked”

L370: “taxa” > “sequences”

L370: “studied ingroup” > “ingroup of interest”

L373: “identified, … viruses,” > “included”

L375: “of the longest branches” > “between the most dissimilar sequences in the tree”

L378: “exported using … a graphical” > “visualised using a graphical”

L394: You should include Geneious (https://www.geneious.com/) in the table and mention it in the text.

L397: “allows” > “allow users”

L404: Here you should only write MP, ML and BI. Note that you need to define MP somewhere earlier on in the text

L405: “molecular” > “sequence”

L416: In this section, you use the term “mutations”; typically the term “substitutions” is used

L417: “The strength” > “The type and strength”

L417: “can be calculated. This” > “may be of interest. It”

L418: “ration” > “ratio”

L421: “The ratio … 1, that” > “If dN/dS > 1, then the”

L426: “Phylogenetic calibration” > “Molecular dating of events”

L427: “event using events” > “events, using other events”

L440-L444: This can be stated more succinctly

L450: Why do you refer to Table 8 here?

L457: “species” > “traits”

L462-L466: This needs to be stated better

L477: “Population genetics often studies” > Population geneticists often study”

L483: “of” > “at”

L484: “studies, for a full review see” > “studies; for a full review, see”

L493: “in Å” > “in ångström”

L494: You write “can be calculated”. Which program?

L495: “3-dimentional” > “3D”

L496: “amino acid” > “amino-acid”

L498: Here I think you need to include Alpha fold (Nature 596, 583–589 [2021])

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

2. Has the protocol been described in sufficient detail?

To answer this question, please click the link to protocols.io in the Materials and Methods section of the manuscript (if a link has been provided) or consult the step-by-step protocol in the Supporting Information files.

The step-by-step protocol should contain sufficient detail for another researcher to be able to reproduce all experiments and analyses.

Reviewer #2: Partly

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Feb 24;18(2):e0279597. doi: 10.1371/journal.pone.0279597.r004

Author response to Decision Letter 1


25 Nov 2022

We thank the reviewers for their helpful and insightful comments. We have revised the manuscript and responded to all comments and concerns of Reviewer #2. We believe that this review and our revision has resulted in a much-improved contribution. Below, we reply to all the comments one by one. We refer to new intervals of text by their row number in the manuscript.

Reviewer #1: The authors have adequately addressed the most important issues in this revised manuscript. I have no further comment.

Reviewer #2: GENERAL COMMENTS

In general, I still think there is merit in publishing a manuscript like the one submitted. However, I also think the manuscript can improved further. Several aspects of the revision suggest you were in a hurry to finish it, leading, in some cases, to confusion rather than clarity.

If you want the manuscript to become the paper that educators recommend as an introduction to the study of gene and protein phylogeny and evolution, then you may need to invest a bit more of your time on achieving an accurate, easy-to-read presentation of the topics. There is also evidence that you do not understand some of the aspects that you cover. Those have to be addressed so that you present a sound and reliable protocol, with references to critical databases, alignment tools, and phylogenetic methods.

In the following, I list some of the major issues:

1. An example of the above-mentioned misunderstanding appears in Figure 1: The feedback loops should be distinguished from the forward arrows (use different colour or dashed arrows). There are no arrows between “Select phylogenetic method” and the two boxes below it with methods.

Response: We changed the feedback loops to dashed arrows in Figure 1. We added an arrow between “select phylogenetic method” and the methods below.

2. Distance methods are model-based, so they should be listed with the maximum-likelihood and Bayesian methods.

Response: We added the box “selection of molecular evolution model” under the distance methods.

3. You write “Test of robustness”, which is misleading. Robustness refers to the quality or condition of being strong and in good condition. A phylogenetic estimate can be robust, yet it may be incorrect. We would like the estimates to be accurate and the methods to be precise and accurate. As for the accuracy, we may test the goodness of fit between the tree, the model of sequence evolution, and the data. A good fit means that the tree and model of sequence evolution provide a good explanation of the data. Yet, we will never know whether that explanation is the correct (others may exist). A poor fit means that something is not right about the tree, the model of sequence evolution, or both. This is fundamental and it should be clear from your manuscript (which it isn’t).

Response: We changed “test of robustness” to “test of reliability” on Figure 1 and in the text. We explain more in detail what a good fit means and we stress in the manuscript that a good fit doesn`t mean that the tree is correct but is just an estimate of goodness of fit between the tree, the model and the data.:

L455-457: “A good fit means that the tree and the model of sequence evolution provide a good explanation of the data but doesn’t indicate if the tree in correct or not.”.

4. The sections on alignment and phylogenetic analysis (L212-L395) have been improved a lot. However, you appear to ignore the new phylogenetic protocol [1], which you cited in the Introduction. I think it would serve you better if you described the different steps and procedures in the context of the new phylogenetic protocol [1] and/or Figure 1. You know the details and order of actions but newcomers to the field do not and would need a clear framework. The new protocol provides that framework.

Response: We detailed more the new phylogenetic protocol and added information on the two new steps from this protocol: “assessing phylogenetic assumptions” and “test of goodness of fit”.

L295-305 “Phylogenetic models rely on simplifying assumptions stating, for example, that all sites in the alignment evolved under the same tree (treelikeness), that mutation rates have remained constant over time (i.e., time-homogeneity), and that substitutions are reversible and, therefore, also stationary (for details on these assumptions, see [72]). If the phylogenetic data violate these assumptions, the phylogeny and evolutionary analyses can be biased [73–75]. Once the alignment is performed and the sites selected for phylogenetic inference, a recent phylogenetic protocol recommends assessing those phylogenetic assumptions when possible [1]. Statistical methods allowing to test stationarity, homogeneity under certain markovian conditions, and treelikeness have been developed and included in IQ-TREE and IQ-TREE2 [76,77]. Homo2.1 [78] is designed for the analysis of compositional heterogeneity in sequence alignments. It is also possible to use the R package MOTMOT [79].”

L453-467: “It is possible that the inferred optimal model of sequence evolution used for the phylogenetic analysis is inadequate. Once a phylogenetic tree has been inferred, it is recommended to test the goodness of fit (i.e., the adequacy) between the tree, the model and the data [1]. A good fit means that the tree and the model of sequence evolution provide a good explanation of the data but doesn’t indicate if the tree in correct or not. The goodness of fit can be tested using a parametric bootstrap [149]. Parametric bootstrap consists in using the optimal tree and the optimal model as an input to simulate sequence evolution to generate pseudo-data with a Monte Carlo simulation. Sequence generating data, such as SeqGen (https://github.com/rambaut/Seq-Gen/releases/tag/1.3.4, [150]) can be used. The goodness of fit is calculated from the difference between the unconstrained and constrained (i.e., assuming the optimal tree and the optimal model) log-likelihoods of the real data and the pseudo-data. If the fit is poor, it can be good to check the alignment and the selected set of sites and/or the sequence evolution model (feedback loops on Fig.1). Several methods have been developed to test the adequacy of the data [151]. The Goldman-Cox (GC) test can be used with several phylogenetic programs such as PAUP. Most Bayesian phylogenetic programs employ the posterior predictive (PP) test.”

We stress the importance of testing phylogenetic assumptions prior to phylogenetic analysis and the goodness of fit. Our Figure 1 and our manuscript is now based on this new protocol.

5. Regarding multiple sequence alignment, I think you need to consider and cite some of the excellent papers by Morrison (e.g., Is Sequence Alignment and Art or Science; Syst. Bot. 40, 14-26 [2015]). You gloss over the challenges and pitfalls, some of which are illustrated in Golubchik et al. (Mol. Biol. Evol. 24, 2433–2442 [2007]), thereby belittling the challenge it is to obtain a trustworthy multiple sequence alignment. Apart from the progressive and consistency-based approach (mentioned on L224-L227), there are other ways to obtain a multiple sequence alignment. You should at least mention some of these other strategies and cite the relevant literature (or reviews of that literature). Further, the consistency-based approach is not well described (L226-L227).

Response: We added a paragraph to discuss more in detail the challenge of obtaining accurate MSA where we cite Morrison`s paper and Golubchik`s work.

L258-277: “In practice, finding the accurate multiple sequence alignment can be challenging for several reasons. For example, computer programs are not based explicitly on the hypothesis of homology between aligned residues. They consider homology exclusively as similarity, and do not consider the other conditions for homology such as congruence and conjunction [61]. Furthermore, one should keep in mind that the alignment with the best score is not necessarily biologically correct. It is also difficult to get a good alignment for sequence that have diverged significantly and sharing low identity. In this case, for protein-coding sequences, amino-acid data should be preferred over nucleotide data, since it is possible, for example, to consider the biochemical similarity of substitutions of amino acids. Alignment software require defining a gap-opening penalty and a gap-extension penalty, but these values are arbitrary. It is common that different sequences in the alignment do not have the same length, for biological or experimental reasons. It is recommended to keep end gaps unpenalized [62]. Furthermore, indels are reported to affect the accuracy of MSA programs. It is recommended to use several MSA programs for sequences that contain indels [63]. MAFFT is reported to be the most accurate program in the case of sequence with non-overlapping deletions and in the case of alternatively spliced gene products [63]. Another difficulty concerns the case of repeated sequences with different numbers of repeats. Here, a domain of one sequence can be homologous to several domains of another sequence. Single nucleotides or small sequences can also be repeated like in the case of microsatellites, as well as entire protein domains. In the last case, it is recommended to excise the repeated domains [62].”

We describe more in detail consistency-based methods, and we cite the other MSA methods:

L242-244: “Other approaches include the iterative refinement method, which is also included in MAFFT and Muscle [51], the genetic algorithms, used by SAGA [52], and the methods using hidden Markov models, such as SAM [53].”.

6. You should mention the minimum reporting standard for multiple sequence alignments (it was presented in Ref. 53) when introducing trimming of multiple sequence alignment. The reason for doing so is given in Ref. 53.

Response: We added a mention to this minimum reporting standard:

L287-288: “A minimum reporting standard has been developed to quantify the alignment completeness, and implemented in AliStat [66].”

7. The section entitled “Test of robustness of phylogenetic trees” needs to be rewritten. For a start, choosing to use the word “robustness” is not good. Robustness usually refers to how accurate methods are when the assumptions of the methods are violated by the data (see Yang 2014; Molecular Evolution). Other words like accuracy, consistency and precision are more appropriate. In this context, note that:

• Accuracy means ‘the quality or state of being correct or precise’

• Consistency means ‘the quality of achieving a level of performance which does not vary greatly in quality over time’

• Precision means ‘the quality, condition, or fact of being exact and accurate’

Your description of bootstrapping is incorrect; it is not “an estimate of error” (L349).

Response: We rewrote the part on non-parametric boostrap:

L411-437“Once the phylogenetic tree is obtained, it is recommended to estimate the reliability of the clades. Most programs of phylogenetic analysis use the non-parametric bootstrapping method [142]. It is a resampling technique used to assess the repeatability of the clade and the how consistently the data support the nodes [143]. The sites in the alignment (e.g., nucleotides or amino acids) are randomly resampled with replacement and a new phylogeny is calculated for each replicate [142]. A bootstrap value is calculated for every internal branch, indicating the proportion of replicate phylogenies that recovered the clade. A bootstrap value of 100% means that the node is supported by all informative characters, while low values mean that only few characters support the node. Bootstrapping gives a measure of the consistency of the estimate, but it is not a measure of the accuracy of the tree [144]. The number of replicates that are necessary to obtain a good accuracy of the bootstrap value depends on the bootstrap value. For example, for a 1% confidence interval on a bootstrap value of 95, 2000 replicates are necessary [143].

Since bootstrapping can be time consuming, UFBoot and UFBoot2, fast approximation methods for phylogenetic bootstrap, have been developed and are implemented in programs such as RAxML or IQ-TREE [130,145,146]. They are also less biased than other non-parametric bootstrapping methods and robust against moderate model violations. While other bootstrapping methods tend to underestimate the probabilities of the clade of being correct, the bootstrap values from UFBoot and UFBoot2 truly reflect the probability of the clade of being correct, simplifying the interpretation of Bootstrap values (a bootstrap value of 95% indicate a probability of 95% to be correct) [145]).

With ML-based phylogenetic inference, the Shimodaira-Hasegawa test, and its improved version AU [147], are designed to evaluate alternative phylogenetic hypotheses, and test if a tree is better supported than another one. It can be used with PAUP, PhyML, FastTree and IQ-TREE. The approximate likelihood ratio test [148] is another test of phylogenetic relationship, implemented in PhyML along with the bootstrapping method. Bayesian inference method calculates posterior probabilities, which measure branch support instead of bootstrap values.”.

We changed the title to “Test of reliability of phylogenetic trees”. We replaced “nodes” with “branches”. We also added a few comments on the interpretation of bootstrap, and how UFBoot provides a less biased bootstrap estimate, that facilitates the interpretation of bootstrap values. We changed “sites” into “characters”. We explain more in detail the non-parametric bootstrap.

In my previous review, I drew your attention to the correct interpretation of bootstrapping, but you ignored that, which was unwise. Instead, you cite Hedges [120], which focuses on the number of bootstrap replicates needed gain an accurate estimate of the bootstrap P value, and not on the interpretation. You also need to note that bootstrap P values are both data and method dependent. It is the sites in the alignment, not “characters” (L350), which are sampled randomly with replacement, and the correct citation is missing: “each replicate” > “each replicate [Felsenstein 1985; Evolution 39, 783-791]”.

Response: We added a reference to Felsenstein`s paper on non-parametric bootstrap.

Bootstrapping, as described by Felsenstein and implemented by many since then, is a non-parametric bootstrap approach. Using it gives us an insight into the consistency of the data (i.e., do we have enough sites in the data to consistently infer the same tree). It does not tell us whether the inferred tree is accurate.

Response: We added a comment to stress that non-parametric boostrap is not a measure of the accuracy of the tree “Bootstrapping gives a measure of the consistency of the estimate, but it is not a measure of the accuracy of the tree.”

You write that a “bootstrap value is calculated for every node” (L352). This is incorrect. The value is computed for every internal edge/branch/split that separates the sequences.

You write “from the initial tree” (L353). This is incorrect. It can be done for any tree, even the consensus tree (IQ-TREE often list the values for the ML tree and the consensus tree; and these trees may not be identical).

Response: We corrected the sentence into “A bootstrap value is calculated for every internal branch, indicating the proportion of replicate phylogenies that recovered the clade.”

You write “A bootstrap value …for the clade”. Who did you quote? There has been some disagreement on the interpretation of bootstrap P values. New methods like those from von Haeleser’s group [121, 122] have clarified the matter considerably.

We added a comment on the interpretation of boostraps values and how the UFBoot method facilitates their interpretation. L424-431: “Since bootstrapping can be time consuming, UFBoot and UFBoot2, fast approximation methods for phylogenetic bootstrap, have been developed and are implemented in programs such as RAxML or IQ-TREE [130,145,146]. They are also less biased than other non-parametric bootstrapping methods and robust against moderate model violations. While other bootstrapping methods tend to underestimate the probabilities of the clade of being correct, the bootstrap values from UFBoot and UFBoot2 truly reflect the probability of the clade of being correct, simplifying the interpretation of Bootstrap values (a bootstrap value of 95% indicate a probability of 95% to be correct) [145]).”

You write “1000 replicates …” (L356). This should be spelled out, and you should probably mention how that fits with Hedges [120].

Response: We added comments on the number of bootstrap according to Hedges, “The number of replicates that are necessary to obtain a good accuracy of the bootstrap value depends on the bootstrap value. For example, for a +-1 confidence interval on a bootstrap value of 95, 2000 replicates are necessary”.

You write “Other branch support methods exist” (L359) and then refer to the Shimodaira-Hasegawa test [124]. The bootstrap method is not a “branch support method”, so it is not correct to say “Other branch …”. Moreover, the Shimodaira-Hasegawa test is not a branch support test either. It is a test designed to compare the ML tree to other trees.

Response: We explain more in details the SH test and rewrote that part of the section:

L432-437: “With ML-based phylogenetic inference, the Shimodaira-Hasegawa test, and its improved version AU [147], are designed to evaluate alternative phylogenetic hypotheses, and test if a tree is better supported than another one. It can be used with PAUP, PhyML, FastTree and IQ-TREE. The approximate likelihood ratio test [148] is another test of phylogenetic relationship, implemented in PhyML along with the bootstrapping method. Bayesian inference method calculates posterior probabilities, which measure branch support instead of bootstrap values.”

8. I have not included any specific comments on the two case studies, mainly because I have already used far more time on this review than I had available. That said, the two sections do need some revision.

9. None of the figures are as clear as is necessary for publication. In fact, I can’t read them.

The Bibliography is clearly set using a French application and do not appear to comply with PLoS One’s style. Please ensure that the Bibliography complies with the journal’s formatting style.

We modified the bibliography according to PLoS One`s style.

SPECIFIC COMMENTS

L25: “Here a” > “Here we present a”

Response: We changed “Here a” into “Here we present a”.

L26: “nucleic acid” > “DNA”

Response: We changed “nucleic acid” into “DNA”.

L27: Delete “is presented”

Response: We deleted “in presented”.

L33: “proto cells to modern organisms” > “primitive cells to modern cells and multicellular organisms,”

Response: We changed “proto cells to modern organisms” into “primitive cells to modern cells and multicellular organisms,”.

L36: “sequence composition of” > “information content in”

Response: We changed “nucleic acid” into “DNA”.

L38-L40: Poor sentence structure

L41: “mutations that” > “mutations (including substitutions) that”

Response: We changed “mutations that” into “mutations (including substitutions) that”.

L43: “all” > “both”

Response: We changed “all” into “both”.

L45: “past decades” > “past four decades”

Response: We added “four”.

L48: “taxa” > “taxa, genes or genomic components (e.g., transposable elements)”

Response: We added “, genes or genomic components (e.g., transposable elements)” in the sentence.

L52: “the others” > “other species”

Response: We changed “the others” into “other species”.

L53: “infection diseases” > “infectious diseases”

Response: We changed “infection” into “infectious”.

L59: Delete “both”

Response: We deleted “both”.

L65: Delete “the”

Response: We deleted “the”.

L66: “and of” > “as well as”

Response: We changed “and of” into “as well as”.

L67: “processes, following” > “processes. In so doing, we follow”

Response: We changed “processes, following” into “processes. In so doing, we follow”.

L68: “used by many and maintained” > “maintained and used by many”

Response: We changed “used by many and maintained” into “maintained and used by many”.

L75: Delete “peer-reviewed”

Response: We did not delete “peer-reviewed” because this exact sentence has to be included in the manuscript for protocols, according to the editor`s instructions for authors.

L80: Delete “digitalized

Response: We deleted “digitalized”.

L82: “nucleic acid or amino acid sequence” > DNA or amino-acid sequences”

Response: We changed “nucleic acid or amino acid sequence” into “DNA or amino-acid sequences”.

L88: Perhaps state that the protocol is based on that in [1].

Response: We added a reference to the new protocol (ref [1]) in the figure`s caption.

L89: “Nucleic acid” > “DNA”

Response: We changed “nucleic acid” into “DNA”.

L95: ”browser focusing on chordates that contains” > ”browser that focuses on chordates and contains”

Response: We changed “browser focusing on chordates that contains” into “browser that focuses on chordates and contains”.

L97: “This” > “The”

Response: We changed “This” into “The”.

L99” Delete “Overall,”

Response: We deleted “Overall”.

L105: “rendered” > “provided”

Response: We changed “rendered” into “provided”.

L114: “provides several” > “provides access to several”

Response: We changed “provides several” into “provides access to several”.

L116: “under” > “in the”

Response: We changed “under” into “in the”.

L117: “nucleotide sequences” > “nucleotide”

Response: We changed “nucleotide sequences” into “nucleotide”.

L118: Delete “pasted to fasta file”

Response: We deleted “pasted to fasta file”.

L129: “format and the” > “format. The”

Response: We changed “format and the” into “format. The”.

L129: “interlinks” > “links”

Response: We changed “interlinks” into “links”.

L130: “allows to batch download the” > “facilitates batch downloads of”

Response: We changed “allows to batch download the” into “facilitates batch downloads of”.

L132: “databases” > “databases (such as …)”

Response: We changed “nucleic acid” into “DNA”.

L136: “The KEGG focuses” > “The KEGG database focuses”

Response: We changed “nucleic acid” into “DNA”.

L148-L156: You alternate between “3-dimensional” and “3D”. Be consistent

Response: We wrote “3-dimensional” in the whole manuscript to homogenize.

L149: Delete “the”

Response: We deleted “the”.

L153: Place “i.e., … family” between parentheses

Response: We added parentheses.

L159: You write “based on …, or both”. Perhaps separate the databases accordingly.

Response: We agree the sentence was unclear. All three databases use structural similarity, functionality, and evolutionary relationship. We changed the sentence to make it more understandable:

L162-163: “Proteins are classified into different categories based on structural similarity, functionality and evolutionary relationship (Table 2).”

L163: You write “based on the presence or absence of … strands”. I think you need to elaborate a bit further on this. There is SCOPe classification system to consider but you also need to consider the work by Sun et al. (Current Protocols in Protein Science (2004) 17.1.1-17.1.189)

Response: We added precisions on the structural classification.

L166-169: “Most proteins belong to one of the five main structural classes (α, β, α/β, α+β, multi-domain), defined respectively by the presence in the structure of α-helices, β-strands, both α-helices and β-strands, segregated α-helices and β-strands, or none of these characteristics.”

We added mention to two other classification systems:

L172-179: “Similarly, the Class, Architecture, Topology, Homology database (CATH) proposes a five-level classification of protein domains. The first three levels, namely class, architecture and topology, are based on structural homology. The last two, homologous superfamily and family, are based on sequence, structure and functional data, and sequence identity, respectively [29]. The Families of Structurally Similar Proteins (FSSP) provides a classification of proteins of the PDB based on a structure comparison algorithm, that calculate a structural similarity score between protein chains. These similarity scores are used to create a classification of protein structures [30,31].”

L164: “class” > “classification”

Response: We kept class instead of classification because we are talking about the protein “class” in the SCOPe system, not the protein classification.

L177: “[16], that combines” > “[16]. It combines”

Response: We changed “[16], that combines” into “[16]. It combines”.

L186: Delete “whole”

Response: We deleted “whole”.

L188: “hemoglobine” > “haemoglobin”

L189: “hemoglobine” > “haemoglobin”

L190: “myoglobine” > “myoglobin”

Response: We changed “hemoglobin” haemoglobin” and “myoglobine” into “myoglobin”.

L197: Include a citation to the E-value. You might know it, but you are targeting beginners who can be assumed to be ignorant.

Response: We added a citation of Kerfeld and Scott, 2011, for the E-value.

L206: “retrieving from” > “retrieving sequences from”

Response: We changed “retrieving from” into “retrieving sequences from”.

L213: “related to each other” > “related to each other in an evolutionary sense” [they can be related to each other on other senses]

Response: We changed “related to each other” into “related to each other in an evolutionary sense”.

L220: “row one” > row of a matrix, one”

Response: We changed “row one” into “row of a matrix, one”.

L230-L231: Poor sentence structure and you probably mean “user interface”, rather than “web interfaces”

Response: We changed “web interface” into “user interface” and rewrote the sentence.

L233: “calculation times” > “execution times”

Response: We changed “calculation times” into “execution times”.

L236-L237: You need a reference to support this statement, and you then need to cite Golubchik et al. (Mol. Biol. Evol. 24, 2433–2442 [2007]), who reported the opposite for PRANK.

Response: We remastered the sentence on the specificity of Prank: L252: “PRANK is meant for closely related sequences [57].”

L238: “that” > “which”

Response: We changed “that” into “which”.

L242: “web interface” > “user interface”

Response: We changed “web interface” into “user interface”.

L253: “quantified and phylogenetic informative” > “quantified using AliStat [53] and phylogenetically informative”

Response: We changed “quantified and phylogenetic informative” into “quantified using AliStat [53] and phylogenetically informative”.

L258: “stating for example that” > “stating, for example, that”

Response: We changed “stating for example that” into “stating, for example, that”.

L259: “constant” > “constant over time (i.e., time-homogeneity)”

Response: We changed “constant” into “constant over time (i.e., time-homogeneity)”.

L260: “reversible” > “reversible and, therefore, also stationary (for details on these assumptions, see Jayaswal et al. (Syst. Biol. 63, 726-742 [2014]) and papers cited therein)”

Response: We changed “reversible” into “reversible and, therefore, also stationary (for details on these assumptions, see Jayaswal et al. (Syst. Biol. 63, 726-742 [2014]) and papers cited therein)”.

L261: You write “can be biased [58]”. It would be more appropriate to cite Syst. Biol. (53, 623-637; 53, 638-643) because these papers used data generated by simulation (hence, the truth is known).

Response: We added a citation of Syst. Biol. (53, 623-637; 53, 638-643).

L262: “recent phylogenetic protocols recommend … possible [1]” > “a recent phylogenetic protocol [1] recommends … possible”

Response: We changed “recent phylogenetic protocols recommend … possible [1]” into “a recent phylogenetic protocol [1] recommends … possible”.

L264: “developped" > “developed”

Response: We changed “developped” into “developed”.

L264: You write “tests for all these assumptions”. This is not true. Some have been included in IQ-TREE and IQ-TREE2. I agree that the matched-pairs tests of homogeneity (Bioinformatics 22, 1225-1231) are of relevance, but only for some Markovian conditions. One of these is implemented in Homo 2.1 (https://github.com/lsjermiin/Homo.v2.1).

Response: We changed “tests for all these assumptions” into “Statistical methods allowing to test stationarity, homogeneity under certain markovian conditions, and treelikeness”.

L266: “Selection of the molecular evolution model” > “Selection of the optimal model of sequence evolution”

Response: We changed “Selection of the molecular evolution model” into “Selection of the optimal model of sequence evolution”

L267: Several papers and book chapters have described model selection in general terms. I think it would be wise to cite some of these (e.g., Front. Genet. 6, 319 [2015; Meth. Mol. Biol. 1525, 379-420 [2017]; DOI 10.1007/978-1-4939-6622-6_15).

Response: We added references to those papers.

L268: “Nucleotide or amino acid substitution exist” > “Several models of nucleotide or amino-acid substitution exist [REF]”. The statement requires some citations.

Response: We changed “Nucleotide or amino acid substitution exist” into “Several models of nucleotide or amino-acid substitution exist” and added reference to the phylogenetic handbook (Yang, Oxford University Press, 2014).

L270: you cite Yang 2006; you should cite is more recent book from 2014.

Response: We added a reference to this book.

L273: You write “Each model … IQ-TREE [75]”. The set of models described here differ from the substitution models described in L271-L274. They are homotachous models of rate-heterogeneity across sites, so name them appropriately. You forgot to include the PDF/FreeRate model proposed by Yang (Genetics 139, 993-1005 [1995]) and now included in ModelFinder, IQ-TREE and PhyML.

Response: We rewrote the section to better separate the substitution models and the rate-heterogeneity models. We explain more in detail what rate-heterogeneity refers to. We also added reference to the FreeRate model and the relevant literature.

L331-338: “These substitution models can be associated with models of substitution rate-heterogeneity between sites. Mutation rates and selective pressure may vary among sites, due to different roles in the structure and function of the gene or protein. The most common rate heterogeneity model are the Gamma distribution (G) and the proportion of invariant nucleotide or amino acid sites (I). Every substitution model can be associated with G, I or both. The FreeRate model, a more complex model of rate heterogeneity [94], is included in ModelFinder, PhyML and IQ-TREE. More recently, the GHOST model for alignments with variation in mutation rate was introduced and implemented in IQ-TREE [95].”

L279: “every substitution model” > “model of sequence evolution (i.e., combination of substitution model and rate-heterogeneity across sites model)”

Response: We changed “every substitution model” into “model of sequence evolution (i.e., combination of substitution model and rate-heterogeneity across sites model)”.

L281: “optimizing” > “minimising”

Response: We changed “optimizing” into “minimising”.

L284: Move “PartitionFinder 2 …[81]” to L288. The logic here is that ModelFinder is for a single partition and it supersedes ModelTest, jModelTest and ProtTest. PartitionFinder 2 is for multiple partitions.

Response: We moved the mention of PartitionFinder2 to the end of the section.

L293: Replace the sentence with “Phylogenetic trees may consider the topology and the branch lengths (phylograms) or just the topology (cladograms)”

Response: We replaced the sentence with “Phylogenetic trees may consider the topology and the branch lengths (phylograms) or just the topology (cladograms)”.

L294: “tree building methods” > “tree-building methods” [compound adjective]

Response: We changed “tree building methods” into “tree-building methods”.

L295: “distance” > “distances”

Response: We changed “distance” into “distances”.

L295-L206: “number of differences” > “numbers and types of differences”

Response: We changed “number of differences” into “numbers and types of differences”.

L296: “to reconstruct” > “and they use this matrix to reconstruct”

Response: We changed “to reconstruct” into “and they use this matrix to reconstruct”.

L296-L297: The sentence is wrong. Distance methods are also character-based methods.

We left the distinction between distance methods and character-based methods, since this classification is common in reviews and books of the field (e.g. De Bruijn 2014, Muntjal el al 2019…).

L299: “classic” > “classical”

Response: We changed “classic” into “classical”.

L303: Include citation after “long branch attraction”

Response: We added a reference to the review by Bergsten, 2005.

L309: “mean” > “Mean”

Response: We changed “mean” into “Mean”.

L310: “[91].” > “[91] tree-inference methods.”

Response: We added “tree-inference methods.”

L312: “amino acid data” > “amino-acid data” [compound adjective]

Response: We changed “amino acid data” into “amino-acid data”.

L315: “but also” > “and”

Response: We changed “but also” into “and”.

L316: Delete “,”

Response: We deleted this comma.

L317: “in R language that” > “in the R language. It”

Response: We changed “but also” into “and”.

L320: “ML” > “maximum likelihood [ML]”. I think this is the first time that you typed ML, so it has to be defined, even though you and I both know what it refers to

Response: We left ML because the term has been defined earlier in the text (L361).

L322: “ML methods” > “Maximum-likelihood methods”. When starting a sentence, it is good practice to spell out the abbreviation (or number)

Response: We changed “ML methods” into “Maximum-likelihood methods”.

L322: “probability” > “likelihood”

Response: We changed “probability” into “Maximum-likelihood methods”.

L323: “ML aims” > “Maximum likelihood aims”

Response: We changed “ML” into “Maximum likelihood”.

L323: “combinations of model” > “combinations of trees and model”

Response: We changed “combinations of model” into “combinations of trees and model”.

L327: “amino acid data” > “amino-acid data”

Response: We changed “amino acid data” into “amino-acid data”.

L328: “is accurate” > “is reported as being accurate”

Response: We changed “is accurate” into “is reported as being accurate”.

L329: “all common models of molecular evolution” > “many common models of substitution” or “many common models of sequence evolution” [choose the most accurate statement”

Response: We changed “all common models of molecular evolution” into “many common models of sequence evolution”.

L331: “rate heterogeneity … invariant sites” > “rate-heterogeneity across sites (…)”. Here, you replace “…” with the model of that you refer to

Response: We changed the sentence and cite the method mentioned and relevant publication:

L396: “They use CAT, a specific model of rate heterogeneity, faster than G [129], in addition to Gamma law and proportion of invariant sites.”

L333: “that” > “which”

Response: We changed “that” into “which”.

L333: “method” > “method [REF]”. Here you need to replace “REF” with the relevant citation to the method implemented in IQ-TREE and IQ-TREE 2

Response: We added a reference to the paper by Hoang et al, 2017, describing UFBoot2, and mention the tool in the text.

L336: “recent method” > “recently-developed method”

Response: We changed “recent method” into “recently-developed method”.

L336: “inference, that” > “inference (BI). It”

Response: We changed “inference, that” into “inference (BI). It”.

L337: “calculates the probability” > “calculates the posterior probability”

Response: We changed “calculates the probability” into “calculates the posterior probability”.

L337: “the molecular evolution model given” > “the tree and model of sequence evolution, given”

Response: We changed “the molecular evolution model given” into “the tree and model of sequence evolution, given”.

L338: “ [109] that” > “ [109]. It”

Response: We changed “[109] that” into “[109]. It”.

L362: “inbcludes” > “includes”

Response: We corrected the typo.

L365: “taxa” > “sequences”

Response: We changed “taxa” into “sequences”.

L366: “addressed question” > “question asked”

Response: We changed “addressed question” into “question asked”.

L370: “taxa” > “sequences”

Response: We changed “taxa” into “sequences”.

L370: “studied ingroup” > “ingroup of interest”

Response: We changed “studied ingroup” into “ingroup of interest”.

L373: “identified, … viruses,” > “included”

Response: We changed “identified, … viruses,” into “included”.

L375: “of the longest branches” > “between the most dissimilar sequences in the tree”

Response: We changed “of the longest branches” into “between the most dissimilar sequences in the tree”.

L378: “exported using … a graphical” > “visualised using a graphical”

Response: We changed “exported using … a graphical” into “visualised using a graphical”.

L394: You should include Geneious (https://www.geneious.com/) in the table and mention it in the text.

Response: We added Geneious in the table and in the text.

L397: “allows” > “allow users”

Response: We changed “allows” into “allow users”.

L404: Here you should only write MP, ML and BI. Note that you need to define MP somewhere earlier on in the text

Response: We wrote MP, ML and BI. MP is defined earlier in the text.

L405: “molecular” > “sequence”

Response: We changed “molecular” into “sequence”.

L416: In this section, you use the term “mutations”; typically the term “substitutions” is used

Response: We changed “mutations” into “substitutions” in the whole section.

L417: “The strength” > “The type and strength”

Response: We changed “The strength” into “The type and strength”.

L417: “can be calculated. This” > “may be of interest. It”

Response: We changed “can be calculated. This” into “may be of interest. It”.

L418: “ration” > “ratio”

Response: We corrected the typo.

L421: “The ratio … 1, that” > “If dN/dS > 1, then the”

Response: We changed “The ratio … 1, that” into “If dN/dS > 1, then the”.

L426: “Phylogenetic calibration” > “Molecular dating of events”

Response: We changed “Phylogenetic calibration” into “Molecular dating of events”.

L427: “event using events” > “events, using other events”

Response: We changed “event using events” into “events, using other events”.

L440-L444: This can be stated more succinctly

Response: We rewrote the sentences in a more understandable way to introduce coevolution.

L450: Why do you refer to Table 8 here?

Response: We moved the reference to Table 8 to the appropriate sentence.

L457: “species” > “traits”

Response: We changed “species” into “traits”.

L462-L466: This needs to be stated better

We rewrote the sentence:

L554-557: “Phylogenetic trees, in complement with genomics tools and databases, can be used to study genome evolution; and identify evolutionary events such as be mutations, insertions, deletions, gene or genome duplications, genome re-organization, chromosomal rearrangements, polyploidization events or genetic exchanges.”

L477: “Population genetics often studies” > Population geneticists often study”

Response: We changed “genetics” into “geneticists”.

L483: “of” > “at”

We changed “of” into “at”.

L484: “studies, for a full review see” > “studies; for a full review, see”

We changed “studies, for a full review see” into “studies; for a full review, see”.

L493: “in Å” > “in ångström”

We changed “in Å” into “in ångström”.

L494: You write “can be calculated”. Which program?

We added “using PyMol”.

L495: “3-dimentional” > “3D”

To be more consistent, we wrote “3-dimentional” in the whole manuscript.

L496: “amino acid” > “amino-acid”

We changed “amino acid” into “amino-acid”.

L498: Here I think you need to include Alpha fold (Nature 596, 583–589 [2021])

We included Alpha fold in the text and the table.

Attachment

Submitted filename: Response to Reviewers.pdf

Decision Letter 2

Arndt von Haeseler

12 Dec 2022

Roadmap to the study of gene and protein phylogeny and evolution – a practical guide

PONE-D-22-06715R2

Dear Dr. Hammarlund,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Arndt von Haeseler

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

You have addressed all major comments. I trust you to make the minor correction suggested by the reviewer.

Otherwise the manuscript is now ready for publication.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does the manuscript report a protocol which is of utility to the research community and adds value to the published literature?

Reviewer #2: Yes

**********

2. Has the protocol been described in sufficient detail?

To answer this question, please click the link to protocols.io in the Materials and Methods section of the manuscript (if a link has been provided) or consult the step-by-step protocol in the Supporting Information files.

The step-by-step protocol should contain sufficient detail for another researcher to be able to reproduce all experiments and analyses.

Reviewer #2: Yes

**********

3. Does the protocol describe a validated method?

The manuscript must demonstrate that the protocol achieves its intended purpose: either by containing appropriate validation data, or referencing at least one original research article in which the protocol was used to generate data.

Reviewer #2: No

**********

4. If the manuscript contains new data, have the authors made this data fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the article presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please highlight any specific errors that need correcting in the box below.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: GENERAL COMMENTS

The manuscript has improved substantially and is pretty much ready for publication.

Below is a list of minor issues that I suggest you attend to before returning the final version to the journal.

Lastly, well done on assembling so much knowledge. I might use your paper as reading material for 3rd-year biology students doing bioinformatics. Take that as a compliment!

SPECIFIC COMMENTS

L21: “into” > “of”

L61: “coupled to” > “coupled with” [I think this is the correct expression]

L69: “used by many” > “popular”

L86: Figure 1 is almost ready. However, the feedback loops are not accurate. You need to go back to the corresponding figure in Jermiin et al. (2020) and replicate those feedback loops.

L104: “of” > “on”

L118: “plant specific” > “plant-specific” [compound adjective, so it is hyphenated]

L119: What is the “efP” browser? It is listed without citation and URL. Should it be spelled out?

L123: “batch download” > “batch-download” [compound verb]

L129: Is “General information on proteins” a subheading under “Protein databases”? If so, then use a smaller font.

L131: “open reading frame translations” > “open-reading-frame translations” [compound adjective]

L138: You use the abbreviation “PDB” before you define it [line 153]. Please fix it.

L153: PDB should be defined earlier in the document and its citation should be moved to L138.

L154: “3-dimensional” > “three-dimensional (3D)”

L156: “three-dimensional (3D)” > “3D”

L158: “3-dimensional” > “3D”

L164: “The structural” > “The 3D structural”

L174: “database (CATH)” > “(CATH) database”

L178: “of” > “in”

L196: The definition of ‘homologues’ should be moved to where the one of the following three terms is first used (‘homology’, ‘homologous’, and ‘homologues’0) [perhaps L100; in fact, it may be sensible to define ‘homologues, orthologues, paralogues and zenologues in the Introduction, because these terms are used widely].

L217: “exhaustive research of homologues” > “exhaustive search for homologues” [I think this is what you mean]

L226: “amino acid” >“amino-acid”

L228: “tackle the tools these different steps” > “describe the tools used in these different steps”

L245: “] display user interfaces” > “] display inferred MSAs using/in user interfaces”

L247: “They” > “Their use”

L249: “for” > “to analyse”

L251: “closely related” > “closely-related”

L251: “Bayesian” > “Bayesian”

L253: “machine learning” > “machine-learning”

L271: You need more than nucleotide to make a sequence, so revise the sentence.

L286: You should include AliStat in this list.

L291: “Phylogenetic models” > “Phylogenetic methods”

L291: “simplifying assumptions stating” > “simple assumptions about the evolutionary processes, stating”

L295: “can be” > “may become”

L296: “sites selected” > “sites have been selected”

L297: “allowing to” > “allowing users to”

L298: “stationarity, homogeneity under certain Markovian conditions” > “stationarity and homogeneity of the evolutionary processes (along diverging lineages)”

L300: “)[“ > “) [“

L307: “, and most” > “. Most”

L311: “sequence evolution model” > “model of sequence evolution”

L313: “an inappropriate approximation” > “inappropriate approximations”

L325: Delete “substitution”

L326: “between” > “across”

L327: “rate heterogeneity models” > “rate heterogeneity across sites models”

L334: “combination of substitution model and rate-heterogeneity” > “a combination of a substitution model and a rate-heterogeneity”

L340: “for alignments … acids” > “, for alignments … acids, ”

L342: Delete “test”

L372: “analyses including” > “analyses, including”

L386: Do you mean “tree manipulators”?

L386: “of” > “to”

L389: You use the term “Gamma law”. I don’t like it, and suggest you call it something else (e.g., “to assuming Gamma-distributed rate-heterogeneity across sites”).

L394: “recently developed” > “recently-developed”

L398: “bayesian” > “Bayesian”

L399: “model well” > “model. It is well”

L412: “all informative characters” > “all resampled datasets”

L413: “few characters” > a few of these datasets”

L426: “version” > “version,”

L432: In this section, you should consult and cite Syst. Biol. 52, 229-238 and Mol. Biol. Evol. 24, 2400-2411. These papers are highly relevant.

L450: “in” > “is”

L456: “good” > “recommended”

L459: After the PP test, you should cite Syst. Biol. 63, 309-321.

L462: Delete “a”

L480: “3-dimensional” > “3D”

L494: IQ-TREE and IQ-TREE 2 also allow users to infer ancestral sequences, so it should also be listed here.

L512: In this section, you need to consult and cite Trends in Genetics 20, 80-86. It is extremely funny and highly relevant.

L548: You refer to “steps 1 and 2” in Figure 1, but they are not numbered, and the figure is very blurry (I cannot read the letters). You should fix this?

L575: “3-dimensional” > “3D”

L578: Great to see “ångström” spelled correctly.

L580: “amino acid” > “amino-acid

L597: “step by step” > “step-by-step”

L610: Move “instead” to after “but” [you used a Swedish sentence structure]

L613: “1765 … in Pfam” > Pfam contains 1765 P53 sequences from 382 species” [you never start a sentence with a number, and if you do have to, then that number must be spelled out]

L617: Sentence starts with a number [see previous comment]

L628: Sentence starts with a number [see previous comment]

L619: Sentence starts with a number [see previous comment]

L623: I cannot read Figure 2A. The colours make it impossible for me to see what you are trying to show. I am not colour-blind. Something about the figure needs to change.

L636: “3” > “Three”

L645: Is there a reason type-setting this paragraph in Italics? It looks as if it is an invitation to the reader. Should this not be made clearer, for example on L588?

L656: “has been used” > “was chosen”

L662: The text in this paragraph assumes a multiple sequence alignment involving all sequences in question. What you might have discussed percent identities based on pairwise alignments. I think you need to move this paragraph to after subsection “3. Multiple sequence alignment …”.

L668: Is there a reason type-setting this paragraph in Italics? It looks as if it is an invitation to the reader. Should this not be made clearer, for example on L588?

L679: “select” > “identify” [you can only select something after having identified it]

L691: Don’t write “Gamma law”. Find a more accurate term.

L694: “user interface or online” > “a user interface or an online”

L723: “[101] using” > “[101], using”

L723: “numbers indicate” > “numbers on the internal edges/branches indicate”

L762: Has anyone publish an evolutionary history of P53 before this submission. If so, then you could add a sentence or two about this and about how your result compares to the previous one.

L780: “21” > “Twenty-one”

L789: “has been” > “was”

L792: “tested with” > “consistency of the phylogenetic estimate was assessed using”

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

Acceptance letter

Arndt von Haeseler

22 Dec 2022

PONE-D-22-06715R2

Roadmap to the study of gene and protein phylogeny and evolution – a practical guide

Dear Dr. Hammarlund:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Arndt von Haeseler

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Sequences and accession numbers of p53 proteins.

    These sequences and accession numbers were used for phylogenetic analysis.

    (PDF)

    S2 File. Sequences and accession numbers of human CDKs.

    These sequences and accession numbers were used for phylogenetic analysis.

    (PDF)

    S3 File. Sequences and accession numbers of human cyclins.

    These sequences and accession numbers were used for phylogenetic analysis.

    (PDF)

    S4 File. Nexus code.

    The coding was used to reconstruct the co-phylogeny of human CDKs and Cyclins.

    (PDF)

    S5 File. Protocol.

    The protocol as also available on protocols.io.

    (PDF)

    Attachment

    Submitted filename: Rebuttal_letter_PlosOne_Jacques et al.pdf

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    Data are within the Supporting Information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES