AvP: A software package for automatic phylogenetic detection of candidate horizontal gene transfers

Georgios D Koutsovoulos; Solène Granjeon Noriot; Marc Bailly-Bechet; Etienne G J Danchin; Corinne Rancurel

doi:10.1371/journal.pcbi.1010686

. 2022 Nov 9;18(11):e1010686. doi: 10.1371/journal.pcbi.1010686

AvP: A software package for automatic phylogenetic detection of candidate horizontal gene transfers

Georgios D Koutsovoulos ^1,^*, Solène Granjeon Noriot ¹, Marc Bailly-Bechet ¹, Etienne G J Danchin ¹, Corinne Rancurel ¹

Editor: Mark Ziemann²

PMCID: PMC9678320 PMID: 36350852

Abstract

Horizontal gene transfer (HGT) is the transfer of genes between species outside the transmission from parent to offspring. Due to their impact on the genome and biology of various species, HGTs have gained broader attention, but high-throughput methods to robustly identify them are lacking. One rapid method to identify HGT candidates is to calculate the difference in similarity between the most similar gene in closely related species and the most similar gene in distantly related species. Although metrics on similarity associated with taxonomic information can rapidly detect putative HGTs, these methods are hampered by false positives that are difficult to track. Furthermore, they do not inform on the evolutionary trajectory and events such as duplications. Hence, phylogenetic analysis is necessary to confirm HGT candidates and provide a more comprehensive view of their origin and evolutionary history. However, phylogenetic reconstruction requires several time-consuming manual steps to retrieve the homologous sequences, produce a multiple alignment, construct the phylogeny and analyze the topology to assess whether it supports the HGT hypothesis. Here, we present AvP which automatically performs all these steps and detects candidate HGTs within a phylogenetic framework.

This is a PLOS Computational Biology Software paper.

Introduction

The acquisition of genes through horizontal gene transfer (HGT) is mostly observed in prokaryotes, where they play a significant role in adaptive evolution (e.g. antibiotic resistance). To a lesser degree, cases of HGT have also been observed in eukaryotes with important consequences in the biology of the organism [1]. The increase of new genomes being sequenced and the prediction of new gene sets, represents an opportunity to detect additional HGT cases and to characterize more precisely the possible donors. To sustain these needs, high-throughput yet robust HGT detection methods are required.

One method to predict potential HGTs is to calculate the difference in similarity using BLAST [2] (or other sequence similarity search software) between phylogenetically closely related and distant species. The Alien Index (AI) metric uses the difference in e-value between the best hit from closely (Ingroup) and distantly (Donor) related taxa [3]. Positive AI means that the gene is more similar to a distant taxon and indicates a potential HGT. In the past, different values of AI have been used as a cutoff to decrease false positives but with the potential risk of missing HGTs. Similarly, the HGT Index (h) [4] uses the difference in bit scores but is hampered by the same limitations in terms of a trade-off between reducing false positives without missing valid cases. However, tracking these false positives from homology search results alone is not possible.

Even if different cutoffs are applied to AI, the underlying best BLAST hit analysis is an oversimplistic method for the evolutionary complexity of HGT. Recently, an additional metric called outg_pct, which is the percentage of species from Donor lineage in the top hits that have different taxonomic species names, has been used in conjunction with AI to filter out some of the false positives resulting from erroneous taxonomic annotation of the best blast hits [5]. A more evolutionary comprehensive method is to extract the results from the whole BLAST analysis and infer a phylogenetic tree. The phylogenetic position of the potential HGT candidate in relation to the other genes and their taxonomy will provide an evolutionary framework and will validate or reject the HGT hypothesis. However, manually producing then checking each phylogenetic tree is a labour-intensive and time-consuming process. In addition, contamination or symbionts in genome sequencing, unless handled properly, can provide false positives that pass both AI and phylogenetic analysis [6]. External information, such as the target gene structure, taxonomic affiliation of genes near the target gene, and support by transcription data are necessary to eliminate such false positives. Combining all information will lead to a more accurate prediction of putative HGTs.

Methods exist to perform gene tree species tree reconciliation to detect xenologs (i.e HGTs) [7, 8] and are able to distinguish genes that were transferred horizontally with or without duplication events. However, providing a species tree together with the gene tree is required. Therefore, testing hundreds of genes requires either creating different species trees according to the input sequences or comparing everything against the whole NCBI tree of life containing hundreds of thousands of branches.

In this study, we present AvP (short for ‘Alienness vs Predictor’) to automate the robust identification of HGTs at high-throughput with no need to provide a reference species tree. AvP extracts all the information needed to produce input files to perform phylogenetic reconstruction, evaluate HGTs from the phylogenetic trees, and combine multiple other external information for additional support (e.g. gff3 annotation file, transcript quantification file). Our method does not rely on an explicit reference species tree and only uses a simplified take on the species phylogeny, according to the organism tested. This allows for a rapid phylogenetic detection of HGTs that can then be used as input for more sophisticated analyses.

Design and implementation

Software description

AvP performs automatic detection of HGT candidates within a phylogenetic framework. The pipeline comprises two major steps: (i) prepare, and (ii) detect, and three optional steps: (iii) classify, (iv) evaluate, and (v) hgt_local_score (Fig 1). Although the pipeline has been extensively tested with protein datasets, it should work at the DNA level, including non-coding sequences (see GitHub documentation). For the rest of the article we assume a protein dataset.

Fig 1 — Dashed lines indicate optional routes and analyses.

AvP requires three primary files, (i) a fasta file containing the proteins of the species being studied, (ii) a tabular results file of similarity search (e.g. BLAST or DIAMOND [9]) against a protein database, and (iii) an AI features file. Furthermore, the user must provide two config files, one with information on the taxonomic ingroup in the study (defining which group of species is considered closely related and which group is distantly related) and one defining multiple software parameters. The AI features file can be created with the script calculate_ai.py which can be found in the repository.

AvP prepare

The software selects proteins for downstream analyses based on any combination of the metrics AI, outg_pct, and AHS (described below). Then, the software collects all protein sequences corresponding to significant hits from the database based on the tabular results file of the homology search and groups the query species sequences based on the percentage of shared hits (by default 70%) using single linkage clustering. Alternatively, the user can specify a file containing user-generated groups of queries and hits (e.g. from OrthoFinder [10] or protein domain analysis). For each group, a fasta file is created containing the query species sequences and their respective database hits. Each file is then aligned using MAFFT [11] with an option for alignment trimming with trimAl [12].

AvP detect

There are two options available for phylogenetic inference within AvP: (i) FastTree [13], and (ii) IQ-TREE [14]. The defaults for these programs are [-gamma -lg] for FastTree and [-mset WAG,LG,JTT -AICc -mrate E,I,G,R] for IQ-TREE. The user can change the IQ-TREE parameters in the config file. These two approaches vary in time and compute requirements, and consequently in tree reconstruction accuracy [15]. Alternatively, the software can utilise user-generated phylogenetic trees using the alignment files created with AvP prepare with any program that can produce a valid Newick tree format file. By default, AvP does not impose a branch support threshold. However, the user can define a support threshold in the config file under which branches collapse into polytomies.

Each phylogenetic tree is then processed (midpoint rooting) and each query sequence is classified into one of the following three categories: HGT candidate (✓), Complex topology (?), No evidence for HGT (X). The taxonomic assignment of genes and their position in the tree relative to the query gene are used to characterise the gene as HGT or not. Two branches are taken into account, the sister branch of the gene of interest and the ancestral sister branch (Fig 2). Both of these branches are tagged independently depending on the included sequences to either Donor (i.e distantly related species), Ingroup (i.e closely related species), or both. Ingroup is defined by the user and Donor is all species not in Ingroup. The Ingroup tag is applied if most of the sequences (default 80%) belong to taxa inside the taxonomic group closely related to the species studied. Consequently, the Donor tag is applied if most of the sequences belong to taxa that fall outside of the Ingroup taxonomic clade. If the branch contains taxa from both groups at a ratio higher than 1 to 5, then the branch is tagged as both. The tags of these two branches are then processed according to Table 1. For example, if we are searching in a eukaryotic species for HGT originating from prokaryotic species, the Ingroup is set to Eukaryota and the Donor to non Eukaryota (bacteria, viruses etc). If the sister branch of the query contains sequences that belong to eukaryotic species, it is tagged as Ingroup and the gene is not considered as an HGT. In another example, if both the sister branch and the ancestral sister branch contain mostly sequences from non eukaryotic species, both of the branches are tagged as Donor ant the gene is considered as a potential HGT.

Fig 2 — Sister branch positions on the phylogenetic tree.

Table 1. Detection table whether the gene tested is an HGT candidate.

Ancestral SB	Sister branch (SB)
Ancestral SB	Donor	Ingroup	Donor + Ingroup
Donor	✓	X	?
Ingroup	?	X	X
Donor + Ingroup	?	X	?
Not present	✓	X	?

Open in a new tab

For each query sequence, the software produces a nexus formatted file containing the phylogenetic tree, the taxonomic information for each sequence, and each sequence coloured by the taxonomic affiliation for quick visual parsing. The nexus file can be visualised with the tree visualisation software FigTree [16].

AvP classify

This step allows the further classification of HGT candidates into user-generated nested taxonomic ranks for their putative origins. It follows the same logic as in the step AvP detect described previously in terms of tagging the clades to a specific taxonomic affiliation. For example, the HGTs can be classified based on their origin, such as Fungi, Viridiplantae, Viruses etc., according to the NCBI taxonomy.

AvP evaluate

For each HGT candidate, the topology is constrained to form a single monophyletic group containing the query sequence and all the Ingroup sequences. A phylogenetic tree is inferred with FastTree or IQ-TREE and the likelihoods of the initial and constrained topologies are compared with IQ-TREE, which supports several tree topology tests. This step can inform whether the topology supporting HGT is more likely than the alternative constrained topology that does not support HGT.

AvP hgt_local_score

Given a gff3 file containing the genomic location of the genes of the query species and the results of the AvP analyses, this step calculates a score for each HGT candidate that corresponds to whether the HGT candidate is surrounded by genes from the query genome or ‘alien’ genes, including possible contaminants. The score ranges between -1 and +1, with -1 indicating strongly a contamination while +1 indicating strongly a HGT candidate (Fig 3). The rationale is that a candidate HGT surrounded by genes that were also detected as candidate HGT might be part of a contaminant insertion in the genome assembly (although HGT of a whole block of genes or duplications after acquisition are also possible). Hence, this step allows alerting the user on possible contaminations. On the opposite, if the candidate HGT is surrounded by genes that were more likely inherited vertically, the contamination hypothesis can be reasonably ruled out.

Fig 3 — Each neighbouring gene contributes to the score based on its classification getting a value described in the top left panel. In the example, the score is equal to 0.34, most likely indicating an HGT insertion. Overall, a score above 0 indicates an HGT insertion, while a score below 0 indicates a possible contamination or HGT rich region.

AHS: A new contamination-aware metric

The two metrics that are widely used (AI and h) utilise only the best Ingroup and Donor hit from the BLAST output. This poses several potential issues and AI in particular can be 0 if both hits have e-value of 0, although they can differ in similarity. The h metric resolves this problem by using the bitscore instead of the e-value. However, both of these metrics are sensitive to taxonomically misclassified or contaminating sequences in databases as they only rely on the best hits. For instance, if the best hit is wrongly assigned a Donor taxid in the database, these metrics will erroneously detect a candidate HGT. In the opposite, a wrongly assigned Ingroup taxid in the database would necessarily result in no HGT detection if it is the best hit. A different approach to try to circumvent this issue is to aggregate all the bitscores of Donor and Ingroup sequences and then perform the calculation of h. Although this approach will minimise erroneous results it will still suffer from sampling biases.

In order to minimise all these effects we developed a new metric called Aggregate Hit Support (AHS). We first normalise each bitscore Eq (1). We then sum all the normalised bitscores of the Donor hits and all the normalised bitscores of the Ingroup hits seperately and calculate the difference Eq (2). A positive AHS score suggests a potential HGT candidate.

\begin{matrix} B i t s c o r e_{N} = B i t s c o r e \cdot e^{- 10 \cdot \frac{H B i t s c o r e - B i t s c o r e}{B i t s c o r e}} \end{matrix}

(1)

\begin{matrix} A H S = \sum B i t s c o r e_{N}^{D o n o r} - \sum B i t s c o r e_{N}^{I n g r o u p} \end{matrix}

(2)

Results

HGT pipeline

We tested our pipeline using the predicted protein set for the tardigrade species Hypsibius exemplaris (previously named H. dujardini) [17]. We used the database NCBI nr instead of SwissProt+TrEMBL libraries, used in the original publication, and selected candidates with AI > 30 instead of h_ST > 30 (HGT Index), while the phylogenetic inference was performed with FastTree instead of RAxML [18]. The final selection was 401 proteins (386 genes) compared to 463 proteins (463 genes), and based on the phylogenetic trees, we detected a total of 379 candidate HGTs (95%) instead of 357 (77%). Overall, 342 candidate HGTs were common to AvP and the previously published analysis, the ones not identified by our pipeline having an AI below 30. We then evaluated the candidate HGTs by comparing the likelihoods of the original HGT-supporting trees to those of constrained trees in which tardigrade and other metazoan proteins were forced to form a monophyletic group. Equally likely topologies were observed for 27 proteins bringing the total number of strongly supported candidate HGTs to 352 (1.7% of the total proteins present in the genome). To assess the effect of using different databases, we performed two more searches against SwissProt (SP) and Uniref90 (UR). A total of 196 / 333 / 401 proteins were selected when using SP / UR / NR resulting in 127 / 292 / 352 candidate HGTs after alternative topology tests (AvP evaluate). Hence, depending on the sampling of the sequence diversity present in the sequence database, the number of detectable candidate HGT varies considerably.

In the original publication describing Alien Index (AI) [3], the authors considered AI > 45 to be a good indication of foreign origin while genes with 0 < AI < 45 were designated intermediate. However, this AI threshold value was originally defined on one single species only, the bdelloid rotifer, and further analyses on plant-parasitic nematodes have shown that an AI > 45 might be too stringent, leaving several true positives undetectable [19]. Here, we calculated the F1 score Eq (3) for all N with AI > 0 in H. exemplaris to decide the optimal threshold between precision and sensitivity. We found that selecting genes with AI > 10 represented an optimal balance between sensitivity and precision (Fig 4). Therefore, we propose to perform AvP with AI > 0 with FastTree option to minimize the risk of missing HGT cases and utilise the scripts provided to calculate the F1 score and based on that, decide the optimal AI threshold (which is 10 for tardigrade example) for more sophisticated and time-consuming phylogenetic analyses.

\begin{matrix} F 1_{N} = 2 \cdot \frac{H G T_{A I > N}}{H G T_{A I > 0} + G e n e s_{A I > N}} \end{matrix}

(3)

Fig 4 — Sensitivity, Precision, and F1 Score were calculated for Alien Index (AI) up to 40 for the proteins of the tardigrade *Hypsibius exemplaris*. The dashed line indicates the AI with the highest F1 score indicating the most accurate AI threshold.

AHS metric

To test the new metric, we performed an AvP analysis with the nematode Caenorhabditis elegans and excluding members of the Nematoda phylum from the metazoan matches. Hence, the analysis was configured to identify HGT of non-metazoan origin in C. elegans and possibly present in any other nematoda species. We compared the list of potential HGTs by Crisp et al., [20] with the results obtained by AvP (see S1 File for full results of the comparison). By using an initial filter of AI > 0 or AHS > 0 we managed to recover 5 cases that would have been missed if there were filtered only on AI.

We thoroughly checked two of these cases where AI and AHS disagreed to identify the cause. In the first case, AI is 318, indicating a strong HGT candidate, while AHS is -44029 indicating the opposite. The two proteins that are identified as Bacteria, and thus as best non-metazoan hits, are most likely taxonomically missclasified since they are almost identical to the nematode protein and nested in a branch otherwise containing only nematode sequences and not found in any other bacterium (S1 Fig). An alternative less likely hypothesis is that these proteins represent a recent transfer from nematodes to bacteria. In any case, this does not represent HGT from bacteria to nematodes and the AHS metric is not misled by this likely erroneous taxonomic annotation.

In the second case, AI is 7 indicating a poorly supported HGT candidate, while AHS is 10356 indicating a strong HGT candidate (S2 Fig). The closest non nematode metazoan hit is annotated as a rodent protein. Running a BLAST for this protein in nr shows that it is very similar to a nematode of the genus Trichuris which some of its members are shown to be rodent parasites. Thus the rodent protein is actually more likely to represent contamination from a nematode one and should have been excluded from calculating AI and AHS. Although for C. elegans the difference in AI and AHS appear small, performing AI and AHS calculation on the Trichuris suis protein results in AI < −50 while AHS > 10000, further indicating that AHS is much less sensitive to taxonomic annotation errors than AI.

Consequently, it seems that this new AHS metric is able to correct errors due to contamination and taxonomic assignation bias. We thus implemented this new metric in AvP and recommend to use it in combination with AI or other metric.

Future directions and availability

We propose AvP to facilitate the identification and evaluation of candidate HGTs in sequenced genomes across multiple branches of the tree of life. The most common methods used so far have been based on the difference of similarity between Donor and Ingroup sequences. We also propose and implemented AHS, a new metric aiming to address contamination and erroneous taxonomic annotation. Performing phylogenetic reconstruction and alternative topology evaluation creates a framework under which more robust HGT analyses can be performed. AvP can contribute to rapidly populate a reliable dataset with phylogenetically supported HGT cases across the tree of life. This could eventually be used in machine learning approaches in the future attempt to predict HGT events from sequence feature themselves. Furthermore, calculating the hgt_local_score can help identify contamination and HGT hot spots in the genome. In the future, we aim to incorporate a basic module of AvP to the Alieness webserver [19], to facilitate usage of AvP for biologists not familiar with command line software.

The AvP software is available at https://github.com/GDKO/AvP. It is released under GNU General Public License v3.0.

Supporting information

S1 File. Comparing AvP results on C. elegans to Crisp et al., 2015.

(PDF)

Click here for additional data file.^{(75.2KB, pdf)}

S1 Fig. Phylogenetic tree for protein F40E10.3.

Nematoda proteins are excluded from the analysis (dark orange). Bacteria proteins are coloured green while Metazoan proteins are coloured orange. The two bacterial proteins returning the best non-metazoan hits belong to Escherichia coli and Nitriliruptoraceae bacterium and are almost identical to the protein from C. elegans indicating that they are missclasified.

(PDF)

Click here for additional data file.^{(2.3MB, pdf)}

S2 Fig. Phylogenetic tree for protein F44B9.9.

Nematoda proteins are excluded from the analysis (dark orange). Fungal proteins are coloured light green, Metazoan proteins are coloured orange, Viridiplantae proteins are coloured teal, and other non-metazoan eukaryotic proteins are coloured blue. The best Metazoan hit (excluding nematode proteins) marked with the arrow most likely belongs to a nematode from Trichuris genus.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

Acknowledgments

We are grateful to the genotoul bioinformatics platform Toulouse Occitanie (Bioinfo Genotoul, doi: 10.15454/1.5572369328961167E12) for providing computing resources. We are also grateful to the OPAL infrastructure from Université Côte d’Azur and to the Université Côte d’Azur’s Center for High-Performance Computing for providing resources and support.

Data Availability

All relevant data are within the manuscript.

Funding Statement

This work has been supported by the French government, through the UCA-JEDI “Investments in the Future” project managed by the National Research Agency (ANR) with the reference number ANR-15-IDEX-01. GDK has received the support of the EU in the framework of the Marie-Curie FP7 COFUND People Programme, through the award of an AgreenSkills+ fellowship (under grant number 609398).

References

1. Danchin EGJ. Lateral gene transfer in eukaryotes: tip of the iceberg or of the ice cube? BMC Biology. 2016;14(1):101. doi: 10.1186/s12915-016-0330-x [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(1):421. doi: 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Gladyshev EA, Meselson M, Arkhipova IR. Massive Horizontal Gene Transfer in Bdelloid Rotifers. Science. 2008;320(5880):1210–1213. doi: 10.1126/science.1156407 [DOI] [PubMed] [Google Scholar]
4. Boschetti C, Carr A, Crisp A, Eyres I, Wang-Koh Y, Lubzens E, et al. Biochemical Diversification through Foreign Gene Expression in Bdelloid Rotifers. PLOS Genetics. 2012;8(11):1–13. doi: 10.1371/journal.pgen.1003035 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Li Y, Liu Z, Liu C, Shi Z, Pang L, Chen C, et al. HGT is widespread in insects and contributes to male courtship in lepidopterans. Cell. 2022;185(16):2975–2987. doi: 10.1016/j.cell.2022.06.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, et al. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences. 2016;113(18):5053–5058. doi: 10.1073/pnas.1600338113 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Jacox E, Chauve C, Szöllősi GJ, Ponty Y, Scornavacca C. ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics. 2016;32(13):2056–2058. doi: 10.1093/bioinformatics/btw105 [DOI] [PubMed] [Google Scholar]
8. Darby CA, Stolzer M, Ropp PJ, Barker D, Durand D. Xenolog classification. Bioinformatics. 2016;33(5):640–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12(1):59–60. doi: 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]
10. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology. 2019;20(1):238. doi: 10.1186/s13059-019-1832-y [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution. 2013;30(4):772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–1973. doi: 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Price MN, Dehal PS, Arkin AP. FastTree 2—Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE. 2010;5(3):1–10. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution. 2020;37(5):1530–1534. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Zhou X, Shen XX, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Molecular Biology and Evolution. 2018;35(2):486–503. doi: 10.1093/molbev/msx302 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Rambaut A. FigTree v1.4.4. Available from: http://tree.bio.ed.ac.uk/software/figtree/.
17. Yoshida Y, Koutsovoulos G, Laetsch DR, Stevens L, Kumar S, Horikawa DD, et al. Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. PLOS Biology. 2017;15(7):1–40. doi: 10.1371/journal.pbio.2002266 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–1313. doi: 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Rancurel C, Legrand L, Danchin EGJ. Alienness: Rapid Detection of Candidate Horizontal Gene Transfers across the Tree of Life. Genes. 2017;8(10). doi: 10.3390/genes8100248 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Crisp A, Boschetti C, Perry M, Tunnacliffe A, Micklem G. Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes. Genome Biology. 2015;16(1):1–50. doi: 10.1186/s13059-015-0607-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010686.r001

Decision Letter 0

William Stafford Noble, Mark Ziemann

25 Aug 2022

Dear Dr Koutsovoulos,

Thank you very much for submitting your manuscript "AvP: a software package for automatic phylogenetic detection of candidate horizontal gene transfers." for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Mark Ziemann

Academic Editor

PLOS Computational Biology

William Noble

Section Editor

PLOS Computational Biology

***********************

Dear Dr Koutsovoulos,

The manuscript has been reviewed by two experts in the field and overall impressions of the software and manuscript were positive, although both reviewers requested greater clarity in the writing and explanation of certain points. Reviewer 2 point 2 has requested the software to detect non-protein coding genes, which is an excellent suggestion and of reat value to the field, however I am mindful that this may require extensive retooling of the software. I would like to flag this software feature as optional for the authors to address, however it must be mentioned as a limitation/future direction in the discussion of the manuscript along with any challenges that may be apparent when detecting HGT of non-coding genes. I look forward to receiving a revised version of the manuscript.

Regards,

Mark Ziemann, PhD

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Koutsovolous et al. present AvP, a pipeline to automatically identify horizontally transferred genes, along with a set of metrics that allow users to evaluate these predictions. By using phylogenetics at its core, AvP represents a substantial improvement compared to existing best BLAST hit-based approaches. Moreover, the metrics they develop help to address many of the issues with HGT detection (e.g. contamination in assemblies and databases). I was able to install the software from GitHub and the documentation is clear. I have no doubt that AvP will be well used and well cited by researchers interested in HGT. I struggled to find anything to complain about and, as such, I would be happy to recommend the manuscript once a few of my very minor comments (below) have been addressed/considered.

1. Lines 32-38: It took me a few readings of this paragraph to work out that the authors were suggesting that the existing methods that reconcile gene/species trees were not ideal because of the requirement of a species tree - could it be reworked slightly to make the message clearer? Even adding “However” before “To achieve this” would help. Note there are a few typos in here too: hundred > hundreds, compare > comparing, wth > with.

2. Lines 57-59: “AI features file can be automatically generated with the Alienness webserver” - I note that this is scored out in the GitHub instructions; is this no longer true?

3. Lines 91-92: “is all not in Ingroup” > “is all species not in Ingroup”

4. Lines 153-154: note that the sequenced Sciento strain of tardigrades was redescribed as H. exemplaris. I’ll let the authors decide if they'd rather use H. dujardini for consistency with previous publications or H. exemplaris for taxonomic correctness.

5. Lines 185-204: The authors only discuss the two problematic cases in their run on C. elegans, which is valuable, but what about all the other predicted HGT loci? How many were there? Were any of interest or consistent with previous work? I understand that analysing this in-depth is beyond the remit of this manuscript, but it seems odd to not mention anything other than where this had problems. A few examples of what can be discovered with the pipeline would be a valuable addition.

6. Lines 201-203: “Consequently, it seems that this new AHS metric is able to correct errors due to contamination and taxonomic assignment bias.” Perhaps I’m confused but doesn’t the AHS value, which suggests that the rodent protein is a strong HGT candidate (10356), suggests the opposite? Surely AHS is positively misleading in this case, and even more so than AI? Or are the authors suggesting that it’s the difference between AHS and AI that is indicative of a false positive?

7. Github: it would be great if the authors could add some worked examples (e.g. the tardigrade and/or C. elegans examples in the paper) to show users how to set this up.

Reviewer #2: I appreciate that Koutsovoulos et al. put multiple steps that are used to detect HGT into one automatic tool called “AvP”. The study is written well and the tool is friendly to users.

I have four main concerns that authors might wish to address.

1. The AvP incorporated the AI values and phylogenetic robustness. Not sure how different AI cutoff will influence outputs. Of courses, there is no standard criterion. Given the complexity of the database, sometimes, the database could contain contamination so that the query protein is highly similar to the hit from contamination. A recent paper (https://pubmed.ncbi.nlm.nih.gov/35853453/) has considered anther parameter “outg_pct”, that is the percentage of species from OUTGROUP lineage in the list of the top 1,000 hits that have different taxonomic species names. Did you consider this too? If not, why or discuss it?

2. The AvP is aiming to detect HGT based on protein-coding genes (or protein sequence), which has been done many previous studies (although they built their own pipelines). Would you please extend your tool to non-coding sequences? If you can make it, that will be much helpful, because nearly all published studies have not done this. And I think non-coding sequences could aslo be HGT too.

3. It seems that the graph neural network (or machine learning) was used to predict the HGT. if it is possible, please compare your tool with it in term of accuracy of HGT identification. If this’s hard, it would be good to discuss this aspect in your study for future direction.

4. If The AvP tool can also report the characteristics of HGT including the codon usage bias, dn/ds, the sequence similarity between donor and recipient, which would be better.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2022 Nov 9;18(11):e1010686. doi: 10.1371/journal.pcbi.1010686.r002

Author response to Decision Letter 0

7 Oct 2022

Attachment

Submitted filename: response.rtf

Click here for additional data file.^{(17KB, rtf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010686.r003

Decision Letter 1

William Stafford Noble, Mark Ziemann

17 Oct 2022

Dear Dr Koutsovoulos,

Thank you for the revised manuscript which I see has addressed most of the reviewer comments.

For your general information, if you agree with the reviewers' comments, it is good form to address them in the manuscript, as it is likely that the readership will have similar concerns. In the rebuttal letter, indicate which lines of the manuscript have been added/amended. This is relevant to points R1P2, R2P1, R2P2. If you could please amend the manuscript and rebuttal letter to this effect it can be send to reviewer 2 again.

Regards,

Mark Ziemann, PhD

---------------

When you are ready to resubmit, please upload the following:

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Sincerely,

Mark Ziemann

Academic Editor

PLOS Computational Biology

William Noble

Section Editor

PLOS Computational Biology

***********************

Dear Dr Koutsovoulos,

Thank you for the revised manuscript which I see has addressed most of the reviewer comments.

Regards,

Mark Ziemann, PhD

Figure Files:

Data Requirements:

Reproducibility:

PLoS Comput Biol. 2022 Nov 9;18(11):e1010686. doi: 10.1371/journal.pcbi.1010686.r004

Author response to Decision Letter 1

21 Oct 2022

Attachment

Submitted filename: response.rtf

Click here for additional data file.^{(16.7KB, rtf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010686.r005

Decision Letter 2

William Stafford Noble, Mark Ziemann

26 Oct 2022

Dear Koutsovoulos,

We are pleased to inform you that your manuscript 'AvP: a software package for automatic phylogenetic detection of candidate horizontal gene transfers.' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Mark Ziemann

Academic Editor

PLOS Computational Biology

William Noble

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: Thank you very much for including my suggestions! I don’t have any concerns and think it’s ready for publication.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Xing-Xing Shen

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010686.r006

Acceptance letter

William Stafford Noble, Mark Ziemann

4 Nov 2022

PCOMPBIOL-D-22-01025R2

AvP: a software package for automatic phylogenetic detection of candidate horizontal gene transfers.

Dear Dr Koutsovoulos,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. Comparing AvP results on C. elegans to Crisp et al., 2015.

(PDF)

Click here for additional data file.^{(75.2KB, pdf)}

S1 Fig. Phylogenetic tree for protein F40E10.3.

(PDF)

Click here for additional data file.^{(2.3MB, pdf)}

S2 Fig. Phylogenetic tree for protein F44B9.9.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

Attachment

Submitted filename: response.rtf

Click here for additional data file.^{(17KB, rtf)}

Attachment

Submitted filename: response.rtf

Click here for additional data file.^{(16.7KB, rtf)}

Data Availability Statement

All relevant data are within the manuscript.

[pcbi.1010686.ref001] 1. Danchin EGJ. Lateral gene transfer in eukaryotes: tip of the iceberg or of the ice cube? BMC Biology. 2016;14(1):101. doi: 10.1186/s12915-016-0330-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref002] 2. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(1):421. doi: 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref003] 3. Gladyshev EA, Meselson M, Arkhipova IR. Massive Horizontal Gene Transfer in Bdelloid Rotifers. Science. 2008;320(5880):1210–1213. doi: 10.1126/science.1156407 [DOI] [PubMed] [Google Scholar]

[pcbi.1010686.ref004] 4. Boschetti C, Carr A, Crisp A, Eyres I, Wang-Koh Y, Lubzens E, et al. Biochemical Diversification through Foreign Gene Expression in Bdelloid Rotifers. PLOS Genetics. 2012;8(11):1–13. doi: 10.1371/journal.pgen.1003035 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref005] 5. Li Y, Liu Z, Liu C, Shi Z, Pang L, Chen C, et al. HGT is widespread in insects and contributes to male courtship in lepidopterans. Cell. 2022;185(16):2975–2987. doi: 10.1016/j.cell.2022.06.014 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref006] 6. Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, et al. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences. 2016;113(18):5053–5058. doi: 10.1073/pnas.1600338113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref007] 7. Jacox E, Chauve C, Szöllősi GJ, Ponty Y, Scornavacca C. ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics. 2016;32(13):2056–2058. doi: 10.1093/bioinformatics/btw105 [DOI] [PubMed] [Google Scholar]

[pcbi.1010686.ref008] 8. Darby CA, Stolzer M, Ropp PJ, Barker D, Durand D. Xenolog classification. Bioinformatics. 2016;33(5):640–649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref009] 9. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12(1):59–60. doi: 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]

[pcbi.1010686.ref010] 10. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology. 2019;20(1):238. doi: 10.1186/s13059-019-1832-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref011] 11. Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution. 2013;30(4):772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref012] 12. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–1973. doi: 10.1093/bioinformatics/btp348 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref013] 13. Price MN, Dehal PS, Arkin AP. FastTree 2—Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE. 2010;5(3):1–10. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref014] 14. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution. 2020;37(5):1530–1534. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref015] 15. Zhou X, Shen XX, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Molecular Biology and Evolution. 2018;35(2):486–503. doi: 10.1093/molbev/msx302 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref016] 16.Rambaut A. FigTree v1.4.4. Available from: http://tree.bio.ed.ac.uk/software/figtree/.

[pcbi.1010686.ref017] 17. Yoshida Y, Koutsovoulos G, Laetsch DR, Stevens L, Kumar S, Horikawa DD, et al. Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. PLOS Biology. 2017;15(7):1–40. doi: 10.1371/journal.pbio.2002266 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref018] 18. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–1313. doi: 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref019] 19. Rancurel C, Legrand L, Danchin EGJ. Alienness: Rapid Detection of Candidate Horizontal Gene Transfers across the Tree of Life. Genes. 2017;8(10). doi: 10.3390/genes8100248 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010686.ref020] 20. Crisp A, Boschetti C, Perry M, Tunnacliffe A, Micklem G. Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes. Genome Biology. 2015;16(1):1–50. doi: 10.1186/s13059-015-0607-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

AvP: A software package for automatic phylogenetic detection of candidate horizontal gene transfers

Georgios D Koutsovoulos

Solène Granjeon Noriot

Marc Bailly-Bechet

Etienne G J Danchin

Corinne Rancurel

Roles

Abstract

Introduction

Design and implementation

Software description

Fig 1. AvP workflow.

AvP prepare

AvP detect

Fig 2. Tree example.

Table 1. Detection table whether the gene tested is an HGT candidate.

AvP classify

AvP evaluate

AvP hgt_local_score

Fig 3. HGT local score calculation.

AHS: A new contamination-aware metric

Results

HGT pipeline

Fig 4. Sensitivity, precision, and F1 score calculations.

AHS metric

Future directions and availability

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

William Stafford Noble

Mark Ziemann

Roles

Author response to Decision Letter 0

Decision Letter 1

William Stafford Noble

Mark Ziemann

Roles

Author response to Decision Letter 1

Decision Letter 2

William Stafford Noble

Mark Ziemann

Roles

Acceptance letter

William Stafford Noble

Mark Ziemann

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases