Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Nov 22:2023.12.20.572577. [Version 3] doi: 10.1101/2023.12.20.572577

Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling

Marzieh Khodaei, Scott V Edwards, Peter Beerli
PMCID: PMC11601389  PMID: 39605625

A bstract

Methods for rapidly inferring the evolutionary history of species or populations with genome-wide data are progressing, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer genome-wide phylogenies and implemented it in the Python package T opic C ontml . The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’ frequencies from k -mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program C ontml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of T opic C ontml on simulated datasets with gaps and three biological datasets: (1) 14 DNA sequence loci from two Australian bird species distributed across nine populations, (2) 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous P ac B io sequences from 12 bird species. Our empirical results and simulated data suggest that our method is efficient and statistically robust. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES