Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2024 Apr 19:2024.04.18.24306052. [Version 1] doi: 10.1101/2024.04.18.24306052

Early Detection of Novel SARS-CoV-2 Variants from Urban and Rural Wastewater through Genome Sequencing and Machine Learning

Xiaowei Zhuang, Van Vo, Michael A Moshi, Ketan Dhede, Nabih Ghani, Shahraiz Akbar, Ching-Lan Chang, Angelia K Young, Erin Buttery, William Bendik, Hong Zhang, Salman Afzal, Duane Moser, Dietmar Cordes, Cassius Lockett, Daniel Gerrity, Horng-Yuan Kan, Edwin C Oh
PMCID: PMC11065002  PMID: 38699326

Abstract

Genome sequencing from wastewater has emerged as an accurate and cost-effective tool for identifying SARS-CoV-2 variants. However, existing methods for analyzing wastewater sequencing data are not designed to detect novel variants that have not been characterized in humans. Here, we present an unsupervised learning approach that clusters co-varying and time-evolving mutation patterns leading to the identification of SARS-CoV-2 variants. To build our model, we sequenced 3,659 wastewater samples collected over a span of more than two years from urban and rural locations in Southern Nevada. We then developed a multivariate independent component analysis (ICA)-based pipeline to transform mutation frequencies into independent sources with co-varying and time-evolving patterns and compared variant predictions to >5,000 SARS-CoV-2 clinical genomes isolated from Nevadans. Using the source patterns as data-driven reference “barcodes”, we demonstrated the model’s accuracy by successfully detecting the Delta variant in late 2021, Omicron variants in 2022, and emerging recombinant XBB variants in 2023. Our approach revealed the spatial and temporal dynamics of variants in both urban and rural regions; achieved earlier detection of most variants compared to other computational tools; and uncovered unique co-varying mutation patterns not associated with any known variant. The multivariate nature of our pipeline boosts statistical power and can support accurate and early detection of SARS-CoV-2 variants. This feature offers a unique opportunity for novel variant and pathogen detection, even in the absence of clinical testing.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES