Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Dec 7:2025.12.03.692191. [Version 1] doi: 10.64898/2025.12.03.692191

MosaicSim: A Novel Mosaic Variant Simulator Reveals Diminishing Returns of Ultra-High Coverage for Mosaic Variant Detection

Erik Stricker, Farhang Jaryani, Michal Izydorczyk, Chi-Lam Poon, Philippe Sanio, Adam Alexander, Sontosh K Deb, Fritz Sedlazeck, Jeffrey Rogers, Elizabeth G Atkinson
PMCID: PMC12822621  PMID: 41573921

Abstract

Genetic mutations within select cells of a tissue, termed mosaic variants (MV), are being increasingly recognized for their role in human disease. This growing interest underscores the need for specialized tools to detect and analyze MVs. However, such detection methods still lack thorough evaluation, largely due to missing benchmarking datasets that are large, reliable, and reflective of the complexity of biological samples. To address this gap, we developed MosaicSim, a tool for simulating variants in realistic sequencing data. The TweakVar workflow is at the tool’s core and represents a unique simulation pipeline that layers simulated MVs onto empirical whole genome sequencing data, generating a large, realistic ground truth dataset that combines the strengths of both simulation and biological data. To demonstrate the functionality of the workflow, we simulated 1,000 mosaic single nucleotide polymorphisms using TweakVar within whole genome sequencing files of different coverages. MVs were called with Illumina’s DRAGEN and compared to the ground truth. Our results show 150×–445× coverage performed comparably, with a true-positive rate between 50.4% (300×) and 54.9% (150×) and no false-positives detected. Across all samples, increasing variant allele frequency had a significant positive effect on call success. Additionally, we observed that call rates for variants in lower complexity regions improved with increasing read depth. We did not find significant effects attributable to specific mutation patterns or mean read map quality. MosaicSim fills a critical unmet need by providing representative, customizable ground truth datasets for MV benchmarking, enabling systematic evaluation and optimization of variant calling methods.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES