Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

[Preprint]. 2022 Jun 27:2022.06.24.497555. [Version 1] doi: 10.1101/2022.06.24.497555

PMC9258296.1; 2022 Jun 27
PMC9258296.2; 2022 Jul 22
PMC9258296.3; 2023 Mar 13
PMC9258296.4; 2023 Jul 31

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.

PMC Copyright notice

A. Overview of NOMAD (green) vs. existing methods (red). Typical workflows (red) remove reads during fastq preprocessing and alignment, and only then perform statistical significance testing. For every desired inferential task, a different inference pipeline must be used (red). NOMAD performs direct significance testing on raw fastq reads, bypassing alignment and enabling data-scientifically driven inference, using optional ontology mapping for interpretation. If optional mapping is desired, typically 1000 fold fewer reads than in initial fastqs files must be aligned.

B. Overview of NOMAD statistics: raw fastq files are parsed into kmer anchors (red) and targets (blue and yellow) separated by a lookahead sequence of length L. For each anchor, statistical inference is performed on a contingency table of targets by samples. Reads with sample-dependent sequence diversification by alternative splicing are depicted. For each significant anchor, a per-sample consensus sequence is built which can be interpreted as the dominant isoform in the case of alternative splicing.

C. Consensus building denoises inputs to aligners before the alignment process. Sequencing errors (red X’s) are randomly distributed in reads, and by plurality vote across reads from the given sample, error-corrected as a consensus is built. Without this step, aligners will (a) fail to align, (b) yield misaligned reads, or (c) align reads correctly but with sequencing errors. Even if correct alignments are made, resulting mismatches with the reference must be further post-processed to make inference that discriminates sequencing errors from SNPs.

D. Left: NOMAD takes in fastq data, extracts (anchor, target) pairs of k-mers which are sorted and counted, and performs statistical inference. Right: After compressing and denoising via sample consensus sequences, NOMAD reduces the number of alignments required by a factor of 10³.

A. Overview of NOMAD (green) vs. existing methods (red). Typical workflows (red) remove reads during fastq preprocessing and alignment, and only then perform statistical significance testing. For every desired inferential task, a different inference pipeline must be used (red). NOMAD performs direct significance testing on raw fastq reads, bypassing alignment and enabling data-scientifically driven inference, using optional ontology mapping for interpretation. If optional mapping is desired, typically 1000 fold fewer reads than in initial fastqs files must be aligned.

B. Overview of NOMAD statistics: raw fastq files are parsed into kmer anchors (red) and targets (blue and yellow) separated by a lookahead sequence of length L. For each anchor, statistical inference is performed on a contingency table of targets by samples. Reads with sample-dependent sequence diversification by alternative splicing are depicted. For each significant anchor, a per-sample consensus sequence is built which can be interpreted as the dominant isoform in the case of alternative splicing.

C. Consensus building denoises inputs to aligners before the alignment process. Sequencing errors (red X’s) are randomly distributed in reads, and by plurality vote across reads from the given sample, error-corrected as a consensus is built. Without this step, aligners will (a) fail to align, (b) yield misaligned reads, or (c) align reads correctly but with sequencing errors. Even if correct alignments are made, resulting mismatches with the reference must be further post-processed to make inference that discriminates sequencing errors from SNPs.

D. Left: NOMAD takes in fastq data, extracts (anchor, target) pairs of k-mers which are sorted and counted, and performs statistical inference. Right: After compressing and denoising via sample consensus sequences, NOMAD reduces the number of alignments required by a factor of 10³.