Data types |
Short reads |
Accurate sequences of DNA typically 150 bp in length and typically generated on Illumina platforms [14]. Sequences may be paired or unpaired depending on whether both ends of DNA fragments were sequenced |
Linked reads |
Short reads, but with molecular barcodes that tag reads from the same DNA fragment, creating “read clouds” that leverage long range information [15]. Sometimes referred to as 10 × Linked reads after 10 × Genomics, a company that provided this type of sequencing prior to 2020 |
Long reads |
Sequences produced directly from long fragments of DNA, thus providing long range information in the form of intact reads. Typically generated using PacBio or NanoPore platforms, long reads are historically more error prone compared to short reads, though read accuracy continues to improve (e.g. PacBio HiFi reads; [14]) |
Chromosome conformation capture |
A method used to map spatial organization of chromatin across genomes [16]. A suite of techniques can be used to cross link loci and sequence DNA fragments as paired-end short reads linked by unknown proximity. The higher order structure of sequences (e.g. chromosomes) can be inferred because loci interactions increase with linear proximity on the genome. Data is generated on similar platforms to short read sequences, e.g. Illumina |
Optical genome mapping |
A restriction enzyme is applied to highly intact DNA and the lengths and order of fragments are measured. This information is used to guide the order and orientation of assembly fragments by matching patterns in the occurrence of sequence motifs [17, 18]. Note, the data represents mapping information or physical locations of sequence motifs, not sequence data. Bionano is currently the main provider for optical genome mapping services |
Reference genome quality |
K-mer |
Substrings of length k within DNA sequence data |
Coverage |
The number of times, on average, a genomic region or complete genome has been sequenced. Oftentimes synonymous with the depth, or number, of uniquely overlapping reads in a dataset |
Contig |
A DNA sequence assembled by overlapping k-mers or reads |
Scaffold |
Contigs ordered and oriented into longer sequences, typically with gaps represented as Ns in between contigs [19] |
Contiguity |
The level to which a reference genome is assembled into continuous sequences representing DNA, a genome fragmented into a larger amount of smaller sequences being less contiguous |
Quantitative parameters |
N50 |
The minimum sequence length above which 50% of the reference genome is represented. A proxy for contiguity |
L50 |
The minimum number of sequences within which 50% of the reference genome is represented. A proxy for contiguity |
Completeness |
The proportion of the genomic sequences captured in a reference assembly. This is typically benchmarked using the proportion of observed vs expected single copy orthologues appearing in an assembly (i.e. BUSCO scores; [10]) |
Qualitative parameters |
Accuracy |
A general term to scale the match between an assembly and a hypothetical complete and error-free assembly |
Precision |
A general term to scale the replicability of the assembly using similar or alternative methods |
Certainty/uncertainty |
A general term to scale the confidence surrounding a genomic sequence or assembly |
Error/mis-join/mis-assembly |
An incorrect inference regarding the order and/or orientation of a particular genomic sequence |
Discrepancy |
An inconsistency between two reference genomes which could be due to an error or inter- or intraspecific variation |
Discrepancies |
Debris |
Segments of DNA, typically contigs, not assimilated into higher order scaffolding of chromosome sequences |
Gaps |
Runs of Ns, typically 10-100, that appear between contigs within scaffolds, representing uncertainty between the adjoining sequences |
Translocation |
A unique DNA segment appearing on different chromosomes between two assemblies |
Inversion |
A unique DNA segment running in opposite directions between two assemblies |
Relocation |
Unique DNA segments appearing in a different order between two assemblies |
General terms |
Restriction enzyme |
A protein that cleaves DNA at sites with a particular sequence, or restriction site |
Orthologous |
A DNA segment or gene appearing in separate species and inherited from a common ancestor, typically retaining similar function |
Repetitive element |
Patterns of DNA sequences that occur as multiple copies throughout a genome |
Transposable element |
DNA sequences, typically genes, that can move location within a genome |
Reference genome |
A representation or estimation of the entire genomic sequence of a species or individual |
End-user |
Someone seeking to leverage a previously generated reference genome for applied purposes. For example, an end-user might use a reference genome to map sequences and call variant positions in a set of samples |