Skip to main content
. 2023 Nov 20;24:693. doi: 10.1186/s12864-023-09779-3

Table 1.

Definitions for terms used in the current study

Term Definition
Data types
 Short reads Accurate sequences of DNA typically 150 bp in length and typically generated on Illumina platforms [14]. Sequences may be paired or unpaired depending on whether both ends of DNA fragments were sequenced
 Linked reads Short reads, but with molecular barcodes that tag reads from the same DNA fragment, creating “read clouds” that leverage long range information [15]. Sometimes referred to as 10 × Linked reads after 10 × Genomics, a company that provided this type of sequencing prior to 2020
 Long reads Sequences produced directly from long fragments of DNA, thus providing long range information in the form of intact reads. Typically generated using PacBio or NanoPore platforms, long reads are historically more error prone compared to short reads, though read accuracy continues to improve (e.g. PacBio HiFi reads; [14])
 Chromosome conformation capture A method used to map spatial organization of chromatin across genomes [16]. A suite of techniques can be used to cross link loci and sequence DNA fragments as paired-end short reads linked by unknown proximity. The higher order structure of sequences (e.g. chromosomes) can be inferred because loci interactions increase with linear proximity on the genome. Data is generated on similar platforms to short read sequences, e.g. Illumina
 Optical genome mapping A restriction enzyme is applied to highly intact DNA and the lengths and order of fragments are measured. This information is used to guide the order and orientation of assembly fragments by matching patterns in the occurrence of sequence motifs [17, 18]. Note, the data represents mapping information or physical locations of sequence motifs, not sequence data. Bionano is currently the main provider for optical genome mapping services
Reference genome quality
 K-mer Substrings of length k within DNA sequence data
 Coverage The number of times, on average, a genomic region or complete genome has been sequenced. Oftentimes synonymous with the depth, or number, of uniquely overlapping reads in a dataset
 Contig A DNA sequence assembled by overlapping k-mers or reads
 Scaffold Contigs ordered and oriented into longer sequences, typically with gaps represented as Ns in between contigs [19]
 Contiguity The level to which a reference genome is assembled into continuous sequences representing DNA, a genome fragmented into a larger amount of smaller sequences being less contiguous
Quantitative parameters
 N50 The minimum sequence length above which 50% of the reference genome is represented. A proxy for contiguity
 L50 The minimum number of sequences within which 50% of the reference genome is represented. A proxy for contiguity
 Completeness The proportion of the genomic sequences captured in a reference assembly. This is typically benchmarked using the proportion of observed vs expected single copy orthologues appearing in an assembly (i.e. BUSCO scores; [10])
Qualitative parameters
 Accuracy A general term to scale the match between an assembly and a hypothetical complete and error-free assembly
 Precision A general term to scale the replicability of the assembly using similar or alternative methods
 Certainty/uncertainty A general term to scale the confidence surrounding a genomic sequence or assembly
 Error/mis-join/mis-assembly An incorrect inference regarding the order and/or orientation of a particular genomic sequence
 Discrepancy An inconsistency between two reference genomes which could be due to an error or inter- or intraspecific variation
Discrepancies
 Debris Segments of DNA, typically contigs, not assimilated into higher order scaffolding of chromosome sequences
 Gaps Runs of Ns, typically 10-100, that appear between contigs within scaffolds, representing uncertainty between the adjoining sequences
 Translocation A unique DNA segment appearing on different chromosomes between two assemblies
 Inversion A unique DNA segment running in opposite directions between two assemblies
 Relocation Unique DNA segments appearing in a different order between two assemblies
General terms
 Restriction enzyme A protein that cleaves DNA at sites with a particular sequence, or restriction site
 Orthologous A DNA segment or gene appearing in separate species and inherited from a common ancestor, typically retaining similar function
 Repetitive element Patterns of DNA sequences that occur as multiple copies throughout a genome
 Transposable element DNA sequences, typically genes, that can move location within a genome
 Reference genome A representation or estimation of the entire genomic sequence of a species or individual
 End-user Someone seeking to leverage a previously generated reference genome for applied purposes. For example, an end-user might use a reference genome to map sequences and call variant positions in a set of samples