Skip to main content
. 2014 Jun 24;7(9):1026–1042. doi: 10.1111/eva.12178
Alignment Similarity-based arrangement of DNA, RNA or protein sequences. In this context, subject and query sequence should be orthologous and reflect evolutionary, not functional or structural relationships
Annotation Computational process of attaching biologically relevant information to genome sequence data
Assembly Computational reconstruction of a longer sequence from smaller sequence reads
Barcode Short-sequence identifier for individual labelling (barcoding) of sequencing libraries
BAC (Bacterial artificial chromosome) DNA construct of various length (150–350 kb)
cDNA Complementary DNA synthesized from an mRNA template
Contig A contiguous linear stretch of DNA or RNA consensus sequence. Constructed from a number of smaller, partially overlapping, sequence fragments (reads)
Coverage Also known as ‘sequencing depth’. Sequence coverage refers to the average number of reads per locus and differs from physical coverage, a term often used in genome assembly referring to the cumulative length of reads or read pairs expressed as a multiple of genome size
De novo assembly Refers to the reconstruction of contiguous sequences without making use of any reference sequence
EST library Expressed sequence tag library. A short subsequence of cDNA transcript sequence
Fosmid A vector for bacterial cloning of genomic DNA fragments that usually holds inserts of around 40 kb
GC content The proportion of guanine and cytosine bases in a DNA/RNA sequence
Gene ontology (GO) Structured, controlled vocabularies and classifications of gene function across species and research areas
InDel Insertion/deletion polymorphism
Insert size Length of randomly sheared fragments (from the genome or transcriptome) sequenced from both ends
K-mer Short, unique element of DNA sequence of length k, used by many assembly algorithms
Library Collection of DNA (or RNA) fragments modified in a way that is appropriate for downstream analyses, such as high-throughput sequencing in this case
Mapping A term routinely used to describe alignment of short sequence reads to a longer reference sequence
Masking Converting a DNA sequence [A,C,G,T] (usually repetitive or of low quality) to the uninformative character state N or to lower case characters [a,c,g,t] (soft masking)
Massively parallel (or next generation) sequencing High-throughput sequencing nano-technology used to determine the base-pair sequence of DNA/RNA molecules at much larger quantities than previous end-termination (e.g. Sanger sequencing) based sequencing techniques
Mate-pair Sequence information from two ends of a DNA fragment, usually several thousand base-pairs long
N50 A statistic of a set of contigs (or scaffolds). It is defined as the length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs
N90 Equivalent to the N50 statistic describing the length for which the collection of all contigs of that length or longer contains at least 90% of the total of the lengths of the contigs
Optical map Genomewide, ordered, high-resolution restriction map derived from single, stained DNA molecules. It can be used to improve a genome assembly by matching it to the genomewide pattern of expected restriction sites, as inferred from the genome sequence
Paired-end sequencing Sequence information from two ends of a short DNA fragment, usually a few hundred base pairs long
Read Short base-pair sequence inferred from the DNA/RNA template by sequencing
RNA-Seq High-throughput shotgun transcriptome (cDNA) sequencing. Usually not used synonymous to RNA-sequencing which implies direct sequencing of RNA molecules skipping the cDNA generation step
Scaffold Two or more contigs joined together using read-pair information
Transcriptome Set of all RNA molecules transcribed from a DNA template