Skip to main content
. 2017 Aug 31;6:1618. [Version 1] doi: 10.12688/f1000research.12344.1

Table 3. Overview of common standard data formats for ‘omics data.

Data type Format name Description Reference or URL for format specification URLs for repositories
accepting data in this format
Raw DNA/RNA
sequence
FASTA

FASTQ

HDF5

SAM/BAM/
CRAM
FASTA is a common text format to store DNA/RNA/Protein
sequence and FASTQ combines base quality information
with the nucleotide sequence.

HDF5 is a newer sequence read formats used by long read
sequencers e.g. PacBio and Oxford Nanopore.

Raw sequence can also be stored in unaligned SAM/BAM/CRAM format
41
42
https://support.hdfgroup.org/HDF5/
https://samtools.github.io/hts-specs/

https://www.ncbi.nlm.nih.gov/
sra/docs/submitformats/
http://www.ebi.ac.uk/ena/
submit/data-formats
Assembled
DNA sequence
FASTA

Flat file

AGP
Assemblies without annotation are generally stored in
FASTA format.

Annotation can be integrated with assemblies in contig,
scaffold or chromosome flat file format.

AGP files are used to describe how smaller fragments are
placed in an assembly but do not contain the sequence
information themselves
41
http://www.ebi.ac.uk/ena/submit/contig-flat-file
http://www.ebi.ac.uk/ena/submit/scaffold-flat-file

https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_
Specification/
http://www.ebi.ac.uk/ena
/submit/genomes-sequence-
submission
Aligned DNA
sequence
SAM/BAM

CRAM
Sequences aligned to a reference are represented in
sequence alignment and mapping format (SAM). Its binary
version is called BAM and further compression can be
done using the CRAM format
https://samtools.github.io/hts-specs/ https://www.ncbi.nlm.nih.gov/
sra/docs/submitformats/#bam
Gene model or
genomic feature
annotation
GTF/GFF/GFF3
BED
GB/GBK
General feature format or general transfer format are
commonly used to store genomic features in tab-delimited
flat text format.

GFF3 is a more advanced version of the basic GFF that
allows description of more complex features.

BED format is a tab-delimited text format that also allows
definition of how a feature should be displayed (e.g. on a
genome browser).

GenBank flat file Format (GB/GBK) is also commonly used
but not well standardised
https://github.com/The-Sequence-Ontology/
Specifications/blob/master/gff3.md

https://genome.ucsc.edu/FAQ/FAQformat.html

https://genome.ucsc.edu/FAQ/FAQformat.html

https://www.ncbi.nlm.nih.gov/Sitemap/
samplerecord.html
http://www.ensembl.org/info/
website/upload/gff.html
http://www.ensembl.org/info/
website/upload/gff3.html
Gene functional
annotation
GAF

(GPAD and
RDF will also
be available in
2018)
A GAF file is a GO Annotation File containing annotations
made to the GO by a contributing resource such as
FlyBase or Pombase. However, the GAF standard is
applicable outside of GO, e.g. using other ontologies such
as PO. GAF (v2) is a simple tab-delimited file format with 17
columns to describe an entity (e.g. a protein), its annotation
and some annotation metadata
http://geneontology.org/page/go-annotation-file-
format-20
http://geneontology.org/page/
submitting-go-annotations
Genetic/genomic
variants
VCF A tab-delimited text format to store meta-information as
header lines followed by information about variants position
in the genome. The current version is VCF4.2
https://samtools.github.io/hts-specs/VCFv4.2.pdf http://www.ensembl.org/info/
website/upload/var.html
Interaction data PSI-MI XML

MITAB
Data formats developed to exchange molecular interaction
data, related metadata and fully describe molecule
constructs
http://psidev.info/groups/molecular-interactions http://www.ebi.ac.uk/intact
Raw metabolite
profile
mzML

nmrML
XML based data formats that define mass spectrometry
and nuclear magnetic resonance raw data in Metabolomics
http://www.psidev.info/mzml

http://nmrml.org/
Protein sequence FASTA A text-based format for representing nucleotide sequences
or protein sequences, in which nucleotides or amino acids
are represented using single-letter codes
[41] www.uniprot.org
Raw proteome
profile
mzML A formally defined XML format for representing mass
spectrometry data. Files typically contain sequences of
mass spectra, plus metadata about the experiment
http://www.psidev.info/mzml www.ebi.ac.uk/pride
Organisms and
specimens
Darwin Core The Darwin Core (DwC) standard facilitates the exchange
of information about the geographic location of organisms
and associated collection specimens
http://rs.tdwg.org/dwc/