Table 3. Overview of common standard data formats for ‘omics data.
Data type | Format name | Description | Reference or URL for format specification | URLs for repositories
accepting data in this format |
---|---|---|---|---|
Raw DNA/RNA
sequence |
FASTA
FASTQ HDF5 SAM/BAM/ CRAM |
FASTA is a common text format to store DNA/RNA/Protein
sequence and FASTQ combines base quality information with the nucleotide sequence. HDF5 is a newer sequence read formats used by long read sequencers e.g. PacBio and Oxford Nanopore. Raw sequence can also be stored in unaligned SAM/BAM/CRAM format |
41
42 https://support.hdfgroup.org/HDF5/ https://samtools.github.io/hts-specs/ https://www.ncbi.nlm.nih.gov/ sra/docs/submitformats/ http://www.ebi.ac.uk/ena/ submit/data-formats |
|
Assembled
DNA sequence |
FASTA
Flat file AGP |
Assemblies without annotation are generally stored in
FASTA format. Annotation can be integrated with assemblies in contig, scaffold or chromosome flat file format. AGP files are used to describe how smaller fragments are placed in an assembly but do not contain the sequence information themselves |
41
http://www.ebi.ac.uk/ena/submit/contig-flat-file http://www.ebi.ac.uk/ena/submit/scaffold-flat-file https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_ Specification/ |
http://www.ebi.ac.uk/ena
/submit/genomes-sequence- submission |
Aligned DNA
sequence |
SAM/BAM
CRAM |
Sequences aligned to a reference are represented in
sequence alignment and mapping format (SAM). Its binary version is called BAM and further compression can be done using the CRAM format |
https://samtools.github.io/hts-specs/ |
https://www.ncbi.nlm.nih.gov/
sra/docs/submitformats/#bam |
Gene model or
genomic feature annotation |
GTF/GFF/GFF3
BED GB/GBK |
General feature format or general transfer format are
commonly used to store genomic features in tab-delimited flat text format. GFF3 is a more advanced version of the basic GFF that allows description of more complex features. BED format is a tab-delimited text format that also allows definition of how a feature should be displayed (e.g. on a genome browser). GenBank flat file Format (GB/GBK) is also commonly used but not well standardised |
https://github.com/The-Sequence-Ontology/
Specifications/blob/master/gff3.md https://genome.ucsc.edu/FAQ/FAQformat.html https://genome.ucsc.edu/FAQ/FAQformat.html https://www.ncbi.nlm.nih.gov/Sitemap/ samplerecord.html |
http://www.ensembl.org/info/
website/upload/gff.html http://www.ensembl.org/info/ website/upload/gff3.html |
Gene functional
annotation |
GAF
(GPAD and RDF will also be available in 2018) |
A GAF file is a GO Annotation File containing annotations
made to the GO by a contributing resource such as FlyBase or Pombase. However, the GAF standard is applicable outside of GO, e.g. using other ontologies such as PO. GAF (v2) is a simple tab-delimited file format with 17 columns to describe an entity (e.g. a protein), its annotation and some annotation metadata |
http://geneontology.org/page/go-annotation-file-
format-20 |
http://geneontology.org/page/
submitting-go-annotations |
Genetic/genomic
variants |
VCF | A tab-delimited text format to store meta-information as
header lines followed by information about variants position in the genome. The current version is VCF4.2 |
https://samtools.github.io/hts-specs/VCFv4.2.pdf |
http://www.ensembl.org/info/
website/upload/var.html |
Interaction data | PSI-MI XML
MITAB |
Data formats developed to exchange molecular interaction
data, related metadata and fully describe molecule constructs |
http://psidev.info/groups/molecular-interactions | http://www.ebi.ac.uk/intact |
Raw metabolite
profile |
mzML
nmrML |
XML based data formats that define mass spectrometry
and nuclear magnetic resonance raw data in Metabolomics |
http://www.psidev.info/mzml
http://nmrml.org/ |
|
Protein sequence | FASTA | A text-based format for representing nucleotide sequences
or protein sequences, in which nucleotides or amino acids are represented using single-letter codes |
[41] | www.uniprot.org |
Raw proteome
profile |
mzML | A formally defined XML format for representing mass
spectrometry data. Files typically contain sequences of mass spectra, plus metadata about the experiment |
http://www.psidev.info/mzml | www.ebi.ac.uk/pride |
Organisms and
specimens |
Darwin Core | The Darwin Core (DwC) standard facilitates the exchange
of information about the geographic location of organisms and associated collection specimens |
http://rs.tdwg.org/dwc/ |