. Author manuscript; available in PMC: 2022 Apr 19.

Published in final edited form as: Mol Omics. 2021 Apr 19;17(2):170–185. doi: 10.1039/d0mo00041h

Table 1. Available resources for big data sets.

The table list resources available to download data sets from various omics platforms as well as sequence and annotation information.

Resource	Data type	Link	Reference
SILVA is a resource of databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains.	gene sequences of 16S for prokaryotes and 18S for Eukarya	https://www.arb-silva.de/	[121]
Ribosomal Database Project: aligned and annotated rRNA gene sequence data	16S rRNA sequences	http://rdp.cme.msu.edu/	[122]
Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated taxonomy based on de novo tree inference.	Taxonomy based on the 16S rRNA gene	https://greengenes.secondgenome.com/	[123]
Genome Taxonomy Database is an initiative to establish a standardized microbial taxonomy based on genome phylogeny. The genomes used to construct the phylogeny are obtained from RefSeq and Genbank.	a comprehensive and phylogenomic-based taxonomy for bacterial and archaeal taxa	https://gtdb.ecogenomic.org/	[52, 53]
Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data	protein sequence and annotation database	https://www.uniprot.org/	[124]
NIH National Center for Biotechnology Information (NCBI) GenBank is an annotated collection of all publically available DNA sequences. Complete bimonthly release updates are available. Data is exchanged daily with the DNA DataBank of Japan and the European Nucleotide Archive.	genomic sequence and annotation	https://www.ncbi.nlm.nih.gov/genbank/	[125]
NIH/NCBI Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins	genomic, transcriptomics, and proteomic sequence and annotation	https://www.ncbi.nlm.nih.gov/refseq/	[126]
University of California Santa Cruz (UCSC) Genome Browser for exploring genome sequences and annotation. GenBank updates for mRNA, RefSeq, and EST data occur on a semi-quarterly basis.	genome sequence and annotation database	http://genome.ucsc.edu/	[127]
NIH National Human Genome Research Institute Encyclopedia of DNA Elements (ENCODE) Consortium project uses Reference Genomes from NCBI or UCSC	DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, modified histones, transcription factors, chromatin regulators, and RNA-binding proteins. Genome sequence and annotation database.	https://www.encodeproject.org/	[128]
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Updates are released every 2–3 months.	genome sequence and annotation, gene models, transcriptional data, genetic variation and comparative analysis	http://ensembl.org/	[129]
The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This a joint effort between the National Cancer Institute and the National Human Genome Research Institute.	Individual patient tumor samples: DNA, RNA, Protein, epigenetic changes	https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga	[130]
Cancer Cell Line Encyclopedia (CCLE) is a collaboration between the Broad Institute, and the Novartis Institutes for Biomedical Research and its Genomics Institute of the Novartis Research Foundation to conduct a detailed genetic and pharmacologic characterization of a large panel of human cancer models. CCLE contains genomics data and visualization for over 1400 cell lines.	Copy Number, mRNA expression (Affy), RPPA, RRBS, and mRNA expression (RNAseq)	https://portals.broadinstitute.org/ccle	[131]
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is a community resource project. TARGET is organized into a collaborative network of disease-specific project teams with the goal of identifying molecular changes that drive childhood cancers.	clinical information, gene expression, miRNA expression, copy number, sequencing data for cancers	https://ocg.cancer.gov/programs/target	Initiative phs000218
Omics Discovery Index (OmicsDI) an open-source platform that enables access, discovery and dissemination of omics data sets.	genomics, transcriptomics, proteomics, metabolomics	https://www.omicsdi.org/	[132]
Multi-Omics Profiling Expression Database (MOPED) is a repository for multi-omics data of human and model organisms.	transcriptomics and proteomics data and visualization	https://omictools.com/moped-tool	[133]
ProteomeXchange (PX) Consortium consists of PRIDE, PeptideAtlas, PASSEL, MassIVE and jPOST. Devoted to mass spectrometry (MS)-based proteomics data.	proteomics data sets	http://www.proteomexchange.org/	[134, 135]