Table 2:
Test datasets
A) Genome sequence datasets | ||||
---|---|---|---|---|
Category | Organism | Accession | Size | |
Virus | Gordoniaphage GAL1 [61] | GCF_001884535.1 | 50.7 kB | |
Bacteria | WS1 bacterium JGI 0000059-K21 [60] | GCA_000398605.1 | 522 kB | |
Protist | Astrammina rara [60] | GCA_000211355.2 | 1.71 MB | |
Fungus | Nosema ceranae [60] | GCA_000988165.1 | 5.81 MB | |
Protist | Cryptosporidium parvumIowa II [60] | GCA_000165345.1 | 9.22 MB | |
Protist | Spironucleus salmonicida [60] | GCA_000497125.1 | 13.1 MB | |
Protist | Tieghemostelium lacteum [60] | GCA_001606155.1 | 23.7 MB | |
Fungus | Fusarium graminearumPH-1 [61] | GCF_000240135.3 | 36.9 MB | |
Protist | Salpingoeca rosetta [60] | GCA_000188695.1 | 56.2 MB | |
Algae | Chondrus crispus [60] | GCA_000350225.2 | 106 MB | |
Algae | Kappaphycus alvarezii [60] | GCA_002205965.2 | 341 MB | |
Animal | Strongylocentrotus purpuratus [61] | GCF_000002235.4 | 1.01 GB | |
Plant | Picea abies [60] | GCA_900067695.1 | 13.4 GB | |
B) Other DNA datasets | ||||
Dataset | No. of sequences | Size | Source | Date |
Mitochondrion [61] | 9,402 | 245 MB | RefSeq ftp: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz | 15 March 2019 |
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.2.1.genomic.fna.gz | ||||
NCBI Virus Complete Nucleotide Human [62] | 36,745 | 482 MB | NCBI Virus: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/ | 11 May 2020 |
Influenza [63] | 700,001 | 1.22 GB | Influenza Virus Database: ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz | 27 April 2019 |
Helicobacter [60] | 108,292 | 2.76 GB | NCBI Assembly: https://www.ncbi.nlm.nih.gov/assembly | 24 April 2019 |
C) RNA datasets | ||||
SILVA 132 LSURef [64] | 198,843 | 610 MB | Silva database: https://ftp.arb-silva.de/release_132/Exports/SILVA_132_LSURef_tax_silva.fasta.gz | 11 December 2017 |
SILVA 132 SSURef Nr99 [64] | 695,171 | 1.11 GB | Silva database: https://ftp.arb-silva.de/release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva.fasta.gz | 11 Devember 2017 |
SILVA 132 SSURef [64] | 2,090,668 | 3.28 GB | Silva database: https://ftp.arb-silva.de/release_132/Exports/SILVA_132_SSURef_tax_silva.fasta.gz | 11 December 2017 |
D) Multiple DNA sequence alignments | ||||
UCSC hg38 7way knownCanonical-exonNuc [65] | 1,470,154 | 340 MB | UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz7way/alignments/knownCanonical.exonNuc.fa.gz | 6 June 2014 |
UCSC hg38 20way knownCanonical-exonNuc [65] | 4,211,940 | 969 MB | UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz20way/alignments/knownCanonical.exonNuc.fa.gz | 30 June 2015 |
E) Protein datasets | ||||
PDB [66] | 109,914 | 67.6 MB | PDB database FTP: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz | 9 April 2019 |
Homo sapiens GRCh38 [67] | 105,961 | 73.2 MB | NCBI ftp: ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz | 12 March 2019 |
NCBI Virus RefSeq Protein [62] | 373,332 | 122 MB | NCBI Virus: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/ | 10 May 2020 |
UniProtKB Reviewed (Swiss-Prot) [68] | 560,118 | 277 MB | UniProt ftp: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz | 2 April 2019 |