Table 2.
Data Sets Used for Cross-Species Evaluation
| Organism | Protein sequences | Assigned by EUCLID |
|---|---|---|
| Homo sapiens | 23,740 | 13,419 |
| Drosophila melanogaster | 14,334 | 7235 |
| Caenorhabditis elegans | 20,263 | 7840 |
| Arabidopsis thaliana | 25,617 | 11,771 |
| Schizosaccharomyces pombe | 4952 | 2786 |
| Saccharomyces cerevisiae | 6329 | 3302 |
| Aeropyrum pernix | 2694 | 684 |
| Pyrobaculum aerophilum | 2605 | 867 |
| Sulfolobus solfataricus | 2977 | 1186 |
| Sulfolobus tokodaii | 2826 | 1045 |
| Archaeoglobus fulgidus | 2407 | 1074 |
| Methanobacterium thermoautotrophicum | 1869 | 867 |
| Methanococcus jannaschii | 1715 | 781 |
| Methanosarcina mazei | 3371 | 1420 |
| Methanopyrus kandleri | 1691 | 653 |
| Methanosarcina acetivorans | 4540 | 1850 |
| Pyrococcus abyssi | 1765 | 855 |
| Pyrococcus horikoshii | 2064 | 786 |
| Thermoplasma acidophilum | 1031 | 783 |
| Thermoplasma volcanium | 1499 | 792 |
| Anabaena sp. | 5366 | 2444 |
| Aquifex aeolicus | 1522 | 926 |
| Borrelia burgdorferi | 850 | 461 |
| Bacillus halodurans | 4066 | 2223 |
| Bacillus subtilis | 4100 | 2240 |
| Buchnera sp. | 564 | 469 |
| Campylobacter jejuni | 1654 | 975 |
| Chlamydia pneumoniae | 1052 | 530 |
| Chlamydia trachomatis | 894 | 498 |
| Deinoccocus radiodurans | 2937 | 1332 |
| Escherichia coli | 4289 | 2883 |
| Fusobacterium nucleatum | 2068 | 1083 |
| Haemophilus influenzae | 1709 | 1183 |
| Helicobacter pylori | 1566 | 815 |
| Lactococcus lactis | 2266 | 1229 |
| Mycoplasma genitalium | 480 | 332 |
| Mycoplasma pneumoniae | 677 | 441 |
| Mycobacterium tuberculosis | 3918 | 1973 |
| Neisseria meningitides ser. A | 2121 | 1132 |
| Neisseria meningitides ser. B | 2025 | 1088 |
| Rickettsia prowazekii | 834 | 548 |
| Streptomyces coelicolor | 7848 | 3625 |
| Synechocystis sp. | 3169 | 1598 |
| Thermotoga maritima | 1846 | 1064 |
| Treponema pallidum | 1031 | 507 |
| Vibrio cholerae | 3828 | 2054 |
| Xylella fastidiosa | 2766 | 1184 |
| Yersinia pestis | 4008 | 2566 |
The column “protein sequences” lists the number of protein-coding regions annotated in the genomes, with the exception of the organisms H. sapiens, D. melanogaster, C. elegans, and A. thaliana (see text for details on these data sets). The protein sequences that could be assigned to a cellular role by the EUCLID method (last column) show the amount of data available for validation of the ProtFun method for each organism.