(A) Pipeline to identify families that do not have an assigned domain and families that are not represented in RefSeq genomes. Upper path of the flow diagram: only a small subset of the ~4,000 small protein families were assigned a protein domain (identified by RPS-blast against CDD position specific scoring matrices, PSSMs). Lower path of the flow diagram: representatives of all ~4,000 families were blasted against ~3,000,000 small RefSeq annotated proteins originating from ~70,000 RefSeq genomes and against ~7,000,000 putative small proteins that we annotated using Prodigal with adjusted thresholds. The second step allowed the identification of an additional set of homologs that are encoded but not annotated in RefSeq genomes.
(B) Domains identified among ~4,000 families. Domains that were classified to ≥5 families and/or ≥50 species are shown. A complete list of domains can be found in Table S3.
(C) Number of species encoding small proteins of families with no known domain are shown in histogram.