Supplementary material for Ferretti et al. (2001) Proc. Natl. Acad. Sci. USA 98 (8), 4658-4663. (10.1073/pnas.071559398)

Annotation procedure for Streptococcus pyogenes

Initial ORF prediction for S. pyogenes SF370 was made by using GLIMMER 2.0 (http://www.tigr.org/) with default settings.

The coding regions were extracted from the genome and loaded into the relational database of the Genome Annotation Tool Kit from the Los Alamos National Laboratory (Los Alamos, NM). The corresponding translation products were analyzed by using several database comparison packages [BLASTP, PRO-DOM, BLOCKS, COGs, SIGNALP (http://www.cbs.dtu.dk/services/SignalP-2.0/), PFAM].

Functional assignment was based on the COGs functional categories with additional categories for virulence factors, bacteriophage genomes, and stable RNA molecules. Assignment was based primarily on the results of COGs and BLASTP analysis. Product definition was based on analysis of all search data.

Sequence trace data in regions of suspect frame shifts were examined in CONSED and recalled on the basis of analysis of trace data or after resequencing of the region. ORFs containing confirmed frame-shift or point mutations with the coding region were designated as pseudogenes.

Internal ORFs and small peptides lacking distinguishable promoter or termination regions were manually removed from the database on an individual basis.

tRNAs were identified by using tRNAscan-SE with the prokaryotic training set.

Paralogs and paralogous families were identified by BLASTP analysis of each individual ORF against an S. pyogenes ORF database. Sequences show 30% identity over >60% of the query protein were considered homologous.

Multisequence alignments and phyolgenic trees were generated by using CLUSTALX and TREEVIEW or the ALLALL server (http://cbrg.inf.ethz.ch/subsection3_1_1.html), both by using a PAM250 matrix.